Given a URL - extract all the link of the images?

extrovertive · September 28, 2009

Given a link/page, what's the best way to extract all the image urls of it? I know I can use $string = file_get_contents("http://www.domain.com"); and then use regular expression to get all the link of the images.

But what if some images are like http://www.domain.com/resources/data.php?=123 It would be quite a difficult challenge to find the ones without the typical .jpg, .gif, or .png extensions.

Can anyone show me a simple script to do this?

thebadbad · September 28, 2009

The URLs would be found inside img tags (as the src attribute), right?

extrovertive · September 28, 2009

The URLs would be found inside img tags (as the src attribute), right?

Ah thanks. What would be the best method to do this? Use Curl? Use file_get_conents and extract the <img code? But how do I just grab the image URLs only?

Alex · September 28, 2009

<?php
$file = file_get_contents('http://www.phpfreaks.com/forums/index.php/topic,270900.0.html');
preg_match_all('~<img(.+?)src="(.+?)"~', $file, $matches);
print_r($matches[2]);
?>

That works, but to be honest it's probably not the best way, my regex is rusty.

thebadbad · September 28, 2009

You could use DOM to grab the URLs (better readability compared to regular expressions - but probably slower and takes up more lines). And I normally use cURL to get the remote contents:

<?php
function curl_load($url, $postdata = false) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
if (is_array($postdata)) {
	curl_setopt($ch, CURLOPT_POST, true);
	curl_setopt($ch, CURLOPT_POSTFIELDS, $postdata);
}
curl_setopt($ch, CURLOPT_FORBID_REUSE, true);
curl_setopt($ch, CURLOPT_FRESH_CONNECT, true);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 6.0; da; rv:1.9.1.3) Gecko/20090824 Firefox/3.5.3');
$contents = curl_exec($ch);
curl_close($ch);
return $contents;
}
$site = 'http://example.com/';
$html = curl_load($site);
$dom = new DOMDocument();
$dom->loadHTML($html);
$tags = $dom->getElementsByTagName('img');
$urls = array();
foreach ($tags as $tag) {
if ($tag->hasAttribute('src')) {
	$urls[] = $tag->getAttribute('src');
}
}
echo '<pre>' . print_r($urls, true) . '</pre>';
?>

The grabbed URLs could be relative; here's how to convert them to absolute (append to the above code):

<?php
// http://w-shadow.com/blog/2007/07/16/how-to-extract-all-urls-from-a-page-using-php/
function relative2absolute($absolute, $relative) {
        $p = @parse_url($relative);
        if(!$p) {
        //$relative is a seriously malformed URL
        return false;
        }
        if(isset($p["scheme"])) return $relative;

        $parts=(parse_url($absolute));

        if(substr($relative,0,1)=='/') {
            $cparts = (explode("/", $relative));
            array_shift($cparts);
        } else {
            if(isset($parts['path'])){
                 $aparts=explode('/',$parts['path']);
                 array_pop($aparts);
                 $aparts=array_filter($aparts);
            } else {
                 $aparts=array();
            }
           $rparts = (explode("/", $relative));
           $cparts = array_merge($aparts, $rparts);
           foreach($cparts as $i => $part) {
                if($part == '.') {
                    unset($cparts[$i]);
                } else if($part == '..') {
                    unset($cparts[$i]);
                    unset($cparts[$i-1]);
                }
            }
        }
        $path = implode("/", $cparts);

        $url = '';
        if($parts['scheme']) {
            $url = "$parts[scheme]://";
        }
        if(isset($parts['user'])) {
            $url .= $parts['user'];
            if(isset($parts['pass'])) {
                $url .= ":".$parts['pass'];
            }
            $url .= "@";
        }
        if(isset($parts['host'])) {
            $url .= $parts['host']."/";
        }
        $url .= $path;

        return $url;
}

//for the absolute URL use the base href if found
$base = $dom->getElementsByTagName('base');
if ($base = $base->item(0)) {
if ($base->hasAttribute('href')) {
	$site = $base->getAttribute('href');
}
}
//convert URLs
$abs_urls = array();
foreach ($urls as $url) {
$abs_urls[] = relative2absolute($site, $url);
}
echo '<pre>' . print_r($abs_urls, true) . '</pre>';
?>

And my regex alternative:

<?php
preg_match_all('~<img\b[^>]+\bsrc\s?=\s?([\'"])(.*?)\1~is', $html, $matches);
$urls = $matches[2];
?>

and for grabbing the possible base tag:

<?php
if (preg_match('~<base\b[^>]+\bhref\s?=\s?([\'"])(.*?)\1~is', $html, $matches)) {
$site = $matches[2];
}
?>

Sign In

Given a URL - extract all the link of the images?

Recommended Posts

extrovertive

Link to comment

Share on other sites

thebadbad

Link to comment

Share on other sites

extrovertive

Link to comment

Share on other sites

Alex

Link to comment

Share on other sites

thebadbad

Link to comment

Share on other sites

Archived

Browse

Activity

Important Information