Jump to content

Given a URL - extract all the link of the images?


extrovertive

Recommended Posts

Given a link/page, what's the best way to extract all the image urls of it? I know I can use $string = file_get_contents("http://www.domain.com"); and then  use regular expression to get all the link of the images.

 

But what if some images are like http://www.domain.com/resources/data.php?=123 It would be quite a difficult challenge to find the ones without the typical .jpg, .gif, or .png extensions.

 

Can anyone show me a simple script to do this?

The URLs would be found inside img tags (as the src attribute), right?

 

Ah thanks. What would be the best method to do this? Use Curl? Use file_get_conents and extract the <img code? But how do I just grab the image URLs only?

<?php
$file = file_get_contents('http://www.phpfreaks.com/forums/index.php/topic,270900.0.html');
preg_match_all('~<img(.+?)src="(.+?)"~', $file, $matches);
print_r($matches[2]);
?>

 

That works, but to be honest it's probably not the best way, my regex is rusty.

You could use DOM to grab the URLs (better readability compared to regular expressions - but probably slower and takes up more lines). And I normally use cURL to get the remote contents:

 

<?php
function curl_load($url, $postdata = false) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
if (is_array($postdata)) {
	curl_setopt($ch, CURLOPT_POST, true);
	curl_setopt($ch, CURLOPT_POSTFIELDS, $postdata);
}
curl_setopt($ch, CURLOPT_FORBID_REUSE, true);
curl_setopt($ch, CURLOPT_FRESH_CONNECT, true);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 6.0; da; rv:1.9.1.3) Gecko/20090824 Firefox/3.5.3');
$contents = curl_exec($ch);
curl_close($ch);
return $contents;
}
$site = 'http://example.com/';
$html = curl_load($site);
$dom = new DOMDocument();
$dom->loadHTML($html);
$tags = $dom->getElementsByTagName('img');
$urls = array();
foreach ($tags as $tag) {
if ($tag->hasAttribute('src')) {
	$urls[] = $tag->getAttribute('src');
}
}
echo '<pre>' . print_r($urls, true) . '</pre>';
?>

 

The grabbed URLs could be relative; here's how to convert them to absolute (append to the above code):

 

<?php
// http://w-shadow.com/blog/2007/07/16/how-to-extract-all-urls-from-a-page-using-php/
function relative2absolute($absolute, $relative) {
        $p = @parse_url($relative);
        if(!$p) {
        //$relative is a seriously malformed URL
        return false;
        }
        if(isset($p["scheme"])) return $relative;

        $parts=(parse_url($absolute));

        if(substr($relative,0,1)=='/') {
            $cparts = (explode("/", $relative));
            array_shift($cparts);
        } else {
            if(isset($parts['path'])){
                 $aparts=explode('/',$parts['path']);
                 array_pop($aparts);
                 $aparts=array_filter($aparts);
            } else {
                 $aparts=array();
            }
           $rparts = (explode("/", $relative));
           $cparts = array_merge($aparts, $rparts);
           foreach($cparts as $i => $part) {
                if($part == '.') {
                    unset($cparts[$i]);
                } else if($part == '..') {
                    unset($cparts[$i]);
                    unset($cparts[$i-1]);
                }
            }
        }
        $path = implode("/", $cparts);

        $url = '';
        if($parts['scheme']) {
            $url = "$parts[scheme]://";
        }
        if(isset($parts['user'])) {
            $url .= $parts['user'];
            if(isset($parts['pass'])) {
                $url .= ":".$parts['pass'];
            }
            $url .= "@";
        }
        if(isset($parts['host'])) {
            $url .= $parts['host']."/";
        }
        $url .= $path;

        return $url;
}

//for the absolute URL use the base href if found
$base = $dom->getElementsByTagName('base');
if ($base = $base->item(0)) {
if ($base->hasAttribute('href')) {
	$site = $base->getAttribute('href');
}
}
//convert URLs
$abs_urls = array();
foreach ($urls as $url) {
$abs_urls[] = relative2absolute($site, $url);
}
echo '<pre>' . print_r($abs_urls, true) . '</pre>';
?>

 

And my regex alternative:

 

<?php
preg_match_all('~<img\b[^>]+\bsrc\s?=\s?([\'"])(.*?)\1~is', $html, $matches);
$urls = $matches[2];
?>

 

and for grabbing the possible base tag:

 

<?php
if (preg_match('~<base\b[^>]+\bhref\s?=\s?([\'"])(.*?)\1~is', $html, $matches)) {
$site = $matches[2];
}
?>

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.