Jump to content

Given a URL - extract all the link of the images?


extrovertive

Recommended Posts

Given a link/page, what's the best way to extract all the image urls of it? I know I can use $string = file_get_contents("http://www.domain.com"); and then  use regular expression to get all the link of the images.

 

But what if some images are like http://www.domain.com/resources/data.php?=123 It would be quite a difficult challenge to find the ones without the typical .jpg, .gif, or .png extensions.

 

Can anyone show me a simple script to do this?

Link to comment
Share on other sites

<?php
$file = file_get_contents('http://www.phpfreaks.com/forums/index.php/topic,270900.0.html');
preg_match_all('~<img(.+?)src="(.+?)"~', $file, $matches);
print_r($matches[2]);
?>

 

That works, but to be honest it's probably not the best way, my regex is rusty.

Link to comment
Share on other sites

You could use DOM to grab the URLs (better readability compared to regular expressions - but probably slower and takes up more lines). And I normally use cURL to get the remote contents:

 

<?php
function curl_load($url, $postdata = false) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
if (is_array($postdata)) {
	curl_setopt($ch, CURLOPT_POST, true);
	curl_setopt($ch, CURLOPT_POSTFIELDS, $postdata);
}
curl_setopt($ch, CURLOPT_FORBID_REUSE, true);
curl_setopt($ch, CURLOPT_FRESH_CONNECT, true);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 6.0; da; rv:1.9.1.3) Gecko/20090824 Firefox/3.5.3');
$contents = curl_exec($ch);
curl_close($ch);
return $contents;
}
$site = 'http://example.com/';
$html = curl_load($site);
$dom = new DOMDocument();
$dom->loadHTML($html);
$tags = $dom->getElementsByTagName('img');
$urls = array();
foreach ($tags as $tag) {
if ($tag->hasAttribute('src')) {
	$urls[] = $tag->getAttribute('src');
}
}
echo '<pre>' . print_r($urls, true) . '</pre>';
?>

 

The grabbed URLs could be relative; here's how to convert them to absolute (append to the above code):

 

<?php
// http://w-shadow.com/blog/2007/07/16/how-to-extract-all-urls-from-a-page-using-php/
function relative2absolute($absolute, $relative) {
        $p = @parse_url($relative);
        if(!$p) {
        //$relative is a seriously malformed URL
        return false;
        }
        if(isset($p["scheme"])) return $relative;

        $parts=(parse_url($absolute));

        if(substr($relative,0,1)=='/') {
            $cparts = (explode("/", $relative));
            array_shift($cparts);
        } else {
            if(isset($parts['path'])){
                 $aparts=explode('/',$parts['path']);
                 array_pop($aparts);
                 $aparts=array_filter($aparts);
            } else {
                 $aparts=array();
            }
           $rparts = (explode("/", $relative));
           $cparts = array_merge($aparts, $rparts);
           foreach($cparts as $i => $part) {
                if($part == '.') {
                    unset($cparts[$i]);
                } else if($part == '..') {
                    unset($cparts[$i]);
                    unset($cparts[$i-1]);
                }
            }
        }
        $path = implode("/", $cparts);

        $url = '';
        if($parts['scheme']) {
            $url = "$parts[scheme]://";
        }
        if(isset($parts['user'])) {
            $url .= $parts['user'];
            if(isset($parts['pass'])) {
                $url .= ":".$parts['pass'];
            }
            $url .= "@";
        }
        if(isset($parts['host'])) {
            $url .= $parts['host']."/";
        }
        $url .= $path;

        return $url;
}

//for the absolute URL use the base href if found
$base = $dom->getElementsByTagName('base');
if ($base = $base->item(0)) {
if ($base->hasAttribute('href')) {
	$site = $base->getAttribute('href');
}
}
//convert URLs
$abs_urls = array();
foreach ($urls as $url) {
$abs_urls[] = relative2absolute($site, $url);
}
echo '<pre>' . print_r($abs_urls, true) . '</pre>';
?>

 

And my regex alternative:

 

<?php
preg_match_all('~<img\b[^>]+\bsrc\s?=\s?([\'"])(.*?)\1~is', $html, $matches);
$urls = $matches[2];
?>

 

and for grabbing the possible base tag:

 

<?php
if (preg_match('~<base\b[^>]+\bhref\s?=\s?([\'"])(.*?)\1~is', $html, $matches)) {
$site = $matches[2];
}
?>

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.