Jump to content

Recommended Posts

Given a link/page, what's the best way to extract all the image urls of it? I know I can use $string = file_get_contents("http://www.domain.com"); and then  use regular expression to get all the link of the images.


But what if some images are like http://www.domain.com/resources/data.php?=123 It would be quite a difficult challenge to find the ones without the typical .jpg, .gif, or .png extensions.


Can anyone show me a simple script to do this?

The URLs would be found inside img tags (as the src attribute), right?


Ah thanks. What would be the best method to do this? Use Curl? Use file_get_conents and extract the <img code? But how do I just grab the image URLs only?

$file = file_get_contents('http://www.phpfreaks.com/forums/index.php/topic,270900.0.html');
preg_match_all('~<img(.+?)src="(.+?)"~', $file, $matches);


That works, but to be honest it's probably not the best way, my regex is rusty.

You could use DOM to grab the URLs (better readability compared to regular expressions - but probably slower and takes up more lines). And I normally use cURL to get the remote contents:


function curl_load($url, $postdata = false) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
if (is_array($postdata)) {
	curl_setopt($ch, CURLOPT_POST, true);
	curl_setopt($ch, CURLOPT_POSTFIELDS, $postdata);
curl_setopt($ch, CURLOPT_FORBID_REUSE, true);
curl_setopt($ch, CURLOPT_FRESH_CONNECT, true);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 6.0; da; rv: Gecko/20090824 Firefox/3.5.3');
$contents = curl_exec($ch);
return $contents;
$site = 'http://example.com/';
$html = curl_load($site);
$dom = new DOMDocument();
$tags = $dom->getElementsByTagName('img');
$urls = array();
foreach ($tags as $tag) {
if ($tag->hasAttribute('src')) {
	$urls[] = $tag->getAttribute('src');
echo '<pre>' . print_r($urls, true) . '</pre>';


The grabbed URLs could be relative; here's how to convert them to absolute (append to the above code):


// http://w-shadow.com/blog/2007/07/16/how-to-extract-all-urls-from-a-page-using-php/
function relative2absolute($absolute, $relative) {
        $p = @parse_url($relative);
        if(!$p) {
        //$relative is a seriously malformed URL
        return false;
        if(isset($p["scheme"])) return $relative;


        if(substr($relative,0,1)=='/') {
            $cparts = (explode("/", $relative));
        } else {
            } else {
           $rparts = (explode("/", $relative));
           $cparts = array_merge($aparts, $rparts);
           foreach($cparts as $i => $part) {
                if($part == '.') {
                } else if($part == '..') {
        $path = implode("/", $cparts);

        $url = '';
        if($parts['scheme']) {
            $url = "$parts[scheme]://";
        if(isset($parts['user'])) {
            $url .= $parts['user'];
            if(isset($parts['pass'])) {
                $url .= ":".$parts['pass'];
            $url .= "@";
        if(isset($parts['host'])) {
            $url .= $parts['host']."/";
        $url .= $path;

        return $url;

//for the absolute URL use the base href if found
$base = $dom->getElementsByTagName('base');
if ($base = $base->item(0)) {
if ($base->hasAttribute('href')) {
	$site = $base->getAttribute('href');
//convert URLs
$abs_urls = array();
foreach ($urls as $url) {
$abs_urls[] = relative2absolute($site, $url);
echo '<pre>' . print_r($abs_urls, true) . '</pre>';


And my regex alternative:


preg_match_all('~<img\b[^>]+\bsrc\s?=\s?([\'"])(.*?)\1~is', $html, $matches);
$urls = $matches[2];


and for grabbing the possible base tag:


if (preg_match('~<base\b[^>]+\bhref\s?=\s?([\'"])(.*?)\1~is', $html, $matches)) {
$site = $matches[2];

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.