Jump to content

How to craw on web pages for specific tags such as 'img'


ivytony

Recommended Posts

I am wondering how to craw on webpages to search for the 'img' tag. Because I want to catch images on a target url and these fetched images will be used for my users to choose when they submit an article. This is the feature digg.com has.

 

 

 

 

ps: I'm still using php4 now

 

thanks!

Use regular expressions. This will extract all the img-tags from a site (be sure to read the comments):

 

<?php
$url = 'http://en.wikipedia.org/wiki/Google';
//the image paths may be relative. For the images to show on this page, the simplest solution is to use the base tag, but beware; this may cause some trouble later..
echo '<base href="', $url, '" />', "\r\n";
$pageString = file_get_contents($url);
preg_match_all('|<img .*?>|s', $pageString, $matches);
//the array $matches[0] now contains the img tags, let's echo 'em:
foreach($matches[0] as $path) {
	echo $path, "<br />";
}
?>

one more question though:

 

For a image tag like this:

 

<im g src="http://images.bestbuy.com:80/BestBuy_US/en_US/images/global/header/logo.gif" alt="Best Buy Logo"/> (notice: the <im g has been spaced out to avoid showing the picture in this post)

 

how to get the http://images.bestbuy.com:80.............logo.gif part between src=" and " alt="Best Buy Logo"/>

 

I tried to get it by creating a preg_replace as below:

 

<?php
echo preg_replace("/<img src=\" > /", "", "<img src=\"http:\/\/images.bestbuy.com:80\/BestBuy_US\/en_US\/images\/global\/header\/logo.gif\" alt=\"Best Buy Logo\"\/>");

 

But this gives me 'Best Buy Logo' which is the value of alt, I found php regular expression is so hard to understand. Can anyone here please help me get the part between src=" and " alt= ??

 

I appreciate your help!

 

Tony

thanks for the super fast reply. The running result is an array as below:

 

Array  (  [0] => src="http://images.bestbuy.com:80/BestBuy_US/en_US/images/global/header/logo.gif" [1] => http://images.bestbuy.com:80/BestBuy_US/en_US/images/global/header/logo.gif )

 

I then changed the output code to print_r($b[0]); and it gives me this:

 

src="http://images.bestbuy.com:80/BestBuy_US/en_US/images/global/header/logo.gif"

However, the part src=" is still in there

 

I am wondering where it is wrong??

 

thanks again!!

 

try

<?php
$a = '<img src="http://images.bestbuy.com:80/BestBuy_US/en_US/images/global/header/logo.gif" alt="Best Buy Logo" />';
preg_match('/src="([^"]+)"/',$a, $b);
print_r($b);
?>

The getimagesize function will return the dimensions of an image (along with some other stuff):

 

<?php
$img = 'http://upload.wikimedia.org/wikipedia/en/5/51/Google.png';
list($x, $y) = getimagesize($img);
echo 'width, height: ', $x, ', ', $y, '<br />';
if($x * $y > 10000) {
echo 'This image is bigger than 10000 pixels.';
} else {
echo 'This image is smaller than 10000 pixels.';
}
?>

 

Again we will run into problems with relative paths. You will need some code to detect these paths and make them absolute.

I found a smart way to scale the images on the getimagesize function page on php.net:

 

<?php
//imagesize practice
$location = "http://images.bestbuy.com/BestBuy_US/en_US/images/global/header/logo.gif";
echo scaleimage($location, 30, 60);


function scaleimage($location, $maxw=NULL, $maxh=NULL){
    $img = @getimagesize($location);
    if($img){
        $w = $img[0];
        $h = $img[1];

        $dim = array('w','h');
        foreach($dim AS $val){
            $max = "max{$val}";
            if(${$val} > ${$max} && ${$max}){
                $alt = ($val == 'w') ? 'h' : 'w';
                $ratio = ${$alt} / ${$val};
                ${$val} = ${$max};
                ${$alt} = ${$val} * $ratio;
            }
        }

        return("<img src='{$location}' alt='image' width='{$w}' height='{$h}' />");
    }
}
?>

 

As thebadbad said, I'll need to convert relative image paths to aboslute. Any smart idea to detect the image path and convert relative paths? Use regular expression again?

 

thanks

I found a function to convert a relative path to an absolute. Haven't tested it, but sure looks good!

 

<?php
function relative2absolute($absolute, $relative) {
        $p = @parse_url($relative);
        if(!$p) {
        //$relative is a seriously malformed URL
        return false;
        }
        if(isset($p["scheme"])) return $relative;

        $parts=(parse_url($absolute));

        if(substr($relative,0,1)=='/') {
            $cparts = (explode("/", $relative));
            array_shift($cparts);
        } else {
            if(isset($parts['path'])){
                 $aparts=explode('/',$parts['path']);
                 array_pop($aparts);
                 $aparts=array_filter($aparts);
            } else {
                 $aparts=array();
            }
           $rparts = (explode("/", $relative));
           $cparts = array_merge($aparts, $rparts);
           foreach($cparts as $i => $part) {
                if($part == '.') {
                    unset($cparts[$i]);
                } else if($part == '..') {
                    unset($cparts[$i]);
                    unset($cparts[$i-1]);
                }
            }
        }
        $path = implode("/", $cparts);

        $url = '';
        if($parts['scheme']) {
            $url = "$parts[scheme]://";
        }
        if(isset($parts['user'])) {
            $url .= $parts['user'];
            if(isset($parts['pass'])) {
                $url .= ":".$parts['pass'];
            }
            $url .= "@";
        }
        if(isset($parts['host'])) {
            $url .= $parts['host']."/";
        }
        $url .= $path;

        return $url;
}
?>

Just run the function on every path, since it will detect absolute paths and quickly return them.

 

And another problem you could run into; if the page you're crawling has a base tag with a href different than the page path itself, the relative paths will fail. But this would happen very rare I'm sure.

 

..and it's pretty easy to solve actually, if needed.

thank you very much!

 

Now you can see the entire script in action: http://www.uspie.net/getimgs.php

 

I'm searching for img tags on this url: http://www.circuitcity.com/ssm/Philips-DVP-5140-DVD-Player-DVP5140-37/sem/rpsm/oid/172059/rpem/ccd/productDetail.do , which uses relative image paths. Right now, the script can filter out images that are smaller than 150px width and 40px height.

 

The entire code is as below:

<?php
$url = 'http://www.circuitcity.com/ssm/Philips-DVP-5140-DVD-Player-DVP5140-37/sem/rpsm/oid/172059/rpem/ccd/productDetail.do';
//the image paths may be relative. For the images to show on this page, the simplest solution is to use the base tag, but beware; this may cause some trouble later..
echo '<base href="', $url, '" />', "\r\n";
$pageString = file_get_contents($url);
preg_match_all('|<img .*?>|s', $pageString, $matches);
//the array $matches[0] now contains the img tags, let's echo 'em:
preg_match('/([^\.\/]+\.[^\/\.]+)((\/)|($))/', $url, $d);
$domain = strtolower($d[1]);
$fulldomain = 'http://www.'. $domain;


foreach ($matches[0] as $path){

preg_match('/src="([^"]+)"/',$path, $b); //get the image path by $b[1], however, will need to detect relative paths and convert to absolute paths

$location = relative2absolute($fulldomain, $b[1]);//this is to convert relative image paths to absolute image paths

$filteredimg = filterimage($location, 150, 40);
echo scaleimage($filteredimg, 100, 100); //scale bigger images and filter out smaller images
}

function relative2absolute($absolute, $relative) {
        $p = @parse_url($relative);
        if(!$p) {
        //$relative is a seriously malformed URL
        return false;
        }
        if(isset($p["scheme"])) return $relative;

        $parts=(parse_url($absolute));

        if(substr($relative,0,1)=='/') {
            $cparts = (explode("/", $relative));
            array_shift($cparts);
        } else {
            if(isset($parts['path'])){
                 $aparts=explode('/',$parts['path']);
                 array_pop($aparts);
                 $aparts=array_filter($aparts);
            } else {
                 $aparts=array();
            }
           $rparts = (explode("/", $relative));
           $cparts = array_merge($aparts, $rparts);
           foreach($cparts as $i => $part) {
                if($part == '.') {
                    unset($cparts[$i]);
                } else if($part == '..') {
                    unset($cparts[$i]);
                    unset($cparts[$i-1]);
                }
            }
        }
        $path = implode("/", $cparts);

        $url = '';
        if($parts['scheme']) {
            $url = "$parts[scheme]://";
        }
        if(isset($parts['user'])) {
            $url .= $parts['user'];
            if(isset($parts['pass'])) {
                $url .= ":".$parts['pass'];
            }
            $url .= "@";
        }
        if(isset($parts['host'])) {
            $url .= $parts['host']."/";
        }
        $url .= $path;

        return $url;
}

function filterimage($location, $minW=NULL, $minH=NULL){
$img = @getimagesize($location);
   if ($img){                      //if image exists
    $w = $img[0];
$h = $img[1];

    if ($w >= $minW && $h >= $minH){   //only if images greater than the minimums are they returned, otherwise, no
    return $location;
    }
	else {
	return;                  //return no images
	}
}
}

function scaleimage($location, $maxw=NULL, $maxh=NULL){
    $img = @getimagesize($location);
    if($img){
        $w = $img[0];
        $h = $img[1];

        $dim = array('w','h');
        foreach($dim AS $val){
            $max = "max{$val}";
            if(${$val} > ${$max} && ${$max}){
                $alt = ($val == 'w') ? 'h' : 'w';
                $ratio = ${$alt} / ${$val};
                ${$val} = ${$max};
                ${$alt} = ${$val} * $ratio;
            }
        }

        return("<img src='{$location}' alt='image' width='{$w}' height='{$h}' />");
    }
}

?>

 

I'll be happy to hear if there is anything needs improvement for efficiency.

 

Thanks a million :D

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.