Jump to content

How to craw on web pages for specific tags such as 'img'


ivytony

Recommended Posts

I am wondering how to craw on webpages to search for the 'img' tag. Because I want to catch images on a target url and these fetched images will be used for my users to choose when they submit an article. This is the feature digg.com has.

 

 

 

 

ps: I'm still using php4 now

 

thanks!

Link to comment
Share on other sites

Use regular expressions. This will extract all the img-tags from a site (be sure to read the comments):

 

<?php
$url = 'http://en.wikipedia.org/wiki/Google';
//the image paths may be relative. For the images to show on this page, the simplest solution is to use the base tag, but beware; this may cause some trouble later..
echo '<base href="', $url, '" />', "\r\n";
$pageString = file_get_contents($url);
preg_match_all('|<img .*?>|s', $pageString, $matches);
//the array $matches[0] now contains the img tags, let's echo 'em:
foreach($matches[0] as $path) {
	echo $path, "<br />";
}
?>

Link to comment
Share on other sites

one more question though:

 

For a image tag like this:

 

<im g src="http://images.bestbuy.com:80/BestBuy_US/en_US/images/global/header/logo.gif" alt="Best Buy Logo"/> (notice: the <im g has been spaced out to avoid showing the picture in this post)

 

how to get the http://images.bestbuy.com:80.............logo.gif part between src=" and " alt="Best Buy Logo"/>

 

I tried to get it by creating a preg_replace as below:

 

<?php
echo preg_replace("/<img src=\" > /", "", "<img src=\"http:\/\/images.bestbuy.com:80\/BestBuy_US\/en_US\/images\/global\/header\/logo.gif\" alt=\"Best Buy Logo\"\/>");

 

But this gives me 'Best Buy Logo' which is the value of alt, I found php regular expression is so hard to understand. Can anyone here please help me get the part between src=" and " alt= ??

 

I appreciate your help!

 

Tony

Link to comment
Share on other sites

thanks for the super fast reply. The running result is an array as below:

 

Array  (  [0] => src="http://images.bestbuy.com:80/BestBuy_US/en_US/images/global/header/logo.gif" [1] => http://images.bestbuy.com:80/BestBuy_US/en_US/images/global/header/logo.gif )

 

I then changed the output code to print_r($b[0]); and it gives me this:

 

src="http://images.bestbuy.com:80/BestBuy_US/en_US/images/global/header/logo.gif"

However, the part src=" is still in there

 

I am wondering where it is wrong??

 

thanks again!!

 

try

<?php
$a = '<img src="http://images.bestbuy.com:80/BestBuy_US/en_US/images/global/header/logo.gif" alt="Best Buy Logo" />';
preg_match('/src="([^"]+)"/',$a, $b);
print_r($b);
?>

Link to comment
Share on other sites

The getimagesize function will return the dimensions of an image (along with some other stuff):

 

<?php
$img = 'http://upload.wikimedia.org/wikipedia/en/5/51/Google.png';
list($x, $y) = getimagesize($img);
echo 'width, height: ', $x, ', ', $y, '<br />';
if($x * $y > 10000) {
echo 'This image is bigger than 10000 pixels.';
} else {
echo 'This image is smaller than 10000 pixels.';
}
?>

 

Again we will run into problems with relative paths. You will need some code to detect these paths and make them absolute.

Link to comment
Share on other sites

I found a smart way to scale the images on the getimagesize function page on php.net:

 

<?php
//imagesize practice
$location = "http://images.bestbuy.com/BestBuy_US/en_US/images/global/header/logo.gif";
echo scaleimage($location, 30, 60);


function scaleimage($location, $maxw=NULL, $maxh=NULL){
    $img = @getimagesize($location);
    if($img){
        $w = $img[0];
        $h = $img[1];

        $dim = array('w','h');
        foreach($dim AS $val){
            $max = "max{$val}";
            if(${$val} > ${$max} && ${$max}){
                $alt = ($val == 'w') ? 'h' : 'w';
                $ratio = ${$alt} / ${$val};
                ${$val} = ${$max};
                ${$alt} = ${$val} * $ratio;
            }
        }

        return("<img src='{$location}' alt='image' width='{$w}' height='{$h}' />");
    }
}
?>

 

As thebadbad said, I'll need to convert relative image paths to aboslute. Any smart idea to detect the image path and convert relative paths? Use regular expression again?

 

thanks

Link to comment
Share on other sites

I found a function to convert a relative path to an absolute. Haven't tested it, but sure looks good!

 

<?php
function relative2absolute($absolute, $relative) {
        $p = @parse_url($relative);
        if(!$p) {
        //$relative is a seriously malformed URL
        return false;
        }
        if(isset($p["scheme"])) return $relative;

        $parts=(parse_url($absolute));

        if(substr($relative,0,1)=='/') {
            $cparts = (explode("/", $relative));
            array_shift($cparts);
        } else {
            if(isset($parts['path'])){
                 $aparts=explode('/',$parts['path']);
                 array_pop($aparts);
                 $aparts=array_filter($aparts);
            } else {
                 $aparts=array();
            }
           $rparts = (explode("/", $relative));
           $cparts = array_merge($aparts, $rparts);
           foreach($cparts as $i => $part) {
                if($part == '.') {
                    unset($cparts[$i]);
                } else if($part == '..') {
                    unset($cparts[$i]);
                    unset($cparts[$i-1]);
                }
            }
        }
        $path = implode("/", $cparts);

        $url = '';
        if($parts['scheme']) {
            $url = "$parts[scheme]://";
        }
        if(isset($parts['user'])) {
            $url .= $parts['user'];
            if(isset($parts['pass'])) {
                $url .= ":".$parts['pass'];
            }
            $url .= "@";
        }
        if(isset($parts['host'])) {
            $url .= $parts['host']."/";
        }
        $url .= $path;

        return $url;
}
?>

Link to comment
Share on other sites

Just run the function on every path, since it will detect absolute paths and quickly return them.

 

And another problem you could run into; if the page you're crawling has a base tag with a href different than the page path itself, the relative paths will fail. But this would happen very rare I'm sure.

 

..and it's pretty easy to solve actually, if needed.

Link to comment
Share on other sites

thank you very much!

 

Now you can see the entire script in action: http://www.uspie.net/getimgs.php

 

I'm searching for img tags on this url: http://www.circuitcity.com/ssm/Philips-DVP-5140-DVD-Player-DVP5140-37/sem/rpsm/oid/172059/rpem/ccd/productDetail.do , which uses relative image paths. Right now, the script can filter out images that are smaller than 150px width and 40px height.

 

The entire code is as below:

<?php
$url = 'http://www.circuitcity.com/ssm/Philips-DVP-5140-DVD-Player-DVP5140-37/sem/rpsm/oid/172059/rpem/ccd/productDetail.do';
//the image paths may be relative. For the images to show on this page, the simplest solution is to use the base tag, but beware; this may cause some trouble later..
echo '<base href="', $url, '" />', "\r\n";
$pageString = file_get_contents($url);
preg_match_all('|<img .*?>|s', $pageString, $matches);
//the array $matches[0] now contains the img tags, let's echo 'em:
preg_match('/([^\.\/]+\.[^\/\.]+)((\/)|($))/', $url, $d);
$domain = strtolower($d[1]);
$fulldomain = 'http://www.'. $domain;


foreach ($matches[0] as $path){

preg_match('/src="([^"]+)"/',$path, $b); //get the image path by $b[1], however, will need to detect relative paths and convert to absolute paths

$location = relative2absolute($fulldomain, $b[1]);//this is to convert relative image paths to absolute image paths

$filteredimg = filterimage($location, 150, 40);
echo scaleimage($filteredimg, 100, 100); //scale bigger images and filter out smaller images
}

function relative2absolute($absolute, $relative) {
        $p = @parse_url($relative);
        if(!$p) {
        //$relative is a seriously malformed URL
        return false;
        }
        if(isset($p["scheme"])) return $relative;

        $parts=(parse_url($absolute));

        if(substr($relative,0,1)=='/') {
            $cparts = (explode("/", $relative));
            array_shift($cparts);
        } else {
            if(isset($parts['path'])){
                 $aparts=explode('/',$parts['path']);
                 array_pop($aparts);
                 $aparts=array_filter($aparts);
            } else {
                 $aparts=array();
            }
           $rparts = (explode("/", $relative));
           $cparts = array_merge($aparts, $rparts);
           foreach($cparts as $i => $part) {
                if($part == '.') {
                    unset($cparts[$i]);
                } else if($part == '..') {
                    unset($cparts[$i]);
                    unset($cparts[$i-1]);
                }
            }
        }
        $path = implode("/", $cparts);

        $url = '';
        if($parts['scheme']) {
            $url = "$parts[scheme]://";
        }
        if(isset($parts['user'])) {
            $url .= $parts['user'];
            if(isset($parts['pass'])) {
                $url .= ":".$parts['pass'];
            }
            $url .= "@";
        }
        if(isset($parts['host'])) {
            $url .= $parts['host']."/";
        }
        $url .= $path;

        return $url;
}

function filterimage($location, $minW=NULL, $minH=NULL){
$img = @getimagesize($location);
   if ($img){                      //if image exists
    $w = $img[0];
$h = $img[1];

    if ($w >= $minW && $h >= $minH){   //only if images greater than the minimums are they returned, otherwise, no
    return $location;
    }
	else {
	return;                  //return no images
	}
}
}

function scaleimage($location, $maxw=NULL, $maxh=NULL){
    $img = @getimagesize($location);
    if($img){
        $w = $img[0];
        $h = $img[1];

        $dim = array('w','h');
        foreach($dim AS $val){
            $max = "max{$val}";
            if(${$val} > ${$max} && ${$max}){
                $alt = ($val == 'w') ? 'h' : 'w';
                $ratio = ${$alt} / ${$val};
                ${$val} = ${$max};
                ${$alt} = ${$val} * $ratio;
            }
        }

        return("<img src='{$location}' alt='image' width='{$w}' height='{$h}' />");
    }
}

?>

 

I'll be happy to hear if there is anything needs improvement for efficiency.

 

Thanks a million :D

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.