ivytony Posted February 18, 2008 Share Posted February 18, 2008 I am wondering how to craw on webpages to search for the 'img' tag. Because I want to catch images on a target url and these fetched images will be used for my users to choose when they submit an article. This is the feature digg.com has. ps: I'm still using php4 now thanks! Quote Link to comment Share on other sites More sharing options...
thebadbad Posted February 18, 2008 Share Posted February 18, 2008 Use regular expressions. This will extract all the img-tags from a site (be sure to read the comments): <?php $url = 'http://en.wikipedia.org/wiki/Google'; //the image paths may be relative. For the images to show on this page, the simplest solution is to use the base tag, but beware; this may cause some trouble later.. echo '<base href="', $url, '" />', "\r\n"; $pageString = file_get_contents($url); preg_match_all('|<img .*?>|s', $pageString, $matches); //the array $matches[0] now contains the img tags, let's echo 'em: foreach($matches[0] as $path) { echo $path, "<br />"; } ?> Quote Link to comment Share on other sites More sharing options...
ivytony Posted February 18, 2008 Author Share Posted February 18, 2008 wow, awesome!!! Is there a way to filter out some small images by evaluating the dimensions? thank you again!! Quote Link to comment Share on other sites More sharing options...
ivytony Posted February 19, 2008 Author Share Posted February 19, 2008 one more question though: For a image tag like this: <im g src="http://images.bestbuy.com:80/BestBuy_US/en_US/images/global/header/logo.gif" alt="Best Buy Logo"/> (notice: the <im g has been spaced out to avoid showing the picture in this post) how to get the http://images.bestbuy.com:80.............logo.gif part between src=" and " alt="Best Buy Logo"/> I tried to get it by creating a preg_replace as below: <?php echo preg_replace("/<img src=\" > /", "", "<img src=\"http:\/\/images.bestbuy.com:80\/BestBuy_US\/en_US\/images\/global\/header\/logo.gif\" alt=\"Best Buy Logo\"\/>"); But this gives me 'Best Buy Logo' which is the value of alt, I found php regular expression is so hard to understand. Can anyone here please help me get the part between src=" and " alt= ?? I appreciate your help! Tony Quote Link to comment Share on other sites More sharing options...
sasa Posted February 19, 2008 Share Posted February 19, 2008 try <?php $a = '<img src="http://images.bestbuy.com:80/BestBuy_US/en_US/images/global/header/logo.gif" alt="Best Buy Logo" />'; preg_match('/src="([^"]+)"/',$a, $b); print_r($b); ?> Quote Link to comment Share on other sites More sharing options...
ivytony Posted February 19, 2008 Author Share Posted February 19, 2008 thanks for the super fast reply. The running result is an array as below: Array ( [0] => src="http://images.bestbuy.com:80/BestBuy_US/en_US/images/global/header/logo.gif" [1] => http://images.bestbuy.com:80/BestBuy_US/en_US/images/global/header/logo.gif ) I then changed the output code to print_r($b[0]); and it gives me this: src="http://images.bestbuy.com:80/BestBuy_US/en_US/images/global/header/logo.gif" However, the part src=" is still in there I am wondering where it is wrong?? thanks again!! try <?php $a = '<img src="http://images.bestbuy.com:80/BestBuy_US/en_US/images/global/header/logo.gif" alt="Best Buy Logo" />'; preg_match('/src="([^"]+)"/',$a, $b); print_r($b); ?> Quote Link to comment Share on other sites More sharing options...
sasa Posted February 19, 2008 Share Posted February 19, 2008 try echo $b[1]; Quote Link to comment Share on other sites More sharing options...
thebadbad Posted February 19, 2008 Share Posted February 19, 2008 The getimagesize function will return the dimensions of an image (along with some other stuff): <?php $img = 'http://upload.wikimedia.org/wikipedia/en/5/51/Google.png'; list($x, $y) = getimagesize($img); echo 'width, height: ', $x, ', ', $y, '<br />'; if($x * $y > 10000) { echo 'This image is bigger than 10000 pixels.'; } else { echo 'This image is smaller than 10000 pixels.'; } ?> Again we will run into problems with relative paths. You will need some code to detect these paths and make them absolute. Quote Link to comment Share on other sites More sharing options...
ivytony Posted February 19, 2008 Author Share Posted February 19, 2008 I found a smart way to scale the images on the getimagesize function page on php.net: <?php //imagesize practice $location = "http://images.bestbuy.com/BestBuy_US/en_US/images/global/header/logo.gif"; echo scaleimage($location, 30, 60); function scaleimage($location, $maxw=NULL, $maxh=NULL){ $img = @getimagesize($location); if($img){ $w = $img[0]; $h = $img[1]; $dim = array('w','h'); foreach($dim AS $val){ $max = "max{$val}"; if(${$val} > ${$max} && ${$max}){ $alt = ($val == 'w') ? 'h' : 'w'; $ratio = ${$alt} / ${$val}; ${$val} = ${$max}; ${$alt} = ${$val} * $ratio; } } return("<img src='{$location}' alt='image' width='{$w}' height='{$h}' />"); } } ?> As thebadbad said, I'll need to convert relative image paths to aboslute. Any smart idea to detect the image path and convert relative paths? Use regular expression again? thanks Quote Link to comment Share on other sites More sharing options...
thebadbad Posted February 19, 2008 Share Posted February 19, 2008 I found a function to convert a relative path to an absolute. Haven't tested it, but sure looks good! <?php function relative2absolute($absolute, $relative) { $p = @parse_url($relative); if(!$p) { //$relative is a seriously malformed URL return false; } if(isset($p["scheme"])) return $relative; $parts=(parse_url($absolute)); if(substr($relative,0,1)=='/') { $cparts = (explode("/", $relative)); array_shift($cparts); } else { if(isset($parts['path'])){ $aparts=explode('/',$parts['path']); array_pop($aparts); $aparts=array_filter($aparts); } else { $aparts=array(); } $rparts = (explode("/", $relative)); $cparts = array_merge($aparts, $rparts); foreach($cparts as $i => $part) { if($part == '.') { unset($cparts[$i]); } else if($part == '..') { unset($cparts[$i]); unset($cparts[$i-1]); } } } $path = implode("/", $cparts); $url = ''; if($parts['scheme']) { $url = "$parts[scheme]://"; } if(isset($parts['user'])) { $url .= $parts['user']; if(isset($parts['pass'])) { $url .= ":".$parts['pass']; } $url .= "@"; } if(isset($parts['host'])) { $url .= $parts['host']."/"; } $url .= $path; return $url; } ?> Quote Link to comment Share on other sites More sharing options...
thebadbad Posted February 19, 2008 Share Posted February 19, 2008 Just run the function on every path, since it will detect absolute paths and quickly return them. And another problem you could run into; if the page you're crawling has a base tag with a href different than the page path itself, the relative paths will fail. But this would happen very rare I'm sure. ..and it's pretty easy to solve actually, if needed. Quote Link to comment Share on other sites More sharing options...
ivytony Posted February 19, 2008 Author Share Posted February 19, 2008 thank you very much! Now you can see the entire script in action: http://www.uspie.net/getimgs.php I'm searching for img tags on this url: http://www.circuitcity.com/ssm/Philips-DVP-5140-DVD-Player-DVP5140-37/sem/rpsm/oid/172059/rpem/ccd/productDetail.do , which uses relative image paths. Right now, the script can filter out images that are smaller than 150px width and 40px height. The entire code is as below: <?php $url = 'http://www.circuitcity.com/ssm/Philips-DVP-5140-DVD-Player-DVP5140-37/sem/rpsm/oid/172059/rpem/ccd/productDetail.do'; //the image paths may be relative. For the images to show on this page, the simplest solution is to use the base tag, but beware; this may cause some trouble later.. echo '<base href="', $url, '" />', "\r\n"; $pageString = file_get_contents($url); preg_match_all('|<img .*?>|s', $pageString, $matches); //the array $matches[0] now contains the img tags, let's echo 'em: preg_match('/([^\.\/]+\.[^\/\.]+)((\/)|($))/', $url, $d); $domain = strtolower($d[1]); $fulldomain = 'http://www.'. $domain; foreach ($matches[0] as $path){ preg_match('/src="([^"]+)"/',$path, $b); //get the image path by $b[1], however, will need to detect relative paths and convert to absolute paths $location = relative2absolute($fulldomain, $b[1]);//this is to convert relative image paths to absolute image paths $filteredimg = filterimage($location, 150, 40); echo scaleimage($filteredimg, 100, 100); //scale bigger images and filter out smaller images } function relative2absolute($absolute, $relative) { $p = @parse_url($relative); if(!$p) { //$relative is a seriously malformed URL return false; } if(isset($p["scheme"])) return $relative; $parts=(parse_url($absolute)); if(substr($relative,0,1)=='/') { $cparts = (explode("/", $relative)); array_shift($cparts); } else { if(isset($parts['path'])){ $aparts=explode('/',$parts['path']); array_pop($aparts); $aparts=array_filter($aparts); } else { $aparts=array(); } $rparts = (explode("/", $relative)); $cparts = array_merge($aparts, $rparts); foreach($cparts as $i => $part) { if($part == '.') { unset($cparts[$i]); } else if($part == '..') { unset($cparts[$i]); unset($cparts[$i-1]); } } } $path = implode("/", $cparts); $url = ''; if($parts['scheme']) { $url = "$parts[scheme]://"; } if(isset($parts['user'])) { $url .= $parts['user']; if(isset($parts['pass'])) { $url .= ":".$parts['pass']; } $url .= "@"; } if(isset($parts['host'])) { $url .= $parts['host']."/"; } $url .= $path; return $url; } function filterimage($location, $minW=NULL, $minH=NULL){ $img = @getimagesize($location); if ($img){ //if image exists $w = $img[0]; $h = $img[1]; if ($w >= $minW && $h >= $minH){ //only if images greater than the minimums are they returned, otherwise, no return $location; } else { return; //return no images } } } function scaleimage($location, $maxw=NULL, $maxh=NULL){ $img = @getimagesize($location); if($img){ $w = $img[0]; $h = $img[1]; $dim = array('w','h'); foreach($dim AS $val){ $max = "max{$val}"; if(${$val} > ${$max} && ${$max}){ $alt = ($val == 'w') ? 'h' : 'w'; $ratio = ${$alt} / ${$val}; ${$val} = ${$max}; ${$alt} = ${$val} * $ratio; } } return("<img src='{$location}' alt='image' width='{$w}' height='{$h}' />"); } } ?> I'll be happy to hear if there is anything needs improvement for efficiency. Thanks a million Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.