extrovertive Posted September 28, 2009 Share Posted September 28, 2009 Given a link/page, what's the best way to extract all the image urls of it? I know I can use $string = file_get_contents("http://www.domain.com"); and then use regular expression to get all the link of the images. But what if some images are like http://www.domain.com/resources/data.php?=123 It would be quite a difficult challenge to find the ones without the typical .jpg, .gif, or .png extensions. Can anyone show me a simple script to do this? Quote Link to comment https://forums.phpfreaks.com/topic/175839-given-a-url-extract-all-the-link-of-the-images/ Share on other sites More sharing options...
thebadbad Posted September 28, 2009 Share Posted September 28, 2009 The URLs would be found inside img tags (as the src attribute), right? Quote Link to comment https://forums.phpfreaks.com/topic/175839-given-a-url-extract-all-the-link-of-the-images/#findComment-926547 Share on other sites More sharing options...
extrovertive Posted September 28, 2009 Author Share Posted September 28, 2009 The URLs would be found inside img tags (as the src attribute), right? Ah thanks. What would be the best method to do this? Use Curl? Use file_get_conents and extract the <img code? But how do I just grab the image URLs only? Quote Link to comment https://forums.phpfreaks.com/topic/175839-given-a-url-extract-all-the-link-of-the-images/#findComment-926548 Share on other sites More sharing options...
Alex Posted September 28, 2009 Share Posted September 28, 2009 <?php $file = file_get_contents('http://www.phpfreaks.com/forums/index.php/topic,270900.0.html'); preg_match_all('~<img(.+?)src="(.+?)"~', $file, $matches); print_r($matches[2]); ?> That works, but to be honest it's probably not the best way, my regex is rusty. Quote Link to comment https://forums.phpfreaks.com/topic/175839-given-a-url-extract-all-the-link-of-the-images/#findComment-926550 Share on other sites More sharing options...
thebadbad Posted September 28, 2009 Share Posted September 28, 2009 You could use DOM to grab the URLs (better readability compared to regular expressions - but probably slower and takes up more lines). And I normally use cURL to get the remote contents: <?php function curl_load($url, $postdata = false) { $ch = curl_init(); curl_setopt($ch, CURLOPT_URL, $url); curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); if (is_array($postdata)) { curl_setopt($ch, CURLOPT_POST, true); curl_setopt($ch, CURLOPT_POSTFIELDS, $postdata); } curl_setopt($ch, CURLOPT_FORBID_REUSE, true); curl_setopt($ch, CURLOPT_FRESH_CONNECT, true); curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 6.0; da; rv:1.9.1.3) Gecko/20090824 Firefox/3.5.3'); $contents = curl_exec($ch); curl_close($ch); return $contents; } $site = 'http://example.com/'; $html = curl_load($site); $dom = new DOMDocument(); $dom->loadHTML($html); $tags = $dom->getElementsByTagName('img'); $urls = array(); foreach ($tags as $tag) { if ($tag->hasAttribute('src')) { $urls[] = $tag->getAttribute('src'); } } echo '<pre>' . print_r($urls, true) . '</pre>'; ?> The grabbed URLs could be relative; here's how to convert them to absolute (append to the above code): <?php // http://w-shadow.com/blog/2007/07/16/how-to-extract-all-urls-from-a-page-using-php/ function relative2absolute($absolute, $relative) { $p = @parse_url($relative); if(!$p) { //$relative is a seriously malformed URL return false; } if(isset($p["scheme"])) return $relative; $parts=(parse_url($absolute)); if(substr($relative,0,1)=='/') { $cparts = (explode("/", $relative)); array_shift($cparts); } else { if(isset($parts['path'])){ $aparts=explode('/',$parts['path']); array_pop($aparts); $aparts=array_filter($aparts); } else { $aparts=array(); } $rparts = (explode("/", $relative)); $cparts = array_merge($aparts, $rparts); foreach($cparts as $i => $part) { if($part == '.') { unset($cparts[$i]); } else if($part == '..') { unset($cparts[$i]); unset($cparts[$i-1]); } } } $path = implode("/", $cparts); $url = ''; if($parts['scheme']) { $url = "$parts[scheme]://"; } if(isset($parts['user'])) { $url .= $parts['user']; if(isset($parts['pass'])) { $url .= ":".$parts['pass']; } $url .= "@"; } if(isset($parts['host'])) { $url .= $parts['host']."/"; } $url .= $path; return $url; } //for the absolute URL use the base href if found $base = $dom->getElementsByTagName('base'); if ($base = $base->item(0)) { if ($base->hasAttribute('href')) { $site = $base->getAttribute('href'); } } //convert URLs $abs_urls = array(); foreach ($urls as $url) { $abs_urls[] = relative2absolute($site, $url); } echo '<pre>' . print_r($abs_urls, true) . '</pre>'; ?> And my regex alternative: <?php preg_match_all('~<img\b[^>]+\bsrc\s?=\s?([\'"])(.*?)\1~is', $html, $matches); $urls = $matches[2]; ?> and for grabbing the possible base tag: <?php if (preg_match('~<base\b[^>]+\bhref\s?=\s?([\'"])(.*?)\1~is', $html, $matches)) { $site = $matches[2]; } ?> Quote Link to comment https://forums.phpfreaks.com/topic/175839-given-a-url-extract-all-the-link-of-the-images/#findComment-926565 Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.