Jump to content

REGEX help, build a list of images on a page


EthanV2

Recommended Posts

Hi, just wondering if anyone could help me out with this. I just need a piece of code, or a function, that will search a string for any images, then put the list of images into an array.

 

So far all of my Google'ing hasn't shown up any results, so any help would be much appreciated.

 

I'll be using the function to scan an HTML page, and generate a list of images embedded on that page (using the <img> tag), but I just can't seem to get it to work.

This is a starter function, you could certainly extend it by checking for relative URLs in the returned array and then prepending the directory URL of the page so that the image src would be complete etc.

<pre>
<?php
function getimgs($page){
  $ch=curl_init();
  curl_setopt($ch, CURLOPT_URL, $page);
  curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); 
  curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
  preg_match_all('/<img\s[^>]*\bsrc=(??:([\'"])((??!\1|\?).)*))|([^\s?>]*))/is',curl_exec($ch),$images);
  curl_close($ch);
  $images=array_merge(array_filter($images[2]),array_filter($images[3]));
  return $images;
}
echo print_r(getimgs('http://www.phpfreaks.com/forums/'),true);
?>

This solution assumes you only want the directories / image file names..

 

$str = <<<DATA
<table id="logintable"> 
<tr id="usernamerow"> 
	<th rowspan="3" valign="top">Login</th> 
	<td>Username:</td> 
	<td><input type="text" class="text150" name="vb_login_username" /></td> 
	<td id="submit"><input type="image" src="/mainpage_images/buttons/submit.gif" alt="Submit" /></td>
</tr> 
<tr id="passwordrow"> 
	<td>Password:</td> 
	<td><input type="password" class="text150" name="vb_login_password" /></td> 
	<td id="join"><a href="/forums/register.php"><img src="/mainpage_images/buttons/join.gif" alt="Join" title="Join" border="0" /></a></td> 
</tr> 
<tr id="remembermerow"> 
	<td align="right"><input type="checkbox" name="cookieuser" value="1" tabindex="3" id="cb_cookieuser_navbar" accesskey="c" checked="checked" /></td><td align="left"><label for="cb_cookieuser_navbar">Remember Me</label></td> 
	<td id="help"><a href="/forums/faq.php?"><img src="/mainpage_images/buttons/help.gif" alt="Help" title="Help" width="15" height="15" border="0" /></a></td> 
</tr> 
</table> 
DATA;

preg_match_all('#<img[^"]+"([^"]+)"#', $str, $matches);
echo '<pre>';
print_r($matches[1]);
echo '</pre>';

 

Output:

Array
(
    [0] => /mainpage_images/buttons/join.gif
    [1] => /mainpage_images/buttons/help.gif
)

 

If you want only the filenames, and nothing else, you can use this pattern instead:

preg_match_all('#<img.+?([^/]+\.[^"]+)"#', $str, $matches);

 

Output:

Array
(
    [0] => join.gif
    [1] => help.gif
)

My code in my original post was munged by the forum server during posting (\'s removed before alpha characters and a ' removed from within [ ] character set and before the pattern, quite unintended changes that makes me concerned about posting future source code here).

 

Here's the proper code posting:

http://pastebin.com/f42a371cc

 

Here's the pattern in action matching against img tags without quotes, with single quotes, and with double quotes:

http://www.myregextester.com/?r=020906eb

 

The array_filter and array_merge in my code is done to merge the results of quoted and non-quoted images into one array result if they happen to appear on the same page, you can see the results of the preg_match_all code before merging in the example link above.

My code in my original post was munged by the forum server during posting (\'s removed before alpha characters and a ' removed from within [ ] character set and before the pattern, quite unintended changes that makes me concerned about posting future source code here).

 

Here's the proper code posting:

http://pastebin.com/f42a371cc

 

Here's the pattern in action matching against img tags without quotes, with single quotes, and with double quotes:

http://www.myregextester.com/?r=020906eb

 

The array_filter and array_merge in my code is done to merge the results of quoted and non-quoted images into one array result if they happen to appear on the same page, you can see the results of the preg_match_all code before merging in the example link above.

 

That's perfect, thanks. This is a huge help for me.

While testing with various sites I noticed that some sites such as msn.com had img src URLs that referenced an image with a # sign and then more (unwanted) data, so I modified the expression to drop the # sign and anything after that if encountered (as it does with the ?):

http://pastebin.com/f1935e1d6

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.