Jump to content

REGEX help, build a list of images on a page


EthanV2

Recommended Posts

Hi, just wondering if anyone could help me out with this. I just need a piece of code, or a function, that will search a string for any images, then put the list of images into an array.

 

So far all of my Google'ing hasn't shown up any results, so any help would be much appreciated.

 

I'll be using the function to scan an HTML page, and generate a list of images embedded on that page (using the <img> tag), but I just can't seem to get it to work.

Link to comment
Share on other sites

This is a starter function, you could certainly extend it by checking for relative URLs in the returned array and then prepending the directory URL of the page so that the image src would be complete etc.

<pre>
<?php
function getimgs($page){
  $ch=curl_init();
  curl_setopt($ch, CURLOPT_URL, $page);
  curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); 
  curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
  preg_match_all('/<img\s[^>]*\bsrc=(??:([\'"])((??!\1|\?).)*))|([^\s?>]*))/is',curl_exec($ch),$images);
  curl_close($ch);
  $images=array_merge(array_filter($images[2]),array_filter($images[3]));
  return $images;
}
echo print_r(getimgs('http://www.phpfreaks.com/forums/'),true);
?>

Link to comment
Share on other sites

This solution assumes you only want the directories / image file names..

 

$str = <<<DATA
<table id="logintable"> 
<tr id="usernamerow"> 
	<th rowspan="3" valign="top">Login</th> 
	<td>Username:</td> 
	<td><input type="text" class="text150" name="vb_login_username" /></td> 
	<td id="submit"><input type="image" src="/mainpage_images/buttons/submit.gif" alt="Submit" /></td>
</tr> 
<tr id="passwordrow"> 
	<td>Password:</td> 
	<td><input type="password" class="text150" name="vb_login_password" /></td> 
	<td id="join"><a href="/forums/register.php"><img src="/mainpage_images/buttons/join.gif" alt="Join" title="Join" border="0" /></a></td> 
</tr> 
<tr id="remembermerow"> 
	<td align="right"><input type="checkbox" name="cookieuser" value="1" tabindex="3" id="cb_cookieuser_navbar" accesskey="c" checked="checked" /></td><td align="left"><label for="cb_cookieuser_navbar">Remember Me</label></td> 
	<td id="help"><a href="/forums/faq.php?"><img src="/mainpage_images/buttons/help.gif" alt="Help" title="Help" width="15" height="15" border="0" /></a></td> 
</tr> 
</table> 
DATA;

preg_match_all('#<img[^"]+"([^"]+)"#', $str, $matches);
echo '<pre>';
print_r($matches[1]);
echo '</pre>';

 

Output:

Array
(
    [0] => /mainpage_images/buttons/join.gif
    [1] => /mainpage_images/buttons/help.gif
)

 

If you want only the filenames, and nothing else, you can use this pattern instead:

preg_match_all('#<img.+?([^/]+\.[^"]+)"#', $str, $matches);

 

Output:

Array
(
    [0] => join.gif
    [1] => help.gif
)

Link to comment
Share on other sites

My code in my original post was munged by the forum server during posting (\'s removed before alpha characters and a ' removed from within [ ] character set and before the pattern, quite unintended changes that makes me concerned about posting future source code here).

 

Here's the proper code posting:

http://pastebin.com/f42a371cc

 

Here's the pattern in action matching against img tags without quotes, with single quotes, and with double quotes:

http://www.myregextester.com/?r=020906eb

 

The array_filter and array_merge in my code is done to merge the results of quoted and non-quoted images into one array result if they happen to appear on the same page, you can see the results of the preg_match_all code before merging in the example link above.

Link to comment
Share on other sites

My code in my original post was munged by the forum server during posting (\'s removed before alpha characters and a ' removed from within [ ] character set and before the pattern, quite unintended changes that makes me concerned about posting future source code here).

 

Here's the proper code posting:

http://pastebin.com/f42a371cc

 

Here's the pattern in action matching against img tags without quotes, with single quotes, and with double quotes:

http://www.myregextester.com/?r=020906eb

 

The array_filter and array_merge in my code is done to merge the results of quoted and non-quoted images into one array result if they happen to appear on the same page, you can see the results of the preg_match_all code before merging in the example link above.

 

That's perfect, thanks. This is a huge help for me.

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.