Excluding a 'word' from matches - PHP Regex

webhead2 · March 10, 2009

Hi.

I am building a scraper that gathers information from a couple of websites (I have permission).

Sometimes I scrape image tags. Sometimes the image tags have a full URL as the src attribute, sometimes they don't. I need to fix the ones that don't and leave the ones that have "http://" alone.

My thought was to build a regex to match the img tag and exclude it if 'http' is found.

Here's a working regex that finds the image tag and source correctly (and replaces it):

$text2 = preg_replace('/<img[^>]+?src="([^"]+)"[^>]+?>/i', '<img src="$1">', $text);

The problem is that this also replaces the image sources with full path intact. This outputs http://..blah/blah/http://blah...

Is there a way to exlcude 'http://' from the matches??

Thanks in advance.

JonnoTheDev · March 10, 2009

Personally I wouldn't do it like that.

If all the image sources are from the same url i.e. http://www.xyz.com/images then I would just add it on later.

Extract all the paths from the img src, then check if the url is included

i.e.

$url  = "http://www.xyz.com/";
$src = "images/x.jpg";
// if the src is not absolute prepend with the url
if(!strstr($src, $url)) {
  // will give me http://www.xyz.com/images/x.jpg
  $src = $url.$src;
}

webhead2 · March 10, 2009

Personally I wouldn't do it like that.

If all the image sources are from the same url i.e. http://www.xyz.com/images then I would just add it on later.

Extract all the paths from the img src, then check if the url is included

i.e.
$url  = "http://www.xyz.com/";
$src = "images/x.jpg";
// if the src is not absolute prepend with the url
if(!strstr($src, $url)) {
 // will give me http://www.xyz.com/images/x.jpg
 $src = $url.$src;
}

Thank you, however, I can't really do it like that. I have over 20 websites and will be adding more in the future. I don't have control over the format of the incoming links in all cases.

My script needs to be dynamic. Some sources have full paths and some do not.

Note: The contents of $text is the scraped HTML from a page.

$text2 = preg_replace('/<img[^>]+?src="([^"]+)"[^>]+?>/i', '<img src="$1">', $text);

Hopefully this clarifies. Thanks!

JonnoTheDev · March 10, 2009

I don't have control over the format of the incoming links in all cases. My script needs to be dynamic.

Dont understand

If you have the url that you are scraping image paths from then why can this not be prepended to the src attribute if it doesn't already exist? That is what you are trying to do from your initial post.

webhead2 · March 10, 2009

I don't have control over the format of the incoming links in all cases. My script needs to be dynamic.

Dont understand

If you have the url that you are scraping image paths from then why can this not be prepended to the src attribute if it doesn't already exist? That is what you are trying to do from your initial post.

Not exactly.

Some of the src's come in as

src="http://......

Others come in as

src="images/...."

$text contains all the html of a single webpage within a loop.

I am cleaning the text before insertion into a database, so that the images and links will display from my server.

Hence, images with full paths already supplied need be left alone.

Thanks again mate!

JonnoTheDev · March 10, 2009

Ah so you are not extracting the images paths. You are storing the entire page html and want to replace the src attribute within the html.

webhead2 · March 10, 2009

Ah so you are not extracting the images paths. You are storing the entire page html and want to replace the src attribute within the html.

Correct.

I didn't go to the bother of breaking $text down into smaller pieces. It contains everything within the <body> tags. On the next iteration of the loop, $text will contain the HTML from a different page.

I am trying to avoid breaking down $text into an array, if at all possible. All I intend to do is store the entire chunk into a database field (once sanitized) along with some other info. My one line of code is getting me really close.

In effect, this line looks for all occrences on an img tag and replaces it (if needed)...

$text2 = preg_replace('/<img[^>]+?src="([^"]+)"[^>]+?>/i', '<img src="$1">', $text);

Note: I will need to change <img src="$1"> to whatever I need, not a problem. I am looking to exclude the occurence of "http://" in the regex itself. Wouldn't that be nice?

But the above also replaces images that already have full paths supplied - NOT Good

JonnoTheDev · March 10, 2009

Here you go. Use this as a starter

$url = "http://www.test.com/";
$html = '<img src="http://www.test.com/images/x.jpg"><br /><img src="images/y.jpg"><br /><img src="images/abc.jpg">';
$result = preg_replace('%<img src="([^http://])([a-zA-Z0-9\\./-_ ]+)"%', '<img src="'.$url.'$1$2', $html);

print "Original html:<br />";
print "<xmp>".$html."</xmp>";
print "New html:<br />";
print "<xmp>".$result."</xmp>";

webhead2 · March 10, 2009

Here you go. Use this as a starter

$url = "http://www.test.com/";
$html = '<img src="http://www.test.com/images/x.jpg"><br /><img src="images/y.jpg"><br /><img src="images/abc.jpg">';
$result = preg_replace('%<img src="([^http://])([a-zA-Z0-9\\./-_ ]+)"%', '<img src="'.$url.'$1$2', $html);

print "Original html:<br />";
print "<xmp>".$html."</xmp>";
print "New html:<br />";
print "<xmp>".$result."</xmp>";

I tested this, and it definately ignores the full url's.

But try adding alt or other tags to your images. I get output like:

img src=http://www.ppcwebspy.com/mg/checkmark.png%3C/td%3E%20%20%20%20%20%20%20%20%20%20%20%20%3Ctd%20width=

$text2 = preg_replace('/<img src="([^http:])([^"]+)"[^>]+?>/i', '<img src="http://'.$out.'/$2', $text);

Note that the "i" in img is truncated and the regex picks up stuff after the img path.

I also tried using your REGEX code verbatim with similar results.

This:

$text2 = preg_replace('%<img src="([^http://])([a-zA-Z0-9\\./-_ ]+)"%', '<img src="http://'.$out.'/$2', $text);

Produces this:

img src=http://www.ppcwebspy.com/mg/PPCWebSpy_box_small2.jpg%20align=

JonnoTheDev · March 10, 2009

Will look later. You should be able to extend the regex to capture the entire image tag with all attributes. The example was just to start you off.

A useful tool is regex buddy http://www.regexbuddy.com/

Get a copy of this.

webhead2 · March 10, 2009

Will look later. You should be able to extend the regex to capture the entire image tag with all attributes. The example was just to start you off.

A useful tool is regex buddy http://www.regexbuddy.com/

Get a copy of this.

I already have regex buddy, thanks.

Update:

This gets me real close:

$text2 = preg_replace('/<img[^>]+?src="([^http:\/\/])([^"]+)"[^>]+?>/i', '<center><img src="http://'.$out.'/$1/$2"></center>', $text);

Output:

http://www.ppcwebspy.com/i/mg/ppc_web_spy_small.jpg

What's with the "i"?

JonnoTheDev · March 10, 2009

This

$out.'/$1/$2

change to

$out.'/$1$2

Nearly there

webhead2 · March 10, 2009

This
$out.'/$1/$2
change to
$out.'/$1$2
Nearly there

Perfect. Kudos to you mate.

.josh · March 10, 2009

That regex isn't right. You can't put blocks of text inside character classes like that. Character classes only match single characters, so that http:// is going to not match on any one of those characters, not the whole thing. That's what lookbehinds and lookaheads are for.

~<img.*?src\s?=\s?"(?!http:)[^"]*"[^>]*>~is

Sign In

Excluding a 'word' from matches - PHP Regex

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Archived

Important Information