Jump to content

Excluding a 'word' from matches - PHP Regex


webhead2

Recommended Posts

Hi.

 

I am building a scraper that gathers information from a couple of websites (I have permission). 

 

Sometimes I scrape image tags.  Sometimes the image tags have a full URL as the src attribute, sometimes they don't.  I need to fix the ones that don't and leave the ones that have "http://" alone.

 

My thought was to build a regex to match the img tag and exclude it if 'http' is found.

 

Here's a working regex that finds the image tag and source correctly (and replaces it):

 

$text2 = preg_replace('/<img[^>]+?src="([^"]+)"[^>]+?>/i', '<img src="$1">', $text);

 

The problem is that this also replaces the image sources with full path intact.  This outputs http://..blah/blah/http://blah...

 

Is there a way to exlcude 'http://' from the matches??

 

Thanks in advance.

 

Link to comment
Share on other sites

Personally I wouldn't do it like that.

If all the image sources are from the same url i.e. http://www.xyz.com/images then I would just add it on later.

 

Extract all the paths from the img src, then check if the url is included

i.e.

$url  = "http://www.xyz.com/";
$src = "images/x.jpg";
// if the src is not absolute prepend with the url
if(!strstr($src, $url)) {
  // will give me http://www.xyz.com/images/x.jpg
  $src = $url.$src;
}

Link to comment
Share on other sites

Personally I wouldn't do it like that.

If all the image sources are from the same url i.e. http://www.xyz.com/images then I would just add it on later.

 

Extract all the paths from the img src, then check if the url is included

i.e.

$url  = "http://www.xyz.com/";
$src = "images/x.jpg";
// if the src is not absolute prepend with the url
if(!strstr($src, $url)) {
 // will give me http://www.xyz.com/images/x.jpg
 $src = $url.$src;
}

 

Thank you, however, I can't really do it like that. I have over 20 websites and will be adding more in the future. I don't have control over the format of the incoming links in all cases.

 

My script needs to be dynamic. Some sources have full paths and some do not.

 

Note: The contents of $text is the scraped HTML from a page.

 

$text2 = preg_replace('/<img[^>]+?src="([^"]+)"[^>]+?>/i', '<img src="$1">', $text);

 

Hopefully this clarifies. Thanks!

 

 

 

Link to comment
Share on other sites

I don't have control over the format of the incoming links in all cases. My script needs to be dynamic.

 

Dont understand

 

If you have the url that you are scraping image paths from then why can this not be prepended to the src attribute if it doesn't already exist? That is what you are trying to do from your initial post.

Link to comment
Share on other sites

I don't have control over the format of the incoming links in all cases. My script needs to be dynamic.

 

Dont understand

 

If you have the url that you are scraping image paths from then why can this not be prepended to the src attribute if it doesn't already exist? That is what you are trying to do from your initial post.

 

Not exactly. 

 

Some of the src's come in as

 

Others come in as

src="images/...."

 

$text contains all the html of a single webpage within a loop.

 

I am cleaning the text before insertion into a database, so that the images and links will display from my server. 

 

Hence, images with full paths already supplied need be left alone. 

 

Thanks again mate!

Link to comment
Share on other sites

Ah so you are not extracting the images paths. You are storing the entire page html and want to replace the src attribute within the html.

 

Correct.

 

I didn't go to the bother of breaking $text down into smaller pieces. It contains everything within the <body> tags. On the next iteration of the loop, $text will contain the HTML from a different page.

 

I am trying to avoid breaking down $text into an array, if at all possible. All I intend to do is store the entire chunk into a database field (once sanitized) along with some other info. My one line of code is getting me really close.

 

In effect, this line looks for all occrences on an img tag and replaces it (if needed)...

 

$text2 = preg_replace('/<img[^>]+?src="([^"]+)"[^>]+?>/i', '<img src="$1">', $text);

 

Note: I will need to change <img src="$1"> to whatever I need, not a problem.  I am looking to exclude the occurence of "http://" in the regex itself. Wouldn't that be nice?

 

But the above also replaces images that already have full paths supplied - NOT Good :)

 

 

Link to comment
Share on other sites

Here you go. Use this as a starter

 

$url = "http://www.test.com/";
$html = '<img src="http://www.test.com/images/x.jpg"><br /><img src="images/y.jpg"><br /><img src="images/abc.jpg">';
$result = preg_replace('%<img src="([^http://])([a-zA-Z0-9\\./-_ ]+)"%', '<img src="'.$url.'$1$2', $html);

print "Original html:<br />";
print "<xmp>".$html."</xmp>";
print "New html:<br />";
print "<xmp>".$result."</xmp>";

Link to comment
Share on other sites

Here you go. Use this as a starter

 

$url = "http://www.test.com/";
$html = '<img src="http://www.test.com/images/x.jpg"><br /><img src="images/y.jpg"><br /><img src="images/abc.jpg">';
$result = preg_replace('%<img src="([^http://])([a-zA-Z0-9\\./-_ ]+)"%', '<img src="'.$url.'$1$2', $html);

print "Original html:<br />";
print "<xmp>".$html."</xmp>";
print "New html:<br />";
print "<xmp>".$result."</xmp>";

 

I tested this, and it definately ignores the full url's.

 

But try adding alt or other tags to your images. I get output like:

 

img src=http://www.ppcwebspy.com/mg/checkmark.png%3C/td%3E%20%20%20%20%20%20%20%20%20%20%20%20%3Ctd%20width=

 

$text2 = preg_replace('/<img src="([^http:])([^"]+)"[^>]+?>/i', '<img src="http://'.$out.'/$2', $text);

 

Note that the "i" in img is truncated and the regex picks up stuff after the img path.

 

I also tried using your REGEX code verbatim with similar results.

 

This:

$text2 = preg_replace('%<img src="([^http://])([a-zA-Z0-9\\./-_ ]+)"%', '<img src="http://'.$out.'/$2', $text);

 

Produces this:

 

img src=http://www.ppcwebspy.com/mg/PPCWebSpy_box_small2.jpg%20align=

Link to comment
Share on other sites

Will look later. You should be able to extend the regex to capture the entire image tag with all attributes. The example was just to start you off.

 

A useful tool is regex buddy http://www.regexbuddy.com/

Get a copy of this.

 

I already have regex buddy, thanks.  :)

 

Update:

 

This gets me real close:

 

$text2 = preg_replace('/<img[^>]+?src="([^http:\/\/])([^"]+)"[^>]+?>/i', '<center><img src="http://'.$out.'/$1/$2"></center>', $text);

 

Output:

 

http://www.ppcwebspy.com/i/mg/ppc_web_spy_small.jpg

 

What's with the "i"?

 

Link to comment
Share on other sites

That regex isn't right. You can't put blocks of text inside character classes like that.  Character classes only match single characters, so that http:// is going to not match on any one of those characters, not the whole thing.  That's what lookbehinds and lookaheads are for.

 

~<img.*?src\s?=\s?"(?!http:)[^"]*"[^>]*>~is

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.