webhead2 Posted March 10, 2009 Share Posted March 10, 2009 Hi. I am building a scraper that gathers information from a couple of websites (I have permission). Sometimes I scrape image tags. Sometimes the image tags have a full URL as the src attribute, sometimes they don't. I need to fix the ones that don't and leave the ones that have "http://" alone. My thought was to build a regex to match the img tag and exclude it if 'http' is found. Here's a working regex that finds the image tag and source correctly (and replaces it): $text2 = preg_replace('/<img[^>]+?src="([^"]+)"[^>]+?>/i', '<img src="$1">', $text); The problem is that this also replaces the image sources with full path intact. This outputs http://..blah/blah/http://blah... Is there a way to exlcude 'http://' from the matches?? Thanks in advance. Quote Link to comment Share on other sites More sharing options...
JonnoTheDev Posted March 10, 2009 Share Posted March 10, 2009 Personally I wouldn't do it like that. If all the image sources are from the same url i.e. http://www.xyz.com/images then I would just add it on later. Extract all the paths from the img src, then check if the url is included i.e. $url = "http://www.xyz.com/"; $src = "images/x.jpg"; // if the src is not absolute prepend with the url if(!strstr($src, $url)) { // will give me http://www.xyz.com/images/x.jpg $src = $url.$src; } Quote Link to comment Share on other sites More sharing options...
webhead2 Posted March 10, 2009 Author Share Posted March 10, 2009 Personally I wouldn't do it like that. If all the image sources are from the same url i.e. http://www.xyz.com/images then I would just add it on later. Extract all the paths from the img src, then check if the url is included i.e. $url = "http://www.xyz.com/"; $src = "images/x.jpg"; // if the src is not absolute prepend with the url if(!strstr($src, $url)) { // will give me http://www.xyz.com/images/x.jpg $src = $url.$src; } Thank you, however, I can't really do it like that. I have over 20 websites and will be adding more in the future. I don't have control over the format of the incoming links in all cases. My script needs to be dynamic. Some sources have full paths and some do not. Note: The contents of $text is the scraped HTML from a page. $text2 = preg_replace('/<img[^>]+?src="([^"]+)"[^>]+?>/i', '<img src="$1">', $text); Hopefully this clarifies. Thanks! Quote Link to comment Share on other sites More sharing options...
JonnoTheDev Posted March 10, 2009 Share Posted March 10, 2009 I don't have control over the format of the incoming links in all cases. My script needs to be dynamic. Dont understand If you have the url that you are scraping image paths from then why can this not be prepended to the src attribute if it doesn't already exist? That is what you are trying to do from your initial post. Quote Link to comment Share on other sites More sharing options...
webhead2 Posted March 10, 2009 Author Share Posted March 10, 2009 I don't have control over the format of the incoming links in all cases. My script needs to be dynamic. Dont understand If you have the url that you are scraping image paths from then why can this not be prepended to the src attribute if it doesn't already exist? That is what you are trying to do from your initial post. Not exactly. Some of the src's come in as src="http://...... Others come in as src="images/...." $text contains all the html of a single webpage within a loop. I am cleaning the text before insertion into a database, so that the images and links will display from my server. Hence, images with full paths already supplied need be left alone. Thanks again mate! Quote Link to comment Share on other sites More sharing options...
JonnoTheDev Posted March 10, 2009 Share Posted March 10, 2009 Ah so you are not extracting the images paths. You are storing the entire page html and want to replace the src attribute within the html. Quote Link to comment Share on other sites More sharing options...
webhead2 Posted March 10, 2009 Author Share Posted March 10, 2009 Ah so you are not extracting the images paths. You are storing the entire page html and want to replace the src attribute within the html. Correct. I didn't go to the bother of breaking $text down into smaller pieces. It contains everything within the <body> tags. On the next iteration of the loop, $text will contain the HTML from a different page. I am trying to avoid breaking down $text into an array, if at all possible. All I intend to do is store the entire chunk into a database field (once sanitized) along with some other info. My one line of code is getting me really close. In effect, this line looks for all occrences on an img tag and replaces it (if needed)... $text2 = preg_replace('/<img[^>]+?src="([^"]+)"[^>]+?>/i', '<img src="$1">', $text); Note: I will need to change <img src="$1"> to whatever I need, not a problem. I am looking to exclude the occurence of "http://" in the regex itself. Wouldn't that be nice? But the above also replaces images that already have full paths supplied - NOT Good Quote Link to comment Share on other sites More sharing options...
JonnoTheDev Posted March 10, 2009 Share Posted March 10, 2009 Here you go. Use this as a starter $url = "http://www.test.com/"; $html = '<img src="http://www.test.com/images/x.jpg"><br /><img src="images/y.jpg"><br /><img src="images/abc.jpg">'; $result = preg_replace('%<img src="([^http://])([a-zA-Z0-9\\./-_ ]+)"%', '<img src="'.$url.'$1$2', $html); print "Original html:<br />"; print "<xmp>".$html."</xmp>"; print "New html:<br />"; print "<xmp>".$result."</xmp>"; Quote Link to comment Share on other sites More sharing options...
webhead2 Posted March 10, 2009 Author Share Posted March 10, 2009 Here you go. Use this as a starter $url = "http://www.test.com/"; $html = '<img src="http://www.test.com/images/x.jpg"><br /><img src="images/y.jpg"><br /><img src="images/abc.jpg">'; $result = preg_replace('%<img src="([^http://])([a-zA-Z0-9\\./-_ ]+)"%', '<img src="'.$url.'$1$2', $html); print "Original html:<br />"; print "<xmp>".$html."</xmp>"; print "New html:<br />"; print "<xmp>".$result."</xmp>"; I tested this, and it definately ignores the full url's. But try adding alt or other tags to your images. I get output like: img src=http://www.ppcwebspy.com/mg/checkmark.png%3C/td%3E%20%20%20%20%20%20%20%20%20%20%20%20%3Ctd%20width= $text2 = preg_replace('/<img src="([^http:])([^"]+)"[^>]+?>/i', '<img src="http://'.$out.'/$2', $text); Note that the "i" in img is truncated and the regex picks up stuff after the img path. I also tried using your REGEX code verbatim with similar results. This: $text2 = preg_replace('%<img src="([^http://])([a-zA-Z0-9\\./-_ ]+)"%', '<img src="http://'.$out.'/$2', $text); Produces this: img src=http://www.ppcwebspy.com/mg/PPCWebSpy_box_small2.jpg%20align= Quote Link to comment Share on other sites More sharing options...
JonnoTheDev Posted March 10, 2009 Share Posted March 10, 2009 Will look later. You should be able to extend the regex to capture the entire image tag with all attributes. The example was just to start you off. A useful tool is regex buddy http://www.regexbuddy.com/ Get a copy of this. Quote Link to comment Share on other sites More sharing options...
webhead2 Posted March 10, 2009 Author Share Posted March 10, 2009 Will look later. You should be able to extend the regex to capture the entire image tag with all attributes. The example was just to start you off. A useful tool is regex buddy http://www.regexbuddy.com/ Get a copy of this. I already have regex buddy, thanks. Update: This gets me real close: $text2 = preg_replace('/<img[^>]+?src="([^http:\/\/])([^"]+)"[^>]+?>/i', '<center><img src="http://'.$out.'/$1/$2"></center>', $text); Output: http://www.ppcwebspy.com/i/mg/ppc_web_spy_small.jpg What's with the "i"? Quote Link to comment Share on other sites More sharing options...
JonnoTheDev Posted March 10, 2009 Share Posted March 10, 2009 This $out.'/$1/$2 change to $out.'/$1$2 Nearly there Quote Link to comment Share on other sites More sharing options...
webhead2 Posted March 10, 2009 Author Share Posted March 10, 2009 This $out.'/$1/$2 change to $out.'/$1$2 Nearly there Perfect. Kudos to you mate. Quote Link to comment Share on other sites More sharing options...
.josh Posted March 10, 2009 Share Posted March 10, 2009 That regex isn't right. You can't put blocks of text inside character classes like that. Character classes only match single characters, so that http:// is going to not match on any one of those characters, not the whole thing. That's what lookbehinds and lookaheads are for. ~<img.*?src\s?=\s?"(?!http:)[^"]*"[^>]*>~is Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.