scvinodkumar Posted August 8, 2009 Share Posted August 8, 2009 i want to extract all images from the content having minimum width of 100 px Quote Link to comment Share on other sites More sharing options...
thebadbad Posted August 8, 2009 Share Posted August 8, 2009 Because we don't know in which order the src and width attributes appear, I think the easiest and fastest way is to grab all image tags where the width attribute is present and at least 100, and then grab each image source individually: <?php $str = 'Images: <img src="test1.jpg" width="130" /> <img src="test2.jpg" width="50" /> <img width="256" src="test3.jpg" /> <img width="12" src="test4.jpg" /> <img src="nowidth.jpg" />'; //grab image tags where the width is at least 100 preg_match_all('~<img\b[^>]+width=[\'"][1-9][0-9]{2,}[\'"][^>]*>~i', $str, $matches); //grab the image sources $images = array(); foreach ($matches[0] as $img) { preg_match('~src=[\'"]([^\'"]+)[\'"]~i', $img, $match); $images[] = $match[1]; } echo '<pre>' . print_r($images, true) . '</pre>'; ?> Something tells me there's a smarter solution, but I can't think of any Edit: This should be a bit smarter and faster: <?php $str = 'Images: <img src="test1.jpg" width="130" /> <img src="test2.jpg" width="50" /> <img width="256" src="test3.jpg" /> <img width="12" src="test4.jpg" /> <img src="nowidth.jpg" />'; //grab images where the width is at least 100 preg_match_all('~<img\b[^>]+width=[\'"][1-9][0-9]{2,}[\'"][^>]*>~i', $str, $matches); //remove anything but the image sources $matches[0] = preg_replace('~.+?src=[\'"](.+?)[\'"].+~is', '$1', $matches[0]); echo '<pre>' . print_r($matches[0], true) . '</pre>'; ?> Quote Link to comment Share on other sites More sharing options...
.josh Posted August 8, 2009 Share Posted August 8, 2009 preg_match_all('~<img[^>]*(width\s?=\s?[\'"][1-9][0-9]{2,}(?:px)?[\'"])?[^>]*src\s?=\s?[\'"]([^\'"]*)[\'"][^>]*(?(1)|width\s?=\s?[\'"][1-9][0-9]{2,}(?:px)?[\'"])[^>]*>~i',$string,$matches); echo "<pre>"; print_r($matches[2]); This will retrieve the src url for all images with width 100 (whether it is written as 100 or 100px), regardless of the width location, spacing, quoting, or capitalization conventions. edit: doh! I thought it was supposed to be exactly 100; missed that 'minimum'. regex edited. Quote Link to comment Share on other sites More sharing options...
thebadbad Posted August 8, 2009 Share Posted August 8, 2009 @CV FYI, my method is ~ 12 times faster than yours, when testing with a string containing 50 image tags. I like your elaborate pattern, though Quote Link to comment Share on other sites More sharing options...
.josh Posted August 8, 2009 Share Posted August 8, 2009 yeah lol... way faster to break it down but what's the fun in that? Quote Link to comment Share on other sites More sharing options...
Daniel0 Posted August 9, 2009 Share Posted August 9, 2009 I haven't bothered doing any benchmark, but you could also do this: <?php $html = 'Images: <img src="test1.jpg" width="130" /> <img src="test2.jpg" width="50" /> <img width="256" src="test3.jpg" /> <img width="12" src="test4.jpg" /> <img src="nowidth.jpg" />'; $doc = new DOMDocument(); $doc->loadHTML($html); $images = array(); foreach ($doc->getElementsByTagName('img') as $img) { if ((int) $img->getAttribute('width') >= 200) { $images[] = $img->getAttribute('src'); } } var_dump($images); Edit: As expected, DOM is significantly slower, but still faster than CV's regex. <?php header('Content-type: text/plain'); $iterations = 10000; $tags = 50; $html = ''; for ($i = 0; $i < $tags; ++$i) { $html .= '<img src="test' . $i . '.jpg" width="' . mt_rand(50, 200) . '">'; } /** * Test DOM */ $start = microtime(true); for ($i = 0; $i < $iterations; ++$i) { $doc = new DOMDocument(); $doc->loadHTML($html); $images = array(); foreach ($doc->getElementsByTagName('img') as $img) { if ($img->getAttribute('width') >= 100) { $images[] = $img->getAttribute('src'); } } } echo 'Time (DOM): ' . (microtime(true) - $start); /** * Test regex */ $start = microtime(true); for ($i = 0; $i < $iterations; ++$i) { preg_match_all('~<img\b[^>]+width=[\'"][1-9][0-9]{2,}[\'"][^>]*>~i', $html, $matches); $matches[0] = preg_replace('~.+?src=[\'"](.+?)[\'"].+~is', '$1', $matches[0]); } echo PHP_EOL . 'Time (regex): ' . (microtime(true) - $start); /** * Test regex 2 */ $start = microtime(true); for ($i = 0; $i < $iterations; ++$i) { preg_match_all('~<img[^>]*(width\s?=\s?[\'"][1-9][0-9]{2,}(?:px)?[\'"])?[^>]*src\s?=\s?[\'"]([^\'"]*)[\'"][^>]*(?(1)|width\s?=\s?[\'"][1-9][0-9]{2,}(?:px)?[\'"])[^>]*>~i',$html,$matches); } echo PHP_EOL . 'Time (regex 2): ' . (microtime(true) - $start); Output on my computer: Time (DOM): 5.2172110080719 Time (regex): 1.754891872406 Time (regex 2): 9.2111718654633 Quote Link to comment Share on other sites More sharing options...
Garethp Posted August 9, 2009 Share Posted August 9, 2009 Why not just '~<img.*?(?:src="(.*?)".*?)?width="?[1-9][0-9][0-9]*"?.*(?:src="(.*?)".*?)?>~'? That way it has to be atleast 3 numbers, with the first being atleast 1, so it has to be atleast 100 Quote Link to comment Share on other sites More sharing options...
Daniel0 Posted August 9, 2009 Share Posted August 9, 2009 Isn't that what thebadbad's does? Quote Link to comment Share on other sites More sharing options...
.josh Posted August 9, 2009 Share Posted August 9, 2009 thebadbad: Also want to point out though that even though my pattern is a lot slower, it does provide a lot more breathing room for matching. I suppose DOM still beats me out garethp: ~<img.*?(?:src="(.*?)".*?)?width="?[1-9][0-9][0-9]*"?.*(?:src="(.*?)".*?)?>~ [1-9][0-9][0-9]* will match 10 or higher, not 100 or higher. You would need to use a + instead of * Quote Link to comment Share on other sites More sharing options...
thebadbad Posted August 9, 2009 Share Posted August 9, 2009 thebadbad: Also want to point out though that even though my pattern is a lot slower, it does provide a lot more breathing room for matching. I suppose DOM still beats me out When adding that extra 'breathing room' to my patterns, it's still ~ 11 times faster But I get your point. preg_match_all('~<img\b[^>]+\bwidth\s?=\s?[\'"][1-9][0-9]{2,}(?:px)?[\'"][^>]*>~i', $string, $matches); $matches[0] = preg_replace('~.+?\bsrc\s?=\s?[\'"](.+?)[\'"].+~is', '$1', $matches[0]); In turn I'm also using word boundaries, to make sure invalid tags like <imgrandombull ... /> and <imgwidth="150" ... /> aren't matched. Quote Link to comment Share on other sites More sharing options...
.josh Posted August 9, 2009 Share Posted August 9, 2009 Garet did have one thing in there that we didn't think of: making the quotes around the width optional. Quote Link to comment Share on other sites More sharing options...
.josh Posted August 9, 2009 Share Posted August 9, 2009 oh and also for yours: I wonder if instead of doing a preg_replace, if you were to do $matched = implode('',$matches[0]); preg_match_all('~src\s?=\s?[\'"]([^\'"]*)[\'"]~i',$matched,$matches); that might possibly be faster. Quote Link to comment Share on other sites More sharing options...
thebadbad Posted August 9, 2009 Share Posted August 9, 2009 Garet did have one thing in there that we didn't think of: making the quotes around the width optional. Yea, but that's not valid HTML. But who knows if the content he's grabbing from is, so good point. oh and also for yours: I wonder if instead of doing a preg_replace, if you were to do $matched = implode('',$matches[0]); preg_match_all('~src\s?=\s?[\'"]([^\'"]*)[\'"]~i',$matched,$matches); that might possibly be faster. Good idea. It's ~ 20% faster when testing with a string containing 50 tags. Quote Link to comment Share on other sites More sharing options...
Daniel0 Posted August 9, 2009 Share Posted August 9, 2009 Garet did have one thing in there that we didn't think of: making the quotes around the width optional. That's why I like using DOM for parsing HTML. You don't have to think about optional quotes, attribute order, etc. It's also much more readable, and it's quick to write. You don't just look at the first regex you made and see exactly what it's doing. The DOM interface is much more descriptive and you should be able to figure out the semantics instantaneously. I think the performance overhead is worth it in the majority of the cases. Garet did have one thing in there that we didn't think of: making the quotes around the width optional. Yea, but that's not valid HTML. But who knows if the content he's grabbing from is, so good point. Yes it is. Attribute quoting is optional as long as you only use alphanumeric characters, periods, hyphens and colons in the attribute value. Quote Link to comment Share on other sites More sharing options...
.josh Posted August 9, 2009 Share Posted August 9, 2009 Yeah overall I think the DOM approach would be the best solution. One-lining it (mine) vs. speed (thebadbad's) is more of a thought exercise than anything. Quote Link to comment Share on other sites More sharing options...
thebadbad Posted August 9, 2009 Share Posted August 9, 2009 I agree with both of you. And you're right Daniel; I didn't know it was optional. But I see that it's required when writing XHTML. Quote Link to comment Share on other sites More sharing options...
nrg_alpha Posted August 10, 2009 Share Posted August 10, 2009 IBut I see that it's required when writing XHTML. Yep. <sidenote> That's what I like about XHTML.. required quotes, element and attribute names must be in lowercase, self closing tags, etc.. While I can see the 'flexibility' in old school HTML's system, I personally prefer XHTML's more. All I can say is I'm really greatful X/HTML 5 will support both. I for one will continue to embrace the XHTML way. </sidenote> Quote Link to comment Share on other sites More sharing options...
thebadbad Posted August 10, 2009 Share Posted August 10, 2009 I also prefer XHTML over HTML any day. Being a bit of a perfectionist, the stricter the code the better Quote Link to comment Share on other sites More sharing options...
nrg_alpha Posted August 10, 2009 Share Posted August 10, 2009 I also prefer XHTML over HTML any day. Being a bit of a perfectionist, the stricter the code the better Quote Link to comment Share on other sites More sharing options...
.josh Posted August 10, 2009 Share Posted August 10, 2009 Yeah well...in the real world, most people don't have time to make stuff perfect. I've got too much work on my plate on any given day to make sure all my t's are crossed and i's are dotted. If what I do works in most major browsers, as far as I'm concerned, it's good enough. I promise you, the people with the money who make the decisions don't care one whit about that sort of thing. As long as they can go to the site and it looks pretty and does what it's supposed to, it's all good. I'm not necessarily promoting bad coding...I'm just sayin'...when you have 500 things to do, you have to prioritize. Which is why I lean towards "breathing room" regexes etc.. because I know most other people out there outside of the classroom are under the same pressures, and therefore do the same thing. Quote Link to comment Share on other sites More sharing options...
nrg_alpha Posted August 10, 2009 Share Posted August 10, 2009 No disputes there.. I don't do this stuff for a living... more of a twisted penchant for learning webdev (with the possibility later on of doing contract work). So yeah, time / budgets might certainly dictate otherwise for sure. I'm not in that boat, so I have the time to tweak and nudge things here and there. Quote Link to comment Share on other sites More sharing options...
thebadbad Posted August 10, 2009 Share Posted August 10, 2009 Yeah, obviously it depends on why you're coding.. NRG put it pretty well Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.