Extract image having width 100px

scvinodkumar · August 8, 2009

i want to extract all images from the content having minimum width of 100 px

thebadbad · August 8, 2009

Because we don't know in which order the src and width attributes appear, I think the easiest and fastest way is to grab all image tags where the width attribute is present and at least 100, and then grab each image source individually:

<?php
$str = 'Images: <img src="test1.jpg" width="130" /> <img src="test2.jpg" width="50" /> <img width="256" src="test3.jpg" /> <img width="12" src="test4.jpg" /> <img src="nowidth.jpg" />';
//grab image tags where the width is at least 100
preg_match_all('~<img\b[^>]+width=[\'"][1-9][0-9]{2,}[\'"][^>]*>~i', $str, $matches);
//grab the image sources
$images = array();
foreach ($matches[0] as $img) {
preg_match('~src=[\'"]([^\'"]+)[\'"]~i', $img, $match);
$images[] = $match[1];
}
echo '<pre>' . print_r($images, true) . '</pre>';
?>

Something tells me there's a smarter solution, but I can't think of any

Edit: This should be a bit smarter and faster:

<?php
$str = 'Images: <img src="test1.jpg" width="130" /> <img src="test2.jpg" width="50" /> <img width="256" src="test3.jpg" /> <img width="12" src="test4.jpg" /> <img src="nowidth.jpg" />';
//grab images where the width is at least 100
preg_match_all('~<img\b[^>]+width=[\'"][1-9][0-9]{2,}[\'"][^>]*>~i', $str, $matches);
//remove anything but the image sources
$matches[0] = preg_replace('~.+?src=[\'"](.+?)[\'"].+~is', '$1', $matches[0]);
echo '<pre>' . print_r($matches[0], true) . '</pre>';
?>

.josh · August 8, 2009

preg_match_all('~<img[^>]*(width\s?=\s?[\'"][1-9][0-9]{2,}(?:px)?[\'"])?[^>]*src\s?=\s?[\'"]([^\'"]*)[\'"][^>]*(?(1)|width\s?=\s?[\'"][1-9][0-9]{2,}(?:px)?[\'"])[^>]*>~i',$string,$matches);

echo "<pre>";
print_r($matches[2]);

This will retrieve the src url for all images with width 100 (whether it is written as 100 or 100px), regardless of the width location, spacing, quoting, or capitalization conventions.

edit: doh! I thought it was supposed to be exactly 100; missed that 'minimum'. regex edited.

thebadbad · August 8, 2009

@CV

FYI, my method is ~ 12 times faster than yours, when testing with a string containing 50 image tags. I like your elaborate pattern, though

.josh · August 8, 2009

yeah lol... way faster to break it down but what's the fun in that?

Daniel0 · August 9, 2009

I haven't bothered doing any benchmark, but you could also do this:

<?php
$html = 'Images: <img src="test1.jpg" width="130" /> <img src="test2.jpg" width="50" /> <img width="256" src="test3.jpg" /> <img width="12" src="test4.jpg" /> <img src="nowidth.jpg" />';

$doc = new DOMDocument();
$doc->loadHTML($html);

$images = array();
foreach ($doc->getElementsByTagName('img') as $img) {
if ((int) $img->getAttribute('width') >= 200) {
	$images[] = $img->getAttribute('src');
}
}

var_dump($images);

Edit: As expected, DOM is significantly slower, but still faster than CV's regex.

<?php
header('Content-type: text/plain');

$iterations = 10000;
$tags = 50;

$html = '';
for ($i = 0; $i < $tags; ++$i) {
$html .= '<img src="test' . $i . '.jpg" width="' . mt_rand(50, 200) . '">';
}

/**
* Test DOM
*/

$start = microtime(true);
for ($i = 0; $i < $iterations; ++$i) {
$doc = new DOMDocument();
$doc->loadHTML($html);

$images = array();
foreach ($doc->getElementsByTagName('img') as $img) {
	if ($img->getAttribute('width') >= 100) {
		$images[] = $img->getAttribute('src');
	}
}
}
echo 'Time (DOM): ' . (microtime(true) - $start);

/**
* Test regex
*/

$start = microtime(true);
for ($i = 0; $i < $iterations; ++$i) {
preg_match_all('~<img\b[^>]+width=[\'"][1-9][0-9]{2,}[\'"][^>]*>~i', $html, $matches);
$matches[0] = preg_replace('~.+?src=[\'"](.+?)[\'"].+~is', '$1', $matches[0]);
}
echo PHP_EOL . 'Time (regex): ' . (microtime(true) - $start);

/**
* Test regex 2
*/

$start = microtime(true);
for ($i = 0; $i < $iterations; ++$i) {
preg_match_all('~<img[^>]*(width\s?=\s?[\'"][1-9][0-9]{2,}(?:px)?[\'"])?[^>]*src\s?=\s?[\'"]([^\'"]*)[\'"][^>]*(?(1)|width\s?=\s?[\'"][1-9][0-9]{2,}(?:px)?[\'"])[^>]*>~i',$html,$matches);
}
echo PHP_EOL . 'Time (regex 2): ' . (microtime(true) - $start);

Output on my computer:

Time (DOM): 5.2172110080719
Time (regex): 1.754891872406
Time (regex 2): 9.2111718654633

Garethp · August 9, 2009

Why not just '~<img.*?(?:src="(.*?)".*?)?width="?[1-9][0-9][0-9]*"?.*(?:src="(.*?)".*?)?>~'?

That way it has to be atleast 3 numbers, with the first being atleast 1, so it has to be atleast 100

Daniel0 · August 9, 2009

Isn't that what thebadbad's does?

.josh · August 9, 2009

thebadbad: Also want to point out though that even though my pattern is a lot slower, it does provide a lot more breathing room for matching. I suppose DOM still beats me out

garethp:

~<img.*?(?:src="(.*?)".*?)?width="?[1-9][0-9][0-9]*"?.*(?:src="(.*?)".*?)?>~

[1-9][0-9][0-9]* will match 10 or higher, not 100 or higher. You would need to use a + instead of *

thebadbad · August 9, 2009

thebadbad: Also want to point out though that even though my pattern is a lot slower, it does provide a lot more breathing room for matching. I suppose DOM still beats me out

When adding that extra 'breathing room' to my patterns, it's still ~ 11 times faster But I get your point.

preg_match_all('~<img\b[^>]+\bwidth\s?=\s?[\'"][1-9][0-9]{2,}(?:px)?[\'"][^>]*>~i', $string, $matches);
$matches[0] = preg_replace('~.+?\bsrc\s?=\s?[\'"](.+?)[\'"].+~is', '$1', $matches[0]);

In turn I'm also using word boundaries, to make sure invalid tags like <imgrandombull ... /> and <imgwidth="150" ... /> aren't matched.

.josh · August 9, 2009

Garet did have one thing in there that we didn't think of: making the quotes around the width optional.

.josh · August 9, 2009

oh and also for yours:

I wonder if instead of doing a preg_replace, if you were to do

$matched = implode('',$matches[0]);
preg_match_all('~src\s?=\s?[\'"]([^\'"]*)[\'"]~i',$matched,$matches);

that might possibly be faster.

thebadbad · August 9, 2009

Garet did have one thing in there that we didn't think of: making the quotes around the width optional.

Yea, but that's not valid HTML. But who knows if the content he's grabbing from is, so good point.

oh and also for yours:

I wonder if instead of doing a preg_replace, if you were to do
$matched = implode('',$matches[0]);
preg_match_all('~src\s?=\s?[\'"]([^\'"]*)[\'"]~i',$matched,$matches);
that might possibly be faster.

Good idea. It's ~ 20% faster when testing with a string containing 50 tags.

Daniel0 · August 9, 2009

Garet did have one thing in there that we didn't think of: making the quotes around the width optional.

That's why I like using DOM for parsing HTML. You don't have to think about optional quotes, attribute order, etc. It's also much more readable, and it's quick to write. You don't just look at the first regex you made and see exactly what it's doing. The DOM interface is much more descriptive and you should be able to figure out the semantics instantaneously. I think the performance overhead is worth it in the majority of the cases.

Garet did have one thing in there that we didn't think of: making the quotes around the width optional.

Yea, but that's not valid HTML. But who knows if the content he's grabbing from is, so good point.

Yes it is. Attribute quoting is optional as long as you only use alphanumeric characters, periods, hyphens and colons in the attribute value.

.josh · August 9, 2009

Yeah overall I think the DOM approach would be the best solution. One-lining it (mine) vs. speed (thebadbad's) is more of a thought exercise than anything.

thebadbad · August 9, 2009

I agree with both of you. And you're right Daniel; I didn't know it was optional. But I see that it's required when writing XHTML.

nrg_alpha · August 10, 2009

IBut I see that it's required when writing XHTML.

Yep.

That's what I like about XHTML.. required quotes, element and attribute names must be in lowercase, self closing tags, etc.. While I can see the 'flexibility' in old school HTML's system, I personally prefer XHTML's more.

All I can say is I'm really greatful X/HTML 5 will support both. I for one will continue to embrace the XHTML way.

</sidenote>

thebadbad · August 10, 2009

I also prefer XHTML over HTML any day. Being a bit of a perfectionist, the stricter the code the better

nrg_alpha · August 10, 2009

I also prefer XHTML over HTML any day. Being a bit of a perfectionist, the stricter the code the better

:qft:

.josh · August 10, 2009

Yeah well...in the real world, most people don't have time to make stuff perfect. I've got too much work on my plate on any given day to make sure all my t's are crossed and i's are dotted. If what I do works in most major browsers, as far as I'm concerned, it's good enough. I promise you, the people with the money who make the decisions don't care one whit about that sort of thing. As long as they can go to the site and it looks pretty and does what it's supposed to, it's all good. I'm not necessarily promoting bad coding...I'm just sayin'...when you have 500 things to do, you have to prioritize. Which is why I lean towards "breathing room" regexes etc.. because I know most other people out there outside of the classroom are under the same pressures, and therefore do the same thing.

nrg_alpha · August 10, 2009

No disputes there.. I don't do this stuff for a living... more of a twisted penchant for learning webdev (with the possibility later on of doing contract work). So yeah, time / budgets might certainly dictate otherwise for sure. I'm not in that boat, so I have the time to tweak and nudge things here and there. :geek:

thebadbad · August 10, 2009

Yeah, obviously it depends on why you're coding.. NRG put it pretty well

Sign In

Extract image having width 100px

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Archived

Important Information