Jump to content

Recommended Posts

Because we don't know in which order the src and width attributes appear, I think the easiest and fastest way is to grab all image tags where the width attribute is present and at least 100, and then grab each image source individually:

 

<?php
$str = 'Images: <img src="test1.jpg" width="130" /> <img src="test2.jpg" width="50" /> <img width="256" src="test3.jpg" /> <img width="12" src="test4.jpg" /> <img src="nowidth.jpg" />';
//grab image tags where the width is at least 100
preg_match_all('~<img\b[^>]+width=[\'"][1-9][0-9]{2,}[\'"][^>]*>~i', $str, $matches);
//grab the image sources
$images = array();
foreach ($matches[0] as $img) {
preg_match('~src=[\'"]([^\'"]+)[\'"]~i', $img, $match);
$images[] = $match[1];
}
echo '<pre>' . print_r($images, true) . '</pre>';
?>

 

Something tells me there's a smarter solution, but I can't think of any :P

 

Edit: This should be a bit smarter and faster:

 

<?php
$str = 'Images: <img src="test1.jpg" width="130" /> <img src="test2.jpg" width="50" /> <img width="256" src="test3.jpg" /> <img width="12" src="test4.jpg" /> <img src="nowidth.jpg" />';
//grab images where the width is at least 100
preg_match_all('~<img\b[^>]+width=[\'"][1-9][0-9]{2,}[\'"][^>]*>~i', $str, $matches);
//remove anything but the image sources
$matches[0] = preg_replace('~.+?src=[\'"](.+?)[\'"].+~is', '$1', $matches[0]);
echo '<pre>' . print_r($matches[0], true) . '</pre>';
?>

preg_match_all('~<img[^>]*(width\s?=\s?[\'"][1-9][0-9]{2,}(?:px)?[\'"])?[^>]*src\s?=\s?[\'"]([^\'"]*)[\'"][^>]*(?(1)|width\s?=\s?[\'"][1-9][0-9]{2,}(?:px)?[\'"])[^>]*>~i',$string,$matches);

echo "<pre>";
print_r($matches[2]);

 

This will retrieve the src url for all images with width 100 (whether it is written as 100 or 100px), regardless of the width location, spacing, quoting, or capitalization conventions.

 

edit: doh!  I thought it was supposed to be exactly 100; missed that 'minimum'. regex edited.

I haven't bothered doing any benchmark, but you could also do this:

 

<?php
$html = 'Images: <img src="test1.jpg" width="130" /> <img src="test2.jpg" width="50" /> <img width="256" src="test3.jpg" /> <img width="12" src="test4.jpg" /> <img src="nowidth.jpg" />';

$doc = new DOMDocument();
$doc->loadHTML($html);

$images = array();
foreach ($doc->getElementsByTagName('img') as $img) {
if ((int) $img->getAttribute('width') >= 200) {
	$images[] = $img->getAttribute('src');
}
}

var_dump($images);

 

Edit: As expected, DOM is significantly slower, but still faster than CV's regex.

 

<?php
header('Content-type: text/plain');

$iterations = 10000;
$tags = 50;

$html = '';
for ($i = 0; $i < $tags; ++$i) {
$html .= '<img src="test' . $i . '.jpg" width="' . mt_rand(50, 200) . '">';
}

/**
* Test DOM
*/

$start = microtime(true);
for ($i = 0; $i < $iterations; ++$i) {
$doc = new DOMDocument();
$doc->loadHTML($html);

$images = array();
foreach ($doc->getElementsByTagName('img') as $img) {
	if ($img->getAttribute('width') >= 100) {
		$images[] = $img->getAttribute('src');
	}
}
}
echo 'Time (DOM): ' . (microtime(true) - $start);

/**
* Test regex
*/

$start = microtime(true);
for ($i = 0; $i < $iterations; ++$i) {
preg_match_all('~<img\b[^>]+width=[\'"][1-9][0-9]{2,}[\'"][^>]*>~i', $html, $matches);
$matches[0] = preg_replace('~.+?src=[\'"](.+?)[\'"].+~is', '$1', $matches[0]);
}
echo PHP_EOL . 'Time (regex): ' . (microtime(true) - $start);

/**
* Test regex 2
*/

$start = microtime(true);
for ($i = 0; $i < $iterations; ++$i) {
preg_match_all('~<img[^>]*(width\s?=\s?[\'"][1-9][0-9]{2,}(?:px)?[\'"])?[^>]*src\s?=\s?[\'"]([^\'"]*)[\'"][^>]*(?(1)|width\s?=\s?[\'"][1-9][0-9]{2,}(?:px)?[\'"])[^>]*>~i',$html,$matches);
}
echo PHP_EOL . 'Time (regex 2): ' . (microtime(true) - $start);

 

Output on my computer:

Time (DOM): 5.2172110080719
Time (regex): 1.754891872406
Time (regex 2): 9.2111718654633

thebadbad: Also want to point out though that even though my pattern is a lot slower, it does provide a lot more breathing room for matching. I suppose DOM still beats me out

 

garethp:

~<img.*?(?:src="(.*?)".*?)?width="?[1-9][0-9][0-9]*"?.*(?:src="(.*?)".*?)?>~

 

[1-9][0-9][0-9]* will match 10 or higher, not 100 or higher. You would need to use a + instead of *

 

 

thebadbad: Also want to point out though that even though my pattern is a lot slower, it does provide a lot more breathing room for matching. I suppose DOM still beats me out

 

When adding that extra 'breathing room' to my patterns, it's still ~ 11 times faster :) But I get your point.

 

preg_match_all('~<img\b[^>]+\bwidth\s?=\s?[\'"][1-9][0-9]{2,}(?:px)?[\'"][^>]*>~i', $string, $matches);
$matches[0] = preg_replace('~.+?\bsrc\s?=\s?[\'"](.+?)[\'"].+~is', '$1', $matches[0]);

In turn I'm also using word boundaries, to make sure invalid tags like <imgrandombull ... /> and <imgwidth="150" ... /> aren't matched.

Garet did have one thing in there that we didn't think of: making the quotes around the width optional.

 

Yea, but that's not valid HTML. But who knows if the content he's grabbing from is, so good point.

 

oh and also for yours:

 

I wonder if instead of doing a preg_replace, if you were to  do

 

$matched = implode('',$matches[0]);
preg_match_all('~src\s?=\s?[\'"]([^\'"]*)[\'"]~i',$matched,$matches);

 

that might possibly be faster.

 

Good idea. It's ~ 20% faster when testing with a string containing 50 tags.

Garet did have one thing in there that we didn't think of: making the quotes around the width optional. 

 

That's why I like using DOM for parsing HTML. You don't have to think about optional quotes, attribute order, etc. It's also much more readable, and it's quick to write. You don't just look at the first regex you made and see exactly what it's doing. The DOM interface is much more descriptive and you should be able to figure out the semantics instantaneously. I think the performance overhead is worth it in the majority of the cases.

 

Garet did have one thing in there that we didn't think of: making the quotes around the width optional.

 

Yea, but that's not valid HTML. But who knows if the content he's grabbing from is, so good point.

 

Yes it is. Attribute quoting is optional as long as you only use alphanumeric characters, periods, hyphens and colons in the attribute value.

IBut I see that it's required when writing XHTML.

 

Yep.

 

<sidenote>

That's what I like about XHTML.. required quotes, element and attribute names must be in lowercase, self closing tags, etc.. While I can see the 'flexibility' in old school HTML's system, I personally prefer XHTML's more.

 

All I can say is I'm really greatful X/HTML 5 will support both. I for one will continue to embrace the XHTML way.

</sidenote>

Yeah well...in the real world, most people don't have time to make stuff perfect.  I've got too much work on my plate on any given day to make sure all my t's are crossed and i's are dotted.  If what I do works in most major browsers, as far as I'm concerned, it's good enough.  I promise you, the people with the money who make the decisions don't care one whit about that sort of thing.  As long as they can go to the site and it looks pretty and does what it's supposed to, it's all good.  I'm not necessarily promoting bad coding...I'm just sayin'...when you have 500 things to do, you have to prioritize.  Which is why I lean towards "breathing room" regexes etc.. because I know most other people out there outside of the classroom are under the same pressures, and therefore do the same thing.

No disputes there.. I don't do this stuff for a living... more of a twisted penchant for learning webdev (with the possibility later on of doing contract work). So yeah, time / budgets might certainly dictate otherwise for sure. I'm not in that boat, so I have the time to tweak and nudge things here and there.  :geek:

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.