Match only when something is not there

Sildhe · December 21, 2008

Okay basically I'm looking for an expanded version of a negative character class. For instance, say I have this string:

$string = "
<a href='blah' id=1234.2>[FLAG] something</a>
<a href='blah' id=829.1>somethingelse</a>
<a href='blah' id=634.5>somerandomcharlength</a>
";

I want to capture the id but only if '[FLAG] ' between the anchor tag is not there. '[FLAG] ' will always be immediately after the closing '>' of the opening anchor tag and it will always have a space right after it. If it is not there, the stuff between the anchor tags will start immediately after the '>' for the opening tag (no space).

So for this example, I want to capture 829.1 and 634.5 but not 1234.2

I think the answer is negative lookbehind, so here's what I've tried:

preg_match_all("/<a.*?id=([0-9.]+)[^>]*?>(?<!\[FLAG\] ).*?<\/a>/",$string,$matches);

I've also tried wrapping the .*? in non-capturing parenthesis like so:

preg_match_all("/<a.*?id=([0-9.]+)[^>]*?>(?<!\[FLAG\] )(?:.*?)<\/a>/",$string,$matches);

But it continues to match all 3 ids. So...what am I doing wrong?

DarkWater · December 21, 2008

<?php
$string = "
<a href='blah' id=1234.2>[FLAG] something</a>
<a href='blah' id=829.1>somethingelse</a>
<a href='blah' id=634.5>somerandomcharlength</a>
";
preg_match_all('~<a.*?id=([0-9.]+)[^>]*?>(?!\[FLAG\]).*?</a>~', $string, $matches);
print_r($matches);
?>

You should be using negative lookaheads for this.

This regex works as expected.

nrg_alpha · December 21, 2008

Here is my take:

$str = "
<a href='blah' id=1234.2>[FLAG] something</a>
<a href='blah' id=829.1>somethingelse</a>
<a href='blah' id=634.5>somerandomcharlength</a>
";
preg_match_all('#<a.+?id=([0-9.]+)(?!.*?\[FLAG\])#s', $str, $matches);
echo '<pre>'.print_r($matches[1], true);

EDIT - I threw in the s modifier just in case the dot-all matching part has to deal with a long multi-lined anchor tag.. but in this situation, it isn't needed.

EDIT2 - I modified my line to encase the closing a tag, just in case (although from the initial code I provided, it still does work).

preg_match_all('#<a.+?id=([0-9.]+)(?!.*?\[FLAG\]).*</a>#', $str, $matches);

DarkWater · December 21, 2008

Seems like the quotation marks got taken off. Let's see if it does it again:

<?php
$string = "
<a href='blah' id=1234.2>[FLAG] something</a>
<a href='blah' id=829.1>somethingelse</a>
<a href='blah' id=634.5>somerandomcharlength</a>
";
preg_match_all('~<a.*?id=([0-9.]+)[^>]*?>(?!\[FLAG\]).*?</a>~', $string, $matches);
print_r($matches);
?>

EDIT:

That's so odd. It seems to be stripping the opening quotation mark, but it's there if you try to quote my post. =/

nrg_alpha · December 21, 2008

Oops.. damn, that last line of mine should be:

preg_match_all('#<a.+?id=([0-9.]+)(?!.*?\[FLAG\]).*?</a>#', $str, $matches);

I neglected the last .* part by not making it a lazy quantifier. The quote system timed out.. so I could not include this in my last post.

Sildhe · December 21, 2008

hmm...you guys' ideas seem to work by itself, but it's actually part of a larger regex. Probably another part of my regex is causing this stuff to mess up, or something, I don't know.

I ended up just doing a capture on everything between the anchor tags with no condition, and then using a foreach loop and strpos to further filter out the ids, based on the captured stuff between the anchor tags. That seems to work fine.

Thanks for the help guys, appreciate it.

nrg_alpha · December 21, 2008

hmm...you guys' ideas seem to work by itself, but it's actually part of a larger regex. Probably another part of my regex is causing this stuff to mess up, or something, I don't know.

I ended up just doing a capture on everything between the anchor tags with no condition, and then using a foreach loop and strpos to further filter out the ids, based on the captured stuff between the anchor tags. That seems to work fine.

Well, we can only work with what you give us..so if you negelected to give us the 'bigger picture', how are we supposed to know? As a result, it's not a fair statement to say that things are still not working. We can only try to solve what you give us. Nothing more. If you present the entire package of the problem so-to-speak, we may be able to solve the entire issue at hand.. otherwise, you will only get solutions to a portion of the problem (which means anything else you are withholding is your problem and yours alone for obvious reasons).

effigy · December 22, 2008

When you're working with a known format--e.g., HTML tags begin with "<" and end with ">"--conform to these rules in your pattern: don't use <a.*?...> but <a[^>]*...>. Not only is the greediness optimal, but safer, ensuring that you stay within your tag boundary.

Sildhe · December 22, 2008

When you're working with a known format--e.g., HTML tags begin with "<" and end with ">"--conform to these rules in your pattern: don't use <a.*?...> but <a[^>]*...>. Not only is the greediness optimal, but safer, ensuring that you stay within your tag boundary.

Inside the anchor tag I have

<a.*?id=([0-9.]+)[^>]*?>

I suppose I don't need that last ? in there but I did use [^>]* to match till the end of the anchor tag.

I have .*? after <a because there are any number of things between the start of the anchor tag and where it lists id=1234.5 I thought a non-greedy match all would be the best thing for that. Am I wrong?

nrg_alpha · December 22, 2008

Inside the anchor tag I have
<a.*?id=([0-9.]+)[^>]*?>
I suppose I don't need that last ? in there but I did use [^>]* to match till the end of the anchor tag.

Correct.. You don't need that last ?, as the negative character class will keep greedily matching until it hits the > charater.

@Effigy

I'm not sure I can agree with your statement (in this particular case). The reason being is I would be hard-pressed to think that there would be an 'id=' outside of a tag.. therefore, I would still use <a.+? in this case, as we know the 'id=' will be inside the tag..

Could you demonstrate a sample that uses <a[^>]* that works with this particular situation? Perhaps when we see how this would be implemented in this situation, we can better understand why using .+? is not as wise a choice.

effigy · December 22, 2008

The reason being is I would be hard-pressed to think that there would be an 'id=' outside of a tag..

The concern isn't of id= being outside of a tag, but of a tag not having id=. In this instance the regex would keep consuming data--going outside of the tag and running into another, possibly not even an a--until it finds id=. Arguably, the data in question may always have id= in the a; however, (1) data may change; and (2) [^>]* will work in both cases.

Additionally, according to Mastering Regular Expressions:

Common Optimizations
With a lazy quantifier, as in "(.*?)", the engine normally must jump between checking what the quantifier controls (the dot) with checking what comes after ". For this and other reasons, lazy quantifiers are generally much slower than greedy ones....

Lazy Versus Greedy: Be Specific
"...use a greedy quantifier, as they are generally optimized a bit better than non-greedy quantifier..."

"Again, this is dependent on the data and the language, but with most engines, using a negated character class is much more efficient than a lazy quantifier."

Sildhe · December 22, 2008

oh okay I get what you're saying. If id=.. isn't there, then it's gonna keep on gobbling stuff up. That makes sense. So how would I write it? Seems like opposite of what I'm trying to do with the FLAG, so would I us positive lookahead?

effigy · December 22, 2008

<pre>
<?php
$string = <<<STR
<a href='blah' id=1234.2>[FLAG] something</a>
<a href='blah' id=829.1>somethingelse</a>
<a href='blah' id=634.5>somerandomcharlength</a>
STR;
preg_match_all('%<a[^>]+id=([\d.]+)[^>]*>(?!\[FLAG\]\s)%si', $string, $matches);
array_shift($matches);
print_r($matches);
?>
</pre>

nrg_alpha · December 22, 2008

The concern isn't of id= being outside of a tag, but of a tag not having id=. In this instance the regex would keep consuming data--going outside of the tag and running into another, possibly not even an a--until it finds id=. Arguably, the data in question may always have id= in the a; however, (1) data may change; and (2) [^>]* will work in both cases.

Point taken. Haven't considered it that way. So the question now is, how would one incorporate [^>] (and not use .*? within the a tag) into this situation? I'm scratching my head on this one (and I am curious how to resolve this in this manner). Here's what I have thus far..

$str = "
<a href='blah' id=1234.2>[FLAG] somethingElse</a>
<a href='blah'>somethingelse</a>
<a href='blah' id=634.5>somerandomcharlength</a>
";
preg_match_all('#<a.+?id=([0-9.]+)[^>]*>(?!.*?\[FLAG\])#', $str, $matches);
echo '<pre>'.print_r($matches[1], true);

Note that I altered the middle entry in the OP's string by removing the 'id=' business... So in this case, only the last tag's id should be recorded (and this is indeed the case). But I am using a mix of both .+? and [^>] (much like the OP's last code snippet). But when I start going through the motions in that second line, do I undrestand correctly that the pattern will keep on matching all the way to the end of the line, then fail? How can the <a.+?id= part be re-written better, yet still take into account all the other aspects?

Additionally, according to Mastering Regular Expressions:

Common Optimizations
With a lazy quantifier, as in "(.*?)", the engine normally must jump between checking what the quantifier controls (the dot) with checking what comes after ". For this and other reasons, lazy quantifiers are generally much slower than greedy ones....

Lazy Versus Greedy: Be Specific
"...use a greedy quantifier, as they are generally optimized a bit better than non-greedy quantifier..."

"Again, this is dependent on the data and the language, but with most engines, using a negated character class is much more efficient than a lazy quantifier."

No disputes there. The only reason I resorted to .+? was because in this case, I don't know what else to use..

EDIT - just notice you posting as I did.. so I'll have a go at your code and come back to you...

nrg_alpha · December 22, 2008

Ok.. I have thought of using something like what you just proposed.. but when I see this:

<a[^>]+

Would this not mean some backtracking? It would match all the way to the first > character it runs into, then must start backtracking until it finds the complete set of characters id=... (and obviously, if there is no id=, it fails).

This makes me wonder out loud which is faster? .+? until it matches an id (and assuming there is no id, keep on matching till the end of line (non s modifier), then fail, or match up to the > character, then backtrack.. it's the backtracking speed that concerns me.. as from the book, we know the regex engine must start going back one saved state at a time, and check to see if the character starting at that saved state matches (in this case), the i in id, then if so, does the next match the d, then the =, and so forth.

I can understand using [^>] as a saftey net.. but when I did think of using it instead of .+?, I a) made the assumption that there would be an id within the tag (bad assumption on my part admittedly), but also figured backtracking would be slower than a lazy quantifier.. (of course, I could be all wrong here)..

nrg_alpha · December 22, 2008

Never mind.. I did a speed test.. turns out, my solution was slower

EDIT:

$string = <<<STR
<a href='blah' id=1234.2>[FLAG] something</a>
<a href='blah' id=829.1>somethingelse</a>
<a href='blah' id=634.5>somerandomcharlength</a>
STR;
$loop = 1000;

$start = gettimeofday();
for($i = 0; $i < $loop; $i++){
   preg_match_all('%<a[^>]+id=([\d.]+)[^>]*>(?!\[FLAG\]\s)%si', $string, $matches);
}
$final = gettimeofday();
echo 'Effigy\'s method:' . '<br />';
echo '<pre>'.print_r($matches[1], true) . '<br />';
$sec = ($final['sec'] + $final['usec'] / 1000000)-($start['sec'] + $start['usec'] / 1000000);
printf("Time: %3f", $sec);

echo '<br /><br />';

$start = gettimeofday();
for($i = 0; $i < $loop; $i++){
   preg_match_all('#<a.+?id=([0-9.]+)[^>]*>(?!.*?\[FLAG\])#', $string, $matches);
}
$final = gettimeofday();
echo 'NRG\'s method:' . '<br />';
echo '<pre>'.print_r($matches[1], true) . '<br />';
$sec = ($final['sec'] + $final['usec'] / 1000000)-($start['sec'] + $start['usec'] / 1000000);
printf("Time: %3f", $sec);

Output sample:

Effigy's method:
Array
(
    [0] => 829.1
    [1] => 634.5
)

Time: 0.033764

NRG's method:
Array
(
    [0] => 829.1
    [1] => 634.5
)

Time: 0.062249

effigy · December 22, 2008

I can understand using [^>] as a saftey net.. but when I did think of using it instead of .+?, I a) made the assumption that there would be an id within the tag (bad assumption on my part admittedly), but also figured backtracking would be slower than a lazy quantifier.. (of course, I could be all wrong here)..

Actually, this is correct. I crossed my wires on the lazy/greedy portion, while the real issue is using [^>]*? rather than .*? (or with +, doesn't matter). My apologies.

The difference between the lazy/greedy approach depends, as the book says, on the data.

Sign In

Match only when something is not there

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation

Important Information