Jump to content

Match only when something is not there


Sildhe

Recommended Posts

Okay basically I'm looking for an expanded version of a negative character class.  For instance, say I have this string:

 

$string = "
<a href='blah' id=1234.2>[FLAG] something</a>
<a href='blah' id=829.1>somethingelse</a>
<a href='blah' id=634.5>somerandomcharlength</a>
";

 

I want to capture the id but only if '[FLAG] ' between the anchor tag is not there.  '[FLAG] ' will always be immediately after the closing '>' of the opening anchor tag and it will always have a space right after it.  If it is not there, the stuff between the anchor tags will start immediately after the '>' for the opening tag (no space).

 

So for this example, I want to capture 829.1 and 634.5 but not 1234.2

 

I think the answer is negative lookbehind, so here's what I've tried:

 

preg_match_all("/<a.*?id=([0-9.]+)[^>]*?>(?<!\[FLAG\] ).*?<\/a>/",$string,$matches);

 

I've also tried wrapping the .*? in non-capturing parenthesis like so:

 

preg_match_all("/<a.*?id=([0-9.]+)[^>]*?>(?<!\[FLAG\] )(?:.*?)<\/a>/",$string,$matches);

 

But it continues to match all 3 ids.  So...what am I doing wrong?

 

Link to comment
Share on other sites

<?php
$string = "
<a href='blah' id=1234.2>[FLAG] something</a>
<a href='blah' id=829.1>somethingelse</a>
<a href='blah' id=634.5>somerandomcharlength</a>
";
preg_match_all('~<a.*?id=([0-9.]+)[^>]*?>(?!\[FLAG\]).*?</a>~', $string, $matches);
print_r($matches);
?>

 

You should be using negative lookaheads for this.

This regex works as expected.

Link to comment
Share on other sites

Here is my take:

 

$str = "
<a href='blah' id=1234.2>[FLAG] something</a>
<a href='blah' id=829.1>somethingelse</a>
<a href='blah' id=634.5>somerandomcharlength</a>
";
preg_match_all('#<a.+?id=([0-9.]+)(?!.*?\[FLAG\])#s', $str, $matches);
echo '<pre>'.print_r($matches[1], true);

 

EDIT - I threw in the s modifier just in case the dot-all matching part has to deal with a long multi-lined anchor tag.. but in this situation, it isn't needed.

 

EDIT2 - I modified my line to encase the closing a tag, just in case (although from the initial code I provided, it still does work).

preg_match_all('#<a.+?id=([0-9.]+)(?!.*?\[FLAG\]).*</a>#', $str, $matches);

Link to comment
Share on other sites

Seems like the quotation marks got taken off.  Let's see if it does it again:

 

<?php
$string = "
<a href='blah' id=1234.2>[FLAG] something</a>
<a href='blah' id=829.1>somethingelse</a>
<a href='blah' id=634.5>somerandomcharlength</a>
";
preg_match_all('~<a.*?id=([0-9.]+)[^>]*?>(?!\[FLAG\]).*?</a>~', $string, $matches);
print_r($matches);
?>

 

EDIT:

That's so odd.  It seems to be stripping the opening quotation mark, but it's there if you try to quote my post. =/

Link to comment
Share on other sites

Oops.. damn, that last line of mine should be:

 

preg_match_all('#<a.+?id=([0-9.]+)(?!.*?\[FLAG\]).*?</a>#', $str, $matches);

 

I neglected the last .* part by not making it a lazy quantifier. The quote system timed out.. so I could not include this in my last post.

Link to comment
Share on other sites

hmm...you guys' ideas seem to work by itself, but it's actually part of a larger regex.  Probably another part of my regex is causing this stuff to mess up, or something, I don't know.

 

I ended up just doing a capture on everything between the anchor tags with no condition, and then using a foreach loop and strpos to further filter out the ids, based on the captured stuff between the anchor tags.  That seems to work fine. 

 

Thanks for the help guys, appreciate it.

 

 

Link to comment
Share on other sites

hmm...you guys' ideas seem to work by itself, but it's actually part of a larger regex.  Probably another part of my regex is causing this stuff to mess up, or something, I don't know.

 

I ended up just doing a capture on everything between the anchor tags with no condition, and then using a foreach loop and strpos to further filter out the ids, based on the captured stuff between the anchor tags.  That seems to work fine. 

 

Well, we can only work with what you give us..so if you negelected to give us the 'bigger picture', how are we supposed to know? As a result, it's not a fair statement to say that things are still not working. We can only try to solve what you give us. Nothing more. If you present the entire package of the problem so-to-speak, we may be able to solve the entire issue at hand.. otherwise, you will only get solutions to a portion of the problem (which means anything else you are withholding is your problem and yours alone for obvious reasons).

Link to comment
Share on other sites

When you're working with a known format--e.g., HTML tags begin with "<" and end with ">"--conform to these rules in your pattern: don't use <a.*?...> but <a[^>]*...>. Not only is the greediness optimal, but safer, ensuring that you stay within your tag boundary.

Link to comment
Share on other sites

When you're working with a known format--e.g., HTML tags begin with "<" and end with ">"--conform to these rules in your pattern: don't use <a.*?...> but <a[^>]*...>. Not only is the greediness optimal, but safer, ensuring that you stay within your tag boundary.

 

Inside the anchor tag I have

<a.*?id=([0-9.]+)[^>]*?>

 

I suppose I don't need that last ? in there but I did use [^>]* to match till the end of the anchor tag.

 

I have .*? after <a because there are any number of things between the start of the anchor tag and where it lists id=1234.5 I thought a non-greedy match all would be the best thing for that.  Am I wrong? 

Link to comment
Share on other sites

 

Inside the anchor tag I have

<a.*?id=([0-9.]+)[^>]*?>

 

I suppose I don't need that last ? in there but I did use [^>]* to match till the end of the anchor tag.

 

Correct.. You don't need that last ?, as the negative character class will keep greedily matching until it hits the > charater.

 

@Effigy

 

I'm not sure I can agree with your statement (in this particular case). The reason being is I would be hard-pressed to think that there would be an 'id=' outside of a tag.. therefore, I would still use <a.+? in this case, as we know the 'id=' will be inside the tag..

 

Could you demonstrate a sample that uses <a[^>]* that works with this particular situation? Perhaps when we see how this would be implemented in this situation, we can better understand why using .+? is not as wise a choice.

Link to comment
Share on other sites

The reason being is I would be hard-pressed to think that there would be an 'id=' outside of a tag..

 

The concern isn't of id= being outside of a tag, but of a tag not having id=. In this instance the regex would keep consuming data--going outside of the tag and running into another, possibly not even an a--until it finds id=. Arguably, the data in question may always have id= in the a; however, (1) data may change; and (2) [^>]* will work in both cases.

 

Additionally, according to Mastering Regular Expressions:

 

Common Optimizations

With a lazy quantifier, as in "(.*?)", the engine normally must jump between checking what the quantifier controls (the dot) with checking what comes after ". For this and other reasons, lazy quantifiers are generally much slower than greedy ones....

 

Lazy Versus Greedy: Be Specific

"...use a greedy quantifier, as they are generally optimized a bit better than non-greedy quantifier..."

"Again, this is dependent on the data and the language, but with most engines, using a negated character class is much more efficient than a lazy quantifier."

 

Link to comment
Share on other sites

oh okay I get what you're saying.  If id=.. isn't there, then it's gonna keep on gobbling stuff up.  That makes sense. So how would I write it? Seems like opposite of what I'm trying to do with the FLAG, so would I us positive lookahead?

 

 

Link to comment
Share on other sites

<pre>
<?php
$string = <<<STR
<a href='blah' id=1234.2>[FLAG] something</a>
<a href='blah' id=829.1>somethingelse</a>
<a href='blah' id=634.5>somerandomcharlength</a>
STR;
preg_match_all('%<a[^>]+id=([\d.]+)[^>]*>(?!\[FLAG\]\s)%si', $string, $matches);
array_shift($matches);
print_r($matches);
?>
</pre>

Link to comment
Share on other sites

The concern isn't of id= being outside of a tag, but of a tag not having id=. In this instance the regex would keep consuming data--going outside of the tag and running into another, possibly not even an a--until it finds id=. Arguably, the data in question may always have id= in the a; however, (1) data may change; and (2) [^>]* will work in both cases.

 

Point taken. Haven't considered it that way. So the question now is, how would one incorporate [^>] (and not use .*? within the a tag) into this situation? I'm scratching my head on this one (and I am curious how to resolve this in this manner). Here's what I have thus far..

 

$str = "
<a href='blah' id=1234.2>[FLAG] somethingElse</a>
<a href='blah'>somethingelse</a>
<a href='blah' id=634.5>somerandomcharlength</a>
";
preg_match_all('#<a.+?id=([0-9.]+)[^>]*>(?!.*?\[FLAG\])#', $str, $matches);
echo '<pre>'.print_r($matches[1], true);

 

Note that I altered the middle entry in the OP's string by removing the 'id=' business... So in this case, only the last tag's id should be recorded (and this is indeed the case). But I am using a mix of both .+? and [^>] (much like the OP's last code snippet). But when I start going through the motions in that second line, do I undrestand correctly that the pattern will keep on matching all the way to the end of the line, then fail? How can the <a.+?id= part be re-written better, yet still take into account all the other aspects?

 

Additionally, according to Mastering Regular Expressions:

 

Common Optimizations

With a lazy quantifier, as in "(.*?)", the engine normally must jump between checking what the quantifier controls (the dot) with checking what comes after ". For this and other reasons, lazy quantifiers are generally much slower than greedy ones....

 

Lazy Versus Greedy: Be Specific

"...use a greedy quantifier, as they are generally optimized a bit better than non-greedy quantifier..."

"Again, this is dependent on the data and the language, but with most engines, using a negated character class is much more efficient than a lazy quantifier."

 

No disputes there. The only reason I resorted to .+? was because in this case, I don't know what else to use..

 

EDIT -  just notice you posting as I did.. so I'll have a go at your code and come back to you...

Link to comment
Share on other sites

Ok.. I have thought of using something like what you just proposed.. but when I see this:

 

<a[^>]+

 

Would this not mean some backtracking? It would match all the way to the first > character it runs into, then must start backtracking until it finds the complete set of characters id=... (and obviously, if there is no id=, it fails).

 

This makes me wonder out loud which is faster? .+? until it matches an id (and assuming there is no id, keep on matching till the end of line (non s modifier), then fail, or match up to the > character, then backtrack.. it's the backtracking speed that concerns me.. as from the book, we know the regex engine must start going back one saved state at a time, and check to see if the character starting at that saved state matches (in this case), the i in id, then if so, does the next match the d, then the =, and so forth.

 

I can understand using [^>] as a saftey net.. but when I did think of using it instead of .+?, I  a) made the assumption that there would be an id within the tag (bad assumption on my part admittedly), but also figured backtracking would be slower than a lazy quantifier.. (of course, I could be all wrong here)..

Link to comment
Share on other sites

Never mind.. I did a speed test.. turns out, my solution was slower :(

 

EDIT:

$string = <<<STR
<a href='blah' id=1234.2>[FLAG] something</a>
<a href='blah' id=829.1>somethingelse</a>
<a href='blah' id=634.5>somerandomcharlength</a>
STR;
$loop = 1000;

$start = gettimeofday();
for($i = 0; $i < $loop; $i++){
   preg_match_all('%<a[^>]+id=([\d.]+)[^>]*>(?!\[FLAG\]\s)%si', $string, $matches);
}
$final = gettimeofday();
echo 'Effigy\'s method:' . '<br />';
echo '<pre>'.print_r($matches[1], true) . '<br />';
$sec = ($final['sec'] + $final['usec'] / 1000000)-($start['sec'] + $start['usec'] / 1000000);
printf("Time: %3f", $sec);

echo '<br /><br />';

$start = gettimeofday();
for($i = 0; $i < $loop; $i++){
   preg_match_all('#<a.+?id=([0-9.]+)[^>]*>(?!.*?\[FLAG\])#', $string, $matches);
}
$final = gettimeofday();
echo 'NRG\'s method:' . '<br />';
echo '<pre>'.print_r($matches[1], true) . '<br />';
$sec = ($final['sec'] + $final['usec'] / 1000000)-($start['sec'] + $start['usec'] / 1000000);
printf("Time: %3f", $sec);

 

Output sample:

Effigy's method:
Array
(
    [0] => 829.1
    [1] => 634.5
)

Time: 0.033764

NRG's method:
Array
(
    [0] => 829.1
    [1] => 634.5
)

Time: 0.062249

Link to comment
Share on other sites

I can understand using [^>] as a saftey net.. but when I did think of using it instead of .+?, I  a) made the assumption that there would be an id within the tag (bad assumption on my part admittedly), but also figured backtracking would be slower than a lazy quantifier.. (of course, I could be all wrong here)..

 

Actually, this is correct. I crossed my wires on the lazy/greedy portion, while the real issue is using [^>]*? rather than .*? (or with +, doesn't matter). My apologies.

 

The difference between the lazy/greedy approach depends, as the book says, on the data.

 

 

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.