preg_replace_callback getting bad result

charley12 · April 28, 2011

Hello!

I'am newbie with regex, but at now I need your help.

I write a regex for get exp atribute and content of tag line.

<line exp="something">

content

</line>

function clb($matches)
{
    var_dump($matches);
}

preg_replace_callback('/<line +exp *= *"([a-zA-Z0-9\.\-\ ]*)">.*\<\/line\>/siu', "clb", $str);

If the line tag is only single in $str it's everything okey.

array(2) { [0]=> string(45) " content " [1]=> string(9) "something" }

But when in $str is more than one line tag, I getting bad result.

$str = "
<line exp=\"something\">
\n
content
</line>

<line exp=\"something2\">
\n
content2
</line>
";

array(2) { [0]=> string(96) " content content2 " [1]=> string(9) "something" }

I need something like :

array(4) { [0]=> string(7) "content" [1]=> string(9) "something" [2]=> string( "content2" [3]=> string(10) "something2" }

Where I doing mistake please?

Thank you for respond.

.josh · April 28, 2011

If you are looking for an easy fix...

Your mistake is the .* here :

preg_replace_callback('/<line +exp *= *"([a-zA-Z0-9\.\-\ ]*)">.*\<\/line\>/siu', "clb", $str);

Change it to .*?

If you want to actually learn something, read this tl;dr....

* is a quantifier. Quantifiers are by default greedy (versus lazy).

The difference between greedy and lazy matching is this: A greedy quantifier will match everything it can possibly match, and then start giving back stuff if it has to, to satisfy whatever comes after it in your pattern. A lazy quantifier will creep forward and only match until the first time it hits something that matches the rest of your pattern.

Here is an example. I have a string, and I want to match everything up to "O".

$string = "XXXXOXXXXXOXXXXXO";
preg_match("/.*O/",$string,$greedy);
preg_match("/.*?O/",$string,$lazy);

Pattern 1 (greedy) : "/.*O/"

With this pattern, what happens is .* will first match everything it can match. the dot is a "match all" character that match (almost) everything. So first it matches literally the whole string:

"XXXXOXXXXXOXXXXXO"

But then the pattern ends with an O so it has to give up characters until it satisfies that part of the pattern. So it gives up characters 1 at a time until it finds the first O:

"XXXXOXXXXXOXXXXXO"

And then the O in your pattern matches the last O in the string and $greedy will return the full string.

Pattern 2 (lazy) : "/.*?O/"

So in this pattern, a ? is thrown into the mix. The .* part of the pattern still matches (almost) everything, but the ? tells it to be a lazy match. With this pattern, what happens is .*? Will match only up to the point where the rest of the pattern can be fulfilled. So instead of consuming everything and then moving backwards to give up stuff until the rest of the pattern can be fulfilled, it moves forward until it finds the first instance in which the rest of the pattern can be fulfilled:

"XXXXOXXXXXOXXXXXO"

So the overall takeaway here is that greedy quantifiers will consume everything it can and then give up only what it has to, to satisfy the rest of the pattern. If the pattern can be fulfilled in more than one place in your string, it will be fulfilled from the last instance in the string (based on what the greedy quantifier actually matches). Whereas lazy quantifiers will move forward and consume only to the point where the first instance of the rest of the pattern occurs.

So for your pattern, you should be doing this:

function clb($matches)
{
    var_dump($matches);
}

preg_replace_callback('/<line +exp *= *"([a-zA-Z0-9\.\-\ ]*)">.*?\<\/line\>/siu', "clb", $str);

However, a couple notes here...

First off, you say you want it to display the following:

array(4) { [0]=> string(7) "content" [1]=> string(9) "something" [2]=> string( "content2" [3]=> string(10) "something2" }

This isn't going to happen with preg_replace_callback(). What you will actually see is two separate calls to clb() and two var_dumps, not one single var_dump: one for each matched pattern. If you are looking to just match something (versus replacing it with something else), you should be using preg_match (if you are wanting to just match the first occurrence) or preg_match_all (if you do indeed want to grab all occurrences).

Another thing to mention is your pattern is a bit bloated and inefficient. So you have this:

'/<line +exp *= *"([a-zA-Z0-9\.\-\ ]*)">.*?\<\/line\>/siu'

First off, you have a lot of unnecessary escaping going on. You only need to escape characters that mean something special to the regex engine. So you don't need to escape < and you don't need to escape . or - or the space in the character class (well, maybe the hyphen, with the way you have it positioned in there, but you can move it to the front or the back of the list to avoid having to escape it).

Also, you can avoid escaping the / by using a different delimiter for your pattern. You can use pretty much any non-alphanumeric character as your pattern delimiter, so it's usually a good idea to use something other than a / if you are trying to parse html, because / creeps up in html a lot, and therefore you have to escape it and make your pattern more bloated and harder to read.

So your pattern can be cleaned up as so:

'~<line +exp *= *"([-a-zA-Z0-93. ]*)">.*?</line>~siu'

But there's more that can be done. So back to greedy and lazy quantifiers... all fine and dandy, except .*? isn't really efficient. In order for it to work, it still has to look ahead and find where the rest of the pattern can be satisfied. This is extra work for the regex engine. The ideal thing is to have the quality of the greedy quantifier where it can just match everything it's supposed to, but not have to turn around and backtrack to give stuff up.

So there are two ways we can do that: with a negative character class, or with a negative look-ahead.

negative character class

You already have a character class in your pattern: [-a-zA-Z0-93. ]

This is a positive character class, which tells the regex engine to match any one of those characters (literal characters or within the ranges, like a-z), and your * quantifier after that says match 0 or more of that character class.

A negative character class tells the regex engine to match anything not specified. A negative character class starts with ^ as the first character inside the square brackets. So, going back to my XO string example:

$string = "XXXXOXXXXXOXXXXXO";
preg_match("/.*O/",$string,$greedy);
preg_match("/.*?O/",$string,$lazy);
preg_match("/[^O]*O/",$string,$ncc);

matches:

.* : "XXXXOXXXXXOXXXXXO"

.*? : "XXXXOXXXXXOXXXXXO"

[^O]* : "XXXXOXXXXXOXXXXXO"

So here you see that the negative character class will essentially match the same thing as the .*? lazy quantifier. But the difference is that the negative character class doesn't have to look ahead and try to match the rest of the pattern. It just looks at the next character and says "are you in this character class? No? Okay matched. Yes? Okay stop matching."

So how can negative character class be applied to your pattern? Well, you could use [^<]* and it will match everything up until the next < but that assumes that the content between your tags won't ever have a < in it, and that the next < will be from </line>. If you know this is the case, then that is what you should use, as it will be more efficient than .*? for sure, but also more efficient than a negative look-ahead. But if that is not the case... the next best option would be a negative look-ahead.

Negative Look-ahead

Okay so the problem with a negative character class is that it boils down to matching a single character (though it can match anything specified in the class, it still only matches one character at a time). But (not) matching for a single character isn't really enough in some cases, especially when it comes to matching a delimiter that is more than one character (like closing html tags). Look-ahead gives you the ability to look ahead to more than one character at a time. It looks like this: (?!) And you put what you want to look for between the ! and ) so for instance (?!</line)

So (?!</line>) just looks ahead to see if that is what is next in your string, but it doesn't actually match anything (the official term for this is "zero width assertion"). You still need to use something to actually match for the content. Since we want to match anything that is not </line> we can use a dot.

((?!</line>).)

The outer (...) is to capture what you are matching. So basically this first looks ahead to see if "</line>" is not there, and then matches the next thing, which is (almost) any one character. But we want to match more than one character. Just like with .*? we want to match everything up to </line>. This is where it gets kind of tricky...

We could try...

((?!</line>).*?)

But this won't actually work. What really happens here is that it tries to match everything in your string (lazy) - .*? - not followed by </line>. Since this is a lazy match, it moves forward, looking for </line> and since it occurs later on in your string as a whole, the .*? will not match anything. If we were to change it to this (make it greedy):

((?!</line>).*)

...it would match something..but not what you want! It will greedily consume your entire string and since there is nothing after the end of your string, the negative look-ahead still works because </line> is not matched. I know, that's probably the most confusing part of this, and has tripped me up for the longest time.

The solution here is to use that negative look-ahead to look ahead and then match one character at a time:

((??!</line>).)*)

So first we have the core (?!</line>) which is the negative look-ahead. Then we have (?:(?!</line>).) which makes this as a group, but does not capture it. Then we have the * to quantify it, match 0 or more of that group, and wrap all that in (..) to actually capture it.

So now it will look at things one character at a time and see if the character is part of the </line>. If not, add it to the group to capture.

example string: "foo<bar</line>"

Is 'f' part of </line>? Nope, put it into the captured group.
Is 'o' part of </line>? Nope, put it into the captured group.
Is 'o' part of </line>? Nope, put it into the captured group.
Is '<' part of </line>? Yes! It's the same as the starting character of </line> BUT overall, the next bits of the string will not match </line> because the next character after that is a b not a /, so it does not match, put it into the captured group.
Is 'b' part of </line>? Nope, put it into the captured group.
Is 'a' part of </line>? Nope, put it into the captured group.
Is 'r' part of </line>? Nope, put it into the captured group.
Is '<' part of </line>? Yes! and the following characters will all match up to </line>, so there is nothing else to capture.

So overall, we now have:

'~<line +exp *= *"([-a-zA-Z0-93. ]*)">((??!</line>).)*)</line>~isu'

So overall, the regex engine is still having to look head in the string and see if something is there, so it's not quite as efficient as a negative character class, but in many cases, a negative character class isn't really an option, and this is still more efficient than having to look ahead at the entire rest of the pattern when you use a lazy match all .*?

But we can still make this a bit more streamlined. You have that positive character class to match and capture the value of the exp attribute. Unless you are specifically wanting to match values that only have those characters (like for instance, you don't want to match if exp="foo_bar"), you can make this more efficient by applying a negative character class instead, by just matching anything that is not a closing quote. And what about if single quotes are used instead of double (single quote is escaped because that's the quote you are using to wrap this whole pattern for your function call)?

'~<line +exp *= *[\'"]([^\'"]*)[\'"]>((??!</line>).)*)</line>~isu'

Also I see that you attempt to quantify the spaces in there, which means you're trying to match in case there is a difference in spacing. But what if there are other attributes in that opening line tag? Your pattern currently only matches if exp is the first (and only) attribute. Also, it doesn't consider any errant spacing that might occur between the " and >. Also, most people use either 0 or 1 spaces between equal signs and attributes, so you can probably safely use a ? instead of * which is technically more efficient. But if you want to cover just-in-cases, you may as well just use \s instead of a space because it is easier to read \s than " ". \s will match any "white space" character (like a space from spacebar, or a tab, etc..), so technically it will match more than just a spacebar space but it's still whitespace and \s is easier to see in a pattern than a blank space.

So overall, a more flexible, efficient pattern would be:

'~<line[^>]*exp\s*=\s*[\'"]([^\'"]*)[\'"][^>]*>((??!</line>).)*)</line>~isu'

So...using your example content...

<?php

$str = <<<CONTENT
<line exp="something">
content
</line>

<line exp="something2">
content2
</line>
CONTENT;

function clb($matches)
{
    var_dump($matches);
    echo "\n";
}

preg_replace_callback('~<line[^>]*exp\s*=\s*[\'"]([^\'"]*)[\'"][^>]*>((??!</line>).)*)</line>~isu', "clb", $str);
?>

outputs (from rightclick > view source):

array(3) {
  [0]=>
  string(38) "<line exp="something"> 
content
</line>"
  [1]=>
  string(9) "something"
  [2]=>
  string(9) "
content
"
}

array(3) {
  [0]=>
  string(40) "<line exp="something2"> 
content2
</line>"
  [1]=>
  string(10) "something2"
  [2]=>
  string(10) "
content2
"
}

But again I have to question whether or not you should be using preg_replace_callback vs. preg_match or preg_match_all.. maybe you are just var_dumping to test the match, and you really do intend to replace something, but if you are just looking to match things and use those matches somewhere else and do not intend to actually change $str, you should be using preg_match or preg_match_all instead.

Philip · April 28, 2011

^^ that's what I call TL;DR

charley12 · April 29, 2011

Really thank you for your respond. It was really helpful.

I appreciate it.

Sign In

preg_replace_callback getting bad result

Recommended Posts

charley12

Link to comment

Share on other sites

.josh

Link to comment

Share on other sites

Philip

Link to comment

Share on other sites

charley12

Link to comment

Share on other sites

Join the conversation

Browse

Activity

Important Information