what is causing this regex to fail when parsing longer strings?

bitt3n · August 10, 2011

I am using this regex to parse html:

	$total_matches = preg_match_all('{

	<a\shref="
	(?<link>[^"]+)
	"(??!src=).)+src="
	(?<image>[^"]+)
	(??!designer-name">).)+designer-name">
	(?<brand>[^<]+)
	(??!title=).)+title="
	(?<title>((?!">).)+)
	(??!"price">).)+"price">\$
	(?<price>[\d.,]+)

	}xsi',$output,$all_matches,PREG_SET_ORDER);

This seems to work fine when I try to parse several matches in a row. However when I try parsing the full page these matches come from (I have permission to parse/use these data)

http://www.mytheresa.com/us_en/new-arrivals/what-s-new-this-week-1.html?limit=12

the regex fails (I actually get a 500 error). (I have confirmed that my script scrapes the page fine.) I've tried increasing the backtrack/recursion limits using

ini_set('pcre.backtrack_limit',100000000);
ini_set('pcre.recursion_limit',100000000);

but this does not solve the problem. I am wondering what I am doing wrong that is causing the regex to fail via PHP when it seems to be valid, and match code on the relevant page. Fiddling with it seems to suggest the negative lookaheads (in conjunction with the page length) are causing problems, but I'm not sure how I screwed them up.

I am running PHP 5.2.17. My PCRE_VERSION is 8.02 2010-03-19. (I've recently been informed that I should be using DomDocument to parse html, but I'd like to fix this for my own edification if nothing else.)

JAY6390 · August 10, 2011

Hmm, I'm really not sure what the issue is with your regex, but I would definitely second using the DomDocument to parse it instead

requinix · August 10, 2011

All those

(??!X).)X

can be much better written as

.*?X

See if that makes a big enough difference.

.josh · August 11, 2011

Dunno how much it will help, but you could also optimize by not naming the captured groups. When you name them, in addition to the named elements, $all_matches will still have the numerically indexed version of those matches, so you are basically doubling the size of $all_matches (aside from element 0 not being doubled, of course). Perhaps this is causing the server to run out of memory...

For example:

$string = "abc123def";
preg_match_all('~(?<letters>[a-z]+)~',$string,$matches);
print_r($matches);

output:

Array
(
    [0] => Array
        (
            [0] => abc
            [1] => def
        )

    [letters] => Array
        (
            [0] => abc
            [1] => def
        )

    [1] => Array
        (
            [0] => abc
            [1] => def
        )

)

Remove the ?<name> stuff and reference the numerical indexes instead. Or rename the numeric indexes after the match to save having to change code further down the pipe if you have to.

bitt3n · August 11, 2011

I ended up replacing the negative lookaheads as suggested and it solved the problem. I'm not sure why the negative lookaheads were a problem, since it seems like they should be just as efficient as using lazy stars, but clearly they are not as efficient.

I could remove the names of the captured groups, but this makes the code that uses the resulting array slightly more confusing, since I have to keep track of which number corresponds to which type of information.

Another thing I noticed is that increasing the recursion backtrack and recursion limits by 1000X actually causes the script to take far longer to execute, to the point where it was timing out before conclusion. I had to remove these increases to get the script to complete.

Incidentally, if anyone has a favorite tutorial on domdocument() to recommend, I would love to hear about it.

.josh · August 11, 2011

I could remove the names of the captured groups, but this makes the code that uses the resulting array slightly more confusing, since I have to keep track of which number corresponds to which type of information.

Well that's why I suggested renaming the array elements after the match. Example:

$string = "abc123def";
preg_match_all('~([a-z]+)~',$string,$matches);
$matches['letters'] = $matches[1];
unset($matches[1]);
print_r($matches);

I mean sure, you will have to initially "on paper" sort out the numerical index of your named groups, but it's really not that hard to do... first parenthesis == 1, 2nd == 2, skip the non-captured groups when counting, etc... then your code can go about using $matches['namedelementhere'] as it was before. The difference though is that you aren't wasting time and memory with naming the groups in the regex and doubling the matched array.

requinix · August 11, 2011

I ended up replacing the negative lookaheads as suggested and it solved the problem. I'm not sure why the negative lookaheads were a problem, since it seems like they should be just as efficient as using lazy stars, but clearly they are not as efficient.

They may be functionality equivalent, but the regex engine can optimize lazy quantifiers much better.

Another thing I noticed is that increasing the recursion backtrack and recursion limits by 1000X actually causes the script to take far longer to execute, to the point where it was timing out before conclusion. I had to remove these increases to get the script to complete.

That probably means the backtrack limit is being reached, which is generally a very bad thing. Depending on the text and expression you may be able to optimize out any backtracking with a once-only subpattern (?>expr) which the regex engine won't backtrack into. For example,

(?>(??!X).)+X)

bitt3n · August 12, 2011

Well that's why I suggested renaming the array elements after the match. Example:
$string = "abc123def";
preg_match_all('~([a-z]+)~',$string,$matches);
$matches['letters'] = $matches[1];
unset($matches[1]);
print_r($matches);
I mean sure, you will have to initially "on paper" sort out the numerical index of your named groups, but it's really not that hard to do... first parenthesis == 1, 2nd == 2, skip the non-captured groups when counting, etc... then your code can go about using $matches['namedelementhere'] as it was before. The difference though is that you aren't wasting time and memory with naming the groups in the regex and doubling the matched array.

that's true, is there some way to benchmark different regexes in PHP so we can see the difference?

That probably means the backtrack limit is being reached, which is generally a very bad thing. Depending on the text and expression you may be able to optimize out any backtracking with a once-only subpattern (?>expr) which the regex engine won't backtrack into. For example,
(?>(??!X).)+X)

now that you mention it, that is kind of disturbing. based on your observation, I ran my script with

if (preg_last_error() == PREG_NO_ERROR) {
    print 'There is no error.';
}
else if (preg_last_error() == PREG_INTERNAL_ERROR) {
    print 'There is an internal error!';
}
else if (preg_last_error() == PREG_BACKTRACK_LIMIT_ERROR) {
    print 'Backtrack limit was exhausted!';
}
else if (preg_last_error() == PREG_RECURSION_LIMIT_ERROR) {
    print 'Recursion limit was exhausted!';
}
else if (preg_last_error() == PREG_BAD_UTF8_ERROR) {
    print 'Bad UTF8 error!';
}
else if (preg_last_error() == PREG_BAD_UTF8_ERROR) {
    print 'Bad UTF8 offset error!';
}

which gave 'no error' as a result. so it seems like something else must be going on? it seems odd that reducing the limits would cause the script to run to completion if the reduced limits were not being reached.

.josh · August 12, 2011

well you said you were getting a server error, so my guess is that your server ran out of resources trying to run it. Anyways, instead of all those conditions to look at preg_last_error() ..why not just echo out preg_last_error() ? But that's neither here nor there.

bitt3n · August 12, 2011

No, this is actually a different issue. I was getting a server error on account of using negative lookaheads. Removing the negative lookaheads solved the server error problem. However, the script was now timing out before the regex had parsed all the matches. Removing the limit increases on backtrack and recursion solved the timeout problem, and now all results get parsed.

requinix suggested this might be on account of the fact that the backtrack and recursion increases were actually being used by the regex, which seems logical, and is obviously not good. however preg_last_error() indicates that the lower limits are not being exceeded. Thus the reason why raising the limits causes the script to time out remains unclear.

Sign In

what is causing this regex to fail when parsing longer strings?

Recommended Posts

bitt3n

Link to comment

Share on other sites

JAY6390

Link to comment

Share on other sites

requinix

Link to comment

Share on other sites

.josh

Link to comment

Share on other sites

bitt3n

Link to comment

Share on other sites

.josh

Link to comment

Share on other sites

requinix

Link to comment

Share on other sites

bitt3n

Link to comment

Share on other sites

.josh

Link to comment

Share on other sites

bitt3n

Link to comment

Share on other sites

Archived

Browse

Activity

Important Information