Jump to content

what is causing this regex to fail when parsing longer strings?


bitt3n

Recommended Posts

I am using this regex to parse html:

 

	$total_matches = preg_match_all('{

	<a\shref="
	(?<link>[^"]+)
	"(??!src=).)+src="
	(?<image>[^"]+)
	(??!designer-name">).)+designer-name">
	(?<brand>[^<]+)
	(??!title=).)+title="
	(?<title>((?!">).)+)
	(??!"price">).)+"price">\$
	(?<price>[\d.,]+)

	}xsi',$output,$all_matches,PREG_SET_ORDER);

 

This seems to work fine when I try to parse several matches in a row. However when I try parsing the full page these matches come from (I have permission to parse/use these data)

 

http://www.mytheresa.com/us_en/new-arrivals/what-s-new-this-week-1.html?limit=12

 

the regex fails (I actually get a 500 error). (I have confirmed that my script scrapes the page fine.) I've tried increasing the backtrack/recursion limits using

 

ini_set('pcre.backtrack_limit',100000000);
ini_set('pcre.recursion_limit',100000000);

 

but this does not solve the problem. I am wondering what I am doing wrong that is causing the regex to fail via PHP when it seems to be valid, and match code on the relevant page. Fiddling with it seems to suggest the negative lookaheads (in conjunction with the page length) are causing problems, but I'm not sure how I screwed them up.

 

I am running PHP 5.2.17. My PCRE_VERSION is 8.02 2010-03-19. (I've recently been informed that I should be using DomDocument to parse html, but I'd like to fix this for my own edification if nothing else.)

 

Link to comment
Share on other sites

Dunno how much it will help, but you could also optimize by not naming the captured groups.  When you name them, in addition to the named elements, $all_matches will still have the numerically indexed version of those matches, so you are basically doubling the size of $all_matches (aside from element 0 not being doubled, of course).  Perhaps this is causing the server to run out of memory...

 

For example:

 

$string = "abc123def";
preg_match_all('~(?<letters>[a-z]+)~',$string,$matches);
print_r($matches);

 

output:

Array
(
    [0] => Array
        (
            [0] => abc
            [1] => def
        )

    [letters] => Array
        (
            [0] => abc
            [1] => def
        )

    [1] => Array
        (
            [0] => abc
            [1] => def
        )

)

 

Remove the ?<name> stuff and reference the numerical indexes instead.  Or rename the numeric indexes after the match to save having to change code further down the pipe if you have to.

Link to comment
Share on other sites

I ended up replacing the negative lookaheads as suggested and it solved the problem. I'm not sure why the negative lookaheads were a problem, since it seems like they should be just as efficient as using lazy stars, but clearly they are not as efficient.

 

I could remove the names of the captured groups, but this makes the code that uses the resulting array slightly more confusing, since I have to keep track of which number corresponds to which type of information.

 

Another thing I noticed is that increasing the recursion backtrack and recursion limits by 1000X actually causes the script to take far longer to execute, to the point where it was timing out before conclusion. I had to remove these increases to get the script to complete.

 

Incidentally, if anyone has a favorite tutorial on domdocument() to recommend, I would love to hear about it.

Link to comment
Share on other sites

I could remove the names of the captured groups, but this makes the code that uses the resulting array slightly more confusing, since I have to keep track of which number corresponds to which type of information.

 

Well that's why I suggested renaming the array elements after the match.  Example:

 

$string = "abc123def";
preg_match_all('~([a-z]+)~',$string,$matches);
$matches['letters'] = $matches[1];
unset($matches[1]);
print_r($matches);

 

I mean sure, you will have to initially "on paper" sort out the numerical index of your named groups, but it's really not that hard to do... first parenthesis == 1, 2nd == 2, skip the non-captured groups when counting, etc... then your code can go about using $matches['namedelementhere'] as it was before.  The difference though is that you aren't wasting time and memory with naming the groups in the regex and doubling the matched array.

Link to comment
Share on other sites

I ended up replacing the negative lookaheads as suggested and it solved the problem. I'm not sure why the negative lookaheads were a problem, since it seems like they should be just as efficient as using lazy stars, but clearly they are not as efficient.

They may be functionality equivalent, but the regex engine can optimize lazy quantifiers much better.

 

Another thing I noticed is that increasing the recursion backtrack and recursion limits by 1000X actually causes the script to take far longer to execute, to the point where it was timing out before conclusion. I had to remove these increases to get the script to complete.

That probably means the backtrack limit is being reached, which is generally a very bad thing. Depending on the text and expression you may be able to optimize out any backtracking with a once-only subpattern (?>expr) which the regex engine won't backtrack into. For example,

(?>(??!X).)+X)

Link to comment
Share on other sites

 

Well that's why I suggested renaming the array elements after the match.  Example:

 

$string = "abc123def";
preg_match_all('~([a-z]+)~',$string,$matches);
$matches['letters'] = $matches[1];
unset($matches[1]);
print_r($matches);

 

I mean sure, you will have to initially "on paper" sort out the numerical index of your named groups, but it's really not that hard to do... first parenthesis == 1, 2nd == 2, skip the non-captured groups when counting, etc... then your code can go about using $matches['namedelementhere'] as it was before.  The difference though is that you aren't wasting time and memory with naming the groups in the regex and doubling the matched array.

 

that's true, is there some way to benchmark different regexes in PHP so we can see the difference?

 

That probably means the backtrack limit is being reached, which is generally a very bad thing. Depending on the text and expression you may be able to optimize out any backtracking with a once-only subpattern (?>expr) which the regex engine won't backtrack into. For example,

(?>(??!X).)+X)

 

now that you mention it, that is kind of disturbing. based on your observation, I ran my script with

 

if (preg_last_error() == PREG_NO_ERROR) {
    print 'There is no error.';
}
else if (preg_last_error() == PREG_INTERNAL_ERROR) {
    print 'There is an internal error!';
}
else if (preg_last_error() == PREG_BACKTRACK_LIMIT_ERROR) {
    print 'Backtrack limit was exhausted!';
}
else if (preg_last_error() == PREG_RECURSION_LIMIT_ERROR) {
    print 'Recursion limit was exhausted!';
}
else if (preg_last_error() == PREG_BAD_UTF8_ERROR) {
    print 'Bad UTF8 error!';
}
else if (preg_last_error() == PREG_BAD_UTF8_ERROR) {
    print 'Bad UTF8 offset error!';
}

 

which gave 'no error' as a result. so it seems like something else must be going on? it seems odd that reducing the limits would cause the script to run to completion if the reduced limits were not being reached.

Link to comment
Share on other sites

No, this is actually a different issue. I was getting a server error on account of using negative lookaheads. Removing the negative lookaheads solved the server error problem. However, the script was now timing out before the regex had parsed all the matches. Removing the limit increases on backtrack and recursion solved the timeout problem, and now all results get parsed.

 

requinix suggested this might be on account of the fact that the backtrack and recursion increases were actually being used by the regex, which seems logical, and is obviously not good. however preg_last_error() indicates that the lower limits are not being exceeded. Thus the reason why raising the limits causes the script to time out remains unclear.

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.