bitt3n Posted August 10, 2011 Share Posted August 10, 2011 I am using this regex to parse html: $total_matches = preg_match_all('{ <a\shref=" (?<link>[^"]+) "(??!src=).)+src=" (?<image>[^"]+) (??!designer-name">).)+designer-name"> (?<brand>[^<]+) (??!title=).)+title=" (?<title>((?!">).)+) (??!"price">).)+"price">\$ (?<price>[\d.,]+) }xsi',$output,$all_matches,PREG_SET_ORDER); This seems to work fine when I try to parse several matches in a row. However when I try parsing the full page these matches come from (I have permission to parse/use these data) http://www.mytheresa.com/us_en/new-arrivals/what-s-new-this-week-1.html?limit=12 the regex fails (I actually get a 500 error). (I have confirmed that my script scrapes the page fine.) I've tried increasing the backtrack/recursion limits using ini_set('pcre.backtrack_limit',100000000); ini_set('pcre.recursion_limit',100000000); but this does not solve the problem. I am wondering what I am doing wrong that is causing the regex to fail via PHP when it seems to be valid, and match code on the relevant page. Fiddling with it seems to suggest the negative lookaheads (in conjunction with the page length) are causing problems, but I'm not sure how I screwed them up. I am running PHP 5.2.17. My PCRE_VERSION is 8.02 2010-03-19. (I've recently been informed that I should be using DomDocument to parse html, but I'd like to fix this for my own edification if nothing else.) Quote Link to comment Share on other sites More sharing options...
JAY6390 Posted August 10, 2011 Share Posted August 10, 2011 Hmm, I'm really not sure what the issue is with your regex, but I would definitely second using the DomDocument to parse it instead Quote Link to comment Share on other sites More sharing options...
requinix Posted August 10, 2011 Share Posted August 10, 2011 All those (??!X).)X can be much better written as .*?X See if that makes a big enough difference. Quote Link to comment Share on other sites More sharing options...
.josh Posted August 11, 2011 Share Posted August 11, 2011 Dunno how much it will help, but you could also optimize by not naming the captured groups. When you name them, in addition to the named elements, $all_matches will still have the numerically indexed version of those matches, so you are basically doubling the size of $all_matches (aside from element 0 not being doubled, of course). Perhaps this is causing the server to run out of memory... For example: $string = "abc123def"; preg_match_all('~(?<letters>[a-z]+)~',$string,$matches); print_r($matches); output: Array ( [0] => Array ( [0] => abc [1] => def ) [letters] => Array ( [0] => abc [1] => def ) [1] => Array ( [0] => abc [1] => def ) ) Remove the ?<name> stuff and reference the numerical indexes instead. Or rename the numeric indexes after the match to save having to change code further down the pipe if you have to. Quote Link to comment Share on other sites More sharing options...
bitt3n Posted August 11, 2011 Author Share Posted August 11, 2011 I ended up replacing the negative lookaheads as suggested and it solved the problem. I'm not sure why the negative lookaheads were a problem, since it seems like they should be just as efficient as using lazy stars, but clearly they are not as efficient. I could remove the names of the captured groups, but this makes the code that uses the resulting array slightly more confusing, since I have to keep track of which number corresponds to which type of information. Another thing I noticed is that increasing the recursion backtrack and recursion limits by 1000X actually causes the script to take far longer to execute, to the point where it was timing out before conclusion. I had to remove these increases to get the script to complete. Incidentally, if anyone has a favorite tutorial on domdocument() to recommend, I would love to hear about it. Quote Link to comment Share on other sites More sharing options...
.josh Posted August 11, 2011 Share Posted August 11, 2011 I could remove the names of the captured groups, but this makes the code that uses the resulting array slightly more confusing, since I have to keep track of which number corresponds to which type of information. Well that's why I suggested renaming the array elements after the match. Example: $string = "abc123def"; preg_match_all('~([a-z]+)~',$string,$matches); $matches['letters'] = $matches[1]; unset($matches[1]); print_r($matches); I mean sure, you will have to initially "on paper" sort out the numerical index of your named groups, but it's really not that hard to do... first parenthesis == 1, 2nd == 2, skip the non-captured groups when counting, etc... then your code can go about using $matches['namedelementhere'] as it was before. The difference though is that you aren't wasting time and memory with naming the groups in the regex and doubling the matched array. Quote Link to comment Share on other sites More sharing options...
requinix Posted August 11, 2011 Share Posted August 11, 2011 I ended up replacing the negative lookaheads as suggested and it solved the problem. I'm not sure why the negative lookaheads were a problem, since it seems like they should be just as efficient as using lazy stars, but clearly they are not as efficient. They may be functionality equivalent, but the regex engine can optimize lazy quantifiers much better. Another thing I noticed is that increasing the recursion backtrack and recursion limits by 1000X actually causes the script to take far longer to execute, to the point where it was timing out before conclusion. I had to remove these increases to get the script to complete. That probably means the backtrack limit is being reached, which is generally a very bad thing. Depending on the text and expression you may be able to optimize out any backtracking with a once-only subpattern (?>expr) which the regex engine won't backtrack into. For example, (?>(??!X).)+X) Quote Link to comment Share on other sites More sharing options...
bitt3n Posted August 12, 2011 Author Share Posted August 12, 2011 Well that's why I suggested renaming the array elements after the match. Example: $string = "abc123def"; preg_match_all('~([a-z]+)~',$string,$matches); $matches['letters'] = $matches[1]; unset($matches[1]); print_r($matches); I mean sure, you will have to initially "on paper" sort out the numerical index of your named groups, but it's really not that hard to do... first parenthesis == 1, 2nd == 2, skip the non-captured groups when counting, etc... then your code can go about using $matches['namedelementhere'] as it was before. The difference though is that you aren't wasting time and memory with naming the groups in the regex and doubling the matched array. that's true, is there some way to benchmark different regexes in PHP so we can see the difference? That probably means the backtrack limit is being reached, which is generally a very bad thing. Depending on the text and expression you may be able to optimize out any backtracking with a once-only subpattern (?>expr) which the regex engine won't backtrack into. For example, (?>(??!X).)+X) now that you mention it, that is kind of disturbing. based on your observation, I ran my script with if (preg_last_error() == PREG_NO_ERROR) { print 'There is no error.'; } else if (preg_last_error() == PREG_INTERNAL_ERROR) { print 'There is an internal error!'; } else if (preg_last_error() == PREG_BACKTRACK_LIMIT_ERROR) { print 'Backtrack limit was exhausted!'; } else if (preg_last_error() == PREG_RECURSION_LIMIT_ERROR) { print 'Recursion limit was exhausted!'; } else if (preg_last_error() == PREG_BAD_UTF8_ERROR) { print 'Bad UTF8 error!'; } else if (preg_last_error() == PREG_BAD_UTF8_ERROR) { print 'Bad UTF8 offset error!'; } which gave 'no error' as a result. so it seems like something else must be going on? it seems odd that reducing the limits would cause the script to run to completion if the reduced limits were not being reached. Quote Link to comment Share on other sites More sharing options...
.josh Posted August 12, 2011 Share Posted August 12, 2011 well you said you were getting a server error, so my guess is that your server ran out of resources trying to run it. Anyways, instead of all those conditions to look at preg_last_error() ..why not just echo out preg_last_error() ? But that's neither here nor there. Quote Link to comment Share on other sites More sharing options...
bitt3n Posted August 12, 2011 Author Share Posted August 12, 2011 No, this is actually a different issue. I was getting a server error on account of using negative lookaheads. Removing the negative lookaheads solved the server error problem. However, the script was now timing out before the regex had parsed all the matches. Removing the limit increases on backtrack and recursion solved the timeout problem, and now all results get parsed. requinix suggested this might be on account of the fact that the backtrack and recursion increases were actually being used by the regex, which seems logical, and is obviously not good. however preg_last_error() indicates that the lower limits are not being exceeded. Thus the reason why raising the limits causes the script to time out remains unclear. Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.