Chinchilla3k Posted September 25, 2008 Share Posted September 25, 2008 Hello, I am using preg_match to parse large dynamic HTML pages.. I have a problem where preg_match fails at a certain point while using a lazy quantifier (it also fails when setting the entire expression as ungreedy) here is sample code illustrating the problem test.txt is a 1 megabyte file with the letter 'A' repeating for the first 500kb. The text "I WILL NOT REACH HERE" occurs a little after the 500kb mark. <?php $handle = fopen("test.txt", "r"); $var = fread($handle, 1048576); $arr = array(); echo preg_match('/.+?I WILL NOT REACH/s', $var, $arr); //will output 0 ?> Why does it fail? Can anyone suggest a work around for this? It seems to fail around the 99999'th character mark.. To give you an example of the situation where I use lazy quantifiers: Consider the html code <div class="topictitle">title</div>(dynamically generated data)<div class="post">first post</div> I would use a regular expression to extract the topic title and the first post via backreferences. preg_match('/opictitle">([^<]+)<\/div>.+?ost">(.+?)<\/div>/s', $data, $output); //something similar to this. The goal in mind is that I want to only extract the topic title and the first post. I cannot use ungreedy quantifiers because they would give the last post on the page. Regardless.. even with ungreedy quantifiers if there is more than 99999 characters AFTER the regular expression being matched it will also fail. Can anyone suggest an alternative approach or maybe a workaround? Thanks. Quote Link to comment https://forums.phpfreaks.com/topic/125840-quantifier-limit/ Share on other sites More sharing options...
effigy Posted September 25, 2008 Share Posted September 25, 2008 I believe this is what the manual is referring to here: All values in repeating quantifiers must be less than 65536. There must be a better way of partitioning or analyzing your data. Quote Link to comment https://forums.phpfreaks.com/topic/125840-quantifier-limit/#findComment-650746 Share on other sites More sharing options...
Chinchilla3k Posted September 25, 2008 Author Share Posted September 25, 2008 I believe this is what the manual is referring to here: There must be a better way of partitioning or analyzing your data. Thank you for linking me to the document.. lots of useful information. Yes, in light of this information there must be a better way of partitioning/analyzing data... it's strange that it would only break at the 99999'th character though. The implementation I'm using probably has the limitation hardcoded. I would rather use regex without the limitation.. as the data I'm analyzing doesn't get much more larger. However, the project I'm working on won't be launched on a server I completely control... so I can't run my own build of PHP... I'll work around this. Thank you. Quote Link to comment https://forums.phpfreaks.com/topic/125840-quantifier-limit/#findComment-650798 Share on other sites More sharing options...
effigy Posted September 29, 2008 Share Posted September 29, 2008 You could try a split between the two desired patterns--/(?:A|B)/--and move forward from there. If you need additional help, please provide more data. Quote Link to comment https://forums.phpfreaks.com/topic/125840-quantifier-limit/#findComment-652976 Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.