Jump to content

Quantifier limit


Chinchilla3k

Recommended Posts

Hello, I am using preg_match to parse large dynamic HTML pages.. I have a problem where preg_match fails at a certain point while using a lazy quantifier (it also fails when setting the entire expression as ungreedy)

 

here is sample code illustrating the problem

test.txt is a 1 megabyte file with the letter 'A' repeating for the first 500kb. The text "I WILL NOT REACH HERE" occurs a little after the 500kb mark.

 

<?php
$handle = fopen("test.txt", "r");
$var = fread($handle, 1048576);
$arr = array();
echo preg_match('/.+?I WILL NOT REACH/s', $var, $arr); //will output 0
?>

 

Why does it fail? Can anyone suggest a work around for this? It seems to fail around the 99999'th character mark..

 

To give you an example of the situation where I use lazy quantifiers:

 

Consider the html code

 

<div class="topictitle">title</div>(dynamically generated data)<div class="post">first post</div>

 

I would use a regular expression to extract the topic title and the first post via backreferences.

 

preg_match('/opictitle">([^<]+)<\/div>.+?ost">(.+?)<\/div>/s', $data, $output); //something similar to this. 

 

The goal in mind is that I want to only extract the topic title and the first post. I cannot use ungreedy quantifiers because they would give the last post on the page. Regardless.. even with ungreedy quantifiers if there is more than 99999 characters AFTER the regular expression being matched it will also fail.

 

Can anyone suggest an alternative approach or maybe a workaround?

 

Thanks.

Link to comment
Share on other sites

I believe this is what the manual is referring to here:

There must be a better way of partitioning or analyzing your data.

Thank you for linking me to the document.. lots of useful information. Yes, in light of this information there must be a better way of partitioning/analyzing data... it's strange that it would only break at the 99999'th character though. The implementation I'm using probably has the limitation hardcoded.

 

I would rather use regex without the limitation.. as the data I'm analyzing doesn't get much more larger. However, the project I'm working on won't be launched on a server I completely control... so I can't run my own build of PHP... I'll work around this.

 

Thank you.

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.