Jump to content

Backtracking is something good to do ?


idris

Recommended Posts

Hi. It's me again.

 

Question: is backtracking a good thing to do ?

 

Do we always need to ignore backtracking ? Does it always reduce the performance ?

 

Example:

 

String: abcdefghi

 

Regex  #1 , .*i

Regex  #2 , .*?i

 

First regex will go through a to i then backtrack one character and it's ok.

Second regex will check if the next character is i every time it loops.

 

So regex #1 needs 2 steps to be done while regex #2 needs 8. Or am I completely wrong ?

 

So in this example, which one is better to use ? #1 or #2 ? Is backtracking completely BAD thing to do ?

Link to comment
Share on other sites

Regex #1 does this:

Nothing at the start matches "anything", can I still do that? Yes, the 'a' matches, can I still match anything, yeah the 'b' does. It continues to do that all the way until: can I still match anything, yes, the 'i' matches. Then it finally checks, can I match 'i'? Nope, so I'll go back one step. Now can I match 'i'? Yes. Done.

 

Regex #2 does this:

Can I match anything 0 or more times? Yes, the nothing at the start matches that. Now does 'i' match? No, okay then I'll try matching .*? again which 'a' does. Now does 'i' match? Not yet, but 'b' matches. It does all that until it has matched 'h'. At that point 'i' matches, the regex has ended and the engine tells you there is a match.

 

Link to comment
Share on other sites

If my understanding is correct the performance of the two would be very similar given the input string you gave. The thing to bare in mind here is that the two patterns are not different ways of writing the same thing as they will match different patterns, as such comparing the efficiency of the two patterns has no 'real' meaning outside the context of your simple example input. I would guess the first pattern will backtrack once as it will initially match the i then realise it needs to match an i and hence have to backtrack. Whereas the second example doesn't backtrack because the 'lazy' nature of the match. Backtracking is generally best avoided as much as possible as the more backtracking that is being performed the more steps the PCRE engine has to take and hence the longer it will take. Let's take your two patterns and use a different input script.

 

$input = "abcdefghijklmnopqrtuvwxyz";

 

Since the first pattern is greedy, it will initially match all the way to z, it will then have to backtrack back to the i. If I'm honest I'm not 100% sure of the internals on this, but I imagine this would involve backtracking a letter at a time meaning you actual backtrack 17 times. The second pattern however will stop as soon as it matches the first i meaning it doesn't have to backtrack at all. Given a third input...

 

$input = "abcdefghijklmnopqrtuvwxyzi";

 

The first pattern will match the whole string whereas the second pattern will only match 'abcdefghi'. This just goes to re-enforce the fact that comparing the efficiency of those two patterns only really have a meaning given the limited scope of a single input.

 

Having said all this, I'm pretty green with regards to regular expressions, this is simply my interpretation of it I'm sure somebody else will chime in if I'm wrong.

Link to comment
Share on other sites

You got it pretty much correct, cags. :)

 

@idris

The issue of backtracking and its effectiveness is highly circumstantial, depending on the pattern and the source string in question. In your initial example, using a greedy quantifier is prefereable, as the speed of backtracking once vs lazily checking every character prior to matching would be faster in this case I'd wager (this of course assumes that the required character is found only once, and at the end).

 

In general, I am finding that using greedy quantifiers is like walking a slippery slope.. the issue I find more often than not is not speed, but rather accuracy issues. Given the 3rd input excample in cag's post would be an example of greedy not yielding to correct expectations if the desired outcome is abcdefghi and not abcdefghijklmnopqrtuvwxyzi.

 

Like everything else, greedy quantifiers are tools. It's not correct to label it as evil, but when used incorrectly (and yes, it is used incorrectly more often than not), your concerns should be more centered around accuracy rather than speed. Yes, backtracking eats away at performance, but depending on how much backtracking will dictate whether it is desirable or not (assuming the end results are the same between using greedy vs lazy quantifiers).

 

I always recommend this book to people.. so here I recommend it to you :) There is tons of info on the mechanics of how regex engines *think* and will definitely open the doors of understanding a hell of a lot more once you read that puppy!

Link to comment
Share on other sites

You got it pretty much correct, cags. :)

Wonders will never cease... ;)

 

In yourinitial example, using a greedy quantifier is prefereable, as the speedof backtracking once vs lazily checking every character prior tomatching would be faster in this case I'd wager (this of course assumesthat the required character is found only once, and at the end).

I theorised this myself, but wasn't sure enough to post it, I'm glad you've pretty much confirmed that though.

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.