Jump to content

Case-Insensitive Effeciency


cags

Recommended Posts

Ok, so with Regular Expressions there are multiple ways of making something case insensitive we can either list both in the character class [a-zA-Z] or we can use the i modifier after the pattern '#blah#i'. I also recently came across a/some new variations(s). '#(?i:blah)#i' and '#(?i)blah(?-i)#'. Obviously these have the advantage of applying to only a section of the pattern as opposed to the whole thing like the modifier, but what is the advantage/disadvantage of using it as opposed to a character class?

 

Whilst writing out this topic I think I've pretty much worked out that the main advantage is being the ability to match specific words/character chains insensitively so you could do '#something (?i:blah) something#'. Which wouldn't be possible using either of the other methods without a lot of alternation. But is anybody aware of the internals on this and how much of a performance hit is involved. Also is there actually a difference between the (?i:blah) method over the (?i)blah(?-i) method, or are they internally the same?

Link to comment
Share on other sites

Ok, so with Regular Expressions there are multiple ways of making something case insensitive we can either list both in the character class [a-zA-Z] or we can use the i modifier after the pattern '#blah#i'. I also recently came across a/some new variations(s). '#(?i:blah)#i' and '#(?i)blah(?-i)#'. Obviously these have the advantage of applying to only a section of the pattern as opposed to the whole thing like the modifier, but what is the advantage/disadvantage of using it as opposed to a character class?

 

Stuff like (?i:blah) are called mode modifiers (note, in your sample pattern, you had the i modifier after the closing delimiter, which isn't necessary, as your mode modifier already does this), and (?i)blah(?-i) are called mode-modified spans. And it is isn't restricted to using case insensitivity. Common mode modifiers include i (case insensitive), x (freeform), s (dot match all), m (enahnced line anchor mode) and # (comments). But as far as comparing them to character classes, you don't, as mode modifiers and character classes are two completely different things.. the mode modifier quite simply effects the text involved within it with the appropriate modifier, where as a character class looks to match an individual character.. so you are comparing apples to oranges.

 

But is anybody aware of the internals on this and how much of a performance hit is involved. Also is there actually a difference between the (?i:blah) method over the (?i)blah(?-i) method, or are they internally the same?

 

While I'm not sure about the internal differences, I would wager that there probably isn't any speed difference (it may simply be an issue of prefernce / code readabilty - but I'm not entirely sure on that.

 

 

Link to comment
Share on other sites

Oops, yer, the i in the '#(?i:blah)#i' was just a typo, I realise it wasn't required.

 

I did figure that other 'mode modifiers' would exist but hadn't got around to playing around to see what they might be. I assume they will include many of the full pattern modifiers such as U also. But I can play around to find out exactly what is and isn't (for example D wouldn't seem to have any specific meaning outside of a full pattern modifier).

 

I realise that a character class is a completely different entity to modifiers, what I was trying to get across is that given a pattern to match a sequence of non-specific characters is there a performance difference between for example '#[a-z]{5}#i' and '#[a-zA-Z]{5}#'. The main reason I ask is because this is probably the most common use I see it employed for. I would guess that overall they both break down to the same thing I was just curious.

Link to comment
Share on other sites

I did figure that other 'mode modifiers' would exist but hadn't got around to playing around to see what they might be. I assume they will include many of the full pattern modifiers such as U also. But I can play around to find out exactly what is and isn't (for example D wouldn't seem to have any specific meaning outside of a full pattern modifier).

 

Yeah, some modifiers won't really be applicable for sure. Personally, I haven't run into much need of mode modifiers (but they certainly have their place for sure). It's fun to experiment in regex, eh? :)

 

I realise that a character class is a completely different entity to modifiers, what I was trying to get across is that given a pattern to match a sequence of non-specific characters is there a performance difference between for example '#[a-z]{5}#i' and '#[a-zA-Z]{5}#'. The main reason I ask is because this is probably the most common use I see it employed for. I would guess that overall they both break down to the same thing I was just curious.

 

Ah, now I understand what you meant.. sorry, my bad. :-[ As far as performance issues between '#[a-z]{5}#i' and '#[a-zA-Z]{5}#' is concerned, we could do a simple benchmarking test:

 

$loop = 5000;

$start = gettimeofday();
for($a = 0; $a < $loop; $a++){
    $str = 'testing 1 2 3!';
    $str = preg_replace('#[a-z]{5}#i', '*', $str);
}
$final = gettimeofday();
$end = gettimeofday();
$sec =	($final['sec'] + $final['usec']/10000000) -
($start['sec'] + $start['usec']/10000000);

printf("Result: %s - Time of executing #[a-z]{5}#i: %4f<br />\n\n", $str, $sec);

$start = gettimeofday();
for($a = 0; $a < $loop; $a++){
    $str = 'testing 1 2 3!';
    $str = preg_replace('#[a-zA-Z]{5}#', '*', $str);
}
$final = gettimeofday();
$end = gettimeofday();
$sec =	($final['sec'] + $final['usec']/10000000) -
($start['sec'] + $start['usec']/10000000);

printf("Result: %s - Time of executing #[a-zA-Z]{5}#: %4f<br />\n\n", $str, $sec);

 

After hitting refresh numerous times on the above script (as there is going to some margin of error to account for), on average (on my system anyway), the '#[a-z]{5}#i' solution edges out the pure character class (but even in a loop 5000 times, the difference is sooo small). Even when I make changes to the above benchmark with $str = 'Hello Mr.Smith!'; and the first pattern is #Hello (?i:mr)\.(?i:s)mith#, and the second pattern is #Hello [MmRr]+\.[ss]mith#, the mode modifier edges out by a slim margin.

 

So all in all, either way, the speed difference is so small, we're in all likelyhood splitting hairs admittedly.  :P

Link to comment
Share on other sites

I did figure that other 'mode modifiers' would exist but hadn't got around to playing around to see what they might be. I assume they will include many of the full pattern modifiers such as U also. But I can play around to find out exactly what is and isn't (for example D wouldn't seem to have any specific meaning outside of a full pattern modifier).

 

Modifiers that are allowed in those option settings are:

  • i (PCRE_CASELESS)
  • m (PCRE_MULTILINE)
  • s (PCRE_DOTALL)
  • x (PCRE_EXTENDED)
  • J (PCRE_INFO_JCHANGED)  (changes the local PCRE_DUPNAMES option)
  • U (PCRE_UNGREEDY)
  • X (PCRE_EXTRA)

 

The point of changing these options isn't particularly to do with performance (well, it might be if you're really needing to super-duper-micro-optimise and these settings changes prove most efficient) but as with most things like this, they are tools to get a job done. 

 

Continuing from the examples already mentioned, if we wanted to match only a portion of an expression case-insensitively we could have an expression like abc[dD][eE][fF]ghi or we could temporarily change the caseless setting (in conjunction with a non-capturing group) like abc(?i:def)ghi (or abc(?:(?i)def)ghi if we don't use the shorthand form) or we could change the setting then change it back like abc(?i)def(?-i)ghi. Note that that last pattern does something subtly different (the other two don't affect how "ghi" is matched).

 

Of the three, for that particular case, I'd be swayed towards the second expression (using the shorthand) not for performance reasons but purely because it (to me) most clearly illustrates what the pattern is meant to do.

 

It's fun to experiment in regex, eh? :)

Sure is.  ;)

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.