Case-Insensitive Effeciency

cags · December 7, 2009

Ok, so with Regular Expressions there are multiple ways of making something case insensitive we can either list both in the character class [a-zA-Z] or we can use the i modifier after the pattern '#blah#i'. I also recently came across a/some new variations(s). '#(?i:blah)#i' and '#(?i)blah(?-i)#'. Obviously these have the advantage of applying to only a section of the pattern as opposed to the whole thing like the modifier, but what is the advantage/disadvantage of using it as opposed to a character class?

Whilst writing out this topic I think I've pretty much worked out that the main advantage is being the ability to match specific words/character chains insensitively so you could do '#something (?i:blah) something#'. Which wouldn't be possible using either of the other methods without a lot of alternation. But is anybody aware of the internals on this and how much of a performance hit is involved. Also is there actually a difference between the (?i:blah) method over the (?i)blah(?-i) method, or are they internally the same?

nrg_alpha · December 8, 2009

Ok, so with Regular Expressions there are multiple ways of making something case insensitive we can either list both in the character class [a-zA-Z] or we can use the i modifier after the pattern '#blah#i'. I also recently came across a/some new variations(s). '#(?i:blah)#i' and '#(?i)blah(?-i)#'. Obviously these have the advantage of applying to only a section of the pattern as opposed to the whole thing like the modifier, but what is the advantage/disadvantage of using it as opposed to a character class?

Stuff like (?i:blah) are called mode modifiers (note, in your sample pattern, you had the i modifier after the closing delimiter, which isn't necessary, as your mode modifier already does this), and (?i)blah(?-i) are called mode-modified spans. And it is isn't restricted to using case insensitivity. Common mode modifiers include i (case insensitive), x (freeform), s (dot match all), m (enahnced line anchor mode) and # (comments). But as far as comparing them to character classes, you don't, as mode modifiers and character classes are two completely different things.. the mode modifier quite simply effects the text involved within it with the appropriate modifier, where as a character class looks to match an individual character.. so you are comparing apples to oranges.

But is anybody aware of the internals on this and how much of a performance hit is involved. Also is there actually a difference between the (?i:blah) method over the (?i)blah(?-i) method, or are they internally the same?

While I'm not sure about the internal differences, I would wager that there probably isn't any speed difference (it may simply be an issue of prefernce / code readabilty - but I'm not entirely sure on that.

cags · December 8, 2009

Oops, yer, the i in the '#(?i:blah)#i' was just a typo, I realise it wasn't required.

I did figure that other 'mode modifiers' would exist but hadn't got around to playing around to see what they might be. I assume they will include many of the full pattern modifiers such as U also. But I can play around to find out exactly what is and isn't (for example D wouldn't seem to have any specific meaning outside of a full pattern modifier).

I realise that a character class is a completely different entity to modifiers, what I was trying to get across is that given a pattern to match a sequence of non-specific characters is there a performance difference between for example '#[a-z]{5}#i' and '#[a-zA-Z]{5}#'. The main reason I ask is because this is probably the most common use I see it employed for. I would guess that overall they both break down to the same thing I was just curious.

nrg_alpha · December 8, 2009

I did figure that other 'mode modifiers' would exist but hadn't got around to playing around to see what they might be. I assume they will include many of the full pattern modifiers such as U also. But I can play around to find out exactly what is and isn't (for example D wouldn't seem to have any specific meaning outside of a full pattern modifier).

Yeah, some modifiers won't really be applicable for sure. Personally, I haven't run into much need of mode modifiers (but they certainly have their place for sure). It's fun to experiment in regex, eh?

I realise that a character class is a completely different entity to modifiers, what I was trying to get across is that given a pattern to match a sequence of non-specific characters is there a performance difference between for example '#[a-z]{5}#i' and '#[a-zA-Z]{5}#'. The main reason I ask is because this is probably the most common use I see it employed for. I would guess that overall they both break down to the same thing I was just curious.

Ah, now I understand what you meant.. sorry, my bad. :-[ As far as performance issues between '#[a-z]{5}#i' and '#[a-zA-Z]{5}#' is concerned, we could do a simple benchmarking test:

$loop = 5000;

$start = gettimeofday();
for($a = 0; $a < $loop; $a++){
    $str = 'testing 1 2 3!';
    $str = preg_replace('#[a-z]{5}#i', '*', $str);
}
$final = gettimeofday();
$end = gettimeofday();
$sec =	($final['sec'] + $final['usec']/10000000) -
($start['sec'] + $start['usec']/10000000);

printf("Result: %s - Time of executing #[a-z]{5}#i: %4f<br />\n\n", $str, $sec);

$start = gettimeofday();
for($a = 0; $a < $loop; $a++){
    $str = 'testing 1 2 3!';
    $str = preg_replace('#[a-zA-Z]{5}#', '*', $str);
}
$final = gettimeofday();
$end = gettimeofday();
$sec =	($final['sec'] + $final['usec']/10000000) -
($start['sec'] + $start['usec']/10000000);

printf("Result: %s - Time of executing #[a-zA-Z]{5}#: %4f<br />\n\n", $str, $sec);

After hitting refresh numerous times on the above script (as there is going to some margin of error to account for), on average (on my system anyway), the '#[a-z]{5}#i' solution edges out the pure character class (but even in a loop 5000 times, the difference is sooo small). Even when I make changes to the above benchmark with $str = 'Hello Mr.Smith!'; and the first pattern is #Hello (?i:mr)\.(?i:s)mith#, and the second pattern is #Hello [MmRr]+\.[ss]mith#, the mode modifier edges out by a slim margin.

So all in all, either way, the speed difference is so small, we're in all likelyhood splitting hairs admittedly.

salathe · December 8, 2009

I did figure that other 'mode modifiers' would exist but hadn't got around to playing around to see what they might be. I assume they will include many of the full pattern modifiers such as U also. But I can play around to find out exactly what is and isn't (for example D wouldn't seem to have any specific meaning outside of a full pattern modifier).

Modifiers that are allowed in those option settings are:

i (PCRE_CASELESS)
m (PCRE_MULTILINE)
s (PCRE_DOTALL)
x (PCRE_EXTENDED)
J (PCRE_INFO_JCHANGED) (changes the local PCRE_DUPNAMES option)
U (PCRE_UNGREEDY)
X (PCRE_EXTRA)

The point of changing these options isn't particularly to do with performance (well, it might be if you're really needing to super-duper-micro-optimise and these settings changes prove most efficient) but as with most things like this, they are tools to get a job done.

Continuing from the examples already mentioned, if we wanted to match only a portion of an expression case-insensitively we could have an expression like abc[dD][eE][fF]ghi or we could temporarily change the caseless setting (in conjunction with a non-capturing group) like abc(?i:def)ghi (or abc(??i)def)ghi if we don't use the shorthand form) or we could change the setting then change it back like abc(?i)def(?-i)ghi. Note that that last pattern does something subtly different (the other two don't affect how "ghi" is matched).

Of the three, for that particular case, I'd be swayed towards the second expression (using the shorthand) not for performance reasons but purely because it (to me) most clearly illustrates what the pattern is meant to do.

It's fun to experiment in regex, eh?

Sure is.

Sign In

Case-Insensitive Effeciency

Recommended Posts

cags

Link to comment

Share on other sites

nrg_alpha

Link to comment

Share on other sites

cags

Link to comment

Share on other sites

nrg_alpha

Link to comment

Share on other sites

salathe

Link to comment

Share on other sites

Archived

Browse

Activity

Important Information