[SOLVED] Regex Engine Question

shane18 · September 24, 2009

Can someone explain to me in high detail how:

<?
$CC = ".Username Reason is blah blah blah";
preg_match("/^([\.!#@\+^\$-_~]{1})(.+?)(.+?)$/", $CC, $CCM);
echo "<pre>";
print_r($CCM);
echo "</pre>";
?>

Makes this:

Array
(
    [0] => .Username Reason is blah blah blah
    [1] => .
    [2] => U
    [3] => sername Reason is blah blah blah
)

the outcome.

I know how to make this work the way I want it too, but that is not the question because I am trying to learn how the engine works inside and out. This is my last piece of the puzzle.

Garethp · September 24, 2009

Ok, well, [0] will always be the entire string matched

[1] is what was matched by the first bracket, which happened to be ([\.!#@\+^\$-_~]{1})

Now, ([\.!#@\+^\$-_~]{1}) means, 1 character (the {1} means one and only one), which has to be a . ! # + ^ $ - _ or ~. In this case it was .

[2] is what was matched by the second bracket which was (.+?) which means match anything, to any amount, so long as you match as little as you are required

[3] is the third bracket, which is the same as above

Now, [2] matched only one character because there was another (.+?) to let it stop, because it's lazy, it said "Well, it's your job to match now, I'm gonna sit down and have a cup of Coffee" simply because it was lazy enough to pass the job on as soon as it could. Since there was no other match orders after [3], [3] had to match the rest, because it .+ which meant anything, once or more

nrg_alpha · September 24, 2009

Shane18,

To further expand on the explanation of things, I advise you to have a look at this thread, which explains things regarding .+ and .+? (in particular, read post #11 and #14).

Also note that in your pattern, you used the {1} (called an interval) after the character class (character class = [...] notation).. this is not necessary, as a character class already checks for a single character only.. so using [abc] will check for either an a, b or c at the current location in the source string, just as [abc]{1} will.

Intervals are more useful for things like {1,} (minimum one, or any additional amount - similar to the + quantifier), or say {2,7} (minimum 2, maximum 7) kind of thing. Simply using {1} is impractical, as whatever aspect of the pattern that precedes it will represent at least one.. so the pattern #sle{1}pt# is the same as simply using #slept#, as in both cases, a single 'e' is understood automatically.

As well, with regards to character classes, it is important to understand that most meta characters (meta characters are characters that have special meanings; examples are like the dot (which is a match_all character that typically matches any single character other than a newline by default)) lose their special meaning within a character class..(some meta characters can retain their special meaning, depending on their location within the character class) so for a literal dot in the character class, you don't need to escape it... (position of the dot in a character class doesn't matter).

Notice however the location of your hyphen (-) character in the class (this is where location in the character class becomes crucial). If you want to look for a literal hyphen, list it as the very first or very last character in the character class, otherwise you are creating a range instead. So in your case, you have \$-_ which creates a range from the dollar sign to the underscore, which would create undesirable results.. (much like [a-z] will look for a range from a all the way to z). Relocate that hyphen to the start or end, as this will be clear to the regex engine that this is not a range (as you won't have characters listed on both sides of it) and will instead force it to be treated as a literal.

Sign In

[SOLVED] Regex Engine Question

Recommended Posts

shane18

Link to comment

Share on other sites

Garethp

Link to comment

Share on other sites

nrg_alpha

Link to comment

Share on other sites

Archived

Browse

Activity

Important Information