Jump to content

Tut: Two Little-Known but Way-Cool Features of PHP Regex


Recommended Posts

Greetings, PHP heads!

 

A couple days ago, during a full revamp of my regex tutorial (see sig), I added two very sweet regex features that I haven't seen discussed in the online PHP world. These are features that I found buried in the PCRE documentation. One of them is briefly mentioned in the PHP manual. The other might be too, but I couldn't find it. 

 

I thought I'd make a quick tut on the forum to share these two "secret features" of PHP regex with my fellow regex lovers. :D

 

A. (?(DEFINE)) LETS YOU REUSE A PATTERN

You already know how a back-reference (named or numbered) lets you match a literal string previously captured by a set of parentheses:

Both

(\d\d) abc \1

and

(?P<Nb>\d\d) abc (?P=Nb)

will match "12 abc 12", where "12" is captured in Group 1 or in a named capture group: "Nb"

 

What if instead of referring to a string already captured, you could refer to a regex pattern? This is what (?(DEFINE)) does. In the following example, the DEFINE statement defines "phone" as this regex pattern:

(?:Tel|Fax):[ ]415-\d{3}-\d{4}

Then the regex uses this defined phone pattern multiple times in the expression.

 

Run this:

 

<?php 
$pattern=',(?x)(?(DEFINE)(?<phone>(?:Tel|Fax):[ ]415-\d{3}-\d{4}))
            ^start[ ](?&phone)[ ]////[ ](?&phone)
            [ ]----[ ]((?&phone)),';
$string = 'start Tel: 415-555-1212 //// Fax: 415-555-0000 ---- Fax: 415-555-9999';
if(preg_match($pattern, $string, $match)) 
echo 'Properly formatted string.<br /> The third number is: <b>'.$match[2].'</b><br />';
?>

 

Three phone numbers are matched, but the pattern to match a phone number is only given once!

 

Note that the third number is captured by an additional set of parentheses. It is actually Group 2, because the DEFINE statement consumes one group.

 

Now, you could also accomplish this with a repeating expression: the (?1) syntax. For instance:

<?php 
$string = 'start Tel: 415-555-1212 //// Fax: 415-555-0000 ---- Fax: 415-555-9999';
$pattern=',(?x)  
            ^start[ ]((?:Tel|Fax):[ ]415-\d{3}-\d{4})
		[ ]////[ ](?1)
		[ ]----[ ]((?1)),';
if(preg_match($pattern, $string, $match)) 
echo 'Properly formatted string.<br /> The third number is: <b>'.$match[2].'</b><br />';
?>

 

Or, using Oniguruma-style named capture:

<?php 
$string = 'start Tel: 415-555-1212 //// Fax: 415-555-0000 ---- Fax: 415-555-9999';
$pattern=',(?x)  
            ^start[ ](?<Phone>(?:Tel|Fax):[ ]415-\d{3}-\d{4})
		[ ]////[ ]\g<Phone>
		[ ]----[ ](\g<Phone>),';
if(preg_match($pattern, $string, $match)) 
echo 'Properly formatted string.<br /> The third number is: <b>'.$match[2].'</b><br />';
?>

 

So what is the benefit of the DEFINE syntax over these other techniques?

Well, if you wanted, you could set up all your definitions at the beginning of the expression, which could be handy for a long regex!

(?(DEFINE)(?<Gender>M|F))
(?(DEFINE)(?<Age>\b\d\d\b))
(?(DEFINE)(?<Name>\b[[:alpha:]]+\b))

 

Then you can pepper your names in the expression: (?&Age) to match an Age, (?&Gender) to match a Gender, and so on. Then, if you change your mind about a sub-pattern, all you have to do is change it at the top!

 

See my page on regex (? syntax disambiguation for more on PCRE regex DEFINE syntax.

 

As of Jan 28 2012, I couldn't find this feature on the PHP manual, but if you find it please let me know.

 

In the next post, we will look at an even more interesting and useful feature of PHP regex: Capture Groups with Duplicate Numbers

 

...part 2 of this post:

 

B. (?| ) LETS YOU USE ONE GROUP NUMBER FOR MULTIPLE CAPTURES

 

Sometimes, you have data where what you want to capture almost fits in one set of parentheses. Almost: you can fit it in an alternation, but you end up using multiple sets of parentheses. As a result, your data can find itself in Group 1, Group 2, Group 3... You don't know. To sort it out, you have to write some code to look at the array of results.

 

Here's an example:

(?:shipping (\w+)|mailing (\w+) to): \w+

 

This could match either of these strings:

shipping books: today

mailing books to: john

 

In both cases, "books" would be captured... But in Group 1 for the first case, and Group 2 for the second. That's because group numbers are set from left to right as you read the regex, whether or not they are set. On the second string, Group 1 is not set, but Group 2 captures "books". You'll have to sort that out in PHP by examining your matches.

 

Well, PCRE has a magical feature that lets you capture "books" in Group 1 in both cases, even though they are captured by different sets of parentheses!!! That syntax is (?|, and it allows you to "reset" a capture group once you pass "|", the alternation marker. 

 

Here's the piece of magic syntax that always returns "books" in Group 1:

 

(?|shipping (\w+)|mailing (\w+) to): \w+

 

Here's code to test it:

 

<?php
$regex=',(?|shipping (\w+)|mailing (\w+) to): \w+,';
preg_match($regex, 'shipping books: today', $match);
echo $match[1].'<br />';
preg_match($regex, 'mailing books to: john', $match);
echo $match[1].'<br />';
?>

 

Output:

books

books

 

In both cases, "books" is found in $match[1], which is the content of Group 1!

 

This feature is briefly mentioned on the PHP manual's subpattern page.

 

See my page on regex (? syntax disambiguation for more on PCRE (?| group reset syntax.

 

Wishing you all a fun weekend!

:)

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.