regex question

justinh · February 3, 2009

 
<?php 

      $string = "01-09-1980";
      $pattern = '/01[-./]09[-./]1980/';
      echo preg_match($pattern, $string);
      
?>

Shouldn't this be outputting 1?

Instead it is giving me an error =/

Warning: preg_match() [function.preg-match]: Unknown modifier ']' in C:\xampp\htdocs\match.php on line 5

corbin · February 3, 2009

Uhhhh...... No x.x.

What exactly are you trying to do?

justinh · February 3, 2009

Well I just received my "Master Regular Expressions 3rd Edition" book, so I'm trying to get a grasp on regex.

<?php 

      $string = "01-09-1980";
      $pattern = '/01[-./]09[-./]1980/';
      echo preg_match($pattern, $string);
      
?>

I really don't understand why this wouldn't return 1. Isn't the literal meaning of $pattern:

Find 0 Find 1 (Find - or Find . or Find / ) ... etc

That's what the book is explaining

.josh · February 3, 2009

the problem is that since you chose forward slashes as the delimiters, you're not escaping the forward slashes inside your pattern.

justinh · February 3, 2009

I wrote my own pattern!!

<?php
      $string  = "<table border=\"1\">"; 
      $pattern = '/<table( +border *= *"*[0-9]+"*)? *>/'; 
      if(preg_match($pattern, $string)){ 
        
        echo "Match!"; 
        
      }else{ 

        echo "No Match!"; 
        
      }
      ?>

Not bad for just getting the book tonight .. regex isn't that bad..

P.S this returns Match!

nrg_alpha · February 3, 2009

the problem is that since you chose forward slashes as the delimiters, you're not escaping the forward slashes inside your pattern.

@OP: This is why I generally avoid using / as delimiters.. my personal fav is #, as I almost never use the x modifier (thus no spacing nor comments, which obviously # is used for [comments that is]). Granted, this is a personal preference. You can use !, or ~, or a slew of other delimiters...

When I am typing out my regex, I typically start out by adding both delimiters right away.. so I'll first type in by example: preg_match('##', $str)... then I go into back inbetween my delimiters and start hammering away at the pattern..

One mistake some people make is they forget about putting in delimiters and do something like this:

preg_match('<div[^>]*>(.*?)</div>', $str).

The problem here is that you can actually use <, (, and [ as opening delimiters, but then must close them off using thier opposite counter parts >, ), and ]. But in the last example, notice that the first < and > characters are what the regex engine sees as delimiters (afterall, they are legal).. but the problem is that there is more stuff after the first closing delimiter >, and thus this example is illegal because of it (you'll run into unknown modifier errors). But by first putting in your delimiters (and again, I prefer to avoid using /, as this is common for date formats, or file paths, etc.. I would stick to something less used; #,~,!, etc...), you will not run into those errors (or at least, the odds are so much less likely). So:

preg_match('##', $str) then becomes: preg_match('#<div[^>]*>(.*?)</div>#', $str). Now we have our relatively safe delimiters, and we have our pattern that would normally cause delimiter errors on its own without the proper delimiter setup, but is now safe.

This concludes your public internet saftey broadcast message, and we know return you to your regular....expressions.

nrg_alpha · February 3, 2009

I wrote my own pattern!!

<?php
      $string  = "<table border=\"1\">"; 
      $pattern = '/<table( +border *= *"*[0-9]+"*)? *>/'; 
      if(preg_match($pattern, $string)){ 
        
        echo "Match!"; 
        
      }else{ 

        echo "No Match!"; 
        
      }
      ?>

Not bad for just getting the book tonight .. regex isn't that bad..

P.S this returns Match!

Good stuff! You're getting there!

You could also revise it to read:

$string  = '<table border="1">';
$pattern = '#<table.+?border=([\'"])[0-9]+\1>#i';
echo (preg_match($pattern, $string))? 'Match!' : 'Oh noes! No Match!';

Looking at the patter, here is what I have done:

match <table, then anything one or more times, then border= ... Now we capture either ' or " using a character class ([\'"]), as some code might use single quotes, while other code might use double quotes. Then match anything between 0-9 one or more times, then finally we require the matching quote (which we captured initially) and the closing tag bracket \1>... Notice I used the i modifier after the closing delimiter.. some code might use <TABLE...> while others use <table..>. This i modifier catches either version, just in case.

To wrap it off, since we are simply using a string output depending on a boolean outcome, I resorted to simple ternary operator notation.

This all falls into place as you get more comfortable with regex... but good stuff! Keep going!

justinh · February 4, 2009

<?php 

$string = "[email protected]"; 
$pattern = '#[a-zA-Z0-9.-_]+@+[a-zA-Z0-9.-_].+[(com|net|org|)]+#'; 

if(preg_match($pattern, $string)){ 

echo "Match!"; 

}else{ 

echo "No Match!"; 

} 

?>

Okay, here it is, this time with a different delimiter

Any suggestions on this bit of code? I'm pretty sure this is probably the worst way of validating an email.

Lol.

nrg_alpha · February 4, 2009

Just a few points to make with regards to:

$pattern = '#[a-zA-Z0-9.-_]+@+[a-zA-Z0-9.-_].+[(com|net|org|)]+#';

If you add the i modifier at the end of the pattern (after the closing delimiter), you can simply list things like a-z instead of a-zA-Z.

Notice the location of your dash in your character class: [a-zA-Z0-9.-_]... the problem with this is that if the dash is not the very first or very last character, it creates a range.. so in this case, you are creating a range from dot to underscore.. For explicit dashes, always list it as the very first or very last character... or I think you can escape it with \, but easier to simply position it without the need to escape.

The plus sign after the @ means that you can have the ampersand many times in a row.. You don't want that plus.

instead of using .+ as you have between the second and last character class, you should use .+? (this is a lazy quantifier) instead.. when using .* or .+, things become very inefficient,because the regex engine has to match all the way to the end of string (or till it reaches a newline if there is no s modifier by example), then start backtracking, relinquishing each matched character one at a time and checking to see if that newest relinquished character follows after .* or .+ in the pattern.

Don't surround alternations in a character class, as you havere here: [(com|net|org|)], as this is not correct.. a character class [..] checks for an individual character that has (or has not, depending on your setup) what is listed inside the square brackets... so in this case, you are saying checkto see if the next charcater that is a (, or a c, or an o, or m..etc.. you simply want (com|net|org)

Be aware that this example only accepts a very limited amount of emails.. (will not take, ca, de, co.uk, etc...).

So based on your pattern, I would write is such:

$pattern = '#[a-z0-9._-]+@[a-z0-9._-].+?(com|net|org)#i';

But even such patterns are not recommended, becuase it will not match all domains.. (there are plenty of repositories that go into email patterns).

Damn, sorry for all the edits.. I'm catching all my mistypings / mistakes..

justinh · February 4, 2009

Okay I understand everything except for this:

.+?

Doesn't this mean that in the search the dot is optional?

.josh · February 4, 2009

No that makes the search non-greedy. The dot matches (almost) any one thing. The + matches 1 or more of that dot. But it's greedy. It will keep on matching until it finds the last match it can make. The ? tells it to stop after the first match it finds.

nrg_alpha · February 4, 2009

Okay I understand everything except for this:
.+?
Doesn't this mean that in the search the dot is optional?

Here is the big difference between say .+ and .+?

With .+, the regex engine will be greedy and match anything (except a newline by default). Thus it will match up to end of line / string. But then since there is more that comes after .+ in the pattern, the regex engine needs to take this into account, so to make a long story short, the engine starts backtracking, relinquishing each character that it matched in reverse order, one at a time, and checks to see if that last character is what follows after .+ in the pattern.. This is more work for the regex engine than is needed.... a faster way is to make it lazy.. this is where the ? comes into play.. so with .+?, the engine will match a character, then it will check and see if the next character infront of it is the next one in the pattern.. is not, it will move forward, match that character, then check all over again.. so it is basically 'creeping' forward instead of matching everything under the sun (well, almost), then backtracking.. As a rule, using .+ or .* is not recommended.. instead, resorting to lazy quantifiers or even better (most of the time), negated character classes is preferable.

If you really want to get on the ball, I suggest getting this book, as I didn't fully explain everything.. I'm just hpoing you get the basic idea of it all.

justinh · February 4, 2009

Just bought it a couple of days ago, I kind of understand what you're saying. Thanks for all your help, very kind of you.

nrg_alpha · February 4, 2009

Just bought it a couple of days ago, I kind of understand what you're saying. Thanks for all your help, very kind of you.

Happy readings!

FYI, with my above explanation, it is more or less circumstancial.. (depending on what you are checking, where in the string this info is, the size of the string, etc..) so .+ and .* is not all evil..but must be exercised with care is more or less what I am trying to get at.

.josh · February 4, 2009

More visual way of explaining:

Consider this string:

alkasdjflskdfjlaksfjdslfkj

.+j will match

alkasdjflskdfjlaksfjdslfkj

.+?j will match

alkasdjflskdfjlaksfjdslfkj

A more practical example:

<a href="home.php" id="blah">home</a><a href="about.php" id="blah">about</a>

~<a.*href="[^"]*"[^>]*>~

will match

<a href="home.php" id = "blah">home</a><a href="about.php" id="blah">about</a>

Because the .* is greedy and will first match all the way to the end of the string and then go backwards, giving up one character at a time until the first time the other patterns are satisfied. Doing the same regex but with a ? after the .* will make it non-greedy. Instead of it gobbling the whole line up and working its way backwards, it moves forward and matches at the first instance that the patterns are satisfied. So:

~<a.*?href="[^"]*"[^>]*>~

will match

<a href="home.php" id="blah">home</a><a href="about.php"id="blah">about</a>

nrg_alpha · February 4, 2009

Good explanation CV (I know I didn't fully explain things, just trying to get the jist out).. but your example also illustrates the other pitfall of non greedy quantifiers.

@ OP, so yeah, what he said lol

Sign In

regex question

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Archived

Important Information