preg_match: can someone explain this to me?

enil14someone · January 1, 2014

hello, i've just started learning php at w3schools.com and i've stumbled upon the function known as "preg_match". it says it can be used for validation (like the syntax of emails) but when I see the code I got confused.

$name = test_input($_POST["name"]);
if (!preg_match("/^[a-zA-Z ]*$/",$name))
  {
  $nameErr = "Only letters and white space allowed";
  }

i can understand that the "[a-zA-Z]" is a regex meaning that it should only contain lowercase/uppercase letters but I don't quite get what "/^" and "*" is for.

another example that got me even more confused is this:

$email = test_input($_POST["email"]);
if (!preg_match("/([\w\-]+\@[\w\-]+\.[\w\-]+)/",$email))
  {
  $emailErr = "Invalid email format";
  }

I know that "\W" is another regex but the "(-@+" combinations are quite mind boggling.

and the last example got even worse:

$website = test_input($_POST["website"]);
if (!preg_match("/\b(??:https?|ftp):\/\/|www\.)[-a-z0-9+&@#\/%?=~_|!:,.;]*[-a-z0-9+&@#\/%=~_|]/i",$website))
  {
  $websiteErr = "Invalid URL";
  }

I've tried searching the function on google and so far ive understand that the letters and other symbols (i.e. \i, \b, \w, $ ) are called "regex" yet im still getting more confused with this so called "patterns" and its other parameters. I've searched the whole page suggested by google and yet with every search i just get more and more confused. can anyone help me?
Thank you!

Ch0cu3r · January 1, 2014

Regex is not easy to understand at first (I took me a while to start to understand it).

The best place I learnt regex was at regular-expressions.info, there they explained the basic patterns through to the more advanced patterns.

You can also check out the php manual on the PCRE pattern syntax too.

.josh · January 1, 2014

hello, i've just started learning php at w3schools.com and i've stumbled upon the function known as "preg_match". it says it can be used for validation (like the syntax of emails) but when I see the code I got confused.
$name = test_input($_POST["name"]);
if (!preg_match("/^[a-zA-Z ]*$/",$name))
  {
  $nameErr = "Only letters and white space allowed";
  }
i can understand that the "[a-zA-Z]" is a regex meaning that it should only contain lowercase/uppercase letters but I don't quite get what "/^" and "*" is for.

The / at the beginning and end is the pattern delimiter. All PCRE function(preg_xxx) patterns must be wrapped in a delimiter. This is because you can add pattern modifiers, and those are also specified in the first argument: "/pattern/modifiers".

The delimiter doesn't have to be /. It can be pretty much any non-alphanumeric char, as long as they match up (well.. you can also use brackets in which case you'd use them as opening/closing pairs, but let's not open that can of worms). For example, these are the 3 most common delimiters you will see:

/pattern/modifiers
~pattern~modifiers
#pattern#modifiers

/ is popular because it is the only delimiter you can use in some languages. For example, javascript uses / to delimited a regex object and you can't use anything else. IOW it's the most "universal" delimiter. The main thing to remember about the delimiter is that if you need to use that symbol in your pattern, you must escape it (prefix it with a \). IOW it works basically like quotes for strings. A good chunk of the time people work with regex (in php), they are trying to parse html (which isn't necessarily a good idea, but that's a different discussion). As you probably know, html contains a lot of forward slashes, so rather than having to deal with escaping them (which isn't *that* big a deal, but it does technically make for a longer and uglier pattern), a lot of people instead use something else (like the tilde or hash).

The ^ in this context is an anchor, signifying the start of a string. For example, let's say you have the following string:

"foobar"

And your pattern is /^bar/. This pattern says to match for beginning of string, followed by "bar". So this pattern would not match your "foobar" string, because "bar" is not at the beginning of the string. The counterpart to ^ is $, which stands for "end of string". So for example, if your pattern is /bar$/ it would match, because "bar" is at the end of "foobar".

One thing to note that ^ can mean other things, depending on the context. If you use the m modifier (multi-line mode), ^ and $ will change to mean start and end of line, respectively. IOW it will match against newline chars instead of string start/termination.

^ is also used within a character class, to signify a negative character class. For example, [0-9] will match any one digit. But [^0-9] will invert that. It will match any one character that is not a digit.

The * is a quantifier. A quantifier tells the regex engine to match the preceding thing for x amount of times. * means 0 or more times. So in your example, [a-zA-Z ]* says to match for 0 or more lower/uppercase letters or spaces.

So overall, the pattern: "/^[a-zA-Z ]*$/" says start at the beginning of the string and match 0 or more letters or spaces until end of string. So if the pattern matches, that means $name only contains letters and spaces.

One a sidenote, I mentioned modifiers, and there is opportunity to improve this pattern with the "i" modifier. The "i" modifier makes the pattern case-insensitive. So the pattern can be shortened by doing this: "/^[a-z ]*$/i"

another example that got me even more confused is this:
$email = test_input($_POST["email"]);
if (!preg_match("/([\w\-]+\@[\w\-]+\.[\w\-]+)/",$email))
  {
  $emailErr = "Invalid email format";
  }
I know that "\W" is another regex but the "(-@+" combinations are quite mind boggling.

"(-@+" isn't in the example, so I'm not sure what you're confused about.

and the last example got even worse:
$website = test_input($_POST["website"]);
if (!preg_match("/\b(??:https?|ftp):\/\/|www\.)[-a-z0-9+&@#\/%?=~_|!:,.;]*[-a-z0-9+&@#\/%=~_|]/i",$website))
  {
  $websiteErr = "Invalid URL";
  }
I've tried searching the function on google and so far ive understand that the letters and other symbols (i.e. \i, \b, \w, $ ) are called "regex" yet im still getting more confused with this so called "patterns" and its other parameters. I've searched the whole page suggested by google and yet with every search i just get more and more confused. can anyone help me?
Thank you!

Things preceded with a backslash are called escape sequences. What it is depends on the context. For example, if I used / as my pattern delimiter and wanted to match a closing html anchor tag, I'd have to do this: "/<\/a>/". In this context, I'm simply escaping the forward slash in my pattern to tell the regex engine to look for a literal forward slash, instead of think it's the end of my pattern.

Some escape sequences are shorthand character classes. For example, \w will match any "word" character, and is the equivalent of [a-zA-Z_]. Well, it's a little more complicated than that (read the entry for it in the link above).

So this last chunk of code you posted.. actually, it doesn't make much sense. Overall it looks like the intention is to validate a url. I *assume* test_input() is supposed to do this.. but then why turn around and have regex validation after that? IOW even out of context, this code has an "improperly structured" vibe to it. Anyways..

So the regex itself.. again, under the assumption that it's supposed to be validating a url.. this pattern is bad. You can go here and enter in the pattern to get a breakdown of it.

Sign In

preg_match: can someone explain this to me?

Recommended Posts

enil14someone

Link to comment

Share on other sites

Ch0cu3r

Link to comment

Share on other sites

.josh

Link to comment

Share on other sites

Join the conversation

Browse

Activity

Important Information