Jump to content

Recommended Posts

I'm not going to try to hide it. I'm crap at regex and have never found the time to brush up on it. But for a site I'm doing, I sure could do with some help on a number of user-submitted fields.

 

1: Probably the easiest. I ask them to enter their Bebo username. Bebo usernames only allow alphanumeric characters and underscores; so I'd have to make sure that was the case when they entered theirs on my site.

 

2: A little more difficult, They have to enter a link similar to http://www.bebo.com/Profile.jsp?PreviewSkinId=4867349407

The only thing that will be different is the PreviewSkinId number.

 

Hopefully I can learn a little from any examples. Or at least, re-use the code in the future.

For the first one [0-9A-Z_] should work (use the i flag the end for case insensitivity as mentioned on (http://php.net/preg_match)

For the second one substr with some negative numbers if the number at the end is always the same length.

Did some minor testing, I'm not regexpert but they seem to work fine:

 

$s = "2user_n3ame_";
if(preg_match("~^([\w\d_])+$~i", $s))
{
   echo "valid";
}
else
{
   echo "invalid";
}

$u = "http://www.bebo.com/Profile.jsp?PreviewSkinId=4867349407";
if(preg_match("~http://www\.bebo\.com/Profile\.jsp\?previewSkinId=([\d])+$~i", $u))
{
   echo "\nvalid";
}
else
{
   echo "\ninvalid";
}


?>

Did some minor testing, I'm not regexpert but they seem to work fine:

 

You're getting there, Maq :)

Just note that for ([\w\d_])+, the shorthand character class \w by default will match a-zA-Z0-9_ (locale issues of potentially matching even more characters than that aside), so you don't need the \d nor the _ afterwards (and as a result, nor is the character class [] characters themselves needed). The parenthesis is for capturing, but since the goal is simply to check for a format (at least from the look of things anyways), those shouldn't be needed:

 

if(preg_match('~^\w+$~i', $s))

 

Same kind of ordeal with the second solution (again, assuming it's just a format, not doing anything with the numbers):

 

...previewSkinId=([\d])+$

 

could simply be

 

previewSkinId=\d+$

If they're entering the same link with just the ID different, why not just ask for that? Be easier to validate and would mean if bebo alter anything it shouldn't affect you.

 

Oh I wish. But you see, I'm expecting the majority of my users to be between 14 & 18. The majority of them probably don't (nor want to) understand what number is needed for the link to be valid.

 

Ok...  I've edited Maq's code and added nrg_alphas recommendation... so is the code below fit for use?

 

<?php

$u = "http://www.bebo.com/Profile.jsp?PreviewSkinId=4867349407";
if(preg_match("~http://www\.bebo\.com/Profile\.jsp\?previewSkinId=\d+$",$u))
{
   echo "\nvalid";
}
else
{
   echo "\ninvalid";
}


?>

You're getting there, Maq :)

 

Haha thanks,  I try.  :)

 

Just note that for ([\w\d_])+, the shorthand character class \w by default will match a-zA-Z0-9_ (locale issues of potentially matching even more characters than that aside), so you don't need the \d nor the _ afterwards (and as a result, nor is the character class [] characters themselves needed). The parenthesis is for capturing, but since the goal is simply to check for a format (at least from the look of things anyways), those shouldn't be needed:

 

Is this true for just PCRE, or all regex engines?

Just note that for ([\w\d_])+, the shorthand character class \w by default will match a-zA-Z0-9_ (locale issues of potentially matching even more characters than that aside), so you don't need the \d nor the _ afterwards (and as a result, nor is the character class [] characters themselves needed). The parenthesis is for capturing, but since the goal is simply to check for a format (at least from the look of things anyways), those shouldn't be needed:

 

Is this true for just PCRE, or all regex engines?

 

You mean about \w?

I only use PCRE, so I'm not versed in other engines.. but according to the Master Regular Expressions book:

 

Perl and most other programs consider alphanumerics and underscore to be part of a word

 

\w Part-of-word character  Often the same as [a-zA-Z0-9_]. Some tools omit the underscore, while others include all alphanumerics in the current locale. If Unicode is supported, \w usually refers to all alphanumerics; notable exceptions include java.util.regex and PCRE (and by extension, PHP), whose \w are exactly [a-zA-Z0-9_].

 

But yeah, depending on your locale, in may not be exactly [a-zA-Z0-9_]. For me, if I want that to be the case, I have to set my LC_CTYPE variable to 'C' (I just link to threads to save some retyping). But I digress...

 

All in all your solution works, and that's the important thing!  :)

Just note that for ([\w\d_])+, the shorthand character class \w by default will match a-zA-Z0-9_ (locale issues of potentially matching even more characters than that aside), so you don't need the \d nor the _ afterwards (and as a result, nor is the character class [] characters themselves needed). The parenthesis is for capturing, but since the goal is simply to check for a format (at least from the look of things anyways), those shouldn't be needed:

 

Is this true for just PCRE, or all regex engines?

 

You mean about \w?

I only use PCRE, so I'm not versed in other engines.. but according to the Master Regular Expressions book:

 

Perl and most other programs consider alphanumerics and underscore to be part of a word

 

\w Part-of-word character  Often the same as [a-zA-Z0-9_]. Some tools omit the underscore, while others include all alphanumerics in the current locale. If Unicode is supported, \w usually refers to all alphanumerics; notable exceptions include java.util.regex and PCRE (and by extension, PHP), whose \w are exactly [a-zA-Z0-9_].

 

But yeah, depending on your locale, in may not be exactly [a-zA-Z0-9_]. For me, if I want that to be the case, I have to set my LC_CTYPE variable to 'C' (I just link to threads to save some retyping). But I digress...

 

All in all your solution works, and that's the important thing!  :)

 

I agree, thanks for the info.  ;)

I suppose for all intents and purposes, \w, \d and the like won't get you into trouble via matching stuff you didn't expect (but I guess you can never be too sure - one day, it's bound to bite someone in rear). Definitely declaring things explicitly in your own character class is a sure fire way.. either that or simply use setlocale(LC_CTYPE, 'C'); to make sure that those shorthand character classes behave as expected.

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.