Help. I need somebody. Help. Not just anybody.

waynew · May 19, 2009

I'm not going to try to hide it. I'm crap at regex and have never found the time to brush up on it. But for a site I'm doing, I sure could do with some help on a number of user-submitted fields.

1: Probably the easiest. I ask them to enter their Bebo username. Bebo usernames only allow alphanumeric characters and underscores; so I'd have to make sure that was the case when they entered theirs on my site.

2: A little more difficult, They have to enter a link similar to http://www.bebo.com/Profile.jsp?PreviewSkinId=4867349407

The only thing that will be different is the PreviewSkinId number.

Hopefully I can learn a little from any examples. Or at least, re-use the code in the future.

Axeia · May 19, 2009

For the first one [0-9A-Z_] should work (use the i flag the end for case insensitivity as mentioned on (http://php.net/preg_match)

For the second one substr with some negative numbers if the number at the end is always the same length.

Maq · May 19, 2009

Did some minor testing, I'm not regexpert but they seem to work fine:

$s = "2user_n3ame_";
if(preg_match("~^([\w\d_])+$~i", $s))
{
   echo "valid";
}
else
{
   echo "invalid";
}

$u = "http://www.bebo.com/Profile.jsp?PreviewSkinId=4867349407";
if(preg_match("~http://www\.bebo\.com/Profile\.jsp\?previewSkinId=([\d])+$~i", $u))
{
   echo "\nvalid";
}
else
{
   echo "\ninvalid";
}


?>

waynew · May 19, 2009

Thanks guys. For your help you will be awarded seventeen virgins in heaven.

nrg_alpha · May 20, 2009

Did some minor testing, I'm not regexpert but they seem to work fine:

You're getting there, Maq

Just note that for ([\w\d_])+, the shorthand character class \w by default will match a-zA-Z0-9_ (locale issues of potentially matching even more characters than that aside), so you don't need the \d nor the _ afterwards (and as a result, nor is the character class [] characters themselves needed). The parenthesis is for capturing, but since the goal is simply to check for a format (at least from the look of things anyways), those shouldn't be needed:

if(preg_match('~^\w+$~i', $s))

Same kind of ordeal with the second solution (again, assuming it's just a format, not doing anything with the numbers):

...previewSkinId=([\d])+$

could simply be

previewSkinId=\d+$

GingerRobot · May 20, 2009

If they're entering the same link with just the ID different, why not just ask for that? Be easier to validate and would mean if bebo alter anything it shouldn't affect you.

waynew · May 20, 2009

If they're entering the same link with just the ID different, why not just ask for that? Be easier to validate and would mean if bebo alter anything it shouldn't affect you.

Oh I wish. But you see, I'm expecting the majority of my users to be between 14 & 18. The majority of them probably don't (nor want to) understand what number is needed for the link to be valid.

Ok... I've edited Maq's code and added nrg_alphas recommendation... so is the code below fit for use?

<?php

$u = "http://www.bebo.com/Profile.jsp?PreviewSkinId=4867349407";
if(preg_match("~http://www\.bebo\.com/Profile\.jsp\?previewSkinId=\d+$",$u))
{
   echo "\nvalid";
}
else
{
   echo "\ninvalid";
}


?>

Maq · May 20, 2009

You're getting there, Maq

Haha thanks, I try.

Just note that for ([\w\d_])+, the shorthand character class \w by default will match a-zA-Z0-9_ (locale issues of potentially matching even more characters than that aside), so you don't need the \d nor the _ afterwards (and as a result, nor is the character class [] characters themselves needed). The parenthesis is for capturing, but since the goal is simply to check for a format (at least from the look of things anyways), those shouldn't be needed:

Is this true for just PCRE, or all regex engines?

nrg_alpha · May 20, 2009

Just note that for ([\w\d_])+, the shorthand character class \w by default will match a-zA-Z0-9_ (locale issues of potentially matching even more characters than that aside), so you don't need the \d nor the _ afterwards (and as a result, nor is the character class [] characters themselves needed). The parenthesis is for capturing, but since the goal is simply to check for a format (at least from the look of things anyways), those shouldn't be needed:

Is this true for just PCRE, or all regex engines?

You mean about \w?

I only use PCRE, so I'm not versed in other engines.. but according to the Master Regular Expressions book:

Perl and most other programs consider alphanumerics and underscore to be part of a word

\w Part-of-word character Often the same as [a-zA-Z0-9_]. Some tools omit the underscore, while others include all alphanumerics in the current locale. If Unicode is supported, \w usually refers to all alphanumerics; notable exceptions include java.util.regex and PCRE (and by extension, PHP), whose \w are exactly [a-zA-Z0-9_].

But yeah, depending on your locale, in may not be exactly [a-zA-Z0-9_]. For me, if I want that to be the case, I have to set my LC_CTYPE variable to 'C' (I just link to threads to save some retyping). But I digress...

All in all your solution works, and that's the important thing!

Maq · May 20, 2009

Just note that for ([\w\d_])+, the shorthand character class \w by default will match a-zA-Z0-9_ (locale issues of potentially matching even more characters than that aside), so you don't need the \d nor the _ afterwards (and as a result, nor is the character class [] characters themselves needed). The parenthesis is for capturing, but since the goal is simply to check for a format (at least from the look of things anyways), those shouldn't be needed:

Is this true for just PCRE, or all regex engines?

You mean about \w?

I only use PCRE, so I'm not versed in other engines.. but according to the Master Regular Expressions book:

Perl and most other programs consider alphanumerics and underscore to be part of a word

\w Part-of-word character Often the same as [a-zA-Z0-9_]. Some tools omit the underscore, while others include all alphanumerics in the current locale. If Unicode is supported, \w usually refers to all alphanumerics; notable exceptions include java.util.regex and PCRE (and by extension, PHP), whose \w are exactly [a-zA-Z0-9_].

But yeah, depending on your locale, in may not be exactly [a-zA-Z0-9_]. For me, if I want that to be the case, I have to set my LC_CTYPE variable to 'C' (I just link to threads to save some retyping). But I digress...

All in all your solution works, and that's the important thing!

I agree, thanks for the info.

.josh · May 20, 2009

I used to like the shortcut char classes but then I found out about potential locality discrepancies in interpretation, so I usually use a char class explicitly writing the stuff out.

nrg_alpha · May 20, 2009

I suppose for all intents and purposes, \w, \d and the like won't get you into trouble via matching stuff you didn't expect (but I guess you can never be too sure - one day, it's bound to bite someone in rear). Definitely declaring things explicitly in your own character class is a sure fire way.. either that or simply use setlocale(LC_CTYPE, 'C'); to make sure that those shorthand character classes behave as expected.

.josh · May 20, 2009

well using setlocale is fine and dandy within php environment but from a portability perspective....

nrg_alpha · May 20, 2009

True enough. I'm always assuming it's from a php environment (unless the OP specifies otherwise) as this is a php forum after all.

Sign In

Help. I need somebody. Help. Not just anybody.

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Archived

Important Information