[SOLVED] Help with preg_replace please

Chappers · May 29, 2009

Hi everyone,

Made an upload form so users can upload images on my site, and now modified it again so that each image has the user's name given to it so I know what's come from whom. One of the checks on the name that is performed was grabbed from elsewhere as I've never fully grasped the use of preg_replace:

$uploaddir = 'files/'
$filedone = $uploaddir.strtolower(preg_replace("/[^\W\D]+/", '', $name));

Trouble is that if $name contains any numbers, preg_replace is removing them. I found a list of all the special character definitons but just don't get how they work, why some are in parenthesis, etc. I saw that \W means match a non-word character, so thought that means anything other than normal letters. \D states it's for matching non-digit characters. If that's right, why is it matching digits?

Tried removing the \D anyway, then found that everything was removed except for non-letters like @ and dot (.). What's going wrong?

Thanks! James

Hybride · May 29, 2009

If you scroll down, POSIX regex function from PHP.net is a good guide as well. Also check out this Regex Cheat Sheet.

Are you actually asking to remove the digits or what do you want to be replaced?

akitchin · May 29, 2009

notice that your bracketed character set begins with a "^" - this actually tells the engine to match anything that is NOT in the character set specified. since you're specifying NON-word characters and NON-digit characters, it will actually match exactly those.

what exactly are you trying to do with the string? we might be able to guide you to the correct pattern.

nrg_alpha · May 29, 2009

If you scroll down, POSIX regex function from PHP.net is a good guide as well. Also check out this Regex Cheat Sheet.

Actually, POSIX is now discouraged as it will no longer be included within the core of PHP as of version 6. PCRE is preferable.

@Chappers, akitchin nailed it.. Also note that even if you meant [^\w\d]+, \w by default will match a-zA-Z0-9_ (so there is no need to include the \d in there.. and depending on your locale, \w might return more than you think..) so no harm in either a) declaring that character class as: [^a-zA-Z0-9_]+ instead, or b) if you do use [^\w]+, ensure that you have your ctype setting set to 'C' prior to any regex:

setlocale(LC_CTYPE, 'C');

Chappers · May 30, 2009

Sorry, should have said what my intentions were. As I grabbed it from elsewhere for this kind of thing, I gathered it was designed to keep dodgy characters out used for bad purposes, perhaps for before being sent to a sql database.

Anyway, I just wanted it to allow normal letters, all numerals, and underscores, dashes, full stops, etc. Of course, wouldn't want apostrophes or quotes nor parentheses, square brackets, etc. I want the same kind of thing that email suppliers use when you sign up and enter a desired email username - you're allowed underscores, etc., but not many other things like apostrophes, colons, slashes, etc.

Thanks

nrg_alpha · May 30, 2009

Anyway, I just wanted it to allow normal letters, all numerals, and underscores, dashes, full stops, etc. Of course, wouldn't want apostrophes or quotes nor parentheses, square brackets, etc. I want the same kind of thing that email suppliers use when you sign up and enter a desired email username - you're allowed underscores, etc., but not many other things like apostrophes, colons, slashes, etc.

This character class (given what you mentioned) should do it: [^a-zA-Z0-9_-]+ or if you are using the i modifier after the closing delimiter (which is case insensitive), you simple use [^a-z0-9_-]+ instead.

Chappers · May 30, 2009

Hi, thanks for that, I'll give it a try in a minute. I don't suppose you have the time to briefly explain a bit about the definitions and how the way it's set out works? It's just I can't find anything on the net that properly explains it and the php.net manual doesn't even give the different things you can use and what they do (the letters like D that then match whatever D stands for...).

I can't understand what I was using: preg_replace("/[^\W\D]+/". What do the D and W properly stand for and have they been used incorrectly together in this instance? I don't know what the forward slashes are for, nor what the ^ means. If you copuld even just recommend a good tutorial I'd really appreciate it. If I can understand it, I can do it myself from then on instead of bothering others... Oh, and don't know what an i modifier is, sorry. Bit confused by it all.

Thanks again, James

nrg_alpha · May 31, 2009

ok.. the basic run down of what you were using: [^\W\D]+ is this..

[^...] This is a negated character class that checks to see if the current character being examined in the string is NOT ... (and in this case, not \W or \D).

\W is any non word character.. but to understand what is not a word character, you must know what is a word character. In this case, by default, \w is a-zA-Z0-9_ So \W is anything that is not any of those characters.

\D is any non digit.. so if \d is 0-9, \D is anything that is not 0-9.

So the pattern is last proposed: [^a-zA-Z0-9_-]+ is simply a character class that checks to see if the current character being checked within the string is not a-zA-Z0-9_- (one or more times consecutively, because of the use of the + quantifier).

You can learn more about regex from these sites:

regular-expressions

weblogtoolscollection

phpfreaks regex tutorial

phpfreaks regex resources

These should be more than enough to get you started.. Googling regex tutorials will obviously yield some results as well.

Chappers · June 1, 2009

Excellent tutorials, thanks. After having a read, I tried this experiment to help me grasp it:

<table cellpadding='5' border='1'>
<?php
$gimp = '[email protected]';
echo "<tr><td>this is original:</td><td>$gimp</td></tr>";
$gimp1 = preg_replace("/[^\W\D]+/", '', $gimp);
echo "<tr><td>1) this is after \W\D:</td><td>$gimp1</td></tr>";
$gimp2 = preg_replace("/[^\W]+/", '', $gimp);
echo "<tr><td>2) this is after \W:</td><td>$gimp2</td></tr>";
$gimp3 = preg_replace("/[^\D]+/", '', $gimp);
echo "<tr><td>3) this is after \D:</td><td>$gimp3</td></tr>";
$gimp4 = preg_replace("/[^\w\d]+/", '', $gimp);
echo "<tr><td>4) this is after \w\d:</td><td>$gimp4</td></tr>";
$gimp5 = preg_replace("/[^\w]+/", '', $gimp);
echo "<tr><td>5) this is after \w:</td><td>$gimp5</td></tr>";
$gimp6 = preg_replace("/[^\d]+/", '', $gimp);
echo "<tr><td>6) this is after \d:</td><td>$gimp6</td></tr>";
$gimp7 = preg_replace("/[^\w\.@-]+/", '', $gimp);
echo "<tr><td>7) this is after \w\.@-:</td><td>$gimp7</td></tr>";
$gimp8 = preg_replace("/[^\w@-\.]+/", '', $gimp);
echo "<tr><td> this is after \w@-\.:</td><td>$gimp8</td></tr>";
$gimp9 = preg_replace("/[^\w\.-@]+/", '', $gimp);
echo "<tr><td>9) this is after \w\.-@:</td><td>$gimp9</td></tr>";
?>
</table>

Which outputs:

this is original: [email protected]

1) this is after \W\D: [email protected]

2) this is after \W: -@.

3) this is after \D: [email protected]

4) this is after \w\d: theword_1234testcom

5) this is after \w: theword_1234testcom

6) this is after \d: 1234

7) this is after \w\.@-: [email protected]

this is after \w@-\.:

9) this is after \w\.-@: [email protected]

What seems odd is that \W\D should be matching letters and digits but is only removing the digits. However, both \W and \D work as expected separately. \w\d should be matching non-words and non-digits and works normally even when together, and works as expected when separate. It gets odd when I try adding other characters I want to be left alone, like @, _, - and a dot. The order seems to effect whether it works or not. 7 works perfectly. 8 leaves nothing behind afterwards, so putting the escaped dot at the end of the search list doesn't work. 9 ignores the dash if it comes before the @. Weird?

MadTechie · June 1, 2009

\w matche digets as well

So [\W\D] would match be the same as [\W]

8 & 9 should have the - at the end of the set

8.

[^\w@\.-]+

9.

[^\w\.@-]+

nrg_alpha · June 1, 2009

Also note that inside a character class, you don't need to escape the dot, as almost all special meta characters lose their special abilities within the character class. so using number 8 as an example (taking into account MadTechie's correction of the dash location (which must be the very first or last character in the class to be treated as a literal [unless using a negated character class, in that case the dash must be the second or last character]) could be:

8.

[^\[email protected]]+

Chappers · June 1, 2009

Thanks so much to everyone who's helped, the above explains a great deal to me and with the tutorials I'm learning more and more about it. Much appreciated, and my form is now doing what I want it to. I'll keep dashes at the end, and it's good to know why, not just that it should be that way. Thanks.

nrg_alpha · June 1, 2009

I'll keep dashes at the end, and it's good to know why, not just that it should be that way.

Oh right..the reason for placing the dash first or last is because there cannot be a range (like say [a-z] for instance) if there is a valid value on only one side.. so if you take an example like [a-z-], the last part involving the dash would be z- but since there is nothing on the right-hand side of the dash, regex will not be able to consider this as a range. Same with the [-a-z]... the dash is missing a character on the left-hand side.

With negated character classes, you could use [^a-z-] or [^-a-z] (the latter here is not treated as a range despite there being a character on both sides of the dash, as the first character in the negated class: ^ is considered a meta character with the special purpose of making the class negated, so the regex engine won't treat this as a literal (in these cases, and in general, I prefer placing the dash last - but have seen myself sometimes place it first.. depends on the mood ).

Chappers · June 2, 2009

Thanks for that. I wasn't hinting that I wanted an explanation, I'd see that as being rude. I was actually being sincere and saying that it was good that you'd explained about why instead of just saying to do it, because from my point of view you had explained when you said "taking into account MadTechie's correction of the dash location (which must be the very first or last character in the class to be treated as a literal [unless using a negated character class, in that case the dash must be the second or last character])". Sorry if I didn't make it clear. But thanks anyway for the expanded explanation, it all helps!

Sign In

[SOLVED] Help with preg_replace please

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Archived

Important Information