Jump to content

[SOLVED] Help with preg_replace please


Chappers

Recommended Posts

Hi everyone,

 

Made an upload form so users can upload images on my site, and now modified it again so that each image has the user's name given to it so I know what's come from whom. One of the checks on the name that is performed was grabbed from elsewhere as I've never fully grasped the use of preg_replace:

 

$uploaddir = 'files/'
$filedone = $uploaddir.strtolower(preg_replace("/[^\W\D]+/", '', $name));

 

Trouble is that if $name contains any numbers, preg_replace is removing them. I found a list of all the special character definitons but just don't get how they work, why some are in parenthesis, etc. I saw that \W means match a non-word character, so thought that means anything other than normal letters. \D states it's for matching non-digit characters. If that's right, why is it matching digits?

 

Tried removing the \D anyway, then found that everything was removed except for non-letters like @ and dot (.). What's going wrong?

 

Thanks! James

Link to comment
Share on other sites

notice that your bracketed character set begins with a "^" - this actually tells the engine to match anything that is NOT in the character set specified. since you're specifying NON-word characters and NON-digit characters, it will actually match exactly those.

 

what exactly are you trying to do with the string? we might be able to guide you to the correct pattern.

Link to comment
Share on other sites

If you scroll down, POSIX regex function from PHP.net is a good guide as well. Also check out this Regex Cheat Sheet.

 

Actually, POSIX is now discouraged as it will no longer be included within the core of PHP as of version 6. PCRE is preferable.

 

@Chappers, akitchin nailed it.. Also note that even if you meant [^\w\d]+, \w by default will match a-zA-Z0-9_ (so there is no need to include the \d in there.. and depending on your locale, \w might return more than you think..) so no harm in either a) declaring that character class as: [^a-zA-Z0-9_]+ instead, or b) if you do use [^\w]+, ensure that you have your ctype setting set to 'C' prior to any regex:

 

setlocale(LC_CTYPE, 'C');

 

 

 

Link to comment
Share on other sites

Sorry, should have said what my intentions were. As I grabbed it from elsewhere for this kind of thing, I gathered it was designed to keep dodgy characters out used for bad purposes, perhaps for before being sent to a sql database.

 

Anyway, I just wanted it to allow normal letters, all numerals, and underscores, dashes, full stops, etc. Of course, wouldn't want apostrophes or quotes nor parentheses, square brackets, etc. I want the same kind of thing that email suppliers use when you sign up and enter a desired email username - you're allowed underscores, etc., but not many other things like apostrophes, colons, slashes, etc.

 

Thanks

Link to comment
Share on other sites

Anyway, I just wanted it to allow normal letters, all numerals, and underscores, dashes, full stops, etc. Of course, wouldn't want apostrophes or quotes nor parentheses, square brackets, etc. I want the same kind of thing that email suppliers use when you sign up and enter a desired email username - you're allowed underscores, etc., but not many other things like apostrophes, colons, slashes, etc.

 

This character class (given what you mentioned) should do it: [^a-zA-Z0-9_-]+ or if you are using the i modifier after the closing delimiter (which is case insensitive), you simple use [^a-z0-9_-]+ instead.

Link to comment
Share on other sites

Hi, thanks for that, I'll give it a try in a minute. I don't suppose you have the time to briefly explain a bit about the definitions and how the way it's set out works? It's just I can't find anything on the net that properly explains it and the php.net manual doesn't even give the different things you can use and what they do (the letters like D that then match whatever D stands for...).

 

I can't understand what I was using: preg_replace("/[^\W\D]+/". What do the D and W properly stand for and have they been used incorrectly together in this instance? I don't know what the forward slashes are for, nor what the ^ means. If you copuld even just recommend a good tutorial I'd really appreciate it. If I can understand it, I can do it myself from then on instead of bothering others... Oh, and don't know what an i modifier is, sorry. Bit confused by it all.

 

Thanks again, James

Link to comment
Share on other sites

ok.. the basic run down of what you were using: [^\W\D]+ is this..

[^...] This is a negated character class that checks to see if the current character being examined in the string is NOT ... (and in this case, not \W or \D).

 

\W is any non word character.. but to understand what is not a word character, you must know what is a word character. In this case, by default, \w is a-zA-Z0-9_ So \W is anything that is not any of those characters.

 

\D is any non digit.. so if \d is 0-9, \D is anything that is not 0-9.

 

So the pattern is last proposed: [^a-zA-Z0-9_-]+ is simply a character class that checks to see if the current character being checked within the string is not a-zA-Z0-9_- (one or more times consecutively, because of the use of the + quantifier).

 

You can learn more about regex from these sites:

regular-expressions

weblogtoolscollection

phpfreaks regex tutorial

phpfreaks regex resources

 

These should be more than enough to get you started.. Googling regex tutorials will obviously yield some results as well.

 

 

 

Link to comment
Share on other sites

Excellent tutorials, thanks. After having a read, I tried this experiment to help me grasp it:

<table cellpadding='5' border='1'>
<?php
$gimp = 'the-word_1234@test.com';
echo "<tr><td>this is original:</td><td>$gimp</td></tr>";
$gimp1 = preg_replace("/[^\W\D]+/", '', $gimp);
echo "<tr><td>1) this is after \W\D:</td><td>$gimp1</td></tr>";
$gimp2 = preg_replace("/[^\W]+/", '', $gimp);
echo "<tr><td>2) this is after \W:</td><td>$gimp2</td></tr>";
$gimp3 = preg_replace("/[^\D]+/", '', $gimp);
echo "<tr><td>3) this is after \D:</td><td>$gimp3</td></tr>";
$gimp4 = preg_replace("/[^\w\d]+/", '', $gimp);
echo "<tr><td>4) this is after \w\d:</td><td>$gimp4</td></tr>";
$gimp5 = preg_replace("/[^\w]+/", '', $gimp);
echo "<tr><td>5) this is after \w:</td><td>$gimp5</td></tr>";
$gimp6 = preg_replace("/[^\d]+/", '', $gimp);
echo "<tr><td>6) this is after \d:</td><td>$gimp6</td></tr>";
$gimp7 = preg_replace("/[^\w\.@-]+/", '', $gimp);
echo "<tr><td>7) this is after \w\.@-:</td><td>$gimp7</td></tr>";
$gimp8 = preg_replace("/[^\w@-\.]+/", '', $gimp);
echo "<tr><td> this is after \w@-\.:</td><td>$gimp8</td></tr>";
$gimp9 = preg_replace("/[^\w\.-@]+/", '', $gimp);
echo "<tr><td>9) this is after \w\.-@:</td><td>$gimp9</td></tr>";
?>
</table>

 

Which outputs:

 

this is original: the-word_1234@test.com

1) this is after \W\D: the-word_@test.com

2) this is after \W: -@.

3) this is after \D: the-word_@test.com

4) this is after \w\d: theword_1234testcom

5) this is after \w: theword_1234testcom

6) this is after \d: 1234

7) this is after \w\.@-: the-word_1234@test.com

8) this is after \w@-\.: 

9) this is after \w\.-@: theword_1234@test.com

 

What seems odd is that \W\D should be matching letters and digits but is only removing the digits. However, both \W and \D work as expected separately. \w\d should be matching non-words and non-digits and works normally even when together, and works as expected when separate. It gets odd when I try adding other characters I want to be left alone, like @, _, - and a dot. The order seems to effect whether it works or not. 7 works perfectly. 8 leaves nothing behind afterwards, so putting the escaped dot at the end of the search list doesn't work. 9 ignores the dash if it comes before the @. Weird?

Link to comment
Share on other sites

Also note that inside a character class, you don't need to escape the dot, as almost all special meta characters lose their special abilities within the character class. so using number 8 as an example (taking into account MadTechie's correction of the dash location (which must be the very first or last character in the class to be treated as a literal [unless using a negated character class, in that case the dash must be the second or last character]) could be:

8.

[^\w@.-]+

 

Link to comment
Share on other sites

Thanks so much to everyone who's helped, the above explains a great deal to me and with the tutorials I'm learning more and more about it. Much appreciated, and my form is now doing what I want it to. I'll keep dashes at the end, and it's good to know why, not just that it should be that way. Thanks.

Link to comment
Share on other sites

I'll keep dashes at the end, and it's good to know why, not just that it should be that way.

 

Oh right..the reason for placing the dash first or last is because there cannot be a range (like say [a-z] for instance) if there is a valid value on only one side.. so if you take an example like [a-z-], the last part involving the dash would be z- but since there is nothing on the right-hand side of the dash, regex will not be able to consider this as a range. Same with the [-a-z]... the dash is missing a character on the left-hand side.

 

With negated character classes, you could use [^a-z-] or [^-a-z] (the latter here is not treated as a range despite there being a character on both sides of the dash, as the first character in the negated class: ^ is considered a meta character with the special purpose of making the class negated, so the regex engine won't treat this as a literal (in these cases, and in general, I prefer placing the dash last -  but have seen myself sometimes place it first.. depends on the mood  ;) ).

Link to comment
Share on other sites

Thanks for that. I wasn't hinting that I wanted an explanation, I'd see that as being rude. I was actually being sincere and saying that it was good that you'd explained about why instead of just saying to do it, because from my point of view you had explained when you said "taking into account MadTechie's correction of the dash location (which must be the very first or last character in the class to be treated as a literal [unless using a negated character class, in that case the dash must be the second or last character])". Sorry if I didn't make it clear. But thanks anyway for the expanded explanation, it all helps!

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.