[SOLVED] simple filename regex

scottybwoy · February 26, 2009

Hiya,

I'm trying to create a filename based on some other data, that includes characters that aren't good for filenames. I have a simple regular expression but it keeps the "=" character. I believe the ASCII code for this is x3D if that helps. Here's my Regex Code line :

 $image = $row[3] . "_" . preg_replace("^[a-zA-Z0-9_-]^", '', $row[4]) . ".jpg|" . $row[5];

Can someone point me into how to get rid of the "=" char. Thanks in advance.

nrg_alpha · February 26, 2009

I personally wouldn't use ^ as delimiters. Simply add the equal sign to your character class (you can probably get away with using \w, which is a short hand character class for a-zA-Z0-9_

'#[\w=-]#'

scottybwoy · February 27, 2009

Hmm,

Whats the difference between ^ and # and why should I use one over the other?

Also using your code, with or without the = sign still leaves the = sign in when I want that stripped. Any ideas?

Thanks.

Thanks also for the \w pointer

nrg_alpha · February 27, 2009

Hmm,

Whats the difference between ^ and # and why should I use one over the other?

The only reason why I personally wouldn't use ^ as delimiters is due to the fact that it is common to use ^ to signify the beginning of a string (there are other ways to do this granted) so image using a pattern like: ^^[123]$^ (I suppose you could use \A....\z instead of ^....$ to accomplish line by line matching / capturing. Still makes things messy by using ^ as delimiters in my opinion).

By the way, it doesn't have to be #. A delimiter can be a any non alphanumeric, non white spaced ACSII character (except for a backslash). So delimiters are typically /...../, but can also be or !.....! or ~.......~ or |.....| etc.. I personally stick to #.....# (but in the event I use an x modifier after the closing delimiter (for free spacing / commenting within the pattern, # inside the pattern creates a comment.. so in that case, I would not use #.....# as delimiters).

Also using your code, with or without the = sign still leaves the = sign in when I want that stripped. Any ideas?

Could you post a (small) sample of what string you are using that is causing you problems and the code that checks this using preg? (don't post the whole code of your entire page, just those small specific sections). Because it should take out the equal sign in your string since the equal sign is listed within the character list.

laffin · February 27, 2009

i think the regex ya need is '@[^\w=-]@'

which just grabs the characters not in the list

note delimeters, and alpha's explanation above.

nrg_alpha · February 27, 2009

My understanding is that the OP wants to remove those characters, not anything but those.

laffin · February 27, 2009

Wrong way, he wants to remove anything but a-z 0-9 - .

yer regex '\w-' matches those characters, which are valid filename characters

so ya have to strip out everything else

scottybwoy · February 27, 2009

Sorry, the file I'm working on is at work.

But, it gets some data out of a csv file and I use some of the data to get a remote image file and create a local one. It takes manufacturers code names, but some have unsightly characters in, so I want to strip them for the filename.

An example would be FGH/3456-z==

I want that to become FGH3456-z

Thanks also for the info.

nrg_alpha · February 27, 2009

Oh ok.. well, in that case, if your string is 'FGH/3456-z==', you can simply use something like as an example:

$str = 'FGH/3456-z==';
$newStr = str_replace('=', '', $str);
echo $newStr; // outputs FGH/3456-z

This of course assumes that you only want to take out the = (which doesn't require regex).

EDIT If however, you wanted to say take out multiple characters like / = and - for example, then you can use something like preg_replace as such:

$str = 'FGH/3456-z==';
$newStr = preg_replace('#[=/-]#', '', $str);
echo $newStr; // outputs FGH3456z

laffin · February 27, 2009

I think this will work as well

$string=preg_replace('@[^\w-]@','',$string)

remember yer not just dealing with = \ u may get other characters that u may not want as well, so easier to assign a list of valid characters, and use the NOT operator.

I think \w handles AlphaNums, _ can't remember about - & .

nrg_alpha · February 28, 2009

I think this will work as well
$string=preg_replace('@[^\w-]@','',$string)
remember yer not just dealing with = \ u may get other characters that u may not want as well, so easier to assign a list of valid characters, and use the NOT operator.

Actually, speed tests confirm my method to be faster. Keep in mind that it is more efficient for regex to only replace a small set of characters in a positive character class as opposed to sifting through a larger negative character class to avoid specific characters (in fact, ALL character classes are actually positive assertions... even negative character classes must make successful positive matches - read: must positively ID a character to not include in the match).

Consider this simple timed sample:

$str = 'FGH/3456-z==';
$loop = 1000;
$arr = 0;

echo 'Initial string: ' . $str . "<br /><br >\n";

$time_start = microtime(true);
for($i = 0; $i < $loop; $i++){
    $newStr = preg_replace('#[=/-]#', '', $str);
}
$time_end = microtime(true);
$elapsed_time = round($time_end-$time_start, 4);
echo 'Elapsed time:' . $elapsed_time . '<br />';
echo '<span style="color:#A8A8A8">Output using #[=/-]#:</span> ' . $newStr . '<br />-------------------------<br />';

$time_start = microtime(true);
for($i = 0; $i < $loop; $i++){
   $newStr = preg_replace('#[^\w]#', '', $str);
}
$time_end = microtime(true);
$elapsed_time = round($time_end-$time_start, 4);
echo 'Elapsed time:' . $elapsed_time . '<br />';
echo '<span style="color:#A8A8A8">Output using #[^\w]#: </span>' . $newStr . '<br />-------------------------<br />';

Sample Output:

Initial string: FGH/3456-z==

Elapsed time:0.014
Output using #[=/-]#: FGH3456z
-------------------------
Elapsed time:0.0187
Output using #[^\w]#: FGH3456z

Both give the same end results, but on the whole, the shorter positive character class is faster.

So if we start with the sample string FGH/3456-z==, and let's suppose we want to get rid of /- and = characters. As in the timed test code, there are two approaches taken, one using my method of a positive character class [=/-], while other sports your negative character class [^\w]. Notice which in a 1000 loop executes faster? Sure, we may be splitting hairs on a single pass of execution.. but the point is, if you only need to check a very small list of characters to replace, it's faster to use a positive character class as opposed to going heavier on a negative character class. Granted, if the list of characters to be eliminated gets too long, this might start to favor a negated character class instead.. but in this thread, the list of characters is small enough that the positive character class is actually a faster choice.

I think \w handles AlphaNums, _ can't remember about - & .

\w = a-zA-Z0-9_

laffin · February 28, 2009

But yer missing the point I made.

The whitelist method is better because ya know what characters ya want to accept a-z0-9_-

so yer method only is stripping out 3 characters, but when ya add more, of the possible 256 characters in an ascii chart it just becomes ridiculous.

so no longer are u looking at a simple pattern of '-/='

but a huge one to strip control characters, spaces, and symbols, when we know we just want 64 of the 256 characters.

This is a prime example of when to use a whitelist over a blacklist.

nrg_alpha · February 28, 2009

I got your point. Hence, from my last post:

Granted, if the list of characters to be eliminated gets too long, this might start to favor a negated character class instead..

The whitelist method is better because ya know what characters ya want to accept a-z0-9_-

I guess this is a 'glass half empty, glass half full' kind of thing from a perception standpoint. One might indeed find it easier to know which characters to keep, while someone else might know, I just need to remove this, this and that (everything else stays). It all depends on what needs to stay versus what needs to go. In this particular case, if its only a handful of characters, I think it's easier (and it definitily is faster) to use a positive character class instead. Otherwise, yes, a negated one is better.

scottybwoy · March 2, 2009

This is all good stuff here, and I'm learning more about timing and Regex at the same time. This is by far the best php forum I've used so far, keep it up.

However just to clarify and clear things up. nrg_alphas timing comparison, did just only take out 3 characters, according to my example.

I want to keep only characters that are allowed in a Windows/Linux filename but the input string taken could be any character allowed in a csv cell. As there are a lot more characters to sift through I think it would be best to use laffin's regex '@[^\w=-]@' in this case. But appreciate everyone else's contribution.

Are we in agreement, or have I missed something?

Thanks

laffin · March 2, 2009

Oh I'm not arguing that point.

Its good to hear from different viewpoints of the same topic sometimes Ya learn a lot more than by a single viewpoint of the matter....

nrg_alpha · March 2, 2009

Ok, I was under the impression you wanted only a small handful of known characters to eliminate.

As mentioned before.. if the list of characters to be eliminated is large (perhaps even the term unknown would fit the bill), then negated characters classes would do better.

I am rather confused on the = though. In your last post, you wanted to get rid of the equal sign... but if you use '@[^\w=-]@' the equal sign is protected (it will not be removed because it is inside a negated character class). I think you mean you want to use Laffin's @[^\w-]@ pattern from reply #9, as this one will remove the equal sign.

scottybwoy · March 3, 2009

Yes, you are correct, sorry that was my copy/paste typo. I am in fact using Laffin's @[^\w-]@ pattern and it is working perfectly. Thanks for all your help.

By the way I meant in agreement that @[^\w-]@ would be the best pattern to use, not the different viewpoints. lol.

laffin · March 3, 2009

Its all good viewpoints, cuz sometimes ya learn something that never crossed yer mind.

I think the = was in the first couple of posts, so could be the mixup why it kept coming back to haunt us...

Anyways, have a nice day all

Sign In

[SOLVED] simple filename regex

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Archived

Important Information