Jump to content

Recommended Posts

Hiya,

 

I'm trying to create a filename based on some other data, that includes characters that aren't good for filenames.  I have a simple regular expression but it keeps the "=" character.  I believe the ASCII code for this is x3D if that helps.  Here's my Regex Code line :

 $image = $row[3] . "_" . preg_replace("^[a-zA-Z0-9_-]^", '', $row[4]) . ".jpg|" . $row[5];

 

Can someone point me into how to get rid of the "=" char.  Thanks in advance.

Link to comment
https://forums.phpfreaks.com/topic/147031-solved-simple-filename-regex/
Share on other sites

Hmm,

 

Whats the difference between ^ and # and why should I use one over the other?

 

Also using your code, with or without the = sign still leaves the = sign in when I want that stripped.  Any ideas?

 

Thanks.

 

Thanks also for the \w pointer

Hmm,

 

Whats the difference between ^ and # and why should I use one over the other?

 

The only reason why I personally wouldn't use ^ as delimiters is due to the fact that it is common to use ^ to signify the beginning of a string (there are other ways to do this granted) so image using a pattern like: ^^[123]$^ (I suppose you could use \A....\z instead of ^....$ to accomplish line by line matching / capturing. Still makes things messy by using ^ as delimiters in my opinion).

 

By the way, it doesn't have to be #. A delimiter can be a any non alphanumeric, non white spaced ACSII character (except for a backslash). So delimiters are typically /...../, but can also be or !.....! or ~.......~ or |.....| etc.. I personally stick to #.....# (but in the event I use an x modifier after the closing delimiter (for free spacing / commenting within the pattern, # inside the pattern creates a comment.. so in that case, I would not use #.....# as delimiters).

 

Also using your code, with or without the = sign still leaves the = sign in when I want that stripped.  Any ideas?

 

Could you post a (small) sample of what string you are using that is causing you problems and the code that checks this using preg? (don't post the whole code of your entire page, just those small specific sections). Because it should take out the equal sign in your string since the equal sign is listed within the character list.

 

 

 

Sorry, the file I'm working on is at work.

 

But, it gets some data out of a csv file and I use some of the data to get a remote image file and create a local one.  It takes manufacturers code names, but some have unsightly characters in, so I want to strip them for the filename.

 

An example would be FGH/3456-z==

 

I want that to become FGH3456-z

 

Thanks also for the info.

Oh ok.. well, in that case, if your string is 'FGH/3456-z==', you can simply use something like as an example:

$str = 'FGH/3456-z==';
$newStr = str_replace('=', '', $str);
echo $newStr; // outputs FGH/3456-z

 

This of course assumes that you only want to take out the = (which doesn't require regex).

 

EDIT If however, you wanted to say take out multiple characters like / = and - for example, then you can use something like preg_replace as such:

 

$str = 'FGH/3456-z==';
$newStr = preg_replace('#[=/-]#', '', $str);
echo $newStr; // outputs FGH3456z

I think this will work as well

 

$string=preg_replace('@[^\w-]@','',$string)

 

remember yer not just dealing with = \ u may get other characters that u may not want as well, so easier to assign a list of valid characters, and use the NOT operator.

 

I think \w handles AlphaNums, _ can't remember about - & .

 

 

I think this will work as well

 

$string=preg_replace('@[^\w-]@','',$string)

 

remember yer not just dealing with = \ u may get other characters that u may not want as well, so easier to assign a list of valid characters, and use the NOT operator.

 

Actually, speed tests confirm my method to be faster. Keep in mind that it is more efficient for regex to only replace a small set of characters in a positive character class as opposed to sifting through a larger negative character class to avoid specific characters (in fact, ALL character classes are actually positive assertions... even negative character classes must make successful positive matches - read: must positively ID a character to not include in the match).

 

Consider this simple timed sample:

$str = 'FGH/3456-z==';
$loop = 1000;
$arr = 0;

echo 'Initial string: ' . $str . "<br /><br >\n";

$time_start = microtime(true);
for($i = 0; $i < $loop; $i++){
    $newStr = preg_replace('#[=/-]#', '', $str);
}
$time_end = microtime(true);
$elapsed_time = round($time_end-$time_start, 4);
echo 'Elapsed time:' . $elapsed_time . '<br />';
echo '<span style="color:#A8A8A8">Output using #[=/-]#:</span> ' . $newStr . '<br />-------------------------<br />';

$time_start = microtime(true);
for($i = 0; $i < $loop; $i++){
   $newStr = preg_replace('#[^\w]#', '', $str);
}
$time_end = microtime(true);
$elapsed_time = round($time_end-$time_start, 4);
echo 'Elapsed time:' . $elapsed_time . '<br />';
echo '<span style="color:#A8A8A8">Output using #[^\w]#: </span>' . $newStr . '<br />-------------------------<br />';

 

Sample Output:

Initial string: FGH/3456-z==

Elapsed time:0.014
Output using #[=/-]#: FGH3456z
-------------------------
Elapsed time:0.0187
Output using #[^\w]#: FGH3456z

 

Both give the same end results, but on the whole, the shorter positive character class is faster.

 

So if we start with the sample string FGH/3456-z==, and let's suppose we want to get rid of /- and = characters. As in the timed test code, there are two approaches taken, one using my method of a positive character class [=/-], while other sports your negative character class [^\w]. Notice which in a 1000 loop executes faster? Sure, we may be splitting hairs on a single pass of execution.. but the point is, if you only need to check a very small list of characters to replace, it's faster to use a positive character class as opposed to going heavier on a negative character class. Granted, if the list of characters to be eliminated gets too long, this might start to favor a negated character class instead.. but in this thread, the list of characters is small enough that the positive character class is actually a faster choice.

 

I think \w handles AlphaNums, _ can't remember about - & .

 

\w = a-zA-Z0-9_

But yer missing the point I made.

The whitelist method is better because ya know what characters ya want to accept a-z0-9_-

so yer method only is stripping out 3 characters, but when ya add more, of the possible 256 characters in an ascii chart it just becomes ridiculous.

so no longer are u looking at a simple pattern of '-/='

but a huge one to strip control characters, spaces, and symbols, when we know we just want 64 of the 256 characters.

 

This is a prime example of when to use a whitelist over a blacklist.

I got your point. Hence, from my last post:

Granted, if the list of characters to be eliminated gets too long, this might start to favor a negated character class instead..

 

The whitelist method is better because ya know what characters ya want to accept a-z0-9_-

 

I guess this is a 'glass half empty, glass half full' kind of thing from a perception standpoint. One might indeed find it easier to know which characters to keep, while someone else might know, I just need to remove this, this and that (everything else stays). It all depends on what needs to stay versus what needs to go. In this particular case, if its only a handful of characters, I think it's easier (and it definitily is faster) to use a positive character class instead. Otherwise, yes, a negated one is better.

This is all good stuff here, and I'm learning more about timing and Regex at the same time.  This is by far the best php forum I've used so far, keep it up.

 

However just to clarify and clear things up.  nrg_alphas timing comparison, did just only take out 3 characters, according to my example.

 

I want to keep only characters that are allowed in a Windows/Linux filename but the input string taken could be any character allowed in a csv cell.  As there are a lot more characters to sift through I think it would be best to use laffin's regex '@[^\w=-]@' in this case.  But appreciate everyone else's contribution.

 

Are we in agreement, or have I missed something?

 

Thanks

Ok, I was under the impression you wanted only a small handful of known characters to eliminate.

As mentioned before.. if the list of characters to be eliminated is large (perhaps even the term unknown would fit the bill), then negated characters classes would do better.

 

I am rather confused on the = though. In your last post, you wanted to get rid of the equal sign... but if you use '@[^\w=-]@' the equal sign is protected (it will not be removed because it is inside a negated character class). I think you mean you want to use Laffin's @[^\w-]@ pattern from reply #9, as this one will remove the equal sign.

Yes, you are correct, sorry that was my copy/paste typo.  I am in fact using Laffin's @[^\w-]@ pattern and it is working perfectly.  Thanks for all your help.

 

By the way I meant in agreement that @[^\w-]@ would be the best pattern to use, not the different viewpoints. lol. ;)

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.