nrg_alpha Posted April 22, 2009 Share Posted April 22, 2009 (also it should be [-_]. - should always come first in a class unless you want a black hole to open up and swallow the universe). Actually, the order doesn't matter (be it [-_] or [_-]), because it is perfectly acceptable to have the dash as the very first or very last character listed in a character class.. it's when it is nested somewhere in the middle (not possible in this case, as we are only dealing with 2 characters) that we have to worry about inadvertently creating a range if the intent is to treat it as a literal. Quote Link to comment Share on other sites More sharing options...
.josh Posted April 22, 2009 Share Posted April 22, 2009 In my experience I have had adverse side effects to putting the dash as the last character. For instance, [a-] sometimes the engine will treat it as a or - and sometimes it will treat it as a to anything above it. I haven't fully investigated it, but I swear, I've made patterns that just did not seem to work until I moved the - from the end to the beginning. Quote Link to comment Share on other sites More sharing options...
nrg_alpha Posted April 22, 2009 Share Posted April 22, 2009 Hmm.. I've never had that issue.. everytime I tested the location of the dash as first or last, it was treated as a literal. My understanding is that for a range to be considered, there has to be at least a character on both sides of the dash (thus why it is safe to use first or last positions, as in both cases, there are characters missing on one side or the other.. so the result from the engine's consideration is a literal, not a range). EDIT - if you stumble upon one that treats this differently, I would be very interested in seeing it (perhaps there is some very oddball condition that causes rare issues like that). If this is the case, knowing this would be beneficial in avoiding that situation the next time around. Quote Link to comment Share on other sites More sharing options...
trace Posted April 23, 2009 Share Posted April 23, 2009 the OP said it had to end with a letter, so the end of the pattern had to have 2 separate classes. No reason the first one can't be combined though... also, you forgot the ? after the [_-] as OP said it's optional Ow... How did I miss that? :oops: I actually missed your working solution also... preg_match('/^[a-z\d]+[-_]?[a-z\d]*[a-z]$/i',$string); and of course, we could, as before, debate about whether to make the + greedy or not. I'm no expert att that. And I think it should be let up to the regexp engine to decide how to hande it, as we dont't care if it is greedy or not in this case. And I'd be willing to argue that the + should be greedy. Then it will only need to backtrack one character as everything except the last [a-z] is optional, right? If we make the + nongreedy, then the [a-z\d]* has to backtrack if it finds a [-_]... And the engine might anyway optimize the end of the regexp? I could imagine the engine might do a ( /^[a-z\d]/i && /[a-z]$/i && (/^[a-z\d]*[-_]?[a-z\d]*$/ ~ $string_minus_first_and_last_character) ) if it wanted to optimize. Now that we have the OP's problem solved, how would you with one regexp solve it if multiple - or _ are valid, but not after one another. Ie /--+/ and /__+/ are invalid, but "-_" and "_-" are valid. I would probably take the easy way out and write it in two regexpes: /^[a-z\d][-_a-z\d]+[a-z]$/i and not /--+|__+/ Something along the lines of /^[a-z\d][-_]?([a-z\d][-_]?)*+[a-z]$/i but this would also mark "-_" and "-_" as invalid... Quote Link to comment Share on other sites More sharing options...
nrg_alpha Posted April 23, 2009 Share Posted April 23, 2009 preg_match('/^[a-z\d]+[-_]?[a-z\d]*[a-z]$/i',$string); and of course, we could, as before, debate about whether to make the + greedy or not. I'm no expert att that. And I think it should be let up to the regexp engine to decide how to hande it, as we dont't care if it is greedy or not in this case. While we might not care if something is greedy or not in certain circumstances, the regex engine simply cannot 'decide' how to handle greedy vs non-greedy quantifiers.. it's ultimately either greedy or not. Regex is a very terse language that follows things down to the wire so-to-speak. So it is up to us to exploit regex's strengths and avoid it's weaknesses. And I'd be willing to argue that the + should be greedy. Then it will only need to backtrack one character as everything except the last [a-z] is optional, right? Actually, in this case, it won't matter, as both results (greedy + vs non-greedy +) yeilds the same amount of backtracking (you'll see in my examples below): If we make the + nongreedy, then the [a-z\d]* has to backtrack if it finds a [-_]... No, because the [-_] comes first in the pattern (as an optional) prior to [a-z\d]*.. Here comes the non-greedy backtracking explanation.. Assuming the pattern uses the lazy + quantifier as such: (colour coded for easy convenience) preg_match('/^[a-z\d]+?[-_]?[a-z\d]*[a-z]$/i',$string); and assuming that $string = 'a454hui43P', here's the breakdown of what ultimately matches what (using corrosponding colours): a454hui43P So if the + was lazy, it is important to understand that all it really needs is one qualifying character (as it is lazy). Ordinarily, the + alone is greedy, and require at least one, but get as much as it can.. however, this is not the case..So after the first character is matched, this part is finished. Since [-_]? is optional (and there is none in this case), the next item in the pattern is [a-z\d]*. So you can probably guess what happens with this.. it is greedy, so this part picks up from where [a-z\d]+? ends off, and thus matches all the rest of the way to the end of string initially. a454hui43P But finally, there is [a-z]$ to try and satisfy, so the engine must backtrack once, and check and see if that single final character fits in [a-z], and if so, that part gets the final character and the pattern is complete. a454hui43P The pure code way to check could be: $str = 'a454hui43P'; preg_match('#^([a-z\d]+?)([-_]?)([a-z\d]*)([a-z])$#i', $str, $match); echo "<pre>".print_r($match, true); But what if the + is greedy? Using the same assumptions as above: $string = 'a454hui43P' preg_match('/^[a-z\d]+[-_]?[a-z\d]*[a-z]$/i',$string); The end matched results: a454hui43P Why is this the case? Well, since [a-z\d]+ is greedy, and the entire string fits this aspect of the pattern, everything in the string (from start to finish) is matched. But again, there is more in the pattern for the regex engine to try and satisfy.. so first up, [-_]? Since it's optional and not present, there is nothing for [a-z\d]+ to relinquish to satisfy this. On to the next part: [a-z\d]* Same kind of story really... since this requires a minimum of zero (and the entire string at this point is taken up) nothing happens. Finally, we have [a-z]$ So now, for the pattern to be true, the engine needs this single character to be stored into [a-z]. Therefore, the regex engine backtracks one character to see if that character indeed matches [a-z], and if so, that character is relinquished from [a-z\d]+ and matched via [a-z]$. Once again, the code shows this: $str = 'a454hui43P'; preg_match('#^([a-z\d]+)([-_]?)([a-z\d]*)([a-z])$#i', $str, $match); echo "<pre>".print_r($match, true); So as you can see.. both versions of greedy and non greedy + quantifiers actually yield the same amount of backtracking.. one. If there is a - or _ in the string , then this obviously changes the dynamics of things. And the engine might anyway optimize the end of the regexp? I could imagine the engine might do a ( /^[a-z\d]/i && /[a-z]$/i && (/^[a-z\d]*[-_]?[a-z\d]*$/ ~ $string_minus_first_and_last_character) ) if it wanted to optimize. Huh? Now that we have the OP's problem solved, how would you with one regexp solve it if multiple - or _ are valid, but not after one another. Ie /--+/ and /__+/ are invalid, but "-_" and "_-" are valid. I would probably take the easy way out and write it in two regexpes: /^[a-z\d][-_a-z\d]+[a-z]$/i and not /--+|__+/ Something along the lines of /^[a-z\d][-_]?([a-z\d][-_]?)*+[a-z]$/i but this would also mark "-_" and "-_" as invalid... Hmm.. Don't know about the others, but I'll have to mull that one over and get back to you. Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.