lwc Posted June 13, 2008 Share Posted June 13, 2008 I want to find a whole word, but I don't manage to do it neither in ereg nor in preg when using Unicode. $pattern = "pattern"; $text = "a phrase that contains the word pattern as a whole word."; if ($pattern == utf8_encode($pattern)) { // The following patterns only work if $pattern is in pure Latin letters $ereg_pattern = "[[:<:]]{$pattern}[[:>:]]"; $preg_pattern = "/\b$pattern\b/i"; } else { $ereg_pattern = ?????????????????????? $preg_pattern = ?????????????????????? //Note: "/\b$pattern\b/u" does NOT work - see below. } // Now I can highlight the pattern $highlight_by_ereg = eregi_replace($ereg_pattern, '<font class="highlight">\\0</font>', $text); // Or $highlight_by_preg = preg_replace($preg_pattern, '<font class="highlight">\\0</font>', $text); // Note: if $pattern is a word in Unicode and $preg_pattern was set to "/\b$pattern\b/u", // than $highlight_by_preg just remains equal to the original $text (i.e. nothing is replaced). What I need is something to replace either or both of those ?-?-?-?-?-?-?-?-?... Thanks! Quote Link to comment https://forums.phpfreaks.com/topic/110150-matching-whole-words-in-unicode/ Share on other sites More sharing options...
effigy Posted June 16, 2008 Share Posted June 16, 2008 <meta http-equiv="Content-Type" content="text/html;charset=utf-8" /> <pre> <?php $word = 'hei' . pack('C', 0xDF); echo $str = utf8_encode("Es ist $word. Wie ${word}en Sie?"), '<br>'; $word = utf8_encode($word); echo preg_replace("/\b$word\b/u", '<font style="color:red;">\\0</font>', $str); ?> </pre> Quote Link to comment https://forums.phpfreaks.com/topic/110150-matching-whole-words-in-unicode/#findComment-566566 Share on other sites More sharing options...
lwc Posted July 28, 2008 Author Share Posted July 28, 2008 I don't get it. your code actually replaces your word only when it's not a whole word (when it has an "en" after it). Quote Link to comment https://forums.phpfreaks.com/topic/110150-matching-whole-words-in-unicode/#findComment-602127 Share on other sites More sharing options...
effigy Posted July 29, 2008 Share Posted July 29, 2008 This is the result I get: Es ist heiß. Wie heißen Sie? Es ist heiß. Wie heißen Sie? Quote Link to comment https://forums.phpfreaks.com/topic/110150-matching-whole-words-in-unicode/#findComment-602694 Share on other sites More sharing options...
lwc Posted July 30, 2008 Author Share Posted July 30, 2008 And this is what I get when I copy and paste your code: Es ist heiß. Wie heißen Sie? Es ist heiß. Wie heißen Sie? Quote Link to comment https://forums.phpfreaks.com/topic/110150-matching-whole-words-in-unicode/#findComment-603564 Share on other sites More sharing options...
effigy Posted July 30, 2008 Share Posted July 30, 2008 Interesting. What's your PHP version and locale? Does the following change make a difference? echo preg_replace('/\b' . preg_quote($word) . '\b/u', '<font style="color:red;">\\0</font>', $str); Quote Link to comment https://forums.phpfreaks.com/topic/110150-matching-whole-words-in-unicode/#findComment-603594 Share on other sites More sharing options...
lwc Posted August 1, 2008 Author Share Posted August 1, 2008 That line makes no difference. I've tried this on v5.2.5. Tell me which command tells me my locale. Quote Link to comment https://forums.phpfreaks.com/topic/110150-matching-whole-words-in-unicode/#findComment-605573 Share on other sites More sharing options...
effigy Posted August 4, 2008 Share Posted August 4, 2008 echo setlocale(LC_ALL, 0); Quote Link to comment https://forums.phpfreaks.com/topic/110150-matching-whole-words-in-unicode/#findComment-607540 Share on other sites More sharing options...
lwc Posted August 5, 2008 Author Share Posted August 5, 2008 LC_COLLATE=C;LC_CTYPE=Hebrew_Israel.1255;LC_MONETARY=C;LC_NUMERIC=C;LC_TIME=C Quote Link to comment https://forums.phpfreaks.com/topic/110150-matching-whole-words-in-unicode/#findComment-608089 Share on other sites More sharing options...
effigy Posted August 5, 2008 Share Posted August 5, 2008 I believe the answer is here: PCRE handles caseless matching, and determines whether characters are letters, digits, or whatever, by reference to a set of tables, indexed by character value. When running in UTF-8 mode, this applies only to characters with codes less than 128. Higher-valued codes never match escapes such as \w or \d, but can be tested with \p if PCRE is built with Unicode character property support. The use of locales with Uni- code is discouraged. If you are handling characters with codes greater than 128, you should either use UTF-8 and Unicode, or use locales, but not try to mix the two. Try this pattern: /(?<!\p{L})$word(?!\p{L})/u This is different from \b in that it encompasses all Unicode characters with the letter property and does not include digits or "_". These can easily be added if needed. Quote Link to comment https://forums.phpfreaks.com/topic/110150-matching-whole-words-in-unicode/#findComment-608716 Share on other sites More sharing options...
lwc Posted August 5, 2008 Author Share Posted August 5, 2008 You're a genius! I've tried messing with \p{L} by byself before, but the real challenge was to surround it with things that would make work like \b (both for the start and for the end) - and that's what you managed to do. Any chance you can explain how come "?<!" and "?!" do the trick? Quote Link to comment https://forums.phpfreaks.com/topic/110150-matching-whole-words-in-unicode/#findComment-609057 Share on other sites More sharing options...
nrg_alpha Posted August 6, 2008 Share Posted August 6, 2008 Any chance you can explain how come "?<!" and "?!" do the trick? These are called assertions.. you can look ahead or behind to find subpatterns. Assertions are not included in the capturing of patterns. (?<! this means negative look behind.. so in this sample: (?<!foo)bar - means do NOT capture 'bar' if the word 'foo' comes right before it. Thus: in the string 'foobar', bar is not captured because the 'foo' part is right before it.. (?! is a negative look ahead assertion... so foo(?!bar) - means do not include foo if the word bar is right infront of it. Thus: in the string 'foobar', 'foo' is not captured because 'bar' follows directly infront of it. You can find out a bunch of this stuff in the php manual: http://www.php.net/manual/en/regexp.reference.php#regexp.reference.assertions Cheers, NRG Quote Link to comment https://forums.phpfreaks.com/topic/110150-matching-whole-words-in-unicode/#findComment-609243 Share on other sites More sharing options...
effigy Posted August 6, 2008 Share Posted August 6, 2008 To be pedantic, alpha's post should use the word "match" rather than "capture," because ( and ) are the only mechanisms that capture in regexp. Quote Link to comment https://forums.phpfreaks.com/topic/110150-matching-whole-words-in-unicode/#findComment-609674 Share on other sites More sharing options...
nrg_alpha Posted August 6, 2008 Share Posted August 6, 2008 To be pedantic, alpha's post should use the word "match" rather than "capture," because ( and ) are the only mechanisms that capture in regexp. Oops.. my bad. I stand corrected. Cheers, NRG Quote Link to comment https://forums.phpfreaks.com/topic/110150-matching-whole-words-in-unicode/#findComment-609691 Share on other sites More sharing options...
lwc Posted December 16, 2008 Author Share Posted December 16, 2008 Turns out there's still one problem. Both "b" (when not in Unicode) and "p" (when in Unicode) also match "<match", "match>", "<match" and "match>". Is there a way to make PHP (both in and not in Unicode) realize a < or > (and their equivalents) next to the match means the match is not a whole word? Thanks! Quote Link to comment https://forums.phpfreaks.com/topic/110150-matching-whole-words-in-unicode/#findComment-716872 Share on other sites More sharing options...
lwc Posted December 16, 2008 Author Share Posted December 16, 2008 Okay, on further thinking I guess "p" doesn't even need this because Unicode characters can't be used in HTML tags anyway. As for "b", fataqui's simple (relatively...) approach seems to work great for me. I'll continue the debate there. Quote Link to comment https://forums.phpfreaks.com/topic/110150-matching-whole-words-in-unicode/#findComment-716907 Share on other sites More sharing options...
effigy Posted December 16, 2008 Share Posted December 16, 2008 If you're not planning to parse the entities into characters: /(?<![\p{L}<]|<)$word(?![\p{L}>]|>)/u Quote Link to comment https://forums.phpfreaks.com/topic/110150-matching-whole-words-in-unicode/#findComment-716924 Share on other sites More sharing options...
lwc Posted December 17, 2008 Author Share Posted December 17, 2008 If you're not planning to parse the entities into characters: What do you mean? Quote Link to comment https://forums.phpfreaks.com/topic/110150-matching-whole-words-in-unicode/#findComment-717214 Share on other sites More sharing options...
effigy Posted December 17, 2008 Share Posted December 17, 2008 Well, what I was really after is: are you using any HTML or XML tools for this? Typically these will handle entities, isolation of content, etc. Quote Link to comment https://forums.phpfreaks.com/topic/110150-matching-whole-words-in-unicode/#findComment-717761 Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.