Matching whole words in Unicode

lwc · June 13, 2008

I want to find a whole word, but I don't manage to do it neither in ereg nor in preg when using Unicode.

     $pattern = "pattern";
     $text = "a phrase that contains the word pattern as a whole word.";

     if ($pattern == utf8_encode($pattern)) {
         // The following patterns only work if $pattern is in pure Latin letters
        $ereg_pattern = "[[:<:]]{$pattern}[[:>:]]";
        $preg_pattern = "/\b$pattern\b/i";
     } else {
        $ereg_pattern = ??????????????????????
        $preg_pattern = ?????????????????????? //Note: "/\b$pattern\b/u" does NOT work - see below.
     }

// Now I can highlight the pattern
     $highlight_by_ereg = eregi_replace($ereg_pattern, '<font class="highlight">\\0</font>', $text);
// Or
     $highlight_by_preg = preg_replace($preg_pattern, '<font class="highlight">\\0</font>', $text);
// Note: if $pattern is a word in Unicode and $preg_pattern was set to "/\b$pattern\b/u",
// than $highlight_by_preg just remains equal to the original $text (i.e. nothing is replaced).

What I need is something to replace either or both of those ?-?-?-?-?-?-?-?-?...

Thanks!

effigy · June 16, 2008

<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
<pre>
<?php
$word = 'hei' . pack('C', 0xDF);
echo $str = utf8_encode("Es ist $word. Wie ${word}en Sie?"), '<br>';
$word = utf8_encode($word);
echo preg_replace("/\b$word\b/u", '<font style="color:red;">\\0</font>', $str);
?>
</pre>

lwc · July 28, 2008

I don't get it. your code actually replaces your word only when it's not a whole word (when it has an "en" after it).

effigy · July 29, 2008

This is the result I get:

Es ist heiß. Wie heißen Sie?

lwc · July 30, 2008

And this is what I get when I copy and paste your code:

Es ist heiß. Wie heißen Sie?

effigy · July 30, 2008

Interesting. What's your PHP version and locale? Does the following change make a difference?

echo preg_replace('/\b' . preg_quote($word) . '\b/u', '<font style="color:red;">\\0</font>', $str);

lwc · August 1, 2008

That line makes no difference. I've tried this on v5.2.5. Tell me which command tells me my locale.

effigy · August 4, 2008

echo setlocale(LC_ALL, 0);

lwc · August 5, 2008

LC_COLLATE=C;LC_CTYPE=Hebrew_Israel.1255;LC_MONETARY=C;LC_NUMERIC=C;LC_TIME=C

effigy · August 5, 2008

I believe the answer is here:

PCRE handles caseless matching, and determines whether characters are

letters, digits, or whatever, by reference to a set of tables, indexed

by character value. When running in UTF-8 mode, this applies only to

characters with codes less than 128. Higher-valued codes never match

escapes such as \w or \d, but can be tested with \p if PCRE is built

with Unicode character property support. The use of locales with Uni-

code is discouraged. If you are handling characters with codes greater

than 128, you should either use UTF-8 and Unicode, or use locales, but

not try to mix the two.

Try this pattern: /(?<!\p{L})$word(?!\p{L})/u

This is different from \b in that it encompasses all Unicode characters with the letter property and does not include digits or "_". These can easily be added if needed.

lwc · August 5, 2008

You're a genius! I've tried messing with \p{L} by byself before, but the real challenge was to surround it with things that would make work like \b (both for the start and for the end) - and that's what you managed to do.

Any chance you can explain how come "?<!" and "?!" do the trick?

nrg_alpha · August 6, 2008

Any chance you can explain how come "?<!" and "?!" do the trick?

These are called assertions.. you can look ahead or behind to find subpatterns. Assertions are not included in the capturing of patterns.

(?<! this means negative look behind..

so in this sample: (?<!foo)bar - means do NOT capture 'bar' if the word 'foo' comes right before it. Thus: in the string 'foobar', bar is not captured because the 'foo' part is right before it..

(?! is a negative look ahead assertion...

so foo(?!bar) - means do not include foo if the word bar is right infront of it. Thus: in the string 'foobar', 'foo' is not captured because 'bar' follows directly infront of it.

You can find out a bunch of this stuff in the php manual:

http://www.php.net/manual/en/regexp.reference.php#regexp.reference.assertions

Cheers,

NRG

effigy · August 6, 2008

To be pedantic, alpha's post should use the word "match" rather than "capture," because ( and ) are the only mechanisms that capture in regexp.

nrg_alpha · August 6, 2008

To be pedantic, alpha's post should use the word "match" rather than "capture," because ( and ) are the only mechanisms that capture in regexp.

Oops.. my bad. I stand corrected.

Cheers,

NRG

lwc · December 16, 2008

Turns out there's still one problem.

Both "b" (when not in Unicode) and "p" (when in Unicode) also match "<match", "match>", "<match" and "match>".

Is there a way to make PHP (both in and not in Unicode) realize a < or > (and their equivalents) next to the match means the match is not a whole word?

Thanks!

lwc · December 16, 2008

Okay, on further thinking I guess "p" doesn't even need this because Unicode characters can't be used in HTML tags anyway.

As for "b", fataqui's simple (relatively...) approach seems to work great for me. I'll continue the debate there.

effigy · December 16, 2008

If you're not planning to parse the entities into characters:

/(?<![\p{L}<]|<)$word(?![\p{L}>]|>)/u

lwc · December 17, 2008

If you're not planning to parse the entities into characters:

What do you mean?

effigy · December 17, 2008

Well, what I was really after is: are you using any HTML or XML tools for this? Typically these will handle entities, isolation of content, etc.

Sign In

Matching whole words in Unicode

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Archived

Important Information