Jump to content

Matching whole words in Unicode


Recommended Posts

I want to find a whole word, but I don't manage to do it neither in ereg nor in preg when using Unicode.


     $pattern = "pattern";
     $text = "a phrase that contains the word pattern as a whole word.";

     if ($pattern == utf8_encode($pattern)) {
         // The following patterns only work if $pattern is in pure Latin letters
        $ereg_pattern = "[[:<:]]{$pattern}[[:>:]]";
        $preg_pattern = "/\b$pattern\b/i";
     } else {
        $ereg_pattern = ??????????????????????
        $preg_pattern = ?????????????????????? //Note: "/\b$pattern\b/u" does NOT work - see below.

// Now I can highlight the pattern
     $highlight_by_ereg = eregi_replace($ereg_pattern, '<font class="highlight">\\0</font>', $text);
// Or
     $highlight_by_preg = preg_replace($preg_pattern, '<font class="highlight">\\0</font>', $text);
// Note: if $pattern is a word in Unicode and $preg_pattern was set to "/\b$pattern\b/u",
// than $highlight_by_preg just remains equal to the original $text (i.e. nothing is replaced).

What I need is something to replace either or both of those ?-?-?-?-?-?-?-?-?...  ;D



Link to comment
Share on other sites

<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
$word = 'hei' . pack('C', 0xDF);
echo $str = utf8_encode("Es ist $word. Wie ${word}en Sie?"), '<br>';
$word = utf8_encode($word);
echo preg_replace("/\b$word\b/u", '<font style="color:red;">\\0</font>', $str);

Link to comment
Share on other sites

  • 1 month later...

I believe the answer is here:


      PCRE handles caseless matching, and determines whether  characters  are

      letters,  digits, or whatever, by reference to a set of tables, indexed

      by character value. When running in UTF-8 mode, this  applies  only  to

      characters  with  codes  less than 128. Higher-valued codes never match

      escapes such as \w or \d, but can be tested with \p if  PCRE  is  built

      with  Unicode  character property support. The use of locales with Uni-

      code is discouraged. If you are handling characters with codes  greater

      than  128, you should either use UTF-8 and Unicode, or use locales, but

      not try to mix the two.


Try this pattern: /(?<!\p{L})$word(?!\p{L})/u


This is different from \b in that it encompasses all Unicode characters with the letter property and does not include digits or "_". These can easily be added if needed.

Link to comment
Share on other sites

You're a genius! I've tried messing with \p{L} by byself before, but the real challenge was to surround it with things that would make work like \b (both for the start and for the end) - and that's what you managed to do.


Any chance you can explain how come "?<!" and "?!" do the trick?

Link to comment
Share on other sites

Any chance you can explain how come "?<!" and "?!" do the trick?


These are called assertions.. you can look ahead or behind to find subpatterns. Assertions are not included in the capturing of patterns.


(?<! this means negative look behind..

so in this sample: (?<!foo)bar    - means do NOT capture 'bar' if the word 'foo' comes right before it. Thus: in the string 'foobar', bar is not captured because the 'foo' part is right before it..


(?! is a negative look ahead assertion...

so foo(?!bar)    - means do not include foo if the word bar is right infront of it. Thus: in the string 'foobar', 'foo' is not captured because 'bar' follows directly infront of it.

You can find out a bunch of this stuff in the php manual:






Link to comment
Share on other sites

  • 4 months later...

Turns out there's still one problem.


Both "b" (when not in Unicode) and "p" (when in Unicode) also match "<match", "match>", "<match" and  "match>".


Is there a way to make PHP (both in and not in Unicode) realize a < or > (and their equivalents) next to the match means the match is not a whole word?



Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.