regex : match words with accents

anarchoi · June 14, 2009

Hello,

i am using the following syntax to replace words in a string

"#\b$word(\n|s|\b)(?=\s|[.,?!;:]\s)#i";

$word is a name, like "Élisée Reclus" or "Emile Pouget". Right now the syntax works perfectly and will replace these names.

but i am wondering if there is a way to match the same words with or without accents.

exemple:

if $word is "Élisée Reclus" i would like to match "Élisée Reclus" AND "Elisee Reclus" AND "Èlisêè Reclùs", etc...

if $word is "Emile Pouget" i would like to match "Émile Pouget" AND "Emile Pouget" AND "Êmîlè Pôùgèt", etc...

thanks a lot!

thebadbad · June 14, 2009

Not with regular expressions as far as I know. I guess you could first 'normalize' every accented letter in $name (e.g. using strtr()) and then replace every character in $name with a character class of all the similar accented characters, like a => [aáàâäå], but that would be pretty long winded. Should work though.

thebadbad · June 14, 2009

Okay, I actually had some fun writing this, since it ended up working Since the code contains some very odd Unicode characters, I uploaded it to my server instead of posting it here, 'cause the forum messes with the chars:

http://kronb.org/php/accents.phps

Note that in my example I use ~ as pattern delimiter, and supply it in preg_quote() once. Also, my regex pattern uses the modifier u, treating the pattern as UTF-8. That's important since my function handles UTF-8 chars. And yeah, obviously your PHP file needs to be encoded in UTF-8 too.

Basically, my function accents() either 'normalizes' the input string:

accents('Ȩḷiséẽ Řeclůs', true):

elisee reclus

Or returns an array of all different versions of the input character:

accents('a'):

Array
(
    [0] => A
    [1] => a
    [2] => Á
    [3] => á
    [4] => À
    [5] => à
    [6] => Ă
    [7] => ă
    [8] => Ắ
    [9] => ắ
    [10] => Ằ
    [11] => ằ
    [12] => Ẵ
    [13] => ẵ
    [14] => Ẳ
    [15] => ẳ
    [16] => Â
    [17] => â
    [18] => Ấ
    [19] => ấ
    [20] => Ầ
    [21] => ầ
    [22] => Ẫ
    [23] => ẫ
    [24] => Ẩ
    [25] => ẩ
    [26] => Ǎ
    [27] => ǎ
    [28] => Å
    [29] => å
    [30] => Ǻ
    [31] => ǻ
    [32] => Ä
    [33] => ä
    [34] => Ǟ
    [35] => ǟ
    [36] => Ã
    [37] => ã
    [38] => Ȧ
    [39] => ȧ
    [40] => Ǡ
    [41] => ǡ
    [42] => Ą
    [43] => ą
    [44] => Ā
    [45] => ā
    [46] => Ả
    [47] => ả
    [48] => Ȁ
    [49] => ȁ
    [50] => Ȃ
    [51] => ȃ
    [52] => Ạ
    [53] => ạ
    [54] => Ặ
    [55] => ặ
    [56] => Ậ
    [57] => ậ
    [58] => Ḁ
    [59] => ḁ
    [60] => Ⱥ
    [61] => ⱥ
    [62] => ᶏ
    [63] => Ɐ
    [64] => ɐ
    [65] => Ɑ
    [66] => ɑ
)

With that functionality you can normalize $name and then build a string with character classes to be used in a regular expression. See the example in the script.

Sign In

regex : match words with accents

Recommended Posts

anarchoi

Link to comment

Share on other sites

thebadbad

Link to comment

Share on other sites

thebadbad

Link to comment

Share on other sites

Archived

Browse

Activity

Important Information