Jump to content

regex : match words with accents


anarchoi

Recommended Posts

Hello,

 

i am using the following syntax to replace words in a string

 

"#\b$word(\n|s|\b)(?=\s|[.,?!;:]\s)#i";

 

$word is a name, like "Élisée Reclus" or "Emile Pouget". Right now the syntax works perfectly and will replace these names.

 

but i am wondering if there is a way to match the same words with or without accents.

 

exemple:

if $word is "Élisée Reclus" i would like to match "Élisée Reclus" AND "Elisee Reclus" AND "Èlisêè Reclùs", etc...

if $word is "Emile Pouget" i would like to match "Émile Pouget" AND "Emile Pouget" AND "Êmîlè Pôùgèt", etc...

 

thanks a lot!

Link to comment
https://forums.phpfreaks.com/topic/162129-regex-match-words-with-accents/
Share on other sites

Not with regular expressions as far as I know. I guess you could first 'normalize' every accented letter in $name (e.g. using strtr()) and then replace every character in $name with a character class of all the similar accented characters, like a => [aáàâäå], but that would be pretty long winded. Should work though.

Okay, I actually had some fun writing this, since it ended up working :D Since the code contains some very odd Unicode characters, I uploaded it to my server instead of posting it here, 'cause the forum messes with the chars:

 

http://kronb.org/php/accents.phps

 

Note that in my example I use ~ as pattern delimiter, and supply it in preg_quote() once. Also, my regex pattern uses the modifier u, treating the pattern as UTF-8. That's important since my function handles UTF-8 chars. And yeah, obviously your PHP file needs to be encoded in UTF-8 too.

 

Basically, my function accents() either 'normalizes' the input string:

 

accents('Ȩḷiséẽ Řeclůs', true):

elisee reclus

 

Or returns an array of all different versions of the input character:

 

accents('a'):

Array
(
    [0] => A
    [1] => a
    [2] => Á
    [3] => á
    [4] => À
    [5] => à
    [6] => Ă
    [7] => ă
    [8] => Ắ
    [9] => ắ
    [10] => Ằ
    [11] => ằ
    [12] => Ẵ
    [13] => ẵ
    [14] => Ẳ
    [15] => ẳ
    [16] => Â
    [17] => â
    [18] => Ấ
    [19] => ấ
    [20] => Ầ
    [21] => ầ
    [22] => Ẫ
    [23] => ẫ
    [24] => Ẩ
    [25] => ẩ
    [26] => Ǎ
    [27] => ǎ
    [28] => Å
    [29] => å
    [30] => Ǻ
    [31] => ǻ
    [32] => Ä
    [33] => ä
    [34] => Ǟ
    [35] => ǟ
    [36] => Ã
    [37] => ã
    [38] => Ȧ
    [39] => ȧ
    [40] => Ǡ
    [41] => ǡ
    [42] => Ą
    [43] => ą
    [44] => Ā
    [45] => ā
    [46] => Ả
    [47] => ả
    [48] => Ȁ
    [49] => ȁ
    [50] => Ȃ
    [51] => ȃ
    [52] => Ạ
    [53] => ạ
    [54] => Ặ
    [55] => ặ
    [56] => Ậ
    [57] => ậ
    [58] => Ḁ
    [59] => ḁ
    [60] => Ⱥ
    [61] => ⱥ
    [62] => ᶏ
    [63] => Ɐ
    [64] => ɐ
    [65] => Ɑ
    [66] => ɑ
)

 

With that functionality you can normalize $name and then build a string with character classes to be used in a regular expression. See the example in the script.

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.