Jump to content

regex : match words with accents


anarchoi

Recommended Posts

Hello,

 

i am using the following syntax to replace words in a string

 

"#\b$word(\n|s|\b)(?=\s|[.,?!;:]\s)#i";

 

$word is a name, like "Élisée Reclus" or "Emile Pouget". Right now the syntax works perfectly and will replace these names.

 

but i am wondering if there is a way to match the same words with or without accents.

 

exemple:

if $word is "Élisée Reclus" i would like to match "Élisée Reclus" AND "Elisee Reclus" AND "Èlisêè Reclùs", etc...

if $word is "Emile Pouget" i would like to match "Émile Pouget" AND "Emile Pouget" AND "Êmîlè Pôùgèt", etc...

 

thanks a lot!

Link to comment
Share on other sites

Not with regular expressions as far as I know. I guess you could first 'normalize' every accented letter in $name (e.g. using strtr()) and then replace every character in $name with a character class of all the similar accented characters, like a => [aáàâäå], but that would be pretty long winded. Should work though.

Link to comment
Share on other sites

Okay, I actually had some fun writing this, since it ended up working :D Since the code contains some very odd Unicode characters, I uploaded it to my server instead of posting it here, 'cause the forum messes with the chars:

 

http://kronb.org/php/accents.phps

 

Note that in my example I use ~ as pattern delimiter, and supply it in preg_quote() once. Also, my regex pattern uses the modifier u, treating the pattern as UTF-8. That's important since my function handles UTF-8 chars. And yeah, obviously your PHP file needs to be encoded in UTF-8 too.

 

Basically, my function accents() either 'normalizes' the input string:

 

accents('Ȩḷiséẽ Řeclůs', true):

elisee reclus

 

Or returns an array of all different versions of the input character:

 

accents('a'):

Array
(
    [0] => A
    [1] => a
    [2] => Á
    [3] => á
    [4] => À
    [5] => à
    [6] => Ă
    [7] => ă
    [8] => Ắ
    [9] => ắ
    [10] => Ằ
    [11] => ằ
    [12] => Ẵ
    [13] => ẵ
    [14] => Ẳ
    [15] => ẳ
    [16] => Â
    [17] => â
    [18] => Ấ
    [19] => ấ
    [20] => Ầ
    [21] => ầ
    [22] => Ẫ
    [23] => ẫ
    [24] => Ẩ
    [25] => ẩ
    [26] => Ǎ
    [27] => ǎ
    [28] => Å
    [29] => å
    [30] => Ǻ
    [31] => ǻ
    [32] => Ä
    [33] => ä
    [34] => Ǟ
    [35] => ǟ
    [36] => Ã
    [37] => ã
    [38] => Ȧ
    [39] => ȧ
    [40] => Ǡ
    [41] => ǡ
    [42] => Ą
    [43] => ą
    [44] => Ā
    [45] => ā
    [46] => Ả
    [47] => ả
    [48] => Ȁ
    [49] => ȁ
    [50] => Ȃ
    [51] => ȃ
    [52] => Ạ
    [53] => ạ
    [54] => Ặ
    [55] => ặ
    [56] => Ậ
    [57] => ậ
    [58] => Ḁ
    [59] => ḁ
    [60] => Ⱥ
    [61] => ⱥ
    [62] => ᶏ
    [63] => Ɐ
    [64] => ɐ
    [65] => Ɑ
    [66] => ɑ
)

 

With that functionality you can normalize $name and then build a string with character classes to be used in a regular expression. See the example in the script.

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.