Jump to content

[SOLVED] How to strip our Accent Characters from text string?


Recommended Posts

Hi, I have a website where people signup using a form and sometimes they have a name that contains a French accented character in their name.  Unfortunately, when the PHP script tries to process the record the name gets all screwed up and doesn't display properly.

 

Is there an easy script or function to strip our French accented characters from a string?

 

Thanks for your help,

Mikey

<?php
function transcribe($string) {
    $string = strtr($string,
       "\xA1\xAA\xBA\xBF\xC0\xC1\xC2\xC3\xC5\xC7
        \xC8\xC9\xCA\xCB\xCC\xCD\xCE\xCF\xD0\xD1
        \xD2\xD3\xD4\xD5\xD8\xD9\xDA\xDB\xDD\xE0
        \xE1\xE2\xE3\xE5\xE7\xE8\xE9\xEA\xEB\xEC
        \xED\xEE\xEF\xF0\xF1\xF2\xF3\xF4\xF5\xF8
        \xF9\xFA\xFB\xFD\xFF",
        "!ao?AAAAAC
        EEEEIIIIDN
        OOOOOUUUYa
        aaaaceeeei
        iiidnooooo
        uuuyy");  
    $string = strtr($string, array("\xC4"=>"Ae", "\xC6"=>"AE", "\xD6"=>"Oe", "\xDC"=>"Ue", "\xDE"=>"TH", "\xDF"=>"ss", "\xE4"=>"ae", "\xE6"=>"ae", "\xF6"=>"oe", "\xFC"=>"ue", "\xFE"=>"th"));
    return($string);
}
//usage:
$data = "Àorvar";
print transcribe($data);

You shouldn't try to strip them out. You should be using the correct character encoding for your DB storage, php, HTML document. UTF-8 Unicode should work. Check your HTML headers:

 

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

 

If that doesnt work use php (place at the top of each script preferably in a common include prior to any screen output).

 

header('Content-Type: text/html; charset=UTF-8');

I agree with the "don't just strip them out".  If you are wanting to stick with the "get rid of them" instead of "make sure they display" route, I would give the user the option to enter in something of the "correct" format, rather than you "stripping" them out.  Check if they are there, tell the user he can't use them, re-enter name, sort of thing.

 

if(!preg_match('~^[a-z ]+$~i',$name)) {
  // name has more than a-z, A-Z or space, give error
}

 

 

*snip*

 

Ø and Å aren't O and A, but OE and AA. Same goes for their lower case variants.

 

There is another problem with your function. If you for instance have the word "Æble" (Apple in Danish) it would be transliterated into Aeble. However, if it's in all caps, ÆBLE, then it would be AEBLE. Your function doesn't take that into account. You would have to figure the case of the other characters out as well. Otherwise you could end up with something like AeBLE or AEble, but of which look stupid. The same goes for all the other letters that stand for more than one letter.

 

Niel and CV, it could be the case that he needs it in a URL (e.g. /user/Daniel). In that case he might only want the letters A-Z without any sort of diacritics.

Well, that's easy to say for someone whose primary language is English. English doesn't really use other characters than a through z. Many other languages use various diacritics to give different meanings. Compare these words in Spanish for instance: año (year) vs. ano (anus), papá (dad) vs. papa (potato (or pope)).

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.