Jump to content

How to check only alphanumeric and other language alphabet by using preg_replace?


thara

Recommended Posts

I need to create a SEO friendly string only from alphanumeric and characters of my native language. It is sinhala.

My expected string should be something like this:

$myString = "this-is-a-දහසක්-බාධක-දුක්-කම්කටොලු-මැදින්-ලෝකය-දිනන්නට-වෙර-දරන";

I am using a function to create the string like this. And that function is as follow:

function seoUrl($string) {
    //Lower case everything
    $string = strtolower($string);
    //Make alphanumeric (removes all other characters)
    $string = preg_replace("/[^a-z0-9_\s-]/", "", $string);
    //Clean up multiple dashes or whitespaces
    $string = preg_replace("/[\s-]+/", " ", $string);
    //Convert whitespaces and underscore to dash
    $string = preg_replace("/[\s_]/", "-", $string);
    return $string;
}

This function only works for English characters and output of above string as below:

$title = seoUrl("this-is-a-දහසක්-බාධක-දුක්-කම්කටොලු-මැදින්-ලෝකය-දිනන්නට-වෙර-දරන");
echo $title; // this-is-a-

I modified this function using `mb_ereg_replace` as below:
 

function seoUrl($string) {
    //Lower case everything
    //$string = strtolower($string);
    //Make alphanumeric (removes all other characters)
    $string = mb_ereg_replace("/[^a-z0-9_\s-]/", "", $string);
    //Clean up multiple dashes or whitespaces
    $string = mb_ereg_replace("/[\s-]+/", " ", $string);
    //Convert whitespaces and underscore to dash
    $string = mb_ereg_replace("/[\s_]/", "-", $string);
    return $string;
}

But is not working for me.

Can anybody tell me how to modify above function to get all my characters (including my native language characters)

Hope somebody may help me out. Thank you.

Link to comment
Share on other sites

Thanks for reply.

Yes I tried it like that.

Updated version:

function seoUrl($string) {
  //Lower case everything
  $string = strtolower($string);
  //Make alphanumeric (removes all other characters)
  $string = preg_replace("/[^\pL\pN_\s-]/u", "", $string);
  //Clean up multiple dashes or whitespaces
  $string = preg_replace("/[\s-]+/", " ", $string);
  //Convert whitespaces and underscore to dash
  $string = preg_replace("/[\s_]/", "-", $string);
  return $string;
}

$title = seoUrl("this-is-a-දහසක්-බාධක-දුක්-කම්කටොලු-මැදින්-ලෝකය-දිනන්නට-වෙර-දරන");
echo $title;

Output:

this-is-a-දහසක-බධක-දක-කමකටල-මදන-ලකය-දනනනට-වර-දරන

But some parts are missing in sinhala characters. Please look at two string closely, you will notice the difference.

 

Link to comment
Share on other sites

My guess would be that the altered characters are using some sort of combining marks that \pL isn't including. Another \pX option should get it.

Or you could drop the /u mode and blindly accept all high bytes (\x7F-\xFF). You'll only be able to filter out standard ASCII characters but maybe that's all you need.

Link to comment
Share on other sites

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.