Jump to content

[SOLVED] help with regexp on a multibyte string


itaym02

Recommended Posts

I have the following string:

PHP Code:

$text="א אב אבי אביהו מדינה שול של";

In which I wish to add 'אאא' to all <4 chars word, so the string will turn into:

"אאאא אבאאא אביאאא אביהו מדינה שולאאא שלאאא"

 

The code I am using is:

PHP Code:

   $text="א אב אבי אביהו מדינה שול של";
    $pattern='/\s(.{1,6})\s/';
    $text=preg_replace($pattern,' $1אאא ',$text);
    echo $text;

Which results in:

א אבאאא אבי אביהו מדינה שולאאא של

 

 

Problems:

1. It seems word boundary is not recognized (hence my use of \s).

2. Why was the אבי not replaced?

Finally got it working after a lot of tweaking:

 

<?php

header('Content-type: text/plain; charset=utf-8');

$text = 'א אב אבי אביהו מדינה שול של';

$add = 'אאא';

$text = preg_replace('~\S+~ue', "(mb_strlen('$0', 'utf-8') < 4) ? '$0$add' : '$0'", $text);

?>

 

Using a curly bracket quantifier inside the pattern didn't work properly, so I'm grabbing each word (\S+: Any string of chars not containing a whitespace character) and then checking the length of the word with mb_strlen() inside the replacement. It's important to note that the u pattern modifier treats the pattern as Unicode, and that the e modifier treats the replacement as PHP.

 

Edit: Unicode chars didn't display properly. Fixed by removing

 tags.

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.