Jump to content


Photo

PHP + preg + international chars problem


  • Please log in to reply
29 replies to this topic

#21 yaba

yaba
  • Members
  • PipPip
  • Member
  • 27 posts

Posted 08 September 2006 - 08:26 PM

This certainly looks promising! I'll have a good look and report back...

Thanks for all your effor effigy :)

#22 yaba

yaba
  • Members
  • PipPip
  • Member
  • 27 posts

Posted 09 September 2006 - 06:20 PM

Hello again!

How would I translate something like this:

preg_replace("/(.*)\b(".($a).")(.*)/ui", "$1___$2____$3", ($b));

to use \x etc? I mean, when the pattern contains some chars that are obtained dynamically, what do I do?

To make it even more difficutl (!), what do I need to put to match both greek and latin (english) characters? So it would match for $a and $b being either both greek, or both english?

:-\

#23 effigy

effigy
  • Staff Alumni
  • Advanced Member
  • 3,600 posts
  • LocationIL

Posted 10 September 2006 - 07:09 AM

How would I translate something like this to use \x etc?

Depends on what you're trying to do... can you expand? Also, without the proper locale set, I wouldn't use \b.

what do I need to put to match both greek and latin (english) characters?

Simply add another code point range. The Latin chart goes from 0000 to 007F; therefore, to match Greek and Latin, use /([\x{0370}-\x{03FF}\x{0000}-\x{007F}])/u.
Regexp | Unicode Article | Letter Database
/\A(e)?((1)?ff(?:(?:ig)?y)?|f(?:ig)?)\z/

#24 yaba

yaba
  • Members
  • PipPip
  • Member
  • 27 posts

Posted 10 September 2006 - 12:36 PM

OK here's what I need to do:

given $_GET['searchLetter'], I perform a FT search in my DB for all words/phrases that contain at least a word that starts with that letter. For example if $_GET['searchLetter'] = 'a', then search would return 'A dog', 'some phrase with ALetter', and so on...

I then want to apply some css to highlight that letter (well, actually it can be a word or part of a word). With normal preg, I'd do it this way:

$word = preg_replace("/(.*)\b(".($_GET['searchLetter']).")(.*)/i", "$1<span class=\"highlightmatch\">$2</span>$3", $word);

which works fine for english chars (and greek chars, on my local PC).

Thanks once again for all the help :)

#25 effigy

effigy
  • Staff Alumni
  • Advanced Member
  • 3,600 posts
  • LocationIL

Posted 11 September 2006 - 02:10 PM

Try this:

<meta charset="utf-8"/>
<pre>
<?php
	
	### Create the "GREEK SMALL LETTER ALPHA" character.
	$alpha = pack("c*", 0xCE, 0xB1);
	### Create a string using the alpha.
	echo $string = "${alpha} string with ${alpha}n ${alpha}lph${alpha} ch${alpha}r${alpha}cter: ${alpha}bc, ${alpha}${alpha}${alpha}.";
	### Show before.
	echo '<br/><br/>';
	### Run replace and highlight.
	echo $string = preg_replace('/(?<=\p{Z})(' . $alpha . ')(?=.)/u', '<b><u>\1</u></b>', $string);
?>
</pre>

It's my understanding that the /u modifies the . as well. You'll want to run this through more tests.
Regexp | Unicode Article | Letter Database
/\A(e)?((1)?ff(?:(?:ig)?y)?|f(?:ig)?)\z/

#26 yaba

yaba
  • Members
  • PipPip
  • Member
  • 27 posts

Posted 11 September 2006 - 02:30 PM

So are you saying I should "rebuild" every word I get off the DB to UTF8?

That seems a lot of processing :/

Even if I did that, how do I replace say 'α' with $alpha dynamically?

#27 effigy

effigy
  • Staff Alumni
  • Advanced Member
  • 3,600 posts
  • LocationIL

Posted 11 September 2006 - 02:38 PM

I'm just using that to create the character for the sake of example; try it as you normally would without extra processing. Also, the above code is not block/language/chart specific. They're basically Unicode classes, just like \b, \w, and \s.
Regexp | Unicode Article | Letter Database
/\A(e)?((1)?ff(?:(?:ig)?y)?|f(?:ig)?)\z/

#28 yaba

yaba
  • Members
  • PipPip
  • Member
  • 27 posts

Posted 11 September 2006 - 03:32 PM

Right, roger that :)

Could you elaborate a bit on the pattern you used please?

I don't think I've encountered before \p, for example. And those '?<=' and '?=.' bits you used... And correct if I'm worng, the \1 is what is captured by the first set of parentheses, right? I.e. (?<=\p{Z})

I don't quite get it, too advanced for me.

Any place I could check these out with examples etc? Couldn't find anything... Or some feedback? Thanks! ;)

#29 effigy

effigy
  • Staff Alumni
  • Advanced Member
  • 3,600 posts
  • LocationIL

Posted 11 September 2006 - 04:10 PM

\p{property} matches Unicode characters that have the property, whereas \P{property} matches Unicode characters that do not have the property. (This is a common syntax for specifying "match" and "don't match"; compare to \s and \S, \w and \W.) The "Z" property is for "Separators," which "mark the boundaries between units of text." The (?<=...) and (?=...) are lookarounds, specifically, a positive lookbehind and a positive lookahead. They look--they don't match. You can find more information on these from the links in my signature.

Therefore, the pattern results in:

/
	(?<=\p{Z}) ### Make sure the next character is preceded by a separator.
	(' . $alpha . ') ### Match the charaacter.
	(?=.) ### Make sure the character is followed by another character, e.g., match "an" but not "a".
/xu

Regexp | Unicode Article | Letter Database
/\A(e)?((1)?ff(?:(?:ig)?y)?|f(?:ig)?)\z/

#30 kevins

kevins
  • New Members
  • Pip
  • Newbie
  • 2 posts

Posted 25 September 2006 - 08:10 AM

If your server does not support unicode completely, here you can test the server's response and find out some tricks accordingly:

http://www.nottodoli...turkishTest.php

The script is here. ( Just replace <> with < ):
http://www.nottodoli...urkishTest.html
http://www.nottodolist.com/test2.html





0 user(s) are reading this topic

0 members, 0 guests, 0 anonymous users