Add a space

haku · June 16, 2009

I need to add a space character to this regex:

/(?:\xEF\xBD[\xA1-\xBF]|\xEF\xBE[\x80-\x9F])/

and this regex as well

/^(\xe3(\x82[\xa1-\xbf]|\x83[\x80-\xb6]|\x83\xbc))*$/

I'm right crap with PRCE regex, and I'm even more crap with hexidecimal regex, so I really don't know how/where to add the space into this. Can anyone give me a hand?

Zane · June 16, 2009

spaces are matched with \s

have you not tried that.

haku · June 16, 2009

Not at all - as I say, I'm crap with these. Where do I add that?

Zane · June 16, 2009

uh.....mmm

idk...wherever you want to match a space.

what's one of the texts that you're trying to match

haku · June 16, 2009

They're matching strings to check if they are katakana (one of the Japanese alphabets). I got it off the Japanese interwebs. They're both checking each character to see if they are in that alphabet.

thebadbad · June 16, 2009

Well,

/(?:\xEF\xBD[\xA1-\xBF]|\xEF\xBE[\x80-\x9F])/

either matches \xEF, \xBD and \xA1-\xBF (3 chars total) or \xEF, \xBE and \x80-\x9F (also 3 chars total). Where do you want to allow whitespace?

/^(\xe3(\x82[\xa1-\xbf]|\x83[\x80-\xb6]|\x83\xbc))*$/

matches \xe3 followed by either \x82 and \xa1-\xbf, \x83 and \x80-\xb6 or \x83 and \xbc (3 chars total in either case). And all that is matched 0 or more times within the full string. Again - where do you want to allow whitespace?

haku · June 16, 2009

The first one was missing the *$ - I figured that out after posting this.

I want to allow whitespace anywhere. And actually, I would like to combine the two statements into one if possible.

Any help is muchly appreciated.

thebadbad · June 16, 2009

It would really help if you explained what you're trying to do with this regex. And e.g. provide sample haystacks and expected matches/non-matches.

But here's a guess: Do you want to allow all the characters found in your patterns, including whitespace, and require the string to be between 0 and 3 in length? Do the allowed characters have to appear in a certain order? Not sure if that would make sense, but that's what I can decipher from your posts

haku · June 17, 2009

You may not be able to see this, but I am looking to confirm that the entire field consists of only these characters:

アイウエオカキクケコサシスセソタチツテトラリルレロマミムメモナニヌネノワヲンガギグゲゴダヂヅデドザジズゼゾバビブベボパピプペポ　ｱｲｳｴｵｶｷｸｹｺｻｼｽｾｿﾀﾁﾂﾃﾄﾗﾘﾙﾚﾛﾏﾐﾑﾒﾓﾅﾆﾇﾈﾉﾜｦﾝｶﾞｷﾞｸﾞｹﾞｺﾞﾀﾞﾁﾞﾂﾞﾃﾞﾄﾞｻﾞｼﾞｽﾞｾﾞｿﾞﾊﾞﾋﾞﾌﾞﾍﾞﾎﾞﾊﾟﾋﾟﾌﾟﾍﾟﾎﾟ

(including the double width and single width spaces in the middle). There are no length limitations, and. I just want to make sure that the submitted value is only within this range.

MadTechie · June 17, 2009

Tip#1 use \p{Katakana}

ie

<?php
$subject = "アイ"; //<-- forum doesn't like Katakana in code lol
if (preg_match('/^\p{Katakana}+$/iu', $subject)) {
echo "ok";
} else {
echo "failed";
}
?>

haku · June 17, 2009

Nice! Thanks MadT. That's much easier to use than what I was using.

I have one last problem to solve with it now - I need to allow zenkaku spaces. This is a double-byte space. The hex code is 8140 (retrieved using bin2hex()). Any idea how I can add this?

haku · June 17, 2009

Actually, I may have answered my own question. I did this:

/^\p{Katakana}|\x8140+$/iu

and it seems to work. Does that look right?

edit: Nope, it doesn't work. It allows this through:

トム　tom

which it shouldn't because of the English.

haku · June 17, 2009

And also, do you know if these also exist?

/^\p{Hiragana}+$/iu
/^\p{Kanji}+$/iu

MadTechie · June 17, 2009

try this

if (preg_match('/^[\p{Katakana}\x8140]+$/iu', $subject)) {

\p{Hiragana} exists

Kanji .. i don't think so!

MadTechie · June 17, 2009

I'm not sure but mb_convert_kana() may help you out

I'm jumping ship for a bit (Its 4:30am my bed time)

Ken2k7 · June 17, 2009

How in the world can you stay up until 4:30!? That's insanity.

haku · June 17, 2009

I stay up till that time every weekend. I would on weeknites too if it weren't for work!

Thanks for the mb_convert_kana tip Techie- but I'm trying to do this without relying on the mbstring functions, as they aren't enabled at runtime.

MadTechie · June 17, 2009

How in the world can you stay up until 4:30!? That's insanity.

Well 3:00am is my normal time, any later and I find it hard to get up for work at 8:40am.. of course i have a lay ins as well!

I just code better at night

thebadbad · June 17, 2009

I tried to match some of your characters with \p{Katakana}, and not all of them matched. The easy solution is to simply add all the characters including a single and double-byte space inside a character class. I also tried that, and it worked.

~^[アイウエオカキクケコサシスセソタチツテトラリルレロマミムメモナニヌネノワヲンガギグゲゴダヂヅデドザジズゼゾバビブベボパピプペポ　ｱｲｳｴｵｶｷｸｹｺｻｼｽｾｿﾀﾁﾂﾃﾄﾗﾘﾙﾚﾛﾏﾐﾑﾒﾓﾅﾆﾇﾈﾉﾜｦﾝｶﾞｷﾞｸﾞｹﾞｺﾞﾀﾞﾁﾞﾂﾞﾃﾞﾄﾞｻﾞｼﾞｽﾞｾﾞｿﾞﾊﾞﾋﾞﾌﾞﾍﾞﾎﾞﾊﾟﾋﾟﾌﾟﾍﾟﾎﾟ]*$~iuD

I'm not sure the double-byte space is added in there^, because of the forum, so you might have to add that afterwards in your script.

thebadbad · June 17, 2009

This also matches all the characters you provided:

~^[\p{Katakana} 　ﾞﾟ]*$~iuD

But I don't know if it matches other characters too (probably does).

MadTechie · June 18, 2009

Hi haku,

My Chinese friend 徐曌 (Rick) says

"Kanji" could be "Han", i.e. if you put \p{Han} instead of \p{Kanji}, it should work as Japamese Kanji is actually Chinese ancient characters (AKA: Traditional Chinese).

Also we were playing with some Chinese today and found some characters where missing ie comma's to resolved this we used \p{Common} to check only Chinese was entered

this is what we used for Chinese

$subject = "這是中文測試，这是中文测试哦。";

<?php
if (preg_match('/^[\p{Han}\p{Common}]+$/iu', $subject)) {
   echo "ok";
} else {
   echo "failed";
}
?>

This is the correct way of dealing with Unicode instead of just adding the characters,

I hope this helps

thebadbad · June 18, 2009

This is the correct way of dealing with Unicode instead of just adding the characters,

I resorted to that because the Unicode scripts I tried either didn't match all the characters he wanted to allow or possibly allowed heaps of other characters he didn't specify.

haku · June 19, 2009

Japanese kanji isn't exactly the same as traditional Chinese unfortunately. For the most part they are the same, but there are a few different kanji, so that wouldn't work. Fortunately I already have a regex for kanji, I was just wondering earlier in this thread if a shortcut existed.

This thread though was about katakana, which is a phonetic alphabet that doesn't exist in Chinese at all. There are three alphabets in Japanese (four if you count English letters) - kanji, which is Chinese characters, hiragana, which is phonetic characters used with Japanese words, and katakana which is phonetic characters used with words taken from other languages, or for onomatopoeia, or emphasizing words.

Sign In

Add a space

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation

Important Information