haku Posted June 16, 2009 Share Posted June 16, 2009 I need to add a space character to this regex: /(?:\xEF\xBD[\xA1-\xBF]|\xEF\xBE[\x80-\x9F])/ and this regex as well /^(\xe3(\x82[\xa1-\xbf]|\x83[\x80-\xb6]|\x83\xbc))*$/ I'm right crap with PRCE regex, and I'm even more crap with hexidecimal regex, so I really don't know how/where to add the space into this. Can anyone give me a hand? Quote Link to comment Share on other sites More sharing options...
Zane Posted June 16, 2009 Share Posted June 16, 2009 spaces are matched with \s have you not tried that. Quote Link to comment Share on other sites More sharing options...
haku Posted June 16, 2009 Author Share Posted June 16, 2009 Not at all - as I say, I'm crap with these. Where do I add that? Quote Link to comment Share on other sites More sharing options...
Zane Posted June 16, 2009 Share Posted June 16, 2009 uh.....mmm idk...wherever you want to match a space. what's one of the texts that you're trying to match Quote Link to comment Share on other sites More sharing options...
haku Posted June 16, 2009 Author Share Posted June 16, 2009 They're matching strings to check if they are katakana (one of the Japanese alphabets). I got it off the Japanese interwebs. They're both checking each character to see if they are in that alphabet. Quote Link to comment Share on other sites More sharing options...
thebadbad Posted June 16, 2009 Share Posted June 16, 2009 Well, /(?:\xEF\xBD[\xA1-\xBF]|\xEF\xBE[\x80-\x9F])/ either matches \xEF, \xBD and \xA1-\xBF (3 chars total) or \xEF, \xBE and \x80-\x9F (also 3 chars total). Where do you want to allow whitespace? /^(\xe3(\x82[\xa1-\xbf]|\x83[\x80-\xb6]|\x83\xbc))*$/ matches \xe3 followed by either \x82 and \xa1-\xbf, \x83 and \x80-\xb6 or \x83 and \xbc (3 chars total in either case). And all that is matched 0 or more times within the full string. Again - where do you want to allow whitespace? Quote Link to comment Share on other sites More sharing options...
haku Posted June 16, 2009 Author Share Posted June 16, 2009 The first one was missing the *$ - I figured that out after posting this. I want to allow whitespace anywhere. And actually, I would like to combine the two statements into one if possible. Any help is muchly appreciated. Quote Link to comment Share on other sites More sharing options...
thebadbad Posted June 16, 2009 Share Posted June 16, 2009 It would really help if you explained what you're trying to do with this regex. And e.g. provide sample haystacks and expected matches/non-matches. But here's a guess: Do you want to allow all the characters found in your patterns, including whitespace, and require the string to be between 0 and 3 in length? Do the allowed characters have to appear in a certain order? Not sure if that would make sense, but that's what I can decipher from your posts Quote Link to comment Share on other sites More sharing options...
haku Posted June 17, 2009 Author Share Posted June 17, 2009 You may not be able to see this, but I am looking to confirm that the entire field consists of only these characters: アイウエオカキクケコサシスセソタチツテトラリルレロマミムメモナニヌネノワヲンガギグゲゴダヂヅデドザジズゼゾバビブベボパピプペポ アイウエオカキクケコサシスセソタチツテトラリルレロマミムメモナニヌネノワヲンガギグゲゴダヂヅデドザジズゼゾバビブベボパピプペポ (including the double width and single width spaces in the middle). There are no length limitations, and. I just want to make sure that the submitted value is only within this range. Quote Link to comment Share on other sites More sharing options...
MadTechie Posted June 17, 2009 Share Posted June 17, 2009 Tip#1 use \p{Katakana} ie <?php $subject = "アイ"; //<-- forum doesn't like Katakana in code lol if (preg_match('/^\p{Katakana}+$/iu', $subject)) { echo "ok"; } else { echo "failed"; } ?> Quote Link to comment Share on other sites More sharing options...
haku Posted June 17, 2009 Author Share Posted June 17, 2009 Nice! Thanks MadT. That's much easier to use than what I was using. I have one last problem to solve with it now - I need to allow zenkaku spaces. This is a double-byte space. The hex code is 8140 (retrieved using bin2hex()). Any idea how I can add this? Quote Link to comment Share on other sites More sharing options...
haku Posted June 17, 2009 Author Share Posted June 17, 2009 Actually, I may have answered my own question. I did this: /^\p{Katakana}|\x8140+$/iu and it seems to work. Does that look right? edit: Nope, it doesn't work. It allows this through: トム tom which it shouldn't because of the English. Quote Link to comment Share on other sites More sharing options...
haku Posted June 17, 2009 Author Share Posted June 17, 2009 And also, do you know if these also exist? /^\p{Hiragana}+$/iu /^\p{Kanji}+$/iu Quote Link to comment Share on other sites More sharing options...
MadTechie Posted June 17, 2009 Share Posted June 17, 2009 try this if (preg_match('/^[\p{Katakana}\x8140]+$/iu', $subject)) { \p{Hiragana} exists Kanji .. i don't think so! Quote Link to comment Share on other sites More sharing options...
MadTechie Posted June 17, 2009 Share Posted June 17, 2009 I'm not sure but mb_convert_kana() may help you out I'm jumping ship for a bit (Its 4:30am my bed time) Quote Link to comment Share on other sites More sharing options...
Ken2k7 Posted June 17, 2009 Share Posted June 17, 2009 How in the world can you stay up until 4:30!? That's insanity. Quote Link to comment Share on other sites More sharing options...
haku Posted June 17, 2009 Author Share Posted June 17, 2009 I stay up till that time every weekend. I would on weeknites too if it weren't for work! Thanks for the mb_convert_kana tip Techie- but I'm trying to do this without relying on the mbstring functions, as they aren't enabled at runtime. Quote Link to comment Share on other sites More sharing options...
MadTechie Posted June 17, 2009 Share Posted June 17, 2009 How in the world can you stay up until 4:30!? That's insanity. Well 3:00am is my normal time, any later and I find it hard to get up for work at 8:40am.. of course i have a lay ins as well! I just code better at night Quote Link to comment Share on other sites More sharing options...
thebadbad Posted June 17, 2009 Share Posted June 17, 2009 I tried to match some of your characters with \p{Katakana}, and not all of them matched. The easy solution is to simply add all the characters including a single and double-byte space inside a character class. I also tried that, and it worked. ~^[アイウエオカキクケコサシスセソタチツテトラリルレロマミムメモナニヌネノワヲンガギグゲゴダヂヅデドザジズゼゾバビブベボパピプペポ アイウエオカキクケコサシスセソタチツテトラリルレロマミムメモナニヌネノワヲンガギグゲゴダヂヅデドザジズゼゾバビブベボパピプペポ]*$~iuD I'm not sure the double-byte space is added in there^, because of the forum, so you might have to add that afterwards in your script. Quote Link to comment Share on other sites More sharing options...
thebadbad Posted June 17, 2009 Share Posted June 17, 2009 This also matches all the characters you provided: ~^[\p{Katakana} ゙゚]*$~iuD But I don't know if it matches other characters too (probably does). Quote Link to comment Share on other sites More sharing options...
MadTechie Posted June 18, 2009 Share Posted June 18, 2009 Hi haku, My Chinese friend 徐曌 (Rick) says "Kanji" could be "Han", i.e. if you put \p{Han} instead of \p{Kanji}, it should work as Japamese Kanji is actually Chinese ancient characters (AKA: Traditional Chinese). Also we were playing with some Chinese today and found some characters where missing ie comma's to resolved this we used \p{Common} to check only Chinese was entered this is what we used for Chinese $subject = "這是中文測試,这是中文测试哦。"; <?php if (preg_match('/^[\p{Han}\p{Common}]+$/iu', $subject)) { echo "ok"; } else { echo "failed"; } ?> This is the correct way of dealing with Unicode instead of just adding the characters, I hope this helps Quote Link to comment Share on other sites More sharing options...
thebadbad Posted June 18, 2009 Share Posted June 18, 2009 This is the correct way of dealing with Unicode instead of just adding the characters, I resorted to that because the Unicode scripts I tried either didn't match all the characters he wanted to allow or possibly allowed heaps of other characters he didn't specify. Quote Link to comment Share on other sites More sharing options...
haku Posted June 19, 2009 Author Share Posted June 19, 2009 Japanese kanji isn't exactly the same as traditional Chinese unfortunately. For the most part they are the same, but there are a few different kanji, so that wouldn't work. Fortunately I already have a regex for kanji, I was just wondering earlier in this thread if a shortcut existed. This thread though was about katakana, which is a phonetic alphabet that doesn't exist in Chinese at all. There are three alphabets in Japanese (four if you count English letters) - kanji, which is Chinese characters, hiragana, which is phonetic characters used with Japanese words, and katakana which is phonetic characters used with words taken from other languages, or for onomatopoeia, or emphasizing words. Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.