Staggan Posted January 5, 2015 Share Posted January 5, 2015 Hello I have a PHP page that sends text entered by a user to our database which we use to display news. This system supports various languages but occasionally we get issues with odd characters being entered... For example, the premade glyph for ellipsis which is normally represented by 3 .'s broke our system today How can I check that each character is valid and within range? These are our character ranges ExtendedLatin_c_iLowerAlphaChar = 0x00C0; ExtendedLatin_c_iUpperAlphaChar = 0x01FF; Arabic_c_iLowerChar = 0x600; Arabic_c_iUpperChar = 0x6FF; Arabic_c_iLowerAlphaChar = 0x621; Arabic_c_iUpperAlphaChar = 0x64A; Arabic_c_iLowerNumericChar = 0x660; Arabic_c_iUpperNumericChar = 0x669; So each character must fall within one of these ranges... but I have no idea how to get the hex value of a character in PHP Thanks Quote Link to comment Share on other sites More sharing options...
QuickOldCar Posted January 5, 2015 Share Posted January 5, 2015 If you want to support multi languages and characters you should be detecting and converting to utf-8 and save it to database as so. iconv It's tough to do everything, someone made an arabic class to help with it http://sourceforge.net/projects/ar-php/files/ar-php/ Quote Link to comment Share on other sites More sharing options...
Staggan Posted January 5, 2015 Author Share Posted January 5, 2015 I can't just get the hex code of each character? Or is this because the code changes depending on the encoding? Quote Link to comment Share on other sites More sharing options...
Staggan Posted January 5, 2015 Author Share Posted January 5, 2015 I thought I should explain a little more.. Our text is saved as UTF-8 into the database but several characters when imported to our application cause issues, the ellipsis being one of them, hence we check each character is within a specific range when imported. So, my question is still how do I get the hex value of a character in PHP... The rest is simple. Thanks Quote Link to comment Share on other sites More sharing options...
Jacques1 Posted January 5, 2015 Share Posted January 5, 2015 Why does a character like the ellipsis cause trouble? This is not normal. You should rather fix this problem than come up with weird workarounds. Quote Link to comment Share on other sites More sharing options...
Staggan Posted January 5, 2015 Author Share Posted January 5, 2015 We support several languages within our application and our font system has limitations. Those limitations include omitting some valid but less used characters to allow us to support so many languages in the way we do... BTW, this is not a web application, but the text is added via a web page... Quote Link to comment Share on other sites More sharing options...
Jacques1 Posted January 5, 2015 Share Posted January 5, 2015 (edited) I still find it odd to have such severe limitation in today's times. It's some legacy application, I guess? Anyway, this is not quite that easy. The problem is that there's a difference between the abstract Unicode codepoint (which is unique) and the actual byte representation (which depends on the character encoding). Your hex strings above are not how those characters are represented in UTF-8, so if you literally searched the input for the byte patterns, you wouldn't find anything. However, there are several solutions which do work: If those ranges have a specific meaning (I'm not familiar with Arabic), you may be able to express them with a Unicode character class within a regular expression. This is by far the most elegant approach, because you don't need to hard-code any byte sequences. You could look up the concrete UTF-8 representation of each character and use byte ranges within a regular expression: [\xUUUUUU-\xVVVVVV\xYYYYYY-\xZZZZZZ] You might (ab)use json_decode() to enter the Unicode code points directly and get back the UTF-8 representation. JSON supports Unicode escape sequences like \u00FF. Edited January 5, 2015 by Jacques1 Quote Link to comment Share on other sites More sharing options...
Staggan Posted January 6, 2015 Author Share Posted January 6, 2015 The unicode solution sounds possible.. Can you provide me some more information on that? Thanks Quote Link to comment Share on other sites More sharing options...
Jacques1 Posted January 7, 2015 Share Posted January 7, 2015 Which of the solutions? Quote Link to comment Share on other sites More sharing options...
Staggan Posted January 7, 2015 Author Share Posted January 7, 2015 I have rechecked with our coders, and we actually compare characters with their unicode value to ensure they are within the ranges I specified in my first post.. So, how can I confirm that each character falls within one of those ranges in my first post? Quote Link to comment Share on other sites More sharing options...
Jacques1 Posted January 7, 2015 Share Posted January 7, 2015 Um, that's exactly what I'm trying to help you with. What do you think the whole discussion was about? If you want to validate the data with PHP (which you appearently do), then you have to use the features of PHP. It's great that your desktop coders have some Unicode range check they can use. We don't. PHP can do byte ranges, sure, but then you have to use the UTF-8 representations, not the code points. So that's the solution you've chosen? By the way, what's the whole point of the last two ranges when they're already covered by the second one? Quote Link to comment Share on other sites More sharing options...
Staggan Posted January 7, 2015 Author Share Posted January 7, 2015 I appreciate your help, sorry, maybe I got a little confused with the answer... Let me explain a little more, perhaps it will help... Our app reads the latest news on login and displays it to the user, this news could be in a multitude of languages. We limit the characters for each language so that we can handle so many in a single app, a windows app, without ever needing to use windows font systems as they have a number of additional limitations. So, to put the news into the database we use a PHP page, but occasionally the person responsible for the news in a particular language manages to use a character we do not support.. which causes an issue.... Our app uses the ranges above which are the unicode numerical representation of the characters we allow...at least in Windows. I thought PHP might allow me a similar way to make the comparison, but it seems there is nothing quite that simple... As for your last point.. I see it, and can only assume it is an oversight somehow.. I will point it out... Thanks Quote Link to comment Share on other sites More sharing options...
Jacques1 Posted January 7, 2015 Share Posted January 7, 2015 Again, you can do this in PHP, and I've offered three possible approaches. Now it's up to you to pick one. If the encoding will always be UTF-8, then this is actually very easy. There's even a special escape sequence which converts code points to UTF-8 sequences, so you don't have to do that yourself: <?php // only accept characters from the Arabic block and some characters from the Latin blocks $character_validation_pattern = '/\\A[\\x{00C0}-\\x{01FF}\\x{0600}-\\x{06FF}]+\\z/u'; // should match var_dump( preg_match($character_validation_pattern, 'ĕث') ); // should not match var_dump( preg_match($character_validation_pattern, 'ab') ); Quote Link to comment Share on other sites More sharing options...
Staggan Posted January 7, 2015 Author Share Posted January 7, 2015 Can you clarify how this works for me? As I do not understand it... Thanks Quote Link to comment Share on other sites More sharing options...
Jacques1 Posted January 7, 2015 Share Posted January 7, 2015 This is a regular expression. It defines a pattern which the input must match: \A is the beginning of the string. [...] is a character from a certain character set; for example, [a-z] is a lowercase character from the latin alphabet (you may read it as “a to z”). \x{...} is the UTF-8 representation of a certain Unicode code point; for example, \x{00C0} is the UTF-8 sequence of the “latin capital letter A with grave”. + means that the pattern should be repeated once or more. \z is the end of the string. u turns the regular expression into “UTF-8 mode”. So the pattern simply says: A sequence of at least one UTF-8 encoded character from the Unicode ranges U+00C0 to U+01FF and U+0600 to U+06FF. Quote Link to comment Share on other sites More sharing options...
hansford Posted January 7, 2015 Share Posted January 7, 2015 Can you clarify how this works for me? As I do not understand it... Over my head as well - Jacques - concrete example in PHP. Quote Link to comment Share on other sites More sharing options...
voodooKobra Posted January 7, 2015 Share Posted January 7, 2015 It's probably worth noting that \xBB is for byte literals (most usually ASCII), meanwile \uBBBB is meant for unicode characters (UTF-. Quote Link to comment Share on other sites More sharing options...
Jacques1 Posted January 7, 2015 Share Posted January 7, 2015 PHP doesn't have a \u escape sequence. There's no such thing. What is does have is a \x{...} sequence for multibyte characters (as I already explained above). Quote Link to comment Share on other sites More sharing options...
voodooKobra Posted January 8, 2015 Share Posted January 8, 2015 PHP doesn't have a \u escape sequence. There's no such thing. What is does have is a \x{...} sequence for multibyte characters (as I already explained above). Oh, you're right. For some reason I thought it did. Quote Link to comment Share on other sites More sharing options...
Staggan Posted January 8, 2015 Author Share Posted January 8, 2015 I have tried the above suggestion... This is what I do... function validateString($string) { $character_validation_pattern = '/\\A[\\x{00C0}-\\x{01FF}\\x{0600}-\\x{06FF}]+\\z/u'; if (preg_match($character_validation_pattern, $string)){ return true; } return false; } Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.