Validate input

Staggan · January 5, 2015

Hello

I have a PHP page that sends text entered by a user to our database which we use to display news. This system supports various languages but occasionally we get issues with odd characters being entered...

For example, the premade glyph for ellipsis which is normally represented by 3 .'s broke our system today

How can I check that each character is valid and within range?

These are our character ranges

ExtendedLatin_c_iLowerAlphaChar = 0x00C0;

ExtendedLatin_c_iUpperAlphaChar = 0x01FF;

Arabic_c_iLowerChar = 0x600;

Arabic_c_iUpperChar = 0x6FF;

Arabic_c_iLowerAlphaChar = 0x621;

Arabic_c_iUpperAlphaChar = 0x64A;

Arabic_c_iLowerNumericChar = 0x660;

Arabic_c_iUpperNumericChar = 0x669;

So each character must fall within one of these ranges... but I have no idea how to get the hex value of a character in PHP

Thanks

QuickOldCar · January 5, 2015

If you want to support multi languages and characters you should be detecting and converting to utf-8 and save it to database as so.

iconv

It's tough to do everything, someone made an arabic class to help with it

http://sourceforge.net/projects/ar-php/files/ar-php/

Staggan · January 5, 2015

I can't just get the hex code of each character? Or is this because the code changes depending on the encoding?

Staggan · January 5, 2015

I thought I should explain a little more..

Our text is saved as UTF-8 into the database but several characters when imported to our application cause issues, the ellipsis being one of them, hence we check each character is within a specific range when imported.

So, my question is still how do I get the hex value of a character in PHP...

The rest is simple.

Thanks

Jacques1 · January 5, 2015

Why does a character like the ellipsis cause trouble? This is not normal. You should rather fix this problem than come up with weird workarounds.

Staggan · January 5, 2015

We support several languages within our application and our font system has limitations. Those limitations include omitting some valid but less used characters to allow us to support so many languages in the way we do...

BTW, this is not a web application, but the text is added via a web page...

Jacques1 · January 5, 2015

I still find it odd to have such severe limitation in today's times. It's some legacy application, I guess?

Anyway, this is not quite that easy. The problem is that there's a difference between the abstract Unicode codepoint (which is unique) and the actual byte representation (which depends on the character encoding).

Your hex strings above are not how those characters are represented in UTF-8, so if you literally searched the input for the byte patterns, you wouldn't find anything. However, there are several solutions which do work:

If those ranges have a specific meaning (I'm not familiar with Arabic), you may be able to express them with a Unicode character class within a regular expression. This is by far the most elegant approach, because you don't need to hard-code any byte sequences.
You could look up the concrete UTF-8 representation of each character and use byte ranges within a regular expression: [\xUUUUUU-\xVVVVVV\xYYYYYY-\xZZZZZZ]
You might (ab)use json_decode() to enter the Unicode code points directly and get back the UTF-8 representation. JSON supports Unicode escape sequences like \u00FF.

Staggan · January 6, 2015

The unicode solution sounds possible..

Can you provide me some more information on that?

Thanks

Jacques1 · January 7, 2015

Which of the solutions?

Staggan · January 7, 2015

I have rechecked with our coders, and we actually compare characters with their unicode value to ensure they are within the ranges I specified in my first post..

So, how can I confirm that each character falls within one of those ranges in my first post?

Jacques1 · January 7, 2015

Um, that's exactly what I'm trying to help you with. What do you think the whole discussion was about?

If you want to validate the data with PHP (which you appearently do), then you have to use the features of PHP. It's great that your desktop coders have some Unicode range check they can use. We don't.

PHP can do byte ranges, sure, but then you have to use the UTF-8 representations, not the code points. So that's the solution you've chosen?

By the way, what's the whole point of the last two ranges when they're already covered by the second one?

Staggan · January 7, 2015

I appreciate your help, sorry, maybe I got a little confused with the answer...

Let me explain a little more, perhaps it will help...

Our app reads the latest news on login and displays it to the user, this news could be in a multitude of languages. We limit the characters for each language so that we can handle so many in a single app, a windows app, without ever needing to use windows font systems as they have a number of additional limitations.

So, to put the news into the database we use a PHP page, but occasionally the person responsible for the news in a particular language manages to use a character we do not support.. which causes an issue....

Our app uses the ranges above which are the unicode numerical representation of the characters we allow...at least in Windows.

I thought PHP might allow me a similar way to make the comparison, but it seems there is nothing quite that simple...

As for your last point.. I see it, and can only assume it is an oversight somehow.. I will point it out...

Thanks

Jacques1 · January 7, 2015

Again, you can do this in PHP, and I've offered three possible approaches. Now it's up to you to pick one.

If the encoding will always be UTF-8, then this is actually very easy. There's even a special escape sequence which converts code points to UTF-8 sequences, so you don't have to do that yourself:

<?php

// only accept characters from the Arabic block and some characters from the Latin blocks
$character_validation_pattern = '/\\A[\\x{00C0}-\\x{01FF}\\x{0600}-\\x{06FF}]+\\z/u';

// should match
var_dump( preg_match($character_validation_pattern, 'ĕث') );

// should not match
var_dump( preg_match($character_validation_pattern, 'ab') );

Staggan · January 7, 2015

Can you clarify how this works for me? As I do not understand it...

Thanks

Jacques1 · January 7, 2015

This is a regular expression. It defines a pattern which the input must match:

\A is the beginning of the string.
[...] is a character from a certain character set; for example, [a-z] is a lowercase character from the latin alphabet (you may read it as “a to z”).
\x{...} is the UTF-8 representation of a certain Unicode code point; for example, \x{00C0} is the UTF-8 sequence of the “latin capital letter A with grave”.
+ means that the pattern should be repeated once or more.
\z is the end of the string.
u turns the regular expression into “UTF-8 mode”.

So the pattern simply says: A sequence of at least one UTF-8 encoded character from the Unicode ranges U+00C0 to U+01FF and U+0600 to U+06FF.

hansford · January 7, 2015

Can you clarify how this works for me? As I do not understand it...

Over my head as well - Jacques - concrete example in PHP.

voodooKobra · January 7, 2015

It's probably worth noting that \xBB is for byte literals (most usually ASCII), meanwile \uBBBB is meant for unicode characters (UTF-.

Jacques1 · January 7, 2015

PHP doesn't have a \u escape sequence. There's no such thing. What is does have is a \x{...} sequence for multibyte characters (as I already explained above).

voodooKobra · January 8, 2015

PHP doesn't have a \u escape sequence. There's no such thing. What is does have is a \x{...} sequence for multibyte characters (as I already explained above).

Oh, you're right. For some reason I thought it did.

Staggan · January 8, 2015

I have tried the above suggestion...

This is what I do...

 
function validateString($string)
{
 
$character_validation_pattern = '/\\A[\\x{00C0}-\\x{01FF}\\x{0600}-\\x{06FF}]+\\z/u';
 
 
if (preg_match($character_validation_pattern, $string)){
 
return true;
}
 
return false;
}

Sign In

Validate input

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Archived

Important Information