Jump to content

Validate input


Staggan

Recommended Posts

Hello

 

I have a PHP page that sends text entered by a user to our database which we use to display news. This system supports various languages but occasionally we get issues with odd characters being entered...  

 

For example, the premade glyph for ellipsis which is normally represented by 3 .'s broke our system today

 

How can I check that each character is valid and within range?

 

These are our character ranges 

 

ExtendedLatin_c_iLowerAlphaChar = 0x00C0;

ExtendedLatin_c_iUpperAlphaChar = 0x01FF;

Arabic_c_iLowerChar = 0x600;

Arabic_c_iUpperChar = 0x6FF;

Arabic_c_iLowerAlphaChar = 0x621;

Arabic_c_iUpperAlphaChar = 0x64A;

Arabic_c_iLowerNumericChar = 0x660;

Arabic_c_iUpperNumericChar = 0x669;

 

So each character must fall within one of these ranges... but I have no idea how to get the hex value of a character in PHP

 

Thanks

 

 

Link to comment
Share on other sites

I thought I should explain a little more..

 

Our text is saved as UTF-8 into the database but several characters when imported to our application cause issues, the ellipsis being one of them, hence we check each character is within a specific range when imported.

So, my question is still how do I get the hex value of a character in PHP... 

 

The rest is simple.

 

Thanks

Link to comment
Share on other sites

We support several languages within our application and our font system has limitations. Those limitations include omitting some valid but less used characters to allow us to support so many languages in the way we do... 

 

BTW, this is not a web application, but the text is added via a web page...

Link to comment
Share on other sites

I still find it odd to have such severe limitation in today's times. It's some legacy application, I guess?

 

Anyway, this is not quite that easy. The problem is that there's a difference between the abstract Unicode codepoint (which is unique) and the actual byte representation (which depends on the character encoding).

 

Your hex strings above are not how those characters are represented in UTF-8, so if you literally searched the input for the byte patterns, you wouldn't find anything. However, there are several solutions which do work:

  • If those ranges have a specific meaning (I'm not familiar with Arabic), you may be able to express them with a Unicode character class within a regular expression. This is by far the most elegant approach, because you don't need to hard-code any byte sequences.
  • You could look up the concrete UTF-8 representation of each character and use byte ranges within a regular expression: [\xUUUUUU-\xVVVVVV\xYYYYYY-\xZZZZZZ]
  • You might (ab)use json_decode() to enter the Unicode code points directly and get back the UTF-8 representation. JSON supports Unicode escape sequences like \u00FF.
Edited by Jacques1
Link to comment
Share on other sites

I have rechecked with our coders, and we actually compare characters with their unicode value to ensure they are within the ranges I specified in my first post.. 

 

So, how can I confirm that each character falls within one of those ranges in my first post?

Link to comment
Share on other sites

Um, that's exactly what I'm trying to help you with. What do you think the whole discussion was about?

 

If you want to validate the data with PHP (which you appearently do), then you have to use the features of PHP. It's great that your desktop coders have some Unicode range check they can use. We don't.

 

PHP can do byte ranges, sure, but then you have to use the UTF-8 representations, not the code points. So that's the solution you've chosen?

 

By the way, what's the whole point of the last two ranges when they're already covered by the second one?

Link to comment
Share on other sites

I appreciate your help, sorry, maybe I got a little confused with the answer...

 

Let me explain a little more, perhaps it will help...

 

Our app reads the latest news on login and displays it to the user, this news could be in a multitude of languages. We limit the characters for each language so that we can handle so many in a single app, a windows app, without ever needing to use windows font systems as they have a number of additional limitations.

 

So, to put the news into the database we use a PHP page, but occasionally the person responsible for the news in a particular language manages to use a character we do not support.. which causes an issue.... 

 

Our app uses the ranges above which are the unicode numerical representation of the characters we allow...at least in Windows.

 

I thought PHP might allow me a similar way to make the comparison, but it seems there is nothing quite that simple...

 

As for your last point.. I see it, and can only assume it is an oversight somehow.. I will point it out...

 

Thanks

Link to comment
Share on other sites

Again, you can do this in PHP, and I've offered three possible approaches. Now it's up to you to pick one.

 

If the encoding will always be UTF-8, then this is actually very easy. There's even a special escape sequence which converts code points to UTF-8 sequences, so you don't have to do that yourself:

<?php

// only accept characters from the Arabic block and some characters from the Latin blocks
$character_validation_pattern = '/\\A[\\x{00C0}-\\x{01FF}\\x{0600}-\\x{06FF}]+\\z/u';

// should match
var_dump( preg_match($character_validation_pattern, 'ĕث') );

// should not match
var_dump( preg_match($character_validation_pattern, 'ab') );
Link to comment
Share on other sites

This is a regular expression. It defines a pattern which the input must match:

  • \A is the beginning of the string.
  • [...] is a character from a certain character set; for example, [a-z] is a lowercase character from the latin alphabet (you may read it as “a to z”).
  • \x{...} is the UTF-8 representation of a certain Unicode code point; for example, \x{00C0} is the UTF-8 sequence of the “latin capital letter A with grave”.
  • + means that the pattern should be repeated once or more.
  • \z is the end of the string.
  • u turns the regular expression into “UTF-8 mode”.

So the pattern simply says: A sequence of at least one UTF-8 encoded character from the Unicode ranges U+00C0 to U+01FF and U+0600 to U+06FF.

Link to comment
Share on other sites

I have tried the above suggestion... 

 

This is what I do...

 

 

 
function validateString($string)
{
 
$character_validation_pattern = '/\\A[\\x{00C0}-\\x{01FF}\\x{0600}-\\x{06FF}]+\\z/u';
 
 
if (preg_match($character_validation_pattern, $string)){
 
return true;
}
 
return false;
}
 
Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.