Jump to content

A problem with my ASCII range Regular expression, PCRE...


Recommended Posts

Hi, I'm trying to make something that ensures that certain input is only between a specified range. Specially \x20 trough \x7e, as seen on chart:

 

chart33.png

 

I came up with the expression, (using preg_match)

 

/^[\x20-\x7e\t\s]+?$/

 

Which seemed to work at first, though leakage is occurring somewhere? I would only like to allow \x20-\x7e (\x20-\x7e), tab's, spaces and new lines (\t\s), and while it does seem to block out some characters that are not of that range, some still do slip through and I am unsure how. Can anyone see the problem here?

First I'd like to point out that having a non-greedy match when you're trying to match the entire string is a bit unnecessary, and might actually hurt the performance.

Secondly, you don't need to define the range with the hex values, you just need to set up a range:

$RegExp = "/^[ -~\\t\\n]+\\z/";

 

While that is said, I don't see any problem with the RegExp you have there. Do you have any examples of the strings that had some data that got through despite it? Also, what does you code look like, where you're using that RegExp and storing the data?

First I'd like to point out that having a non-greedy match when you're trying to match the entire string is a bit unnecessary, and might actually hurt the performance.

Secondly, you don't need to define the range with the hex values, you just need to set up a range:

$RegExp = "/^[ -~\\t\\n]+\\z/";

 

While that is said, I don't see any problem with the RegExp you have there. Do you have any examples of the strings that had some data that got through despite it? Also, what does you code look like, where you're using that RegExp and storing the data?

 

I'll check that out soon, thank you for some of the tips. I had setup a test page (using a textarea) to send data to the regular expression.

The php...

 

<?php
$response = isset($_POST['post']) ? $_POST['post'] : null;
$result = null;
if ( $response !== null ) {
	$response = trim($response);
	if ( preg_match('/^[\x20-\x7e\t\s]+?$/', $response) ) {
		$result = 'string length: ' . strlen($response) . ' Validated input ' . $response;
	} else {
		$result = 'nope';
	}
}
?>
<!DOCTYPE html>
<html>
<head>
	<title></title>
	<link rel="stylesheet" type="text/css" href="style/base.css" />
	<link rel="stylesheet" type="text/css" href="style/xform.css" />
</head>
<body>
	<?php echo $result; ?>
	<div class="xform">
		<form method="post" action="">
			<div class="inxform">
				<fieldset>
					<legend>Message</legend>
					<div class="overlay">
						<textarea name="post"><?php echo isset($_POST['post']) ? $_POST['post'] : null; ?></textarea>
					</div>
					<div class="overlay">
						<input type="submit" name="submit_post" value="Post" />
					</div>
				</fieldset>
			</div>
		</form>
	</div>
</body>
</html>

 

For example, it'd validate īĬĭ and strange characters like that, but not , and so on and so forth with many random characters, that I had generated with the following code...

<?php
for( $i=0; $i < 1000; $i++)
	echo "&#" . $i . ";";
?>

 

So of course all the weird false positives had thrown me off that expression altogether when I was certain I was doing it right. I'm not sure if something else is causing it, as you say the initial expression should in theory work.

 

Hmm... It might be related to the fact that you're trying to validate an non-ASCII string. If you're using UTF-8 (which you should), then just add the "u" modifier to the RegExp to switch it to UTF-8 mode.

I tried your regular expression as follows,

 

/^[ -~\\t\\n]+\\z/

 

It validated

abcdefghijklmnopqrstuvwxyzABCDEFGHIKLMNOPQRSTUVWXYZ0123456789~!@#$%^&*()_+`-=[]\{}|;':",./<>?Ĝĝ

 

But not

abcdefghijklmnopqrstuvwxyzABCDEFGHIKLMNOPQRSTUVWXYZ0123456789~!@#$%^&*()_+`-=[]\{}|;':",./<>?

 

I even attempted to change mine with your given tips, which resulted in:

 

/^[\x20-\x7e\t\s]+$/u

 

But produced the same false positives as yours did (also added the u modifier to your Regex as well). Still not sure what the proper solution may be.

Just tried on my local server:

$String = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIKLMNOPQRSTUVWXYZ0123456789~!@#$%^&*()_+`-=[]\{}|;\':",./<>?Ĝĝ';
$RegExp = '/^[ -~\\t\\n]+\\z/u';
var_dump (preg_match ($RegExp, $String));
// int(0)
$String = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIKLMNOPQRSTUVWXYZ0123456789~!@#$%^&*()_+`-=[]\{}|;\':",./<>?';
var_dump (preg_match ($RegExp, $String));
// int(1)

 

PS: Sorry for the HTML entities in the post, the forum software seems to double escape some times.

Just tried on my local server:

$String = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIKLMNOPQRSTUVWXYZ0123456789~!@#$%^&*()_+`-=[]\{}|;\':",./<>?Ĝĝ';
$RegExp = '/^[ -~\\t\\n]+\\z/u';
var_dump (preg_match ($RegExp, $String));
// int(0)
$String = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIKLMNOPQRSTUVWXYZ0123456789~!@#$%^&*()_+`-=[]\{}|;\':",./<>?';
var_dump (preg_match ($RegExp, $String));
// int(1)

 

PS: Sorry for the HTML entities in the post, the forum software seems to double escape some times.

 

I had the same results on my server when statically placing the input string as you have, however for some reason in the context of input that's being received through $_POST, it still accepts these characters as valid input.

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.