Jump to content

[SOLVED] A decent, effective filter: using preg_replace


Recommended Posts

Recently, I've noticed that there can be workarounds to even a simple preg_replace:

 

$blockedarray=array(
'/alert=/i', '/alert\(/i', '/iframe/i', '/<me/i', '/<object/i', '/\.cookie/i', '/<app/i', '/mysql/i', '/document\.location/i', '/\0/', '/@import/i', '/<xml/i', '/<meta/i', '/s\nc\nr\ni/i', '/<emb/i', '/<java/i','/moz[^A-Z]/i','/onload/i', '/\+ document/i', '/\;p\?/i', '/\' \+ \'/', '/\' \+\'/', '/\'\+\'/', '/\'\+ \'/', '/-binding:/i', '/-binding :/i'

);
$input = preg_replace($blockedarray, '*blocked*', $input);

 

Where instead of normally filtering out the html, we're also looking at filtering javascript. It turns out that given the right combination of filtered words, you can create a string that itself isn't filtered, IE:

 

$replace = array('apple','banana','cookie');
$input = preg_replace($replace,'',$input);

 

Let's say my input/ouputs are below:

 

$input = 'apple';
//output: ''
$input = 'applebanana';
//output: ''
$input = 'appbananale';
//output: 'apple'

 

Let's re-arrange the array:

$replace = array('banana','cookie','apple');
$input = preg_replace($replace,'',$input);

 

Now, we have:

$input = 'appbananale';
//output: ''

 

That last case shows the problems with preg_replace! And in fact, this poses a problem for most situations. However, a simple workaround would be to replace anything filtered out with '***' or something similar or to even re-arrange the array.. but then one can do 'banappleana' and get it to not filter again.

 

So I wonder, should we try to recursively replace our input until there is no change? Or should we just simply replace anything filtered out with '***'?

 

IE:

$replace = array('banana','cookie','apple');
$counter = 0;
while($input != preg_replace($replace,'',$input))
$counter++;

 

Then, with:

$input = 'banappleana';
//output: '', $counter = 2 (0->1 after replacing 'apple', 1->2 after replacing 'banana')

 

What's your view? Comments? Suggestions?

so basically what you're saying is someone can type "asfuckshole" and you replace fuck with nothing and you wind up with asshole?  I suppose if you really wanted to address that, a recursive function would do the trick. 

so basically what you're saying is someone can type "asfuckshole" and you replace fuck with nothing and you wind up with asshole?  I suppose if you really wanted to address that, a recursive function would do the trick. 

 

I wasn't sure if you could really curse on these forums >.< But yeah, I was just curious as to whether either way would be better. The main issue here with the javascript was that we wanted to filter all javascript perfectly, yet some people were able to workaround this by adding other curse words and etc... using the above methods.

As far as the escaping JS goes, why not just use htmlentities?

 

So here's the fun part. We are allowing html, but we don't want javascript. >.< At least, that's what the site owner wants. So... that's where it is a tad difficult.

Oh.... I had to do something like that recently....

 

 

I went with whitelisting where I parsed every tag then parsed the attributes and so on....

 

 

In your case though, you could probably just recursively strip out script tags, on<blah>= attributes and tag="javascript:" crap.

 

 

Well it wasn't too bad since I only allowed certain elements....

 

 

It was like p, a, b, i, img, center, div and so on that were allowed.

 

It was quite a pain though to parse the style="var: val; var2: val2;" pairs to make sure they were allowed lol.

 

(There's a reason BBCode developed :), but there were issues with it.)

 

 

 

It was a pretty simple design actually....  I had an array of allowed tags, and for some tags I had handlers.

 

Then I parsed everything with preg_replace() with an /e modifier.  If the tag name was mapped to true in the array, it was blindly returned, if it was mapped to a string, the tag, its content and its attributes were passed to a method.  The method would further parse and decide what to return and so on....

 

It was much simpler than that sounded lol.

 

 

(I would share the class with you, but as I used it in a paid project.....)

Hmm.. wow. About that "e" modifier, how were you using that? I've never really seen an example with it, and seeing how I was able to do normal substitutions with $1, $2, etc.. I never got the point of it. For the most part, I assumed it was a stupid form of "eval($replacement)".

kinda off-topic but here's an example of using the 'e' modifier with preg_replace:

 

$string = "A AA AAA AAAA AAAAA AAAA AAA AA A";
$string = preg_replace("~(\w{3,})~e","strtolower('$1')",$string);
echo $string;

 

output:

A AA aaa aaaa aaaaa aaaa aaa AA A

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.