[SOLVED] A decent, effective filter: using preg_replace

kratsg · September 14, 2009

Recently, I've noticed that there can be workarounds to even a simple preg_replace:

$blockedarray=array(
'/alert=/i', '/alert\(/i', '/iframe/i', '/<me/i', '/<object/i', '/\.cookie/i', '/<app/i', '/mysql/i', '/document\.location/i', '/\0/', '/@import/i', '/<xml/i', '/<meta/i', '/s\nc\nr\ni/i', '/<emb/i', '/<java/i','/moz[^A-Z]/i','/onload/i', '/\+ document/i', '/\;p\?/i', '/\' \+ \'/', '/\' \+\'/', '/\'\+\'/', '/\'\+ \'/', '/-binding:/i', '/-binding :/i'

);
$input = preg_replace($blockedarray, '*blocked*', $input);

Where instead of normally filtering out the html, we're also looking at filtering javascript. It turns out that given the right combination of filtered words, you can create a string that itself isn't filtered, IE:

$replace = array('apple','banana','cookie');
$input = preg_replace($replace,'',$input);

Let's say my input/ouputs are below:

$input = 'apple';
//output: ''
$input = 'applebanana';
//output: ''
$input = 'appbananale';
//output: 'apple'

Let's re-arrange the array:

$replace = array('banana','cookie','apple');
$input = preg_replace($replace,'',$input);

Now, we have:

$input = 'appbananale';
//output: ''

That last case shows the problems with preg_replace! And in fact, this poses a problem for most situations. However, a simple workaround would be to replace anything filtered out with '***' or something similar or to even re-arrange the array.. but then one can do 'banappleana' and get it to not filter again.

So I wonder, should we try to recursively replace our input until there is no change? Or should we just simply replace anything filtered out with '***'?

IE:

$replace = array('banana','cookie','apple');
$counter = 0;
while($input != preg_replace($replace,'',$input))
$counter++;

Then, with:

$input = 'banappleana';
//output: '', $counter = 2 (0->1 after replacing 'apple', 1->2 after replacing 'banana')

What's your view? Comments? Suggestions?

.josh · September 14, 2009

so basically what you're saying is someone can type "asfuckshole" and you replace fuck with nothing and you wind up with asshole? I suppose if you really wanted to address that, a recursive function would do the trick.

kratsg · September 14, 2009

so basically what you're saying is someone can type "asfuckshole" and you replace fuck with nothing and you wind up with asshole? I suppose if you really wanted to address that, a recursive function would do the trick.

I wasn't sure if you could really curse on these forums >.< But yeah, I was just curious as to whether either way would be better. The main issue here with the javascript was that we wanted to filter all javascript perfectly, yet some people were able to workaround this by adding other curse words and etc... using the above methods.

corbin · September 14, 2009

As far as the escaping JS goes, why not just use htmlentities?

kratsg · September 14, 2009

As far as the escaping JS goes, why not just use htmlentities?

So here's the fun part. We are allowing html, but we don't want javascript. >.< At least, that's what the site owner wants. So... that's where it is a tad difficult.

corbin · September 14, 2009

Oh.... I had to do something like that recently....

I went with whitelisting where I parsed every tag then parsed the attributes and so on....

In your case though, you could probably just recursively strip out script tags, on<blah>= attributes and tag="javascript:" crap.

kratsg · September 14, 2009

I went with whitelisting where I parsed every tag then parsed the attributes and so on....

Wow, that sounds painful to code. I assume you put this into one giant pattern or something and used preg_match/replace?

corbin · September 14, 2009

Well it wasn't too bad since I only allowed certain elements....

It was like p, a, b, i, img, center, div and so on that were allowed.

It was quite a pain though to parse the style="var: val; var2: val2;" pairs to make sure they were allowed lol.

(There's a reason BBCode developed , but there were issues with it.)

It was a pretty simple design actually.... I had an array of allowed tags, and for some tags I had handlers.

Then I parsed everything with preg_replace() with an /e modifier. If the tag name was mapped to true in the array, it was blindly returned, if it was mapped to a string, the tag, its content and its attributes were passed to a method. The method would further parse and decide what to return and so on....

It was much simpler than that sounded lol.

(I would share the class with you, but as I used it in a paid project.....)

kratsg · September 14, 2009

Hmm.. wow. About that "e" modifier, how were you using that? I've never really seen an example with it, and seeing how I was able to do normal substitutions with $1, $2, etc.. I never got the point of it. For the most part, I assumed it was a stupid form of "eval($replacement)".

.josh · September 14, 2009

kinda off-topic but here's an example of using the 'e' modifier with preg_replace:

$string = "A AA AAA AAAA AAAAA AAAA AAA AA A";
$string = preg_replace("~(\w{3,})~e","strtolower('$1')",$string);
echo $string;

output:

A AA aaa aaaa aaaaa aaaa aaa AA A

Sign In

[SOLVED] A decent, effective filter: using preg_replace

Recommended Posts

kratsg

Link to comment

Share on other sites

.josh

Link to comment

Share on other sites

kratsg

Link to comment

Share on other sites

corbin

Link to comment

Share on other sites

kratsg

Link to comment

Share on other sites

corbin

Link to comment

Share on other sites

kratsg

Link to comment

Share on other sites

corbin

Link to comment

Share on other sites

kratsg

Link to comment

Share on other sites

.josh

Link to comment

Share on other sites

Archived

Browse

Activity

Important Information