A Dynamic Word Filter

kratsg · August 7, 2009

So, I was googling around looking for ideas on how to expand the filter already made for one of our AJAX chatrooms. It allows the moderators to dynamically add/delete words from the list of filtered words using a chatroom command inside the chatroom.

So, in order to scan and cover more bad words, I was wondering how to do this (we already have it set-up using preg_replace() where all the words are matched case-insensitively.)

The list looks something like:

someword

cursing

swears

whee

The filter builds the list of words by putting it all into an array with each word surrounded by backslashes and the case-insensitive "i" at the end.

IE: .....,'/someword/i',........

So basically, how can I go about expanding that scope such as having the filter not just look for someword, but s0meword, somew0rd, s0mew0rd, som3word, s0m3word, som3w0rd, s0m3w0rd. For example (looping through and switching out to similar characters)...

Or with "cursing", the i can be L,1,!, etc... Any ideas?

monkeytooth · August 7, 2009

Unfortunately I personally think your going to have to build your own dictionary/array for the filter to work accordingly. However. I suppose (which this might not be the best concept) you could always expand the function that upon adding to the array the one word your looking to filter out, that it runs a str_replace() or similar like function multiple times on the original string to add commonly used expressions to 1337 Talk, or Hax0r style talk.

Example your adding

monkey for what ever reason.. using your filter and expanding upon it to look for other varations

$string = "monkey";

$altversion = str_replace('e', '3');

like I said its not the best notion in the world, but theres bound to be a way to build an array and loop that can do that based on the concept.

kratsg · August 7, 2009

Now, what about determining excessive quotations or spaces?

IE:

"m!o!n!k!e!y" and "m on key" perhaps?

I would imagine that the loop itself wouldn't be too hard to manage, but having something literally recognize excessive quotations, etc..?

mikesta707 · August 7, 2009

something like that would be very advanced. I think that moneytooth's method is the best so far (well I can't think of a better way) You could probably do some wicked regex and work it out, like for example, taking a bad word, exploding it into its specific chars, and making a regex expression that searches for all those characters with any number of spaces in between each character.

and then of course you could expand that to any number of other characters

kratsg · August 11, 2009

See, I'd imagine that for almost any type of web programming language, we encounter a seemingly overwhelming amount of iterations when we deal with filters. On the other hand, filtering out html tags (like <div> and <span> are quite easy cause you cannot replace the i with a 1, and capitalization doesn't matter with case-insensitive matches). Speaking of this as a side note, anyone know why there are some html tags that do not get stripped via strip_tags() function (I don't know all the exceptions, but <script> and <embed> do not get stripped out, for example).

@mikesta707, thinking about your idea.. couldn't one just prepare a table of replacements ahead of time? I made the following off the top of my head as a basic... If there's any insight into a better way of how to do what's illustrated below, don't hesitate to post :-D Any idea or thought would be nice. User feedback and input = <3

<?php
$str = "mixes";

$find = array("/i/i","/x/i","/e/i");//note, there's 26 letters in the alphabet, this cannot really be a giant array at all... if we looked and tried to filter everything
$replace = array("(i|1|!|l)","(x|cks)","(e|3)");
//first, replace a word we want to filter with the appropriate regex stuff
preg_replace($find,$replace,$string);//perform case-insensitive replace
//next, add appropriate whitespace checkers and filter out whitespace in between each letter
$arr = split("",$str);//break it up
$str = explode("\s*",$arr);//put it back together with the checking of whitespace in between letters

//at this point, our word should look like: m\s*(i|1|!|l)\s*(x|cks)\s*(e|3)\s*s

//We can store this line either in the database (after escaping it) or put it in a text file
//we can call it back among with other "smart" words and use this
//assume it was in a file, each line = a new word
$filter = array();
$filename = "filter_list.txt";
$file = fopen($filename,"r");
while($line = fgets($file)){
$line = trim($line);//remove line breaks and extra whitespace
$filter[] = "/$line/i";//add our filter words with the backslash delimiters and a case-insensitive specification
}
fclose($file);

$message = preg_replace($filter,"****",$message);//$message refers to a message the user posted
?>

oni-kun · August 11, 2009

Speaking of this as a side note, anyone know why there are some html tags that do not get stripped via strip_tags() function (I don't know all the exceptions, but <script> and <embed> do not get stripped out, for example).

Some multi-line tags may accidentally pass through, may be a bug.. you should do something such as this.

$htmlstring = preg_replace("'<embed[^>]*>.*</embed>'siU",'',$htmlstring);

All in all that suggestion to 'brute-force' alike characters is all you can do, you may want to use a match multiple characters such as 'daaaamn' etc:

$content = "grrrrrrrrrrr arggggg loooool shiiiiiit";
$pattern = '{([a-zA-Z])\1+}';
$replacement = '$1$1';
$filtered = preg_replace($pattern, $replacement, $content);

Another suggestion is to use 'iconv' to strip out characters such as 'ú' to be filtered into 'u' beforehand, so 'fú*k' can't pass through unfiltered.

kratsg · August 11, 2009

Speaking of this as a side note, anyone know why there are some html tags that do not get stripped via strip_tags() function (I don't know all the exceptions, but <script> and <embed> do not get stripped out, for example).

Some multi-line tags may accidentally pass through, may be a bug.. you should do something such as this.
$htmlstring = preg_replace("'<embed[^>]*>.*</embed>'siU",'',$htmlstring);
All in all that suggestion to 'brute-force' alike characters is all you can do, you may want to use a match multiple characters such as 'daaaamn' etc:
$content = "grrrrrrrrrrr arggggg loooool shiiiiiit";
$pattern = '{([a-zA-Z])\1+}';
$replacement = '$1$1';
$filtered = preg_replace($pattern, $replacement, $content); 
Another suggestion is to use 'iconv' to strip out characters such as 'ú' to be filtered into 'u' beforehand, so 'fú*k' can't pass through unfiltered.

I will try to test these tags using a textarea to see what really is going on, and I'll apply your suggestion.

I was afraid of trying to match the multiple characters because there are double letter words (mississippi o sssshhhiiittt) which we could fix by using {2,} instead. I just think of regex as always being greedy xD

Sign In

A Dynamic Word Filter

Recommended Posts

kratsg

Link to comment

Share on other sites

monkeytooth

Link to comment

Share on other sites

kratsg

Link to comment

Share on other sites

mikesta707

Link to comment

Share on other sites

kratsg

Link to comment

Share on other sites

oni-kun

Link to comment

Share on other sites

kratsg

Link to comment

Share on other sites

Archived

Browse

Activity

Important Information