Jump to content

A Dynamic Word Filter


kratsg

Recommended Posts

So, I was googling around looking for ideas on how to expand the filter already made for one of our AJAX chatrooms. It allows the moderators to dynamically add/delete words from the list of filtered words using a chatroom command inside the chatroom.

 

So, in order to scan and cover more bad words, I was wondering how to do this (we already have it set-up using preg_replace() where all the words are matched case-insensitively.)

 

The list looks something like:

someword

cursing

swears

whee

 

The filter builds the list of words by putting it all into an array with each word surrounded by backslashes and the case-insensitive "i" at the end.

 

IE: .....,'/someword/i',........

 

So basically, how can I go about expanding that scope such as having the filter not just look for someword, but s0meword, somew0rd, s0mew0rd, som3word, s0m3word, som3w0rd, s0m3w0rd. For example (looping through and switching out to similar characters)...

 

Or with "cursing", the i can be L,1,!, etc... Any ideas?

Link to comment
Share on other sites

Unfortunately I personally think your going to have to build your own dictionary/array for the filter to work accordingly. However. I suppose (which this might not be the best concept) you could always expand the function that upon adding to the array the one word your looking to filter out, that it runs a str_replace() or similar like function multiple times on the original string to add commonly used expressions to 1337 Talk, or Hax0r style talk.

 

Example your adding

 

monkey for what ever reason.. using your filter and expanding upon it to look for other varations

 

$string = "monkey";

$altversion = str_replace('e', '3');

 

like I said its not the best notion in the world, but theres bound to be a way to build an array and loop that can do that based on the concept.

 

Link to comment
Share on other sites

Now, what about determining excessive quotations or spaces?

 

IE:

 

"m!o!n!k!e!y" and "m    on  key" perhaps?

 

I would imagine that the loop itself wouldn't be too hard to manage, but having something literally recognize excessive quotations, etc..?

Link to comment
Share on other sites

something like that would be very advanced. I think that moneytooth's method is the best so far (well I can't think of a better way) You could probably do some wicked regex and work it out, like for example, taking a bad word, exploding it into its specific chars, and making a regex expression that searches for all those characters with any number of spaces in between each character.

 

and then of course you could expand that to any number of other characters

Link to comment
Share on other sites

See, I'd imagine that for almost any type of web programming language, we encounter a seemingly overwhelming amount of iterations when we deal with filters. On the other hand, filtering out html tags (like <div> and <span> are quite easy cause you cannot replace the i with a 1, and capitalization doesn't matter with case-insensitive matches). Speaking of this as a side note, anyone know why there are some html tags that do not get stripped via strip_tags() function (I don't know all the exceptions, but <script> and <embed> do not get stripped out, for example).

 

@mikesta707, thinking about your idea.. couldn't one just prepare a table of replacements ahead of time? I made the following off the top of my head as a basic... If there's any insight into a better way of how to do what's illustrated below, don't hesitate to post :-D Any idea or thought would be nice. User feedback and input = <3

 

<?php
$str = "mixes";

$find = array("/i/i","/x/i","/e/i");//note, there's 26 letters in the alphabet, this cannot really be a giant array at all... if we looked and tried to filter everything
$replace = array("(i|1|!|l)","(x|cks)","(e|3)");
//first, replace a word we want to filter with the appropriate regex stuff
preg_replace($find,$replace,$string);//perform case-insensitive replace
//next, add appropriate whitespace checkers and filter out whitespace in between each letter
$arr = split("",$str);//break it up
$str = explode("\s*",$arr);//put it back together with the checking of whitespace in between letters

//at this point, our word should look like: m\s*(i|1|!|l)\s*(x|cks)\s*(e|3)\s*s

//We can store this line either in the database (after escaping it) or put it in a text file
//we can call it back among with other "smart" words and use this
//assume it was in a file, each line = a new word
$filter = array();
$filename = "filter_list.txt";
$file = fopen($filename,"r");
while($line = fgets($file)){
$line = trim($line);//remove line breaks and extra whitespace
$filter[] = "/$line/i";//add our filter words with the backslash delimiters and a case-insensitive specification
}
fclose($file);

$message = preg_replace($filter,"****",$message);//$message refers to a message the user posted
?>

Link to comment
Share on other sites

Speaking of this as a side note, anyone know why there are some html tags that do not get stripped via strip_tags() function (I don't know all the exceptions, but <script> and <embed> do not get stripped out, for example).

 

Some multi-line tags may accidentally pass through, may be a bug.. you should do something such as this.

$htmlstring = preg_replace("'<embed[^>]*>.*</embed>'siU",'',$htmlstring);

All in all that suggestion to 'brute-force' alike characters is all you can do, you may want to use a match multiple characters such as 'daaaamn' etc:

$content = "grrrrrrrrrrr arggggg loooool shiiiiiit";
$pattern = '{([a-zA-Z])\1+}';
$replacement = '$1$1';
$filtered = preg_replace($pattern, $replacement, $content); 

 

Another suggestion is to use 'iconv' to strip out characters such as 'ú' to be filtered into 'u' beforehand, so 'fú*k' can't pass through unfiltered.

Link to comment
Share on other sites

Speaking of this as a side note, anyone know why there are some html tags that do not get stripped via strip_tags() function (I don't know all the exceptions, but <script> and <embed> do not get stripped out, for example).

 

Some multi-line tags may accidentally pass through, may be a bug.. you should do something such as this.

$htmlstring = preg_replace("'<embed[^>]*>.*</embed>'siU",'',$htmlstring);

All in all that suggestion to 'brute-force' alike characters is all you can do, you may want to use a match multiple characters such as 'daaaamn' etc:

$content = "grrrrrrrrrrr arggggg loooool shiiiiiit";
$pattern = '{([a-zA-Z])\1+}';
$replacement = '$1$1';
$filtered = preg_replace($pattern, $replacement, $content); 

 

Another suggestion is to use 'iconv' to strip out characters such as 'ú' to be filtered into 'u' beforehand, so 'fú*k' can't pass through unfiltered.

 

I will try to test these tags using a textarea to see what really is going on, and I'll apply your suggestion.

 

I was afraid of trying to match the multiple characters because there are double letter words (mississippi o sssshhhiiittt) which we could fix by using {2,} instead. I just think of regex as always being greedy xD

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.