Jump to content


Photo

Better rude word filter... when they change le++er5 for characters


  • Please log in to reply
15 replies to this topic

#1 Mouse

Mouse
  • Members
  • PipPipPip
  • Advanced Member
  • 95 posts
  • LocationToo Close to LONDON

Posted 20 October 2006 - 03:10 PM

I am using a simple but effective “bad word” filter on my site –

Bad_word Filter code:

Code:

$bad_words = explode('|', 'badword1|badword2|badword3|etc|etc'); 
   foreach ($bad_words as $naughty) 
   { 
      $comments = eregi_replace($naughty, "#!@%*#", $comments); 
   }

but as always when you solve one little problem you come up with another, those pesky kids and their rude #!@%*# words.

As I’ve added words to the filter they are replaced with other words where one or more characters have been replace with symbols or digits. i.e. in the word ‘*****’, [sorry for any offence to Meredith Brooks fans, or anybody else for that matter] this is easily be filtered on its own using a list, but the same word spelt w1tch, wi+ch etc. could be equally offensive.

Any ideas? (Short of trying to work out every connotation of every word)

Mouse

MOD EDIT(shoz): edited for language

#2 obsidian

obsidian
  • Staff Alumni
  • Advanced Member
  • 3,202 posts
  • LocationSeattle, WA

Posted 20 October 2006 - 03:17 PM

hmm... that could get a bit tricky, but you might be able to write up a function that when you enter a word into your filter bank comes up with all possible combinations of haxor for that word. you would definitely want it to be on the banking side, though, or else your system resources would be eaten alive by trying to find combinations of every word in a post.
You can't win, you can't lose, you can't break even... you can't even get out of the game.

<?php
while (count($life->getQuestions()) > 0)
{   $life->study(); } ?>
  LINKS: PHP: Manual MySQL: Manual PostgreSQL: Manual (X)HTML: Validate It! CSS: A List Apart | IE bug fixes | Zen Garden | Validate It! JavaScript: Reference Cards RegEx: Everything RegEx

#3 Psycho

Psycho
  • Moderators
  • Move along, nothing to see here
  • 11,892 posts
  • LocationCanada

Posted 20 October 2006 - 03:47 PM

You are fighting a losing battle. The problem is that there is no magic bullet and you would almost need to have very complex tables to seach for all the combinations or you would have to come up with simpler rules that would create a lot of false positives. For example the word "botch" could get caught in the screening for variations for the word "bitch". And, they will just start using euphemisms or some other way to get their point accross that you cn't filter.

Perhaps you just need to start banning people.
The quality of the responses received is directly proportional to the quality of the question asked.

I do not always test the code I provide, so there may be some syntax errors. In 99% of all cases I found the solution to your problem here: http://www.php.net

#4 Mouse

Mouse
  • Members
  • PipPipPip
  • Advanced Member
  • 95 posts
  • LocationToo Close to LONDON

Posted 20 October 2006 - 03:57 PM

ok, what if we went for

take each word,
explode it,
if first character is a letter then this is a word
  carry on filtering if you find a space then go to next group
    if you find a number before you find a space treat as bad_word
if first character is a number then this is a number group
  carry on filtering if you find a space then go to next  group
    if you find a letter before you find a space treat as bad_word

ok, your stuffed if you're telling the world about your love of U2, otherwise you should be ok... what do you think?

Mouse


#5 oracle259

oracle259
  • Members
  • PipPipPip
  • Advanced Member
  • 119 posts

Posted 20 October 2006 - 04:09 PM

mjdamato has a good point but to some extent you can implement a lang filter. Here is the code i am working on to screen usernames on my site. It's in no way perfect, in fact i have some more work to do but you can try it. I only ask that if you modify it let me have a look at the new code. It's based on the phpfreaks lang filter tutorial. However, since i'm screening usernames that follow a set convention : ie letters, numbers, periods, hyphens and underscores only its easier to screen. In ur earlier post u had a problem with +$%^ and other symbols just add them to  $character_arr = array('.', '-', '_', ' ') and that should eliminate the problem.

The table structure is pretty simple:

ID
Profanity : varchar 255
Severity: varchar 1  (1 - high or 2 - low)

<?php
// Passed from database file on include
$dbhost = 'localhost'; 
$dbport = '3306';
$dbuser = '***'; 
$dbpwd = '*******'; 
$database = '***********'; 
$table = 'filter'; 
$col1 = 'profanity';
$col2 = 'severity';

// Passed from settings file on include
$max = '25';
$min = '5';
$show_analysis = '1';


// Passed from language file on include
$msg_fail = "<b>Sorry, the username contains a word or phrase that is not allowed. Please enter another username.</b>";
$msg_pass = '<b>Congratulations, username is available.</b>';
$err_prog = "<b>Program Error:</b>&nbsp;Username must be $min to $max characters in length and contain only letters.";
$err_username = "Please enter your username. Note username may only contain letters, numbers, periods, hyphens and underscores.";
$err_max = "Please enter maximum limit for username. Note maximum limit must be greater than 0.";
$err_min = "Please enter minimum limit for username. Note minimum limit must be greater than 0.";
$err_min_max = "Program Error: Username minimum character limit <b>($min)</b> cannot exceed maximum character limit <b>($max)</b>.";
$err_show_analysis = "Program Error: only 1 - on and 2 - off are appropriate for this setting.";
?>



<?php

global $dbhost, $dbport, $dbuser, $dbpwd, $database, $table, $col1, $col2;
global $username, $max, $min, $show_analysis, $msg_pass, $msg_fail;
global $err_prog, $err_username, $err_max, $errs_min, $err_min_max, $err_show_analysis;

// Ensures that username is not numeric, not empty and has been set
if (!isset($_POST['username']) || is_numeric($_POST['username']) || empty($_POST['username'])) {
    die ($err_username); } else {
        $username = $_POST['username'];
      $string = trim($username);
      }

// Ensures that username maximum character limit is not numeric, not empty
// and greater than 0
if (!is_numeric($max) || empty($max)) {
    die ($err_max);} else { $maximum = $max; }
    
// Ensures that username minimum character limit is not numeric, not empty
// and greater than 0
if (!is_numeric($min) || empty($min)) {
    die ($err_min);} else { $minimum = $min; }

// Ensures that username minimum character limit is not greater than maximum
// character limit
if ($min > $max) {
    die ($err_min_max);}

// Ensures that show_analysis is numeric, not empty
if (!is_numeric($show_analysis) || empty($show_analysis)) {
    die ($err_show_analysis); } 

// Ensures that show_analysis is 1 or 2
if ($show_analysis == '1' || $show_analysis == '2') {
$show_analysis = $show_analysis; 
} else { die ($err_show_analysis); 
} 


// Converts username to lower case letters
$string = strtolower($string);
// Strips out HTML and PHP tags
$string = strip_tags($string);

$strip_string = $string;

	$character_arr = array('.', '-', '_', ' ');
	$number_arr = range(0,9);
	$combo_arr = array_merge($number_arr, $character_arr);

    // Strips username of numbers and symbols for an improved word match
    $string = str_replace($combo_arr, '', $string);

    // Ensures that username contains only letters at this stage
    $pattern = "/^[a-z]"."{".$minimum.",". $maximum. "}$/i";
    $isletter = preg_match($pattern, $string);
          if (!$isletter) { 
	die ($err_prog);
		} 
		
// Open database connection
$conn = mysql_connect($dbhost.":".$dbport, $dbuser, $dbpwd);
 if (!$conn) {
  die('Failed to connect to database:\n' . mysql_error());
}  

// Select database
$db = mysql_select_db($database, $conn);
if (!$conn) {
  die('Connection to database failed:\n' . mysql_error());
} 

// Query database for present curse words
$query = sprintf("SELECT * FROM $table ORDER BY length($col1) DESC");
$request = mysql_query($query);

	if (!$request) {
 die ("Invalid Query: $query<br>\n" . "<b>" . mysql_error() ."</b>");
	} 

// Checks if a record is found
$num_res_rows = mysql_num_rows($request);
if ($num_res_rows < 1) {
die ("Sorry, no entries found.\n" . mysql_error());
}

// Define $obscenities and loop through the results and assign it to
// $obscenities
$obscenities = array();

while ($row = mysql_fetch_assoc($request))  {

$obscenities[] = $row[$col1];
              }

foreach ($obscenities as $curse_word)
	{

     	if (stristr(trim($string), $curse_word))
		{

   			$length = strlen($curse_word);
			$stars = '';
			for ($i = 1; $i <= $length; $i++)
			$stars .= '*';

			$string = eregi_replace($curse_word, $stars, trim($string));
			$stars = '';

   // Saves curse words used for latter analysis
   $badword[] = $curse_word;

		}
	}

// Free $request results
mysql_free_result($request) or die ("Could not free result\n" . mysql_error());

  $new_string = $string;
  $string_cnt = strlen(trim($string));
  $username_cnt = strlen(trim($username));
  $new_string_cnt = strlen(trim($new_string));

  $do_match  = str_replace('*','', $new_string);
  $do_match  = strlen(trim($do_match));

  $badword_cnt = ($username_cnt - $do_match);
  
if ($show_analysis == '1') {

echo "<br>Username: $username";
echo "<br>Profanities Used: ";

$curse_word_cnt = '0';
foreach ($badword as $profanity) {
    $curse_word_cnt = $curse_word_cnt + 1;

echo $profanity."  ";

   }
      if  ($curse_word_cnt == '0' ) {
          echo "none";
}

echo "<br>No. of Profanities Used: $curse_word_cnt";
echo "<br>Username after Replacement: $new_string";
echo "<br>Strlen Username: $username_cnt";
echo "<br>Strlen Badword: $badword_cnt<br>";
}


// No match to any curse word in the database
if ($do_match != '0' && $curse_word_cnt == "0" && $new_string_cnt == $do_match)
{
die ($msg_pass);
}

// Exact match to the curse word used
if ($do_match == '0' && $curse_word_cnt >= "1" && $new_string_cnt != $do_match)
{
die ($msg_fail);
}


// Not an exact match, but word or phrase contains 2 or more curse words
if ($do_match != '0' && $curse_word_cnt > "1" && $new_string_cnt != $do_match)
{
 die ($msg_fail);
}

// Not an exact match, but word or phrase contains just 1 curse word
if ($do_match != '0' && $curse_word_cnt == "1" && $new_string_cnt != $do_match)
{

// Calculates percentage similarity between username and curse word
$text_match = similar_text($username, $curse_word, $percent);
$percent = round($percent, 2);

// Query database for the severity level set for the curse word used
$sql = sprintf("SELECT * FROM $table WHERE $col1='$profanity'");
$record = mysql_query($sql);

	if (!$record) {
 die ("Invalid Query: $sql<br>" . "<b>" . mysql_error() ."</b>");
	}

// Checks if a record is found
$num_rec_rows = mysql_num_rows($record);
if ($num_rec_rows < 1) {
die ("Sorry, no entries found.\n" . mysql_error());
}

while ($rec = mysql_fetch_assoc($record)) {
$ranking = $rec[$col2];
}

// Free $record results
mysql_free_result($record) or die ("Could not free result\n" . mysql_error());


// Calculates the spread between the minimum and maximum
// allowable username characters
$spread = $maximum - $minimum;


// Ensures the $ranking is either 1 or 2
if ($ranking == '1' || $ranking == '2') {
} else { die ("Program Error: Invalid System Resource.");
}

// Creates dymanic rank based on spread
if ($ranking == 1) {
    $adv_rank = abs(45 + (($spread/20)*2.25));
} else {
    $adv_rank = abs(35 + (($spread/20)*2.25));
}

$adv_rank = round($adv_rank, 2);

// Assigns rank based on the spread and the severity of the curse word used
if ($ranking == '2') {
$rank = $adv_rank;
$threshold = round(($minimum+$maximum)/($maximum+$minimum+5), 2);
}
else {
$rank = $adv_rank;
$threshold = round(($minimum+$maximum)/($maximum+$minimum+1), 2);
}

// Calculates trigger to test threshold statistic
$lev_distance = levenshtein($username, $profanity);
$trigger = ($badword_cnt/$username_cnt)*$lev_distance;
$trigger = round($trigger, 2);


if ($show_analysis == '1') {
echo "<br>Percent: $percent";
echo "<br>Ranking:  $ranking<br>Rank: $rank<br>Levenshtein Distance: $lev_distance<br>Trigger: $trigger";
echo "<br>Spread: $spread<br>Threshold: $threshold<br>";
}

if ($percent >= $rank && $ranking == '1' || $rankings == '2') {
       echo $msg_fail;
}

// Adjustment for zero percent and low percent due to variation in spread
// Minimizes false positives for high severity curse words
if ($percent < $rank && $ranking == '1') {
	if ($trigger <= $threshold || $badword_cnt >= $username_cnt*'0.25') {
		echo $msg_fail;
 			} else { echo $msg_pass;
		  }
		}


// Adjustment for severity, zero and low percent due to variation in spread
// Adds flexibility by allowing phrases with curse words with lower severity
// that would otherwise be blocked provided that they are not too close to curse word
if ($percent < $rank && $ranking == '2') {
	if ($trigger <= $threshold  || $badword_cnt >= $username_cnt*'0.5') {
		echo $msg_fail;
 			} else { echo $msg_pass;
		  }
		}
}

// Close database connection
mysql_close($conn) or die ("Could not close connection\n" . mysql_error());

?>






#6 shoz

shoz
  • Staff Alumni
  • Advanced Member
  • 600 posts

Posted 20 October 2006 - 08:31 PM

Thought I'd give it a shot. It's far from perfect, but does do what you're asking.

Note that characters added as synonym characters should not be letters. For example since "l"(lower case L) is a synonym for "i" it should not be given synonyms of it's own. ie no $this->syn_chrs_map['l'] (lower case L). Also note that if a synonym is part of a "bad word" that the filter may not catch it.


<?php

class word_filter
{
    var $bad_words;
    var $syn_chrs_map;
    var $syn_chrs;
    function word_filter()
    {
        $this->bad_words = array('word', 'bird', 'cat', 'witch', 'sit');

        $this->syn_chrs_map = array();
        $this->syn_chrs_map['i'] = array('1', 'l');
        $this->syn_chrs_map['t'] = array('+');
        $this->syn_chrs_map['b'] = array('8');
        $this->syn_chrs_map['s'] = array('5');

        $this->syn_chrs = array();
        foreach ($this->syn_chrs_map as $l => $s)
        {
            $this->syn_chrs = array_merge($this->syn_chrs, $s);
        }



    }
    function filter($string)
    {
        $tmp_string = strtolower($string);
        if ($this->syn_chrs_map)
        {
            $regex = '/'.implode('|', array_map('preg_quote', $this->syn_chrs)).'/e';
            $tmp_string = preg_replace($regex, '$this->replace_syn_chrs("\\0")',  $tmp_string);

        }
        
        $str_words = explode(' ', $string);
        $tmp_words = explode(' ', $tmp_string);
        $num_words = count($str_words);
        
        $new_string = array();
        for ($i = 0; $i < $num_words; $i++)
        {
            $bad_regex = '/'.implode('|', $this->bad_words).'/i';
            if (preg_match($bad_regex, $tmp_words[$i])
                || preg_match($bad_regex, preg_replace('/[^\w_]/', '', $str_words[$i])))
            {
                $new_string[] = 'x';
            }
            else
            {
                $new_string[] = $str_words[$i];
            }
        }
        return implode(' ', $new_string);

    }
    function replace_syn_chrs($chr)
    {
        foreach (array_keys($this->syn_chrs_map) as $key)
        {
            if (in_array($chr, $this->syn_chrs_map[$key]))
            {
                return $key;

            }
        }

    }

}
$string = 'what word and b1rd 8ird w1+ch and 5iti s*i*t Wo*rd S1tsyou';
print $string."\n";

$word_filter = new word_filter();
print $word_filter->filter($string);
?>

//output
what word and b1rd 8ird w1+ch and 5iti s*i*t Wo*rd S1tsyou
what x and x x x and x x x x


#7 Psycho

Psycho
  • Moderators
  • Move along, nothing to see here
  • 11,892 posts
  • LocationCanada

Posted 20 October 2006 - 08:56 PM

Not trying to rain on your parade, but there are so many permeatations that 1) you wao't catch all "bad" words and you will start to run into false positives. Once that starts happening people will stop using your forum.

For example I see a lot of people using M$ for Microsoft. That would be included in the banned words. I've also seen album names and such using special characters in them as well.

I do like the approach shoz takes above though. By replacing certain charachters with letters - THEN checking the content against your language filter would probably do the best job with a low occurance of false positives. However, it looks like his code goes a little further than that by filtering any words with special characters. Why is "5iti" filtered?

Here's what I'm thinking. If you want to filter the word "witch" and you know people use "+" to represent a "t", then just do a replace to change "wi+tch" to "witch" and then check against your bad words. Of course, they could always use "w-i-t-c-h"

MOD EDIT(shoz): edited for language
The quality of the responses received is directly proportional to the quality of the question asked.

I do not always test the code I provide, so there may be some syntax errors. In 99% of all cases I found the solution to your problem here: http://www.php.net

#8 shoz

shoz
  • Staff Alumni
  • Advanced Member
  • 600 posts

Posted 20 October 2006 - 09:05 PM

Why is "5iti" filtered?


5iti is filtered because "sit" is in the bad word list and "5" is a synonym for "s".

Using a "-" in between letters will also be caught by the filter. It won't currently catch a word where a "-"(or other special character) and a "synonym" character is used. That can be added however.

#9 Psycho

Psycho
  • Moderators
  • Move along, nothing to see here
  • 11,892 posts
  • LocationCanada

Posted 20 October 2006 - 09:23 PM

Why is "5iti" filtered?


5iti is filtered because "sit" is in the bad word list and "5" is a synonym for "s".

Using a "-" in between letters will also be caught by the filter. It won't currently catch a word where a "-"(or other special character) and a "synonym" character is used. That can be added however.


Ok, so is it replacing any words that begin with bad words? If the user input "he cocked the gun" would "cocked" be filtered if "cock" was in the filter list? I'm not knocking your logic, but just like mail filters, false positives would be worse than missing a few true positives (IMHO).

Here's a good report concerning dealing with leet speak: http://www.mail-arch...u/msg01687.html
The quality of the responses received is directly proportional to the quality of the question asked.

I do not always test the code I provide, so there may be some syntax errors. In 99% of all cases I found the solution to your problem here: http://www.php.net

#10 shoz

shoz
  • Staff Alumni
  • Advanced Member
  • 600 posts

Posted 20 October 2006 - 09:58 PM

Why is "5iti" filtered?


5iti is filtered because "sit" is in the bad word list and "5" is a synonym for "s".

Using a "-" in between letters will also be caught by the filter. It won't currently catch a word where a "-"(or other special character) and a "synonym" character is used. That can be added however.


Ok, so is it replacing any words that begin with bad words? If the user input "he cocked the gun" would "cocked" be filtered if "cock" was in the filter list? I'm not knocking your logic, but just like mail filters, false positives would be worse than missing a few true positives (IMHO).

Here's a good report concerning dealing with leet speak: http://www.mail-arch...u/msg01687.html


I do agree that false positives are possible. Currently if a word in the "bad" word list is found anywhere in the word, it matches.

Before I try to address some of the "problems" let me make it clear that I'm not for or against the use of this or any filter, but of course will try to "defend" or modify the code up to a point to handle the points raised.

I think that the majority of the words that are generally considered to be offensive are unlikely to cause false positives.

I think a possible solution for the possible/likely positives would be to generate the code using a dictionary which would create a list of exceptions for words that would cause false positives. Meaning the words would be tested against the dictionary and any word matched would become an exception.

#11 Mouse

Mouse
  • Members
  • PipPipPip
  • Advanced Member
  • 95 posts
  • LocationToo Close to LONDON

Posted 21 October 2006 - 07:46 AM

gentlemen this is going far further than i though... many thanks, i'll work on it when i get to my home pc... many thanks

Mouse

#12 shoz

shoz
  • Staff Alumni
  • Advanced Member
  • 600 posts

Posted 21 October 2006 - 10:38 AM

I've been thinking of ways to handle the false positive situation in a better way and although I don't know if it's possible to elminate them, I think that the following approach may effectively deal with the problem.

Instead of removing the text it could instead be hidden. The "xxxx" would be seen and the user could decide whether or not to turn off the filter.

If you decided to, perhaps for "guests" or a user registered as being below a certain age filtering could be turned on by default with no option to turn it off. They'd either have to deal with the false positives or have a more precise filter applied to significantly reduce them. Users could also have filtering turned off/on by default for other users or be allowed to add words to their filter.

Note that this method would be best in a situation where you're simply trying not to offend anyones sensibilities (ie some users do not want to read certain words) but still want to allow users to express themselves freely or where it's only being used to stop broad "profanity" usage.

If the purpose of the filtering is to create a certain environment where rules have been established and the purpose of the filtering is to remove the occasional slip by a member, then the rules should specify what the penalty for breaking them are. Suspending or banning accounts as mjdamato suggests or perhaps removing specific privileges from users may be best.

Btw, the filter above can be restricted to matching whole words by changing the $bad_regex assignment to the following

$bad_regex = '/^'.implode('|', $this->bad_words).'$/i';

You can use "\b" in place of ^ and $ to match word boundaries. You can find out the difference here.

#13 Mouse

Mouse
  • Members
  • PipPipPip
  • Advanced Member
  • 95 posts
  • LocationToo Close to LONDON

Posted 22 October 2006 - 12:58 PM

Shoz, thank you very much for all your assistance on the bad word filter. It is much appreciated!

I think you’re on to a good thing with the switchable filter, It’ll preserve personal expression and peoples sensibilities. So yes… and the false positives would become a null issue.

For my part I have been working on the bad words array/list. (I don’t have php on my works laptop.) I was also thinking that as a combined project it should be posted on the site for use by all… what would you say?

Again, many thanks

Mouse


#14 shoz

shoz
  • Staff Alumni
  • Advanced Member
  • 600 posts

Posted 23 October 2006 - 09:11 AM

For my part I have been working on the bad words array/list. (I don’t have php on my works laptop.) I was also thinking that as a combined project it should be posted on the site for use by all… what would you say?


I don't have any problems with you posting any changes you've made.


#15 Mouse

Mouse
  • Members
  • PipPipPip
  • Advanced Member
  • 95 posts
  • LocationToo Close to LONDON

Posted 23 October 2006 - 01:59 PM

sorry i meant in the scripts pages... my bad for not explaining too well

Mouse

#16 shoz

shoz
  • Staff Alumni
  • Advanced Member
  • 600 posts

Posted 24 October 2006 - 02:25 PM

I don't think this qualifies as a full fledged script. It's probably more suited as a code example to be put in the PHP Code Library but I'm not sure what the current status of the code library is however.

The main site is being worked on but some sections aren't fully working. To be honest I think I'd want to spend more time on it (which I'm not sure I'll do) before thinking about putting it in the library but If you'd like to submit it you can try to.






0 user(s) are reading this topic

0 members, 0 guests, 0 anonymous users