Jump to content

Better rude word filter... when they change le++er5 for characters


Mouse

Recommended Posts

I am using a simple but effective “bad word” filter on my site –

Bad_word Filter code:

Code:

[code]
$bad_words = explode('|', 'badword1|badword2|badword3|etc|etc');
  foreach ($bad_words as $naughty)
  {
      $comments = eregi_replace($naughty, "#!@%*#", $comments);
  }
[/code]

but as always when you solve one little problem you come up with another, those pesky kids and their rude #!@%*# words.

As I’ve added words to the filter they are replaced with other words where one or more characters have been replace with symbols or digits. i.e. in the word ‘*****’, [sorry for any offence to Meredith Brooks fans, or anybody else for that matter] this is easily be filtered on its own using a list, but the same word spelt w1tch, wi+ch etc. could be equally offensive.

Any ideas? (Short of trying to work out every connotation of every word)

Mouse

[b]MOD EDIT(shoz): edited for language[/b]
Link to comment
Share on other sites

hmm... that could get a bit tricky, but you might be able to write up a function that [b]when you enter a word into your filter bank[/b] comes up with all possible combinations of haxor for that word. you would definitely want it to be on the banking side, though, or else your system resources would be eaten alive by trying to find combinations of every word in a post.
Link to comment
Share on other sites

You are fighting a losing battle. The problem is that there is no magic bullet and you would almost need to have very complex tables to seach for all the combinations or you would have to come up with simpler rules that would create a lot of false positives. For example the word "botch" could get caught in the screening for variations for the word "bitch". And, they will just start using euphemisms or some other way to get their point accross that you cn't filter.

Perhaps you just need to start banning people.
Link to comment
Share on other sites

ok, what if we went for

take each word,
explode it,
if first character is a letter then this is a word
  carry on filtering if you find a space then go to next group
    if you find a number before you find a space treat as bad_word
if first character is a number then this is a number group
  carry on filtering if you find a space then go to next  group
    if you find a letter before you find a space treat as bad_word

ok, your stuffed if you're telling the world about your love of U2, otherwise you should be ok... what do you think?

Mouse
Link to comment
Share on other sites

mjdamato has a good point but to some extent you can implement a lang filter. Here is the code i am working on to screen usernames on my site. It's in no way perfect, in fact i have some more work to do but you can try it. I only ask that [b]if you modify it let me have a look at the new code[/b]. It's based on the phpfreaks lang filter tutorial. However, since i'm screening usernames that follow a set convention : ie letters, numbers, periods, hyphens and underscores only its easier to screen. In ur earlier post u had a problem with +$%^ and other symbols just add them to  [b]$character_arr = array('.', '-', '_', ' ')[/b] and that should eliminate the problem.

The table structure is pretty simple:

ID
Profanity : varchar 255
Severity: varchar 1  (1 - high or 2 - low)

[code]
<?php
// Passed from database file on include
$dbhost = 'localhost';
$dbport = '3306';
$dbuser = '***';
$dbpwd = '*******';
$database = '***********';
$table = 'filter';
$col1 = 'profanity';
$col2 = 'severity';

// Passed from settings file on include
$max = '25';
$min = '5';
$show_analysis = '1';


// Passed from language file on include
$msg_fail = "<b>Sorry, the username contains a word or phrase that is not allowed. Please enter another username.</b>";
$msg_pass = '<b>Congratulations, username is available.</b>';
$err_prog = "<b>Program Error:</b>&nbsp;Username must be $min to $max characters in length and contain only letters.";
$err_username = "Please enter your username. Note username may only contain letters, numbers, periods, hyphens and underscores.";
$err_max = "Please enter maximum limit for username. Note maximum limit must be greater than 0.";
$err_min = "Please enter minimum limit for username. Note minimum limit must be greater than 0.";
$err_min_max = "Program Error: Username minimum character limit <b>($min)</b> cannot exceed maximum character limit <b>($max)</b>.";
$err_show_analysis = "Program Error: only 1 - on and 2 - off are appropriate for this setting.";
?>



<?php

global $dbhost, $dbport, $dbuser, $dbpwd, $database, $table, $col1, $col2;
global $username, $max, $min, $show_analysis, $msg_pass, $msg_fail;
global $err_prog, $err_username, $err_max, $errs_min, $err_min_max, $err_show_analysis;

// Ensures that username is not numeric, not empty and has been set
if (!isset($_POST['username']) || is_numeric($_POST['username']) || empty($_POST['username'])) {
    die ($err_username); } else {
        $username = $_POST['username'];
      $string = trim($username);
      }

// Ensures that username maximum character limit is not numeric, not empty
// and greater than 0
if (!is_numeric($max) || empty($max)) {
    die ($err_max);} else { $maximum = $max; }
   
// Ensures that username minimum character limit is not numeric, not empty
// and greater than 0
if (!is_numeric($min) || empty($min)) {
    die ($err_min);} else { $minimum = $min; }

// Ensures that username minimum character limit is not greater than maximum
// character limit
if ($min > $max) {
    die ($err_min_max);}

// Ensures that show_analysis is numeric, not empty
if (!is_numeric($show_analysis) || empty($show_analysis)) {
    die ($err_show_analysis); }

// Ensures that show_analysis is 1 or 2
if ($show_analysis == '1' || $show_analysis == '2') {
$show_analysis = $show_analysis;
} else { die ($err_show_analysis);
}


// Converts username to lower case letters
$string = strtolower($string);
// Strips out HTML and PHP tags
$string = strip_tags($string);

$strip_string = $string;

$character_arr = array('.', '-', '_', ' ');
$number_arr = range(0,9);
$combo_arr = array_merge($number_arr, $character_arr);

    // Strips username of numbers and symbols for an improved word match
    $string = str_replace($combo_arr, '', $string);

    // Ensures that username contains only letters at this stage
    $pattern = "/^[a-z]"."{".$minimum.",". $maximum. "}$/i";
    $isletter = preg_match($pattern, $string);
          if (!$isletter) {
die ($err_prog);
}

// Open database connection
$conn = mysql_connect($dbhost.":".$dbport, $dbuser, $dbpwd);
if (!$conn) {
  die('Failed to connect to database:\n' . mysql_error());


// Select database
$db = mysql_select_db($database, $conn);
if (!$conn) {
  die('Connection to database failed:\n' . mysql_error());
}

// Query database for present curse words
$query = sprintf("SELECT * FROM $table ORDER BY length($col1) DESC");
$request = mysql_query($query);

if (!$request) {
die ("Invalid Query: $query<br>\n" . "<b>" . mysql_error() ."</b>");
}

// Checks if a record is found
$num_res_rows = mysql_num_rows($request);
if ($num_res_rows < 1) {
die ("Sorry, no entries found.\n" . mysql_error());
}

// Define $obscenities and loop through the results and assign it to
// $obscenities
$obscenities = array();

while ($row = mysql_fetch_assoc($request))  {

$obscenities[] = $row[$col1];
              }

foreach ($obscenities as $curse_word)
{

    if (stristr(trim($string), $curse_word))
{

  $length = strlen($curse_word);
$stars = '';
for ($i = 1; $i <= $length; $i++)
$stars .= '*';

$string = eregi_replace($curse_word, $stars, trim($string));
$stars = '';

  // Saves curse words used for latter analysis
  $badword[] = $curse_word;

}
}

// Free $request results
mysql_free_result($request) or die ("Could not free result\n" . mysql_error());

  $new_string = $string;
  $string_cnt = strlen(trim($string));
  $username_cnt = strlen(trim($username));
  $new_string_cnt = strlen(trim($new_string));

  $do_match  = str_replace('*','', $new_string);
  $do_match  = strlen(trim($do_match));

  $badword_cnt = ($username_cnt - $do_match);
 
if ($show_analysis == '1') {

echo "<br>Username: $username";
echo "<br>Profanities Used: ";

$curse_word_cnt = '0';
foreach ($badword as $profanity) {
    $curse_word_cnt = $curse_word_cnt + 1;

echo $profanity."  ";

  }
      if  ($curse_word_cnt == '0' ) {
          echo "none";
}

echo "<br>No. of Profanities Used: $curse_word_cnt";
echo "<br>Username after Replacement: $new_string";
echo "<br>Strlen Username: $username_cnt";
echo "<br>Strlen Badword: $badword_cnt<br>";
}


// No match to any curse word in the database
if ($do_match != '0' && $curse_word_cnt == "0" && $new_string_cnt == $do_match)
{
die ($msg_pass);
}

// Exact match to the curse word used
if ($do_match == '0' && $curse_word_cnt >= "1" && $new_string_cnt != $do_match)
{
die ($msg_fail);
}


// Not an exact match, but word or phrase contains 2 or more curse words
if ($do_match != '0' && $curse_word_cnt > "1" && $new_string_cnt != $do_match)
{
die ($msg_fail);
}

// Not an exact match, but word or phrase contains just 1 curse word
if ($do_match != '0' && $curse_word_cnt == "1" && $new_string_cnt != $do_match)
{

// Calculates percentage similarity between username and curse word
$text_match = similar_text($username, $curse_word, $percent);
$percent = round($percent, 2);

// Query database for the severity level set for the curse word used
$sql = sprintf("SELECT * FROM $table WHERE $col1='$profanity'");
$record = mysql_query($sql);

if (!$record) {
die ("Invalid Query: $sql<br>" . "<b>" . mysql_error() ."</b>");
}

// Checks if a record is found
$num_rec_rows = mysql_num_rows($record);
if ($num_rec_rows < 1) {
die ("Sorry, no entries found.\n" . mysql_error());
}

while ($rec = mysql_fetch_assoc($record)) {
$ranking = $rec[$col2];
}

// Free $record results
mysql_free_result($record) or die ("Could not free result\n" . mysql_error());


// Calculates the spread between the minimum and maximum
// allowable username characters
$spread = $maximum - $minimum;


// Ensures the $ranking is either 1 or 2
if ($ranking == '1' || $ranking == '2') {
} else { die ("Program Error: Invalid System Resource.");
}

// Creates dymanic rank based on spread
if ($ranking == 1) {
    $adv_rank = abs(45 + (($spread/20)*2.25));
} else {
    $adv_rank = abs(35 + (($spread/20)*2.25));
}

$adv_rank = round($adv_rank, 2);

// Assigns rank based on the spread and the severity of the curse word used
if ($ranking == '2') {
$rank = $adv_rank;
$threshold = round(($minimum+$maximum)/($maximum+$minimum+5), 2);
}
else {
$rank = $adv_rank;
$threshold = round(($minimum+$maximum)/($maximum+$minimum+1), 2);
}

// Calculates trigger to test threshold statistic
$lev_distance = levenshtein($username, $profanity);
$trigger = ($badword_cnt/$username_cnt)*$lev_distance;
$trigger = round($trigger, 2);


if ($show_analysis == '1') {
echo "<br>Percent: $percent";
echo "<br>Ranking:  $ranking<br>Rank: $rank<br>Levenshtein Distance: $lev_distance<br>Trigger: $trigger";
echo "<br>Spread: $spread<br>Threshold: $threshold<br>";
}

if ($percent >= $rank && $ranking == '1' || $rankings == '2') {
      echo $msg_fail;
}

// Adjustment for zero percent and low percent due to variation in spread
// Minimizes false positives for high severity curse words
if ($percent < $rank && $ranking == '1') {
if ($trigger <= $threshold || $badword_cnt >= $username_cnt*'0.25') {
echo $msg_fail;
} else { echo $msg_pass;
  }
}


// Adjustment for severity, zero and low percent due to variation in spread
// Adds flexibility by allowing phrases with curse words with lower severity
// that would otherwise be blocked provided that they are not too close to curse word
if ($percent < $rank && $ranking == '2') {
if ($trigger <= $threshold  || $badword_cnt >= $username_cnt*'0.5') {
echo $msg_fail;
} else { echo $msg_pass;
  }
}
}

// Close database connection
mysql_close($conn) or die ("Could not close connection\n" . mysql_error());

?>




[/code]
Link to comment
Share on other sites

Thought I'd give it a shot. It's far from perfect, but does do what you're asking.

Note that characters added as synonym characters should not be letters. For example since "l"(lower case L) is a synonym for "i" it should not be given synonyms of it's own. ie no $this->syn_chrs_map['l'] (lower case L). Also note that if a synonym is part of a "bad word" that the filter may not catch it.


[code]
<?php

class word_filter
{
    var $bad_words;
    var $syn_chrs_map;
    var $syn_chrs;
    function word_filter()
    {
        $this->bad_words = array('word', 'bird', 'cat', 'witch', 'sit');

        $this->syn_chrs_map = array();
        $this->syn_chrs_map['i'] = array('1', 'l');
        $this->syn_chrs_map['t'] = array('+');
        $this->syn_chrs_map['b'] = array('8');
        $this->syn_chrs_map['s'] = array('5');

        $this->syn_chrs = array();
        foreach ($this->syn_chrs_map as $l => $s)
        {
            $this->syn_chrs = array_merge($this->syn_chrs, $s);
        }



    }
    function filter($string)
    {
        $tmp_string = strtolower($string);
        if ($this->syn_chrs_map)
        {
            $regex = '/'.implode('|', array_map('preg_quote', $this->syn_chrs)).'/e';
            $tmp_string = preg_replace($regex, '$this->replace_syn_chrs("\\0")',  $tmp_string);

        }
       
        $str_words = explode(' ', $string);
        $tmp_words = explode(' ', $tmp_string);
        $num_words = count($str_words);
       
        $new_string = array();
        for ($i = 0; $i < $num_words; $i++)
        {
            $bad_regex = '/'.implode('|', $this->bad_words).'/i';
            if (preg_match($bad_regex, $tmp_words[$i])
                || preg_match($bad_regex, preg_replace('/[^\w_]/', '', $str_words[$i])))
            {
                $new_string[] = 'x';
            }
            else
            {
                $new_string[] = $str_words[$i];
            }
        }
        return implode(' ', $new_string);

    }
    function replace_syn_chrs($chr)
    {
        foreach (array_keys($this->syn_chrs_map) as $key)
        {
            if (in_array($chr, $this->syn_chrs_map[$key]))
            {
                return $key;

            }
        }

    }

}
$string = 'what word and b1rd 8ird w1+ch and 5iti s*i*t Wo*rd S1tsyou';
print $string."\n";

$word_filter = new word_filter();
print $word_filter->filter($string);
?>
[/code]

//output
[code]
what word and b1rd 8ird w1+ch and 5iti s*i*t Wo*rd S1tsyou
what x and x x x and x x x x
[/code]
Link to comment
Share on other sites

Not trying to rain on your parade, but there are so many permeatations that 1) you wao't catch all "bad" words and you will start to run into false positives. Once that starts happening people will stop using your forum.

For example I see a lot of people using M$ for Microsoft. That would be included in the banned words. I've also seen album names and such using special characters in them as well.

I do like the approach [b]shoz[/b] takes above though. By replacing certain charachters with letters - THEN checking the content against your language filter would probably do the best job with a low occurance of false positives. However, it looks like his code goes a little further than that by filtering any words with special characters. Why is "5iti" filtered?

Here's what I'm thinking. If you want to filter the word "witch" and you know people use "+" to represent a "t", then just do a replace to change "wi+tch" to "witch" and then check against your bad words. Of course, they could always use "w-i-t-c-h"

[b]MOD EDIT(shoz): edited for language[/b]
Link to comment
Share on other sites

[quote=mjdamato]
Why is "5iti" filtered?
[/quote]

5iti is filtered because "sit" is in the bad word list and "5" is a synonym for "s".

Using a "-" in between letters will also be caught by the filter. It won't currently catch a word where a "-"(or other special character) and a "synonym" character is used. That can be added however.
Link to comment
Share on other sites

[quote author=shoz link=topic=112126.msg455182#msg455182 date=1161378303]
[quote=mjdamato]
Why is "5iti" filtered?
[/quote]

5iti is filtered because "sit" is in the bad word list and "5" is a synonym for "s".

Using a "-" in between letters will also be caught by the filter. It won't currently catch a word where a "-"(or other special character) and a "synonym" character is used. That can be added however.
[/quote]

Ok, so is it replacing any words that [i]begin[/i] with bad words? If the user input "he cocked the gun" would "cocked" be filtered if "cock" was in the filter list? I'm not knocking your logic, but just like mail filters, false positives would be worse than missing a few true positives (IMHO).

Here's a good report concerning dealing with leet speak: http://www.mail-archive.com/wryting-l@listserv.wvu.edu/msg01687.html
Link to comment
Share on other sites

[quote author=mjdamato link=topic=112126.msg455190#msg455190 date=1161379380]
[quote author=shoz link=topic=112126.msg455182#msg455182 date=1161378303]
[quote=mjdamato]
Why is "5iti" filtered?
[/quote]

5iti is filtered because "sit" is in the bad word list and "5" is a synonym for "s".

Using a "-" in between letters will also be caught by the filter. It won't currently catch a word where a "-"(or other special character) and a "synonym" character is used. That can be added however.
[/quote]

Ok, so is it replacing any words that [i]begin[/i] with bad words? If the user input "he cocked the gun" would "cocked" be filtered if "cock" was in the filter list? I'm not knocking your logic, but just like mail filters, false positives would be worse than missing a few true positives (IMHO).

Here's a good report concerning dealing with leet speak: http://www.mail-archive.com/wryting-l@listserv.wvu.edu/msg01687.html
[/quote]

I do agree that false positives are possible. Currently if a word in the "bad" word list is found anywhere in the word, it matches.

Before I try to address some of the "problems" let me make it clear that I'm not for or against the use of this or any filter, but of course will try to "defend" or modify the code up to a point to handle the points raised.

I think that the majority of the words that are generally considered to be offensive are unlikely to cause false positives.

I think a possible solution for the possible/likely positives would be to generate the code using a dictionary which would create a list of exceptions for words that would cause false positives. Meaning the words would be tested against the dictionary and any word matched would become an exception.
Link to comment
Share on other sites

I've been thinking of ways to handle the false positive situation in a better way and although I don't know if it's possible to elminate them, I think that the following approach may effectively deal with the problem.

Instead of removing the text it could instead be hidden. The "xxxx" would be seen and the user could decide whether or not to turn off the filter.

If you decided to, perhaps for "guests" or a user registered as being below a certain age filtering could be turned on by default with no option to turn it off. They'd either have to deal with the false positives or have a more precise filter applied to significantly reduce them. Users could also have filtering turned off/on by default for other users or be allowed to add words to their filter.

Note that this method would be best in a situation where you're simply trying not to offend anyones sensibilities (ie some users do not want to read certain words) but still want to allow users to express themselves freely or where it's only being used to stop broad "profanity" usage.

If the purpose of the filtering is to create a certain environment where rules have been established and the purpose of the filtering is to remove the occasional slip by a member, then the rules should specify what the penalty for breaking them are. Suspending or banning accounts as mjdamato suggests or perhaps removing specific privileges from users may be best.

Btw, the filter above can be restricted to matching whole words by changing the $bad_regex assignment to the following

[code]
$bad_regex = '/^'.implode('|', $this->bad_words).'$/i';
[/code]

You can use "\b" in place of ^ and $ to match word boundaries. You can find out the difference [url=http://regularexpressions.info/wordboundaries.html]here[/url].
Link to comment
Share on other sites

Shoz, thank you very much for all your assistance on the bad word filter. It is much appreciated!

I think you’re on to a good thing with the switchable filter, It’ll preserve personal expression and peoples sensibilities. So yes… and the false positives would become a null issue.

For my part I have been working on the bad words array/list. (I don’t have php on my works laptop.) I was also thinking that as a combined project it should be posted on the site for use by all… what would you say?

Again, many thanks

Mouse
Link to comment
Share on other sites

[quote author=Mouse link=topic=112126.msg455802#msg455802 date=1161521901]
For my part I have been working on the bad words array/list. (I don’t have php on my works laptop.) I was also thinking that as a combined project it should be posted on the site for use by all… what would you say?
[/quote]

I don't have any problems with you posting any changes you've made.
Link to comment
Share on other sites

I don't think this qualifies as a full fledged script. It's probably more suited as a code example to be put in the PHP Code Library but I'm not sure what the current status of the code library is however.

The main site is being worked on but some sections aren't fully working. To be honest I think I'd want to spend more time on it (which I'm not sure I'll do) before thinking about putting it in the library but If you'd like to submit it you can try to.

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.