Mouse Posted October 20, 2006 Share Posted October 20, 2006 I am using a simple but effective “bad word” filter on my site –Bad_word Filter code:Code:[code]$bad_words = explode('|', 'badword1|badword2|badword3|etc|etc'); foreach ($bad_words as $naughty) { $comments = eregi_replace($naughty, "#!@%*#", $comments); }[/code]but as always when you solve one little problem you come up with another, those pesky kids and their rude #!@%*# words.As I’ve added words to the filter they are replaced with other words where one or more characters have been replace with symbols or digits. i.e. in the word ‘*****’, [sorry for any offence to Meredith Brooks fans, or anybody else for that matter] this is easily be filtered on its own using a list, but the same word spelt w1tch, wi+ch etc. could be equally offensive.Any ideas? (Short of trying to work out every connotation of every word)Mouse[b]MOD EDIT(shoz): edited for language[/b] Quote Link to comment Share on other sites More sharing options...
obsidian Posted October 20, 2006 Share Posted October 20, 2006 hmm... that could get a bit tricky, but you might be able to write up a function that [b]when you enter a word into your filter bank[/b] comes up with all possible combinations of haxor for that word. you would definitely want it to be on the banking side, though, or else your system resources would be eaten alive by trying to find combinations of every word in a post. Quote Link to comment Share on other sites More sharing options...
Psycho Posted October 20, 2006 Share Posted October 20, 2006 You are fighting a losing battle. The problem is that there is no magic bullet and you would almost need to have very complex tables to seach for all the combinations or you would have to come up with simpler rules that would create a lot of false positives. For example the word "botch" could get caught in the screening for variations for the word "bitch". And, they will just start using euphemisms or some other way to get their point accross that you cn't filter.Perhaps you just need to start banning people. Quote Link to comment Share on other sites More sharing options...
Mouse Posted October 20, 2006 Author Share Posted October 20, 2006 ok, what if we went fortake each word,explode it,if first character is a letter then this is a word carry on filtering if you find a space then go to next group if you find a number before you find a space treat as bad_wordif first character is a number then this is a number group carry on filtering if you find a space then go to next group if you find a letter before you find a space treat as bad_wordok, your stuffed if you're telling the world about your love of U2, otherwise you should be ok... what do you think?Mouse Quote Link to comment Share on other sites More sharing options...
oracle259 Posted October 20, 2006 Share Posted October 20, 2006 mjdamato has a good point but to some extent you can implement a lang filter. Here is the code i am working on to screen usernames on my site. It's in no way perfect, in fact i have some more work to do but you can try it. I only ask that [b]if you modify it let me have a look at the new code[/b]. It's based on the phpfreaks lang filter tutorial. However, since i'm screening usernames that follow a set convention : ie letters, numbers, periods, hyphens and underscores only its easier to screen. In ur earlier post u had a problem with +$%^ and other symbols just add them to [b]$character_arr = array('.', '-', '_', ' ')[/b] and that should eliminate the problem. The table structure is pretty simple:ID Profanity : varchar 255 Severity: varchar 1 (1 - high or 2 - low)[code]<?php// Passed from database file on include$dbhost = 'localhost'; $dbport = '3306';$dbuser = '***'; $dbpwd = '*******'; $database = '***********'; $table = 'filter'; $col1 = 'profanity';$col2 = 'severity';// Passed from settings file on include$max = '25';$min = '5';$show_analysis = '1';// Passed from language file on include$msg_fail = "<b>Sorry, the username contains a word or phrase that is not allowed. Please enter another username.</b>";$msg_pass = '<b>Congratulations, username is available.</b>';$err_prog = "<b>Program Error:</b> Username must be $min to $max characters in length and contain only letters.";$err_username = "Please enter your username. Note username may only contain letters, numbers, periods, hyphens and underscores.";$err_max = "Please enter maximum limit for username. Note maximum limit must be greater than 0.";$err_min = "Please enter minimum limit for username. Note minimum limit must be greater than 0.";$err_min_max = "Program Error: Username minimum character limit <b>($min)</b> cannot exceed maximum character limit <b>($max)</b>.";$err_show_analysis = "Program Error: only 1 - on and 2 - off are appropriate for this setting.";?><?phpglobal $dbhost, $dbport, $dbuser, $dbpwd, $database, $table, $col1, $col2;global $username, $max, $min, $show_analysis, $msg_pass, $msg_fail;global $err_prog, $err_username, $err_max, $errs_min, $err_min_max, $err_show_analysis;// Ensures that username is not numeric, not empty and has been setif (!isset($_POST['username']) || is_numeric($_POST['username']) || empty($_POST['username'])) { die ($err_username); } else { $username = $_POST['username']; $string = trim($username); }// Ensures that username maximum character limit is not numeric, not empty// and greater than 0if (!is_numeric($max) || empty($max)) { die ($err_max);} else { $maximum = $max; } // Ensures that username minimum character limit is not numeric, not empty// and greater than 0if (!is_numeric($min) || empty($min)) { die ($err_min);} else { $minimum = $min; }// Ensures that username minimum character limit is not greater than maximum// character limitif ($min > $max) { die ($err_min_max);}// Ensures that show_analysis is numeric, not emptyif (!is_numeric($show_analysis) || empty($show_analysis)) { die ($err_show_analysis); } // Ensures that show_analysis is 1 or 2if ($show_analysis == '1' || $show_analysis == '2') {$show_analysis = $show_analysis; } else { die ($err_show_analysis); } // Converts username to lower case letters$string = strtolower($string);// Strips out HTML and PHP tags$string = strip_tags($string);$strip_string = $string; $character_arr = array('.', '-', '_', ' '); $number_arr = range(0,9); $combo_arr = array_merge($number_arr, $character_arr); // Strips username of numbers and symbols for an improved word match $string = str_replace($combo_arr, '', $string); // Ensures that username contains only letters at this stage $pattern = "/^[a-z]"."{".$minimum.",". $maximum. "}$/i"; $isletter = preg_match($pattern, $string); if (!$isletter) { die ($err_prog); } // Open database connection$conn = mysql_connect($dbhost.":".$dbport, $dbuser, $dbpwd); if (!$conn) { die('Failed to connect to database:\n' . mysql_error());} // Select database$db = mysql_select_db($database, $conn);if (!$conn) { die('Connection to database failed:\n' . mysql_error());} // Query database for present curse words$query = sprintf("SELECT * FROM $table ORDER BY length($col1) DESC");$request = mysql_query($query); if (!$request) { die ("Invalid Query: $query<br>\n" . "<b>" . mysql_error() ."</b>"); } // Checks if a record is found$num_res_rows = mysql_num_rows($request);if ($num_res_rows < 1) {die ("Sorry, no entries found.\n" . mysql_error());}// Define $obscenities and loop through the results and assign it to// $obscenities$obscenities = array();while ($row = mysql_fetch_assoc($request)) {$obscenities[] = $row[$col1]; }foreach ($obscenities as $curse_word) { if (stristr(trim($string), $curse_word)) { $length = strlen($curse_word); $stars = ''; for ($i = 1; $i <= $length; $i++) $stars .= '*'; $string = eregi_replace($curse_word, $stars, trim($string)); $stars = ''; // Saves curse words used for latter analysis $badword[] = $curse_word; } }// Free $request resultsmysql_free_result($request) or die ("Could not free result\n" . mysql_error()); $new_string = $string; $string_cnt = strlen(trim($string)); $username_cnt = strlen(trim($username)); $new_string_cnt = strlen(trim($new_string)); $do_match = str_replace('*','', $new_string); $do_match = strlen(trim($do_match)); $badword_cnt = ($username_cnt - $do_match); if ($show_analysis == '1') {echo "<br>Username: $username";echo "<br>Profanities Used: ";$curse_word_cnt = '0';foreach ($badword as $profanity) { $curse_word_cnt = $curse_word_cnt + 1;echo $profanity." "; } if ($curse_word_cnt == '0' ) { echo "none";}echo "<br>No. of Profanities Used: $curse_word_cnt";echo "<br>Username after Replacement: $new_string";echo "<br>Strlen Username: $username_cnt";echo "<br>Strlen Badword: $badword_cnt<br>";}// No match to any curse word in the databaseif ($do_match != '0' && $curse_word_cnt == "0" && $new_string_cnt == $do_match){die ($msg_pass);}// Exact match to the curse word usedif ($do_match == '0' && $curse_word_cnt >= "1" && $new_string_cnt != $do_match){die ($msg_fail);}// Not an exact match, but word or phrase contains 2 or more curse wordsif ($do_match != '0' && $curse_word_cnt > "1" && $new_string_cnt != $do_match){ die ($msg_fail);}// Not an exact match, but word or phrase contains just 1 curse wordif ($do_match != '0' && $curse_word_cnt == "1" && $new_string_cnt != $do_match){// Calculates percentage similarity between username and curse word$text_match = similar_text($username, $curse_word, $percent);$percent = round($percent, 2);// Query database for the severity level set for the curse word used$sql = sprintf("SELECT * FROM $table WHERE $col1='$profanity'");$record = mysql_query($sql); if (!$record) { die ("Invalid Query: $sql<br>" . "<b>" . mysql_error() ."</b>"); }// Checks if a record is found$num_rec_rows = mysql_num_rows($record);if ($num_rec_rows < 1) {die ("Sorry, no entries found.\n" . mysql_error());}while ($rec = mysql_fetch_assoc($record)) {$ranking = $rec[$col2];}// Free $record resultsmysql_free_result($record) or die ("Could not free result\n" . mysql_error());// Calculates the spread between the minimum and maximum// allowable username characters$spread = $maximum - $minimum;// Ensures the $ranking is either 1 or 2if ($ranking == '1' || $ranking == '2') {} else { die ("Program Error: Invalid System Resource.");}// Creates dymanic rank based on spreadif ($ranking == 1) { $adv_rank = abs(45 + (($spread/20)*2.25));} else { $adv_rank = abs(35 + (($spread/20)*2.25));}$adv_rank = round($adv_rank, 2);// Assigns rank based on the spread and the severity of the curse word usedif ($ranking == '2') {$rank = $adv_rank;$threshold = round(($minimum+$maximum)/($maximum+$minimum+5), 2);}else {$rank = $adv_rank;$threshold = round(($minimum+$maximum)/($maximum+$minimum+1), 2);}// Calculates trigger to test threshold statistic$lev_distance = levenshtein($username, $profanity);$trigger = ($badword_cnt/$username_cnt)*$lev_distance;$trigger = round($trigger, 2);if ($show_analysis == '1') {echo "<br>Percent: $percent";echo "<br>Ranking: $ranking<br>Rank: $rank<br>Levenshtein Distance: $lev_distance<br>Trigger: $trigger";echo "<br>Spread: $spread<br>Threshold: $threshold<br>";}if ($percent >= $rank && $ranking == '1' || $rankings == '2') { echo $msg_fail;}// Adjustment for zero percent and low percent due to variation in spread// Minimizes false positives for high severity curse wordsif ($percent < $rank && $ranking == '1') { if ($trigger <= $threshold || $badword_cnt >= $username_cnt*'0.25') { echo $msg_fail; } else { echo $msg_pass; } }// Adjustment for severity, zero and low percent due to variation in spread// Adds flexibility by allowing phrases with curse words with lower severity// that would otherwise be blocked provided that they are not too close to curse wordif ($percent < $rank && $ranking == '2') { if ($trigger <= $threshold || $badword_cnt >= $username_cnt*'0.5') { echo $msg_fail; } else { echo $msg_pass; } }}// Close database connectionmysql_close($conn) or die ("Could not close connection\n" . mysql_error());?>[/code] Quote Link to comment Share on other sites More sharing options...
shoz Posted October 20, 2006 Share Posted October 20, 2006 Thought I'd give it a shot. It's far from perfect, but does do what you're asking.Note that characters added as synonym characters should not be letters. For example since "l"(lower case L) is a synonym for "i" it should not be given synonyms of it's own. ie no $this->syn_chrs_map['l'] (lower case L). Also note that if a synonym is part of a "bad word" that the filter may not catch it.[code]<?phpclass word_filter{ var $bad_words; var $syn_chrs_map; var $syn_chrs; function word_filter() { $this->bad_words = array('word', 'bird', 'cat', 'witch', 'sit'); $this->syn_chrs_map = array(); $this->syn_chrs_map['i'] = array('1', 'l'); $this->syn_chrs_map['t'] = array('+'); $this->syn_chrs_map['b'] = array('8'); $this->syn_chrs_map['s'] = array('5'); $this->syn_chrs = array(); foreach ($this->syn_chrs_map as $l => $s) { $this->syn_chrs = array_merge($this->syn_chrs, $s); } } function filter($string) { $tmp_string = strtolower($string); if ($this->syn_chrs_map) { $regex = '/'.implode('|', array_map('preg_quote', $this->syn_chrs)).'/e'; $tmp_string = preg_replace($regex, '$this->replace_syn_chrs("\\0")', $tmp_string); } $str_words = explode(' ', $string); $tmp_words = explode(' ', $tmp_string); $num_words = count($str_words); $new_string = array(); for ($i = 0; $i < $num_words; $i++) { $bad_regex = '/'.implode('|', $this->bad_words).'/i'; if (preg_match($bad_regex, $tmp_words[$i]) || preg_match($bad_regex, preg_replace('/[^\w_]/', '', $str_words[$i]))) { $new_string[] = 'x'; } else { $new_string[] = $str_words[$i]; } } return implode(' ', $new_string); } function replace_syn_chrs($chr) { foreach (array_keys($this->syn_chrs_map) as $key) { if (in_array($chr, $this->syn_chrs_map[$key])) { return $key; } } }}$string = 'what word and b1rd 8ird w1+ch and 5iti s*i*t Wo*rd S1tsyou';print $string."\n";$word_filter = new word_filter();print $word_filter->filter($string);?>[/code]//output[code]what word and b1rd 8ird w1+ch and 5iti s*i*t Wo*rd S1tsyouwhat x and x x x and x x x x[/code] Quote Link to comment Share on other sites More sharing options...
Psycho Posted October 20, 2006 Share Posted October 20, 2006 Not trying to rain on your parade, but there are so many permeatations that 1) you wao't catch all "bad" words and you will start to run into false positives. Once that starts happening people will stop using your forum.For example I see a lot of people using M$ for Microsoft. That would be included in the banned words. I've also seen album names and such using special characters in them as well.I do like the approach [b]shoz[/b] takes above though. By replacing certain charachters with letters - THEN checking the content against your language filter would probably do the best job with a low occurance of false positives. However, it looks like his code goes a little further than that by filtering any words with special characters. Why is "5iti" filtered?Here's what I'm thinking. If you want to filter the word "witch" and you know people use "+" to represent a "t", then just do a replace to change "wi+tch" to "witch" and then check against your bad words. Of course, they could always use "w-i-t-c-h"[b]MOD EDIT(shoz): edited for language[/b] Quote Link to comment Share on other sites More sharing options...
shoz Posted October 20, 2006 Share Posted October 20, 2006 [quote=mjdamato]Why is "5iti" filtered?[/quote]5iti is filtered because "sit" is in the bad word list and "5" is a synonym for "s". Using a "-" in between letters will also be caught by the filter. It won't currently catch a word where a "-"(or other special character) and a "synonym" character is used. That can be added however. Quote Link to comment Share on other sites More sharing options...
Psycho Posted October 20, 2006 Share Posted October 20, 2006 [quote author=shoz link=topic=112126.msg455182#msg455182 date=1161378303][quote=mjdamato]Why is "5iti" filtered?[/quote]5iti is filtered because "sit" is in the bad word list and "5" is a synonym for "s". Using a "-" in between letters will also be caught by the filter. It won't currently catch a word where a "-"(or other special character) and a "synonym" character is used. That can be added however.[/quote]Ok, so is it replacing any words that [i]begin[/i] with bad words? If the user input "he cocked the gun" would "cocked" be filtered if "cock" was in the filter list? I'm not knocking your logic, but just like mail filters, false positives would be worse than missing a few true positives (IMHO).Here's a good report concerning dealing with leet speak: http://www.mail-archive.com/wryting-l@listserv.wvu.edu/msg01687.html Quote Link to comment Share on other sites More sharing options...
shoz Posted October 20, 2006 Share Posted October 20, 2006 [quote author=mjdamato link=topic=112126.msg455190#msg455190 date=1161379380][quote author=shoz link=topic=112126.msg455182#msg455182 date=1161378303][quote=mjdamato]Why is "5iti" filtered?[/quote]5iti is filtered because "sit" is in the bad word list and "5" is a synonym for "s". Using a "-" in between letters will also be caught by the filter. It won't currently catch a word where a "-"(or other special character) and a "synonym" character is used. That can be added however.[/quote]Ok, so is it replacing any words that [i]begin[/i] with bad words? If the user input "he cocked the gun" would "cocked" be filtered if "cock" was in the filter list? I'm not knocking your logic, but just like mail filters, false positives would be worse than missing a few true positives (IMHO).Here's a good report concerning dealing with leet speak: http://www.mail-archive.com/wryting-l@listserv.wvu.edu/msg01687.html[/quote]I do agree that false positives are possible. Currently if a word in the "bad" word list is found anywhere in the word, it matches.Before I try to address some of the "problems" let me make it clear that I'm not for or against the use of this or any filter, but of course will try to "defend" or modify the code up to a point to handle the points raised. I think that the majority of the words that are generally considered to be offensive are unlikely to cause false positives.I think a possible solution for the possible/likely positives would be to generate the code using a dictionary which would create a list of exceptions for words that would cause false positives. Meaning the words would be tested against the dictionary and any word matched would become an exception. Quote Link to comment Share on other sites More sharing options...
Mouse Posted October 21, 2006 Author Share Posted October 21, 2006 gentlemen this is going far further than i though... many thanks, i'll work on it when i get to my home pc... many thanksMouse Quote Link to comment Share on other sites More sharing options...
shoz Posted October 21, 2006 Share Posted October 21, 2006 I've been thinking of ways to handle the false positive situation in a better way and although I don't know if it's possible to elminate them, I think that the following approach may effectively deal with the problem.Instead of removing the text it could instead be hidden. The "xxxx" would be seen and the user could decide whether or not to turn off the filter. If you decided to, perhaps for "guests" or a user registered as being below a certain age filtering could be turned on by default with no option to turn it off. They'd either have to deal with the false positives or have a more precise filter applied to significantly reduce them. Users could also have filtering turned off/on by default for other users or be allowed to add words to their filter.Note that this method would be best in a situation where you're simply trying not to offend anyones sensibilities (ie some users do not want to read certain words) but still want to allow users to express themselves freely or where it's only being used to stop broad "profanity" usage.If the purpose of the filtering is to create a certain environment where rules have been established and the purpose of the filtering is to remove the occasional slip by a member, then the rules should specify what the penalty for breaking them are. Suspending or banning accounts as mjdamato suggests or perhaps removing specific privileges from users may be best.Btw, the filter above can be restricted to matching whole words by changing the $bad_regex assignment to the following[code]$bad_regex = '/^'.implode('|', $this->bad_words).'$/i';[/code]You can use "\b" in place of ^ and $ to match word boundaries. You can find out the difference [url=http://regularexpressions.info/wordboundaries.html]here[/url]. Quote Link to comment Share on other sites More sharing options...
Mouse Posted October 22, 2006 Author Share Posted October 22, 2006 Shoz, thank you very much for all your assistance on the bad word filter. It is much appreciated!I think you’re on to a good thing with the switchable filter, It’ll preserve personal expression and peoples sensibilities. So yes… and the false positives would become a null issue. For my part I have been working on the bad words array/list. (I don’t have php on my works laptop.) I was also thinking that as a combined project it should be posted on the site for use by all… what would you say?Again, many thanksMouse Quote Link to comment Share on other sites More sharing options...
shoz Posted October 23, 2006 Share Posted October 23, 2006 [quote author=Mouse link=topic=112126.msg455802#msg455802 date=1161521901]For my part I have been working on the bad words array/list. (I don’t have php on my works laptop.) I was also thinking that as a combined project it should be posted on the site for use by all… what would you say?[/quote]I don't have any problems with you posting any changes you've made. Quote Link to comment Share on other sites More sharing options...
Mouse Posted October 23, 2006 Author Share Posted October 23, 2006 sorry i meant in the scripts pages... my bad for not explaining too wellMouse Quote Link to comment Share on other sites More sharing options...
shoz Posted October 24, 2006 Share Posted October 24, 2006 I don't think this qualifies as a full fledged script. It's probably more suited as a code example to be put in the PHP Code Library but I'm not sure what the current status of the code library is however.The main site is being worked on but some sections aren't fully working. To be honest I think I'd want to spend more time on it (which I'm not sure I'll do) before thinking about putting it in the library but If you'd like to submit it you can try to. Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.