Username: Posted November 1, 2011 Share Posted November 1, 2011 I'm currently making some regular expressions for a word filter Now, if the user enters for example: f**k It would censor to **** and if he enters f**************************************kkkkkkkkkkkkkkkkkkkkkkkkkkkk it would censor to *************************************************************** Now, if the user enters: f * * k it won't censor. My current code (I apologize for the swearing, I just wanted you to see what I see): public static function wordFilter($text) { $filter_terms = array('s+h+i+t(|ting|er|e|ing|s)\b','f+u+c+k(|ing|ed|s|er)','a+s+s(|hole)\b','c+u+n+t', 'p+u+s+s+y', 'n+i+g+g+e+r', 'f+a+g(|got)'); $filtered_text = $text; foreach($filter_terms as $word) { $match_count = preg_match_all('/' . $word . '/i', $text, $matches); for($i = 0; $i < $match_count; $i++) { $bwstr = trim($matches[0][$i]); $filtered_text = preg_replace('/' . $bwstr . '/', str_repeat("*", strlen($bwstr)), $filtered_text); } } return $filtered_text; } A) Yes I am aware that is not a good way to filter words B) Could someone point me in the better direction? Thanks in advance If you need anymore details just ask. Quote Link to comment Share on other sites More sharing options...
Username: Posted November 7, 2011 Author Share Posted November 7, 2011 bump Quote Link to comment Share on other sites More sharing options...
xyph Posted November 9, 2011 Share Posted November 9, 2011 This isn't foolproof... considering s+h+i+t could be typed as sh1t, shït, etc. I could even write a phrase like 'Everyone of this skin color is useless and a detriment to society' which are, to some, worse than a single offensive word, and no filter would have an issue. Theory aside, you could apply a one-or-more quantifier to each word to detect things like ppppppppoooooooooooooooooooooopppppp p+o+o+p+ will match poop as well as ppoooooooooooooooooooooooooooooppppppp If you don't want bad false positives, you either need to create a white list, or check for word boundaries. a+s+s+ will match ass, but it will also match assignment. \ba+s+s+\b will make sure there's a work boundary at the start and end of the word, so things like assignment won't be matched. This is an issue when embedding bad words within a string though. Things like unfuckingbelievable won't get filtered by \bf+u+c+k+\b Now, it seems you somewhat understand what I'm talking about. The only issue I see with your code, is the 'empty' OR clause should be LAST instead of FIRST, otherwise it will ALWAYS be matched, and none of the other options will be checked. That, and you should have a quantifier on your last letter as well. f+u+c+k+(ing|ed|s|er|) Though honestly, I don't see why you'd want to filter the suffix as well. This will match 'fuck' as well Quote Link to comment Share on other sites More sharing options...
Username: Posted November 10, 2011 Author Share Posted November 10, 2011 This isn't foolproof... considering s+h+i+t could be typed as sh1t, shït, etc. I could even write a phrase like 'Everyone of this skin color is useless and a detriment to society' which are, to some, worse than a single offensive word, and no filter would have an issue. Theory aside, you could apply a one-or-more quantifier to each word to detect things like ppppppppoooooooooooooooooooooopppppp p+o+o+p+ will match poop as well as ppoooooooooooooooooooooooooooooppppppp If you don't want bad false positives, you either need to create a white list, or check for word boundaries. a+s+s+ will match ass, but it will also match assignment. \ba+s+s+\b will make sure there's a work boundary at the start and end of the word, so things like assignment won't be matched. This is an issue when embedding bad words within a string though. Things like unfuckingbelievable won't get filtered by \bf+u+c+k+\b Now, it seems you somewhat understand what I'm talking about. The only issue I see with your code, is the 'empty' OR clause should be LAST instead of FIRST, otherwise it will ALWAYS be matched, and none of the other options will be checked. That, and you should have a quantifier on your last letter as well. f+u+c+k+(ing|ed|s|er|) Though honestly, I don't see why you'd want to filter the suffix as well. This will match 'fuck' as well I don't believe in censorship myself, but the script I'm making it for, the user wants a censor! I am aware that filters could easily be bypassed, but simply allowing only alphanumeric letters should do the trick, right? The users should theoretically be English only. I have tried using the one-or-more blabbity blah, but this allows anyone to simply write "f.uck" Quote Link to comment Share on other sites More sharing options...
joe92 Posted November 10, 2011 Share Posted November 10, 2011 You might want to consider placing a 'zero or more' quantifier on the vowels rather than a 'one or more' quantifier on everything. As it is usually the vowel which makes the word sound derogatory. This will censor most variations of the word (without the substitution of letters for symbols); s+h+i+t(ting|er|e|ing|s|) becomes shi*t(ting|er|e|ing|s|) That will censor shit, shiiiiiiit and sht. It should speed up your regex too as it has less quantifiers to use. But you will always have problems with censorship. Take fuck for example, rearrange the u to become fcuk and you've got a brand name which you cannot censor. You might want to try arguing to your client that he should not waste his time worrying about censorship, and should instead just cover his ass with a simple clause in his terms and conditions that words will not be censored and instead users who repeatedly swear will be 'silenced' or banned. Quote Link to comment Share on other sites More sharing options...
xyph Posted November 10, 2011 Share Posted November 10, 2011 Taking some minor measures doesn't hurt, assuming the overhead of the RegEx isn't too much for your server to handle at load. You just can't expect to to catch everything that can be interpreted as those words. Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.