Jump to content

Recommended Posts

I'm currently making some regular expressions for a word filter

 

Now, if the user enters for example:

f**k

It would censor to

****

 

and if he enters

f**************************************kkkkkkkkkkkkkkkkkkkkkkkkkkkk

it would censor to

***************************************************************

 

Now, if the user enters:

f * * k

it won't censor.

 

My current code (I apologize for the swearing, I just wanted you to see what I see):

 

	public static function wordFilter($text)
{
    $filter_terms = array('s+h+i+t(|ting|er|e|ing|s)\b','f+u+c+k(|ing|ed|s|er)','a+s+s(|hole)\b','c+u+n+t', 'p+u+s+s+y', 'n+i+g+g+e+r', 'f+a+g(|got)');
    $filtered_text = $text;
    foreach($filter_terms as $word)
    {
        $match_count = preg_match_all('/' . $word . '/i', $text, $matches);
        for($i = 0; $i < $match_count; $i++)
            {
                $bwstr = trim($matches[0][$i]);
                $filtered_text = preg_replace('/' . $bwstr . '/', str_repeat("*", strlen($bwstr)), $filtered_text);
            }
    }
    return $filtered_text;
}

 

A) Yes I am aware that is not a good way to filter words

B) Could someone point me in the better direction?

 

 

Thanks in advance :) If you need anymore details just ask.

Link to comment
https://forums.phpfreaks.com/topic/250231-whitespace-with-word-filter/
Share on other sites

This isn't foolproof... considering

 

s+h+i+t could be typed as sh1t, shït, etc.

I could even write a phrase like 'Everyone of this skin color is useless and a detriment to society' which are, to some, worse than a single offensive word, and no filter would have an issue.

 

Theory aside, you could apply a one-or-more quantifier to each word to detect things like

ppppppppoooooooooooooooooooooopppppp

 

p+o+o+p+

will match poop as well as ppoooooooooooooooooooooooooooooppppppp

 

If you don't want bad false positives, you either need to create a white list, or check for word boundaries.

a+s+s+

will match ass, but it will also match assignment.

 

\ba+s+s+\b

will make sure there's a work boundary at the start and end of the word, so things like assignment won't be matched.

 

This is an issue when embedding bad words within a string though. Things like unfuckingbelievable won't get filtered by

\bf+u+c+k+\b

 

Now, it seems you somewhat understand what I'm talking about. The only issue I see with your code, is the 'empty' OR clause should be LAST instead of FIRST, otherwise it will ALWAYS be matched, and none of the other options will be checked. That, and you should have a quantifier on your last letter as well.

 

f+u+c+k+(ing|ed|s|er|)

 

Though honestly, I don't see why you'd want to filter the suffix as well. This will match 'fuck' as well

This isn't foolproof... considering

 

s+h+i+t could be typed as sh1t, shït, etc.

I could even write a phrase like 'Everyone of this skin color is useless and a detriment to society' which are, to some, worse than a single offensive word, and no filter would have an issue.

 

Theory aside, you could apply a one-or-more quantifier to each word to detect things like

ppppppppoooooooooooooooooooooopppppp

 

p+o+o+p+

will match poop as well as ppoooooooooooooooooooooooooooooppppppp

 

If you don't want bad false positives, you either need to create a white list, or check for word boundaries.

a+s+s+

will match ass, but it will also match assignment.

 

\ba+s+s+\b

will make sure there's a work boundary at the start and end of the word, so things like assignment won't be matched.

 

This is an issue when embedding bad words within a string though. Things like unfuckingbelievable won't get filtered by

\bf+u+c+k+\b

 

Now, it seems you somewhat understand what I'm talking about. The only issue I see with your code, is the 'empty' OR clause should be LAST instead of FIRST, otherwise it will ALWAYS be matched, and none of the other options will be checked. That, and you should have a quantifier on your last letter as well.

 

f+u+c+k+(ing|ed|s|er|)

 

Though honestly, I don't see why you'd want to filter the suffix as well. This will match 'fuck' as well

 

I don't believe in censorship myself, but the script I'm making it for, the user wants a censor!

I am aware that filters could easily be bypassed, but simply allowing only alphanumeric letters should do the trick, right? The users should theoretically be English only.

 

I have tried using the one-or-more blabbity blah, but this allows anyone to simply write "f.uck"

You might want to consider placing a 'zero or more' quantifier on the vowels rather than a 'one or more' quantifier on everything. As it is usually the vowel which makes the word sound derogatory. This will censor most variations of the word (without the substitution of letters for symbols);

 

s+h+i+t(ting|er|e|ing|s|)

becomes

shi*t(ting|er|e|ing|s|)

 

That will censor shit, shiiiiiiit and sht. It should speed up your regex too as it has less quantifiers to use.

 

But you will always have problems with censorship. Take fuck for example, rearrange the u to become fcuk and you've got a brand name which you cannot censor. You might want to try arguing to your client that he should not waste his time worrying about censorship, and should instead just cover his ass with a simple clause in his terms and conditions that words will not be censored and instead users who repeatedly swear will be 'silenced' or banned.

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.