murfy Posted January 26, 2014 Share Posted January 26, 2014 Hello, I have a filter for filtering badwords in my native language. My language is very accurate not like simple English, so it is harder to give good exaple in English. I don't find good word / verb to have good example. But I will try. I separated the pattern for three parts: prefix, middle and suffix. The middle will not change. The suffix can change. The prefix can be a specific letter or may not be present. lets give example now: Let's have the middle part: mil. I want to find words like smile, smiles, smiling or mile, miles, miling but not words like mill, mills, milling or words with different prefix like s. Here I dont find realistic example. So if there would be words like emile, emiles, emiling, amile, amiles, amiling, and so on. These words have not to be included in the result. In my native language I tried something similar: \bv?ojet.?\bal where ojet is the middle part and v is the only possible prefix. But this this not find anything. My second try was [v^d]?ojed[ueo][usm] this time with different middle word: ojed. The suffix is changing here according declination. But always this does not work. The common problem in my tests were that some of the characters of the incorrect middle word were captured. And I need only to capture the character of the correct word, in English example it is the mil, or smile, or smil... etc. So how can I specify pattern, that will capture as many characters as possible from the correct word, but not to capture the characters which are not presend. Note: I work only with a-z (small characters) no white characters in the text. Quote Link to comment https://forums.phpfreaks.com/topic/285686-regex-for-filtering-badwords/ Share on other sites More sharing options...
.josh Posted January 26, 2014 Share Posted January 26, 2014 i'm not sure I 100% understand you, but I have a small suspicion perhaps your issue isn't the prefix, but perhaps the suffix. For example, this: \bv?ojet.?\bal What is the purpose of the .?? The second \b may cause it to not match, depending on what that dot actually matches. If that dot matches a "word" character, then that second \b will cause the pattern to fail. For example, "vojet-al" will match "oject-al" will match "ojectal" will fail "ojectfal" will fail \b is a lookaround assertion. It looks at the character before and the character after it. It will only match if there's a non-word char followed by a word char, or visa versa. Well you have \bv?ojet.?\bal, so the "a" after the \b is a word char, so that dot will have to match a non-word char in order for the \b to match. Quote Link to comment https://forums.phpfreaks.com/topic/285686-regex-for-filtering-badwords/#findComment-1466664 Share on other sites More sharing options...
murfy Posted January 26, 2014 Author Share Posted January 26, 2014 (edited) I check it again and find, that tha al should not be there. I am not sure how I got it there, but it is mistake, I will correct that but the original pattern was: \bv?ojet.?\b but that was just a tip which somebody gave me. v? means that the v is not neccessary, but if the v is there, so it will be included in result Edited January 26, 2014 by murfy Quote Link to comment https://forums.phpfreaks.com/topic/285686-regex-for-filtering-badwords/#findComment-1466668 Share on other sites More sharing options...
murfy Posted January 27, 2014 Author Share Posted January 27, 2014 First pattern So I try this regular expression: v?ojet.? and I apply if for sentences "Tvoje auto je vojete. Otec, dojel v cas. Kojeni neni dojeni.". This is result of my program: Tvoje auto je ******. Otec, d**** * cas. Kojeni neni dojeni. First sentence works fine. "Tvoje" was not found and "vojete" was correctly replaced for stars. But the second sentence is wrong. word "dojel" should not be in the result because it should not look for dojel and dojel is not bad word.. Last sentence is fine (none bad word). Second pattern But now I try the second pattern for the words like in sentence: "Vojedu tvoje auto klicema." I change this: [v^d]?ojed[ueo][usm]? to this: v?ojed[ueo][usm]? But it does not work correctly. Try this sentences: Vojedu tvoje auto klicema. Projedu se autem. Dojedu az zitra. Tak jsem se projel. My program will return: ****** tvoje auto klicema. Pr***** *e autem. D***** az zitra. Tak jsem se projel. Vojedu was found OK. But Projedu, Dojedu are not bad words so they should not be in the result. They are also wrong expressions. Quote Link to comment https://forums.phpfreaks.com/topic/285686-regex-for-filtering-badwords/#findComment-1466671 Share on other sites More sharing options...
murfy Posted January 27, 2014 Author Share Posted January 27, 2014 (edited) I have successufly solved the patterns: (?:[^cdkprmt])oje[tl].? (?:[^cdkprmt])ojed[ueo][usm]? For sentences: $str='Tvoje auto je vojete. Otec, dojel v cas. Kojeni neni dojeni. Vojedu tvoje auto klicema. Projedu se autem. Dojedu az zitra. '; Correct result after match and replace: Tvoje auto je ******. Otec, dojel v cas. Kojeni neni dojeni. ****** tvoje auto klicema. Projedu se autem. Dojedu az zitra. Edited January 27, 2014 by murfy Quote Link to comment https://forums.phpfreaks.com/topic/285686-regex-for-filtering-badwords/#findComment-1466742 Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.