Finding bad words from large list of email addressess

abhilashss · September 6, 2013

Hi,

I have large list of email addressses from a file. It comes around 1 million email ids. I have list of bad words like spam,junk etc, it consist of 20,000+ bad words.

I need to validate email ids. If bad words is present any where in email id it will be marked as invalid.

For example;

[email protected] - invalid

I would like to know which will be fastest comparison method as array looping will take time.

I tried following methods, but this also takes time.

//$keyword_list- array of bad words;

//$check_key- the email id which need to validate

$arrays = array_chunk($keyword_list, 2000);
   for($i=0;$i<count($arrays);$i++)
   {
       if (preg_match('/'.implode('|', $arrays[$i]).'/', $check_key, $matches)){
           return 1;
       }

   }

Please help me to find out best method considering perfomance of script.

Thanks,

Abhilash

Muddy_Funster · September 6, 2013

what possible legitimate reason could you have to be holding almost 1 million unverified email addresses?

gizmola · September 6, 2013

Hey here's a "bad" word. "Dick". Well, it's not really a bad word, because in English "Dick" is short version used for someone with the first name of "Richard". Dick Nixon, Dick Cavett, Dick Butkus are just a few of the myriad famous people who are popularly known by that first name.

There is no way you have a valid 20k list of badwords. That is absolutely ridiculous. And even with a small list, there are are variations of "bad" words that are part of other valid words. These are email addresses... so using phrases, nicknames and the like are common practice.

Practically speaking, micro optimization of something you are going to run once or twice is a complete waste of time. If it takes 3 hours to run, who cares, if you're going to run it once.

However, again, there is no way there are 20k bad words. Spam is not a bad word -- so there's something you're not explaining about this list, and without that information, we can't really help.

abhilashss · September 9, 2013

Thank you for your answers. I gave spam,junk as an example. I am doing a php script to validate email addresses(These emails will be stored in csv file, and administrator upload this file). This is not a one time task.

We have various parameters to validate these emails addresses.

For example;

We have list of bounce emails, we check our input email addresses with these bounce emails

We have list of throwaway domains, we check our input email addresses with these throwaway domains

etc etc

Keyword checking is one of those validation checks. In all of the above explained methods, only keyword checking is slower. So am looking for best method to check if keyword is present anywhere in email or not.

You are right there won't be 20K bad words. But we are using 20K just for checking how it work with 1 million email addressses. I hope I have explained the scenario. Please let me know if you have any questions.

vinny42 · September 9, 2013

I'd move this to a database because they can use indexes to find matches, where PHP cannot (unless you write an indexing routine, which would be interesting but a waste of time because databases already do that).

if you use a proper database like PostgreSQL you can use trigram indexes to seriously speedup substring searches.

This is actually an interesting issue from a performance point of view.

abhilashss · September 9, 2013

Actually we are using Mongodb to store these keywords,bounce emails,throwaway domains etc, and we gave indexing for tables(collections).

Here is the sample code we are using to compare with Bounce list of emails.

$collection = $db->bounce;
//find if values are in bounce file

//$emailds - list of emails uploaded.
$cursor_bounce = $collection->find(array($section => array('$in' =>$emailds)));

The above code gives results in fraction of seconds when compared with 'bounce' table(have 30 million records) with 1 million email ids.

vinny42 · September 9, 2013

Why do you use MongoDB for something that is not document-oriented?

Anyway ,exact mathces are always going to take milliseconds because it's just a comparison of two indexes.

I'd be more interested in knowing how Mongo deals with partial matches.

gizmola · September 11, 2013

Why do you use MongoDB for something that is not document-oriented?

Anyway ,exact mathces are always going to take milliseconds because it's just a comparison of two indexes.

I'd be more interested in knowing how Mongo deals with partial matches.

Mongo's basic indexing is the good old btree. It also has a hash index, a text index and a geospatial and "geohaystack" index for dimensional data. The text index is similar to mysql's fulltext index. I think the "document" idea can be confusing in the case of mongo. A mongo document is essentially a json structure, and it's quite fine for strict hierarchical data or something like this where there's no real need for a relational model. So long as you have the memory to support it, as it's memory mapped, the performance should be very good.

gizmola · September 11, 2013

Thank you for your answers. I gave spam,junk as an example. I am doing a php script to validate email addresses(These emails will be stored in csv file, and administrator upload this file). This is not a one time task.

We have various parameters to validate these emails addresses.

For example;

We have list of bounce emails, we check our input email addresses with these bounce emails

When you import the data, you will want to save the original email in one field, and split the names and domains into separate fields (name, domain) for example. Then you should be able to exact match using an index on any combination of fields. If you maintain a seperate collection of the bad addresses this will allow you to loop through the list and query each time for an exact match. Obviously the larger the list, the longer this will take, but each individual query will be very fast.

We have list of throwaway domains, we check our input email addresses with these throwaway domains

This is an exact match so long as you have separated the host into a field.

Keyword checking is one of those validation checks. In all of the above explained methods, only keyword checking is slower. So am looking for best method to check if keyword is present anywhere in email or not.

Here is where I don't follow you. Are you searching for words in an email, or searching for bad words in the text of an email? The mongo text index may be able to help you.

abhilashss · September 17, 2013

I am searching for bad words in emails.

For example;

Say 'spam' and 'junk' are bad words.

So [email protected] - invalid

[email protected] - invalid.

I will check mongo text index. Thank you for your detailed reply.

AbraCadaver · September 17, 2013

I am searching for bad words in emails.

For example;

Say 'spam' and 'junk' are bad words.

So [email protected] - invalid

[email protected] - invalid.

I will check mongo text index. Thank you for your detailed reply.

Really? The email that I use for this site as well as many others is [email protected]. Also, what about johnbass or summacumlaude? You are just asking for lots of trouble here.

abhilashss · September 30, 2013

The mongo text index may be able to help you.

- Mongo text index did the job. Thank you.

Sign In

Finding bad words from large list of email addressess

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Archived

Important Information