Jump to content

Finding bad words from large list of email addressess


abhilashss

Recommended Posts

Hi,

 

I have large list of email addressses from a file. It comes around 1 million email ids. I have list of bad words like spam,junk etc, it consist of 20,000+  bad words.

 

I need to validate email ids. If bad words is present any where in email id it will be marked as invalid.

For example;

testspam@gmail.com - invalid

newuser@desspam.com - invalid

 

I would like to know which will be fastest comparison method as array looping will take time.

I tried following methods, but this also takes time.

 

//$keyword_list- array of bad words;

//$check_key- the email id which need to validate

$arrays = array_chunk($keyword_list, 2000);
     for($i=0;$i<count($arrays);$i++)
    {
         if (preg_match('/'.implode('|', $arrays[$i]).'/', $check_key, $matches)){
            return 1;
         }
        
    }

 

 

Please help me to find out best method considering perfomance of script.

 

 

Thanks,

Abhilash

Link to comment
Share on other sites

Hey here's a "bad" word. "Dick". Well, it's not really a bad word, because in English "Dick" is short version used for someone with the first name of "Richard". Dick Nixon, Dick Cavett, Dick Butkus are just a few of the myriad famous people who are popularly known by that first name.

 

There is no way you have a valid 20k list of badwords. That is absolutely ridiculous. And even with a small list, there are are variations of "bad" words that are part of other valid words. These are email addresses... so using phrases, nicknames and the like are common practice.

 

Practically speaking, micro optimization of something you are going to run once or twice is a complete waste of time. If it takes 3 hours to run, who cares, if you're going to run it once.

 

However, again, there is no way there are 20k bad words. Spam is not a bad word -- so there's something you're not explaining about this list, and without that information, we can't really help.

Link to comment
Share on other sites

Thank you for your answers. I gave spam,junk as an example. I am doing a php script to validate email addresses(These emails will be stored in csv file, and administrator upload this file). This is not a one time task.

 

We have various parameters to validate these emails addresses.

For example;

We have list of bounce emails, we check our input email addresses with these bounce emails

We have list of throwaway domains, we check our input email addresses with these throwaway domains

etc  etc

 

Keyword checking is one of those validation checks. In all of the above explained methods, only keyword checking is slower. So am looking for best method to check if keyword is present anywhere in email or not.

 

You are right there won't be 20K bad words. But we are using 20K just for checking how it work with 1 million email addressses. I hope I have explained the scenario. Please let me know if you have any questions.

Link to comment
Share on other sites

I'd move this to a database because they can use indexes to find matches, where PHP cannot (unless you write an indexing routine, which would be interesting but a waste of time because databases already do that).

 

if you use a proper database like PostgreSQL you can use trigram indexes to seriously speedup substring searches.

This is actually an interesting issue from a performance point of view.

Link to comment
Share on other sites

Actually we are using Mongodb to store these keywords,bounce emails,throwaway domains etc, and we gave indexing for tables(collections).

 

Here is the sample code we are using to compare with Bounce list of emails.

$collection                 =     $db->bounce;    
//find if values are in bounce file

//$emailds - list of emails uploaded.
$cursor_bounce             =     $collection->find(array($section => array('$in' =>$emailds)));

 

 

The above code gives results in fraction of seconds when compared with 'bounce' table(have 30 million records) with 1 million email ids.

Edited by abhilashss
Link to comment
Share on other sites

Why do you use MongoDB for something that is not document-oriented?

 

Anyway ,exact mathces are always going to take milliseconds because it's just a comparison of two indexes.

I'd be more interested in knowing how Mongo deals with partial matches.

 

Mongo's basic indexing is the good old btree.  It also has a hash index, a text index and a geospatial and "geohaystack" index for dimensional data.  The text index is similar to mysql's fulltext index.  I think the "document" idea can be confusing in the case of mongo.  A mongo document is essentially a json structure, and it's quite fine for strict hierarchical data or something like this where there's no real need for a relational model.  So long as you have the memory to support it, as it's memory mapped, the performance should be very good.

Link to comment
Share on other sites

Thank you for your answers. I gave spam,junk as an example. I am doing a php script to validate email addresses(These emails will be stored in csv file, and administrator upload this file). This is not a one time task.

 

We have various parameters to validate these emails addresses.

For example;

We have list of bounce emails, we check our input email addresses with these bounce emails

When you import the data, you will want to save the original email in one field, and split the names and domains into separate fields (name, domain) for example. Then you should be able to exact match using an index on any combination of fields. If you maintain a seperate collection of the bad addresses this will allow you to loop through the list and query each time for an exact match. Obviously the larger the list, the longer this will take, but each individual query will be very fast.

 

 

We have list of throwaway domains, we check our input email addresses with these throwaway domains

This is an exact match so long as you have separated the host into a field.

 

Keyword checking is one of those validation checks. In all of the above explained methods, only keyword checking is slower. So am looking for best method to check if keyword is present anywhere in email or not.

 

Here is where I don't follow you. Are you searching for words in an email, or searching for bad words in the text of an email? The mongo text index may be able to help you.

Link to comment
Share on other sites

I am searching for bad words in emails.

For example;

Say 'spam' and 'junk' are bad words.

 

So testspam@gmail.com - invalid

     junktest@gmail.com - invalid.

 

I will check mongo text index. Thank you for your detailed reply.

 

Really?  The email that I use for this site as well as many others is nospam@mydomain.net.  Also, what about johnbass or summacumlaude?  You are just asking for lots of trouble here.

Edited by AbraCadaver
Link to comment
Share on other sites

  • 2 weeks later...
This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.