abhilashss Posted September 6, 2013 Share Posted September 6, 2013 Hi, I have large list of email addressses from a file. It comes around 1 million email ids. I have list of bad words like spam,junk etc, it consist of 20,000+ bad words. I need to validate email ids. If bad words is present any where in email id it will be marked as invalid. For example; testspam@gmail.com - invalid newuser@desspam.com - invalid I would like to know which will be fastest comparison method as array looping will take time. I tried following methods, but this also takes time. //$keyword_list- array of bad words; //$check_key- the email id which need to validate $arrays = array_chunk($keyword_list, 2000); for($i=0;$i<count($arrays);$i++) { if (preg_match('/'.implode('|', $arrays[$i]).'/', $check_key, $matches)){ return 1; } } Please help me to find out best method considering perfomance of script. Thanks, Abhilash Quote Link to comment Share on other sites More sharing options...
Muddy_Funster Posted September 6, 2013 Share Posted September 6, 2013 what possible legitimate reason could you have to be holding almost 1 million unverified email addresses? Quote Link to comment Share on other sites More sharing options...
gizmola Posted September 6, 2013 Share Posted September 6, 2013 Hey here's a "bad" word. "Dick". Well, it's not really a bad word, because in English "Dick" is short version used for someone with the first name of "Richard". Dick Nixon, Dick Cavett, Dick Butkus are just a few of the myriad famous people who are popularly known by that first name. There is no way you have a valid 20k list of badwords. That is absolutely ridiculous. And even with a small list, there are are variations of "bad" words that are part of other valid words. These are email addresses... so using phrases, nicknames and the like are common practice. Practically speaking, micro optimization of something you are going to run once or twice is a complete waste of time. If it takes 3 hours to run, who cares, if you're going to run it once. However, again, there is no way there are 20k bad words. Spam is not a bad word -- so there's something you're not explaining about this list, and without that information, we can't really help. Quote Link to comment Share on other sites More sharing options...
abhilashss Posted September 9, 2013 Author Share Posted September 9, 2013 Thank you for your answers. I gave spam,junk as an example. I am doing a php script to validate email addresses(These emails will be stored in csv file, and administrator upload this file). This is not a one time task. We have various parameters to validate these emails addresses. For example; We have list of bounce emails, we check our input email addresses with these bounce emails We have list of throwaway domains, we check our input email addresses with these throwaway domains etc etc Keyword checking is one of those validation checks. In all of the above explained methods, only keyword checking is slower. So am looking for best method to check if keyword is present anywhere in email or not. You are right there won't be 20K bad words. But we are using 20K just for checking how it work with 1 million email addressses. I hope I have explained the scenario. Please let me know if you have any questions. Quote Link to comment Share on other sites More sharing options...
vinny42 Posted September 9, 2013 Share Posted September 9, 2013 I'd move this to a database because they can use indexes to find matches, where PHP cannot (unless you write an indexing routine, which would be interesting but a waste of time because databases already do that). if you use a proper database like PostgreSQL you can use trigram indexes to seriously speedup substring searches. This is actually an interesting issue from a performance point of view. Quote Link to comment Share on other sites More sharing options...
abhilashss Posted September 9, 2013 Author Share Posted September 9, 2013 (edited) Actually we are using Mongodb to store these keywords,bounce emails,throwaway domains etc, and we gave indexing for tables(collections). Here is the sample code we are using to compare with Bounce list of emails. $collection = $db->bounce; //find if values are in bounce file //$emailds - list of emails uploaded.$cursor_bounce = $collection->find(array($section => array('$in' =>$emailds))); The above code gives results in fraction of seconds when compared with 'bounce' table(have 30 million records) with 1 million email ids. Edited September 9, 2013 by abhilashss Quote Link to comment Share on other sites More sharing options...
vinny42 Posted September 9, 2013 Share Posted September 9, 2013 Why do you use MongoDB for something that is not document-oriented? Anyway ,exact mathces are always going to take milliseconds because it's just a comparison of two indexes. I'd be more interested in knowing how Mongo deals with partial matches. Quote Link to comment Share on other sites More sharing options...
gizmola Posted September 11, 2013 Share Posted September 11, 2013 Why do you use MongoDB for something that is not document-oriented? Anyway ,exact mathces are always going to take milliseconds because it's just a comparison of two indexes. I'd be more interested in knowing how Mongo deals with partial matches. Mongo's basic indexing is the good old btree. It also has a hash index, a text index and a geospatial and "geohaystack" index for dimensional data. The text index is similar to mysql's fulltext index. I think the "document" idea can be confusing in the case of mongo. A mongo document is essentially a json structure, and it's quite fine for strict hierarchical data or something like this where there's no real need for a relational model. So long as you have the memory to support it, as it's memory mapped, the performance should be very good. Quote Link to comment Share on other sites More sharing options...
gizmola Posted September 11, 2013 Share Posted September 11, 2013 Thank you for your answers. I gave spam,junk as an example. I am doing a php script to validate email addresses(These emails will be stored in csv file, and administrator upload this file). This is not a one time task. We have various parameters to validate these emails addresses. For example; We have list of bounce emails, we check our input email addresses with these bounce emails When you import the data, you will want to save the original email in one field, and split the names and domains into separate fields (name, domain) for example. Then you should be able to exact match using an index on any combination of fields. If you maintain a seperate collection of the bad addresses this will allow you to loop through the list and query each time for an exact match. Obviously the larger the list, the longer this will take, but each individual query will be very fast. We have list of throwaway domains, we check our input email addresses with these throwaway domains This is an exact match so long as you have separated the host into a field. Keyword checking is one of those validation checks. In all of the above explained methods, only keyword checking is slower. So am looking for best method to check if keyword is present anywhere in email or not. Here is where I don't follow you. Are you searching for words in an email, or searching for bad words in the text of an email? The mongo text index may be able to help you. Quote Link to comment Share on other sites More sharing options...
abhilashss Posted September 17, 2013 Author Share Posted September 17, 2013 I am searching for bad words in emails. For example; Say 'spam' and 'junk' are bad words. So testspam@gmail.com - invalid junktest@gmail.com - invalid. I will check mongo text index. Thank you for your detailed reply. Quote Link to comment Share on other sites More sharing options...
AbraCadaver Posted September 17, 2013 Share Posted September 17, 2013 (edited) I am searching for bad words in emails. For example; Say 'spam' and 'junk' are bad words. So testspam@gmail.com - invalid junktest@gmail.com - invalid. I will check mongo text index. Thank you for your detailed reply. Really? The email that I use for this site as well as many others is nospam@mydomain.net. Also, what about johnbass or summacumlaude? You are just asking for lots of trouble here. Edited September 17, 2013 by AbraCadaver Quote Link to comment Share on other sites More sharing options...
abhilashss Posted September 30, 2013 Author Share Posted September 30, 2013 The mongo text index may be able to help you. - Mongo text index did the job. Thank you. Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.