rofina Posted May 13, 2009 Share Posted May 13, 2009 Each time I merge "domain blacklists" provided by various sources (malwaredomains.com , forum.hosts-file.net , emergingthreats.net) into a blocklist for the DNS proxy app DNSKong (pyrenean.com) the combined list contains LOTS of overlapping / redundant patterns. Although it's easy to sort the list & remove EXACT matching lines, coding a script which is able to match "between the dots" has me stumped. Within the supplied lists, each given pattern might represtent "an entire domain" (bfast.com) or a "complete" hostname (ads.360.yahoo.com) or a partial hostname (tracker.ebay) or just a non-dotted label pattern (teensxxx) For the sake of example, if the existing blocklist already contains: bfast.com ads.360.yahoo.com myyahoohoo tracker.ebay teensxxx when comparing a new pattern against those already in the blocklist, the script would discard the to-be-merged entry teensxxx.smegerator.com because it's redundant, i.e. it would already be matched by 'teensxxx'. If the to-be-merged list contains an entry my.yahoo.com the script WOULD accept this as a new, non-redundant pattern ('my.yahoo' is not a between-the-dots match with 'myyahoohoo') Given that the input has been sanitized elsewhere, prior to pasting, and exists as in "one blocklist pattern per line" format, here's my non-working attempt at coding the script. Initially, the input is the content from the existing list (which already definitely contains some redundant patterns, and I'm tired of weeding them out manually). <?php if( isset($_POST['mytextareacontent']) && $_POST['mytextareacontent'] != '') { $mystringin = trim(strtolower(stripslashes($_POST['mytextareacontent']))); $myorigarray = explode("\n", $mystringin); $myorigarray = array_unique($myorigarray); $howmany = count($myorigarray) + 1; $mytemparray = $myorigarray; In order to effect "between the dots" matching, split each pattern into its label parts for($i=0; $i <= $howmany; $i++) { $orig_pieces = explode('.', $myorigarray[$i]); $length_outer = strlen($mytemparray[$i]); for($j=1; $j <= $howmany; $j++) { // avoid comparing an element with its mirror if( $i != $j && $mytemparray[$i] != ' ' && $mytemparray[$j] != ' ') { $temp_pieces = explode('.', $myorigarray[$j]); $length_inner = strlen($mytemparray[$j]); $overlapping_pieces = sort(array_intersect($temp_pieces, $orig_pieces)); if( count($overlapping_pieces >= 1) && $overlapping_pieces == sort($temp_pieces) || $overlapping_pieces == sort($orig_pieces) ) { // choose the shorter string for use as the strstr() needle if($length_outer > $length_inner) { if(strstr($mytemparray[$i], $mytemparray[$j])){ $mytemparray[$i] = ' '; } } else { if(strstr($mytemparray[$j], $mytemparray[$i])){ $mytemparray[$j] = ' '; } } } // END check whether we have overlapping pieces } // END avoid self-comparison prepare the output for one-per-line textarea display: } // END inner loop } // END outer loop $mytemparray = array_unique($mytemparray); $myout = trim(implode($mytemparray, "\n")); } // END (check if form POST data is present...) build the html page output echo '<html><body><form method="POST" action="' . $PHP_SELF . '">paste list of hostname entries then click GO to remove redundant patterns<br /> <textarea name="mytextareacontent" wrap="soft" rows="20" cols="70">' . $myout . '</textarea><br /><input type="submit" value="go" /></form></body><html>'; ?> Thanks, in advance, for any enlightenment you can provide. When I sat down to write this, I expected it would just involve straightforward array_intersect() but now my head is swimming... Link to comment https://forums.phpfreaks.com/topic/157902-hostname-blocklist-need-to-remove-redundant-patterns/ Share on other sites More sharing options...
rofina Posted May 22, 2009 Author Share Posted May 22, 2009 Ten days, 25 reads and ZERO replies later... FWIW, here's the gist of what I wound up coding. The PHP script works... but even though it doesn't echo anything to the page until matching has finished, it is TERRIBLY slow. During testing, I had to override the max_execution_time setting in php.ini to avoid script timeout when handling only 5k patterns. trial execution times: 500 items = 1.7 sec 1K items = 6.9 sec 2K items = 36 sec 4K items = 136 sec 5K items = timeout (1800+ sec) $mystringin = trim($_POST['mytextareacontent']); $myarray = explode("\n", $mystringin); $myarray = array_unique($myarray); $mytemparray = $myarray; $howmany = count($myarray); $rejects = array(); for($i=0; $i < $howmany; $i++) { for($j=0; $j < $howmany; $j++) { $haystack = trim($myarray[$i]); $needle = trim($myarray[$j]); if( $i != $j && $haystack != '' && $needle != '' && preg_match("/\b".preg_quote($needle)."\b/i", $haystack) > 0 ) { $rejects[] = '<b>'. $haystack .'</b> obviated by <b>'. $needle .'</b><br />'; unset($mytemparray[$i]); $myarray[$i] = ''; } } } $mytemparray = array_unique($mytemparray); $myout = trim(implode($mytemparray, "\n")); // plus additional lines to display $myout and $rejects to the page remaining issues: -- Speed. The nested loops represent (length-1)! iterations, so I'll need to modify the script so that it processes the patterns in batches. -- Apparently the underscore character is valid within third-level domain labels. My reliance on the \w regex modifier does not accomodate this. -- Even though I expect the input fed to script will already have been sanitized... as is, the script won't properly handle any spaces and/or blank lines contained within the input. Link to comment https://forums.phpfreaks.com/topic/157902-hostname-blocklist-need-to-remove-redundant-patterns/#findComment-839717 Share on other sites More sharing options...
Recommended Posts
Archived
This topic is now archived and is closed to further replies.