Jump to content

hostname blocklist: need to remove redundant patterns


rofina

Recommended Posts

Each time I merge "domain blacklists" provided by various sources

(malwaredomains.com , forum.hosts-file.net , emergingthreats.net)

into a blocklist for the DNS proxy app DNSKong (pyrenean.com)

the combined list contains LOTS of overlapping / redundant patterns.

 

Although it's easy to sort the list & remove EXACT matching lines,

coding a script which is able to match "between the dots" has me stumped.

 

Within the supplied lists, each given pattern might represtent "an entire domain"

(bfast.com) or a "complete" hostname (ads.360.yahoo.com) or a partial hostname

(tracker.ebay) or just a non-dotted label pattern (teensxxx)

 

For the sake of example, if the existing blocklist already contains:

bfast.com

ads.360.yahoo.com

myyahoohoo

tracker.ebay

teensxxx

when comparing a new pattern against those already in the blocklist,

the script would discard the to-be-merged entry

teensxxx.smegerator.com

because it's redundant, i.e. it would already be matched by 'teensxxx'.

 

If the to-be-merged list contains an entry

my.yahoo.com

the script WOULD accept this as a new, non-redundant pattern

('my.yahoo' is not a between-the-dots match with 'myyahoohoo')

 

 

 

Given that the input has been sanitized elsewhere, prior to pasting, and exists as in "one blocklist pattern per line" format, here's my non-working attempt at coding the script. Initially, the input is the content from the existing list (which already definitely contains some redundant patterns, and I'm tired of weeding them out manually).

<?php
if( isset($_POST['mytextareacontent'])  &&  $_POST['mytextareacontent'] != '') {

$mystringin = trim(strtolower(stripslashes($_POST['mytextareacontent'])));
$myorigarray = explode("\n", $mystringin);
$myorigarray = array_unique($myorigarray);
$howmany = count($myorigarray) + 1;
$mytemparray = $myorigarray;

 

In order to effect "between the dots" matching, split each pattern into its label parts

for($i=0; $i <= $howmany; $i++) {
$orig_pieces = explode('.', $myorigarray[$i]);
$length_outer = strlen($mytemparray[$i]);

for($j=1; $j <= $howmany; $j++) {

 

 

// avoid comparing an element with its mirror
if( $i != $j  &&  $mytemparray[$i] != ' '  &&  $mytemparray[$j] != ' ')
{

$temp_pieces = explode('.', $myorigarray[$j]);
$length_inner = strlen($mytemparray[$j]);
$overlapping_pieces = sort(array_intersect($temp_pieces, $orig_pieces));                   

if(
count($overlapping_pieces >= 1) &&
$overlapping_pieces == sort($temp_pieces)  || $overlapping_pieces == sort($orig_pieces) 
) {

// choose the shorter string for use as the strstr() needle
if($length_outer > $length_inner)
{
if(strstr($mytemparray[$i], $mytemparray[$j])){   $mytemparray[$i] = ' '; }
} else {
if(strstr($mytemparray[$j], $mytemparray[$i])){  $mytemparray[$j] = ' '; }
}

} // END check whether we have overlapping pieces

} // END avoid self-comparison

 

prepare the output for one-per-line textarea display:

} // END inner loop
} // END outer loop

$mytemparray = array_unique($mytemparray);
$myout = trim(implode($mytemparray, "\n"));
} // END (check if form POST data is present...)

 

build the html page output

echo
'<html><body><form method="POST" action="'
. $PHP_SELF .
'">paste list of hostname entries then click GO to remove redundant patterns<br />
<textarea name="mytextareacontent" wrap="soft" rows="20" cols="70">'
. $myout .
'</textarea><br /><input type="submit" value="go" /></form></body><html>';
?>

 

 

Thanks, in advance, for any enlightenment you can provide.

When I sat down to write this, I expected it would just involve straightforward array_intersect()

but now my head is swimming...

 

  • 2 weeks later...

Ten days, 25 reads and ZERO replies later...

 

FWIW, here's the gist of what I wound up coding.

The PHP script works... but even though it doesn't echo anything to the page until matching has finished, it is TERRIBLY slow.

 

During testing, I had to override the max_execution_time setting in php.ini to avoid script timeout when handling only 5k patterns.

 

trial execution times:

500 items = 1.7 sec

1K items = 6.9 sec

2K items = 36 sec

4K items = 136 sec

5K items = timeout (1800+ sec)

 

$mystringin = trim($_POST['mytextareacontent']);
$myarray = explode("\n", $mystringin);
$myarray = array_unique($myarray);  

$mytemparray = $myarray;
$howmany = count($myarray);
$rejects = array();

for($i=0; $i < $howmany; $i++) {

for($j=0; $j < $howmany; $j++) {
$haystack = trim($myarray[$i]);
$needle = trim($myarray[$j]);

if( $i != $j  && $haystack != '' && $needle != '' 
&& preg_match("/\b".preg_quote($needle)."\b/i", $haystack) > 0 )
{
$rejects[] = '<b>'. $haystack .'</b> obviated by <b>'. $needle .'</b><br />';
unset($mytemparray[$i]);
$myarray[$i] = '';
}

}
}
$mytemparray = array_unique($mytemparray);
$myout = trim(implode($mytemparray, "\n"));

// plus additional lines to display $myout and $rejects to the page

 

remaining issues:

 

-- Speed. The nested loops represent (length-1)! iterations, so I'll need to modify the script so that it processes the patterns in batches.

 

 

-- Apparently the underscore character is valid within third-level domain labels. My reliance on the \w regex modifier does not accomodate this.

 

-- Even though I expect the input fed to script will  already have been sanitized... as is, the script won't properly handle any spaces and/or blank lines contained within the input.

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.