Jump to content

Best way to cross matching large datasets


plautzer

Recommended Posts

Hi,

 

Im running a script where am I cross matching about 200 000 data sets with each other. Each data set consists of 8 parameters and I want to count all datasets which have similar or the same parameters for each data set.

 

Right now, I am doing the matching via a MySql query which im calling about 200 000 times. The problem is that using a query is extremely expensive… it takes up to 2 hours until the script is done. So I am wondering if there is a better method to cross match data sets and if some of could help me find a better solution.

 

While researching I found out that arrays may be a faster alternative to queries. And so far, I identified 3 possible ways for cross matching:

 

1. nested foreach () loops

foreach($array as ar1)

  foreach($array as ar2)

if ($ar1[0] == $ar1[0])….

 

2. Using an Array_map with Callback function, so that i would have only one "hand coded" loop

  foreach($array as arr)

if ($arr[0] == $parameter)….

 

3. Array walk where i could save one "hand coded" loop as well.

 

Theoretically would be the best/fastest way to go about it? Can Anyone tell me what technically the difference between those 3 ways is? And which one is the better approach or if there other alternatives to them?

 

I am thankful for any advice that helps me reduce execution time!

 

 

Greetz

plautzer

Here is an example of a query with 6 parameter.

 

In this case par1,2,3 should unequal to the parameters of the current data set.

par4, par5 and are supposed to be in a range which varies per data set.

 

In the end I do simple statisitical count.

 

$sql =" SELECT         
    Count(id) as sum,
    Sum(If(type = 1 ,1,0)) as type1,
    Sum(If(type = 2 ,1,0)) as type2,
    Sum(If(type = 3 ,1,0)) as type3,
    Sum(If(type = 4 ,1,0)) as type4,
    Sum(If(type = 5 ,1,0)) as type5
  From table1
  WHERE
    (par1 != $id1 and par2 != $id2) and
    (par1  != $id2 and par3  != $id1) and
    par2 = $par3 and
    par4 between 800 and 1000 and
    par5 between 400 and 300 and
    date < '$date' and
    date > DATE_SUB('$date', INTERVAL 5 YEAR)";   

 

The query takes up to 0,1 seconds.

 

Depends how complex the query is, I just ran a quick test on "SELECT * FROM table" ran in a loop 50,000 times, which took 0.364953 secs to succeed. Which is roughly 0.00000729906 per query (assuming theres no caching going on).

 

Edit: Perhaps start a sub-question in the MySQL board asking how to optimise that query.

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.