Best way to cross matching large datasets

plautzer · October 19, 2009

Hi,

Im running a script where am I cross matching about 200 000 data sets with each other. Each data set consists of 8 parameters and I want to count all datasets which have similar or the same parameters for each data set.

Right now, I am doing the matching via a MySql query which im calling about 200 000 times. The problem is that using a query is extremely expensive… it takes up to 2 hours until the script is done. So I am wondering if there is a better method to cross match data sets and if some of could help me find a better solution.

While researching I found out that arrays may be a faster alternative to queries. And so far, I identified 3 possible ways for cross matching:

1. nested foreach () loops

foreach($array as ar1)

foreach($array as ar2)

if ($ar1[0] == $ar1[0])….

2. Using an Array_map with Callback function, so that i would have only one "hand coded" loop

foreach($array as arr)

if ($arr[0] == $parameter)….

3. Array walk where i could save one "hand coded" loop as well.

Theoretically would be the best/fastest way to go about it? Can Anyone tell me what technically the difference between those 3 ways is? And which one is the better approach or if there other alternatives to them?

I am thankful for any advice that helps me reduce execution time!

Greetz

plautzer

JAY6390 · October 19, 2009

Can you give some examples of "dummy" data and what you want to do with it. It could be that you just need to amend your current queries to drastically reduce the query count

plautzer · October 19, 2009

Here is an example of a query with 6 parameter.

In this case par1,2,3 should unequal to the parameters of the current data set.

par4, par5 and are supposed to be in a range which varies per data set.

In the end I do simple statisitical count.

$sql =" SELECT         
    Count(id) as sum,
    Sum(If(type = 1 ,1,0)) as type1,
    Sum(If(type = 2 ,1,0)) as type2,
    Sum(If(type = 3 ,1,0)) as type3,
    Sum(If(type = 4 ,1,0)) as type4,
    Sum(If(type = 5 ,1,0)) as type5
  From table1
  WHERE
    (par1 != $id1 and par2 != $id2) and
    (par1  != $id2 and par3  != $id1) and
    par2 = $par3 and
    par4 between 800 and 1000 and
    par5 between 400 and 300 and
    date < '$date' and
    date > DATE_SUB('$date', INTERVAL 5 YEAR)";

The query takes up to 0,1 seconds.

plautzer · October 19, 2009

quick question on the side... how long does a query usually take? Is it possible to get under 0.001 or even 0.0001 seconds?

cags · October 19, 2009

Depends how complex the query is, I just ran a quick test on "SELECT * FROM table" ran in a loop 50,000 times, which took 0.364953 secs to succeed. Which is roughly 0.00000729906 per query (assuming theres no caching going on).

Edit: Perhaps start a sub-question in the MySQL board asking how to optimise that query.

Sign In

Best way to cross matching large datasets

Recommended Posts

plautzer

Link to comment

Share on other sites

JAY6390

Link to comment

Share on other sites

plautzer

Link to comment

Share on other sites

plautzer

Link to comment

Share on other sites

cags

Link to comment

Share on other sites

Archived

Browse

Activity

Important Information