Large Sized Database

urieilam · November 12, 2008

Hi Everyone,

I was wondering if someone out there has some experience with working with large databases. I'm working on a backend which sends newsletters and maintains a database of the contacts. It should hold around 40,000 contacts, with around 6 fields per contact.

The backend is already programmed, and functions ok. The problem is that it is extremely slow.

The database pulls info out of .csv files. On top of just entering the contacts in to the database, it compares existing information, in order to avoid double entries for the same email addresses. I've scripted it to chop the data into batches of 1000, and to reload the script at the end of each batch.

It currently takes more than an hour to upload a new csv with about 10000 contacts, compare the info, and insert the information.

I've tried running it both with a SQL database and a XML one (I know it's not the best idea to use XML for this, but hoped it would be quicker), and both take around the same amount of time.

Any ideas on what the right design for something like this would be? Are there any quicker ways to go about it? I'm not much of a database expert, especially when it comes to something as huge as this.

Any help would be appreciated!

Thanks, uri

premiso · November 12, 2008

Anytime you have a script enter in that much data it will be slow. I do not think there is a way around it cause essentially you will be querying the DB 10000 plus times which is just slow.

An alternative may be to get all the emails from the csv and put them into a query as such:

$sql = "SELECT email FROM table_contacts WHERE email IN('[email protected]', '[email protected]');"

So now you will have all the bad emails then either loop through each email and test it, or something like this:

<?php
$sql = mysql_query("SELECT email FROM table_contacts WHERE email IN('[email protected]', '[email protected]');");

while ($row = mysql_fetch_assoc($sql)) {
     $bad_emails[] = $row['email'];
}

// then loop through your csv emails
foreach ($csv_email as $email) {
    if (!in_array($email, $bad_emails)) {
         mysql_query("INSERT INTO tablename (`email`) VALUES ('" . $email . "')");
    }
}
?>

Of course the csv_email can be an array of the data for populating and you would just reference the email as

    if (!in_array($email[2], $bad_emails)) {

2 is assuming that the email is at index 3 of that array.

Either way this should cut your queries in about half.

Hope that will get you goin and the ideas flowin.

rarebit · November 12, 2008

My question is, is php the best method for this issue. By writing the 'inserter' application in c or c++, one it would run a lot faster and two wouldn't have any timeout issues. You could probably manage memory better by only accessing the required data, but that depends upon how it's organised.

Also even a shell or batch script might be better approach?

JonnoTheDev · November 12, 2008

This is not the kind of processing that should be initiated through a web browser application.

PHP, MySQL are fine to use however the script should be a shell operation.

You web app admin system could be used to upload the CSV file and flag something in the database. The shell script could run using a cron job. When you upload the CSV the flag could be set to 1, this will let the shell script know that a new csv has been uploaded and it should be processed, when the script has finished it could reset the flag to 0, send you an email, etc....

Mark Baker · November 12, 2008

I was wondering if someone out there has some experience with working with large databases. I'm working on a backend which sends newsletters and maintains a database of the contacts. It should hold around 40,000 contacts, with around 6 fields per contact.

Working with several dozen databases, each of which has tens of millions of contacts, with approaching 100 attributes, per contact, and regular uploads of up to 500,000 entries.

I use Oracle rather than MySQL, but the performance issues are basically the same.

The database pulls info out of .csv files. On top of just entering the contacts in to the database, it compares existing information, in order to avoid double entries for the same email addresses. I've scripted it to chop the data into batches of 1000, and to reload the script at the end of each batch.

Use LOAD DATA INFILE (or the mysqlimport utility) to load into a temporary table.

If you can be sure that data in the files won't have any duplicates, drop indexes on the temporary file before doing the import, then add basic indexes as needed once the import is complete. If you do need indexes to factor out duplicates in the import, use only those indexes that are absolutely necessary, and recreate any additional indexes that you'll need for record matching when the import is complete... Even then, keep the number of indexes to a minimum, because you need to run a full table scan on the temp table, so the indexes aren't much benefit there.

Use a DELETE on the temporary table to blast any records where there is a matching record in your target contacts table. Use WHERE EXISTS against the target table (its efficient).

Finally, you can insert every remaining record from the temp table into your target table using INSERT INTO target ( SELECT * from temp) before deleting everything from the temp.

As other people have pointed out, don't try to do this through a browser interface, but use a server-side CLI script that can be allowed to run until it completes.

urieilam · November 13, 2008

I've managed to cut off a lot of time by making the email column unique, and using only one query, something like:

$query = "INSERT INTO addresses VALUES ('','$status','','','$email','$lan')

ON DUPLICATE KEY UPDATE lan='$lan'...";

now it'S all in one, faster, and keeps the database's status and email info on duplicates! It's probably still not as fast as having something running off the shell, but it's much better. As a next step, I'm going to try to have the data load up using LOAD DATA INFILE - i think this should speed things up even more, especially if I can get it to load the data, compare it and add the info in one query.

premiso · November 13, 2008

Nice job, thanks for the posting the query you used, it may come in hand at a later date.

cmaunoury · November 24, 2008

I would like to add another question to this topic...(maybe it would deserve another one...but I don't want to be pain)

I am working on uniting contacts database that have been made over the years with each one of them having around 10.000 entries with about 20 fields. On these 10.000, I am almost certain that there are 90% that have common data, some entries might be for the same person but wouldn't have the same fields populated (e.g: the 10 years ago version probably does not have email address, but the contact have been duplicated over the years...). I know it has been stupid data management to create new databases over the years with some common parts, some now, but please believe this is something I have been yelling at enough.

I now all those databases with equal fields in Excel docs and I was thinking that it was the perfect occasion to switch my contact management to "the clouds" and have it uploaded on a Mysql database.

What I am wondering about is whether there would be a PHP code that could make me unite entries with the same, say 'name' and 'surname' fields so that I don't loose important data and then that could make me delete double (and also triple, quadruple, etc.) entries.

I would be a happy man (and the people who have been messing with their contacts data over the years too) if I had this clean database on mysql and I could then work on a back-office management system and sync features...

Thanks for your help...

Colin

Sign In

Large Sized Database

Recommended Posts

urieilam

Link to comment

Share on other sites

premiso

Link to comment

Share on other sites

rarebit

Link to comment

Share on other sites

JonnoTheDev

Link to comment

Share on other sites

Mark Baker

Link to comment

Share on other sites

urieilam

Link to comment

Share on other sites

premiso

Link to comment

Share on other sites

cmaunoury

Link to comment

Share on other sites

Archived

Browse

Activity

Important Information