Jump to content

Recommended Posts

Hey guys,

 

I'm entering the amateurish level of programming right now, but have a challenging project ahead of me. Here is a brief scope of the problem:

 

We are collecting a lot of data. It ends up being essentially 53 columns in the CSV. The first thing I noticed was that there were certain portions of the data that were very likely to repeat. This led to the following Schema (MySQL) to decrease redundancies:

 

TABLE Markets (Relies on nothing) (Holds data about markets)
TABLE Contacts (Relies on markets) (Contacts are associated with a market)
TABLE Events (The events belong to both a market and a contact)
TABLE Event Products (This only relies on an Event)

 

Essentially my task is to automate the process since we'll be processing thousands of rows. So the script would probably need to behave like so:

  1. Import in markets
     
  2. Then import associated contacts (as well as the unique market id assigned by MySQL)
     
  3. Then import associated events (placing the unique market and contact id assigned by MySQL as well)
     
  4. Then import associated event products (placing the unique event id assigned by MySQL as well)

As you can see this is probably really challenging. I've looked around a bit for some decent tutorials or reading.

 

Sorry for the long winded post, here are my three questions:

  1. I cannot accurately estimate how long a project like this should take me, in your experience how long does a project like this take?
     
  2. Am I going about this the right way? (Even using PHPMyAdmin to import took a long time due to having to get the appropriate ids)
     
  3. Do you know of any good resources, or books for this?

Thanks a million!

Edited by guinbRo

Your opening statement sounds like you have a single file of UN-normalized data you want to load into an empty normalized database. So I am answering from that perspective.

 

In this situation, I usually create a table matching the import file, with some added columns (in this case) for MarketID, ContactID, EventID, etc. These added columns default to NULL. Then import the CSV file so all of my data is in one place. Then I use a series of SQL statements, directly against the database (phpmyadmin) to review and process the data.

 

First, check the market data:

SELECT DISTINCT MarketName FROM ImportData

using whatever makes the market data unique. If the result looks reasonable, and returns the expected number of rows, great. If there is a problem, fix the import data and re-import (or fix it in the ImportTable as well), then try again. I fix the source data (if possible) in case I have to drop the import table and start over.

 

Once I'm happy with the data, I insert it into the final table:

INSERT INTO MarketData (column, names, here) SELECT DISTINCT column, names, here FROM ImportData

Here is where the added columns come in. Once I have inserted the data, they now have a unique ID from the AUTO_INCREMENT column of the final table, so update the import table with this ID (we will need it to do the next level (Contact)).

UPDATE ImportTable FROM MarketData 
SET ImportTable.MarketID = MarketData.ID 
WHERE ImportTable.MarketName = MarketData.MarketName

 

Then move on to the next level -- Contacts.

 

As I do this, I save every SQL statement that is used, including the ones I use to review the data, along with copious comments on what it does and why. This way, if I have to start over, I can either copy/paste the SQL instead of creating it again; or I can put the conversion SQL in a file and execute the SQL script to do multiple steps quickly.

 

I would not bother to write a PHP script to handle this, unless there is a significant amount of conditional updates. "You can't automate a process that doesn't exist" --- I was told that by a guy I worked with/for many years ago, and it makes sense. Doing this manually in SQL will be faster than trying to write a process to do it, since there will be some trial and error. Once you have the process down, if it will have to be repeated --- because the file comes to you every month or whatever --- then you can automate it and you will know what you are trying to accomplish. Even then, I would likely build a single SQL script file to be run against each new file, rather than write a PHP script.

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.