Jump to content

Recommended Posts

Background:

I work on an application that tracks server information. The data comes from 10 tables in a database that are exact replicas of 10 CSV files that insert daily into their specific tables. The database is nowhere near normalized and is a straight dump from the CSV to the db tables. (CSV headers match the table fields)

 

Problem:

I need to figure out a way to get the data from the CSV data into a normalized database for the application to read instead of a straight dump of the CSV data to the table. I have little control over the CSV's, but do have control over what happens with them afterwards. What would be the best way to get logical data into a normalized database from a set of CSV's?

 

Thanks!

What you describe is a common problem set, typically encountered in Datamarts/datawarehouse building, and even has an industry standard acronym (ETL). 

 

Although there are ETL products available out there, if this is a small scale project, they may not be viable economically or logistically for you, but at least you know they are there, and can google and investigate some.  Furthermore, one of the functions of an ETL tool is to fix/transform/conform data as it is loaded.  You already have all the data loaded into your RDBMS tables.

 

Depending on the database involved, ETL's can be written in a number of ways.  Typically people will utilize stored procedures/triggers and/or procedural ETL scripts in the language of their choice.  What you do depends very much on the quantity of data involved and how much work you have to do to get it into the normalized structure you want. 

 

For example, many many years ago I worked on a project where we needed to load a large amount of hierarchical data into a relational database.  We started by loading it into tables that matched the data, and then wrote sproc's that transformed the data and loaded it into our relational structure.  Logically it worked, but the problem was that it was way too slow, due to the transactional overhead involved in the database, and the cost of looking up related tables when new rows were inserted.

 

We solved this by writing a perl script that essentially took the data, and used gobs of memory via perl's associative arrays.  We preloaded all the lookup tables into these arrays and did all the relating in perl rather than via sql queries, and saved these transformed rows to disk.  We then bulk copied all the data into the destination tables which bypassed the need for transactions.    A process that could not be completed in several weeks, was accomplished through this perl based processing/saving to files/bulkcopy import and ran in several hours.

 

I can only give you a general answer, but I would start with what the database can do natively (stored procs) with the caveat that the volume of data can be a limiter.  You can also look at php,ruby,python or whatever other scripting language you like, as most of these are well suited to these types of projects and have similar features.

 

Obviously you need to build in scheduling for this activity, assuming it happens on a recurring basis.

 

That's about all the general advice I can think of, off the top of my head.

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.