Jump to content

Recommended Posts

What I am doing, is taking a xml file, and adding the values to a database. However, what I want to do is only add the new values.

 

I am guessing that a script that compared the first xx chars of the xml value to the value in the database would work best.

 

The script would be ran (in theory) once a day.

 

Example:

 

day1.xml

<fruits>

  <fruit>apple</fruit>

  <fruit>pear</fruit>

  <fruit>orange</fruit>

</fruits>

 

day2.xml

<fruits>

  <fruit>carrot</fruit>

  <fruit>apple</fruit>

  <fruit>pear</fruit>

  <fruit>grape</fruit>

  <fruit>orange</fruit>

</fruits>

 

Does anyone have any tips or code I could mooch off them?

What is the total number of expected rows to the nearest power of 10? 100, 1K, 10K, 100K, 1M??? And once a value has been inserted will it always be that same value or can it be altered and if it is altered and the original is still in the xml file should the original be inserted as a new value?

 

Help us out by filling in some of the blanks that you know about the data.

The # of values is always increasing (never decreasing) at a random rate. 5 a day, maybe 500 a day. Once the value is in the xml file, it will not be changed. However, it can be removed.

 

Right now, I have the code checking the complete value against that in the database, but that is pretty slow.

I would also store a sha1 hash of the value and the length of the value in the table (you could determine these at the time the query is executed but if ultimate performance is an issue, calculate them once when the data is inserted.) Then you will only need to compare the hash and length of the data with the information in the table. There is a very very high probability that if both the hash and length of data matches something already in the table that it is an exact match and does not need to be inserted again.

 

Does each piece of data in the xml file contain any sort of date/time information that you could instead use to filter out things that have already been processed by storing the latest date/time each time you process the file?

Does each piece of data in the xml file contain any sort of date/time information that you could instead use to filter out things that have already been processed by storing the latest date/time each time you process the file?

 

No :(

 

Thank you for the has/length idea, I'll work on that :)

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.