Jump to content

Confirming data set equality


NotionCommotion

Recommended Posts

I have a MySQL database with the following tables:

 

distributed_databases

- id

- hash_or_timestamp

- bla

 

joined_rows

- id (int)

- distributed_databases_id (int)

- col_a (varchar(60))

- col_b (varchar(60))

- col_c int

 

 

I also have multiple SQLite databases which all include the following schema, and each will have between 0 and 300 records:

 

rows

- id (int)

- col_a (varchar(60))

- col_b (varchar(60))

- col_c int

 

The application with the SQLite databases will periodically poll the main server to ensure that they contain the same data sets.  How should I verify that they are the same?

 

My current thoughts are to either store a hash of the data sets or a timestamp whenever the MySQL data is changed.  Recommendations?  If a hash, how should it be performed?  Or maybe a totally different approach?

 

Thanks

Link to comment
Share on other sites

Timestamp. Versioning. Deltas. Stuff like that. Hashes can tell you that something changed but won't help you identify what it was.

 

I am okay with not know what was changed, only whether it was changed.  The only time I expect the data to be out of sync is if network is temporarily disrupted, and expect this to be fairly rare.  If they are out of sync, I will implement some script to find the differences.

 

What would you recommend to just tell if something was changed?

Link to comment
Share on other sites

Oh, I see.  I don't think that will work because that database also houses data which is outside of the data sets which I wish to confirm are the same as the remote data set, and this will cause the SQLite file date to change.

 

Let me explain what I think the workflow should be.  Beforehand, a bit on nomenclature.

  • "Master" refers to a server which will among other things, store the desired data in a MySQL database of multiple distributed databases.
  • "Distributed" refers to multiple computers which will use a SQLite database.

Okay, here goes...

  1. Master receives Ajax request to change one of the distributed database.
  2. Records in master database are updated for the particular distributed database.  I was thinking of storing either a hash of the new data or a time stamp at this time for the particular distributed DB.
  3. A command is sent out by the master to update the particular distributed database either using Redis along with long polling or maybe sockets.  If the time stamp approach is taken, then this time stamp will also need to be sent.
  4. Particular distributed database hopefully gets the command and updates its database and either hashes the new data and saves it, or saves the received time stamp.

There seems to be too many things that could happen which will get the master data set and one of the distributed data sets out of whack.

 

The distributed computers also periodically send data to the master.  To ensure the data sets are not out of sync for a long time, I was thinking of the following.

  1. When distributed computer sends the data to the master, it also sends either the hash or time stamp.
  2. When master receives the data, it checks to see if the hash or time stamp is valid.  If so, it replies okay.  If not, it sends what should be the distributed computers data set. If I am using Redis as a queue, I will need to somehow deal with cancelling the queue for the particular distributed computer as the distributed data set will now be up to date.

So assuming that my approach is not just completely wrong, I need an approach to concisely make a fingerprint of the the data set.  As I see it, my options are either hashing the data, or using microtime() to save when it was changed.  The hash seems (I don't know why) more appropriate, however, I don't know how hashing works if the input is large (say 60,000 bytes).

 

Did I confuse you?  Hopefully, not too much. Thanks

Link to comment
Share on other sites

So you have a MySQL database that contains a copy of all the data stored in SQLite databases which are themselves spread across multiple servers? And you have control over all these servers? Why such a disjointed approach? Why not have just the one database, or perhaps one writable database and multiple read-only replicas?

 

Anyway, this is basically replication so you can take a page from the MySQL playbook: a "version" identifier, such as a plain number. When master or slave has a change it increments the number locally and sends that number with the data to the other server. When the other server receives such a message, it verifies that the number is only one operation ahead of itself and then replays the data change. If the number doesn't match then the two servers are out of sync and you do a full sync (however you'd like to do that).

Link to comment
Share on other sites

So you have a MySQL database that contains a copy of all the data stored in SQLite databases which are themselves spread across multiple servers?

Yes!

 

And you have control over all these servers?

Not directly.  I will give instructions to the people who have control over them, but they may make a mistake and not do so correctly.

 

Anyway, this is basically replication so you can take a page from the MySQL playbook: a "version" identifier, such as a plain number. When master or slave has a change it increments the number locally and sends that number with the data to the other server. When the other server receives such a message, it verifies that the number is only one operation ahead of itself and then replays the data change. If the number doesn't match then the two servers are out of sync and you do a full sync (however you'd like to do that).

After sleeping on it, I am sure it is impossible for a small hash to be unique for a large input size.  As an alternative, I thought of a strong random number (and not worrying about the collisions) or a table in the master with an auto-increment key and using an increment number.

 

You last recommendation is basically my second solution, but yours is more thought out as it checks if the incrementting number is only one operation ahead.  Where could I find more details about this "MySQL playbook"?

 

Thank you for your help.

Link to comment
Share on other sites

Regarding detecting changes, a hash would be able to tell you something changed but without an indication of which side is newer. A time stamp could tell you which side is newer but relies on accurate in-sync time keeping at both ends. A counter would work for simple one-way replication.

 

If the slaves only ever need to download new data from the master and never upload changes then I'd just go with a counter. Have the master update a counter with each operation then each time a slave is sync'd it'd store the current counter value. On the next sync ask the master for changes made since the counter was at X value (or just do a full sync). For a Mysql master this could probably (just guessing here) be done relatively easily using binary logging and the mysqlbinlog tool as suggested by Jacques1 in a previous thread.

 

If you need bi-directional syncing then things would become more complicated.

Edited by kicken
Link to comment
Share on other sites

Thanks kicken,

 

One-way replication is the plan.

 

For the master, I don't understand where the mysqlbinlog would come in.  Why wouldn't I use PHP to update some counter whenever it made a change to the DB?  I guess the problem with this approach is if the DB was directly changed.  To get around this, I can probably use a trigger.

Link to comment
Share on other sites

This thread is more than a year old. Please don't revive it unless you have something important to add.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.