Confirming data set equality

NotionCommotion · November 2, 2016

I have a MySQL database with the following tables:

distributed_databases

- id

- hash_or_timestamp

- bla

joined_rows

- id (int)

- distributed_databases_id (int)

- col_a (varchar(60))

- col_b (varchar(60))

- col_c int

I also have multiple SQLite databases which all include the following schema, and each will have between 0 and 300 records:

rows

- id (int)

- col_a (varchar(60))

- col_b (varchar(60))

- col_c int

The application with the SQLite databases will periodically poll the main server to ensure that they contain the same data sets. How should I verify that they are the same?

My current thoughts are to either store a hash of the data sets or a timestamp whenever the MySQL data is changed. Recommendations? If a hash, how should it be performed? Or maybe a totally different approach?

Thanks

requinix · November 2, 2016

Timestamp. Versioning. Deltas. Stuff like that. Hashes can tell you that something changed but won't help you identify what it was.

NotionCommotion · November 2, 2016

Timestamp. Versioning. Deltas. Stuff like that. Hashes can tell you that something changed but won't help you identify what it was.

I am okay with not know what was changed, only whether it was changed. The only time I expect the data to be out of sync is if network is temporarily disrupted, and expect this to be fairly rare. If they are out of sync, I will implement some script to find the differences.

What would you recommend to just tell if something was changed?

requinix · November 2, 2016

Modification time of the SQLite file?

NotionCommotion · November 2, 2016

Modification time of the SQLite file?

I don't understand. How would this work?

requinix · November 2, 2016

If you make changes to the database then that means changes to the underlying file, right? You can know if there were changes if the timestamp changes.

NotionCommotion · November 3, 2016

Oh, I see. I don't think that will work because that database also houses data which is outside of the data sets which I wish to confirm are the same as the remote data set, and this will cause the SQLite file date to change.

Let me explain what I think the workflow should be. Beforehand, a bit on nomenclature.

"Master" refers to a server which will among other things, store the desired data in a MySQL database of multiple distributed databases.
"Distributed" refers to multiple computers which will use a SQLite database.

Okay, here goes...

Master receives Ajax request to change one of the distributed database.
Records in master database are updated for the particular distributed database. I was thinking of storing either a hash of the new data or a time stamp at this time for the particular distributed DB.
A command is sent out by the master to update the particular distributed database either using Redis along with long polling or maybe sockets. If the time stamp approach is taken, then this time stamp will also need to be sent.
Particular distributed database hopefully gets the command and updates its database and either hashes the new data and saves it, or saves the received time stamp.

There seems to be too many things that could happen which will get the master data set and one of the distributed data sets out of whack.

The distributed computers also periodically send data to the master. To ensure the data sets are not out of sync for a long time, I was thinking of the following.

When distributed computer sends the data to the master, it also sends either the hash or time stamp.
When master receives the data, it checks to see if the hash or time stamp is valid. If so, it replies okay. If not, it sends what should be the distributed computers data set. If I am using Redis as a queue, I will need to somehow deal with cancelling the queue for the particular distributed computer as the distributed data set will now be up to date.

So assuming that my approach is not just completely wrong, I need an approach to concisely make a fingerprint of the the data set. As I see it, my options are either hashing the data, or using microtime() to save when it was changed. The hash seems (I don't know why) more appropriate, however, I don't know how hashing works if the input is large (say 60,000 bytes).

Did I confuse you? Hopefully, not too much. Thanks

requinix · November 3, 2016

So you have a MySQL database that contains a copy of all the data stored in SQLite databases which are themselves spread across multiple servers? And you have control over all these servers? Why such a disjointed approach? Why not have just the one database, or perhaps one writable database and multiple read-only replicas?

Anyway, this is basically replication so you can take a page from the MySQL playbook: a "version" identifier, such as a plain number. When master or slave has a change it increments the number locally and sends that number with the data to the other server. When the other server receives such a message, it verifies that the number is only one operation ahead of itself and then replays the data change. If the number doesn't match then the two servers are out of sync and you do a full sync (however you'd like to do that).

NotionCommotion · November 3, 2016

So you have a MySQL database that contains a copy of all the data stored in SQLite databases which are themselves spread across multiple servers?

Yes!

And you have control over all these servers?

Not directly. I will give instructions to the people who have control over them, but they may make a mistake and not do so correctly.

Anyway, this is basically replication so you can take a page from the MySQL playbook: a "version" identifier, such as a plain number. When master or slave has a change it increments the number locally and sends that number with the data to the other server. When the other server receives such a message, it verifies that the number is only one operation ahead of itself and then replays the data change. If the number doesn't match then the two servers are out of sync and you do a full sync (however you'd like to do that).

After sleeping on it, I am sure it is impossible for a small hash to be unique for a large input size. As an alternative, I thought of a strong random number (and not worrying about the collisions) or a table in the master with an auto-increment key and using an increment number.

You last recommendation is basically my second solution, but yours is more thought out as it checks if the incrementting number is only one operation ahead. Where could I find more details about this "MySQL playbook"?

Thank you for your help.

kicken · November 3, 2016

Why do you need the local copies of the database rather than just interacting with the main database (either directly or via an API)?

NotionCommotion · November 3, 2016

Why do you need the local copies of the database rather than just interacting with the main database (either directly or via an API)?

Because they have a job to do collecting data regardless of whether the network is up or not. Probably shouldn't be using PHP, but I don't know anything else.

kicken · November 3, 2016

Regarding detecting changes, a hash would be able to tell you something changed but without an indication of which side is newer. A time stamp could tell you which side is newer but relies on accurate in-sync time keeping at both ends. A counter would work for simple one-way replication.

If the slaves only ever need to download new data from the master and never upload changes then I'd just go with a counter. Have the master update a counter with each operation then each time a slave is sync'd it'd store the current counter value. On the next sync ask the master for changes made since the counter was at X value (or just do a full sync). For a Mysql master this could probably (just guessing here) be done relatively easily using binary logging and the mysqlbinlog tool as suggested by Jacques1 in a previous thread.

If you need bi-directional syncing then things would become more complicated.

Edited November 3, 2016 by kicken

NotionCommotion · November 3, 2016

Thanks kicken,

One-way replication is the plan.

For the master, I don't understand where the mysqlbinlog would come in. Why wouldn't I use PHP to update some counter whenever it made a change to the DB? I guess the problem with this approach is if the DB was directly changed. To get around this, I can probably use a trigger.

kicken · November 3, 2016

The binary log / mysqlbinlog is just one way to hook into mysql and get notified of changes at a low-level. You can certainly build your own system if you want.

Sign In

Confirming data set equality

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation

Important Information