Quickly Compare Data in Two Tables Using CHECKSUM and CHECKSUM_AGG

There are times when we need to compare two tables and figure out if the data matches. I often see a number of ways of doing this suggested, most are quite slow and inefficient. I’d quite like to share a quick and slightly dirty way of doing this using the CHECKSUM and CHECKSUM_AGG functions.

CHECKSUM()
Just a reminder that CHECKSUM() will generate a checksum for an entire row or selection of columns in the row.

CHECKSUM_AGG()
Will generate a checksum for a dataset.

Comparing Tables

With these two functions together it’s dead simple to compare the data in two tables. We need to create a checksum for the entire table, this can be done simply by first generating a checksum for each row and then using CHECKSUM_AGG() to give us an aggregated checksum for the table.

SELECT CHECKSUM_AGG(CHECKSUM(*)) 
FROM table_name

The above will return a checksum for all the data in a table, run it for two or more tables and where the checksums match, you know that the data in those tables matches.

CHECKSUM and CHECKSUM_AGG both generate their checksums through hashing, that means that there is a VERY small chance that two different datasets could produce the same checksum.

Let’s See It In Action

For some reason I’ve got two copies of the Databases tables from the Undercover Catalogue that I want to check match.

SELECT CHECKSUM_AGG(CHECKSUM(*)) 
FROM SQLUndercover.Catalogue.Databases 

811887081

SELECT CHECKSUM_AGG(CHECKSUM(*)) 
FROM SQLUndercover.Catalogue.Databases2

811887081

So from the above, we can see that the data in the two tables matches. Just for the sake of argument, let’s try deleting a row out of the table and see what happens to the checksum.

DELETE FROM SQLUndercover.Catalogue.Databases 
WHERE ServerName = 'DavidSQLTest01'

Now if we rerun the checksum code, we get…

175501893

The checksums are different so that means that the data is now also different.

So there you have a very quick way to see if the data in two tables matches.

Please do be aware of the issue with hash collisions that I mentioned above, although the risk is very low. I might do a little investigation into CHECKSUM and hash collisions in a future post and see how prolific they really are.

7 thoughts on “Quickly Compare Data in Two Tables Using CHECKSUM and CHECKSUM_AGG

Add yours

  1. The EXCEPT process takes less than 1 second on almost every table in my databases. So while it might, indeed, be “inefficient”, the inefficiency has no real-world cost, and does offer the advantage of precision. “Maybes” aren’t often very attractive. [SELECT…-EXCEPT-SELECT…] paired with [SELECT COUNT(*)…-EXCEPT-SELECT COUNT(*) ] is very fast and robust.

    Like

    1. The thing that I like about the CHECKSUM approach is that it can be used across servers. Handy if you’re checking an import or wasn’t too make sure that a development database matches prod. Like many things in SQL, there are a few ways that it can be accomplished, which is best depends on the situation.

      Like

  2. Nice article, but particularly CHECKSUM() generates VERY OFTEN the same hash for different input, so you can use it only for a first check. BINARY_CHECKSUM() is a little bit better and HASH_BYTES() much better (depending on the choosen hash function).

    In our database I created a persisted computed column with CHECKSUM over a few columns (that are relevant in comparsion) an put an index onto it. When someone wants to add data similar to the existing, it checks first, if there is an entry with the same checksum (by using the index) AND additional compares the relevant colums (INNER JOIN ON tbl1.chksum = tbl2.chksum AND tbl1.col1 = tbl2.col1 AND tbl2.col1 = tbl2.col2 …).

    Furthermore you should be aware, that CHECKSUM() is not case sensitive, so CHECKSUM(‘A’) returns the same value as CHECKSUM(‘a’), while BINARY_CHECKSUM() and HASH_BYTES() are case sensitive.

    Like

    1. A good point about the case sensitivity :). I’d always believed that hash collisions were more prevalent with BINARY_CHECKSUM. Perhaps an investigation into the three is needed šŸ™‚

      Like

  3. I’ve been burned by this approach. Making the same change to two rows can produce offsetting values, so the sum is the same. Say the CHECKSUM is positive for one row and negative for the other. Making the same change to column in both rows produces a net 0 change.

    Like

  4. I’ve been burned by this approach. Making the same change to two rows can produce offsetting values, so the sum is the same. Say the CHECKSUM is positive for one row and negative for the other. Making the same change to column in both rows produces a net 0 change.

    Like

Leave a comment

Create a website or blog at WordPress.com

Up ↑