During a recent DR scenario I came across a slight misinterpretation in how availability groups with replicas in synchronous commit behave.
How An Availability Group With A Synchronous Secondary Should Behave Under Normal Circumstances
Normally, if we’ve got a secondary replica in synchronous commit mode, our primary can’t commit any transactions until that transaction has been hardened to disk on all of the synchronous secondaries.
When you commit a transaction on the primary, SQL Server will send all changes over to it’s secondary nodes. It’ll then wait for confirmation from the synchronous nodes (it doesn’t bother waiting for those in async) that the changes have been hardened before it can commit locally.
This ensures that those synchronous secondaries are always up to date and will mean that in the event of a fail over, there is no risk of data loss. Any delay in the secondary writing down those changes will mean a delay on committing on the primary (you’ll see this in the form of HADR_SYNC_COMMIT waits on the primary).
When Is Synchronous Not Synchronous?
The primary replica is constantly monitoring the state of it’s secondaries. With the use of a continuous ping, the primary node always knows if the secondaries are up or down.
It’s when SQL detects that one of it’s synchronous replicas goes offline is when interesting things can happen.
So here’s the discussion that came up, if a synchronous replica goes offline for whatever reason, SQL won’t be able to commit any transactions and that means we can be confident that the secondary is up to date, right?
Well, not quite. Let’s try running the following code to simply update a table with a timestamp every second.
TRUNCATE TABLE SyncTest WHILE 1=1 BEGIN INSERT INTO SyncTest VALUES (GETDATE()) WAITFOR DELAY '00:00:01' END
After a short while I’m going to take down the AG’s synchronous replica.
Let’s have a look at what’s in the that table.
Interesting, we can see that there was a delay when the synchronous secondary went offline as we’d expect. But then after a short while, it looks like SQL Server decided to continue committing transactions. Is that right?
And here’s where the misinterpretation came in. Availability groups have a setting, session-timeout. This timeout controls how long a primary replica will wait to receive a ping from a secondary (by default, this is 10 seconds). Once that timeout period is exceeded, the primary will no longer wait for that replica when committing a transaction, effectively switching it to asynchronous commit.
This is the reason why, after 10 seconds of the secondary going down we start to see transactions being committed again.
So the moral here is to be careful.
If you find yourself in a DR situation where you’re contemplating forcing fail over to a secondary replica don’t immediately assume that, just because that secondary was in synchronous commit it’ll be up to date. Think about the nature of the failure and what state things are (and were) in.
If there’s a chance that the primary could have been accepting transactions while unable to communicate with the synchronous secondary you have the potential for data loss in the event of a forced fail over. If possible, in this situation you should investigate the possibility of data loss, checking the last hardened LSN is a good way to get an idea if a fail over would be safe or not.