Here’s a little issue that I came across last week and I thought I’d share my investigations with you.
Users started to report that a database was inaccessible and when I had a look, sure enough the database was in ‘recovery pending’ state.
That’s odd, a database usually only goes into ‘recovery pending’ state at start up. There hadn’t been a restart of the server so perhaps the database had Auto Close switched on, but no that wasn’t the case here either.
What Does the Error Log Say?
Time to take a look in the error log and see if we can figure out what’s going on here. There were a number of interesting entries that came up.
2017-06-09 10:34:19.91 spid258 DBCC TRACEON 3604, server process ID (SPID) 258. This is an informational message only; no user action is required.
2017-06-09 10:34:19.91 spid258 DBCC TRACEON 3213, server process ID (SPID) 258. This is an informational message only; no user action is required.
Now we use Minion Backup from Minionware and these entries indicate that there was a backup happening at the time (Minion sets those trace flags on when doing a backup). A quick check in the Minion log tables confirmed that there was indeed a transaction log backup of the affected database taken around the time that the database went into recovery pending. Surely just a coincidence?
What’s next up in the log?
2017-06-09 10:34:19.95 spid27s The transaction log for database ‘SQLUndercover’ is full due to ‘ACTIVE_TRANSACTION’.
Ah ok, the transaction log’s filled up, an easy one then. But hold on, there’s a ton of space on the disk and the transaction log for that database has no max size set… Is there something more to this?
2017-06-09 10:34:19.95 spid27s During undoing of a logged operation in database ‘SQLUndercover’, an error occurred at log record ID (4125:488:6). Typically, the specific failure is logged previously as an error in the Windows Event Log service. Restore the database or file from a backup, or repair the database.
hmmmm, now this isn’t looking quite so good now. SQL is trying to roll back a transaction but for some reason or another hasn’t been able to.
017-06-09 10:34:19.95 spid27s Database SQLUndercover was shutdown due to error 3314 in routine ‘XdesRMReadWrite::RollbackToLsn’. Restart for non-snapshot databases will be attempted after all connections to the database are aborted.
Now, earlier we were saying that a database shouldn’t go into recovery pending unless it was starting up, but as far as we knew, our database had never been shut down. Well here we are, SQL telling us that it’s really not a happy bunny and wants to try restarting our database.
2017-06-09 10:34:19.95 spid27s During undoing of a logged operation in database ‘SQLUndercover’, an error occurred at log record ID (4125:488:2). Typically, the specific failure is logged previously as an error in the Windows Event Log service. Restore the database or file from a backup, or repair the database.
2017-06-09 10:34:19.95 spid27s The log for database ‘SQLUndercover’ is not available. Check the event log for related error messages. Resolve any errors and restart the database.
2017-06-09 10:34:19.95 spid27s During undoing of a logged operation in database ‘SQLUndercover’, an error occurred at log record ID (4125:488:1). Typically, the specific failure is logged previously as an error in the Windows Event Log service. Restore the database or file from a backup, or repair the database.
SQL is still having trouble rolling back whatever this transaction is but now it’s telling us that the log isn’t available. This is starting to look sinister…
2017-06-09 10:34:20.63 spid391s Starting up database ‘SQLUndercover’.
Ok well here we go, c’mon SQL you can start this database, we’ve got faith in you!!!
2017-06-09 10:34:20.70 spid391s The transaction log for database ‘SQLUndercover’ is full due to ‘ACTIVE_TRANSACTION’.
But, SQL Server??? We know that transaction log isn’t full, what’s the matter with you???
2017-06-09 10:34:20.70 spid391s During undoing of a logged operation in database ‘SQLUndercover’, an error occurred at log record ID (4125:488:6). Typically, the specific failure is logged previously as an error in the Windows Event Log service. Restore the database or file from a backup, or repair the database.
Still, even after a restart SQL Server is having trouble rolling back whatever this transaction is.
2017-06-09 10:34:24.71 spid1712s Database ‘SQLUndercover’ cannot be opened due to inaccessible files or insufficient memory or disk space. See the SQL Server errorlog for details.
There really seems to be something amiss with the files, once again SQL Server is having opening the transaction log because I can only assume that for some reason it seems to think that the file is full. Which of course we know isn’t the case.
2017-06-09 10:34:20.71 Backup BACKUP failed to complete the command BACKUP LOG L-VSQL06. Check the backup application log for detailed messages.
And if we were in any doubt what so ever that there was a transaction log backup going on at the time, here’s the proof.
So How Did I Recover
I’m now stuck with a database in recovery pending and as we can see from above this was caused by a serious error that occurred during a transaction log backup. Luckily for me, this database is part of an availability group which has a synchronous secondary.
Looking at the secondary node, I could see that the database was in a ‘not synchronising’ state and crossed my fingers that a simple failover would be enough to get me out of this mess. Unfortunately, things weren’t quite that simple as after the failover the AG refused to synchronise. This was soon fixed however by dropping the database out of the AG and reseeding it across the secondaries.
But what if your database isn’t in an AG? Carry on reading…
Let’s Look into Things in a Little More Detail
I wasn’t happy just leaving it at that and wanted to know what really had happened. Where to start? Well one thing that we do know is the LSNs of the transactions that SQL Server was trying to roll back. I wondering if that’s still in the transaction log…
There’s a lovely little undocumented function that’ll let you peer inside the active portion of the transaction log, fn_dblog. Remus Rusanu does a far better job of explaining the function in his blog than I can.
Let’s have a look at the LSNs that SQL Server had trouble with,
Lets plug those into a query and see what we get back….
SELECT [Current LSN], operation, [Transaction ID], [Begin Time],[Transaction Name] FROM fn_dblog('4125:488:1','4125:488:2')
This gives us a few interesting bits of information, firstly look at the begin time, that corresponds nicely with the time that we started to see our issues. Secondly we’ve got the transaction ID, we could plug that into a query and get back all the rows in the log for those particular transactions.
NOTE: If the LSNs returned by the query above look different from the ones we plugged in, that because the function takes the LSN as a decimal but returns it as hex.
SELECT [Current LSN], operation, [Transaction ID],[Page ID],[RowLog Contents 0],[RowLog Contents 1] FROM fn_dblog(NULL,NULL) WHERE [Transaction ID] IN ('0000:000104e6','0000:000104e7')
Now here’s some interesting information, we can see that the transaction starts but for some reason it’s never committed, that ties in with what we were seeing in the SQL’s logs. But what we can now see is what it was actually trying to do, we can see that it was trying to modify a row and we can also see what it was trying to modify. The Page ID of 1:9 points to page 9 on file 1, that’s the Boot Page. The Boot Page is an internal structure that you’ll always find on page 9 of file 1 in every database and it holds all sorts of interesting information configuration information for database including the LSN of the last backup. Notice the numbers in RowLogContents1, this is the data the the operation wanted in insert. Have a little closer look at the numbers and you’ll notice that it matches that first LSN.
It would seem that the transaction that never committed was trying to update the Boot Page with the first LSN after the backup just taken.
So How To Recovery If You’re Not In An Availability Group
Please, please, please make sure that you’re happy that there’s no data loss before you go forward. This next step is just based on my observations and what I believe to be the best way to recover.
Remember right back at the start when I said that I could see in the Minion logs that a transaction log backup had taken? Well, Minion reported this as a successful backup and despite what SQL Server’s telling us in it’s log, there is indeed a file for this backup.
Surely this file can’t be valid? But when running a RESTORE WITH VERIFYONLY it did come back OK. I tried restoring the backup up and the database came back without a problem. If you find yourself in this state and you also have a backup file despite SQL Server’s claims that the backup failed, you may well find that you can restore back to that last transaction log.
So while I don’t know 100% what the cause of the problem was, it would appear that the transaction log became corrupt either after marking the inactive portion of the log or during the update of the boot page.
If anyone out there has seen this before or can shed any more light on what but have been the cause, I’d love to hear from you.
Hi David, this is an interesting issue to have for sure. I’m curious what version of SQL you were running and whether you had “database level health detection” enabled for the AG(SQL 2016 and greater). If so, was automatic fail over turned on or off? Just wondering why the AG didn’t fail over to the secondary node.
Thanks for sharing!
LikeLiked by 1 person
That box is a 2012 server so no database level health detection I’m afraid. It was a strange one which I never did get to the bottom of the cause or manage to replicate elsewhere.