How far behind is my DR Data? Cry havoc, and let slip the canaries of war!

Sep 6, 2023
4 min read

Many of us have to manage databases that require High Availability / Disaster Recovery. And we know about the methods and benefits of each method, ie.. Transactional Replication, Availability Groups etc... how to implement and make good use of them. However, the very nature of availability groups means that there are multiple copies of our precious data, with only one primary writable version.

Yes, its the HA/DR mechanism's job to ensure this data is kept up to date, and it does a damn fine job of doing so, as long as we (the implementors) don't make a hash of designing the environment. By that I mean:

Ensuring we choose the correct synchronisation mode for the location of our servers, ie... not choosing somewhere so remote even migrating species of waterfowl wouldn't attempt to reach it.
Being Scrooge McDuck on the specification of the secondary servers so that it can't keep up with the redo load.

Either of these could bring us pain and suffering with keeping our additional copies up to date. And if we didn't realise we had latency, then we enter a situation that we need these additional copies? Ouch, it doesn't bear thinking about, but I hope that your CV is up to date...

Yes, we have the power to monitor, and also estimate data loss, RTO, RPO etc. And those lovely people at MS even give us procedures for Availability Group use.

Monitor performance for availability groups - SQL Server Always On | Microsoft Learn

There is an easier way in my opinion, (which you may already have done without realising) that also applies to Transactional Replication as well.

In fly the Canary tables.

Think of the canaries, taken in cages down the mines a while back. Those poor little feathered fiends were the early warning system for gas detection. They get a whiff of gas, react horribly to it, turn up their beaks and plummet from their perch to meet their maker. This warned the miners of harmful gases in the area, thus avoiding asphyxiation and explosion risks.

This method is far nicer to our feathered friends thankfully, and it is simply logging a row of data to a table that is covered by the HA/DR mechanism on a set schedule. As mentioned, if you already have continual (ie, no breaks in the day/week) inserts into a table with the datetime of it being inserted, then you already have something that can be tarred and feathered for this purpose.

Either in one of your databases that requires HA (or even a separate database in the same Availability Group)

Create a Table.

CREATE TABLE tbl_tweety (LastUpdate datetime2)

Create an SQL Agent Job

This job is to Insert/Update the row within this table.

SET NOCOUNT ON

IF EXISTS(SELECT 1 FROM tbl_tweety)
BEGIN
	UPDATE tbl_tweety SET LastUpdate = getdate()
END
ELSE
BEGIN
	INSERT INTO tbl_tweety (LastUpdate)
	SELECT getdate()
END

Run this job on a schedule that suits an interval for your needs. Ie... every 1 minute.

Obviously, in the case of Availability Groups, create on all replicas, but only run this job's full logic on the primary, so add code in to detect this as necessary. I won't include this for you as i'm a bad old puddy cat and going to write a separate blog about a method i've created before to do this.

Add the table to the HA mechanism.

In the case of A.G.'s, the table must be in a database within the A.G. that you want to monitor. If you can't update the DB schema, then a new database in the same A.G. purely with this table contained will suffice.

In the case of replication, add the table as an article contained within the publication.

So now, you have a table auto updating with a time value, and this is being synchronised (hopefully!) to the replicas/subscribers.

Great, now what?

Look at the data, its as simple as that. The value will give you the last point that the table data was updated. Whether that is your existing table with timestamped continual inserts, or the new one above. No calculations, no concerns, just what is the latest time in my secondary vs. what time is it now?

A.G. Secondaries will need to be read only accessible (if even just for a brief moment) to grab the value, but it ultimately gives a simple view of how far behind the data is upon the secondary. Due to this, Basic Availability groups aren't suitable for this technique unfortunately.

Replication subscribers have no such limitation, so just querying tbl_tweety gives you a meaningful indication of how far your data is behind.

At this point, to simulate the canary taking a perch tumble, put an Alert on this data (and subsequently the Boss entering the building). This can be done in any way, another job firing an alert, and this sent to an operator, 3rd party tools reading this data and alerting. Whatever suits your environment. If the data is older than your RPO allows (or near it), then someone needs to know about it. Preferably before Sylvester the cat rips through your servers and sends you your P45 (Pink slip).

Plainly put, if there is a synchronisation/replication bottleneck for whatever reason, our table data won't yet have been updated with the latest date, and the date is does show gives a pretty good representation of how much data you could lose.

Picture the conversation with your handlers...

You: "Boss!, our main payroll SQL database has a Redo Queue size of 204456828KB, and it's redo rate is only 9540KB/Sec! giving us an estimated recovery time of 21431 seconds"

...

Boss: "What?... In English please?"

vs...

You: "Boss!, our main payroll SQL database DR servers are 6 hours behind as the last update was 2:06am this morning. If we lost the main server, you've lost all that data."

...

Boss:"Oh ****!"

(The next line from the Boss may just be "Fix it!", but hey-ho, you get my point. )

It's wonderfully simple, I like simple, it matches my brain.

Thanks for reading.

Tweety Rod.