One of the things i'm most interested in is ensuring that we monitor the systems we're looking after. Whether that be SQL Server, Windows Server, Oracle (*shiver*), the vending machine that houses our coffee of supreme tolerance, whatever... it doesn't matter. If we are to manage and support that system, we need to know whats going on. And there's nothing worse than not being able to supply an RCA for something that happened previously, or overnight when no one was around etc.
Like a Doctor with their patients, not allowing the patient to talk or report symptoms will make diagnosis more difficult.
Well, maybe not the Doctor from Cannonball Run 1, but we won't go there.
I've spent time as a Database Reliability Engineer(DRE) in the recent past, and although what that entails is slightly different to many DBAs, one of the key drivers in that area is Observability, and this set of posts leans heavily on those experiences.
I'm not going to go into a deep dive here of exactly what Observability is, and there are several different definitions....but isn't it just a rebrand of "Monitoring" dreamt up by a consultative thinktank charging eye watering rates? Well, maybe...but on a broader scale.
Think of an end user , using a web browser, that goes to an application server, which then has a database backend, which sits on a ESX Host, which sits on a SAN, with all the network strings inbetween. That's a lot of moving parts.
You then know the routine...
User screams, tickets raised, tickets bounced around, user screams, major incident declared, user's Director screams, teams either dismiss the issue as No Fault Found or focus troubleshooting as their system(s) "aren't looking normal, but can't quite put their finger on it". Incident managers (only doing their job i'll add) receive threats of violence for pestering stressed techies for updates every 15mins for comms that have to go out.
...and all while this is going on, the end user is sitting with a poorly bit of software getting grief from someone, somewhere, and the company could be losing money, reputation or both.
What it if was easier to see downstream impact of issues? ie.. maybe there was a noisy neighbour on the same ESX host as the SQL Server? So it looks like the SQL box is processing slowly as the same amount of requests are going to it, but more are requests are timing out. Finger points at SQL, when in fact its the underlying host subsystem. Sneaky sneakster.
VERY basically put to the point of doing it an injustice, observability is the extension of monitoring over all of these separate areas to provide a more joined up view of the application world, from user to end bit of data. It allows an inference of what is occuring in an application ecosystem from various inputs to lead us all to the correct conclusion and resolution, faster. And when done well, if your DeLorean is fuelled up ready, sometimes long before the poor user is angered and the deserved Incident manager is threatened with bodily assault.
I know what you're saying... "But hang on there spinich chin, how is that valid for just SQL server?"
For the bigger picture it's not, applications need to be designed to cater for observability, to gain metrics at points before and after our little relational page thrower has done its bit. However, similar principals may be applied to SQL server monitoring, as a lot of you are probably already doing parts of this already when you're troubleshooting SQL server.
I'll mention it again, Observability needs the application to cater for it, key functionality... metrics caught for how long at different steps... ie... rending in Web pages, fetching data from the database etc.
Sound familiar? SSRS gave us metrics like this many years ago in its ExecutionLog views. Lovely they were too.
To make matters worse, as DBAs, we can't often say "All queries have to be within 1ms, 1s etc.", as we may have servers with a mix of OLTP, Reporting and heaven knows what else. And all those servers have different load characteristics. One size very rarely fits all, we can't easily cater for every single system without some clever machine learning involved on the thresholds.
So what can we do? Well it Depends... (there... I said it. I'm glad too.)
So what is our Quest then? O' face of Gollum.
Simply put, try and monitor in a structured way similar to other subsystems and cover what we can, what's critical in a way we can easily see related information.
How often have you looked at some performance data for SQL server, and inferred (its that word again) that the problem lies not with the Subsystem you're looking at, but somewhere else? A couple of quick examples.
IO based waits and latency - driven by buffer starvation.
CPU pressure - driven by Spinlocks
We often put 2 and 2 together following experience of the performance pain, but to be better at this, we need data to look at. Its basic, but we're inferring none the less.
Expand that same structure between systems in an application ecosystem (Web, App, DB, VM, SAN etc) and you can begin to see how powerful this can become in reducing Inter-Team brawls. Ie... say the disk latency is elevated at the SAN, this then feeds into Latency in IO at the VM, into SQL... A metric of interest at one level can be traced through, allowing inference of why something is behaving the way it is due to downstream performance.
SQL has soooo much cool performance data available, through DMVs and Fs, OS performance counters, Logs etc... with a bit of thought around the collection and visualisation we can observe exactly what we're interested in and play our part in the chain.
So let's try and put some structure to our SQL Observability as often used in the SRE/DRE world as key metrics. Get us started in the right direction.
Is SQL accessible? - Availability
How long is the connection/query taking? - Latency
How much work is my SQL Server currently shovelling? - Throughput
How close to capacity is my Instance/DB getting before performance / availability issues? - Saturation
Is my server generally happy with life? - Errors
The Golden Signals.
The Eagles! the Eagles are Coming!
The astute amongst you, or those still awake and reading will notice 5 key words above. 4 of these are often known as The Golden Signals. Or maybe 'rings' if you're based in Mordor, or the Dudley Metropolitan Borough as they're twinned.
Then we have Availability, although technically not a Golden signal...who doesn't want to know when their Server/DB isn't up? So this is included as well and often is the most crucial.
The downside is that nobody ever seems to spill the beans on how these relate to SQL Server, and all of the precious metrics we do have at our disposal.
Different companies may implement some, all and more to those listed, and some are most definitely more appropriate to other systems, ie... Web servers etc. These are just what i've used at companies in the past (i've used SLATE previously) although on a personal level I always prefer ales, repeatedly, from a cold pint glass until i've forgotten that SQL server exists.
Management from Mordor and Stakeholders for Saruman.
You know what? Management and Stakeholders won't give a cranky hobbit about the above and i'm with them completely and understand why. They often want a fast breakdown of the above, minus the chat that goes with it, and stakeholders may want similar information to confirm that their warm SLA blanket is wrapped around tightly. They may pay for it, so they are entitled to exactly that.
What's the availability of my server over period x > y. I want this as a percentage. ie..99.95% in the last month etc. Microsoft have SLAs, so they know how long something is not up for.
How fast was it responding during period a > b. I want this as a % above/below a baseline average.
What capacity was my server / database at over period c > d. Do I need to upgrade?
A lot of this lovely data is supplied in cloud offerings one way or another, but what about us poor souls with estates that are On Prem?
There are some awesome monitoring tools out there, both free and licensed, and they are getting better and better with every release. But, unfortunately still seem a little short on giving a one page view on things like the above in a way that translates well to management.
Mr/Mrs/Ms/Miss/Mx/M Vendor(s), please start to chew your data up to give us some Golden Signal KPIs. As if you're going to ever going to read this bottom of the shoe jibberish blog....
It may be possible to dig into the data Repositories for the above to chew on the data to get the above, but that isn't supported generally so please don't do that without express permission.
All this stems from us observing what's going on inside SQL server, presenting it, and making it available for the Great All seeing Observability Eye up there.
In the next couple of posts in this mini series, i'll going through the Golden Signals and options to create/structure/enhance our monitoring to follow that, and supply some code/processes that can help us with that. Even those with quality monitoring systems in place,
Rod, son of William (of the Shire)