SQL Observation for the Nation - Part 2 - The Few Towers.

Rod Edwards
Jul 7, 2023
8 min read

This is the second post in the series "SQL Observation for the Nation".

Part one can be found here: https://sqlrodbloke.wixsite.com/sqlrod/post/sql-observation-for-the-nation-part-1-the-fellowship-of-the-all-seeing.

Ok, so now we know that monitoring is a good thing, observability is a buzzword for ThinkTanks to earn a mealticket and the Dark Lord Sauron can't get good database support. Those pesky orcsies, you just can't trust 'em.

Mentioned in Part one were Golden Signals from the SRE/DRE world that are used as key metrics when monitoring systems. We then emblished Availability on to this as well as the Boss generally wants to know when something is up, or more important, NOT up. That will be covered in Part 3 alongside Latency.

That leaves us Errors, Saturation and Throughput for discussion and ways we can relate these to the performance data we have at our disposal in SQL Server, and how we can collect this.

Two of these are straight forward, Traffic (or Throughput) and Errors, so i'll start with those.

So lets look at some of this, scripts as well did you say? Lovely, we like scripts.

Gimme, Gimme, Gimli...

..a script after midnight...(Sorry)

Traffic

This one is straight forward. "How much traffic is my system serving?" ie, what's the throughput?

OK, great - we have performance metrics that handle that don't we? Yes, indeed we do, but hold thine horses again...we also have lots of different subsystems within our joyous SQL Server engine that don't relate to each other, but all can be monitored to the Traffic/Throughput. The main one IS the most obvious one though.

SQL - Batch Requests/second

SELECT cntr_value FROM master.sys.dm_os_performance_counters
	WHERE counter_name = 'Batch Requests/sec';

This will just collect the batch total, and then collect it again on the next run, so you can simply save it to a table with a Datecollected datetime2 column and the delta can be worked on between the two runs and then the per sec value worked out. Continue reading for a really simple method to do just.

That is the main throughput/traffic counter in SQL Server. If you want to take it further, then come hither you magnificent people, there's more.

What is available for other subsystems?

IO

Operations, or how often am I having a bite of lembas bread?

Splendid, we have this down to the database file level. The below aggregates that to the database. Obviously, you may want to keep it to the file level if you have different files for different structures, on different drives too.

SELECT 
      DB_NAME(fs.database_id) AS DatabaseName 
    , mf.type_desc 
    , SUM(fs.num_of_reads)AS TotalReads 
    , SUM(fs.num_of_writes) AS TotalWrites 
    , SUM(fs.num_of_reads + fs.num_of_writes) AS [TotalIO] 
,getdate() as CollectionDate 
FROM sys.dm_io_virtual_file_stats(null,null) AS fs 
INNER JOIN sys.master_files AS mf WITH (NOLOCK) 
    ON fs.database_id = mf.database_id 
    AND fs.[file_id] = mf.[file_id] 
GROUP BY DB_NAME(fs.database_id), mf.type_desc 
ORDER BY DatabaseName, type_desc desc

Assume that we are just logging our data to tables with a timestamp so that we persist what was going on at any one collection point. So i've ran the above every minute, simply inserting the data to our table, in this case [MyMetricsDB].[dbo].[tbl_IODBLatency].

Querying this gives us our IOPS at it was at each collection point. The astute amongst you will know and see that this also cumulative, so again we will need to perform some delta calculations to see the throughput between collections. More on that below.

Throughput

Throughput, or how much lembas bread am I actually eating with each bite? (or how big is your gob?)

Exactly the same as above, only separated here for ease of output.

SELECT
      DB_NAME(fs.database_id) AS DatabaseName
    , mf.type_desc
    , SUM(fs.num_of_bytes_read)AS TotalReadBytes
    , SUM(fs.num_of_bytes_written) AS TotalWriteBytes
    , SUM(fs.num_of_bytes_read + fs.num_of_bytes_written) AS [TotalIOBytes]
	,getdate() as CollectionDate
FROM sys.dm_io_virtual_file_stats(null,null) AS fs
INNER JOIN sys.master_files AS mf WITH (NOLOCK)
    ON fs.database_id = mf.database_id
    AND fs.[file_id] = mf.[file_id]
GROUP BY DB_NAME(fs.database_id), mf.type_desc
ORDER BY DatabaseName, type_desc desc

And this case, i'll log it to [MyMetricsDB].[dbo].[tbl_IODBThroughput].

This is another collection that needs to have a delta calculation performed on it. Great, soon, I promise.

Deltas

Ok, I yield! Deltas... can't get sooner than that.

Turns out this is simple these days, as we have the LAG and LEAD functions in our weapons cabinet. I'm not going to go into the syntax, just have a quick google for that, but they are perfect for our needs.

Reading our table of the persisted data - [MyMetricsDB].[dbo].[tbl_IODBThroughput]

SELECT [DatabaseName] 
      ,[type_desc] 
      ,[TotalReads] 
  , LAG([TotalReads]) OVER (PARTITION BY DatabaseName, type_desc ORDER BY CollectionDate) as [TotalReadPrevious] 
  ,[TotalWrites]
  ,[TotalWrites] - LAG([TotalWrites]) OVER (PARTITION BY DatabaseName, type_desc ORDER BY CollectionDate) as [TotalWritesPrevious]
  ,[TotalIO] 
  ,[TotalIO] - LAG([TotalIO]) OVER (PARTITION BY DatabaseName, type_desc ORDER BY CollectionDate) as [TotalIOPrevious] 
      ,[CollectionDate] 
  FROM [MyMetricsDB].[dbo].[tbl_IODBThroughput] 
    where DatabaseName='tempdb' 
order by type_desc desc

Using LAG, gives the previous rows data on the same row as the current...check out the diagonal colour matched fields.

Quick calculation, substracting previous from current and removing the columns we no longer need.

SELECT [DatabaseName] 
      ,[type_desc] 
  ,[TotalReads] - LAG([TotalReads]) OVER (PARTITION BY DatabaseName, type_desc ORDER BY CollectionDate) as [TotalReadsDelta] 
  ,[TotalWrites] - LAG([TotalWrites]) OVER (PARTITION BY DatabaseName, type_desc ORDER BY CollectionDate) as [TotalWritesDelta] 
  ,[TotalIO] - LAG([TotalIO]) OVER (PARTITION BY DatabaseName, type_desc ORDER BY CollectionDate) as [TotalIODelta] 
      ,[CollectionDate] 
  FROM [MyMetricsDB].[dbo].[tbl_IODBThroughput] 
    where DatabaseName='tempdb' 
order by type_desc desc

we get our Delta values.... not a chore at all. At this point you can use this in a view, a transformation to other tables, whatever you want to do.

Great, we have some decent Traffic / Throughput metrics really easily. That's us legging it out of Bree without paying the bar tab.

Errors

Out there, many different systems/applications often structure logs in the same way, something like:

Datetime - Error Level (INFO, WARNING, ERROR etc) - Message.

For example, IIS will log a 500 in a similar format.

Unfortunately, SQL Server isn't one of those applications so its not as apparent. Thanks Mistress Microsoft of Mordor for that one.... so what do we have, and what can we do?

Well, we have the SQL ErrorLog

..oh, and a structured Severity Level, and the ability to raise any custom errors that we like in code, assign the severity and message to them. Ooh, I wander if there are any airbnb's in Mordor? I like the sound of the place again.

The ErrorLog is our starting point. Although this is for the entire server, and issues shown in here quite often have no influence on our databases, its a fine place.

Lets just query it and have a look for anything with the word 'Error:' in the last 1 minute, as thats how Severity errors are logged.


CREATE TABLE SQLErrorLog (LogDate DateTime2,ProcessInfo Nvarchar(12),LogText Nvarchar(4000),ErrorLevel varchar(15)) 
 
DECLARE @logFileType SMALLINT= 1; 
DECLARE @start DATETIME=DATEADD(MINUTE,-1,getdate()) 
DECLARE @end DATETIME=getdate() 
DECLARE @logno INT= 0; 
DECLARE @searchString1 NVARCHAR(256)= 'Error:'; 
DECLARE @searchString2 NVARCHAR(256)= NULL; 
 
INSERT INTO SQLErrorLog(LogDate,ProcessInfo,LogText) 
EXEC master.dbo.xp_readerrorlog @logno, @logFileType, @searchString1, @searchString2, @start, @end; 
 
--Now lets add some standardized contect to the severity lines
 
UPDATE SQLErrorLog SET ErrorLevel = 
CASE WHEN LogText LIKE '%Severity: 11%' THEN 'WARNING' 
WHEN LogText LIKE '%Severity: 12%' THEN 'WARNING' 
WHEN LogText LIKE '%Severity: 13%' THEN 'WARNING' 
WHEN LogText LIKE '%Severity: 14%' THEN 'WARNING' 
WHEN LogText LIKE '%Severity: 15%' THEN 'WARNING' 
WHEN LogText LIKE '%Severity: 16%' THEN 'WARNING' 
WHEN LogText LIKE '%Severity: 17%' THEN 'ERROR' 
WHEN LogText LIKE '%Severity: 18%' THEN 'ERROR' 
WHEN LogText LIKE '%Severity: 19%' THEN 'ERROR' 
WHEN LogText LIKE '%Severity: 20%' THEN 'ERROR' 
WHEN LogText LIKE '%Severity: 21%' THEN 'ERROR' 
WHEN LogText LIKE '%Severity: 23%' THEN 'ERROR' 
WHEN LogText LIKE '%Severity: 24%' THEN 'ERROR' 
WHEN LogText LIKE '%Severity: 25%' THEN 'ERROR' 
END 
 
 
SELECT ErrorLevel, COUNT(ErrorLevel) AS EventCount FROM SQLErrorLog 
 GROUP BY ErrorLevel

And... there we go. (Don't worry, this was an example, my server isn't ablaze.)

This is just reporting a count, but you have the interesting data as well now too so can chop it up however you like.

There are other methods that can be used including...

SQL Agent Job Failures
Error_Reported Extended Events
Perfmon Errors Counters

Query and log these to give you a count of what you're particularly interested in and what is an actual error in your environment.

Even just grabbing the errors above from the error log, gives a nice net for catching as a starting point.

Here's an option if your company has control over the application code of course. Configure application errors to log to standard table names, and just collect those as well? That way, what the Application considers an error is reported along with the database metrics in the same format.

You can log errors to the SQL Error Log of course with RAISEERROR, and the process above would sweep them up, but that requires greater priviledges to do so, so not really recommended.

But take heed!

We can capture loads of errors, but are they actually impacting anything?

When apps are designed with Observability in mind, then everything thats an error gets a proper error severity and message assigned. That often isn't the case though, so this can be a bit of a best endevours, so again, start slowly, and add slowly if needed.

Nothing wrong with slow...I like slow, it embodies my thought process perfectly.

Great, the quest has started well, we're out of the Shire, over the Brandy Bridge and knocking on the door of Moria already. Whats next? Ah yes, Saturation.

Saturation

So, what is Saturation?

Any system can only perform so much at any one time, primarily through the resources that is has at it's disposal. When these resources are depleted, then the ability to service the sustained level of work required reduces.

The saturation is a high-level overview of the system utilisation, so what we always care about is, how much overhead to we have have?

Quite often, a system's performance will begin to struggle before 100% saturation, so its always useful to work out the point at which when performance starts to drop off before this.

CPU is the easy one, 0% - i'm asleep, 100% i'm on fire, please assist. Around 80-85% and my gout is starting to act up. This is the point directly above that is mentioned, the start of the pain and your systems points will vary to what i've stated. I just have a low threshold for pain.

Things like disk, although the underlying SAN may struggle at points when near full capacity, we rarely see this transferred the tiny bit of disk we have chopped up for our system's use.

Threadpool is another, we know that SQL's performance will drop off a proverbial cliff when our sacred pool is exhausted.

Another way to think about this is when our queues start to build, we are getting more saturated. Hot-diggity, we have a load of those!

Waiting Tasks
Pending IO
Pending Network (although i've never seen this building)

And our lovely schedulers! We can't forget those bad boys.

Grab your CPU from the system health ring buffer.

DECLARE @ts_now bigint = (SELECT cpu_ticks/(cpu_ticks/ms_ticks)FROM sys.dm_os_sys_info); 

SELECT TOP(1) SQLProcessUtilization AS [SQL Server Process CPU Utilization], 
               SystemIdle AS [System Idle Process], 
               100 - SystemIdle - SQLProcessUtilization AS [Other Process CPU Utilization], 
               DATEADD(ms, -1 * (@ts_now - [timestamp]), GETDATE()) AS [CollectionDate] 
FROM ( 
	  SELECT record.value('(./Record/@id)[1]', 'int') AS record_id, 
			record.value('(./Record/SchedulerMonitorEvent/SystemHealth/SystemIdle)[1]', 'int') 
			AS [SystemIdle], 
			record.value('(./Record/SchedulerMonitorEvent/SystemHealth/ProcessUtilization)[1]', 
			'int') 
			AS [SQLProcessUtilization], [timestamp] 
	  FROM ( 
			SELECT [timestamp], CONVERT(xml, record) AS [record] 
			FROM sys.dm_os_ring_buffers WITH (NOLOCK)
			WHERE ring_buffer_type = N'RING_BUFFER_SCHEDULER_MONITOR' 
			AND record LIKE N'%<SystemHealth>%') AS x 
	  ) AS y 
ORDER BY record_id DESC;

CPU done. Simples.

Other saturation values, and these AREN'T cumulative you'll be glad to know, so no Delta fiddling required.

"One Query to Bind them all..."

select  (select count(*) as WaitingTasksCount from    sys.dm_os_waiting_tasks wt 
inner join sys.dm_exec_sessions es on wt.session_id=es.session_id
WHERE es.is_user_process=1 
AND wt.wait_type NOT IN  ('SP_SERVER_DIAGNOSTICS_SLEEP','WAITFOR')) AS WaitingTaskCount,
(select count(io_Type) as PendingRequestCount from sys.dm_io_pending_io_requests WHERE io_type ='network') AS PendingNetworkRequest,
(select count(io_Type) as PendingRequestCount from sys.dm_io_pending_io_requests WHERE io_type ='disk') AS PendingDiskRequest,
(select max_workers_count from sys.dm_os_sys_info) as TotalThreads,
sum(active_workers_count) as CurrentThreads,
sum(active_workers_count)*1.0/(select max_workers_count from sys.dm_os_sys_info) AS ThreadSaturation,
sum(runnable_tasks_count) AS RunnableTasks, 
count(runnable_tasks_count) as SchedulersCount, 
sum(runnable_tasks_count)*1.0/count(runnable_tasks_count) AS CPUQueueSaturation,
getutcdate() as CollectionDateUTC
 from  sys.dm_os_Schedulers where status='VISIBLE ONLINE'

And would you look at that lot, a handful of good indicators for potential pain.

The CPUQueueSaturation is just a calculated value to give us a comparative number between servers with different numbers of schedulers. High bad, low good...CPU strong...strong like bull.

And there we are, 3 of the signals covered already, we've rounded the corner of Moria, sped though Rohan and now bearing down on Sauron's penthouse apartment ready to tear up the joint.

In the final Part, we'll cover Availability and Latency, including a method to collect and that availability data up for management friendly consumption.

Thanks for reading.

ElRod of Rivendell