Performance Metrics for SQL Server 2014

Performance Metrics for SQL Server 2014We all crave routine in our life, from the time we are babies through adulthood. Routine and familiarity. This extends to how we do performance monitoring and troubleshooting. Everyone has their favorite set of tools to help get the job done, tools that have a familiar routine involved to make them work best for us.

Over the years I’ve added and modified the tools and performance metrics I use for SQL Server. I am certain everyone does this as their skills develop over time. But another reason to do modify your tools and metrics is a result of new versions of SQL Server. With each new version of SQL Server I like to review what I consider to be the “right” metrics I want to use in addition to all of the standard metrics you can find laying all over the internet.

I often see beginner (or ‘accidental’) database administrators look to the counters provided by Perfmon (AKA, the Resource and Performance Monitor). The trouble I find with Perfmon is that the counters themselves don’t always give you enough details to make an informed decision about what actions you should take next. You need to look at a handful of the counters at any given time (and, most importantly, over a period of time) in order to better ascertain where the bottlenecks exist. Perfmon has no trending history by default, and neither do Dynamic Management Views (DMVs). You need to either build this functionality yourself or purchase a vendor product (and I can recommend one for you).

Now, I’ve written before about how spending an hour of your time being proactive can save you 150 hours of reactive time. You simply can’t be proactive without some type of baselining and trend analysis. But how do you know if you are collecting the right metrics for the right version?

Here is my list of metrics that I find myself frequently using currently for troubleshooting SQL Server performance issues. They help me get insight quickly into some of the deep recesses of SQL Server so that I can easily corroborate with some standard Perfmon counters and DMVs in order to troubleshoot issues for customers and clients. You should incorporate these into whatever monitoring solution you are using for trend analysis. By looking at these metrics in the first few minutes of troubleshooting (and then comparing them to your baseline) they will help you save time.

And often, time to resolution is all that you get measured by.

Before I get to the list of performance metrics for SQL Server 2014, a quick disclaimer:

Script disclaimer, for people who need to be told this sort of thing:

DISCLAIMER: Do not run code you find on the internet in your production environment without testing it first. Do not use this code if your vision becomes blurred. Seek medical attention if this code runs longer than four hours. On rare occasions this code has been known to cause one or more of the following: nausea, headaches, high blood pressure, popcorn cravings, and the impulse to reformat tabs into spaces. If this code causes your servers to smoke, seek shelter. Do not taunt this code.

So, here’s the list, in no particular order. I’ve put all the code snippets into one long script at the end of this post. You can also download the full script here. The code should work on SQL Server 2012 and SQL Server 2014 instances, let me know if you find any issues. The percentages and numbers used in the scripts below are meant to serve as a guide only; just because I say “100” and your system shows “101” doesn’t mean the sky is falling. Baseline and trend analysis is key here. Monitor frequently, look for spikes, and above all Don’t Panic.

1. Signal Waits Percentage

This metric has provided the most value to me over the years. I started using it over six years ago when troubleshooting CPU issues. The instance itself was showing signs of “slowness”, but the overall CPU utilization numbers were nominal. I came across this article by Tom Davidson and found out that I could calculate the percentage of signal waits to the overall waits in order to understand internal CPU pressure exists. [A “signal” wait means that a session is in the “runnable” queue, waiting for a signal from an available processor.]

I filter out a handful background tasks for this query as well. The value I like to see returned here is for no more than 20-25% (see above for my comments about the numbers I have chosen) of the total waits to be signal waits. Numbers greater than 20% can indicate internal CPU pressure. I can remedy the situation by reducing the number of sessions (not always likely), increasing the number of available CPUs (also not likely), or reducing the amount of time the queries need to execute (often very likely, and sometimes easily done).

SELECT (100.0 * SUM(signal_wait_time_ms)/SUM (wait_time_ms)) AS [SignalWaitPct]
FROM sys.dm_os_wait_stats
WHERE wait_type NOT IN (
'CLR_SEMAPHORE', 'LAZYWRITER_SLEEP', 'RESOURCE_QUEUE', 'SLEEP_TASK',
'SLEEP_SYSTEMTASK', 'SQLTRACE_BUFFER_FLUSH', 'WAITFOR', 'LOGMGR_QUEUE',
'CHECKPOINT_QUEUE', 'REQUEST_FOR_DEADLOCK_SEARCH', 'XE_TIMER_EVENT',
'BROKER_TO_FLUSH', 'BROKER_TASK_STOP', 'CLR_MANUAL_EVENT', 'CLR_AUTO_EVENT',
'DISPATCHER_QUEUE_SEMAPHORE', 'FT_IFTS_SCHEDULER_IDLE_WAIT',
'XE_DISPATCHER_WAIT', 'XE_DISPATCHER_JOIN', 'BROKER_EVENTHANDLER',
'TRACEWRITE', 'FT_IFTSHC_MUTEX', 'SQLTRACE_INCREMENTAL_FLUSH_SLEEP',
'BROKER_RECEIVE_WAITFOR', 'ONDEMAND_TASK_QUEUE', 'DBMIRROR_EVENTS_QUEUE',
'DBMIRRORING_CMD', 'BROKER_TRANSMITTER', 'SQLTRACE_WAIT_ENTRIES',
'SLEEP_BPOOL_FLUSH', 'SP_SERVER_DIAGNOSTICS_SLEEP', 'HADR_FILESTREAM_IOMGR_IOCOMPLETION')
AND wait_time_ms <> 0

2. SQL Compilation Percentage

Another sneaky performance issue is having a high number of statements being compiled. There happens to be a performance counter for compiles but the counter itself does not have much meaning unless you also know the overall number of batch requests. Rather than look at either of those counters separately what I like to do is look at their ratio. In fact, you’ll find a handful of items here are measured as a percentage of batch requests.

The percentage of compilations I like to see is roughly 10% of the total number of batch requests (see above for my comments about the numbers I have chosen). The following code will show you the current percentage of compilations to the number of batch requests:

SELECT 1.0*cntr_value /
(SELECT 1.0*cntr_value
FROM sys.dm_os_performance_counters
WHERE counter_name = 'Batch Requests/sec')
AS [SQLCompilationPct]
FROM sys.dm_os_performance_counters
WHERE counter_name = 'SQL Compilations/sec'

At this point someone usually asks “what about re-compilations”? Well, re-compilations should only be roughly 1% (see above for my comments about the numbers I have chosen) of the total number of batch requests, so that code would be like this:

SELECT 1.0*cntr_value /
(SELECT 1.0*cntr_value
FROM sys.dm_os_performance_counters
WHERE counter_name = 'Batch Requests/sec')
AS [SQLReCompilationPct]
FROM sys.dm_os_performance_counters
WHERE counter_name = 'SQL Re-Compilations/sec'

3. Page Lookups Percentage

What I like most about this metric is that it often gives me a hint regarding plan cache bloating without me having to dig through the plan cache itself. It’s possible I have a handful of queries that have lots of logical I/O (not necessarily a problem, mind you), or I may have lots of little queries (probably ad-hoc) which could lead to plan cache bloat.

This metric also gives me insight on the queries hitting the instance, to see if they could use some tuning/indexing as there are likely some inefficient query plans being utilized.

What I want to see here is a value that is less than 100 (see above for my comments about the numbers I have chosen) on average, but it really depends upon the nature of your instance:

SELECT 1.0*cntr_value /
(SELECT 1.0*cntr_value
FROM sys.dm_os_performance_counters
WHERE counter_name = 'Batch Requests/sec')Â 
AS [PageLookupPct]
FROM sys.dm_os_performance_counters
WHERE counter_name = 'Page lookups/sec'

4. Page Splits Percentage

By themselves page splits are not necessarily bad. But a lot of them happening at once could indicate trouble. Monitoring for just the number of page splits by itself isn’t very reliable, as the counter includes any new page allocations as well as page splits due to fragmentation. So I like to compare the number of page splits to the number of batch requests. The number I look for here is roughly 20 page splits/sec for every 100 batch requests (see above for my comments about the numbers I have chosen).

This code gives me the page splits percentage:

SELECT 1.0*cntr_value /
(SELECT 1.0*cntr_value
FROM sys.dm_os_performance_counters
WHERE counter_name = 'Batch Requests/sec') 
AS [PageSplitPct]
FROM sys.dm_os_performance_counters
WHERE counter_name = 'Page splits/sec'

5. Average Task Counts

I got this metric from Glenn Berry (blog | @GlennAlanBerry), and I use this to determine what my system has been busy doing. I like how Glenn always has a brief description for what you want (or don’t want) to see. In this case it is as follows:

  • High Avg Task Counts (>10) are often caused by blocking or other resource contention
  • High Avg Runnable Task Counts (>1) are a good sign of CPU pressure
  • High Avg Pending DiskIO Counts (>1) are a sign of disk pressure

I can use this one in conjunction with others to get some corroborating evidence as to what is happening under the covers:

SELECT AVG(current_tasks_count) AS [Avg Task Count],
AVG(runnable_tasks_count) AS [Avg Runnable Task Count],
AVG(pending_disk_io_count) AS [Avg Pending DiskIO Count]
FROM sys.dm_os_schedulers WITH (NOLOCK)
WHERE scheduler_id < 255 OPTION (RECOMPILE)

6. Buffer Pool I/O Rate

The most misunderstood counter I see is Page Life Expectancy (PLE). I know I used to pay attention as if it meant something. Over time I have come to realize that it is a throwback to days of yore when having 4GB of memory was a really big deal. In other words, its intended purpose no longer has the same value as it once did. It is not totally worthless, however. You just need to know what to do with it.

What I want to see is the rate at which data pages are being cycled through the buffer pool. The following code returns the "Buffer Pool Rate" in MB/sec. I usually look for rates around 20MB/sec as a baseline. Why that number? Because if I have a "typical" server with 56GB of RAM available for the buffer pool and I want to keep my pages around for an hour or so (3600 seconds), then I come up with 56000MB/3600sec, or about 15.5 MB/sec. You can do your own math here to arrive at your own recommended throughput, but that's why I look for a sustained rate of about 20 on average. If I have a spike upward from there then I know I am having memory pressure (pressure that might otherwise fail to be seen if I only examine the PLE counter).

Here is the code I use:

SELECT (1.0*cntr_value/128) /
(SELECT 1.0*cntr_value
FROM sys.dm_os_performance_counters
WHERE object_name like '%Buffer Manager%'
AND lower(counter_name) = 'Page life expectancy')
AS [BufferPoolRate]
FROM sys.dm_os_performance_counters
WHERE object_name like '%Buffer Manager%'
AND counter_name = 'total pages'

You should also be aware that if your server is NUMA aware then you will want to take that into consideration whenever you try to use PLE as a performance metric.

7. Memory grants

This counter helps me understand if I am seeing internal memory pressure. Ideally this value should be as close to 0 as possible. Sustained periods of non-zero values are worth investigating:

SELECT cntr_value                                                                                                       
FROM sys.dm_os_performance_counters 
WHERE counter_name = 'Memory Grants Pending'

If my queries are waiting for memory grants I would look at the query plans first before adding more RAM as a knee-jerk reaction. It could be the case that the new Cardinality Estimator in SQL 2014 has over-estimated the amount of memory needed for this particular query. You may want to try running the query using the trace flag option 9481 and see if the legacy CE works better.

Summary

I cannot stress this enough, but every workload and server is different. It is important that you understand your workload before assuming you have performance issues. Use these scripts to start collecting data over time in order to get a baseline of what is considered to be normal. I've presented some numbers and guidance above but they should not be taken as an absolute for each and every server and workload out there.

Lastly, I'm always looking to add to my toolbox, so if you have a favorite metric you want to share just leave it in the comments below.

For those of you too lazy to click the link above, here's the full script:

/*=============================================
  File: SQL_Server_2014_metrics.sql

  Author: Thomas LaRock, https://thomaslarock.com/contact-me/
  
Performance Metrics for SQL Server 2014


  Summary: Here is my list of metrics that I find myself frequently using for 
troubleshooting SQL Server performance issues. They help me get insight quickly 
into some of the deep recesses of SQL Server so that I can easily corroborate 
with some standard Perfmon counters and DMVs in order to troubleshoot issues 
for customers and clients. You should incorporate these into whatever monitoring 
solution you are using for trend analysis. By looking at these metrics in the 
first few minutes of troubleshooting they will help you save time.

These queries are are meant to serve as a guide only; just because I say "100" 
and your system shows "101" doesn’t mean the sky is falling. Baseline and trend 
analysis is key here. Monitor frequently, look for spikes, and above all Don’t Panic.

  Date: June 5th, 2014

  SQL Server Versions: SQL2012, SQL2014

  You may alter this code for your own purposes. You may republish
  altered code as long as you give due credit. 

  THIS CODE AND INFORMATION IS PROVIDED "AS IS" WITHOUT WARRANTY
  OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING BUT NOT
  LIMITED TO THE IMPLIED WARRANTIES OF MERCHANTABILITY AND/OR
  FITNESS FOR A PARTICULAR PURPOSE.

=============================================*/

/* =============================================

Signal Waits 

The value you want returned here is for no more than 20-25% of your total waits to be signal waits. If you are consistently seeing numbers greater than 20% then you are having internal CPU pressure. You can remedy the situation by reducing the number of sessions (not always likely), increasing the number of available CPUs (also not likely), or reducing the amount of time the queries need to execute (often very likely, and sometimes easily done).
=============================================*/

SELECT (100.0 * SUM(signal_wait_time_ms)/SUM (wait_time_ms)) AS [SignalWaitPct]
FROM sys.dm_os_wait_stats
WHERE wait_type NOT IN (
'CLR_SEMAPHORE', 'LAZYWRITER_SLEEP', 'RESOURCE_QUEUE', 'SLEEP_TASK',
'SLEEP_SYSTEMTASK', 'SQLTRACE_BUFFER_FLUSH', 'WAITFOR', 'LOGMGR_QUEUE',
'CHECKPOINT_QUEUE', 'REQUEST_FOR_DEADLOCK_SEARCH', 'XE_TIMER_EVENT',
'BROKER_TO_FLUSH', 'BROKER_TASK_STOP', 'CLR_MANUAL_EVENT', 'CLR_AUTO_EVENT',
'DISPATCHER_QUEUE_SEMAPHORE', 'FT_IFTS_SCHEDULER_IDLE_WAIT',
'XE_DISPATCHER_WAIT', 'XE_DISPATCHER_JOIN', 'BROKER_EVENTHANDLER',
'TRACEWRITE', 'FT_IFTSHC_MUTEX', 'SQLTRACE_INCREMENTAL_FLUSH_SLEEP',
'BROKER_RECEIVE_WAITFOR', 'ONDEMAND_TASK_QUEUE', 'DBMIRROR_EVENTS_QUEUE',
'DBMIRRORING_CMD', 'BROKER_TRANSMITTER', 'SQLTRACE_WAIT_ENTRIES',
'SLEEP_BPOOL_FLUSH', 'SP_SERVER_DIAGNOSTICS_SLEEP', 'HADR_FILESTREAM_IOMGR_IOCOMPLETION')
AND wait_time_ms <> 0



/*============================================= 

SQL compilation percentage 

The recommended percentage for compilations should be roughly 10% of the total number of batch requests.
*/

SELECT 1.0*cntr_value /
(SELECT 1.0*cntr_value
FROM sys.dm_os_performance_counters
WHERE counter_name = 'Batch Requests/sec')
AS [SQLCompilationPct]
FROM sys.dm_os_performance_counters
WHERE counter_name = 'SQL Compilations/sec'

/*
SQL re-compilations should be roughly 1% of the total number of batch requests.
*/

SELECT 1.0*cntr_value /
(SELECT 1.0*cntr_value
FROM sys.dm_os_performance_counters
WHERE counter_name = 'Batch Requests/sec')
AS [SQLReCompilationPct]
FROM sys.dm_os_performance_counters
WHERE counter_name = 'SQL Re-Compilations/sec'


/*=============================================*/



/*=============================================

Page lookups percentage

What I want to see here is a value that is less than 100 on average, but it really depends upon the nature of your instance. Measure over a period of time and look for spikes.
=============================================*/

SELECT 1.0*cntr_value /
(SELECT 1.0*cntr_value
FROM sys.dm_os_performance_counters
WHERE counter_name = 'Batch Requests/sec') 
AS [PageLookupPct]
FROM sys.dm_os_performance_counters
WHERE counter_name = 'Page lookups/sec'



/*=============================================

Page splits percentage

Monitoring for just the number of page splits by itself isn't very reliable, as the counter includes any new page allocations as well as page splits due to fragmentation. So I like to compare the number of page splits to the number of batch requests. The number I look for here is roughly 20 page splits/sec for every 100 batch requests.
=============================================*/

SELECT 1.0*cntr_value /
(SELECT 1.0*cntr_value
FROM sys.dm_os_performance_counters
WHERE counter_name = 'Batch Requests/sec') 
AS [PageSplitPct]
FROM sys.dm_os_performance_counters
WHERE counter_name = 'Page splits/sec'



/*=============================================

Average task counts

High Avg Task Counts (>10) are often caused by blocking/deadlocking or other resource contention
High Avg Runnable Task Counts (>1) are a good sign of CPU pressure
High Avg Pending DiskIO Counts (>1) are a sign of disk pressure
=============================================*/

SELECT AVG(current_tasks_count) AS [Avg Task Count],
AVG(runnable_tasks_count) AS [Avg Runnable Task Count],
AVG(pending_disk_io_count) AS [Avg Pending DiskIO Count]
FROM sys.dm_os_schedulers WITH (NOLOCK)
WHERE scheduler_id < 255 OPTION (RECOMPILE)



/*=============================================

Buffer pool I/O rate

I usually look for rates around 20MB/sec as a baseline. If there is a spike upward from there then you are 
having memory pressure (pressure that might otherwise fail to be seen if you only examine the page life expectancy counter).
=============================================*/

SELECT (1.0*cntr_value/128) /
(SELECT 1.0*cntr_value
FROM sys.dm_os_performance_counters
WHERE object_name like '%Buffer Manager%'
AND lower(counter_name) = 'Page life expectancy')
AS [BufferPoolRate]
FROM sys.dm_os_performance_counters
WHERE object_name like '%Buffer Manager%'
AND counter_name = 'database pages'



/*=============================================

Memory grants pending

This counter helps me understand if I am seeing internal memory pressure. Ideally this value should be as close to 0 as possible.
Sustained periods of non-zero values are worth investigating.
=============================================*/

SELECT cntr_value 
AS [MemGrantPending]                                                                                                       
FROM sys.dm_os_performance_counters 
WHERE counter_name = 'Memory Grants Pending'

7 thoughts on “Performance Metrics for SQL Server 2014”

  1. The query given for page splits per second isn’t going to work too well. The problem is that the query ignores the documentation, which tells us that the batches/sec and splits/sec counters are cumulative. That is, they’re always increasing. This query, then, shows us the average rate of page splits per batch over the lifetime of the server. I can’t imagine how that value would be useful. Particularly on a server that’s been running for more than a few minutes, the average becomes continually dampened and won’t react to recent changes in any meaningful way.

    Seems better to get the counter, WAITFOR DELAY a second, get the counter again, subtract the two values, and show the number of splits that have happened in the last second. This will be independent of the number of batches processed, though similar math could be done against a second pair of samples to correctly retrieve that information.

    Reply
    • Mike,

      Great feedback, thanks. I like the idea of your approach for getting something closer to a real-time diagnostic tool. As you mentioned, the approach above may not work as well for a system that has been running for a long period of time. However, it would still have some value should it reveal that there are a high number of page splits per batch requests. That’s really the goal here, to understand that page splits or batch requests by themselves may not be useful, but putting the two together would yield something better.

      I’m now thinking if your approach would be something perfect for monitoring an Azure VM, as you’d want to be alerted if your I/O was heavy as a result of page splits. Now you’ve got me thinking I need to write (yet another) blog post…someday.

      Thanks again for the comment!

      Reply
  2. for Buffer Pool I/O Rate, the query doesn’t work for me on Windows 2012 with SQL 2012 Standard…. I think I have a fix for it with the following, is this right?

    SELECT cast((1.0*cntr_value/128) /
    (SELECT 1.0*cntr_value
    FROM sys.dm_os_performance_counters
    WHERE object_name like ‘%Buffer Manager%’
    AND lower(counter_name) = ‘Page life expectancy’) as DECIMAL(8,2))
    AS [BufferPoolRate]
    FROM sys.dm_os_performance_counters
    WHERE
    counter_name = ‘Total Server Memory (KB)’

    Reply

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.