No matter what our profession, we all have our favorite set of tools to help us get the job done.
Database Administrators will have a set of scripts that we use in order to gain insight into the server instances under our care. I often see beginner DBAs look to the counters provided by Perfmon (AKA, the Resource and Performance Monitor). The trouble here is the counters themselves don’t necessarily give you enough details to make an informed decision about what actions you should take next. You need to look at a handful of the counters at any given time (or over a period of time) in order to better ascertain where the bottlenecks exist.
Here is a list of five metrics that I find myself frequently using when troubleshooting SQL Server performance issues. They help me get insight quickly into some of the deep recesses of SQL Server so that I can easily corroborate with standard Perfmon counters in order to troubleshoot issues for customers. Ideally these are all being collected over a period of time so that you can do some proper trend analysis.
These are the metrics, in no particular order, that you should be using during the first few minutes of analysis. They will help you save time.
1. Signal Waits Percentage
This metric has provided the most value personally to me over the years. I started using it four years ago when troubleshooting CPU issues. The instance itself was showing signs of “slowness”, but the overall CPU utilization numbers were nominal. I came across this article by Tom Davidson and found out that I could calculate the percentage of signal waits to the overall waits in order to understand internal CPU pressure exists. [A "signal" wait means that a session is in the "runnable" queue, waiting for a signal from an available processor.]
I filter out some background tasks for this query as well. The value you want returned here is for no more than 20-25% of your total waits to be signal waits. If you are consistently seeing numbers greater than 20% then you are having internal CPU pressure. You can remedy the situation by reducing the number of sessions (not always likely), increasing the number of available CPUs (also not likely), or reducing the amount of time the queries need to execute (often very likely, and sometimes easily done).
SELECT (100.0 * SUM(signal_wait_time_ms)/SUM (wait_time_ms)) AS [SignalWaitPct] FROM sys.dm_os_wait_stats WHERE wait_type NOT IN ( 'CLR_SEMAPHORE', 'LAZYWRITER_SLEEP', 'RESOURCE_QUEUE', 'SLEEP_TASK', 'SLEEP_SYSTEMTASK', 'SQLTRACE_BUFFER_FLUSH', 'WAITFOR', 'LOGMGR_QUEUE', 'CHECKPOINT_QUEUE', 'REQUEST_FOR_DEADLOCK_SEARCH', 'XE_TIMER_EVENT', 'BROKER_TO_FLUSH', 'BROKER_TASK_STOP', 'CLR_MANUAL_EVENT', 'CLR_AUTO_EVENT', 'DISPATCHER_QUEUE_SEMAPHORE', 'FT_IFTS_SCHEDULER_IDLE_WAIT', 'XE_DISPATCHER_WAIT', 'XE_DISPATCHER_JOIN', 'BROKER_EVENTHANDLER', 'TRACEWRITE', 'FT_IFTSHC_MUTEX', 'SQLTRACE_INCREMENTAL_FLUSH_SLEEP', 'BROKER_RECEIVE_WAITFOR', 'ONDEMAND_TASK_QUEUE', 'DBMIRROR_EVENTS_QUEUE', 'DBMIRRORING_CMD', 'BROKER_TRANSMITTER', 'SQLTRACE_WAIT_ENTRIES', 'SLEEP_BPOOL_FLUSH', 'SQLTRACE_LOCK') AND wait_time_ms <> 0
2. SQL Compilation Percentage
Another sneaky performance issue is having a high number of statements being compiled. There happens to be a performance counter for compiles but the counter itself does not have much meaning unless you also know the overall number of batch requests. Rather than look at either of those counters separately what you want to do is look at their ratio.
The recommended percentage for compilations is roughly 10% of the total number of batch requests. The following code will show you the current percentage of compilations to the number of batch requests:
SELECT 1.0*cntr_value / (SELECT 1.0*cntr_value FROM sys.dm_os_performance_counters WHERE counter_name = 'Batch Requests/sec') AS [SQLCompilationPct] FROM sys.dm_os_performance_counters WHERE counter_name = 'SQL Compilations/sec'
At this point someone usually asks “what about re-compilations”? Well, re-compilations should only be roughly 1% of the total number of batch requests, so that code would be like this:
SELECT 1.0*cntr_value / (SELECT 1.0*cntr_value FROM sys.dm_os_performance_counters WHERE counter_name = 'Batch Requests/sec') AS [SQLReCompilationPct] FROM sys.dm_os_performance_counters WHERE counter_name = 'SQL Re-Compilations/sec'
3. Page Lookups Percentage
What I like most about this metric is that it often gives me a hint regarding plan cache bloating without me having to dig through the plan cache itself. I also get insight on the queries hitting the instance, to see if they could use some tuning/indexing as there are likely some inefficient query plans being utilized.
What this metric really shows me is if I have one of two extremes. Either I have a few rather large queries hitting my server, or I have a lot of little queries (probably ad-hoc) hitting my instance and likely leading to some plan cache bloating.
What I want to see here is a value that is less than 100 on average:
SELECT 1.0*cntr_value / (SELECT 1.0*cntr_value FROM sys.dm_os_performance_counters WHERE counter_name = 'Batch Requests/sec') AS [PageLookupPct] FROM sys.dm_os_performance_counters WHERE counter_name = 'Page lookups/sec'
4. Average Task Counts
I got this metric from Glenn Berry (blog | @GlennAlanBerry), and I use this one to determine what my system has been busy doing. I like how Glenn always has a brief description for what you want (or don’t want) to see. In this case it is as follows:
- Sustained values above 10 suggest further investigation in that area
- High Avg Task Counts are often caused by blocking or other resource contention
- High Avg Runnable Task Counts are a good sign of CPU pressure
- High Avg Pending DiskIO Counts are a sign of disk pressure
I can use this one in conjunction with others to get some corroborating evidence as to what is happening under the covers:
SELECT AVG(current_tasks_count) AS [Avg Task Count], AVG(runnable_tasks_count) AS [Avg Runnable Task Count], AVG(pending_disk_io_count) AS [Avg Pending DiskIO Count] FROM sys.dm_os_schedulers WITH (NOLOCK) WHERE scheduler_id < 255 OPTION (RECOMPILE)
5. Buffer Pool I/O Rate
The most misunderstood counter I see is Page Life Expectancy. I know I used to pay attention as if it meant something. Over time I have come to realize that it is a throwback to days of yore when having 4GB of memory was a really big deal. In other words, its intended purpose no longer has the same value as it once did. It is not totally worthless, however. You just need to know what to do with it.
What you really want to see is the rate at which your pages are being cycled through the buffer pool. The following code returns the “Buffer Pool Rate” in MB/sec. I usually look for rates around 20MB/sec as a baseline. Why that number? Because if I have a “typical” server with 56GB of RAM available for the buffer pool (thank you, locked pages in memory) and I want to keep my pages around for an hour or so (3600 seconds), then I come up with 56000MB/3600sec, or about 15.5 MB/sec. That’s why I look for a sustained rate of about 20 on average, and if I have a spike upward from there then I know I am having memory pressure (pressure that might otherwise fail to be seen if I only examine the PLE counter).
Here is the code I use:
SELECT (1.0*cntr_value/128) / (SELECT 1.0*cntr_value FROM sys.dm_os_performance_counters WHERE object_name like '%Buffer Manager%' AND lower(counter_name) = 'Page life expectancy') AS [BufferPoolRate] FROM sys.dm_os_performance_counters WHERE object_name like '%Buffer Manager%' AND counter_name = 'total pages'
You should also be aware that if your server is NUMA aware then you will want to take that into consideration whenever you try to use PLE as a performance metric.
There you go, the five metrics I utilize for customers whenever they need my help in solving performance problems. I’m always looking to add to my toolbox, so if you have a favorite metric you want to share just leave it in the comments below.