You sit down at your desk. It’s 9:10AM and your coffee is still warm. There is a smell of bacon in the air.
Suddenly your phone rings. The trading system is down. The time for quick thinking is now.
Where would you begin to troubleshoot this scenario?
A lot of people will reach for the biggest hammer they can find: a tool that will trace all activity as it hits the database instance. For SQL Server, that tool is typically SQL Profiler.
The trouble here is this: you are in a reactive mode right now. You have no idea as to the root cause of the issue. You are going to configure your trace to capture as many details as possible. This is your reaction to make certain that when the time comes you are prepared to do as thorough a forensics job as possible in the hope that you can fix the issue in the shortest amount of time.
This can all be avoided with a little bit of proactive monitoring. I often tell people that one hour of proactive actions can save you up to 150 hours of being reactive. I also point out how a junior administrator knows how to be reactive, but a senior administrator knows the value in being proactive.
I know which one of those I prefer to have around when the trading system goes down.
The Gathering
So, what data do you need to start troubleshooting the scenario above?
A better question might be this: what performance metrics are you already collecting? And, if so, are you certain you are looking at the right SQL Server performance metrics?
It’s a question that I often hear people ask themselves. Usually as a result of some edge case performance issue, they often end up adding in every possible metric they can find. Believe me, you *DON’T* want to walk into a war room and say the words “we didn’t see this coming because we weren’t capturing the number of widgets per second inside the flux capacitor”. Mostly because those words make no sense and you’ll sound like an idiot. But I think you get my meaning.
Here’s the thing that most people don’t understand: when it comes to gathering performance metrics it’s not the what you gather as much as it is the how you gather.
Knowing what to measure is an easy task. It really is. You can find lots of information on the series of tubes known as the internet that will list out all the metrics an administrator would want. Database size, free disk space, CPU utilization, page life expectancy, buffer cache hit ratio, etc. The list of available metrics seems endless and often overwhelming. Some of that information is even useful; a lot of it can be just noise, depending on the problem you are trying to solve.
But when it comes to the “how” you collect those metrics you have only two options: tracing or polling. That’s it, no more.
It’s up to you to figure out which metrics are better suited for polling, and which for tracing. In order to do that, we need to make certain we understand the difference between each.
Tracing – This method will track details and capture events as they happen. In an ironic twist this method can interfere with the performance of the queries you are trying to measure. Examples of tools that use the tracing method are Extended Events and SQL Profiler.
Polling – This is also known by another name known to statisticians: sampling. A tool that utilizes polling will gather performance data at regular intervals. This is considered a light-weight option to tracing. Examples of tools that use this method are Performance Monitor (by default it samples once per second) and 3rd party applications that query dynamic management objects (which are system views known as DMVs in SQL Server, and x$ and v$ objects in Oracle).
When it comes to performance monitoring you need to have a plan. Capturing every possible detail is not a plan. Well, technically it *is* a plan, just a bad one, usually. As with most decisions in my life I like to build my plan around a reliable standard: CBR (that’s my TLA for Cost, Benefit, and Risk).
Costs
Tracing will track all events as they occur. This method has more overhead than polling. The more events you include in your trace, the more you affect the performance of the server you are trying to monitor. That’s not good.
ADVANTAGE: POLLING
Benefits
Tracing will give you sequential details about events as they happen. This is a tremendous benefit when you are trying to understand a system for which you have no idea who or what is running.
ADVANTAGE: TRACING
Risks
Tracing can interfere with the performance of the instance being monitored. The biggest risk with Polling is that if an issue happens in between the time that two polls (or samples) are taken, then you may miss some pertinent details.
ADVANTAGE: PUSH
I haven’t even talked about baselines or the need to capture and store the details of either method for the purposes of historical trend analysis. Both of those items are probably their own blog posts waiting to happen…someday. But the fact is you need to store the information gathered from either method in order to know what is normal.
Summary
I find that many junior administrators (and developers with novice database performance troubleshooting skills) tend to rely on tracing tools for even the most routine tasks that can be done with a few simple queries against a DMV or two. I do my best to educate when I can, but it often is an uphill battle. I lost track of the number of times I’ve been thrown under the bus by someone saying that they can’t fix an issue because I won’t let them run Profiler against a production box as a first attempt at figuring out what’s going on.
Rather than make people choose between one tool or the other I do my best to explain how they work well together.
Think of a surgeon that needs to operate on a patient. There’s a good chance that before the surgeon cuts into healthy skin they will take an X-ray of the area. Once they examine the X-ray, they know more about what they need to do when they operate.
Polling tools are similar to X-rays. They help you understand more about what areas you need to investigate further. Then, when you need to take that deeper dive, that’s where you are likely to use a tracing tool in order to return only the necessary information needed to solve the problem, and only for the shortest possible duration.
I wouldn’t use a tracing tool as a first option. I rely on polling in order to help me understand where to go next. Sometimes that requires a trace, but often times I’m able to help make positive performance improvements without ever needing to run a trace. Then again, I’m lucky that I have some really good tools to use for monitoring database servers, even ones that are running on VMWare.
There’s a lot for anyone to learn as an administrator, and it can be overwhelming for anyone, new or experienced. So it wouldn’t hurt to double check how you are currently monitoring right now, to make certain you are grabbing the right things and at the right frequency.
2 thoughts on “Doing It Right: Performance Monitoring and Troubleshooting”