Stop Guessing About CPU Pressure

This tiny guy runs everything you do!

Ever try to troubleshoot something that doesn’t really exist?

I think this virtualization thing is here to stay, I really do, but the trouble I see people having with virtualization is that they don’t know where the bottleneck truly lies. There are so many layers involved that most standard troubleshooting techniques fall short. Just think for a minute about being a SQL Server DBA in a virtual world: you administer a piece of software, running on a guest O/S, that is running on a host, that is connected to a datastore that may or may not be shared with “noisy neighbors”.

Knowing all that, when an end-user comes to you and says “this query is running slow”, where do you begin to look?

I always start with the virtual layers. I’d rather spend 5 minutes ruling out the virtual layer as a bottleneck before I try to tune a query. It makes no sense to spend time trying to make a query run faster if the bottleneck is at the host level, for example. That’s wasted time, and I *hate* wasting time.

One example I see often these days has to do with CPU pressure. In my experience most database servers are not CPU bound. (I said “most” and “my experience”.) Usually a database server has a lot of extra CPU cycles to spare. But that is not always the case in the virtual world. So when I come across a database instance that is showing CPU pressure I like to ask a very general question: Is it internal pressure, or external?

One of my favorite metrics to examine when researching potential CPU pressure is called signal waits. You can read more about signal waits at this blog post by a member of the SQLCAT team. The default threshold I like to set is 25%, meaning once I see I have about 25% of my sessions simply waiting for an available CPU I am likely to believe that the CPU pressure is internal to the instance. Of course more factors come into play before I would define 25% to be a problem, such as volume of sessions, length of time for each session to execute, etc.

But what about a guest O/S that is also waiting for an available CPU from the host? Imagine that your sessions are waiting for the SQLOS to give you an available CPU, and the SQLOS itself is waiting for an available CPU from the host? Wouldn’t it be nice if there was a metric for you to know if the issue is at the guest layer, and not internal to SQL?

Well as luck would have it such a metric does exist: CPU ready time. This is simply a metric that tells you the percentage of time that your guest was ready to run, but could not get scheduled.

Together the signal waits and CPU ready time help to give you a more complete picture of what is happening to your database instance. Unfortunately, you can’t get at all of this information easily from within vSphere. Here’s a quick overview of what you would need to do in vSphere to research CPU pressure:

  1. Open vSphere, then go to the correct host, then to the guest that you are interested in monitoring.
  2. On the Performance tab add in the counters you want (but no more than two!)
  3. See real-time information (too bad if your issue was last week and you don’t have details any longer)
  4. Ascertain if the issue is local to guest by going back to the host and looking at Performance tab
  5. Add in the counters for the host Performance tab view (but no more than two!)
  6. See real-time information (too bad if the issue was last week and you don’t have details any longer)
  7. Throw up your hands when you realize that you still don’t know if the pressure is internal or external to SQL running on that guest, on that host
  8. Take your hands and slap yourself when you realize you can’t tie anything back to specific SQL statements

Sounds rather time consuming, right? Especially if you still won’t have any idea about a root cause or what specific actions to take after spending all that time investigating the issue. If you want to see it for yourself I made a video to show you some of the challenges that vSphere has when it comes to displaying all of the metrics necessary for you to make an informed decision.

[Sorry, the video seems to have been removed, I’m not sure where it went but it was five years ago so it isn’t relevant now anyway].

The short of it is that vSphere is never going to give you a complete view (including SQL statements or even number of sessions running) in one screen. But IgniteVM will do that for you, saving you time, and ultimately money as well, money that can then be used to buy more bacon related products for your home or office (don’t forget Valentine’s Day is right around the corner).

Not only is virtualization here to stay, but tools like IgniteVM will be here as well to help administrators quickly find the root cause for performance bottlenecks.

Want to chat more about monitoring your virtual database servers? Then contact me and I would be happy to set up some time to talk. Also, Confio has regular product demos: go register for the next one here.

4 thoughts on “Stop Guessing About CPU Pressure”

  1. Hi Tom, this is by far your best post in a very long time. Even though you are trying to sell a product you have provided very good information. 

    Since it takes microseconds to run a good waits query… why wouldn’t you do that first? It doesn’t seem like a waste of time if you have a properly balanced virtual environment. 

    Does IgniteVM show you the waits?

    Reply
    • Yes, IgniteVM shows you the waits, same as Ignite8 would do.

      The reason I don’t look at waits first is because I want to make certain I rule out the “noisy neighbor”. If my host has disk saturation, then knowing my query is waiting for disk won’t help me make that query faster. I want to address the root cause of performance problems, and not the symptoms or effects.

      Reply

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.