13 Jul SQL University – Advanced Troubleshooting Week
Welcome back to another week of SQL University. This week we will cover advanced troubleshooting. When I first started as a DBA “advanced troubleshooting” was something that you performed while someone else was on the other end of a phone line yelling at you for their own mistakes. Yeah, it really was that much fun. And lets not forget those times when you would be told what to do by people that had even less experience than you did, but you trusted them and then when things went even more wrong you would get all of the blame for not doing the right thing to begin with even though you were following the orders of a superior. Good times, indeed.
This post is designed to put all of that pain drama experience into a format that will provide you with some benefit at the end. I hope it makes you a better DBA, or at the very least a better person. Most importantly, it gave me something to blog about this week.
As always, the tempdb level is all about awareness. The awareness you need here is to know that there are various techniques and solutions available for you to utilize. More importantly you need to be aware that you are not someone else. Let me explain a little bit more about that.
Every IT shop is different. Every IT shop has a different configuration of systems and applications to support. The people working in one shop are not the same people in another. At some point you will come across a person that will tell you “I did this and it solved everything for me.” Stop. Remember this. You are not that person, and you are (most likely) not working in that shop. So, reading about their particular solution in a blog post or even a book does not mean that it is going to be the perfect solution for you. IN fact, it could even make things worse for you, especially if you are under a time crunch and you waste precious hours trying to implement someone else’s solution.
You are not that person. Trying to be that person means you will always be second best. You need to find out for yourself what works best for you, in your shop, as far as advanced troubleshooting techniques go.
The best example I can give to you here is this. After years as a production support DBA I started to get an idea as to when the issue was with a server, with code, or with a person. So if it was 3AM and I needed to figure out why something wasn’t working I could take a step back, assess the situation, and drill right into the most likely culprit (BTW, blaming code is usually a safe bet). Now, you aren’t going to find anyone writing a book about advanced troubleshooting that will say “anytime Fred deploys a DTS package he forgets to do x, and that is going to be what you need to fix at 3AM”. But it certainly is a technique, and one you need to be aware of as time goes on.
You can be handed a lot of technical techniques, but even a good mechanic gets by with a lot of touch and feel under the hood. You’ll need to do the same over time.
Here is where you need to be getting the job done, and you cannot get the job done unless you have a solution in place to help you gather details about your instances. It doesn’t really matter which solution you are using, just so long as that solution fills your needs. In my case what I needed at the time was a solution that allowed for me to answer the question “what happened at 2:03 last night that caused my batch load to run slower than normal?” Since I was usually asleep at that time I knew I needed something that would help me gather the necessary details to answer that exact question.
At first we had our own in-house solution, but it did not scale as we grew from a handful of instances to over 170 by the time I decided to take my talents West to Boulder. So we ended up purchasing a solution in order to help us provide better support. As good as the solution was for us initially, it was nothing more than a way for us to gather details, we still needed to do the forensics ourselves. And in time even that solution was not enough, we had to branch out in another direction eventually. We went with one tool for basic monitoring and used the other solution for gather performance metrics.
And why did we go that route? Well, because our shop is not the same as yours, that’s why.
At this level you have everything in place. You have a solution that is gathering performance metrics that allow for you to go back in time to answer questions such as “what happened”. But here is where you have to start your forensics career. While there are many solutions that allow for you to gather metrics, there are very few solutions that you would trust to perform analysis on those metrics and take action without human intervention. That’s where you come in; you’re the human.
You are going to need to understand the four major SQL Performance Food Groups: Disk, CPU, Memory, tempdb. Each one of those is a potential bottleneck for your queries. Starting looking for clues as to which one may have been the culprit. Was it Memory, in the server room, with a candlestick? Or perhaps this time it was CPU, in the library, with a rope? Or maybe there was some unknown, mysterious stranger that was blocking everyone else from trying to run away?
As you hone your detective skills you will need to work on your soft skills as well when conversing with your customers. No one wants to be told that they wrote a bad query, or if they made a mistake in general. Do your analysis, present your findings, and leave emotion out of your results. This will help give you an opportunity to continue to have people seek you out with their questions, which is vital to being at this level.
If you want to be a master of advanced troubleshooting then the answer is simple: become proactive. In most cases, the DBA is the last person to know that a problem exists. If a customer thinks an ad-hoc query is taking too long, or someone sees their batch ran ten minutes long in the middle of the night, or if a nic card has gone bad chances are you will not hear about it until after the problem has happened. And that puts you in a reactive mode, same as when you are handed a bad database design and are told it needs to go live tomorrow. Time is no longer a luxury for you as someone is waiting for you to fix something immediately. Worse yet, the issue may not even be anything you control, but you are being tasked with discovering the problem and having a solution in place.
At the master level you leave that world (mostly) behind. You will always have those reactive scenarios, but you can limit their occurrence by becoming proactive. The real trick, therefore, to becoming a master is to learn how to be proactive. You want to start troubleshooting issues (or, better yet, potential issues) before your customers are even aware of a problem on their end. This means that you start slicing and dicing your metrics and look for spikes and trends that can signal an possible issue.
The best part of being proactive? That would be the fact that if someone does your way and is yelling about the server performance you will be able to quickly rule out a handful of possible causes because (1) you have the data and (2) you know what to do with it. Why? Because when you are proactive then you will often have the time to thoroughly research the issue at hand. This often leads you to develop sound solutions to whatever issue popped up, and that sound solution helps to prevent other issues. Instead of the MacGyver approach where you keep your server running with duct tape, scissors, and a paper clip for the next 36 hours you get the opportunity to build something properly. And that research is really just extra training (or what many people just call experience), and that experience is what allows for you to become better at troubleshooting in general.
I could point you to the blogs and websites of a lot of people that I consider to be very good when it comes to advanced troubleshooting. I can also point you to more than a handful of books. However, what you really need to do here is to build your own resource list. And you can get started on that right now. Start by documenting all your servers, and the people that use them (or own them, just think high level here), and then make room for you to keep track of any oddball things that exist. Is that server a different collation? Was that server recently rebuilt for some reason? Did that server blue screen last night? Keep track of these notes and over time you will see a story emerge, and that story is going to be crucial for you when it comes time to troubleshoot some odd behavior.
Advanced troubleshooting is really just a fancy way of saying “you can find out what is wrong very quickly”. The techniques for achieving this will differ from person to person, but in the end we all need sound information for us to make our decisions. Start gathering yours, in whatever way works best for you.