Looks like I got tagged by Brent Ozar today. Apparently I need to share some mistakes I have made in my career, and then tag someone else to do the same. Okay then, here goes.
Mistake #1: Ignoring the pager. Back before I became SQLBatman, I was a new DBA. You could call me a junior DBA if you want, but a more accurate term would be calling me an application developer, because that is what I was. Through attrition I was asked to be a database administrator because I was one of the only people in the office that had ever done a backup and a restore. Or, to be more exact, because I was the only one that expressed interest in being a DBA.
So, after being on the job for a couple of weeks, one Sunday morning my new pager goes off. I check the pager and it has this cryptic message, something like “MTF reindex fail”. Now, this pager went off a lot, for various things, and most of them were items that were not anything for me to do. See, we had a lot of shared systems at that time, so most of the pages went out to groups of people. I had no idea what this page meant, no idea where to start looking, and quite simply assumed that it was meant for someone else. Unfortunately that was far from the truth. The page only went to me, and no one else.
So, Monday morning comes and I get a call from the application folks who are screaming about performance being so poor on their system. After about an hour or so, I remembered the page and realized that we needed to rebuild the index. Okay then, let’s rebuild. Hey, did you know that the reason we rebuild the index on Sunday mornings is because it is a three hour job? No? Neither did I. To make matters worse, the rebuild was destined to fail again, because there was not enough free space in the database file for it to complete (thank you, Sybase ASE 11, and your lack of an Autogrow).
So, after about 90 minutes of it rebuilding, the application folks tell me to stop what I am doing and to restore from Saturday. The restore would take about an hour, and they would rather know they would be online in an hour than wait for the reindex to finish (and fail again). Essentially, put us back to where we were right before the system tried to do the rebuild, and do it now. So, I did that, we were back online in an hour, and performance was back to expected levels. Of course, we were now four or fives hours late on the day’s processing. Oops.
I still remember the email to my boss that said “…obviously the DBA needs to be attentive to the pager“. Yeah, I suppose so. But I did learn a lot about indexes that day! I also learned a lot about how to perform forensics on database issues, essentially recreating events to understand what really happened and, more importantly, what to do next.
Mistake #2: Runaway trace table. Did I mention I used to be a new DBA? Well, at one point I was. As such, you need to break a few eggs before you can fully understand some possible consequences of your actions. Such is the case with Profiler. As a diagnostic tool, Profiler is adequate. It gets the job done, it is not perfect, but you can usually configure it to get what you need.
One of the options for Profiler is to store your results in a table inside a database. And, as we know, storing the results in a table does make it easy to sift through the results later on when you are trying to solve a puzzle. I was evaluating a tool provide from a certain software vendor that shall remain nameless, and all the tool did was provide a front end for Profiler. In other words, instead of using the Profiler GUI, you used this tool instead. Yeah, I know, why bother, right? Did I mention I was new?
Anyway, where to store the table? Well, in a database of course. But which database? Well, how about the one we use for our in-house monitoring. Sure, put that table right there. Okay, we know where to store the data, now we just need to figure out what we want to collect. Of course, we don’t want to collect everything, because we do not want to affect performance. Did I mention this was a production server? So, choose wisely. Okay, now for how long? Well, the customer wants to know all activity between the hours of 5PM and 11PM. Okay, we know three key elements: when to run, what to capture, and where to put the information.
As such, a perfect storm was created. We were capturing a little too much info, we were running for way too long, and we were storing the info on the same drive as the production data.
Ouch.
I flipped this on and started to drive home, about an hour’s drive. I got thirty minutes away from the office and was paged by our in-house system. The data drive was full. Not filling, full. Nothing else would fit on the drive, it was in the middle of the nightly processing, and I was about as far away from the office as I could be before I could even begin to work on the problem.
I learned quite a bit that night, and it probably explains why, to this very day, I am against the use of Profiler by anyone outside of our team.