MORE ABOUT ME
Welcome! I’m Thomas…
Resources

March Madness – SQL Server System Tables – sysjobhistory

March Madness – SQL Server System Tables – sysjobhistory

Ever wonder about the history of the jobs that exist inside of SQL Agent? Sure, you could navigate to ‘Jobs’ folder, right-click, then select ‘View History’ and get a nice view of the history for all the jobs. But is that sufficient? Are you confident that the history viewer will give you all of the details that you need to know about the history of your jobs?

It was never enough for me, and that’s why I would always examine the details inside of the sysjobhistory table in the msdb database.

What is this table for?

This table contain details about the execution on the jobs inside of SQL Agent. An important thing to note is that these details are only inserted AFTER the job step has completed. So, you cannot find information on jobs that are currently running. If you need to know which jobs are running right now, then you need to look elsewhere (HINT: it’s named sp_help_job).

Why should you care?

I can tell you why you should care in one word: outliers.

I would dive into the sysjobhistory table to look for outliers. The outliers I wanted to know about were as follows: jobs that ran longer than expected and jobs that had at least one step fail. The job history viewer does not make it easy for you to know which jobs are were running “long” and it also may not be easy for you to find the jobs with steps that have failed (mostly because it is possible for a job step to fail, but the creator of the job could still have that step report a ‘success’, despite the failure. Don’t get me started.)

So if I wanted to find these outliers I needed to collect the information from my sysjobhistory nightly. Let’s take a look.

Say you wanted to know which jobs had steps that failed within the past 24 hours. You would run the following code:

SELECT *
FROM msdb..sysjobhistory
WHERE run_status = 0
AND DATEDIFF(hh, CONVERT(DATETIME, RTRIM(run_date)), GETDATE()) < 24

Remember, just because a job step failed does not mean the job itself would necessarily fail. I was interested to check to see if a step happened to fail as it would give me a chance to check up on the work the job was to have finished and it also gave me a chance to double check my step logic.

The code above also shows one of the biggest PITA with job details in the msdb database. Check out what I had to do with the run_date column. I had to convert (and RTRIM) that column into DATETIME in order to do the DATEDIFF. And why is that? Because the run_date column is stored as an integer. Seriously. Even with the letters “D-A-T-E” in the name of the column, someone at Microsoft decided to store this as an integer, making life a tad more difficult for folks like me that want to mine the data. From where I sit this doesn’t strike me as a flash of genius, but I am certain someone had a good reason for this part of the design.

The next outlier is a little more tricky, as you need to use some math. Not a lot, but enough to make most folks wince. The idea is to return the name of a job that has had any job step run longer than the average duration plus two standard deviations away (i.e., the mean plus two sigmas) within the past 24 hours. Well, at least that is what this code should be doing:

[Editor’s note: code has been updated and check out this post from Brad Schulz for an even better example of finding outliers.]

SELECT sj.name
FROM msdb..sysjobhistory
 INNER JOIN msdb..sysjobs sj ON msdb..sysjobhistory.job_id = sj.job_id
WHERE step_id <> 0 --we want the actual steps, not step zero
AND DATEDIFF(hh, CONVERT(varchar(10), run_date), GETDATE()) > 24
GROUP BY sj.name
HAVING
MAX(CONVERT(INT, LEFT(STUFF(run_duration,1,0,REPLICATE('0',6 - LEN(run_duration))),2))*60*60+
 CONVERT(INT,SUBSTRING(STUFF(run_duration,1,0,REPLICATE('0',6 - LEN(run_duration))),3,2))*60+
 CONVERT(INT,RIGHT(STUFF(run_duration,1,0,REPLICATE('0',6 - LEN(run_duration))),2)) )
>
AVG(CONVERT(INT, LEFT(STUFF(run_duration,1,0,REPLICATE('0',6 - LEN(run_duration))),2))*60*60+
 CONVERT(INT,SUBSTRING(STUFF(run_duration,1,0,REPLICATE('0',6 - LEN(run_duration))),3,2))*60+
 CONVERT(INT,RIGHT(STUFF(run_duration,1,0,REPLICATE('0',6 - LEN(run_duration))),2)) )
+ (2 * (STDEVP (CONVERT(INT, LEFT(STUFF(run_duration,1,0,REPLICATE('0',6 - LEN(run_duration))),2))*60*60+
 CONVERT(INT,SUBSTRING(STUFF(run_duration,1,0,REPLICATE('0',6 - LEN(run_duration))),3,2))*60+
 CONVERT(INT,RIGHT(STUFF(run_duration,1,0,REPLICATE('0',6 - LEN(run_duration))),2)) ) ))

So, why the average plus two more standard deviations? Because 95% of all the run times should lie within the range of the average +/- two standard deviations away. I am not concerned about run times that finish quickly (but you could argue I should want to know those outliers) and instead I focus on durations that are running too long. Note that if there is a steady increase in run duration (say, a few seconds each day) I might never return a hit for this code, so you may want to write something that compares two periods of time. Even with my OCD I never wanted to go that deep, but feel free to dive wherever you want.

If you look through the msdb sysjobs tables I am sure you will come up with many more uses of the data that is stored there.

Where else is this information?

The msdb database contains all of the necessary details for SQL Agent jobs.

3 Pingbacks/Trackbacks

  • I’m enjoying your series.

    However, one quibble on your last query… I don’t think it will work as expected. You’re comparing MAX(run_duration), which is an integer representation of HHMMSS against the AVG()+2*STDEVP(), which are against the number of TOTAL SECONDS.

    So you’re comparing apples to oranges… i.e., comparing HHMMSS to Seconds.

    I would suggest introducing a CROSS APPLY to get the number of seconds of each entry (This is the shortest/cutest way I know to convert the HHMMSS number to an actual time and then using DATEDIFF to get the #secs):

    CROSS APPLY (
    SELECT NumSecs=
    DATEDIFF(second,’19000101′,
    RIGHT(STUFF(STUFF(
    1000000+run_duration
    ,4,0,’:’)
    ,7,0,’:’)
    ,8))
    ) F_Secs

    Then you can use the NumSecs directly in the comparison:

    HAVING MAX(NumSecs)>
    AVG(NumSecs)+2*STDEVP(NumSecs)

    –Brad

    • Thomas LaRock

      Brad,

      You are right! I am going to pull the code from the post for now until I have time to update it properly.

      • anon

        I don’t get the reasoning for converting the HHMMSS to a string and using text processing. It’s not like you are sanity checking the string. Why not use integer arithmetic?

        declare @t int = 123456 –12:34:56

        select @t/10000*3600 + @i/100%100*60 + @i%100

        • ThomasLaRock

          When I run your example it returns a value of 45296, the number of seconds since midnight. Nice!

          I’m not sure there is any reason for using my version of the code versus an updated version using your method. The real question I have is this: why is MS not using an actual date for that field?!?

  • Hi Tom…

    I posted a followup to this post… Hope you enjoy it:

    http://bradsruminations.blogspot.com/2011/04/sysjobhistory-outliers.html

    –Brad

  • Pingback: Current Long-Running Agent Jobs « wklambert()

  • Pingback: How To Find Long Running Backups in SQL Server | SQLRockstar | Thomas LaRock()

  • Pingback: HOW TO: Find Currently Running Long SQL Agent Jobs | SQLRockstar | Thomas LaRock()