Last year I did a series on the system tables inside of SQL Server, which included this post on the sysjobhistory table in the msdb database. As much as I like (and use) that code from time to time to help people find the outliers with regards to long running jobs, something always nagged me about the concept.
See, as a DBA, I am often needed to fight fires in the here and now. I don’t always have the luxury of being in a proactive mode. Therefore at times I need to be reactive in order to be proactive.
Let me explain that a bit more.
We Didn’t Start The Fire
I want the chance to pour water on hot coals before a breeze blows the ashes onto my house and it burns down. Having a smoke detector in my house doesn’t help me prevent those coals from catching fire, all the smoke detector does is alert me to when something in my house is already on fire. By that point it may be too late.
So I need a way to be a bit more proactive. I need a way to know when things may become a problem, but aren’t necessarily a problem for the entire house. Alerts for things like blocked sessions are one way of getting that done. Another way is to alert on when queries are taking longer than expected. Unfortunately you don’t always know how long a query should take, as those details are not persisted between service restarts.
Know what is persisted between server restarts? Job history, that’s what is persisted. The job history is sitting there inside of the msdb database and just begging for you to convert it into a smoke detector that has value.
In order to do that, we want to take the jobs that are currently running, see how long they have been running for, and compare that to the historical average. By doing so I will be able to find the smoldering coals before they set my house ablaze. In other words, I can react to a situation as part of my effort to be proactive in preventing a larger disaster.
The Code
A few notes about the code below. First, you will want to poll at a decent interval, perhaps every ten or fifteen minutes. Whatever interval you choose you should understand the relation between your polling and the @MinAvgSecsDuration parameter. That parameter is the minimum threshold for job duration average. If I set it to 300, then I won’t be concerned with any jobs that have a historical average less than five minutes in duration. I probably don’t want to poll more than 300 seconds apart because there is a chance I could miss an outlier (say if I was polling every fifteen minutes) because this code is filtering for jobs that are currently running.
Second, there is a parameter (@MinHistExecutions) for the minimum number of successful job executions. I like to set this to five or seven, in order to give me a decent sample size to serve as a base. Combined with the @MinAvgSecsDuration parameter this lets me avoid having a lot of false positives for jobs that are newly added.
Third, at the end of the code is a line that checks to see if the jobs are currently running. That’s the point of the script, after all. However, if you comment this line out then this script would serve as a way for you to run a report, perhaps daily, to look for jobs that had any durations more than two standard deviations away.
The last thing I want to mention is that you can also control the amount of history you use for your average calculation. It is set to grab all history by default, but you could configure it to use a rolling window if desired. For example, you may just want to consider the job history for the previous twelve months.
OK, here is the code:
DISCLAIMER: Do not run code you find on the internet in your production environment without testing it first. Do not use this code if your vision becomes blurred. Seek medical attention if this code runs longer than four hours. On rare occasions this code has been known to cause one or more of the following: nausea, headaches, high blood pressure, popcorn cravings, and the impulse to reformat tabs into spaces. If this code causes your servers to smoke, seek shelter. Do not taunt this code.
/*=============================================
File: long_running_jobs.sql
Author: Thomas LaRock, https://thomaslarock.com/contact-me/
Summary: This script will check to see if any currently
running jobs are running long.
Variables:
@MinHistExecutions - Minimum number of job runs we want to consider
@MinAvgSecsDuration - Threshold for minimum duration we care to monitor
@HistoryStartDate - Start date for historical average
@HistoryEndDate - End date for historical average
These variables allow for us to control a couple of factors. First
we can focus on jobs that are running long enough on average for
us to be concerned with (say, 30 seconds or more). Second, we can
avoid being alerted by jobs that have run so few times that the
average and standard deviations are not quite stable yet. This script
leaves these variables at 1.0, but I would advise you alter them
upwards after testing.
Returns: One result set containing a list of jobs that
are currently running and are running longer than two standard deviations
away from their historical average. The "Min Threshold" column
represents the average plus two standard deviations.
Date: October 3rd, 2012
SQL Server Versions: SQL2005, SQL2008, SQL2008R2, SQL2012
You may alter this code for your own purposes. You may republish
altered code as long as you give due credit.
THIS CODE AND INFORMATION IS PROVIDED "AS IS" WITHOUT WARRANTY
OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING BUT NOT
LIMITED TO THE IMPLIED WARRANTIES OF MERCHANTABILITY AND/OR
FITNESS FOR A PARTICULAR PURPOSE.
=============================================*/
DECLARE @HistoryStartDate datetime
,@HistoryEndDate datetime
,@MinHistExecutions int
,@MinAvgSecsDuration int
SET @HistoryStartDate = '19000101'
SET @HistoryEndDate = GETDATE()
SET @MinHistExecutions = 1.0
SET @MinAvgSecsDuration = 1.0
DECLARE @currently_running_jobs TABLE (
job_id UNIQUEIDENTIFIER NOT NULL
,last_run_date INT NOT NULL
,last_run_time INT NOT NULL
,next_run_date INT NOT NULL
,next_run_time INT NOT NULL
,next_run_schedule_id INT NOT NULL
,requested_to_run INT NOT NULL
,request_source INT NOT NULL
,request_source_id SYSNAME NULL
,running INT NOT NULL
,current_step INT NOT NULL
,current_retry_attempt INT NOT NULL
,job_state INT NOT NULL
)
--capture details on jobs
INSERT INTO @currently_running_jobs
EXECUTE master.dbo.xp_sqlagent_enum_jobs 1,''
;WITH JobHistData AS
(
SELECT job_id
,date_executed=msdb.dbo.agent_datetime(run_date, run_time)
,secs_duration=run_duration/10000*3600
+run_duration%10000/100*60
+run_duration%100
FROM msdb.dbo.sysjobhistory
WHERE step_id = 0 --Job Outcome
AND run_status = 1 --Succeeded
)
,JobHistStats AS
(
SELECT job_id
,AvgDuration = AVG(secs_duration*1.)
,AvgPlus2StDev = AVG(secs_duration*1.) + 2*stdevp(secs_duration)
FROM JobHistData
WHERE date_executed >= DATEADD(day, DATEDIFF(day,'19000101',@HistoryStartDate),'19000101')
AND date_executed < DATEADD(day, 1 + DATEDIFF(day,'19000101',@HistoryEndDate),'19000101') GROUP BY job_id HAVING COUNT(*) >= @MinHistExecutions
AND AVG(secs_duration*1.) >= @MinAvgSecsDuration
)
SELECT jd.job_id
,j.name AS [JobName]
,MAX(act.start_execution_date) AS [ExecutionDate]
,AvgDuration AS [Historical Avg Duration (secs)]
,AvgPlus2StDev AS [Min Threshhold (secs)]
FROM JobHistData jd
JOIN JobHistStats jhs on jd.job_id = jhs.job_id
JOIN msdb..sysjobs j on jd.job_id = j.job_id
JOIN @currently_running_jobs crj ON crj.job_id = jd.job_id
JOIN msdb..sysjobactivity AS act ON act.job_id = jd.job_id
AND act.stop_execution_date IS NULL
AND act.start_execution_date IS NOT NULL
WHERE secs_duration > AvgPlus2StDev
AND DATEDIFF(SS, act.start_execution_date, GETDATE()) > AvgPlus2StDev
AND crj.job_state = 1
GROUP BY jd.job_id, j.name, AvgDuration, AvgPlus2StDev
Here is what you will see as a result set:
I hope you find this code useful, I know I have!
Thomas, this script is most appreciative, and one that I will be utilizing in the near future. Thanks for taking the time to post this.
Chris,
You are most welcome, thanks for reading!
This should be a great tool for a DBA – I bet. Thank you very much.
thanks!
sweet.
Thomas- I’ve tested this out by making a long running job…”waitfor” It took a job with a duration of 0 on the Log File Viewer to 5 minutes. Will it pick up anything if the duration shows 0 before? Or is there some other way to test this?
Shelz,
There are parameters at the top of the script:
@MinAvgSecsDuration – Threshold for minimum duration we care to monitor
You can try adjusting that and see how it works.
Tom
This seems great. Will try it out. I am a little unclear what is meant by deviation though. Say a job that runs on average 10 minutes, is running for 12 minutes when this is ran. Will the query return this job with column [Min Threshold (secs)] 120 seconds? Or will it be twice that since it’s multiplied by 2? Ideally I would want this to return something if jobs are running twice as long or more as they do on average. Thanks!
Deviation refers to standard deviation. The code will return jobs that are running for longer than 2 standard deviations away from the mean. HTH
Best Disclaimer ever 🙂
Thomas I know this is old, but was exactly what I was looking for. I made some slight changes to it. Keep up the good work.
https://gist.github.com/tcartwright/e34d1da5b93551a848066ba59c3d55dd
Thank you!