Last year I did a series on the system tables inside of SQL Server, which included this post on the sysjobhistory table in the msdb database. As much as I like (and use) that code from time to time to help people find the outliers with regards to long running jobs, something always nagged me about the concept.
See, as a DBA, I am often needed to fight fires in the here and now. I don’t always have the luxury of being in a proactive mode. Therefore at times I need to be reactive in order to be proactive.
Let me explain that a bit more.
We Didn’t Start The Fire
I want the chance to pour water on hot coals before a breeze blows the ashes onto my house and it burns down. Having a smoke detector in my house doesn’t help me prevent those coals from catching fire, all the smoke detector does is alert me to when something in my house is already on fire. By that point it may be too late.
So I need a way to be a bit more proactive. I need a way to know when things may become a problem, but aren’t necessarily a problem for the entire house. Alerts for things like blocked sessions are one way of getting that done. Another way is to alert on when queries are taking longer than expected. Unfortunately you don’t always know how long a query should take, as those details are not persisted between service restarts.
Know what is persisted between server restarts? Job history, that’s what is persisted. The job history is sitting there inside of the msdb database and just begging for you to convert it into a smoke detector that has value.
In order to do that, we want to take the jobs that are currently running, see how long they have been running for, and compare that to the historical average. By doing so I will be able to find the smoldering coals before they set my house ablaze. In other words, I can react to a situation as part of my effort to be proactive in preventing a larger disaster.
A few notes about the code below. First, you will want to poll at a decent interval, perhaps every ten or fifteen minutes. Whatever interval you choose you should understand the relation between your polling and the @MinAvgSecsDuration parameter. That parameter is the minimum threshold for job duration average. If I set it to 300, then I won’t be concerned with any jobs that have a historical average less than five minutes in duration. I probably don’t want to poll more than 300 seconds apart because there is a chance I could miss an outlier (say if I was polling every fifteen minutes) because this code is filtering for jobs that are currently running.
Second, there is a parameter (@MinHistExecutions) for the minimum number of successful job executions. I like to set this to five or seven, in order to give me a decent sample size to serve as a base. Combined with the @MinAvgSecsDuration parameter this lets me avoid having a lot of false positives for jobs that are newly added.
Third, at the end of the code is a line that checks to see if the jobs are currently running. That’s the point of the script, after all. However, if you comment this line out then this script would serve as a way for you to run a report, perhaps daily, to look for jobs that had any durations more than two standard deviations away.
The last thing I want to mention is that you can also control the amount of history you use for your average calculation. It is set to grab all history by default, but you could configure it to use a rolling window if desired. For example, you may just want to consider the job history for the previous twelve months.
OK, here is the code:
DISCLAIMER: Do not run code you find on the internet in your production environment without testing it first. Do not use this code if your vision becomes blurred. Seek medical attention if this code runs longer than four hours. On rare occasions this code has been known to cause one or more of the following: nausea, headaches, high blood pressure, popcorn cravings, and the impulse to reformat tabs into spaces. If this code causes your servers to smoke, seek shelter. Do not taunt this code.
/*============================================= File: long_running_jobs.sql Author: Thomas LaRock, http://thomaslarock.com/contact-me/ Summary: This script will check to see if any currently running jobs are running long. Variables: @MinHistExecutions - Minimum number of job runs we want to consider @MinAvgSecsDuration - Threshold for minimum duration we care to monitor @HistoryStartDate - Start date for historical average @HistoryEndDate - End date for historical average These variables allow for us to control a couple of factors. First we can focus on jobs that are running long enough on average for us to be concerned with (say, 30 seconds or more). Second, we can avoid being alerted by jobs that have run so few times that the average and standard deviations are not quite stable yet. This script leaves these variables at 1.0, but I would advise you alter them upwards after testing. Returns: One result set containing a list of jobs that are currently running and are running longer than two standard deviations away from their historical average. The "Min Threshold" column represents the average plus two standard deviations. Date: October 3rd, 2012 SQL Server Versions: SQL2005, SQL2008, SQL2008R2, SQL2012 You may alter this code for your own purposes. You may republish altered code as long as you give due credit. THIS CODE AND INFORMATION IS PROVIDED "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE IMPLIED WARRANTIES OF MERCHANTABILITY AND/OR FITNESS FOR A PARTICULAR PURPOSE. =============================================*/ DECLARE @HistoryStartDate datetime ,@HistoryEndDate datetime ,@MinHistExecutions int ,@MinAvgSecsDuration int SET @HistoryStartDate = '19000101' SET @HistoryEndDate = GETDATE() SET @MinHistExecutions = 1.0 SET @MinAvgSecsDuration = 1.0 DECLARE @currently_running_jobs TABLE ( job_id UNIQUEIDENTIFIER NOT NULL ,last_run_date INT NOT NULL ,last_run_time INT NOT NULL ,next_run_date INT NOT NULL ,next_run_time INT NOT NULL ,next_run_schedule_id INT NOT NULL ,requested_to_run INT NOT NULL ,request_source INT NOT NULL ,request_source_id SYSNAME NULL ,running INT NOT NULL ,current_step INT NOT NULL ,current_retry_attempt INT NOT NULL ,job_state INT NOT NULL ) --capture details on jobs INSERT INTO @currently_running_jobs EXECUTE master.dbo.xp_sqlagent_enum_jobs 1,'' ;WITH JobHistData AS ( SELECT job_id ,date_executed=msdb.dbo.agent_datetime(run_date, run_time) ,secs_duration=run_duration/10000*3600 +run_duration%10000/100*60 +run_duration%100 FROM msdb.dbo.sysjobhistory WHERE step_id = 0 --Job Outcome AND run_status = 1 --Succeeded ) ,JobHistStats AS ( SELECT job_id ,AvgDuration = AVG(secs_duration*1.) ,AvgPlus2StDev = AVG(secs_duration*1.) + 2*stdevp(secs_duration) FROM JobHistData WHERE date_executed >= DATEADD(day, DATEDIFF(day,'19000101',@HistoryStartDate),'19000101') AND date_executed < DATEADD(day, 1 + DATEDIFF(day,'19000101',@HistoryEndDate),'19000101') GROUP BY job_id HAVING COUNT(*) >= @MinHistExecutions AND AVG(secs_duration*1.) >= @MinAvgSecsDuration ) SELECT jd.job_id ,j.name AS [JobName] ,MAX(act.start_execution_date) AS [ExecutionDate] ,AvgDuration AS [Historical Avg Duration (secs)] ,AvgPlus2StDev AS [Min Threshhold (secs)] FROM JobHistData jd JOIN JobHistStats jhs on jd.job_id = jhs.job_id JOIN msdb..sysjobs j on jd.job_id = j.job_id JOIN @currently_running_jobs crj ON crj.job_id = jd.job_id JOIN msdb..sysjobactivity AS act ON act.job_id = jd.job_id AND act.stop_execution_date IS NULL AND act.start_execution_date IS NOT NULL WHERE secs_duration > AvgPlus2StDev AND DATEDIFF(SS, act.start_execution_date, GETDATE()) > AvgPlus2StDev AND crj.job_state = 1 GROUP BY jd.job_id, j.name, AvgDuration, AvgPlus2StDev
Here is what you will see as a result set:
I hope you find this code useful, I know I have!