HOW TO: Find Currently Running Long SQL Agent Jobs

A while ago I had posted a script to help find jobs that were currently running longer than expected. Since I’ve been sharing some scripts recently, I decided to verify that this script still works for finding long running jobs in SQL Agent for SQL Server 2014.

Why This Script Matters

Because you want the opportunity to know about issues before your end users call to complain, to be proactive. SQL Server doesn’t persist data about queries between restarts of the instance, but job history is persisted between restarts. That means we can do some data analysis on those details to find if there are jobs currently running longer than two standard deviations away from the mean (i.e., you have what is known as an outlier).

The Code

Let’s take a minute to review what the code should be doing.

First, you will want to use this script to poll your running jobs at a decent interval, perhaps every ten or fifteen minutes. Whatever interval you choose you should understand the relation between your polling and the @MinAvgSecsDuration parameter. That parameter is the minimum threshold for job duration average. If I set it to 300, then I won’t be concerned with any jobs that have a historical average less than five minutes in duration. I probably don’t want to poll more than 300 seconds apart because there is a chance I could miss an outlier (say if I was polling every fifteen minutes) because this code is filtering for jobs that are currently running.

Second, there is a parameter (@MinHistExecutions) for the minimum number of successful job executions. I like to set this to five or seven, in order to give me a decent sample size to serve as a base. Combined with the @MinAvgSecsDuration parameter this lets me avoid having a lot of false positives for jobs that are newly added.

Third, at the end of the code is a line (AND crj.job_state = 1) that checks to see if the jobs are currently running. That’s the point of the script, after all. However, if you comment this line out then this script would serve as a way for you to run a report, perhaps daily, to look for jobs that had any durations more than two standard deviations away.

The last thing I want to mention is that you can also control the amount of history you use for your average calculation. It is set to grab all history by default, but you could configure it to use a rolling window if desired. For example, you may just want to consider the job history for the previous twelve months.

As always, feel free to modify the code for your own needs.

Script

Here’s the script, which you can also download here. Also, here’s my usual disclaimer:

Script disclaimer, for people who need to be told this sort of thing:

DISCLAIMERDo not run code you find on the internet in your production environment without testing it first. Do not use this code if your vision becomes blurred. Seek medical attention if this code runs longer than four hours. On rare occasions this code has been known to cause one or more of the following: nausea, headaches, high blood pressure, popcorn cravings, and the impulse to reformat tabs into spaces. If this code causes your servers to smoke, seek shelter. Do not taunt this code.

/*=============================================
  File: long_running_jobs.sql

  Author: Thomas LaRock, https://thomaslarock.com/contact-me/
  
  
HOW TO: Find Currently Running Long SQL Agent Jobs


  Summary: This script will check to see if any currently
                        running jobs are running long. 

  Variables:
    @MinHistExecutions - Minimum number of job executions we want to consider 
    @MinAvgSecsDuration - Threshold for minimum job duration we care to monitor
    @HistoryStartDate - Start date for historical average
    @HistoryEndDate - End date for historical average

  These variables allow for us to control a couple of factors. First
  we can focus on jobs that are running long enough on average for
  us to be concerned with (say, 30 seconds or more). Second, we can
  avoid being alerted by jobs that have run so few times that the
  average and standard deviations are not quite stable yet. This script
  leaves these variables at 1.0, but I would advise you alter them
  upwards after testing.

  Returns: One result set containing a list of jobs that
  are currently running and are running longer than two standard deviations 
  away from their historical average. The "Min Threshold" column
  represents the average plus two standard deviations. 

  Date: August 19th, 2014

  SQL Server Versions: SQL2005, SQL2008, SQL2008R2, SQL2012, SQL2014

  You may alter this code for your own purposes. You may republish
  altered code as long as you give due credit. 

  THIS CODE AND INFORMATION IS PROVIDED "AS IS" WITHOUT WARRANTY
  OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING BUT NOT
  LIMITED TO THE IMPLIED WARRANTIES OF MERCHANTABILITY AND/OR
  FITNESS FOR A PARTICULAR PURPOSE.

=============================================*/

DECLARE   @HistoryStartDate datetime 
  ,@HistoryEndDate datetime  
  ,@MinHistExecutions int   
  ,@MinAvgSecsDuration int  

SET @HistoryStartDate = '19000101'
SET @HistoryEndDate = GETDATE()
SET @MinHistExecutions = 1.0
SET @MinAvgSecsDuration = 1.0

DECLARE @currently_running_jobs TABLE (
    job_id UNIQUEIDENTIFIER NOT NULL
    ,last_run_date INT NOT NULL
    ,last_run_time INT NOT NULL
    ,next_run_date INT NOT NULL
    ,next_run_time INT NOT NULL
    ,next_run_schedule_id INT NOT NULL
    ,requested_to_run INT NOT NULL
    ,request_source INT NOT NULL
    ,request_source_id SYSNAME NULL
    ,running INT NOT NULL
    ,current_step INT NOT NULL
    ,current_retry_attempt INT NOT NULL
    ,job_state INT NOT NULL
    ) 

--capture details on jobs
INSERT INTO @currently_running_jobs
EXECUTE master.dbo.xp_sqlagent_enum_jobs 1,''

;WITH JobHistData AS
(
  SELECT job_id
        ,date_executed=msdb.dbo.agent_datetime(run_date, run_time)
        ,secs_duration=run_duration/10000*3600
                      +run_duration%10000/100*60
                      +run_duration%100
  FROM msdb.dbo.sysjobhistory
  WHERE step_id = 0   --Job Outcome
  AND run_status = 1  --Succeeded
)
,JobHistStats AS
(
  SELECT job_id
        ,AvgDuration = AVG(secs_duration*1.)
        ,AvgPlus2StDev = AVG(secs_duration*1.) + 2*stdevp(secs_duration)
  FROM JobHistData
  WHERE date_executed >= DATEADD(day, DATEDIFF(day,'19000101',@HistoryStartDate),'19000101')
  AND date_executed < DATEADD(day, 1 + DATEDIFF(day,'19000101',@HistoryEndDate),'19000101') GROUP BY job_id HAVING COUNT(*) >= @MinHistExecutions
  AND AVG(secs_duration*1.) >= @MinAvgSecsDuration
)
SELECT jd.job_id
      ,j.name AS [JobName]
      ,MAX(act.start_execution_date) AS [ExecutionDate]
      ,AvgDuration AS [Historical Avg Duration (secs)]
      ,AvgPlus2StDev AS [Min Threshhold (secs)]
FROM JobHistData jd
JOIN JobHistStats jhs on jd.job_id = jhs.job_id
JOIN msdb..sysjobs j on jd.job_id = j.job_id
JOIN @currently_running_jobs crj ON crj.job_id = jd.job_id
JOIN msdb..sysjobactivity AS act ON act.job_id = jd.job_id
AND act.stop_execution_date IS NULL
AND act.start_execution_date IS NOT NULL
WHERE DATEDIFF(SS, act.start_execution_date, GETDATE()) > AvgPlus2StDev
AND crj.job_state = 1 -- comment this line if you want to report on all jobs
GROUP BY jd.job_id, j.name, AvgDuration, AvgPlus2StDev

6 thoughts on “HOW TO: Find Currently Running Long SQL Agent Jobs”

  1. I think you need to remove the line:
    WHERE secs_duration > AvgPlus2StDev

    If you leave it in, then the secs_duration from the job history table will need to exist – ie. there will need to be at least one historical record of the job running longer than average+2 standard deviations. So if a job is currently running longer than usual, you wont find out until next time it’s running. It works fine with the other WHERE condition.

    Also you need to update the date in the comments section to August 2014 rather than 2012.

    Reply
    • Andrew,

      Thanks! It would appear that when I tested this code I did have jobs with longer duration, and therefore my result set came back. When I just tried it without that clause, as you suggested, it worked fine. So, I’m removing it, and I updated the comments as you suggested.

      Thanks again for the comment! Much appreciated!

      Tom

      Reply
  2. Tom – Thanks for putting this together. I had written something similar ages ago, but it only considered jobs running a step type that shows up in sysprocesses. Come to find out that the code doesn’t catch job steps that run Powershell (for example).

    Reply
    • Oh, that’s interesting! I wonder why certain job steps are excluded? Now I need to go dig around to find out!

      Reply

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.