Cancel GPU Jobs at 0% Utilization

This is one of the most popular features of Jobstats. This alert automatically cancels jobs with GPUs at 0% utilization. Up to two warning emails can be sent before each job is cancelled.

Elevated Privileges

This alert is different than the others in that it must be ran as a user with sufficient privileges to call scancel.

Configuration File

Below is an example entry for the configuration file where many of the settings are applied:

cancel-zero-gpu-jobs-1:
  cluster:
    - della
  partitions:
    - gpu
    - llm
  sampling_period_minutes: 15  # minutes
  first_warning_minutes:   60  # minutes
  second_warning_minutes: 105  # minutes
  cancel_minutes:         120  # minutes
  email_file_first_warning:  "cancel_gpu_jobs_warning_1.txt"
  email_file_second_warning: "cancel_gpu_jobs_warning_2.txt"
  email_file_cancel:         "cancel_gpu_jobs_scancel_3.txt"
  jobid_cache_path: /path/to/writable/directory/
  max_interactive_hours: 8
  max_interactive_gpus: 1
  do_not_cancel: False
  warnings_to_admin: True
  admin_emails:
    - admin@institution.edu
  excluded_users:
    - u12345
    - u23456

The settings are explained below:

cluster: Specify the cluster name as it appears in the Slurm database. per alert.
partitions: Specify one or more Slurm partitions.
sampling_period_minutes: Number of minutes between executions of this alert. This number must also be the same in cron (see cron section below) or the scheduler that is used.
first_warning_minutes: (Optional) Number of minutes that the job must run before the first warning email can be sent.
second_warning_minutes: (Optional) Number of minutes that the job must run before the second warning email can be sent.
cancel_minutes: (Required) Number of minutes that the job must run before it can be cancelled.
email_file_first_warning: (Optional) File to be used for the first warning email.
email_file_second_warning: (Optional) File to be used for the second warning email.
email_file_cancel: (Required) File to be used for the cancellation email.
jobid_cache_path: (Optional) Path to a writable directory where a cache file containing the jobid of each job known to be using the GPUs is stored. This is a binary file with the name .jobid_cache.pkl. Including this setting will eliminate redundant calls to the Prometheus server.
max_interactive_hours: (Optional) An interactive job will only be cancelled if the run time limit is greater than max_interactive_hours and the number of allocated GPUs is less than or equal to max_interactive_gpus. Remove these lines if interactive jobs should not receive special attention. An interactive job is one with a jobname that starts with either interactive or sys/dashboard.
max_interactive_gpus: (Optional) See line above.
gpu_frac_threshold: For a given job, let g be the ratio of GPUs with non-zero utilization to the number of allocated GPUs. Jobs with gpu_frac_threshold greater than or equal to g will be excluded. For example, if gpu_frac_threshold is 0.8 and a job uses 7 of the 8 allocated GPUs then it will be excluded since 7/8 > 0.8. Default: 1.0
nodelist: (Optional) Only apply this alert to jobs that ran on the specified nodes. See example.
excluded_users: (Optional) List of usernames to exclude from this alert.
do_not_cancel: (Optional) If True then scancel will not be called. This is useful for testing only. In this case, one should call the alert with --email --no-emails-to-users. Default: False
warnings_to_admin: (Optional) If False then warning emails will not be sent to admin_emails. Only cancellation emails will be sent. Default: True
admin_emails: (Optional) List of administrator email addresses that should receive the warning and cancellation emails that are sent to users.

Note that jobs are not cancelled at after exactly cancel_minutes since the alert is only called every N minutes via cron. The same is true for warning emails.

In Jobstats, a GPU is said to have 0% utilization if all of the measurements made by the NVIDIA exporter are zero. Measurements are typically made every 30 seconds or so. For the actual value see SAMPLING_PERIOD in config.py for Jobstats.

Example Configurations

The example below will send one warning email and then cancel the jobs:

cancel-zero-gpu-jobs-1:
  cluster:
    - della
  partitions:
    - gpu
    - llm
  sampling_period_minutes: 15  # minutes
  first_warning_minutes:   60  # minutes
  cancel_minutes:         120  # minutes
  email_file_first_warning:  "cancel_gpu_jobs_warning_1.txt"
  email_file_cancel:         "cancel_gpu_jobs_scancel_3.txt"
  jobid_cache_path: /path/to/writable/directory/
  admin_emails:
    - admin@institution.edu

The example below will send no warnings but only cancel jobs after 120 minutes:

cancel-zero-gpu-jobs-1:
  cluster:
    - della
  partitions:
    - gpu
    - llm
  sampling_period_minutes: 15  # minutes
  cancel_minutes:         120  # minutes
  email_file_cancel:         "cancel_gpu_jobs_scancel_3.txt"
  jobid_cache_path: /path/to/writable/directory/
  admin_emails:
    - admin@institution.edu

Testing

For testing, be sure to use:

  do_not_cancel: True

Additionally, add the --no-emails-to-users flag:

$ python job_defense_shield.py --cancel-zero-gpu-jobs --email --no-emails-to-users

Learn more about email testing.

First Warning Email

Below is an example email for the first warning (see email/cancel_gpu_jobs_warning_1.txt):

Hi Alan (aturing),

You have GPU job(s) that have been running for nearly 1 hour but appear to not
be using the GPU(s):

       JobID    Cluster Partition  GPUs-Allocated  GPUs-Unused GPU-Util  Hours
     60131148    della     gpu            4             4         0%       1  

Your jobs will be AUTOMATICALLY CANCELLED if they are found to not be using the
GPUs for 2 hours.

Please consider cancelling the job(s) listed above by using the "scancel"
command:

     $ scancel 60131148

Replying to this automated email will open a support ticket with Research
Computing.

Second Warning Email

Below is an example email for the second warning (see email/cancel_gpu_jobs_warning_2.txt):

Hi Alan (aturing),

This is a second warning. The jobs below will be cancelled in about 15 minutes
unless GPU activity is detected:

       JobID    Cluster Partition  GPUs-Allocated  GPUs-Unused GPU-Util  Hours
     60131148    della     gpu           4             4          0%      1.6  

Replying to this automated email will open a support ticket with Research
Computing.

Cancellation Email

Below is an example email (see email/cancel_gpu_jobs_scancel_3.txt):

Hi Alan (aturing),

The jobs below have been cancelled because they ran for more than 2 hours at 0% GPU
utilization:

     JobID    Cluster  Partition    State    GPUs-Allocated GPU-Util  Hours
    60131148   della      gpu     CANCELLED         4          0%      2.1

See our GPU Computing webpage for three common reasons for encountering zero GPU
utilization:

    https://your-institution.edu/knowledge-base/gpu-computing

Replying to this automated email will open a support ticket with Research
Computing.

`cron`

Below is an example crontab for this alert:

PY=/var/spool/slurm/cancel_zero_gpu_jobs/envs/jds-env/bin
JDS=/var/spool/slurm/job_defense_shield
MYLOG=/var/spool/slurm/cancel_zero_gpu_jobs/log
VIOLATION=/var/spool/slurm/job_defense_shield/violations
MAILTO=admin@institution.edu

*/15 * * * * ${PY}/python ${JDS}/job_defense_shield.py --cancel-zero-gpu-jobs --email -M della -r gpu > ${MYLOG}/zero_gpu_utilization.log 2>&1

Note that the alert is ran every 15 minutes. This must also be the value of sampling_period_minutes.

Report

There is no report for this alert. To find out which users have the most GPU-hours at 0% utilization, see this alert. If you are automatically cancelling GPU jobs then no users should be able to waste significant resources.

Cancel GPU Jobs at 0% Utilization

Configuration File

Example Configurations

Testing

First Warning Email

Tags

Second Warning Email

Tags

Cancellation Email

Tags

`cron`

Report