Cancel GPU Jobs at 0% Utilization
This is one of the most popular features of Jobstats. This alert automatically cancels jobs with GPUs at 0% utilization. Up to two warning emails can be sent before each job is cancelled.
Elevated Privileges
This alert is different than the others in that it must be ran as
a user with sufficient privileges to call scancel
.
Configuration File
Below is an example entry for the configuration file where many of the settings are applied:
cancel-zero-gpu-jobs-1:
cluster:
- della
partitions:
- gpu
- llm
sampling_period_minutes: 15 # minutes
first_warning_minutes: 60 # minutes
second_warning_minutes: 105 # minutes
cancel_minutes: 120 # minutes
email_file_first_warning: "cancel_gpu_jobs_warning_1.txt"
email_file_second_warning: "cancel_gpu_jobs_warning_2.txt"
email_file_cancel: "cancel_gpu_jobs_scancel_3.txt"
jobid_cache_path: /path/to/writable/directory/
max_interactive_hours: 8
max_interactive_gpus: 1
do_not_cancel: False
warnings_to_admin: True
admin_emails:
- admin@institution.edu
excluded_users:
- u12345
- u23456
The settings are explained below:
-
cluster
: Specify the cluster name as it appears in the Slurm database. per alert. -
partitions
: Specify one or more Slurm partitions. -
sampling_period_minutes
: Number of minutes between executions of this alert. This number must also be the same incron
(see cron section below) or the scheduler that is used. -
first_warning_minutes
: (Optional) Number of minutes that the job must run before the first warning email can be sent. -
second_warning_minutes
: (Optional) Number of minutes that the job must run before the second warning email can be sent. -
cancel_minutes
: (Required) Number of minutes that the job must run before it can be cancelled. -
email_file_first_warning
: (Optional) File to be used for the first warning email. -
email_file_second_warning
: (Optional) File to be used for the second warning email. -
email_file_cancel
: (Required) File to be used for the cancellation email. -
jobid_cache_path
: (Optional) Path to a writable directory where a cache file containing thejobid
of each job known to be using the GPUs is stored. This is a binary file with the name.jobid_cache.pkl
. Including this setting will eliminate redundant calls to the Prometheus server. -
max_interactive_hours
: (Optional) An interactive job will only be cancelled if the run time limit is greater thanmax_interactive_hours
and the number of allocated GPUs is less than or equal tomax_interactive_gpus
. Remove these lines if interactive jobs should not receive special attention. An interactive job is one with ajobname
that starts with eitherinteractive
orsys/dashboard
. -
max_interactive_gpus
: (Optional) See line above. -
gpu_frac_threshold
: For a given job, letg
be the ratio of GPUs with non-zero utilization to the number of allocated GPUs. Jobs withgpu_frac_threshold
greater than or equal tog
will be excluded. For example, ifgpu_frac_threshold
is 0.8 and a job uses 7 of the 8 allocated GPUs then it will be excluded since 7/8 > 0.8. Default: 1.0 -
nodelist
: (Optional) Only apply this alert to jobs that ran on the specified nodes. See example. -
excluded_users
: (Optional) List of usernames to exclude from this alert. -
do_not_cancel
: (Optional) IfTrue
thenscancel
will not be called. This is useful for testing only. In this case, one should call the alert with--email --no-emails-to-users
. Default:False
-
warnings_to_admin
: (Optional) IfFalse
then warning emails will not be sent toadmin_emails
. Only cancellation emails will be sent. Default:True
-
admin_emails
: (Optional) List of administrator email addresses that should receive the warning and cancellation emails that are sent to users.
Note that jobs are not cancelled at after exactly cancel_minutes
since the alert is only called every N minutes via cron. The same is true for warning emails.
In Jobstats, a GPU is said to have 0% utilization if all of the measurements made by the NVIDIA exporter are zero. Measurements are typically made every 30 seconds or so. For the actual value see SAMPLING_PERIOD
in config.py
for Jobstats.
Example Configurations
The example below will send one warning email and then cancel the jobs:
cancel-zero-gpu-jobs-1:
cluster:
- della
partitions:
- gpu
- llm
sampling_period_minutes: 15 # minutes
first_warning_minutes: 60 # minutes
cancel_minutes: 120 # minutes
email_file_first_warning: "cancel_gpu_jobs_warning_1.txt"
email_file_cancel: "cancel_gpu_jobs_scancel_3.txt"
jobid_cache_path: /path/to/writable/directory/
admin_emails:
- admin@institution.edu
The example below will send no warnings but only cancel jobs after 120 minutes:
cancel-zero-gpu-jobs-1:
cluster:
- della
partitions:
- gpu
- llm
sampling_period_minutes: 15 # minutes
cancel_minutes: 120 # minutes
email_file_cancel: "cancel_gpu_jobs_scancel_3.txt"
jobid_cache_path: /path/to/writable/directory/
admin_emails:
- admin@institution.edu
Testing
For testing, be sure to use:
Additionally, add the --no-emails-to-users
flag:
Learn more about email testing.
First Warning Email
Below is an example email for the first warning (see email/cancel_gpu_jobs_warning_1.txt
):
Hi Alan (aturing),
You have GPU job(s) that have been running for nearly 1 hour but appear to not
be using the GPU(s):
JobID Cluster Partition GPUs-Allocated GPUs-Unused GPU-Util Hours
60131148 della gpu 4 4 0% 1
Your jobs will be AUTOMATICALLY CANCELLED if they are found to not be using the
GPUs for 2 hours.
Please consider cancelling the job(s) listed above by using the "scancel"
command:
$ scancel 60131148
Replying to this automated email will open a support ticket with Research
Computing.
Tags
These tags can be used to generate custom emails:
<GREETING>
: The greeting generated bygreeting-method
.<CLUSTER>
: The cluster specified for the alert (i.e.,cluster
).<PARTITIONS>
: The partitions listed for the alert (i.e.,partitions
).<SAMPLING>
: The sampling period in minutes (sampling_period_minutes
).<MINUTES-1ST>
: Number of minutes before the first warning is sent (first_warning_minutes
).<HOURS-1ST>
: Number of hours before the first warning is sent.<CANCEL-MIN>
: Number of minutes a job must run for before being cancelled (cancel_minutes
).<CANCEL-HRS>
: Number of hours a job must run for before being cancelled.<TABLE>
: Table of job data.<JOBSTATS>
:jobstats
command for the first JobID ($ jobstats 12345678
).<SCANCEL>
:scancel
command for the first JobID ($ scancel 12345678
).
Second Warning Email
Below is an example email for the second warning (see email/cancel_gpu_jobs_warning_2.txt
):
Hi Alan (aturing),
This is a second warning. The jobs below will be cancelled in about 15 minutes
unless GPU activity is detected:
JobID Cluster Partition GPUs-Allocated GPUs-Unused GPU-Util Hours
60131148 della gpu 4 4 0% 1.6
Replying to this automated email will open a support ticket with Research
Computing.
Tags
These tags can be used to generate custom emails:
<GREETING>
: The greeting generated bygreeting-method
.<CLUSTER>
: The cluster specified for the alert (i.e.,cluster
).<PARTITIONS>
: The partitions listed for the alert (i.e.,partitions
).<SAMPLING>
: The sampling period in minutes (sampling_period_minutes
).<MINUTES-1ST>
: Number of minutes before the first warning is sent (first_warning_minutes
).<MINUTES-2ND>
: Number of minutes before the second warning is sent (second_warning_minutes
).<CANCEL-MIN>
: Number of minutes a job must run for before being cancelled (cancel_minutes
).<CANCEL-HRS>
: Number of hours a job must run for before being cancelled.<TABLE>
: Table of job data.<JOBSTATS>
:jobstats
command for the first JobID ($ jobstats 12345678
).<SCANCEL>
:scancel
command for the first JobID ($ scancel 12345678
).
Cancellation Email
Below is an example email (see email/cancel_gpu_jobs_scancel_3.txt
):
Hi Alan (aturing),
The jobs below have been cancelled because they ran for more than 2 hours at 0% GPU
utilization:
JobID Cluster Partition State GPUs-Allocated GPU-Util Hours
60131148 della gpu CANCELLED 4 0% 2.1
See our GPU Computing webpage for three common reasons for encountering zero GPU
utilization:
https://your-institution.edu/knowledge-base/gpu-computing
Replying to this automated email will open a support ticket with Research
Computing.
Tags
These tags can be used to generate custom emails:
<GREETING>
: The greeting generated bygreeting-method
.<CLUSTER>
: The cluster specified for the alert (i.e.,cluster
).<PARTITIONS>
: The partitions listed for the alert (i.e.,partitions
).<SAMPLING>
: The sampling period in minutes (sampling_period_minutes
).<CANCEL-MIN>
: Number of minutes a job must run for before being cancelled (cancel_minutes
).<CANCEL-HRS>
: Number of hours a job must run for before being cancelled.<TABLE>
: Table of job data.<JOBSTATS>
:jobstats
command for the first JobID ($ jobstats 12345678
).<SCANCEL>
:scancel
command for the first JobID ($ scancel 12345678
).
cron
Below is an example crontab for this alert:
PY=/var/spool/slurm/cancel_zero_gpu_jobs/envs/jds-env/bin
JDS=/var/spool/slurm/job_defense_shield
MYLOG=/var/spool/slurm/cancel_zero_gpu_jobs/log
VIOLATION=/var/spool/slurm/job_defense_shield/violations
MAILTO=admin@institution.edu
*/15 * * * * ${PY}/python ${JDS}/job_defense_shield.py --cancel-zero-gpu-jobs --email -M della -r gpu > ${MYLOG}/zero_gpu_utilization.log 2>&1
Note that the alert is ran every 15 minutes. This must also be the value of sampling_period_minutes
.
Report
There is no report for this alert. To find out which users have the most GPU-hours at 0% utilization, see this alert. If you are automatically cancelling GPU jobs then no users should be able to waste significant resources.