Low GPU Utilization
This alert finds jobs with low GPU utilization. It ignores jobs with GPUs at 0% utilization since the zero GPU utiliation alert catching those cases.
This alert is important because it enables system administrators to identify users on the cluster that are using a large amounts of GPU-hours at low utilization.
Here is an example of the report:
$ python job_defense_shield.py --low-gpu-efficiencies
Low GPU Efficiencies
-------------------------------------------------------------------------------------------------
user cluster partition gpu-hours proportion(%) eff(%) jobs interactive cores coverage
-------------------------------------------------------------------------------------------------
1 u76174 della gpu 3791 20 29 58 0 1.2 1.0
3 u64732 della gpu 3201 17 50 43 0 1.0 1.0
4 u13301 della gpu 2281 12 43 35 0 8.0 1.0
The table lists users from the most GPU-hours to the least. The GPU efficiency is listed.
Configuration File
Below is an example entry for config.yaml
:
Minimal configuration for generating reports only (not sending emails to users):
This configuration can be used for reports and sending emails:
low-gpu-efficiency-1:
cluster: della
partitions:
- gpu
eff_thres_pct: 15 # percent
absolute_thres_hours: 50 # gpu-hours
eff_target_pct: 50 # percent
email_file: "email/low_gpu_efficiency.txt"
admin_emails:
- admin@university.edu
- admin@princeton.edu
- sysadmin@princeton.edu
The parameters are explained below:
-
cluster
: Specify the cluster name as it appears in the Slurm database. One cluster name per alert. Use multiplezero-util-gpu-hours
alerts for multiple clusters. -
partitions
: Specify one or more Slurm partitions. The number of GPU-hours is summed over all partitions. It most cases it is better to create a separate alert for each partition. -
eff_thres_pct
: Efficiency threshold percentage. Users with aeff_thres_pct
os less than or equal to this value will receive an email. plus more -
absolute_thres_hours
: A user must have allocated more than this number of GPU-hours to be considered to receive an email. -
eff_target_pct
: The target value for GPU utilization that users should strive for. It is only used in emails. This value can be referenced as the tag<TARGET>
in email messages (seeemail/low_gpu_efficiencies.txt
). -
email_file
: The file used as a email body. This file must be found in theemail-files-path
setting inconfig.yaml
. Learn more about writing custom emails.
Below is a full set of parameters:
low-gpu-efficiency-1:
cluster: della
partitions:
- gpu
eff_thres_pct: 15 # percent
proportion_thres_pct: 2 # percent
absolute_thres_hours: 50 # gpu-hours
eff_target_pct: 50 # percent
num_top_users: 15 # count
min_run_time: 30 # minutes
email_file: "email/low_gpu_efficiency.txt"
excluded_users:
- aturing
- einstein
admin_emails:
- alerts-jobs-aaaalegbihhpknikkw2fkdx6gi@princetonrc.slack.com
- halverson@princeton.edu
- msbc@princeton.edu
-
num_top_users
: After sorting all users by GPU-hours, only consider the topnum_top_users
for all remaining calculations and emails. This is used to limit the number of users that receive emails and appear in reports. -
min_run_time
: (Optional) The number of minutes that a job must have ran to be considered. This can be used to exclude test jobs and experimental jobs. The default is 0. -
proportion_thres_pct
: A user must being using this proportion of the total GPU-hours (as a percentage) in order to be sent an email. For example, setting this to 2 will excluded all users that are using less than 2% of the total GPU-hours. -
excluded_users
: (Optional) List of users to exclude from receiving emails. These users will still appear in reports for system administrators when--report
is used. -
user_emails_bcc
: (Optional) The emails sent to users will also be sent to these administator emails. This applies when the--email
option is used. -
report_emails
: (Optional) Reports will be sent to these administator emails. This applies when the--report
option is used.
How to Write the Email File
Below is the email message that is generated by the template in email/low_gpu_efficiencies.txt
:
You can modified the file as you like. The tags that can be used in the email message are:
<GREETING>
: The greeting that will be generated based on the choice ofgreeting_method
inconfig.yaml
. An example is "Hello Alan (aturing),".<CLUSTER>
: The name of the cluster as defined inconfig.yaml
.<PARTITIONS>
: A comma-separated list of partitions as defined for the alert inconfig.yaml
.<TABLE>
: A table of jobs for the user.<JOBSTATS>
: A line showing how to run thejobstats
command on one of the jobs of the user. An example is$ jobstats 1234567
.
Usage
Generate a report of the top users are their GPU efficiencies:
Same as above but over the past month:
Send emails to users with low GPU efficiencies over the past 7 days:
Same as above but only pull data for a specific cluster and partition:
cron
It is recommended to run this alert with a time window of 7 days:
PY=/home/sysadmin/.conda/envs/jds-env/bin
JDS=/homem/sysadmin/bin/job_defense_shield
MYLOG=${JDS}/log
0 9 * * * ${PY}/python ${JDS}/job_defense_shield.py --low-gpu-efficiencices --email > ${MYLOG}/low_gpu_efficiencies.log 2>&1
Troubleshooting
You must have a low-gpu-efficiencies
entry in config.yaml
for this alert to work.