Low CPU Utilization
This option lists the users with the most CPU-hours along with their mean CPU efficiencies. The CPU efficiency is weighted by the number of CPU-cores per job. Jobs with 0% utilization on a node are ignored since they are captured by another alert.
This alert is important because it enables system administrators to identify users that are using the most resources in an inefficient way.
Report for System Administrators
Here is an example of the report:
$ python job_defense_shield.py --low-cpu-efficiency
Low GPU Efficiencies
-------------------------------------------------------------------------------------------------
user cluster partition gpu-hours proportion(%) eff(%) jobs interactive cores coverage
-------------------------------------------------------------------------------------------------
1 u76174 della gpu 3791 20 29 58 0 1.2 1.0
3 u64732 della gpu 3201 17 50 43 0 1.0 1.0
4 u13301 della gpu 2281 12 43 35 0 8.0 1.0
The table lists users from the most GPU-hours to the least. The GPU efficiency is listed.
To receive the report by email:
Specify your email in the admin_emails
in the configuration file entry for this alert.
Configuration File
Below is an example entry for config.yaml
:
Minimal configuration for generating reports only (not sending emails to users):
This configuration can be used for reports and sending emails:
low-gpu-efficiency-1:
cluster: della
partitions:
- gpu
eff_thres_pct: 15 # percent
absolute_thres_hours: 50 # gpu-hours
eff_target_pct: 50 # percent
email_file: "email/low_gpu_efficiency.txt"
admin_emails:
- admin@university.edu
- admin@princeton.edu
- sysadmin@princeton.edu
The parameters are explained below:
-
cluster
: Specify the cluster name as it appears in the Slurm database. One cluster name per alert. Use multiplezero-util-gpu-hours
alerts for multiple clusters. -
partitions
: Specify one or more Slurm partitions. The number of GPU-hours is summed over all partitions. It most cases it is better to create a separate alert for each partition. -
eff_thres_pct
: Efficiency threshold percentage. Users with aeff_thres_pct
os less than or equal to this value will receive an email. plus more -
absolute_thres_hours
: A user must have allocated more than this number of GPU-hours to be considered to receive an email. -
eff_target_pct
: The target value for GPU utilization that users should strive for. It is only used in emails. This value can be referenced as the tag<TARGET>
in email messages (seeemail/low_gpu_efficiencies.txt
). -
email_file
: The file used as a email body. This file must be found in theemail-files-path
setting inconfig.yaml
. Learn more about writing custom emails.
Below is a full set of parameters:
low-gpu-efficiency-1:
cluster: della
partitions:
- gpu
eff_thres_pct: 15 # percent
proportion_thres_pct: 2 # percent
absolute_thres_hours: 50 # gpu-hours
eff_target_pct: 50 # percent
num_top_users: 15 # count
min_run_time: 30 # minutes
email_file: "email/low_gpu_efficiency.txt"
excluded_users:
- aturing
- einstein
admin_emails:
- alerts-jobs-aaaalegbihhpknikkw2fkdx6gi@princetonrc.slack.com
- halverson@princeton.edu
- msbc@princeton.edu
-
num_top_users
: After sorting all users by GPU-hours, only consider the topnum_top_users
for all remaining calculations and emails. This is used to limit the number of users that receive emails and appear in reports. -
min_run_time
: (Optional) The number of minutes that a job must have ran to be considered. This can be used to exclude test jobs and experimental jobs. The default is 0. -
proportion_thres_pct
: A user must being using this proportion of the total GPU-hours (as a percentage) in order to be sent an email. For example, setting this to 2 will excluded all users that are using less than 2% of the total GPU-hours. -
excluded_users
: (Optional) List of users to exclude from receiving emails. These users will still appear in reports for system administrators when--report
is used. -
user_emails_bcc
: (Optional) The emails sent to users will also be sent to these administator emails. This applies when the--email
option is used. -
report_emails
: (Optional) Reports will be sent to these administator emails. This applies when the--report
option is used.
How to Write the Email File
Below is the email message that is generated by the template in email/low_gpu_efficiencies.txt
:
You can modified the file as you like. The tags that can be used in the email message are:
<GREETING>
: The greeting that will be generated based on the choice ofgreeting_method
inconfig.yaml
. An example is "Hello Alan (aturing),".<CLUSTER>
: The name of the cluster as defined inconfig.yaml
.<PARTITIONS>
: A comma-separated list of partitions as defined for the alert inconfig.yaml
.<TABLE>
: A table of jobs for the user.<JOBSTATS>
: A line showing how to run thejobstats
command on one of the jobs of the user. An example is$ jobstats 1234567
.
Usage
Generate a report of the top users are their CPU efficiencies:
Same as above but over the past month:
Send emails to users with low GPU efficiencies over the past 7 days:
Same as above but only pull data for a specific cluster and partition:
cron
It is recommended to run this alert with a time window of 7 days:
PY=/home/sysadmin/.conda/envs/jds-env/bin
JDS=/homem/sysadmin/bin/job_defense_shield
MYLOG=${JDS}/log
0 9 * * * ${PY}/python ${JDS}/job_defense_shield.py --low-cpu-efficiency --email > ${MYLOG}/low_cpu_efficiency.log 2>&1
Troubleshooting
You must have a low-cpu-efficiency
entry in config.yaml
for this alert to work.