Low CPU Utilization

This option lists the users with the most CPU-hours along with their mean CPU efficiencies. The CPU efficiency is weighted by the number of CPU-cores per job. Jobs with 0% utilization on a node are ignored since they are captured by another alert.

This alert is important because it enables system administrators to identify users that are using the most resources in an inefficient way.

Report for System Administrators

Here is an example of the report:

$ python job_defense_shield.py --low-cpu-efficiency

                                       Low GPU Efficiencies                                      
-------------------------------------------------------------------------------------------------
    user  cluster partition  gpu-hours  proportion(%)  eff(%)  jobs  interactive  cores  coverage
-------------------------------------------------------------------------------------------------
1  u76174   della     gpu      3791          20          29     58        0        1.2      1.0  
3  u64732   della     gpu      3201          17          50     43        0        1.0      1.0  
4  u13301   della     gpu      2281          12          43     35        0        8.0      1.0

The table lists users from the most GPU-hours to the least. The GPU efficiency is listed.

To receive the report by email:

$ python job_defense_shield --low-cpu-efficency --emain-admin

Specify your email in the admin_emails in the configuration file entry for this alert.

Configuration File

Below is an example entry for config.yaml:

Minimal configuration for generating reports only (not sending emails to users):

low-cpu-efficiency-1:
  cluster: della
  partitions:
    - gpu

This configuration can be used for reports and sending emails:

low-gpu-efficiency-1:
  cluster: della
  partitions:
    - gpu
  eff_thres_pct: 15         # percent
  absolute_thres_hours: 50  # gpu-hours
  eff_target_pct: 50        # percent
  email_file: "email/low_gpu_efficiency.txt"
  admin_emails:
    - admin@university.edu
    - admin@princeton.edu
    - sysadmin@princeton.edu

The parameters are explained below:

cluster: Specify the cluster name as it appears in the Slurm database. One cluster name per alert. Use multiple zero-util-gpu-hours alerts for multiple clusters.
partitions: Specify one or more Slurm partitions. The number of GPU-hours is summed over all partitions. It most cases it is better to create a separate alert for each partition.
eff_thres_pct: Efficiency threshold percentage. Users with a eff_thres_pct os less than or equal to this value will receive an email. plus more
absolute_thres_hours: A user must have allocated more than this number of GPU-hours to be considered to receive an email.
eff_target_pct: The target value for GPU utilization that users should strive for. It is only used in emails. This value can be referenced as the tag <TARGET> in email messages (see email/low_gpu_efficiencies.txt).
email_file: The file used as a email body. This file must be found in the email-files-path setting in config.yaml. Learn more about writing custom emails.

Below is a full set of parameters:

low-gpu-efficiency-1:
  cluster: della
  partitions:
    - gpu
  eff_thres_pct: 15         # percent
  proportion_thres_pct: 2   # percent
  absolute_thres_hours: 50  # gpu-hours
  eff_target_pct: 50        # percent
  num_top_users: 15         # count
  min_run_time: 30          # minutes
  email_file: "email/low_gpu_efficiency.txt"
  excluded_users:
    - aturing
    - einstein
  admin_emails:
    - alerts-jobs-aaaalegbihhpknikkw2fkdx6gi@princetonrc.slack.com
    - halverson@princeton.edu
    - msbc@princeton.edu

num_top_users: After sorting all users by GPU-hours, only consider the top num_top_users for all remaining calculations and emails. This is used to limit the number of users that receive emails and appear in reports.
min_run_time: (Optional) The number of minutes that a job must have ran to be considered. This can be used to exclude test jobs and experimental jobs. The default is 0.
proportion_thres_pct: A user must being using this proportion of the total GPU-hours (as a percentage) in order to be sent an email. For example, setting this to 2 will excluded all users that are using less than 2% of the total GPU-hours.
excluded_users: (Optional) List of users to exclude from receiving emails. These users will still appear in reports for system administrators when --report is used.
user_emails_bcc: (Optional) The emails sent to users will also be sent to these administator emails. This applies when the --email option is used.
report_emails: (Optional) Reports will be sent to these administator emails. This applies when the --report option is used.

How to Write the Email File

Below is the email message that is generated by the template in email/low_gpu_efficiencies.txt:

Dear Alan,

You have a loew gpu efficy.

You can modified the file as you like. The tags that can be used in the email message are:

<GREETING>: The greeting that will be generated based on the choice of greeting_method in config.yaml. An example is "Hello Alan (aturing),".
<CLUSTER>: The name of the cluster as defined in config.yaml.
<PARTITIONS>: A comma-separated list of partitions as defined for the alert in config.yaml.
<TABLE>: A table of jobs for the user.
<JOBSTATS>: A line showing how to run the jobstats command on one of the jobs of the user. An example is $ jobstats 1234567.

Usage

Generate a report of the top users are their CPU efficiencies:

$ python job_defense_shield.py --low-cpu-efficiency

Same as above but over the past month:

$ python job_defense_shield.py --low-cpu-efficiency --days=30

Send emails to users with low GPU efficiencies over the past 7 days:

$ python job_defense_shield.py --low-cpu-efficiency --email

Same as above but only pull data for a specific cluster and partition:

$ python job_defense_shield.py --low-cpu-efficiency --email -M traverse -r gpu

cron

It is recommended to run this alert with a time window of 7 days:

PY=/home/sysadmin/.conda/envs/jds-env/bin
JDS=/homem/sysadmin/bin/job_defense_shield
MYLOG=${JDS}/log

0 9 * * * ${PY}/python ${JDS}/job_defense_shield.py --low-cpu-efficiency --email > ${MYLOG}/low_cpu_efficiency.log 2>&1

Troubleshooting

You must have a low-cpu-efficiency entry in config.yaml for this alert to work.

See low CPU utilization