Configuration
Job Defense Shield is configured using a YAML file, which specifies global settings as well as the individual alerts to run.
Below is a minimal configuration file (config.yaml
) with one alert:
%YAML 1.1
---
#####################
## GLOBAL SETTINGS ##
#####################
jobstats-module-path: /path/to/jobstats/
violation-logs-path: /path/to/violations/
email-files-path: /path/to/email/
email-domain-name: "@institution.edu"
sender: support@institution.edu
reply-to: support@institution.edu
greeting-method: getent
workday-method: file
holidays-file: /path/to/holidays.txt
report-emails:
- admin1@institution.edu
- admin2@institution.edu
##################################
## ZERO CPU UTILIZATION (ALERT) ##
##################################
zero-cpu-utilization-1:
cluster: stellar
partitions:
- cpu
- bigmem
min_run_time: 61 # minutes
email_file: "zero_cpu_utilization.txt"
admin_emails:
- job-alerts-aaaalegbihhpknikkw2fkdx6gi@institution.slack.com
- admin@institution.edu
See a full example of config.yaml in the GitHub repository.
Global Settings
Jobstats Files
The first entry in config.yaml
is the path to jobstats.py
and config.py
:
These two files are needed for getting the summary statistics of actively running jobs. To be clear, /path/to/jobstats
should contain these files:
The value of PROM_SERVER
is taken from config.py
.
If you are not interested in addressing the underutilization of actively running jobs then just set this to a dummy path (e.g., /tmp
).
Violation Logs
One must specifiy the path to a writable directory to store the underutilization history of each user:
The files stored in this directory are read when deciding whether or not sufficient time has passed to send the user another email. These files are important and should be backed up.
Email Settings
Set the path to your email files. These are the files that are read and enhanced by the software. A set of examples in found in the email
directory of the job_defense_shield
GitHub repository.
Specify the email domain for your institution:
Usernames will be concatenated with the email domain to make email addresses.
Specify the sender
and reply-to
values for sending emails:
Tip
By using a reply-to
that is different from sender
, one can prevent auto-reply or out-of-office emails from creating new support tickets.
Use the greeting-method
to determine the first line of the email that users receive:
If you find that getent
is not working properly during testing then use basic
.
Lastly, one can create multiple reports and have those sent to administrators by email when the --report
flag is used:
Workdays
Email alerts are only sent to users on workdays. Pick a method to distinguish the workdays from weekends and holidays. The most flexible method is file
:
The file holidays.txt
should be a list of dates with the format YYYY-MM-DD:
If you only want to avoid weekends and U.S. Federal holidays then use:
If every day is a workday then:
The cron
setting can be used to avoid weekends so really this section is about dealing with holidays.
Other Settings
Partition names can be renamed:
If a partition is renamed then the new name must be used throughout the configuration file.
For users that do not use their institutional email address, one can specify external addresses:
Specifying a Custom Configuration File
By default the software will look for config.yaml
in the same directory as job_defense_shield.py
. One can override this behavior by using the --config-file
option:
The ability to use different configuration files provides additional flexibility. For instance, for some institutions it may make sense to have a different configuration file for each cluster or for different alerts.
Each Alert Must Have a Different Name
Consider the following two alerts (pay attention to the alert names):
zero-cpu-utilization-1:
cluster: stellar
partitions:
- cpu
email_file: "zero_cpu_utilization.txt"
admin_emails:
- admin@institution.edu
zero-cpu-utilization-1:
cluster: della
partitions:
- physics
email_file: "zero_cpu_utilization.txt"
admin_emails:
- admin@institution.edu
While the two alerts above are written for different clusters, only the second one will run since both alerts have the same name (zero-cpu-utilization-1
).
Danger
Make sure each alert name has a different number at the end. An alert with the same name as one previously defined will override the previous alert.
The corrected version would be:
zero-cpu-utilization-1:
cluster: stellar
partitions:
- cpu
email_file: "zero_cpu_utilization.txt"
admin_emails:
- admin@institution.edu
zero-cpu-utilization-2:
cluster: della
partitions:
- physics
email_file: "zero_cpu_utilization.txt"
admin_emails:
- admin@institution.edu
The second alert now has the name zero-cpu-utilization-2
.
Include or Fully Remove a Setting
There are many optional settings for each alert. If you do not want to use an optional setting then fully remove the line.
The following is incorrect for min_run_time
:
zero-cpu-utilization-1:
cluster: stellar
partitions:
- cpu
min_run_time:
email_file: "zero_cpu_utilization.txt"
admin_emails:
- admin@institution.edu
If the default value for min_run_time
should be used then completely remove the line. Here is the correct entry:
zero-cpu-utilization-1:
cluster: stellar
partitions:
- cpu
email_file: "zero_cpu_utilization.txt"
admin_emails:
- admin@institution.edu
Another correct way is to specify the value:
zero-cpu-utilization-1:
cluster: stellar
partitions:
- cpu
min_run_time: 0 # minutes
email_file: "zero_cpu_utilization.txt"
admin_emails:
- admin@institution.edu
Full Example Configuration File
For more examples see config.yaml in the GitHub repository.
Writing and Testing Custom Emails
See the next section to learn about sending custom emails to your users.