The checker works with both python2 and python3, requires PyYaml package. See exact versions tested
To check a log file for compliance:
python -m mlperf_logging.compliance_checker [--config YAML] [--usage training/hpc] [--ruleset MLPERF_EDITION] FILENAME
By default, 4.1.0 training edition rules are used and the default config is set to 4.1.0/common.yaml
.
This config will check all common keys and enqueue benchmark specific config to be checked as well.
Old training editions, still supported are 4.0.0, 3.1.0, 3.0.0, 2.1.0, 2.0.0, 1.1.0, 1.0.0, 0.7.0 and 0.6.0
To check hpc compliance rules (only 1.0.0 ruleset is supported), set --usage hpc --ruleset 1.0.0.
Prints SUCCESS
when no issues were found. Otherwise will print error details.
As log examples use NVIDIA's training logs.
4.1.0/common.yaml - currently the default config file, checks common fields complience and equeues benchmark-specific config file
4.1.0/closed_common.yaml - the common rules file for closed submissions. These rules apply to all benchmarks
4.1.0/open_common.yaml - the common rules file for open submissions. These rules apply to all benchmarks
4.1.0/closed_ssd.yaml - Per-benchmark rules, closed submissions.
4.1.0/closed_bert.yaml
4.1.0/closed_dlrm_dcnv2.yaml
4.1.0/closed_gpt3.yaml
4.1.0/closed_gnn.yaml
4.1.0/closed_llama2_70b_lora.yaml
4.1.0/closed_stable_diffusion.yaml
4.1.0/open_ssd.yaml - Per-benchmark rules, open submissions.
4.1.0/open_bert.yaml
4.1.0/open_dlrm_dcnv2.yaml
4.1.0/open_gpt3.yaml
4.1.0/open_gnn.yaml
4.1.0/open_llama2_70b_lora.yaml
4.1.0/open_stable_diffusion.yaml
Compliance checking is done following below algorithm.
- Parser converts the log into a list of records, each record corresponds to MLLOG line and contains all relevant extracted information
- Set of rules to be checked in loaded from provided config yaml file
- Process optional
BEGIN
rule if present by executing providedCODE
section - Remove messages for rules that are overridden
- Loop through the records of the log
- If the key in the record is defined in rules process the rule:
- If present, execute
PRE
section - If present, evaluate
CHECK
section, and store a warning message if the result is false - If present, execute
POST
section
- If present, execute
- Increment occurrences counter
- If the key in the record is defined in rules process the rule:
- Store a warning message if any occurrences requirements were violated
- Process optional
END
rule if present:- If present, execute
PRE
- If present, evaluate
CHECK
section, and raise an exception if the result is false
- If present, execute
- Print all warning messages
Possible side effects of yaml sections execution can be printing output, or enqueueing additional yaml files to be verified.
Rules to be checked are provided in yaml (config) file. A config file contains the following records:
Defines CODE
to be executed before any other rules defined in the current file. This record is optional
and there can be up to a single BEGIN
record per config file.
Example:
- BEGIN:
CODE: " s.update({'run_start':None}) "
Defines the actions to be triggered while processing a specific KEY
. The name of the KEY
is specified in field NAME
.
The following fields are optional:
REQ
- specifies the requirement regarding occurrence. Possible values :EXACTLY_ONE
- current key has to appear exactly onceAT_LEAST_ONE
- current key has to appear at least onceAT_LEAST(n)
- current key has to appear at least n timesAT_LEAST_ONE_OR(alternatives)
- current key or one of the alternative has to appear at least once; alternatives is a comma separated list of keys
PRE
- code to be executed before performing checksCHECK
- expression to be evaluated as part of checking this key. False result would mean a failure.POST
- code to be executed after performing checks
Example:
- KEY:
NAME: epoch_start
REQ: AT_LEAST_ONE
CHECK: " s['run_started'] and not s['in_epoch'] and ( v['metadata']['epoch_num'] == (s['last_epoch']+1) ) and not s['run_stopped']"
POST: " s['in_epoch'] = True; s['last_epoch'] = v['metadata']['epoch_num'] "
Specifies actions to be taken after processing all the lines in log file. This record is optional and
there can be up to a single END
record per config file.
The following fields are optional:
PRE
- code to be executed before performing checksCHECK
- expression to be evaluated as part of checking this key. False result would mean a failure.
During processing of the records there is a global state s
maintained, accessible from
code provided in yaml. In addition, rules can access the information fields (values) v
of the record, as well as timestamp and the original line string as part of the record ll
.
Global state s
can be used to enforce any cross keys rules, by updating the global state
in POST
(or PRE
) of one KEY
and using that information for CHECK
of another KEY
.
For each config file, s
starts as an empty dictionary, so in order to track global state
it would require adding an entry to s
.
Example:
- BEGIN:
CODE: " s.update({'run_start':None}) "
ll
is a structure representing current log line that triggered KEY
record. ll
has the following fields
that can be accessed:
full_string
- the complete line as a stringtimestamp
- milliseconds as an integerkey
- the string keyvalue
- the parsed value associated with the key, or None if no valuelineno
- line number in the original file of the current key
v
is a shortcut for ll.value
Example:
- KEY:
NAME: run_stop
CHECK: " ( v['metadata']['status'] == 'success' )"
POST: " print('score [sec]:' , ll.timestamp - s['run_start']) "
To enqueue additional rule config files to be verified use enqueue_config(YAML)
function.
Config files in the queue are processed independently, meaning that they do not share state or any rules.
Each config file may define it's BEGIN
and END
records, as well as any other KEY
rules.
Example:
- KEY:
NAME: submission_benchmark
REQ: EXACTLY_ONE
CHECK: " v['value'] in ['resnet', 'ssd', 'maskrcnn', 'transformer', 'gnmt'] "
POST: " enqueue_config('1.0.0/{}.yaml'.format(v['value'])) "
CODE
, REQ
, and POST
fields are executed using python's exec
function. CHECK
is performed
using eval
call. As such, any legal python code would be suitable for use.
For instance, can define rules that would print out information as shown in the example above.
Tested and confirmed working using the following software versions:
- Python 2.7.12 + PyYAML 3.11
- Python 3.6.8 + PyYAML 5.1
- Python 2.9.2 + PyYAML 5.3.1
- Python 3.9.10 + PyYAML 5.4.1
pip install pyyaml