General MLPerf™ Submission Rules

Table of Contents

1. Basics
2. Review committee
3. Operating principles
4. Schedule
5. Submission
6. Review
7. Publication
8. After publication
- 8.1. Terms of use
- 8.2. Issues discovered after publication
9. Appendices

1. Basics

These rules describe the submission, review, and publication process for all MLPerf benchmark suites. There are separate rules that govern what to submit for each MLPerf benchmark suite, including:

Training Rules
Inference Rules

Unless otherwise stated all rules apply to closed, network and open.

2. Review committee

The MLPerf review and publication process is overseen by a review committee.

2.1. Structure

The review committee consists of the following people:

The relevant working group chairs
The relevant power working group chairs
Optionally, the review chair of relevant working group
The executive director
The president of the board
Representatives of each submitter

If representatives of a single organization, its parents, subsidiaries, or affiliates hold multiple positions on this list, at most one such representative is eligible to vote in the review committee.

The review committee is chaired by the relevant results working group chairs unless a review chair is selected.

2.2. Meetings

The review committee makes decisions during properly scheduled meetings. The base meeting schedule is dictated by the review process. The review committee chair may schedule additional meetings during the review process in the course of any meeting or by emailing all committee members at least 24 hours in advance.

Review committee meetings are typically held virtually.

Review committee meetings are open only to the review committee.

2.3. Agenda and decisions

The review committee agenda is set by, and decisions are tracked, using issues filed against the Github submission repo. In general, issues must be filed as dictated by the review process schedule. Exceptions are discouraged but may be allowed during a meeting by a vote of the review committee.

The review committee should attempt to decide issues through discussion and grudging consensus whenever possible. However, if the review committee is unable to reach a grudging consensus, the submitters will vote to decide the issue. Each submitting organization will cast at most one vote per referendum. Non-submitters (including non-submitting chairs, the executive director, and the president of the board) may not vote, except in the case of a tie. In the case of a tie exactly one vote is cast by the first non-submitter from the following list who are able and willing to vote:

The non-submitting review chair
The non-submitting chairs of the relevant working group collectively
The non-submitting chairs of the relevant power working group if invited by the non-submitting chairs of the relevant working group
A random number generator

If there are two outcomes, voting proceeds by simple majority. If there are more than two outcomes, voting proceeds by Condorcet poll with the minimax completion rule. Votes are initiated by the chair and are cast openly. Votes may be cast verbally or using a shared spreadsheet or other voting software.

The review committee operates on balance of interests rather than by avoiding conflict of interest. Members may cast votes on all matters, including those directly affecting benchmark submissions made by their organization, as a practical response to the fact that competitors are also on the review committee.

2.4. Confidential and not precedent setting

The review committee agenda, deliberations, referenda, votes, and specific decisions are confidential and shared only with committee members and submitters for that round. The general nature of decisions may be shared outside the review process because such decisions may expose the need for rules changes. A submitter may publicly or privately share the specific changes necessary to bring their submission into compliance with their suppliers, contractors, and other partners.

The private submission repository will be deleted when the next relevant MLPerf submission is made public or discontinued.

Review committee decisions do not create precedents. Instead, the decisions should be explicitly incorporated into the rules through the normal process.

3. Operating principles

MLPerf’s purpose is to produce fair and useful benchmark results.

The MLPerf review committee reserves the right to depart from these rules and/or exclude submissions that conflict with this purpose with a two-thirds (rounded up) vote by the submitters. For instance, if the schedule is discovered to be untenable in practice, it may be amended. If a submission is judged to be deceptive or not of interest to the community, it may be excluded.

The role of the review process is to ensure fairness of submissions, not to litigate details in an effort to disqualify competitors. For example:

Reviewing submitters should discuss issues with owning submitters after filing objections, and attempt to resolve the issue if possible.
If an objection is supported by the review committee, the objecting submitter should communicate with the owning submitter to ensure a satisfactory fix.
Issues in submission that are agreed to require correction, but that do not meaningfully impact performance (less than 2% cumulative performance difference) or competitive ordering may be waived by the review committee, subject to its discretion, and with the understanding that the submitter will correct the issue in future submissions.

4. Schedule

MLPerf has several submission rounds each year. Each submission round follows a detailed schedule.

4.1. Schedule of submission rounds

The submission schedule will be set yearly, and must be approved by both the inference and training submitters meetings. The goal is to have two Inference and two Training submissions each year, and to not have them overlap each other. MLCommons attempts to avoid major international holidays, and accommodate relevant conferences.

The MLCommons yearly calendars are located in the MLCommons Members shared drive here. For example, the 2022 calendar is located in this google sheet. Access is limited to MLCommons members for now, since the calendars contain sensitive information. We will look into making public versions of these calendars, but in the meantime, the 2022 calendar is:

Submission Round	Submision Date	Publication Date
Inference v2.1	August 5, 2022	August 31, 2022
Training v2.1	October 14, 2022	November 9, 2022

4.2. Single submission round schedule

Each submission round has the following detailed schedule, which has three major phases:

Submission
Review
1. Objection filing
2. Objection review
3. Objection revision
Publication

Each of these phases is described in more detail later in this document.

The exact review period schedule needs to be agreed upon 4 weeks in advance of submission. The following table is an example of the level of detail that the schedule needs to have:

Day	Meeting or deadline (all deadlines are 11:59pm Pacific Time unless otherwise specified)
Week -2	Presubmission
Monday
Tuesday
Wednesday	Submitters must sign CLA and provide primary and secondary POCs with Github handles and email addresses
Thursday
Friday	Submitters WG chair creates submission repo. Gives all submitters access. Sends submitter POCs test email requesting they make a test submission to confirm access.
Week -1	Presubmission
Monday
Tuesday
Wednesday
Thursday
Friday	All “due in advance” writeups due (e.g. for inference calibration / weight transformation)
	Submitters WG chair distributes random seed(s) for load generation (inference only)
Week 0	Submission
Monday
Tuesday
Wednesday
Thursday	Last opportunity to notify chair that you will not submit
Friday	1:00pm Pacific Time: Submit all required artifacts to the Github repo
	1:30pm Pacific Time: Results summary distributed by the Submitters working group chair
Week 1	Review: objection filing
Monday	Begin drafting neutral press release [general chair until org, then executive director]
Tuesday	Review committee meeting, discuss objections
Wednesday
Thursday	Review committee meeting, discuss objections
Friday	Objections due in Github, audit results due in GitHub for open, closed and network
Week 2	Review: objection review
Monday	Submitter response to objections
Tuesday	Review committee meeting, makes easy decisions and requests information about difficult ones
Wednesday	Requested information due
	Distribute neutral press release for comment by [general chair until org, then executive director]
Thursday	Review committee meeting, makes any remaining decisions
Friday
Week 3	Review: objection revision
Monday	Must declare all intended hyperparameter borrowing (training only)
Tuesday	Review committee meeting, finalize all scores.
Wednesday	1:00pm Pacific Time: Final code due
	1:00pm Pacific Time: Final results in human readable form due
	1:00pm Pacific Time: Final opportunity to withdraw some or all results
	1:30pm Pacific Time: Results summary distributed by Results chair
	Approve final draft of press release
Thursday	Review committee meeting, review results presentations.
Friday
Week 4	Publication
Monday	Press and analyst pre-briefings allowed under embargo, all briefings to include neutral press release
	9:30am Pacific Time: submitters can start pre-briefing press under embargo
	1:00pm Pacific Time: Draft of results page available for comment
Tuesday	1:00pm Pacific Time: Corrections to results page due
	5:00pm Pacific Time: Results page and press release live on staging site
Wednesday	10:00am Pacific Time: results and PR public, press embargo ends

4.3. Benchmark Roadmap schedule

Each Working Group decides what benchmarks they want in each round. This is a pipelined process, with the following steps:

Carrying Capacity Decision - Each working group decides how many benchmarks they can handle for this round.
Domain Identification - Working groups review proposals for domain adds/removals from members. The working group will attempt to come to majority consensus in 1 or 2 meetings. If consensus cannot be had, this will go to a vote according to the MLCommons voting rules. Working groups may add up to 2 benchmarks max per round, but will strive for 1 or 0 as the typical case.
Sync Domains across working groups (e.g. Inference, Training, and HPC)
Identify PIC (person in charge) to drive this domain addition across all working groups
For each domain addition, do the two following steps, possibly in parallel:
1. Advisory Board Formation for the domain.
2. Task Force(s) create benchmark proposals. The task force(s) will consider all working groups that might consume this benchmark (e.g. Inference and Training). Ideally benchmark proposals will take around 2 months or less.
Review benchmark proposals with the Advisory Board. The board approves, rejects, or suggests changes. If changes are needed, the Task Force(s) iterate on making the changes and getting Advisory Board approval until the Advisory Board signs off.
Formal working group acceptance - the working group needs to come to consensus on whether or not to accept the benchmark proposal that has now been approved by the Advisory Board. If consensus cannot be had, this will go to a vote according to the MLCommons voting rules.

Some working groups such as HPC may choose to replace the Advisory Board formation with working group consensus, but in general working groups will try their hardest to get third party opinions from non-submitters.

4.3.1. Benchmark roadmap update recommended schedule for Training and HPC

Time	Event
T-28 weeks	Carrying Capacity Decision
T-26 weeks	Domain identification, then sync domains across working groups
T-24 weeks	PIC identified. Task forces iterating on benchmark proposals. Advisory Board formation
T-20 weeks	Advisory Board signs off, Model Frozen, Working groups sync on final benchmark. Finishing touches on benchmark can commence.
T-16 weeks	Benchmark code complete. Only bug fixes allowed beyond this point.
T-12 weeks	RCPs due (for Training and HPC)
T-4 weeks	No more bug fixes. Benchmark code now final.
T-0 weeks	Submission

Working groups are not required to following the timeline above for every round, but are required to complete the process steps. For example, Domain Identification could cover multiple rounds at once, so that step could be accelerated for the next round. Also, the Inference WG has different model freeze and code freeze expectations from the table above (14 weeks and 9 weeks, respectively).

Note that this schedule requires starting 7 months early, which means it needs to be pipelined with prior rounds, given rounds are typically 6 months apart per working group. Working groups are free to start even earlier.

5. Submission

The submission process defines how to submit code and results for review and eventual publication.

5.1. Registration

Submitters must register with the submitters working group and begin attending meetings at least eight weeks before the deadline. In order to register, a submitter or their org must sign the relevant CLA and provide primary and secondary github handles and primary and secondary POC email address.

5.2. How to Submit

The goal of the submission process is to ensure a successful submission for as many submitters as possible in a fair manner. Therefore, the submission process is structured to ensure that submissions are well formed.

A submission is made by placing an encrypted tarball in a MLCommons-provided cloud storage bucket and confirming the submission using an MLCommons web UI.

MLCommons provides a cloud storage bucket [TODO: URL] for submitting encrypted tarballs up to fourteen days before the deadline. Submitters are encouraged to submit as early as possible during this period, since their results will not be visible to others and they will have a chance to fix any issues.

MLCommons provides a web UI [TODO: URL] for verifying scores contained in the tarball. When provided { private key, file name }, the UI decrypts, untars, and runs a submission verifier then displays results or errors. The submitter may confirm results as final and receive an email receipt. All submissions must be confirmed in this manner or they will be disregarded.

Documentation of the web UI usage can be found in this document.

5.3. Late Submissions

5.3.1. Post-submission grace period (submission deadline + 60 minutes):

MLPerf will allow submissions for up to 60 minutes after the published deadline without explanation or penalty. This grace period will be advertised as little as possible. The 60 minute limit will be strictly enforced.

5.3.2. Post-submission extension for extraordinary circumstances (submission deadline + 72 hours):

If a submitter notifies the submission chair that their submission will be delayed due to force-majeure-type circumstances (e.g. blizzards, hurricanes, terrorism, etc.), the submission chair will delay sharing results for up to 72 hours to allow that submitter more time to make their submission. The extraordinary nature of the circumstances must be approved by the review committee at the first committee meeting or the submission will be disregarded.

5.4. Licensing

All submissions of code must be made under the MLCommons CLA. Per the CLA, all submissions of code will be Apache 2 compatible. Third party libraries need not be Apache 2 licensed.

5.4.1. Training

TODO: Fix this section

python3 -m pip install https://github.com/mlcommons/logging/archive/0.7.1.zip
python3 -m mlperf_logging.package_checker <YOUR SUBMISSION_FOLDER> training 0.7.0
python3 -m mlperf_logging.result_summarizer <YOUR SUBMISSION_FOLDER> training 0.7.0

5.4.2. Inference

# from the top of the mlperf inference repository
python3 tools/submission/submission-checker.py --input <YOUR_SUBMISSION_FOLDER> --submitter <YOUR_ORGANIZATION>

5.5. Submission content

A submission must contain the following:

Metadata for the systems under test
Code that implements the benchmarks
Metadata that describes each system-implementation combination tested
Scripts that setup and execute each system-implementation tested
Result logs for each system-implementation tested

5.6. Directory structure

A submission is for one code base for the benchmarks submitted. An org may make multiple submissions. A submission should take the form of a directory with the following structure. The structure must be followed regardless of the actual location of the actual code, e.g. in the MLPerf repo or an external code host site.

5.6.1. Training

<submitting_organization>/
- systems/
 - <system_desc_id>.json
- benchmarks/
 - <benchmark_name per reference>/ [TODO: rename the reference directories]
 
 implementations/
 
 <implementation_id>/
 
 <arbitrary stuff>
 
 <system_desc_id>/
 
 <system_desc_id>_<implementation_id>.json
 
 README.md
 
 setup.sh (one-time configuration script)
 
 init_datasets.sh (one-time dataset init script)
 
 run_and_time.sh (run the benchmark and produce a result)
 
 (include any post-processing scripts used to make changes to result logs)
- results/
 - <system_desc_id>/
 
 <benchmark>/
 
 result_.txt # log file
 
 power # optional power logs
 
 result_
 
 node_<j>.txt
 
 sw_<k>.txt

System names and implementation names may be arbitrary.

Training benchmark directory names must be one of { resnet, ssd, bert, unet3d, gpt3, dlrm_dcnv2, gnn, llama2_70b_lora, stable_diffusion } for v4.0. Benchmark directory names are determined from the benchmark keywords used in logging repository for compliance checks.

5.6.2. HPC

HPC training submissions follow the above Training directory structure except for the results folder which is adjusted to allow for time-to-train measurements as well as throughput measurements (and pruned throughput logs):

results/
- <system_desc_id>/
 - strong/
 
 <benchmark>/
 
 result_.txt # log file for time-to-train measurement
 - weak/
 
 <benchmark>/
 
 result_.txt # log file for throughput measurement
 
 pruned_results/
 
 result_.txt # log file for pruned throughput measurement

5.6.3. Inference

<submitting_organization>/
- systems/
 - <system_desc_id>.json # combines hardware and software stack information
- code/
 - <benchmark_name per reference>/
 
 <implementation_id>/
 
 <Code interface with loadgen and other arbitrary stuff>
- measurements/
 - <system_desc_id>/
 
 <benchmark>/
 
 <scenario>
 
 <system_desc_id>_<implementation_id>_<scenario>.json
 
 README.md
 
 user.conf
 
 mlperf.conf
 
 calibration_process.adoc
- results/
 - <system_desc_id>/
 
 <benchmark>/
 
 <scenario>
 
 performance/
 
 run_x/ # 1 run for all scenarios
 
 mlperf_log_summary.txt
 
 mlperf_log_detail.txt
 
 accuracy/
 
 mlperf_log_summary.txt
 
 mlperf_log_detail.txt
 
 mlperf_log_accuracy.json # truncated by truncate_accuracy_log.py if too large
 
 accuracy.txt # stdout of reference accuracy scripts
- compliance/
 - <system_desc_id>/
 
 <benchmark>/
 
 <scenario>
 
 <test_id>
 
 performance/
 
 run_1/ # 1 run for every scenario
 
 mlperf_log_summary.txt
 
 mlperf_log_detail.txt
 
 accuracy/
 
 accuracy.txt # for TEST01 only, generated from truncate_accuracy_log.py
 
 mlperf_log_accuracy.json # only necessary for TEST01
 
 baseline_accuracy.txt # only for TEST01 if accuracy check fails
 
 compliance_accuracy.txt # only for TEST01 if accuracy check fails
 
 verify_performance.txt
 
 verify_accuracy.txt # for TEST01 only

System names and implementation names may be arbitrary.

<benchmark> must be one of {resnet50, retinanet, rnnt, bert-99, bert-99.9, dlrm-99, dlrm-99.9, 3d-unet-99, 3d-unet-99.9}. The postfix '-99' and '-99.9' indicate that the accuracy must be >= 99% or 99.9% of the target accuracy.

<scenario> must be one of {Offline, Server, SingleStream, MultiStream}.

<test_id> must be one of {TEST01, TEST04, TEST05, TEST06}.

Here is the list of mandatory files for all submissions in any division/category. However, your submission should still include all software information and related information for results replication.

mlperf_log_summary.txt
mlperf_log_detail.txt
mlperf_log_accuracy.json
user.conf
calibration or weight transformation related code if the original MLPerf models are not used
actual models if the models are not deterministically generated
READMEs to enable users to replicate performance results
code which interfaces with the loadgen
<system_desc_id>_<implementation_id>_<scenario>.json
<system_desc_id>.json

For some models mlperf_log_accuracy.json can get very large. Because of this we truncate mlperf_log_accuracy.json in submissions using a tool. A submiter will run the tool before submitting to mlperf and submit the truncated mlperf_log_accuracy.json files inside their organization. Run the tool as follows, assuming <SOURCE> is your local subumission tree and <DEST> the location of the github submission repo:

# from top of the inference source tree
python3 tools/submission/truncate_accuracy_log.py --input <SOURCE> --output <DEST>

5.7. <system_desc_id>.json metadata

The file <system_desc_id>.json should contain the following metadata describing the system:

Field	Meaningful response required	Cloud example	On-premise example1	On-premise example2
submitter	Yes	Google	David Kanter	David Kanter
division	Yes	closed	Closed	Open
system_type	Yes	datacenter	datacenter	edge
system_type_detail	²	cloud	edge-server	edge-device
status	Yes	available	available	available

system_name	Yes	tpu-v3	8ball	8ball
number_of_nodes	Yes	1	1	1
host_processors_per_node	Yes	1	2	2
host_processor_model_name	Yes	Intel Skylake	Intel Xeon Platinum 8164	Intel Xeon Platinum 8164
host_processor_core_count	Yes¹, or vcpu		26	26
host_processor_vcpu_count	Yes¹, or core ¹	96
host_processor_frequency			2000MHz	2000MHz
host_processor_caches			L1: 32KB I + 32KB D per core, L2: 1MB I+D per core, L3: 37.75MB I+D per chip	L1: 32KB I + 32KB D per core, L2: 1MB I+D per core, L3: 37.75MB I+D per chip
host_processor_interconnect			3x 10.6GT/s UPI	3x 10.6GT/s UPI
host_memory_capacity	Yes	128GB	384GB	384GB
host_storage_type	Yes	SSD	SSD	SSD
host_storage_capacity	Yes	1 200 GB + 1 50 GB	800GB	800GB
host_networking	Yes		Gig Ethernet	Infiniband
host_network_card_count	Yes		1 100Gbe + 1 10Gbe	1 Integrated
host_networking_topology	Yes		N/A	N/A
host_memory_configuration	Yes		12 x 32GB 2Rx4 PC4-2666V-R	12 x 32GB 2Rx4 PC4-2666V-R
accelerators_per_node	Yes	16	4	4
accelerator_model_name	Yes	tpu-v3	Nvidia Tesla V100	Nvidia Tesla V100
accelerator_host_interconnect	Yes		PCIe 3.0 x16	PCIe 3.0 x16
accelerator_frequency			1230MHz	1230MHz
accelerator_on-chip_memories			L1: 80x 128KB, L2: 6MB per chip	L1: 80x 128KB, L2: 6MB per chip
accelerator_memory_configuration	Yes	HBM	HBM2	HBM2
accelerator_memory_capacity	Yes	32 GB	32GB	32GB
accelerator_interconnect	Yes		6x 25GT/s NVLink	6x 25GT/s NVLink
accelerator_interconnect_topology			Direct	Mesh
cooling	Yes		Liquid	Air-cooled
hw_notes			I overclocked it!	Miscellaneous notes

framework	Yes	TensorFlow 1.14 commit hash = faf9db515c4bf550daacc1c3a22fedf3ff5dde63	PyTorch, NGC19.05	PyTorch, NGC19.05
other_software_stack	Yes	TPU stack 1.14.1.dev20190518, python 3.6, sacrebleu 1.2.11	cuda 10.2.0.163, cudnn 7.6.0.64, cublas 10.2.0.163, gcc 5.4.0	cuda 10.2.0.163, cudnn 7.6.0.64, cublas 10.2.0.163, gcc 5.4.0
operating_system	Yes	Ubuntu 16.04	Ubuntu 18.04.1 LTS	Ubuntu 18.04.1 LTS
sw_notes			extra notes here	extra notes here

¹ Optional for preview system submission. These fields must be updated in the system description json upon the public availability of the processor.

² Optional for submitters to add more specific system type. Some possible values for system_type_detail are cloud and on-premise for datacenter category and edge-server and edge-device for edge category.

In the Network division for the inference datacenter the file <system_desc_id>.json should also contain:

Field	Example
is_network	True
network_type	Ethernet
network_media	Copper
network_rate	100G
nic_loadgen	NVIDIA CX7
number_nic_loadgen	1
net_software_stack_loadgen	Linux Kernel TCP stack v.XXX
network_protocol	TCP/IPv4 over Ethernet
number_connections	1
nic_sut	NVIDIA CX7
number_nic_sut	1
net_software_stack_sut	Linux Kernel TCP stack v.XXX
network_topology	Loadgen System connected to SUT through Switch and Load Balancer

5.8. <system_desc_id>_<implementation_id>_<scenario>.json metadata

The file <system_desc_id>_<implementation_id>.json should contain metadata describing use of the specified implementation on the specified system.

Field	Meaningful response required	DK_Example_1	DK_Example_2
Starting weights filename?	Yes	https://zenodo.org/record/2269307/files/mobilenet_v1_1.0_224.tgz	https://zenodo.org/record/2269307/files/mobilenet_v1_1.0_224.tgz
Weight transformations?	Yes	No	Yes (URL_to_calibration_writeup)
Weight data type(s)	Yes	fp32	bf16
Input data type(s)	Yes	fp32	bf16
Retraining	Yes	No	Yes (URL_to_writeup)

5.9. Logging requirements

For Training, the results logs must be verified and stamped by the training log verification script [TODO log]. The easiest way to produce such a log is to use the

For Inference, the results logs must have been produced by the [standard load generator](https://github.com/mlperf/inference/tree/master/loadgen). Power information may be appended using the standard power information appending script [TODO link or remove].

5.10. Source code requirements for replication

The following section applies to all submissions in all divisions.

The source code must be sufficient to reproduce the results of the submission, given all source components specified in Section 5.11 (for Inference) or Section 5.12 (for Training) are provided in the submission repo for all Categories, including Available, Preview, and RDI. In addition, any software component that would be required to substantially reproduce the submission must be uniquely identified using one of the following methods:

Possible methods to provide Software (meet at least 1 criteria)	Methods for replication	"Available" Category	"Preview" Category	"RDI" Category
Source code or binary included in the submission repo	---	Yes	Optional	Optional
Depends only on public Github repo	Commit hash or tag	Yes	Optional	Optional
Depends only on public Github repo plus one or more PRs	Commit hash or tag, and PR number(s)	Yes	Optional	Optional
Depends only on an available binary (could be free to download or for purchase / customers only)	Name and version, or url	Yes, if the binary is a Beta or Production release	Optional	Optional
Depends on private source code from an internal source control system	Unique source identifier [i.e., gitlab hash, p4 CL, etc]	No	Yes. Should be made "Available" in the next submission after 140 days of the submission date, or by the next MLPerf submission date, whichever is longer	Yes
Depends on partially redacted source code from an internal source control system (line numbers logged in result files should comply with redacted source code for easy review)	Unique source identifier [i.e., gitlab hash, p4 CL, etc]	No	Yes. Should be made "Available" in the next submission after 140 days of the submission date, or by the next MLPerf submission date, whichever is longer	Yes
Private binary	Checksum	No	Yes. Should be made "Available" in the next submission after 140 days of the submission date, or by the next MLPerf submission date, whichever is longer	Yes

5.11. Source code requirements for inference inspection

The following section applies to all submissions in the Closed and Network divisions. We encourage Open division submissions to be as transparent as possible.

For inference, the source code, pseudo-code, or prose description must be sufficient to determine:

Readme detailing run command with command line flags, if any
The connection to the loadgen
Preprocessing
The architecture of the model, and the operations performed
Weights (please notify results chair if > 2 GB combined)
Weight transformations
- If weight transformations are non-deterministic, then any randomness seeds used must be included in the submission.

For the inference server scenario, the source code, pseudo-code, or prose must be sufficient to determine:

Online batching, meaning how the server batches queries for processing

5.12. Source code requirements for training inspection

For training, the source code must be sufficient to verify all aspects of a Closed submission including but not limited to:

Readme detailing run command with command line flags, if any
Data preprocessing
Data traversal order
Model
Model initialization
Optimizer used
Hyperparameters used
Evaluation frequency
Evaluation method

This requirement applies even to Open submissions, though the aspects do not need to match the reference.

5.13. Compliance Testing

5.13.1. Training

This section in progress [TODO].

5.13.2. Inference

Submitters must run the compliance tests for their closed and network divisions submissions to verify that their submission achieves a basic level of compliance with a subset of the MLPerf rules. If compliance testing identifies a potential issue with the submission, the onus is on the submitter to provide an adequate explanation to the results review committee.

Refer to the documentation found under https://github.com/mlperf/inference/tree/master/compliance/nvidia

The following compliance tests are required for each of the folliwing benchmarks:

model	Required Compliance Tests
resnet50-v1.5	TEST01, TEST04, TEST05
retinanet 800x800	TEST01, TEST05
bert	TEST01, TEST05
dlrm-v2	TEST01, TEST05
3d-unet	TEST01, TEST05
rnnt	TEST01, TEST05
gpt-j	-
stable-diffusion-xl	TEST01, TEST04, TEST05
Llama2-70b	TEST06
mixtral-8x7b	TEST06

6. Review

6.1. Visibility of results and code during review

During the review process, only certain groups are allowed to inspect results and code.

Group	Can Inspect
Review committee	All results, all code
Submitters	All results, all code
Public	No results, no code

6.2. Required reviews

Each submitter is required to review at least one other submission. Required reviews are assigned as follows:

Stack rank submissions by number of results.
Assign reviewers in pairs walking down the stack rank
If an odd number of reviewers, the bottom 3 in the stack rank will review each other.

6.3. Auditing

6.4. Filing objections

Submitters must officially file objections to other submitter’s code by creating a GitHub issue prior to the “Filing objections” deadline that cites the offending lines, the rules section violated, and, if pertinent, corresponding lines of the reference implementation that are not equivalent.

Each submitter must file objections with a “by <org>” tag and a “against <org>” tag. Multiple organizations may append their “by <org>” to an existing objection if desired. If an objector comes to believe the objection is in error they may remove their “by <org>” tag. All objections with no “by <org>” tags at the end of the filing deadline will be closed.

Submitters should file an objection, then discuss with the submitter to verify if the objection is correct. Following filing of an issue but before resolution, both objecting submitter and owning submitter may add comments to help the review committee understand the problem.

If the owning submitter acknowledges the problem, they may append the “fix_required” tag and begin to fix the issue.

6.5. Resolving objections

The review committee will review each objection, and either establish consensus or vote. If the committee votes to support an objection, it will provide some basic guidance on an acceptable fix and append the “fix_required” tag. If the committee votes against an objection, it will close the issue.

6.6. Fixing objections

Code should be updated via a pull request prior to the “fixing objections” deadline. Following submission of all fixes, the objecting submitter should confirm that the objection has been addressed with the objector(s) and ask them to remove their “by <org> tags.

If the objector is not satisfied by the fix, then the review committee will decide the issue at its final review meeting. The review committee may vote to accept a fix and close the issue, or reject a fix and request the submission be moved to open or withdrawn.

6.7. Hyperparameter borrowing (training only)

Hyperparameters may be updated in accordance with the training rules prior to the final code due date.

6.8. Withdrawing results or changing division

Anytime up until the final human readable deadline, an entry may be withdrawn by amending the pull request. Alternatively, an entry may be voluntarily moved from the closed or network divisions to the open division.

7. Publication

MLCommons will publish all results simultaneously via an update to the results page. After publication, code and results are public and free for use under the MLPerf Terms of Use.

7.1. Results tables

For Inference, datacenter, there will be three results table published, one for Closed, one for Network and one for Open. Otherwise, there will be two results table published, one for Closed and one for Open.

7.2. Results table content

Each results table will contain the following information:

Field	Description
TBD	TBD

7.3. Results categories

Results will be divided into categories based on the availability of the hardware and software components. Availability rules apply to Closed, Network and Open division submissions.

Category	Hardware	Software
Available in cloud	Available for rent in the cloud	Available
Available on premise	Available for purchase	Available
Preview	Must be available for rent or purchase in time for the next submission or in the next submission after 140 days whichever is longer	Available except for software required to support substantially new hardware
Research, Development, or Internal	Does not meet the above requirements	Does not meet the above requirements

7.3.1. Available Systems

Available cloud systems must (1) have available pricing (either publicly advertised or available by request), (2) have been rented by at least one third party, (3) have public evidence of availability (web page saying product is available, statement by company, etc), and (4) be “reasonably available” for rent by additional third parties by the submission date.

An on-premise system is Available if all of its components that substantially determine ML performance are Available either individually or in aggregate (development boards that meet the substantially determine clause are allowed). An Available component or system must (1) have available pricing (either publicly advertised or available by request), (2) have been shipped to at least one third party, (3) have public evidence of availability (web page saying product is available, statement by company, etc), and (4) be “reasonably available” for purchase by additional third parties by the submission date. In addition, submissions for on-premise systems must describe the system and its components in sufficient detail to enable third parties to build a similar system.

In both cases, “reasonably available” means:

Supply and lead times are appropriate for system scale, i.e. on-demand and in quantity for the smallest systems and a few months and with limited supply for the largest systems.
Access to rent or purchase may be subject to conditions that are common to generally available products (such as financial qualifications, size of customer, support burden, export restrictions, etc.) but is not otherwise restricted (i.e. no “early access” approval requirements).

However, it is allowed for the qualifying pre-submission rentals/purchases to have been made with restrictions such as “early access” approval.

Available systems must use an Available software stack. A software stack consists of the set of software components that substantially determine ML performance but are not in the uploaded source code. For instance, for training this includes at a minimum any required ML framework (e.g. TensorFlow, pyTorch) and ML accelerator library (e.g. cuDNN, MKL). An Available software stack consists of only Available software components.

An Available software component must be well supported for general use. For open source software, the software may be based on any commit in an "official" repo plus optionally any PRs to support a particular architecture. For binaries, the binary must be made available as release, or as a "beta" release with the requirement that optimizations will be included in a future "official" release. The beta must be made available to customers as a clear part of the release sequence. The software must be available at the time of submission.

7.3.2. Preview Systems

A Preview system is a system which did not qualify as an Available system as of the previous MLPerf submission date, but will qualify in the next submission after 140 days of the current submission date, or by the next MLPerf submission date, whichever is more, and which the submitter commits to submitting as an Available system by that time. If it is not submitted in that submission round with equal or better performance (allowing for noise), the Preview benchmark will be marked as invalid. A Preview submission must include performance on at least one benchmark which will be considered MLPerf Compatible (see the MLPerf Compatibility Table) in the upcoming round where transition to Available is made (consult SWG for Benchmark Roadmap). On each of the benchmarks that are previewed and are Compatible, the Available submission must show equal or better performance (allowing for noise, for any changes to the benchmark definition) on all systems for Inference and across at least the smallest and the largest scale of the systems used for Preview submission on that benchmark for Training (e.g. Available Training submissions can be on scales smaller than the smallest and larger than the largest scale used for Preview submission). For submissions accompanied by power measurements, "equal or better" must use power-normalized performance rather than absolute performance.

Training: For an Available system that is larger than the Preview system, absolute performance must be better. For an Available system that is smaller than the Preview system, efficiency (time-to-train * number of chips) must be better.

If none of the Preview benchmarks are MLPerf Compatible in the upcoming round where transition to Available is made in a rare event, a submitter may get their performance validated in the upcoming round by making a submission on the old/retired benchmark to the Results WG during review period (such a submission will not show up on the Results table but will only be used by the Results WG to validate a past Preview Submission).

For a Preview submission only, the "Available software stack" requirement is waived for software that is necessary to support newly developed hardware component(s) that are substantial contributors to the determination of ML performance (e.g. a new ML accelerator or CPU or NIC). A "newly developed" component is one that was not Available as of the submission date of the previous MLPerf submission round, and was not submitted in a Preview system in that previous round. Other parts of the software stack must still meet the same Available software stack requirements as an Available system.

Examples and counterexamples:

All SKUs of a new chip can be considered "newly developed" as long as the first shipping SKU qualifies as "newly developed". Once the first shipping SKU no longer qualifies, no existing or future SKUs of the chip can be considered "newly developed".
A chip that was Available prior to the submission date of the previous MLPerf round but was never used before for an MLPerf submission does not qualify as "newly developed."
At this point in time a hardware component that is not an ML accelerator, CPU, or NIC, is presumed to not meet the "substantial contributor to the determination of ML performance" criteria. Other possible cases must be brought to the relevant working group for consideration.

7.3.3. Research, Development, or Internal Systems

A research, development, or internal (RDI) component does not meet the requirements for an available or preview component. An RDI system is a system containing one or more RDI components. The RDI components may not be submitted as Available components until the submission cycle after next or 221 days whichever is longer.

8. After publication

8.1. Terms of use

Any use of published results in connection with the MLPerf trademark must follow the MLPerf Results Messaging Guidelines and any relevant policies found at https://mlcommons.org/en/policies/.

8.2. Issues discovered after publication

8.2.1. Scenario

Results posted on mlperf.org have been generated from a non-compliant submission, and the fix results in >5% cumulative reduction in performance.

8.2.2. Process flow

Any MLCommons member may raise an objection to any published results via email to any MLCommons WG chair. An objection review committee (minimally four MLPerf chairs) will screen the objection. If rejected at this stage, the committee chair will respond to the objector with the reasoning.

Otherwise, the committee will designate an investigator with no conflict of interest to produce a brief (e.g. 1 page) report confidential to the committee, which will include a response from the submitter of the disputed result. Based on the report, the committee will respond to the objector or start further investigation on a case-by-case basis.

8.2.3. Possible investigation outcomes

The objection is not valid.
The result-in-question is moved to open for noncompliance with the rules.
The result-in-question is removed due to intentional cheating.

9. Appendices

The appendices contain additional information.

9.1. Committee non-disclosure

This section in progress [TODO].

9.2. Submitter non-disclosure

This section in progress [TODO].

9.3. Submission checklist

This section in progress [TODO].

9.4. Power

This section in progress [TODO].

9.5. Review chair checklist

This section in progress [TODO].

Files

submission_rules.adoc

Latest commit

History

submission_rules.adoc

File metadata and controls

General MLPerf™ Submission Rules

1. Basics

2. Review committee

2.1. Structure

2.2. Meetings

2.3. Agenda and decisions

2.4. Confidential and not precedent setting

3. Operating principles

4. Schedule

4.1. Schedule of submission rounds

4.2. Single submission round schedule

4.3. Benchmark Roadmap schedule

4.3.1. Benchmark roadmap update recommended schedule for Training and HPC

5. Submission

5.1. Registration

5.2. How to Submit

5.3. Late Submissions

5.3.1. Post-submission grace period (submission deadline + 60 minutes):

5.3.2. Post-submission extension for extraordinary circumstances (submission deadline + 72 hours):

5.4. Licensing

5.4.1. Training

5.4.2. Inference

5.5. Submission content

5.6. Directory structure

5.6.1. Training

5.6.2. HPC

5.6.3. Inference

5.7. <system_desc_id>.json metadata

5.8. <system_desc_id>_<implementation_id>_<scenario>.json metadata

5.9. Logging requirements

5.10. Source code requirements for replication

5.11. Source code requirements for inference inspection

5.12. Source code requirements for training inspection

5.13. Compliance Testing

5.13.1. Training

5.13.2. Inference

6. Review

6.1. Visibility of results and code during review

6.2. Required reviews

6.3. Auditing

6.4. Filing objections

6.5. Resolving objections

6.6. Fixing objections

6.7. Hyperparameter borrowing (training only)

6.8. Withdrawing results or changing division

7. Publication

7.1. Results tables

7.2. Results table content

7.3. Results categories

7.3.1. Available Systems

7.3.2. Preview Systems

7.3.3. Research, Development, or Internal Systems

8. After publication

8.1. Terms of use

8.2. Issues discovered after publication

8.2.1. Scenario

8.2.2. Process flow

8.2.3. Possible investigation outcomes

9. Appendices

9.1. Committee non-disclosure

9.2. Submitter non-disclosure

9.3. Submission checklist

9.4. Power

9.5. Review chair checklist