Unified interface to run MLPerf inference benchmarks

Running the MLPerf inference benchmarks and preparing valid submissions is not trivial.

This guide explains how to automate all the steps required to prepare, customize, run and extend MLPerf inference benchmarks across diverse models, datasets, software and hardware using the MLCommons Collective Mind automation framework (CM).

CM makes it possible to compose modular benchmarks from portable and reusable automation recipes (CM scripts) with a common interface and a human-friendly GUI. Such benchmarks attempt to automatically adapt to any software and hardware natively or inside a container with any Operating System.

CM automation for MLPerf benchmarks is being developed by the MLCommons Task Force on Automation and Reproducibility based on the feedback from MLCommons organizations while automating >90% of all performance and power submissions in the v3.1 round.

Don't hesitate to get in touch via public Discord server to get free help to run MLPerf benchmarks and submit valid results.

Table of Contents:

How to run existing MLPerf inference benchmarks?
How to measure power?
How to submit results?
How CM automation works?
How to debug CM automation recipes?
How to add new implementations (models, frameworks, hardware)?
How to run MLPerf inference benchmarks with non-reference models?
How to run MLPerf inference benchmark via Docker?
How to automate MLPerf experiments?
How to visualize and compare MLPerf results?
Current developments
Acknowledgments
Questions? Suggestions?

How to run existing MLPerf inference benchmarks?

Install MLCommons CM framework with automation recipes for AI benchmarks.
Use this GUI to generate CM commands to customize and run MLPerf inference benchmarks.
Use some ready-to-use CM commands for the following models:
- ResNet50
- RetinaNet
- 3D Unet
- RNNT
- Bert
- GPT-J
- LLAMA2 70B
- Stable Diffusion XL
Check on-going reproducibility studies for MLPerf benchmarks.
Participate in open submission and reproducibility challenges.

How to measure power?

Power measurement is optional for MLPerf inference benchmark submissions and is known to be very difficult to set up and run. However, if your system have a good power efficiency, it is great to showcase it and compare against other systems. That's why we fully automated power measurements for MLPerf inference benchmark in CM.

You can follow this tutorial to set up your power analyzer and connect it with your host platform.

Note that the cTuning foundation and cKnowledge.org have several power analyzers and can help you test your MLPerf benchmark implementations.

How to submit results?

We provided a unified CM interface to run the following MLPerf inference benchmarks:

Language processing using Bert-Large model and Squad v1.1 dataset
Language processing using GPT-J model and CNN Daily Mail dataset
Image Classification using ResNet50 model and Imagenet-2012 dataset
Image Classification using variations of MobileNets and EfficientNets and Imagenet-2012 dataset
Object Detection using Retinanet model and OpenImages dataset
Speech Recognition using RNNT model and LibriSpeech dataset
Medical Imaging using 3d-unet model and KiTS19 dataset
Recommendation using DLRMv2 model and Criteo multihot dataset

All seven benchmarks can participate in the datacenter category. All seven benchmarks except Recommendation can participate in the edge category.

Note that language processing and medical imaging benchmarks must achieve a higher accuracy of at least 99.9% of the FP32 reference model in comparison with 99% default accuracy requirement for all other models.

The recommendation benchmark has a high-accuracy variant only. Currently, we are not supporting the recommendation benchmark in CM because we did not have a required high-end server for testing.

After running MLPerf inference benchmarks and collecting results via CM, you can follow this guide to prepare your submission.

How CM automation works?

Collective Mind was developed based on the feedback from MLCommons organizations

it simply wraps numerous native scripts for all steps required to prepare, build and run applications and benchmarks into unified and reusable automation recipes with human-friendly tags, a common API, YAML/JSON meta descriptions and simple Python code. CM makes it easy to chain together different automation recipes into powerful workflows that automatically prepare all environment variables and commands lines on any software, hardware and operating system without the need for users to learn new tools and languages.

We suggest you to explore this automation recipe and check this CM README and CM Getting Started Guide for more details about CM.

Common CM interface and automation for MLPerf inference benchmark is implemented using the "run-mlperf-inference-app" CM script described by this YAML meta-description and customize.py.

This script can be configured using this GUI and will run other CM scripts that set up different MLPerf inference implementations from different vendors:

CM script "app-mlperf-inference-reference" to run MLCommons reference implementation
CM script "app-mlperf-inference-nvidia" to run Nvidia implementation
CM script "reproduce-mlperf-inference-intel" to run Intel implementation
CM script "reproduce-mlperf-inference-qualcomm" to run Qualcomm implementation
CM script "app-mlperf-inference-cpp" to run MLCommons ONNX C++ implementation
CM script "app-mlperf-inference-tflite-cpp" to run TFLite C++ implementation

When running above scripts, CM will cache the output (MLPerf loadgen, downloaded models, preprocessed data sets, installed tools) that will be reused across different scripts. You can see the content of the cache at any time as follows:

cm show cache

You can clean the cache and start from scratch as follows:

cm rm cache -f

How to debug CM automation recipes?

Since CM language uses native OS scripts with python wrappers, it is relatively straightforward to debug it using your existing tools.

You can add --debug flag to your CM command line when running MLPerf benchmarks to open a shell with all MLPerf environment variables prepared to run and debug the final MLPerf loadgen tool manually.

You can also use GDB by adding environment variable --env.CM_RUN_PREFIX="gdb --args " to the CM command line.

Please check this documentation for more details.

How to add new implementations (models, frameworks, hardware)?

If you do not yet have your own implementation, we suggest you to run already existing implementation via CM and then modify loadgen and inference sources in CM cache to develop your own implementation:

cm show cache --tags=mlperf,loadgen
cm show cache --tags=get,git,inference,repo

You can then push your changes to your own clone of the MLPerf inference repo; copy any of above CM scripts for similar implementation; update tags in _cm.yaml; and add your implementation tags to the meta description of the main CM interface for the MLPerf inference benchmark here.

If you need help, don't hesitate to contact us via public Discord server.

It is in our plans to add a tutorial how to develop MLPerf inference benchmarks and add your implementations to CM.

How to run MLPerf inference benchmarks with non-reference models?

If you want to benchmark some ML models using MLPerf loadgen without accuracy, you can use our universal Python loadgen automation for ONNX models. You can benchmark Hugging Face ONNX models or your own local models.

If you want to benchmark ML models with the MLPerf inference benchmark and submit results to open division, you need to make sure that they are trained on the same data sets as reference MLPerf models and that their input/output is the same as MLPerf reference models. In such case, you can use the following CM flags to substitute reference model in MLPerf.

--env.CM_MLPERF_CUSTOM_MODEL_PATH = {full path to the local model}
--env.CM_ML_MODEL_FULL_NAME = {some user-friendly model name for submission}

Check these 2 examples for more details:

Run custom Bert-family ONNX models with MLPerf reference implementation
Run multiple DeepSparse Zoo BERT models via MLPerf

How to run MLPerf inference benchmark via Docker?

If a given vendor implementation uses Docker (Intel, Nvidia, Qualcomm), CM will build required container and run MLPerf inference benchmark automatically.

CM also has an option to run native MLPerf inference benchmark implementations inside automatically-generated container by substituting cm run script command with cm docker script command.

We plan to share snapshots of different MLPerf inference benchmarks via Docker Hub during our reproducibility studies to help the community benchmark their own systems using MLPerf inference benchmark containers.

How to automate MLPerf experiments?

We have developed experiment automation in CM to run multiple experiments, automatically explore multiple parameters, record results and reproduce them by the workgroup.

Please check this documentation for more details.

How to visualize and compare MLPerf results?

You can pull all past MLPerf results in the CM format, import your current experiments under preparation and visualize results with derived metrics on your system using the Collective Knowledge Playground as follows:

cm pull repo mlcommons@cm4mlperf-results
cmr "get git repo _repo.https://github.com/ctuning/mlperf_inference_submissions_v3.1" \
    --env.CM_GIT_CHECKOUT=main \
    --extra_cache_tags=mlperf-inference-results,community,version-3.1
cmr "gui _graph"

You can see example of this visualization GUI online.

Current developments

Current reproducibility studies for MLPerf benchmarks.
Current CM coverage to run and reproduce MLPerf inference benchmarks.
Development version of the modular MLPerf C++ inference implementation.
Development version of the the reference network implementation with CM interface for BERT model.

Acknowledgments

Collective Mind is an open community project to modularize AI benchmarks and provide a common interface to run them across diverse models, data sets, software and hardware - we would like to thank all our great contributors for their feedback, support and extensions!

Questions? Suggestions?

Please check the MLCommons Task Force on Automation and Reproducibility and get in touch via public Discord server.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Unified interface to run MLPerf inference benchmarks

How to run existing MLPerf inference benchmarks?

How to measure power?

How to submit results?

How CM automation works?

How to debug CM automation recipes?

How to add new implementations (models, frameworks, hardware)?

How to run MLPerf inference benchmarks with non-reference models?

How to run MLPerf inference benchmark via Docker?

How to automate MLPerf experiments?

How to visualize and compare MLPerf results?

Current developments

Acknowledgments

Questions? Suggestions?

Files

README.md

Latest commit

History

README.md

File metadata and controls

Unified interface to run MLPerf inference benchmarks

How to run existing MLPerf inference benchmarks?

How to measure power?

How to submit results?

How CM automation works?

How to debug CM automation recipes?

How to add new implementations (models, frameworks, hardware)?

How to run MLPerf inference benchmarks with non-reference models?

How to run MLPerf inference benchmark via Docker?

How to automate MLPerf experiments?

How to visualize and compare MLPerf results?

Current developments

Acknowledgments

Questions? Suggestions?