[Suggestion] MLPerf reproducibility/repeatability methodology from ACM/IEEE/NeurIPS? #1080

gfursin · 2024-01-30T13:57:01Z

Following many recent discussions at MLCommons about improving the repeatability and reproducibility of MLPerf inference benchmarks, we suggest to look at similar initiatives at computer systems conferences (artifact evaluation and reproducibility initiatives) and maybe adopt their methodology and badges:

Our repeatability study for MLPerf inference v3.1 highlights similar repeatability issues to what we already saw in compiler, systems and ML conferences:

some submissions miss critical files that makes it impossible to rebuild and rerun their experiments
some submissions worked fine at the time of submission but got outdated due to newer dependencies
manual modifications of submissions, configurations, paths, environment variables, etc is often required making repeatability very error-prone
containers were useful as a snapshot but they do not guarantee to be working on new hardware or with new software

A potential solution is improve repeatability of MLPerf submissions (full reproducibility is probably too costly and impossible at this stage) by introducing MLPerf reproducibility badges similar to ACM reproducibility badges:

"MLPerf submission available" badge is published along submission only if all artifacts are publicly available for external user to rebuild the submission (code, data, configurations, workflows, etc) .
"MLPerf submission functional/repeatable" badge if anyone can perform a short valid run for a given submission in a fully automated way. The MLCommons Automation and Reproducibility TaskForce can then extend MLCommons CM workflow for MLPerf to run that submission via a common interface in a unified way.

We can evaluate results after submission deadline and before the publication deadline, and assign badges to all results in the final table that is officially published. It may motivate everyone to improve the quality of their submission and get all such badges in the future instead of the community discovering such issues after MLPerf publication of results.

gfursin · 2024-02-20T16:50:28Z

We have developed a prototype infrastructure to track MLPerf configurations and give ACM badges:

gfursin assigned arjunsuresh and gfursin Jan 30, 2024

This was referenced Jan 30, 2024

[automation and reproducibility taskforce] progress report for 20240130 #1082

Closed

[automation and reproducibility taskforce] progress report for 20240130 mlcommons/inference#1595

Open

Automation and reproducibility for MLPerf Inference v3.1 #1052

Closed

gfursin unassigned arjunsuresh and gfursin Oct 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Suggestion] MLPerf reproducibility/repeatability methodology from ACM/IEEE/NeurIPS? #1080

[Suggestion] MLPerf reproducibility/repeatability methodology from ACM/IEEE/NeurIPS? #1080

gfursin commented Jan 30, 2024 •

edited

Loading

gfursin commented Feb 20, 2024

[Suggestion] MLPerf reproducibility/repeatability methodology from ACM/IEEE/NeurIPS? #1080

[Suggestion] MLPerf reproducibility/repeatability methodology from ACM/IEEE/NeurIPS? #1080

Comments

gfursin commented Jan 30, 2024 • edited Loading

gfursin commented Feb 20, 2024

gfursin commented Jan 30, 2024 •

edited

Loading