Submission checker: better modularize the code, and have better documentation of the expectation of results #1670

nvzhihanj · 2024-04-02T23:13:57Z

WIth the increasing number of benchmarks and checks, we have found several issues with the submission checker (https://github.com/mlcommons/inference/blob/master/tools/submission/submission_checker.py):

The file itself is too long (>3700 in loc), which makes it hard to read and interpret
The logic of checking is not very well maintained, making it a bit hard to onboard new submitters
Some corner cases are hard to debug, due to the lack of clear warning/error message and exception handling.

We would like to propose several changes to the submission checker, including but not limited to:

Avoid duplicate code, and cache configs for old submission in separate files (because we don't need backward compatibility)
Code styles and structures improvement. For this @nv-alicheng can provide suggestions on adapting modern Python to the submission checker
Add good documentation and exception/error handling, so debugging for the users is easier.

One concrete example of uncaught error is: if users submit runs lower than 600 seconds, the submission checker will report ERROR in the log and return non-zero, but at the end of the log it will still say "the results look OK".

@pgmpablo157321 FYI. We are discussing how to better improve the submission_checker, and will continue to update in this thread.

arjunsuresh · 2024-04-11T09:35:49Z

Agree with old configs - when running the submission checker on old submissions it makes sense to use the old release of the submission checker. We probably need to maintain only the latest submission and the upcoming submission compatibility in the submission checker. This itself will reduce the submission checker size considerably and this PR does a first iteration for this.

This PR adds a GitHub action for the submission checker changes and runs it against the v4.0 results repository - once this is merged we can safely do any change in the submission checker.

One concrete example of uncaught error is: if users submit runs lower than 600 seconds, the submission checker will report ERROR in the log and return non-zero, but at the end of the log it will still say "the results look OK".

This issue should already be fixed in the master branch by this PR

nv-alicheng · 2024-04-12T02:51:39Z

Proposal for Submission Checker Refactor

MODEL_CONFIG is well over 1000 lines long, and is stored as an enormous nested dictionary. It is also extremely hard to figure out what the schema of this dictionary is without reading the contents.

Proposal:

MODEL_CONFIG should be stored in separate directories (each corresponding to an MLPINF version), each with a common file structure

Proposed structure:

schema.py
constants.py  # Maybe contains scenarios enum?
vX.Y/
    config.py
    models.py

MODEL_CONFIG should have a definitive schema that is easily understandable. The simplest way to do this I believe is with Python dataclasses
MODEL_CONFIG must be flexible enough to allow for small variations that may occur between MLPINF submission rounds
Proposed design:

schema.py

@dataclass
class MLPInferenceSubmission:
    version: str
    models: List[Model]  # Alternatively could be a Dict[str, Model]
    seeds: Seeds
    audit_test_seeds: Dict[AuditTest, Seeds] 
    ignored_errors: List[str]

@dataclass
class Model:
    name: str
    aliases: List[str]
    required_in: Dict[SystemCategory, List[Scenario]]
    optional_in: Dict[SystemCategory, List[Scenario]]
    accuracy_metric: AccuracyMetric  # Defined in constants.py, has regex field
    accuracy_target: float
    performance_sample_count: int
    scenario_configs: Dict[Scenario, ScenarioConfig]  # With this, it is easy to do things like override the result field for Llama2 since you can use a different string.
    audit_tests: List[AuditTest]

@dataclass
class ScenarioConfig:
    latency_constraint: int = 0
    min_query_count: int = 1
    result_field: ResultField = None
    
@dataclass
class Seeds:
    qsl: int
    sample_index: int
    schedule: int

@dataclass
class ResultField:
    id_: str
    raw_string: str

Example: v4.0:

models.py:

from ..schema import *
from ..scenarios import *

ResNet50 = Model(
    name="resnet",
    aliases=["resnet50"],
    required_in={SystemCategory.Datacenter: [Scenario.Offline, Scenario.Server],
                 SystemCategory.Edge: [Scenario.Offline, Scenario.MultiStream, Scenario.SingleStream]},
    optional_in=None,  # Can be omitted if dataclass definition uses purely kwargs
    accuracy_metric=AccuracyMetric.Accuracy,
    accuracy_target=76.46 * 0.99,
    performance_sample_count=1024,
    latency_constraints={Scenario.Server: 15000000},
    min_query_counts={Scenario.Offline: 1,
                      Scenario.Server: 270336,
                      Scenario.SingleStream: 1024,
                      Scenario.MultiStream: 270336},
    scenario_configs={
        Scenario.Offline: ScenarioConfig(min_query_count=1,
                                         result_field=ResultField("result_samples_per_second", "Samples per second")),
        Scenario.Server: ScenarioConfig(latency_constraint=15000000,
                                        min_query_count=270336,
                                        result_field=ResultField("result_scheduled_samples_per_sec", "Scheduled samples per second")),
        Scenario.SingleStream: ...},
    audit_tests=[AuditTest.Test01, AuditTest.Test04, AuditTest.Test05]
)

... etc for all models

config.py:

from ..schema import *
from ..scenarios import *
from .models import *

v4_0_SubmissionConfig = MLPInferenceSubmission(
    version="v4.0",
    models=[ResNet50, Retinanet, ...],
    seeds=Seeds(13281865557512327830, 198141574272810017, 7575108116881280410),
    audit_test_seeds={AuditTest.Test05: Seeds(2376919268182438552, 11176391829184272374, 3911940905271271337)},
    ignored_errors=[])

There is an enormous amount of checks and conditions that need to be satisfied, and the loop is quite hard to read and interpret given its size. I don't have a definite proposal for this since the code for actually checking is quite long and I haven't fully groked through the whole thing, but it can probably be made much simpler and readable by splitting each "phase" of the checker into individual files, and making each individual fail condition its own function (kind of like a PyTest file structure).

For example:

performance_dir_check.py:

class PerformanceDirCheck(Checker):
    def fail_if_missing_files(self, ctx):
        ...
    def fail_if_contains_loadgen_error(self, ctx):
        ...
    def fail_if_insufficient_performance_sample_count(self, ctx):
        ...

Where Checker is a class that has a method that runs all member methods that begin with "fail_if_", and ctx contains the necessary information about the workload being checked.

mrmhodak · 2024-05-14T16:54:49Z

@pgmpablo157321 to have a look

arjunsuresh · 2024-05-14T16:57:14Z

PR removing the backward compatibility of the submission checker: #1677

nv-ananjappa · 2024-05-15T21:20:47Z

We should get back to this issue after #1677 is merged. That PR would simplify the submission checker a lot.

pgmpablo157321 · 2024-10-01T15:33:59Z

@nv-ananjappa @nvzhihanj @arjunsuresh @mrmhodak
Proposal:
Main modules:

Package checker: check that all the require files are there, check for forbidden files
Log checkers:
- Performance checker: check the content of the performance logs
- Accuracy checker: check the content of the performance logs
- Compliance checker: check the compliance tests
Result summarizer: Compute the summary of the results

Helper modules:

Log Parser
Configuration parser

Additional items:

Configuration files

arjunsuresh · 2024-10-01T17:28:18Z

From chatgpt :)

To modularize this script, I’ll refactor the file into logical components, grouping functions, constants, and class definitions into separate modules. Here's how the code can be broken down:

Folder Structure:

submission_checker/
│
├── __init__.py
├── main.py
├── config.py
├── utils.py
├── constants.py
├── checkers/
│   ├── __init__.py
│   ├── accuracy.py
│   ├── performance.py
│   ├── power.py
│   ├── system_desc.py
│   └── compliance.py

1. `config.py`

This module will contain the Config class that handles configuration for different versions of MLPerf and submission types.

# config.py

from .constants import MODEL_CONFIG


class Config:
    """Select config value by mlperf version and submission type."""

    def __init__(self, version, extra_model_benchmark_map, ignore_uncommited=False, skip_power_check=False):
        self.base = MODEL_CONFIG.get(version)
        self.extra_model_benchmark_map = extra_model_benchmark_map
        self.version = version
        self.models = self.base["models"]
        self.seeds = self.base["seeds"]
        self.test05_seeds = self.base["test05_seeds"]
        self.accuracy_target = self.base["accuracy-target"]
        self.accuracy_delta_perc = self.base["accuracy-delta-perc"]
        self.accuracy_upper_limit = self.base.get("accuracy-upper-limit", {})
        self.performance_sample_count = self.base["performance-sample-count"]
        self.latency_constraint = self.base.get("latency-constraint", {})
        self.min_queries = self.base.get("min-queries", {})
        self.required = None
        self.optional = None
        self.ignore_uncommited = ignore_uncommited
        self.skip_power_check = skip_power_check

    def set_type(self, submission_type):
        """Set required and optional scenarios based on the submission type."""
        if submission_type == "datacenter":
            self.required = self.base["required-scenarios-datacenter"]
            self.optional = self.base["optional-scenarios-datacenter"]
        elif submission_type == "edge":
            self.required = self.base["required-scenarios-edge"]
            self.optional = self.base["optional-scenarios-edge"]
        elif submission_type in ["datacenter,edge", "edge,datacenter"]:
            self.required = self.base["required-scenarios-datacenter-edge"]
            self.optional = self.base["optional-scenarios-datacenter-edge"]
        else:
            raise ValueError("Invalid system type")

    # Other methods of Config go here...

2. `constants.py`

This module will contain all constants such as MODEL_CONFIG, VALID_DIVISIONS, and REQUIRED_FILES.

# constants.py

MODEL_CONFIG = {
    "v4.0": {
        # Same structure you had in your original script
    },
    "v4.1": {
        # Another version configuration
    }
}

VALID_DIVISIONS = ["open", "closed", "network"]
VALID_AVAILABILITIES = ["available", "preview", "rdi"]

REQUIRED_PERF_FILES = ["mlperf_log_summary.txt", "mlperf_log_detail.txt"]
OPTIONAL_PERF_FILES = ["mlperf_log_accuracy.json"]

# Add all other constants from the original script here...

3. `utils.py`

This module will hold utility functions that are used throughout the codebase, such as directory operations and file listing.

# utils.py

import os


def list_dir(*path):
    path = os.path.join(*path)
    return [f for f in os.listdir(path) if os.path.isdir(os.path.join(path, f))]


def list_files(*path):
    path = os.path.join(*path)
    return [f for f in os.listdir(path) if os.path.isfile(os.path.join(path, f))]


def list_files_recursively(*path):
    path = os.path.join(*path)
    return [
        os.path.join(dirpath, file)
        for dirpath, dirs, files in os.walk(path)
        for file in files
    ]

# Add any other file/directory-related utility functions here...

4. `checkers/accuracy.py`

This module will contain the functions related to checking accuracy.

# checkers/accuracy.py

import os
import re
from ..utils import list_files
from ..constants import ACC_PATTERN, MAX_ACCURACY_LOG_SIZE


def check_accuracy_dir(config, model, path, verbose):
    """Check accuracy files and logs."""
    is_valid = False
    all_accuracy_valid = True
    acc = None
    result_acc = {}
    hash_val = None
    target = config.get_accuracy_target(model)
    acc_upper_limit = config.get_accuracy_upper_limit(model)
    
    # Parsing and checking the accuracy
    # The rest of the check_accuracy_dir function remains unchanged

    return is_valid, result_acc

5. `checkers/performance.py`

This module will handle performance checking functions.

# checkers/performance.py

from ..constants import MS_TO_NS, S_TO_MS
from log_parser import MLPerfLog


def check_performance_dir(config, model, path, scenario_fixed, division, system_json, has_power=False):
    """Check performance directory and logs."""
    is_valid = False
    rt = {}

    # Load MLPerf log and validate
    mlperf_log = MLPerfLog(os.path.join(path, "mlperf_log_detail.txt"))

    # Continue with performance check logic...

    return is_valid, rt

6. `checkers/power.py`

This module will deal with checking power directories and logs.

# checkers/power.py

from log_parser import MLPerfLog
import os


def check_power_dir(power_path, ranging_path, testing_path, scenario_fixed, power_res_ranging, power_res_testing, config):
    """Check power metrics from power logs."""
    is_valid = True
    power_metric = 0

    # Check and validate power files...

    return is_valid, power_metric

7. `checkers/system_desc.py`

This module will include functions for checking system descriptions.

# checkers/system_desc.py

import os
import json
from ..constants import SYSTEM_DESC_REQUIRED_FIELDS


def check_system_desc_id(fname, systems_json, submitter, division, version, skip_meaningful_fields_emptiness_check):
    """Check system description file."""
    is_valid = True

    # Perform system description validation...

    return is_valid

8. `checkers/compliance.py`

This module will handle compliance-related checks.

# checkers/compliance.py

import os
from ..utils import list_files


def check_compliance_dir(compliance_dir, model, scenario, config, division, system_json, name):
    """Check compliance test directories and logs."""
    compliance_perf_pass = True

    # Perform compliance checks...

    return compliance_perf_pass

9. `main.py`

Finally, the main module will orchestrate the checking process by utilizing all the modularized functions and components.

# main.py

import argparse
from config import Config
from utils import list_dir, list_files
from checkers.accuracy import check_accuracy_dir
from checkers.performance import check_performance_dir
from checkers.power import check_power_dir
from checkers.system_desc import check_system_desc_id
from checkers.compliance import check_compliance_dir


def get_args():
    """Parse commandline arguments."""
    parser = argparse.ArgumentParser()
    parser.add_argument("--input", required=True, help="submission directory")
    parser.add_argument("--version", default="v4.1", choices=["v4.0", "v4.1"], help="mlperf version")
    parser.add_argument("--submitter", help="filter to submitter")
    parser.add_argument("--csv", default="summary.csv", help="csv file with results")
    args = parser.parse_args()
    return args


def main():
    args = get_args()
    config = Config(args.version, "model_mapping.json")

    # Rest of the logic for running the checks using modularized functions...

if __name__ == "__main__":
    main()

Explanation:

Modularization: The file has been split into logical units for configuration, utilities, constants, and specific checkers (accuracy, performance, power, etc.).
Reuse: Each checker module focuses on a specific aspect, making the code easier to maintain and extend in the future.
Main Orchestration: main.py coordinates everything by calling the appropriate functions from the modules.

This modularization makes the codebase easier to manage, debug, and extend for future versions of the MLPerf submission checker.

nvzhihanj mentioned this issue Oct 1, 2024

[4.1 postmortem] Unit test module for inference repo #1865

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Submission checker: better modularize the code, and have better documentation of the expectation of results #1670

Submission checker: better modularize the code, and have better documentation of the expectation of results #1670

nvzhihanj commented Apr 2, 2024 •

edited

Loading

arjunsuresh commented Apr 11, 2024

nv-alicheng commented Apr 12, 2024

mrmhodak commented May 14, 2024

arjunsuresh commented May 14, 2024

nv-ananjappa commented May 15, 2024

pgmpablo157321 commented Oct 1, 2024

arjunsuresh commented Oct 1, 2024

Submission checker: better modularize the code, and have better documentation of the expectation of results #1670

Submission checker: better modularize the code, and have better documentation of the expectation of results #1670

Comments

nvzhihanj commented Apr 2, 2024 • edited Loading

arjunsuresh commented Apr 11, 2024

nv-alicheng commented Apr 12, 2024

mrmhodak commented May 14, 2024

arjunsuresh commented May 14, 2024

nv-ananjappa commented May 15, 2024

pgmpablo157321 commented Oct 1, 2024

arjunsuresh commented Oct 1, 2024

Folder Structure:

1. config.py

2. constants.py

3. utils.py

4. checkers/accuracy.py

5. checkers/performance.py

6. checkers/power.py

7. checkers/system_desc.py

8. checkers/compliance.py

9. main.py

Explanation:

nvzhihanj commented Apr 2, 2024 •

edited

Loading

1. `config.py`

2. `constants.py`

3. `utils.py`

4. `checkers/accuracy.py`

5. `checkers/performance.py`

6. `checkers/power.py`

7. `checkers/system_desc.py`

8. `checkers/compliance.py`

9. `main.py`