Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce Pydantic to verify Test schema #145

Merged
merged 71 commits into from
Sep 25, 2024
Merged

Introduce Pydantic to verify Test schema #145

merged 71 commits into from
Sep 25, 2024

Conversation

amaslenn
Copy link
Contributor

@amaslenn amaslenn commented Jul 11, 2024

Summary

Use Pydantic model for Test Template and Test TOMLs.

Test Plan

  1. Extended CI.
  2. Dry-run and compare generated scripts for the following runs. All found issues were covered with unit tests. Compared vs main state:
cloudai --mode dry-run --system-config slurm.toml --test-templates-dir conf/common/test_template/ --tests conf/common/test/ --test-scenario conf/common/test_scenario/chakra_replay.toml
cloudai --mode dry-run --system-config slurm.toml --test-templates-dir conf/common/test_template/ --tests conf/common/test/ --test-scenario conf/common/test_scenario/nccl_test.toml
cloudai --mode dry-run --system-config slurm.toml --test-templates-dir conf/common/test_template/ --tests conf/common/test/ --test-scenario conf/common/test_scenario/sleep.toml
cloudai --mode dry-run --system-config slurm.toml --test-templates-dir conf/common/test_template/ --tests conf/common/test/ --test-scenario conf/common/test_scenario/ucc_test.toml

Additional Notes

Highlights for Release notes

We are working on schema improvements to simplify configs management and make them verifiable. This will help ensure that configs are correct before expensive runs on real hardware. Today we are enabling it for Test configs. This is a continuation of #158.

  1. Test Template TOML files were replaced with Pydantic models. That ensures mandatory arguments as well as its types and requires less code to maintain.
  2. --test-templates-dir option was removed for all commands. All supported tests are registered in code using Registry().add_test_definition(...) and Registry().add_test_template(...). Documentation was updated to reflect this change.
  3. Test TOML files now take advantage of standard TOML format for all know arguments.
    Before:
    [cmd_args]
    "training" = "llama/llama2_70b"
    "training.trainer.max_steps" = "120"
    "training.model.global_batch_size" = "256"
    "training.model.pipeline_model_parallel_size" = "1"
    Now:
    [cmd_args]
      [cmd_args.training]
      values = "llama/llama2_70b"
        [cmd_args.training.trainer]
        max_steps = "120"
        [cmd_args.training.model]
        global_batch_size = "256"
        pipeline_model_parallel_size = "2"
  4. extra_cmd_args converted from str to dict[str, str]:
    Before:
    extra_cmd_args = "--stepfactor 2"
    Now:
    [extra_cmd_args]
    "--stepfactor" = "2"
  5. Add a new mode to verify if Tests TOMLs are valid: cloudai --mode verify-tests --system-config conf/common/system/standalone_system.toml --tests-dir conf/common/test/chakra_replay.toml

@amaslenn amaslenn marked this pull request as ready for review July 17, 2024 12:54
@TaekyungHeo
Copy link
Member

If we are going to accept this pydantic PR, then we should update USER_GUIDE.md as well to inform contributors how to add a new test template. Before updating USER_GUIDE.md, let's ensure @srinivas212 is on the same page.

@TaekyungHeo
Copy link
Member

It's good to know that you have tested this PR with CloudAI test templates!

Copy link
Member

@TaekyungHeo TaekyungHeo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. However, since it is a major change, let's wait for @srinivas212.

@TaekyungHeo TaekyungHeo added the enhancement New feature or request label Jul 17, 2024
@amaslenn
Copy link
Contributor Author

LGTM. However, since it is a major change, let's wait for @srinivas212.

Do you think we should keep all the existing field validators as is? This might constraint some of our users, I think.

@srinivas212 srinivas212 added Oct24 Oct'24 release feature and removed Oct24 Oct'24 release feature labels Jul 28, 2024
Copy link
Member

@TaekyungHeo TaekyungHeo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amaslenn, let's discuss this PR in the next meeting. This PR might break other test templates as it changes how we retrieve arguments.

@srinivas212 srinivas212 added the Oct24 Oct'24 release feature label Aug 2, 2024
@amaslenn amaslenn added feature and removed enhancement New feature or request labels Aug 27, 2024
@TaekyungHeo
Copy link
Member

TaekyungHeo commented Sep 20, 2024

@amaslenn, the README and USER_GUIDE should be further updated as they contain keywords such as "test template." I suggest running grep -ir "test template", grep -ir "test_template", and grep -ir "TestTemplate" to replace them appropriately. We should also do the same for CloudAIX.

@amaslenn
Copy link
Contributor Author

... to replace them appropriately.

Great catch, I've missed that, thank you! Yet for TestTemplate it is a bit more complicated as we still have such python object it is used.

I've updated docs, please let me know if you see more issues with it.

Copy link
Contributor

@srivatsankrishnan srivatsankrishnan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Met with Andrei and call to discuss this PR. LGTM.

@amaslenn amaslenn merged commit 953f04b into main Sep 25, 2024
2 checks passed
@amaslenn amaslenn deleted the am/schema branch September 25, 2024 16:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Oct24 Oct'24 release feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants