Introduce Pydantic to verify Test schema #145

amaslenn · 2024-07-11T11:30:25Z

Summary

Use Pydantic model for Test Template and Test TOMLs.

Test Plan

Extended CI.
Dry-run and compare generated scripts for the following runs. All found issues were covered with unit tests. Compared vs main state:

cloudai --mode dry-run --system-config slurm.toml --test-templates-dir conf/common/test_template/ --tests conf/common/test/ --test-scenario conf/common/test_scenario/chakra_replay.toml
cloudai --mode dry-run --system-config slurm.toml --test-templates-dir conf/common/test_template/ --tests conf/common/test/ --test-scenario conf/common/test_scenario/nccl_test.toml
cloudai --mode dry-run --system-config slurm.toml --test-templates-dir conf/common/test_template/ --tests conf/common/test/ --test-scenario conf/common/test_scenario/sleep.toml
cloudai --mode dry-run --system-config slurm.toml --test-templates-dir conf/common/test_template/ --tests conf/common/test/ --test-scenario conf/common/test_scenario/ucc_test.toml

Additional Notes

—

Highlights for Release notes

We are working on schema improvements to simplify configs management and make them verifiable. This will help ensure that configs are correct before expensive runs on real hardware. Today we are enabling it for Test configs. This is a continuation of #158.

Test Template TOML files were replaced with Pydantic models. That ensures mandatory arguments as well as its types and requires less code to maintain.
--test-templates-dir option was removed for all commands. All supported tests are registered in code using Registry().add_test_definition(...) and Registry().add_test_template(...). Documentation was updated to reflect this change.

Test TOML files now take advantage of standard TOML format for all know arguments.
Before:

[cmd_args]
"training" = "llama/llama2_70b"
"training.trainer.max_steps" = "120"
"training.model.global_batch_size" = "256"
"training.model.pipeline_model_parallel_size" = "1"

Now:

[cmd_args]
  [cmd_args.training]
  values = "llama/llama2_70b"
    [cmd_args.training.trainer]
    max_steps = "120"
    [cmd_args.training.model]
    global_batch_size = "256"
    pipeline_model_parallel_size = "2"

extra_cmd_args converted from str to dict[str, str]:
Before:

extra_cmd_args = "--stepfactor 2"

Now:

[extra_cmd_args]
"--stepfactor" = "2"

Add a new mode to verify if Tests TOMLs are valid: cloudai --mode verify-tests --system-config conf/common/system/standalone_system.toml --tests-dir conf/common/test/chakra_replay.toml

TaekyungHeo · 2024-07-17T13:00:20Z

If we are going to accept this pydantic PR, then we should update USER_GUIDE.md as well to inform contributors how to add a new test template. Before updating USER_GUIDE.md, let's ensure @srinivas212 is on the same page.

TaekyungHeo · 2024-07-17T13:01:27Z

It's good to know that you have tested this PR with CloudAI test templates!

TaekyungHeo

LGTM. However, since it is a major change, let's wait for @srinivas212.

amaslenn · 2024-07-17T13:42:13Z

LGTM. However, since it is a major change, let's wait for @srinivas212.

Do you think we should keep all the existing field validators as is? This might constraint some of our users, I think.

TaekyungHeo

@amaslenn, let's discuss this PR in the next meeting. This PR might break other test templates as it changes how we retrieve arguments.

src/cloudai/schema/test_template/jax_toolbox/slurm_command_gen_strategy.py

TaekyungHeo · 2024-09-20T15:26:43Z

@amaslenn, the README and USER_GUIDE should be further updated as they contain keywords such as "test template." I suggest running grep -ir "test template", grep -ir "test_template", and grep -ir "TestTemplate" to replace them appropriately. We should also do the same for CloudAIX.

amaslenn · 2024-09-24T11:40:22Z

... to replace them appropriately.

Great catch, I've missed that, thank you! Yet for TestTemplate it is a bit more complicated as we still have such python object it is used.

I've updated docs, please let me know if you see more issues with it.

src/cloudai/schema/test_template/jax_toolbox/slurm_command_gen_strategy.py

srivatsankrishnan

Met with Andrei and call to discuss this PR. LGTM.

amaslenn added 7 commits July 11, 2024 12:36

Use Pydantic as a schema validator for Tests

651b456

Cleanup

362ba9b

Remove pprint

54807c5

Make pyright happy

daab5ab

Merge branch 'main' into am/schema

56cab57

Cleanup and use Registry

e5805c0

Merge branch 'main' into am/schema

8a59902

amaslenn marked this pull request as ready for review July 17, 2024 12:54

TaekyungHeo approved these changes Jul 17, 2024

View reviewed changes

TaekyungHeo requested review from artemry-nv and srinivas212 July 17, 2024 13:03

TaekyungHeo added the enhancement New feature or request label Jul 17, 2024

srinivas212 added Oct24 Oct'24 release feature and removed Oct24 Oct'24 release feature labels Jul 28, 2024

TaekyungHeo requested changes Aug 2, 2024

View reviewed changes

srinivas212 added the Oct24 Oct'24 release feature label Aug 2, 2024

amaslenn added 4 commits August 27, 2024 12:51

Merge branch 'main' into am/schema

94703ae

Format llama.toml

13acc95

Fix header tests

2c1b700

Update doc and exported symbols

2fd8863

amaslenn added feature and removed enhancement New feature or request labels Aug 27, 2024

amaslenn added 5 commits August 27, 2024 13:22

Remove needless method

66e8397

Fix tests collection, make it stricter

426c837

Make ruff happy

240e839

Make extra_cmd_args a dict[str, str]

ed3349e

Add --mode verify-tests

7cf1324

amaslenn added 6 commits September 18, 2024 21:35

Fix XLA flags formatting

8950b82

Set default value for Grok.fdl_config

564af34

Remove unused code

8516dc2

Make xla boolean values lowcase

b65cb85

Fix tests

e047ace

Add --mode verify-tests example

dbdf4ac

srivatsankrishnan reviewed Sep 19, 2024

View reviewed changes

src/cloudai/schema/test_template/jax_toolbox/slurm_command_gen_strategy.py Show resolved Hide resolved

Merge branch 'main' into am/schema

bfcaf6b

amaslenn added 3 commits September 23, 2024 12:30

More alignments with original code

7587937

Remove Jax and related test definitions

8877f54

Update USER_GUIDE.md on test template vs test, etc.

693ca32

amaslenn and others added 5 commits September 24, 2024 15:34

Improve UI when there are parsing issues in Test configs

57ee40f

Fix imports

938df4e

Revert accidential change

b547094

Path absolute path + bug fix for pre_test and load container flags

0513ba9

fix pre_test flag

4dda839

srivatsankrishnan reviewed Sep 25, 2024

View reviewed changes

src/cloudai/schema/test_template/jax_toolbox/slurm_command_gen_strategy.py Outdated Show resolved Hide resolved

srivatsankrishnan and others added 4 commits September 25, 2024 03:11

revert changes

7611288

ruff fixes

1fe9311

more ruff fixes

5594cf2

Use load_container directly

a3e6d27

TaekyungHeo approved these changes Sep 25, 2024

View reviewed changes

srivatsankrishnan approved these changes Sep 25, 2024

View reviewed changes

Merge branch 'main' into am/schema

7dcf8a7

amaslenn merged commit 953f04b into main Sep 25, 2024
2 checks passed

amaslenn deleted the am/schema branch September 25, 2024 16:35

amaslenn mentioned this pull request Sep 26, 2024

Pydantic for Test Scenario #205

Merged

TaekyungHeo mentioned this pull request Oct 15, 2024

Update NeMo launcher commit hash and image tag #265

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce Pydantic to verify Test schema #145

Introduce Pydantic to verify Test schema #145

amaslenn commented Jul 11, 2024 •

edited

Loading

TaekyungHeo commented Jul 17, 2024

TaekyungHeo commented Jul 17, 2024

TaekyungHeo left a comment

amaslenn commented Jul 17, 2024

TaekyungHeo left a comment •

edited

Loading

TaekyungHeo commented Sep 20, 2024 •

edited

Loading

amaslenn commented Sep 24, 2024

srivatsankrishnan left a comment

Introduce Pydantic to verify Test schema #145

Introduce Pydantic to verify Test schema #145

Conversation

amaslenn commented Jul 11, 2024 • edited Loading

Summary

Test Plan

Additional Notes

Highlights for Release notes

TaekyungHeo commented Jul 17, 2024

TaekyungHeo commented Jul 17, 2024

TaekyungHeo left a comment

Choose a reason for hiding this comment

amaslenn commented Jul 17, 2024

TaekyungHeo left a comment • edited Loading

Choose a reason for hiding this comment

TaekyungHeo commented Sep 20, 2024 • edited Loading

amaslenn commented Sep 24, 2024

srivatsankrishnan left a comment

Choose a reason for hiding this comment

amaslenn commented Jul 11, 2024 •

edited

Loading

TaekyungHeo left a comment •

edited

Loading

TaekyungHeo commented Sep 20, 2024 •

edited

Loading