Releases: NVIDIA/cloudai
Releases · NVIDIA/cloudai
v0.9.beta21
v0.9.beta20
What's Changed
- Update NeMo launcher commit hash and image tag by @TaekyungHeo in #265
Full Changelog: v0.9.beta19...v0.9.beta20
v0.9.beta19
What's Changed
- Refactor SlurmCommandGenStrategy (_write_sbatch_script) by @TaekyungHeo in #253
- Refactor JaxToolboxSlurmCommandGenStrategy unit tests by @TaekyungHeo in #259
- Handle node allocation errors gracefully, log details, and exit on failure by @TaekyungHeo in #264
Full Changelog: v0.9.beta18...v0.9.beta19
v0.9.beta18
What's Changed
- Cleanup docs from mentioning --mode option by @amaslenn in #260
- Improve verify modes by @amaslenn in #262
Full Changelog: v0.9.beta17...v0.9.beta18
v0.9.beta17
What's Changed
- Refactor JaxToolboxSlurmCommandGenStrategy by @TaekyungHeo in #254
- Move JaxToolbox-related test definitions to CloudAI by @TaekyungHeo in #257
Full Changelog: v0.9.beta16...v0.9.beta17
v0.9.beta16
Highlights
Use subcommands instead of --mode <value>
by @amaslenn in #194
New help message looks like this:
> cloudai --help
usage: cloudai [-h] [--log-file LOG_FILE] [--log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}]
{uninstall,install,dry-run,run,generate-report,verify-systems,verify-tests,verify-test-scenarios} ...
Cloud AI
optional arguments:
-h, --help show this help message and exit
--log-file LOG_FILE The name of the log file (default: debug.log).
--log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}
Set the logging level (default: INFO).
modes:
{uninstall,install,dry-run,run,generate-report,verify-systems,verify-tests,verify-test-scenarios}
uninstall Remove the installed dependencies.
install Prepare execution by setting up env and dependencies for the tests to run.
dry-run Perform a dry-run of the test scenarios without executing them.
run Execute the test scenarios.
generate-report Generate a report based on the test results.
verify-systems Verify the system configurations.
verify-tests Verify the test configurations.
verify-test-scenarios
Verify the test scenario configurations.
- Each command (a.k.a mode) has own help message.
- Each command also has a uniq set of required and optional arguments. While for many commands options are the same, others are quite different, for example:
> cloudai run --help usage: cloudai run [-h] --system-config SYSTEM_CONFIG --tests-dir TESTS_DIR --test-scenario TEST_SCENARIO [--output-dir OUTPUT_DIR] optional arguments: -h, --help show this help message and exit --system-config SYSTEM_CONFIG Path to the system configuration file. --tests-dir TESTS_DIR Path to the test configuration directory. --test-scenario TEST_SCENARIO Path to the test scenario file. --output-dir OUTPUT_DIR Path to the output directory. > cloudai verify-tests --help usage: cloudai verify-tests [-h] test_configs positional arguments: test_configs Path to the test configuration file or directory. optional arguments: -h, --help show this help message and exit
What's Changed
- Refactor NeMoLauncherSlurmCommandGenStrategy unit tests by @TaekyungHeo in #252
- Refactor JaxToolboxSlurmCommandGenStrategy by @TaekyungHeo in #249
Full Changelog: v0.9.beta15...v0.9.beta16
v0.9.beta15
What's Changed
- Remove assigning null when the value is null (NeMo launcher) by @TaekyungHeo in #250
Full Changelog: v0.9.beta14...v0.9.beta15
v0.9.beta14
What's Changed
- Fix bug in violating Kubernetes naming rules by @TaekyungHeo in #244
- Add unit tests for SlurmCommandGenStrategy by @TaekyungHeo in #247
- Fix missing 'output_path' in cmd_args by @amaslenn in #251
Full Changelog: v0.9.beta13...v0.9.beta14
v0.9.beta13
What's Changed
- Update Sleep to ensure implementation consistency by @TaekyungHeo in #234
- Update USER_GUIDE.md and README.md by @TaekyungHeo in #235
- Remove duplicated _format_env_vars calls by @TaekyungHeo in #233
- Rename test definitions by @TaekyungHeo in #237
- Remove unnecessary arg from generate_test_command by @TaekyungHeo in #238
- Spin-off cmd_args validation logic for SlurmCommandGenStrategy by @TaekyungHeo in #236
- Expect SlurmSystem in respective cmd_gen and installer classes by @amaslenn in #239
- Move more fields from Test to TestRun by @amaslenn in #240
- Make TestDefinition a part of Test by @amaslenn in #241
- Minor refactoring on SlurmCommandGenStrategy by @TaekyungHeo in #246
- Break down test_slurm_command_gen_strategy into smaller tests by @TaekyungHeo in #245
- Resolve K8s Comments (Part 1) by @TaekyungHeo in #242
- Fix race condition during docker images caching by @amaslenn in #248
Full Changelog: v0.9.beta12...v0.9.beta13
v0.9.beta12
Highlights
We are working on schema improvements to simplify configs management and make them verifiable. This will help ensure that configs are correct before expensive runs on real hardware. Today we are enabling it for Test Scenario configs. This is a continuation of #145.
Tests
becomes and array. This helps making case names more expressive:
before:now:[Tests.1] # ...
[[Tests]] id = "any-name.you_want" # before it was just "1"
id
field is mandatory and must be unique and is used to specify dependencies:[[Tests]] id = "Tests.1" # ... [[Tests]] id = "Tests.2" # ... [[Tests.dependencies]] id = "Tests.1" # ...
name
(under the list of tests) renamed totest_name
to better reflect its meaning. It still references a test defined in a separate TOML file.- Dependencies converted to a list to support multiple dependencies of the same type.
beforenow# ... [Tests.2] name = "ucc_test_alltoall" [Tests.2.dependencies] start_post_comp = { name = "Tests.1", time = 0 } # only one dependency of this type is allowed
# ... [[Tests]] id = "Tests.3" test_name = "ucc_test_alltoall" # ... [[Tests.dependencies]] type = "start_post_comp" id = "Tests.1" [[Tests.dependencies]] type = "start_post_comp" id = "Tests.2"
What's Changed
- Cover wrong python bin path in exec script bug by @amaslenn in #232
- Pydantic for Test Scenario by @amaslenn in #205
Full Changelog: v0.9.beta11...v0.9.beta12