Skip to content

Releases: NVIDIA/cloudai

v0.9.beta21

17 Oct 10:27
284c6a8
Compare
Choose a tag to compare
v0.9.beta21 Pre-release
Pre-release

What's Changed

Full Changelog: v0.9.beta20...v0.9.beta21

v0.9.beta20

15 Oct 16:08
771e08c
Compare
Choose a tag to compare
v0.9.beta20 Pre-release
Pre-release

What's Changed

Full Changelog: v0.9.beta19...v0.9.beta20

v0.9.beta19

15 Oct 14:39
622141f
Compare
Choose a tag to compare
v0.9.beta19 Pre-release
Pre-release

What's Changed

  • Refactor SlurmCommandGenStrategy (_write_sbatch_script) by @TaekyungHeo in #253
  • Refactor JaxToolboxSlurmCommandGenStrategy unit tests by @TaekyungHeo in #259
  • Handle node allocation errors gracefully, log details, and exit on failure by @TaekyungHeo in #264

Full Changelog: v0.9.beta18...v0.9.beta19

v0.9.beta18

14 Oct 15:30
0af4cd9
Compare
Choose a tag to compare
v0.9.beta18 Pre-release
Pre-release

What's Changed

Full Changelog: v0.9.beta17...v0.9.beta18

v0.9.beta17

11 Oct 15:28
ae67c34
Compare
Choose a tag to compare
v0.9.beta17 Pre-release
Pre-release

What's Changed

Full Changelog: v0.9.beta16...v0.9.beta17

v0.9.beta16

11 Oct 06:51
156ccf9
Compare
Choose a tag to compare
v0.9.beta16 Pre-release
Pre-release

Highlights

Use subcommands instead of --mode <value> by @amaslenn in #194

New help message looks like this:

> cloudai --help
usage: cloudai [-h] [--log-file LOG_FILE] [--log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}]
               {uninstall,install,dry-run,run,generate-report,verify-systems,verify-tests,verify-test-scenarios} ...

Cloud AI

optional arguments:
  -h, --help            show this help message and exit
  --log-file LOG_FILE   The name of the log file (default: debug.log).
  --log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}
                        Set the logging level (default: INFO).

modes:
  {uninstall,install,dry-run,run,generate-report,verify-systems,verify-tests,verify-test-scenarios}
    uninstall           Remove the installed dependencies.
    install             Prepare execution by setting up env and dependencies for the tests to run.
    dry-run             Perform a dry-run of the test scenarios without executing them.
    run                 Execute the test scenarios.
    generate-report     Generate a report based on the test results.
    verify-systems      Verify the system configurations.
    verify-tests        Verify the test configurations.
    verify-test-scenarios
                        Verify the test scenario configurations.
  1. Each command (a.k.a mode) has own help message.
  2. Each command also has a uniq set of required and optional arguments. While for many commands options are the same, others are quite different, for example:
    > cloudai run --help
    usage: cloudai run [-h] --system-config SYSTEM_CONFIG --tests-dir TESTS_DIR --test-scenario TEST_SCENARIO [--output-dir OUTPUT_DIR]
    
    optional arguments:
      -h, --help            show this help message and exit
      --system-config SYSTEM_CONFIG
                            Path to the system configuration file.
      --tests-dir TESTS_DIR
                            Path to the test configuration directory.
      --test-scenario TEST_SCENARIO
                            Path to the test scenario file.
      --output-dir OUTPUT_DIR
                            Path to the output directory.
    
    > cloudai verify-tests --help
    usage: cloudai verify-tests [-h] test_configs
    
    positional arguments:
      test_configs  Path to the test configuration file or directory.
    
    optional arguments:
      -h, --help    show this help message and exit

What's Changed

Full Changelog: v0.9.beta15...v0.9.beta16

v0.9.beta15

09 Oct 13:56
2b6181b
Compare
Choose a tag to compare
v0.9.beta15 Pre-release
Pre-release

What's Changed

  • Remove assigning null when the value is null (NeMo launcher) by @TaekyungHeo in #250

Full Changelog: v0.9.beta14...v0.9.beta15

v0.9.beta14

09 Oct 13:23
7455f42
Compare
Choose a tag to compare
v0.9.beta14 Pre-release
Pre-release

What's Changed

Full Changelog: v0.9.beta13...v0.9.beta14

v0.9.beta13

09 Oct 09:53
e54f4c1
Compare
Choose a tag to compare
v0.9.beta13 Pre-release
Pre-release

What's Changed

Full Changelog: v0.9.beta12...v0.9.beta13

v0.9.beta12

07 Oct 17:38
c40e92c
Compare
Choose a tag to compare
v0.9.beta12 Pre-release
Pre-release

Highlights

We are working on schema improvements to simplify configs management and make them verifiable. This will help ensure that configs are correct before expensive runs on real hardware. Today we are enabling it for Test Scenario configs. This is a continuation of #145.

  1. Tests becomes and array. This helps making case names more expressive:
    before:
    [Tests.1]
    # ...
    now:
    [[Tests]]
    id = "any-name.you_want" # before it was just "1"
  2. id field is mandatory and must be unique and is used to specify dependencies:
    [[Tests]]
    id = "Tests.1"
    # ...
    
    [[Tests]]
    id = "Tests.2"
    # ...
      [[Tests.dependencies]]
      id = "Tests.1"
      # ...
  3. name (under the list of tests) renamed to test_name to better reflect its meaning. It still references a test defined in a separate TOML file.
  4. Dependencies converted to a list to support multiple dependencies of the same type.
    before
    # ...
    
    [Tests.2]
    name = "ucc_test_alltoall"
      [Tests.2.dependencies]
      start_post_comp = { name = "Tests.1", time = 0 }  # only one dependency of this type is allowed
    now
    # ...
    
    [[Tests]]
    id = "Tests.3"
    test_name = "ucc_test_alltoall"
    # ...
      [[Tests.dependencies]]
      type = "start_post_comp"
      id = "Tests.1"
    
      [[Tests.dependencies]]
      type = "start_post_comp"
      id = "Tests.2"

What's Changed

Full Changelog: v0.9.beta11...v0.9.beta12