Intial dse parameters for llama_8b #359

srivatsankrishnan · 2025-02-05T21:56:08Z

Summary

Updated config plus relevant cloudAI changes for configs discussed in todays meeting with Malay. First of the many PRs to get the base NemoRun support in CLoudAI.

Updated the test definition with parameters needed for Llama_8b model
Modify the report generation to average from 80-100th step and drop the first step (due to initial overhead)
Update the relevant unit test.
Supporting mixed precision.
Constraint checkers based on the discussion with Malay. More details here
trajectory writer + log generation for storing gym interface data.

Update: DP size is autocomputed and set in Nemo. No need to expose or sweep based on the discussion with Malay.

Update: How to handle the CloudAI num_nodes and Nemo Run trainer.num_nodes.
Context: CloudAI controls the generation of the sbatch and by extension also the node allocation from Slurm. However, the srun uses the Nemo-Run CLI which internally has its own slurm executor. The prior approach in NemoRun cloudAI support was mixing these concepts by not exposing this flag in the first place and in command generation overwriting this flag based on test scenario.

Based on design meeting discussion on 2/10, we agreed that

Benchmarking point of view, trainer.num_nodes will not be exposed in the test toml and will use CloudAI test scenario to update this flag.
In the DSE job, this will be defined as a list and will only check for cases where the trainer.num_nodes exceeds the num_nodes in test scenario. This will basically means the job asks for more nodes than what was allocation and in fail. We address this by catching this case, and log a warning message. Then will fix the trainer.num_nodes to the same as num_nodes defined in the test scenario.

Based on design meeting discussion on 2/12, we agreed that to exit in the case where the trainer.num_nodes exceeds the test_scenario node count. The error message will be logged insteaed of raising ValueError there by not using try-catch (reason: readability of the code).

Example

[Error] Mismatch in num_nodes: 1 vs 4. trainer.num_nodes should be less than or equal to the number of nodes specified in the test scenario.

Things ToDo (Not going to be part of this PR)

Tokenizer for this model (Will be done by Taekyung's rework on Nemo2.0 for CloudAI)
Nsys (Will be done by Taekyung's rework on Nemo2.0 for CloudAI)

Test Plan

CI
Real system

cloudai run --system-config ../cloudaix/conf/common/system/coreweave.toml --tests-dir conf/common/test --test-scenario conf/common/test_scenario/dse_nemo_run_llama3_8b.toml

[INFO] System Name: coreweave
[INFO] Scheduler: slurm
[INFO] Test Scenario Name: nemo_run_llama3_8b
[INFO] Checking if test templates are installed.
[INFO] Test Scenario: nemo_run_llama3_8b

Section Name: dse_nemo_run_llama3_8b_1
  Test Name: dse_nemo_run_llama3_8b
  Description: dse_nemo_run_llama3_8b
  No dependencies
[INFO] Initializing Runner [DRY-RUN] mode
[INFO] Creating SlurmRunner
[INFO] Starting test scenario execution.
[INFO] Starting test: dse_nemo_run_llama3_8b_1
[INFO] Running test: dse_nemo_run_llama3_8b_1
[INFO] Submitted slurm job: 0
[INFO] Job completed: dse_nemo_run_llama3_8b_1
[INFO] Step 1: Observation: [xxxx], Reward: xxxx
[INFO] Starting test scenario execution.
[INFO] Starting test: dse_nemo_run_llama3_8b_1
[INFO] Running test: dse_nemo_run_llama3_8b_1
[INFO] Submitted slurm job: 0
[INFO] Job completed: dse_nemo_run_llama3_8b_1
[INFO] Step 2: Observation: [xxxx], Reward: xxxx
[INFO] Starting test scenario execution.
[INFO] Starting test: dse_nemo_run_llama3_8b_1
[INFO] Running test: dse_nemo_run_llama3_8b_1
[INFO] Submitted slurm job: 0
[INFO] Job completed: dse_nemo_run_llama3_8b_1
[INFO] Step 3: Observation: [xxx], Reward: xxx

More testing on system

cloudai run --system-config ../cloudaix/conf/common/system/xxxx.toml --tests-dir conf/common/test --test-scenario conf/common/test_scenario/xxxxx.toml 
[INFO] System Name: xxx
[INFO] Scheduler: slurm
[INFO] Test Scenario Name: nemo_run_llama3_8b
[INFO] Checking if test templates are installed.
[INFO] Test Scenario: nemo_run_llama3_8b

Section Name: xxxx
  Test Name: xxx
  Description: xxxx
  No dependencies
[INFO] Initializing Runner [RUN] mode
[INFO] Creating SlurmRunner
[INFO] Starting test scenario execution.
[INFO] Starting test: dse_nemo_run_llama3_8b_1
[INFO] Running test: dse_nemo_run_llama3_8b_1
[INFO] Submitted slurm job: xxxxxxx
[INFO] Job completed: dse_nemo_run_llama3_8b_1
[ERROR] No train_step_timing found in results/nemo_run_llama3_8b/dse_nemo_run_llama3_8b_1/0/x/stdout.txt
[INFO] Step x: Observation: [-x.x], Reward: -x.x
[INFO] Starting test scenario execution.
[INFO] Starting test: dse_nemo_run_llama3_8b_1
[INFO] Running test: dse_nemo_run_llama3_8b_1
[INFO] Submitted slurm job: xxxxxxx
[INFO] Job completed: dse_nemo_run_llama3_8b_1
[INFO] Step x: Observation: [xx.xxxx], Reward: x.xxxxxxxxxxxxxxx
[INFO] Starting test scenario execution.
[INFO] Starting test: dse_nemo_run_llama3_8b_1
[INFO] Running test: dse_nemo_run_llama3_8b_1
[INFO] Submitted slurm job: xxxxxxx
[INFO] Job completed: dse_nemo_run_llama3_8b_1
[ERROR] No train_step_timing found in results/nemo_run_llama3_8b/dse_nemo_run_llama3_8b_1/0/x/stdout.txt
[INFO] Step x: Observation: [-x.x], Reward: -x.x
[INFO] Starting test scenario execution.
[INFO] Starting test: dse_nemo_run_llama3_8b_1
[INFO] Running test: dse_nemo_run_llama3_8b_1
[INFO] Submitted slurm job: xxxxxxx
[INFO] Job completed: dse_nemo_run_llama3_8b_1
[INFO] Step x: Observation: [xx.xxxxxxxx], Reward: x.xxxxxxxxxxxxxxx
[INFO] Starting test scenario execution.
[INFO] Starting test: dse_nemo_run_llama3_8b_1
[INFO] Running test: dse_nemo_run_llama3_8b_1
[INFO] Submitted slurm job: xxxxxxx
[INFO] Job completed: dse_nemo_run_llama3_8b_1
[INFO] Step x: Observation: [xx.xxxxxxxx], Reward: x.xxxxxxxxxxxxxxx
[INFO] Constraint check failed. Skipping step.
[INFO] Step x: Observation: [-x.x], Reward: -x.x
[INFO] Constraint check failed. Skipping step.
[INFO] Step x: Observation: [-x.x], Reward: -x.x

Additional Notes

Include any other notes or comments about the pull request here. This can include challenges faced, future considerations, or context that reviewers might find helpful.

tests/test_cloudaigym.py

conf/common/test/dse_nemo_run_llama3_8b.toml

… trainer.num_nodes

src/cloudai/test_definitions/nemo_run.py

src/cloudai/schema/test_template/nemo_run/slurm_command_gen_strategy.py

src/cloudai/test_definitions/nemo_run.py

conf/common/test/dse_nemo_run_llama3_8b_fp8.toml

src/cloudai/test_definitions/nemo_run.py

src/cloudai/_core/configurator/cloudai_gym.py

src/cloudai/schema/test_template/nemo_run/slurm_command_gen_strategy.py

src/cloudai/test_definitions/nemo_run.py

tests/slurm_command_gen_strategy/test_nemo_run_slurm_command_gen_strategy.py

tests/test_cloudaigym.py

src/cloudai/schema/test_template/nemo_run/slurm_command_gen_strategy.py

srivatsankrishnan added 5 commits February 5, 2025 13:55

intial dse parameters for llama_8b

2588afd

update pytest

a75cfd5

update the golden ref sbatch script

9b1599d

more fixes

0324ce5

update report generation to take average from 80 to 100 iteration

f005f35

srivatsankrishnan marked this pull request as ready for review February 6, 2025 05:36

srivatsankrishnan requested review from TaekyungHeo and amaslenn February 6, 2025 05:36

amaslenn reviewed Feb 6, 2025

View reviewed changes

tests/test_cloudaigym.py Outdated Show resolved Hide resolved

conf/common/test/dse_nemo_run_llama3_8b.toml Show resolved Hide resolved

srivatsankrishnan added 2 commits February 6, 2025 16:24

add mixed preicsion support in CLoudAI

b086eb9

Add checks to ensure the test scenario num_nodes matches with NemoRUn…

fec79b8

… trainer.num_nodes

amaslenn reviewed Feb 7, 2025

View reviewed changes

src/cloudai/test_definitions/nemo_run.py Outdated Show resolved Hide resolved

src/cloudai/test_definitions/nemo_run.py Outdated Show resolved Hide resolved

amaslenn reviewed Feb 7, 2025

View reviewed changes

src/cloudai/schema/test_template/nemo_run/slurm_command_gen_strategy.py Outdated Show resolved Hide resolved

amaslenn reviewed Feb 7, 2025

View reviewed changes

src/cloudai/test_definitions/nemo_run.py Show resolved Hide resolved

srivatsankrishnan added 7 commits February 7, 2025 14:03

more fixes

5ee5da7

constrain check property in test definition

dd72d7f

trajectory writer for summary

d21523b

store it at top-level directory

833a1d8

remove None from command (nemo runtime errors) + log actions

30ca47d

fix andrei comments

4b47f4f

add model config

c3df263

srivatsankrishnan requested a review from malay-nagda February 10, 2025 07:52

Merge branch 'main' into llama3_8b_dse

5b78579

malay-nagda reviewed Feb 10, 2025

View reviewed changes

conf/common/test/dse_nemo_run_llama3_8b_fp8.toml Show resolved Hide resolved

srivatsankrishnan added 2 commits February 10, 2025 11:21

handle the cloudai vs nemo run case for num_nodes

463f099

Merge remote-tracking branch 'origin/llama3_8b_dse' into llama3_8b_dse

b3577ae

srivatsankrishnan commented Feb 10, 2025

View reviewed changes

src/cloudai/test_definitions/nemo_run.py Show resolved Hide resolved

srivatsankrishnan added 3 commits February 10, 2025 11:53

address andrei's comments

4a74eb1

add more flags for bf16/fp8

844313d

ruff fix

ee20ef4

amaslenn reviewed Feb 11, 2025

View reviewed changes

srivatsankrishnan requested a review from srinivas212 February 11, 2025 20:24

srivatsankrishnan added 4 commits February 11, 2025 20:05

remove capsys

4f6b257

Merge branch 'main' into llama3_8b_dse

18d040c

remove valuerror exception

92b4f7c

constaint check property in test definition

b1272fa

srinivas212 previously approved these changes Feb 12, 2025

View reviewed changes

use property in cloudai_gym

629636b

srivatsankrishnan dismissed srinivas212’s stale review via 629636b February 12, 2025 19:55

amaslenn reviewed Feb 12, 2025

View reviewed changes

src/cloudai/schema/test_template/nemo_run/slurm_command_gen_strategy.py Outdated Show resolved Hide resolved

srivatsankrishnan added 2 commits February 12, 2025 12:08

fix the constrain_check

3663485

fix

03e298e

amaslenn approved these changes Feb 12, 2025

View reviewed changes

TaekyungHeo approved these changes Feb 12, 2025

View reviewed changes

srivatsankrishnan merged commit ba115d7 into NVIDIA:main Feb 12, 2025
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intial dse parameters for llama_8b #359

Intial dse parameters for llama_8b #359

srivatsankrishnan commented Feb 5, 2025 •

edited

Loading

Intial dse parameters for llama_8b #359

Intial dse parameters for llama_8b #359

Conversation

srivatsankrishnan commented Feb 5, 2025 • edited Loading

Summary

Test Plan

Additional Notes

srivatsankrishnan commented Feb 5, 2025 •

edited

Loading