Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intial dse parameters for llama_8b #359

Merged
merged 27 commits into from
Feb 12, 2025

Conversation

srivatsankrishnan
Copy link
Contributor

@srivatsankrishnan srivatsankrishnan commented Feb 5, 2025

Summary

Updated config plus relevant cloudAI changes for configs discussed in todays meeting with Malay. First of the many PRs to get the base NemoRun support in CLoudAI.

  • Updated the test definition with parameters needed for Llama_8b model
  • Modify the report generation to average from 80-100th step and drop the first step (due to initial overhead)
  • Update the relevant unit test.
  • Supporting mixed precision.
  • Constraint checkers based on the discussion with Malay. More details here
  • trajectory writer + log generation for storing gym interface data.

Update: DP size is autocomputed and set in Nemo. No need to expose or sweep based on the discussion with Malay.

Update: How to handle the CloudAI num_nodes and Nemo Run trainer.num_nodes.
Context: CloudAI controls the generation of the sbatch and by extension also the node allocation from Slurm. However, the srun uses the Nemo-Run CLI which internally has its own slurm executor. The prior approach in NemoRun cloudAI support was mixing these concepts by not exposing this flag in the first place and in command generation overwriting this flag based on test scenario.

Based on design meeting discussion on 2/10, we agreed that

  • Benchmarking point of view, trainer.num_nodes will not be exposed in the test toml and will use CloudAI test scenario to update this flag.
  • In the DSE job, this will be defined as a list and will only check for cases where the trainer.num_nodes exceeds the num_nodes in test scenario. This will basically means the job asks for more nodes than what was allocation and in fail. We address this by catching this case, and log a warning message. Then will fix the trainer.num_nodes to the same as num_nodes defined in the test scenario.

Based on design meeting discussion on 2/12, we agreed that to exit in the case where the trainer.num_nodes exceeds the test_scenario node count. The error message will be logged insteaed of raising ValueError there by not using try-catch (reason: readability of the code).

Example

[Error] Mismatch in num_nodes: 1 vs 4. trainer.num_nodes should be less than or equal to the number of nodes specified in the test scenario.

Things ToDo (Not going to be part of this PR)

  • Tokenizer for this model (Will be done by Taekyung's rework on Nemo2.0 for CloudAI)
  • Nsys (Will be done by Taekyung's rework on Nemo2.0 for CloudAI)

Test Plan

  • CI
  • Real system
cloudai run --system-config ../cloudaix/conf/common/system/coreweave.toml --tests-dir conf/common/test --test-scenario conf/common/test_scenario/dse_nemo_run_llama3_8b.toml

[INFO] System Name: coreweave
[INFO] Scheduler: slurm
[INFO] Test Scenario Name: nemo_run_llama3_8b
[INFO] Checking if test templates are installed.
[INFO] Test Scenario: nemo_run_llama3_8b

Section Name: dse_nemo_run_llama3_8b_1
  Test Name: dse_nemo_run_llama3_8b
  Description: dse_nemo_run_llama3_8b
  No dependencies
[INFO] Initializing Runner [DRY-RUN] mode
[INFO] Creating SlurmRunner
[INFO] Starting test scenario execution.
[INFO] Starting test: dse_nemo_run_llama3_8b_1
[INFO] Running test: dse_nemo_run_llama3_8b_1
[INFO] Submitted slurm job: 0
[INFO] Job completed: dse_nemo_run_llama3_8b_1
[INFO] Step 1: Observation: [xxxx], Reward: xxxx
[INFO] Starting test scenario execution.
[INFO] Starting test: dse_nemo_run_llama3_8b_1
[INFO] Running test: dse_nemo_run_llama3_8b_1
[INFO] Submitted slurm job: 0
[INFO] Job completed: dse_nemo_run_llama3_8b_1
[INFO] Step 2: Observation: [xxxx], Reward: xxxx
[INFO] Starting test scenario execution.
[INFO] Starting test: dse_nemo_run_llama3_8b_1
[INFO] Running test: dse_nemo_run_llama3_8b_1
[INFO] Submitted slurm job: 0
[INFO] Job completed: dse_nemo_run_llama3_8b_1
[INFO] Step 3: Observation: [xxx], Reward: xxx

More testing on system

cloudai run --system-config ../cloudaix/conf/common/system/xxxx.toml --tests-dir conf/common/test --test-scenario conf/common/test_scenario/xxxxx.toml 
[INFO] System Name: xxx
[INFO] Scheduler: slurm
[INFO] Test Scenario Name: nemo_run_llama3_8b
[INFO] Checking if test templates are installed.
[INFO] Test Scenario: nemo_run_llama3_8b

Section Name: xxxx
  Test Name: xxx
  Description: xxxx
  No dependencies
[INFO] Initializing Runner [RUN] mode
[INFO] Creating SlurmRunner
[INFO] Starting test scenario execution.
[INFO] Starting test: dse_nemo_run_llama3_8b_1
[INFO] Running test: dse_nemo_run_llama3_8b_1
[INFO] Submitted slurm job: xxxxxxx
[INFO] Job completed: dse_nemo_run_llama3_8b_1
[ERROR] No train_step_timing found in results/nemo_run_llama3_8b/dse_nemo_run_llama3_8b_1/0/x/stdout.txt
[INFO] Step x: Observation: [-x.x], Reward: -x.x
[INFO] Starting test scenario execution.
[INFO] Starting test: dse_nemo_run_llama3_8b_1
[INFO] Running test: dse_nemo_run_llama3_8b_1
[INFO] Submitted slurm job: xxxxxxx
[INFO] Job completed: dse_nemo_run_llama3_8b_1
[INFO] Step x: Observation: [xx.xxxx], Reward: x.xxxxxxxxxxxxxxx
[INFO] Starting test scenario execution.
[INFO] Starting test: dse_nemo_run_llama3_8b_1
[INFO] Running test: dse_nemo_run_llama3_8b_1
[INFO] Submitted slurm job: xxxxxxx
[INFO] Job completed: dse_nemo_run_llama3_8b_1
[ERROR] No train_step_timing found in results/nemo_run_llama3_8b/dse_nemo_run_llama3_8b_1/0/x/stdout.txt
[INFO] Step x: Observation: [-x.x], Reward: -x.x
[INFO] Starting test scenario execution.
[INFO] Starting test: dse_nemo_run_llama3_8b_1
[INFO] Running test: dse_nemo_run_llama3_8b_1
[INFO] Submitted slurm job: xxxxxxx
[INFO] Job completed: dse_nemo_run_llama3_8b_1
[INFO] Step x: Observation: [xx.xxxxxxxx], Reward: x.xxxxxxxxxxxxxxx
[INFO] Starting test scenario execution.
[INFO] Starting test: dse_nemo_run_llama3_8b_1
[INFO] Running test: dse_nemo_run_llama3_8b_1
[INFO] Submitted slurm job: xxxxxxx
[INFO] Job completed: dse_nemo_run_llama3_8b_1
[INFO] Step x: Observation: [xx.xxxxxxxx], Reward: x.xxxxxxxxxxxxxxx
[INFO] Constraint check failed. Skipping step.
[INFO] Step x: Observation: [-x.x], Reward: -x.x
[INFO] Constraint check failed. Skipping step.
[INFO] Step x: Observation: [-x.x], Reward: -x.x

Additional Notes

Include any other notes or comments about the pull request here. This can include challenges faced, future considerations, or context that reviewers might find helpful.

@srivatsankrishnan srivatsankrishnan marked this pull request as ready for review February 6, 2025 05:36
tests/test_cloudaigym.py Outdated Show resolved Hide resolved
conf/common/test/dse_nemo_run_llama3_8b.toml Show resolved Hide resolved
src/cloudai/test_definitions/nemo_run.py Outdated Show resolved Hide resolved
src/cloudai/test_definitions/nemo_run.py Outdated Show resolved Hide resolved
srinivas212
srinivas212 previously approved these changes Feb 12, 2025
@srivatsankrishnan srivatsankrishnan merged commit ba115d7 into NVIDIA:main Feb 12, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants