Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intial dse parameters for llama_8b #359

Merged
merged 27 commits into from
Feb 12, 2025
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
2588afd
intial dse parameters for llama_8b
srivatsankrishnan Feb 5, 2025
a75cfd5
update pytest
srivatsankrishnan Feb 6, 2025
9b1599d
update the golden ref sbatch script
srivatsankrishnan Feb 6, 2025
0324ce5
more fixes
srivatsankrishnan Feb 6, 2025
f005f35
update report generation to take average from 80 to 100 iteration
srivatsankrishnan Feb 6, 2025
b086eb9
add mixed preicsion support in CLoudAI
srivatsankrishnan Feb 7, 2025
fec79b8
Add checks to ensure the test scenario num_nodes matches with NemoRUn…
srivatsankrishnan Feb 7, 2025
5ee5da7
more fixes
srivatsankrishnan Feb 7, 2025
dd72d7f
constrain check property in test definition
srivatsankrishnan Feb 8, 2025
d21523b
trajectory writer for summary
srivatsankrishnan Feb 8, 2025
833a1d8
store it at top-level directory
srivatsankrishnan Feb 8, 2025
30ca47d
remove None from command (nemo runtime errors) + log actions
srivatsankrishnan Feb 8, 2025
4b47f4f
fix andrei comments
srivatsankrishnan Feb 10, 2025
c3df263
add model config
srivatsankrishnan Feb 10, 2025
5b78579
Merge branch 'main' into llama3_8b_dse
srivatsankrishnan Feb 10, 2025
463f099
handle the cloudai vs nemo run case for num_nodes
srivatsankrishnan Feb 10, 2025
b3577ae
Merge remote-tracking branch 'origin/llama3_8b_dse' into llama3_8b_dse
srivatsankrishnan Feb 10, 2025
4a74eb1
address andrei's comments
srivatsankrishnan Feb 10, 2025
844313d
add more flags for bf16/fp8
srivatsankrishnan Feb 10, 2025
ee20ef4
ruff fix
srivatsankrishnan Feb 10, 2025
4f6b257
remove capsys
srivatsankrishnan Feb 12, 2025
18d040c
Merge branch 'main' into llama3_8b_dse
srivatsankrishnan Feb 12, 2025
92b4f7c
remove valuerror exception
srivatsankrishnan Feb 12, 2025
b1272fa
constaint check property in test definition
srivatsankrishnan Feb 12, 2025
629636b
use property in cloudai_gym
srivatsankrishnan Feb 12, 2025
3663485
fix the constrain_check
srivatsankrishnan Feb 12, 2025
03e298e
fix
srivatsankrishnan Feb 12, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
49 changes: 49 additions & 0 deletions conf/common/test/dse_nemo_run_llama3_8b_fp8.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
# SPDX-FileCopyrightText: NVIDIA CORPORATION & AFFILIATES
# Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

name = "dse_nemo_run_llama3_8b_fp8"
description = "dse_nemo_run_llama3_8b"
test_template_name = "NeMoRun"

[cmd_args]
docker_image_url = "nvcr.io/nvidia/nemo:24.12.rc3"
task = "pretrain"
recipe_name = "llama3_8b"

[cmd_args.data]
micro_batch_size = [1]
global_batch_size = [128, 256, 512]

[cmd_args.trainer]
max_steps = 100
val_check_interval = 1000
num_nodes = 1

[cmd_args.trainer.strategy]
tensor_model_parallel_size = [1]
pipeline_model_parallel_size = [1]
context_parallel_size = [2]

[cmd_args.trainer.plugins]
fp8 = "hybrid"
fp8_margin = 0
fp8_amax_history_len = 1024
fp8_amax_compute_algo = "max"
fp8_params = true

[cmd_args.log.ckpt]
save_on_train_epoch_end = false
save_last = false
12 changes: 12 additions & 0 deletions src/cloudai/test_definitions/nemo_run.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,17 @@
from cloudai.installer.installables import DockerImage, Installable


class Plugin(BaseModel):
"""Plugin configuration for NeMoRun."""

fp8: Optional[str] = "hybrid"
fp8_margin: Optional[int] = 0
fp8_amax_history_len: Optional[int] = 1
fp8_amax_compute_algo: Optional[str] = "most_recent"
fp8_wgrad: Optional[bool] = True
fp8_params: Optional[bool] = True


class Data(BaseModel):
"""Data configuration for NeMoRun."""

Expand All @@ -45,6 +56,7 @@ class Trainer(BaseModel):
val_check_interval: Union[int, List[int]] = 1000
num_nodes: Union[int, List[int]] = 1
strategy: TrainerStrategy = Field(default_factory=TrainerStrategy)
plugins: Plugin = Field(default_factory=Plugin)


class LogCkpt(BaseModel):
Expand Down
Loading