Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add auto configurator to NeMo #10270

Merged
merged 66 commits into from
Sep 7, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
66 commits
Select commit Hold shift + click to select a range
cd3bb0b
add base configs
dimapihtar Aug 27, 2024
28d3c02
add auto configurator functionality
dimapihtar Aug 27, 2024
cf50b14
Apply isort and black reformatting
dimapihtar Aug 27, 2024
e89ed25
add runner
dimapihtar Aug 27, 2024
0a2a6a0
add end-to-end example for auto configurator
dimapihtar Aug 27, 2024
b0d8478
add unit tests for auto configurator
dimapihtar Aug 27, 2024
8189de9
add GPT configs
dimapihtar Aug 27, 2024
a551658
add GPT configs
dimapihtar Aug 27, 2024
a28f77b
Apply isort and black reformatting
dimapihtar Aug 27, 2024
35522ab
switch to dataclass
dimapihtar Aug 27, 2024
399385b
Apply isort and black reformatting
dimapihtar Aug 27, 2024
b616b41
switch to dataclass
dimapihtar Aug 27, 2024
80054d7
Apply isort and black reformatting
dimapihtar Aug 27, 2024
227a738
fix dataclasses usage
dimapihtar Aug 27, 2024
d0acbca
Apply isort and black reformatting
dimapihtar Aug 27, 2024
9a26476
remove unused imports
dimapihtar Aug 28, 2024
1315031
remove extra function
dimapihtar Aug 28, 2024
1aafc20
fix docstring style
dimapihtar Aug 28, 2024
bda0100
Apply isort and black reformatting
dimapihtar Aug 28, 2024
6d5305e
take Config object as input for model
dimapihtar Aug 28, 2024
bb86c39
Apply isort and black reformatting
dimapihtar Aug 28, 2024
a2099af
add nemotron support
dimapihtar Aug 28, 2024
86694e6
Apply isort and black reformatting
dimapihtar Aug 28, 2024
0b896b7
remove search_config.py
dimapihtar Sep 2, 2024
2d062b0
Apply isort and black reformatting
dimapihtar Sep 2, 2024
1e30118
move configs creation to Basic class
dimapihtar Sep 3, 2024
1f8dde6
Merge branch 'main' into dpykhtar/autoconf
dimapihtar Sep 3, 2024
14b9549
Apply isort and black reformatting
dimapihtar Sep 3, 2024
e1ccec1
move to common basic class
dimapihtar Sep 3, 2024
c641b7d
Apply isort and black reformatting
dimapihtar Sep 3, 2024
71b0420
rename main config
dimapihtar Sep 3, 2024
4103009
remove base configs for models
dimapihtar Sep 3, 2024
e3793ad
Apply isort and black reformatting
dimapihtar Sep 3, 2024
5815586
Apply isort and black reformatting
artbataev Sep 3, 2024
4d03be0
change auto conf functionality
dimapihtar Sep 3, 2024
f812e2b
Apply isort and black reformatting
dimapihtar Sep 3, 2024
97f9e61
fix docstring
dimapihtar Sep 4, 2024
d2fed7a
Apply isort and black reformatting
dimapihtar Sep 4, 2024
eb9bae5
remove unused imports
dimapihtar Sep 4, 2024
4606ef3
add changes
dimapihtar Sep 4, 2024
a4e8128
remove activations_checkpoint_num_layers
dimapihtar Sep 4, 2024
b853a83
remove gbs from config
dimapihtar Sep 4, 2024
7040056
fix logs
dimapihtar Sep 4, 2024
ae744ae
Apply isort and black reformatting
dimapihtar Sep 4, 2024
eda32ce
fix performance calculation
dimapihtar Sep 4, 2024
ae46957
fix end-to-end example
dimapihtar Sep 5, 2024
1fe46ed
Apply isort and black reformatting
dimapihtar Sep 5, 2024
25b8e3f
Merge branch 'main' into dpykhtar/autoconf
dimapihtar Sep 5, 2024
38082d9
fix model config
dimapihtar Sep 5, 2024
0ce1672
Apply isort and black reformatting
dimapihtar Sep 5, 2024
41c9f29
minor changes
dimapihtar Sep 5, 2024
3fdcc83
minor changes
dimapihtar Sep 5, 2024
3fbcc16
Apply isort and black reformatting
dimapihtar Sep 5, 2024
010e0de
fix unit tests
dimapihtar Sep 6, 2024
7fd82cf
Apply isort and black reformatting
dimapihtar Sep 6, 2024
3a345e8
Merge branch 'main' into dpykhtar/autoconf
dimapihtar Sep 6, 2024
83e537d
add README
dimapihtar Sep 6, 2024
1aa3636
fix README
dimapihtar Sep 6, 2024
649eb44
fix README
dimapihtar Sep 6, 2024
c642281
fix readme
dimapihtar Sep 6, 2024
7309603
fix readme
dimapihtar Sep 6, 2024
df1dcb8
remove extra arg
dimapihtar Sep 6, 2024
25a148a
remove unused imports
dimapihtar Sep 6, 2024
f006372
add nemo-run installation
dimapihtar Sep 6, 2024
9dda193
fix unit tests
dimapihtar Sep 7, 2024
c4c5ecb
fix unit tests
dimapihtar Sep 7, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions Dockerfile.ci
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,10 @@ EOF

WORKDIR /workspace

RUN pip install hatchling # needed to install nemo-run
ARG NEMU_RUN_TAG=34259bd3e752fef94045a9a019e4aaf62bd11ce2
RUN pip install nemo_run@git+https://github.com/NVIDIA/NeMo-Run.git@${NEMU_RUN_TAG}

# Install NeMo requirements
ARG TE_TAG=7d576ed25266a17a7b651f2c12e8498f67e0baea
ARG MODELOPT_VERSION=0.15.0
Expand Down
85 changes: 85 additions & 0 deletions examples/llm/auto_configurator/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
> [!IMPORTANT]
> This is an early version of the Auto Configurator, and the code base can be modified as it will be integrated into the CLI.

Use Auto Configurator to Find the Optimal Configuration
-------------------------------------------------------

Auto Configurator searches for hyperparameters (HPs) that achieve the maximum highest training throughput when working with Large Language Models (LLMs) utilizing the NeMo Framework.

> [!NOTE]
> Auto Configurator is only supported now for GPT-based models: GPT3, LLama, Mixtral, Mistral, Gemma and Nemotron.

Auto Configurator Capabilities
------------------------------

Auto Configurator is intended to iterate over different model configurations quickly and find the best configuration, that is, the configuration that minimizes both time and financial expenditure. It offers a range of features to facilitate this, as detailed in the list below.

- **Model size recommendation**: finds the optimal model size if the parameter is not specified.
- **Training time estimation**: estimates model training time based on input parameters.
- **Base configuration generation**: returns a basic model configuration.
- **Hyperparameters recommendation**: finds the optimal list of hyperparameters to be trained.
- **Optimal configuration recommendation**: calculates the performance after a short training of candidate configurations and finds the optimal model configuration.

Model Size Recommendation
-------------------------

If you have not decided what model size you want to train, Auto Configurator can recommend a model size for your use case. If you know the number of GPUs, TFLOPS per GPU, the maximum time to train, and the number of tokens to train for, it can recommend a model size that can be trained with the specified hardware and time constraints.

For example, if you had 20 NVIDIA DGX nodes available (in 80 GB GPU memory), and wanted to train a GPT model for a maximum of 5 days, Auto Configurator would recommend using a 5B parameter GPT model.

Training Time Estimation
------------------------

Auto Configurator calculates the estimated training time for your model. It provides a projection of the training time in days, based on the input dataset and parameters you provide.

Base Configuration Generation
-----------------------------

When you provide the model size, or Auto Configurator has suggested one, it generates a base configuration for the target model. The base configuration is a valid configuration in NeMo 2.0 format. The optimization of throughput, however, is conducted in the next step.

Hyperparameters Recommendation
------------------------------

After Auto Configurator generates the base configuration, it searches over four critical hyperparameters that have a great impact on training throughput but do not affect model convergence. These hyperparameters include Tensor Parallelism (TP), Pipeline Parallelism (PP), Context Parallelism (CP), Expert Parallelism (EP), Micro Batch Size (MBS), and Activation Checkpointing Layers (ActCkpt). Auto Configurator will also provide optimal Global Batch Size (GBS) if it's not specified.

Auto Configurator initially applies heuristics to identify suitable candidates for the four key parameters, subsequently generating a grid of candidate configurations. It returns all of the candidate configurations in NeMo 2.0 format.

> [!NOTE]
> Some of the candidate configurations may not work due to high-memory usage or other issues.

Once the candidate configurations are generated, you can use NeMo Framework to launch the most promising candidates.

When running the candidates on the cluster, you can limit job time and job max steps by using ``max_minutes_per_run`` and ``max_steps_per_run`` parameters. During this search, the jobs will run with the number of nodes specified in the configuration files, using the ``num_nodes`` parameter. Once all of the jobs have finished running, you'll need to run compare_throughput.py to get a ``.csv`` table with performance results for each succeeded job.

Optimal Configuration Recommendation
------------------------------------

After all of the candidate jobs are done, Auto Configurator calculates performance parameters for each of the candidates.
Auto Configurator generates two ``.csv`` files: one detailing the performance measures of the candidates and another listing the candidates that failed due to out-of-memory errors.

End-To-End Example
------------------

The following list shows the required input parameters for the Auto Configurator runner:

- ``model``: model configuration based on NeMo 2.0.
- ``num_nodes``: number of nodes to be used for the training.
- ``seq_length``: sequence length to be used for the training.
- ``data_paths``: dataset to be used for the training.
- ``tokenizer_path``: path to tokenizer model if custom tokenizer will be used.

The following list shows the optional parameters for the Auto Configurator runner:

- ``global_batch_size``: global batch size to be used.
- ``tensor_parallel_sizes``: a list, such as ``[1, 2, 4]``.
- ``pipeline_parallel_sizes``: a list, such as ``[1, 2, 4]``.
- ``context_parallel_sizes``: a list, such as ``[1, 2, 4]``.
- ``expert_parallel_sizes``: a list, such as ``[1, 2, 4]``.
- ``micro_batch_sizes``: a list, such as ``[1, 2, 4]``.
- ``min_model_parallel_size``: a value for the minimum desired parallelism.
- ``max_model_parallel_size``: a value for the maximum desired parallelism.

For each of the optional parameters, Auto Configurator will find the optimal value if the parameter is not specified. To view the full list of parameters, please visit [this page](https://github.com/NVIDIA/NeMo/blob/dpykhtar/nemo_autoconf/nemo/collections/llm/tools/auto_configurator/runner.py#L51).

To view an end-to-end example of how to generate candidate configs, train them, and calculate the performance using Auto Configurator with NeMo Framework, please visit [this page](https://github.com/NVIDIA/NeMo/blob/dpykhtar/nemo_autoconf/examples/llm/auto_configurator/auto_config.py).

81 changes: 81 additions & 0 deletions examples/llm/auto_configurator/auto_config.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import argparse
import os

import fiddle as fdl
import nemo_run as run

from nemo.collections.llm import GPTConfig126M
from nemo.collections.llm.tools.auto_configurator import AutoConfigurator, generate_configs, get_results


def get_args():
parser = argparse.ArgumentParser()
parser.add_argument("--run_number", type=int, help="Number of config to run")
parser.add_argument("--logs_dir", type=str, help="Path where to save training logs")
parser.add_argument("--data_path", type=str, help="Path to the dataset")
parser.add_argument("--get_results", action="store_true")

return parser.parse_args()


def train_config(args):
# GPT-3 126M
# This example will generate 3 configs.
# It is expected that this script will be run 3 times with changing --run_number flag for each run from 0 to 2.
# After all configurations are trained, please trigger the script using --get_results flag.
runner = AutoConfigurator(
model=run.Config(GPTConfig126M),
num_nodes=1,
gpus_per_node=1,
gpu_memory_gb=40,
global_batch_size=16,
seq_length=512,
tensor_parallel_sizes=[1],
pipeline_parallel_sizes=[1],
micro_batch_sizes=[1, 2, 4],
max_training_days=1,
max_steps_per_run=25,
num_tokens_in_b=10,
vocab_size=51200,
data_paths=args.data_path,
path_to_logs=args.logs_dir,
)

base_cfg, configs = generate_configs(runner)
if not args.get_results:
# Get generated configs
partials = list(configs.values())
names = list(configs.keys())

# Run pre-training
partial = partials[args.run_number - 1]
partial.log.dir = os.path.join(args.logs_dir, names[args.run_number - 1])
pretrain = fdl.build(partial)
pretrain()
else:
# # Get Auto Configurator results
get_results(base_cfg, runner, args.logs_dir)
print(f"The results were successfully saved to {args.logs_dir}.")


def main():
args = get_args()
train_config(args)


if __name__ == '__main__':
main()
6 changes: 6 additions & 0 deletions nemo/collections/llm/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,12 @@
GemmaConfig7B,
GemmaModel,
GPTConfig,
GPTConfig5B,
GPTConfig7B,
GPTConfig20B,
GPTConfig40B,
GPTConfig126M,
GPTConfig175B,
GPTModel,
Llama2Config7B,
Llama2Config13B,
Expand Down
6 changes: 6 additions & 0 deletions nemo/collections/llm/gpt/model/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,12 @@
from nemo.collections.llm.gpt.model.baichuan import Baichuan2Config, Baichuan2Config7B, Baichuan2Model
from nemo.collections.llm.gpt.model.base import (
GPTConfig,
GPTConfig5B,
GPTConfig7B,
GPTConfig20B,
GPTConfig40B,
GPTConfig126M,
GPTConfig175B,
GPTModel,
MaskedTokenLossReduction,
gpt_data_step,
Expand Down
54 changes: 54 additions & 0 deletions nemo/collections/llm/gpt/model/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -182,6 +182,60 @@ def configure_model(self, tokenizer) -> "MCoreGPTModel":
)


@dataclass
class GPTConfig126M(GPTConfig):
seq_length: int = 2048
num_layers: int = 12
hidden_size: int = 768
ffn_hidden_size: int = 3072
num_attention_heads: int = 12


@dataclass
class GPTConfig5B(GPTConfig):
seq_length: int = 2048
num_layers: int = 24
hidden_size: int = 4096
ffn_hidden_size: int = 16384
num_attention_heads: int = 32


@dataclass
class GPTConfig7B(GPTConfig):
seq_length: int = 2048
num_layers: int = 32
hidden_size: int = 4096
ffn_hidden_size: int = 10880
num_attention_heads: int = 32


@dataclass
class GPTConfig20B(GPTConfig):
seq_length: int = 2048
num_layers: int = 44
hidden_size: int = 6144
ffn_hidden_size: int = 24576
num_attention_heads: int = 48


@dataclass
class GPTConfig40B(GPTConfig):
seq_length: int = 2048
num_layers: int = 48
hidden_size: int = 8192
ffn_hidden_size: int = 32768
num_attention_heads: int = 64


@dataclass
class GPTConfig175B(GPTConfig):
seq_length: int = 2048
num_layers: int = 96
hidden_size: int = 12288
ffn_hidden_size: int = 49152
num_attention_heads: int = 96


class GPTModel(L.LightningModule, io.IOMixin, io.ConnectorMixin, fn.FNMixin):
def __init__(
self,
Expand Down
2 changes: 2 additions & 0 deletions nemo/collections/llm/tools/auto_configurator/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
from nemo.collections.llm.tools.auto_configurator.core.calculate_performance import get_results
from nemo.collections.llm.tools.auto_configurator.runner import AutoConfigurator, generate_configs
13 changes: 13 additions & 0 deletions nemo/collections/llm/tools/auto_configurator/core/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
Loading
Loading