NVIDIA · dimapihtar · Sep 7, 2024 · Aug 27, 2024 · Aug 27, 2024 · Aug 27, 2024
@@ -31,6 +31,10 @@ EOF
 
 WORKDIR /workspace
 
+RUN pip install hatchling   # needed to install nemo-run
+ARG NEMU_RUN_TAG=34259bd3e752fef94045a9a019e4aaf62bd11ce2
+RUN pip install nemo_run@git+https://github.com/NVIDIA/NeMo-Run.git@${NEMU_RUN_TAG}
+
 # Install NeMo requirements
 ARG TE_TAG=7d576ed25266a17a7b651f2c12e8498f67e0baea
 ARG MODELOPT_VERSION=0.15.0

diff --git a/examples/llm/auto_configurator/README.md b/examples/llm/auto_configurator/README.md
@@ -0,0 +1,85 @@
+> [!IMPORTANT] 
+> This is an early version of the Auto Configurator, and the code base can be modified as it will be integrated into the CLI.
+
+Use Auto Configurator to Find the Optimal Configuration
+-------------------------------------------------------
+
+Auto Configurator searches for hyperparameters (HPs) that achieve the maximum highest training throughput when working with Large Language Models (LLMs) utilizing the NeMo Framework.
+
+> [!NOTE] 
+> Auto Configurator is only supported now for GPT-based models: GPT3, LLama, Mixtral, Mistral, Gemma and Nemotron.
+
+Auto Configurator Capabilities
+------------------------------
+
+Auto Configurator is intended to iterate over different model configurations quickly and find the best configuration, that is, the configuration that minimizes both time and financial expenditure. It offers a range of features to facilitate this, as detailed in the list below.
+
+- **Model size recommendation**: finds the optimal model size if the parameter is not specified.
+- **Training time estimation**: estimates model training time based on input parameters.
+- **Base configuration generation**: returns a basic model configuration.
+- **Hyperparameters recommendation**: finds the optimal list of hyperparameters to be trained.
+- **Optimal configuration recommendation**: calculates the performance after a short training of candidate configurations and finds the optimal model configuration.
+
+Model Size Recommendation
+-------------------------
+
+If you have not decided what model size you want to train, Auto Configurator can recommend a model size for your use case. If you know the number of GPUs, TFLOPS per GPU, the maximum time to train, and the number of tokens to train for, it can recommend a model size that can be trained with the specified hardware and time constraints.
+
+For example, if you had 20 NVIDIA DGX nodes available (in 80 GB GPU memory), and wanted to train a GPT model for a maximum of 5 days, Auto Configurator would recommend using a 5B parameter GPT model.
+
+Training Time Estimation
+------------------------
+
+Auto Configurator calculates the estimated training time for your model. It provides a projection of the training time in days, based on the input dataset and parameters you provide.
+
+Base Configuration Generation
+-----------------------------
+
+When you provide the model size, or Auto Configurator has suggested one, it generates a base configuration for the target model. The base configuration is a valid configuration in NeMo 2.0 format. The optimization of throughput, however, is conducted in the next step.
+
+Hyperparameters Recommendation
+------------------------------
+
+After Auto Configurator generates the base configuration, it searches over four critical hyperparameters that have a great impact on training throughput but do not affect model convergence. These hyperparameters include  Tensor Parallelism (TP), Pipeline Parallelism (PP), Context Parallelism (CP), Expert Parallelism (EP), Micro Batch Size (MBS), and Activation Checkpointing Layers (ActCkpt). Auto Configurator will also provide optimal Global Batch Size (GBS) if it's not specified.
+
+Auto Configurator initially applies heuristics to identify suitable candidates for the four key parameters, subsequently generating a grid of candidate configurations. It returns all of the candidate configurations in NeMo 2.0 format.
+
+> [!NOTE]
+> Some of the candidate configurations may not work due to high-memory usage or other issues.
+
+Once the candidate configurations are generated, you can use NeMo Framework to launch the most promising candidates.
+
+When running the candidates on the cluster, you can limit job time and job max steps by using ``max_minutes_per_run`` and ``max_steps_per_run`` parameters. During this search, the jobs will run with the number of nodes specified in the configuration files, using the ``num_nodes`` parameter. Once all of the jobs have finished running, you'll need to run compare_throughput.py to get a ``.csv`` table with performance results for each succeeded job.
+
+Optimal Configuration Recommendation
+------------------------------------
+
+After all of the candidate jobs are done, Auto Configurator calculates performance parameters for each of the candidates. 
+Auto Configurator generates two ``.csv`` files: one detailing the performance measures of the candidates and another listing the candidates that failed due to out-of-memory errors.
+
+End-To-End Example
+------------------
+
+The following list shows the required input parameters for the Auto Configurator runner:
+
+- ``model``: model configuration based on NeMo 2.0.
+- ``num_nodes``: number of nodes to be used for the training.
+- ``seq_length``: sequence length to be used for the training.
+- ``data_paths``: dataset to be used for the training.
+- ``tokenizer_path``: path to tokenizer model if custom tokenizer will be used.
+
+The following list shows the optional parameters for the Auto Configurator runner:
+
+- ``global_batch_size``: global batch size to be used.
+- ``tensor_parallel_sizes``: a list, such as ``[1, 2, 4]``.
+- ``pipeline_parallel_sizes``: a list, such as ``[1, 2, 4]``.
+- ``context_parallel_sizes``: a list, such as ``[1, 2, 4]``.
+- ``expert_parallel_sizes``: a list, such as ``[1, 2, 4]``.
+- ``micro_batch_sizes``: a list, such as ``[1, 2, 4]``.
+- ``min_model_parallel_size``: a value for the minimum desired parallelism.
+- ``max_model_parallel_size``: a value for the maximum desired parallelism.
+
+For each of the optional parameters, Auto Configurator will find the optimal value if the parameter is not specified. To view the full list of parameters, please visit [this page](https://github.com/NVIDIA/NeMo/blob/dpykhtar/nemo_autoconf/nemo/collections/llm/tools/auto_configurator/runner.py#L51).
+
+To view an end-to-end example of how to generate candidate configs, train them, and calculate the performance using Auto Configurator with NeMo Framework, please visit [this page](https://github.com/NVIDIA/NeMo/blob/dpykhtar/nemo_autoconf/examples/llm/auto_configurator/auto_config.py).
+
diff --git a/examples/llm/auto_configurator/auto_config.py b/examples/llm/auto_configurator/auto_config.py
@@ -0,0 +1,81 @@
+# Copyright (c) 2024, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+
+import fiddle as fdl
+import nemo_run as run
+
+from nemo.collections.llm import GPTConfig126M
+from nemo.collections.llm.tools.auto_configurator import AutoConfigurator, generate_configs, get_results
+
+
+def get_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--run_number", type=int, help="Number of config to run")
+    parser.add_argument("--logs_dir", type=str, help="Path where to save training logs")
+    parser.add_argument("--data_path", type=str, help="Path to the dataset")
+    parser.add_argument("--get_results", action="store_true")
+
+    return parser.parse_args()
+
+
+def train_config(args):
+    # GPT-3 126M
+    # This example will generate 3 configs.
+    # It is expected that this script will be run 3 times with changing --run_number flag for each run from 0 to 2.
+    # After all configurations are trained, please trigger the script using --get_results flag.
+    runner = AutoConfigurator(
+        model=run.Config(GPTConfig126M),
+        num_nodes=1,
+        gpus_per_node=1,
+        gpu_memory_gb=40,
+        global_batch_size=16,
+        seq_length=512,
+        tensor_parallel_sizes=[1],
+        pipeline_parallel_sizes=[1],
+        micro_batch_sizes=[1, 2, 4],
+        max_training_days=1,
+        max_steps_per_run=25,
+        num_tokens_in_b=10,
+        vocab_size=51200,
+        data_paths=args.data_path,
+        path_to_logs=args.logs_dir,
+    )
+
+    base_cfg, configs = generate_configs(runner)
+    if not args.get_results:
+        # Get generated configs
+        partials = list(configs.values())
+        names = list(configs.keys())
+
+        # Run pre-training
+        partial = partials[args.run_number - 1]
+        partial.log.dir = os.path.join(args.logs_dir, names[args.run_number - 1])
+        pretrain = fdl.build(partial)
+        pretrain()
+    else:
+        # # Get Auto Configurator results
+        get_results(base_cfg, runner, args.logs_dir)
+        print(f"The results were successfully saved to {args.logs_dir}.")
+
+
+def main():
+    args = get_args()
+    train_config(args)
+
+
+if __name__ == '__main__':
+    main()
diff --git a/nemo/collections/llm/__init__.py b/nemo/collections/llm/__init__.py
@@ -46,6 +46,12 @@
     GemmaConfig7B,
     GemmaModel,
     GPTConfig,
+    GPTConfig5B,
+    GPTConfig7B,
+    GPTConfig20B,
+    GPTConfig40B,
+    GPTConfig126M,
+    GPTConfig175B,
     GPTModel,
     Llama2Config7B,
     Llama2Config13B,

diff --git a/nemo/collections/llm/gpt/model/__init__.py b/nemo/collections/llm/gpt/model/__init__.py
@@ -15,6 +15,12 @@
 from nemo.collections.llm.gpt.model.baichuan import Baichuan2Config, Baichuan2Config7B, Baichuan2Model
 from nemo.collections.llm.gpt.model.base import (
     GPTConfig,
+    GPTConfig5B,
+    GPTConfig7B,
+    GPTConfig20B,
+    GPTConfig40B,
+    GPTConfig126M,
+    GPTConfig175B,
     GPTModel,
     MaskedTokenLossReduction,
     gpt_data_step,

diff --git a/nemo/collections/llm/gpt/model/base.py b/nemo/collections/llm/gpt/model/base.py
@@ -182,6 +182,60 @@ def configure_model(self, tokenizer) -> "MCoreGPTModel":
         )
 
 
+@dataclass
+class GPTConfig126M(GPTConfig):
+    seq_length: int = 2048
+    num_layers: int = 12
+    hidden_size: int = 768
+    ffn_hidden_size: int = 3072
+    num_attention_heads: int = 12
+
+
+@dataclass
+class GPTConfig5B(GPTConfig):
+    seq_length: int = 2048
+    num_layers: int = 24
+    hidden_size: int = 4096
+    ffn_hidden_size: int = 16384
+    num_attention_heads: int = 32
+
+
+@dataclass
+class GPTConfig7B(GPTConfig):
+    seq_length: int = 2048
+    num_layers: int = 32
+    hidden_size: int = 4096
+    ffn_hidden_size: int = 10880
+    num_attention_heads: int = 32
+
+
+@dataclass
+class GPTConfig20B(GPTConfig):
+    seq_length: int = 2048
+    num_layers: int = 44
+    hidden_size: int = 6144
+    ffn_hidden_size: int = 24576
+    num_attention_heads: int = 48
+
+
+@dataclass
+class GPTConfig40B(GPTConfig):
+    seq_length: int = 2048
+    num_layers: int = 48
+    hidden_size: int = 8192
+    ffn_hidden_size: int = 32768
+    num_attention_heads: int = 64
+
+
+@dataclass
+class GPTConfig175B(GPTConfig):
+    seq_length: int = 2048
+    num_layers: int = 96
+    hidden_size: int = 12288
+    ffn_hidden_size: int = 49152
+    num_attention_heads: int = 96
+
+
 class GPTModel(L.LightningModule, io.IOMixin, io.ConnectorMixin, fn.FNMixin):
     def __init__(
         self,

diff --git a/nemo/collections/llm/tools/auto_configurator/__init__.py b/nemo/collections/llm/tools/auto_configurator/__init__.py
@@ -0,0 +1,2 @@
+from nemo.collections.llm.tools.auto_configurator.core.calculate_performance import get_results
+from nemo.collections.llm.tools.auto_configurator.runner import AutoConfigurator, generate_configs
diff --git a/nemo/collections/llm/tools/auto_configurator/core/__init__.py b/nemo/collections/llm/tools/auto_configurator/core/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2024, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.