Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add auto configurator to NeMo #10270

Merged
merged 66 commits into from
Sep 7, 2024
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
Show all changes
66 commits
Select commit Hold shift + click to select a range
cd3bb0b
add base configs
dimapihtar Aug 27, 2024
28d3c02
add auto configurator functionality
dimapihtar Aug 27, 2024
cf50b14
Apply isort and black reformatting
dimapihtar Aug 27, 2024
e89ed25
add runner
dimapihtar Aug 27, 2024
0a2a6a0
add end-to-end example for auto configurator
dimapihtar Aug 27, 2024
b0d8478
add unit tests for auto configurator
dimapihtar Aug 27, 2024
8189de9
add GPT configs
dimapihtar Aug 27, 2024
a551658
add GPT configs
dimapihtar Aug 27, 2024
a28f77b
Apply isort and black reformatting
dimapihtar Aug 27, 2024
35522ab
switch to dataclass
dimapihtar Aug 27, 2024
399385b
Apply isort and black reformatting
dimapihtar Aug 27, 2024
b616b41
switch to dataclass
dimapihtar Aug 27, 2024
80054d7
Apply isort and black reformatting
dimapihtar Aug 27, 2024
227a738
fix dataclasses usage
dimapihtar Aug 27, 2024
d0acbca
Apply isort and black reformatting
dimapihtar Aug 27, 2024
9a26476
remove unused imports
dimapihtar Aug 28, 2024
1315031
remove extra function
dimapihtar Aug 28, 2024
1aafc20
fix docstring style
dimapihtar Aug 28, 2024
bda0100
Apply isort and black reformatting
dimapihtar Aug 28, 2024
6d5305e
take Config object as input for model
dimapihtar Aug 28, 2024
bb86c39
Apply isort and black reformatting
dimapihtar Aug 28, 2024
a2099af
add nemotron support
dimapihtar Aug 28, 2024
86694e6
Apply isort and black reformatting
dimapihtar Aug 28, 2024
0b896b7
remove search_config.py
dimapihtar Sep 2, 2024
2d062b0
Apply isort and black reformatting
dimapihtar Sep 2, 2024
1e30118
move configs creation to Basic class
dimapihtar Sep 3, 2024
1f8dde6
Merge branch 'main' into dpykhtar/autoconf
dimapihtar Sep 3, 2024
14b9549
Apply isort and black reformatting
dimapihtar Sep 3, 2024
e1ccec1
move to common basic class
dimapihtar Sep 3, 2024
c641b7d
Apply isort and black reformatting
dimapihtar Sep 3, 2024
71b0420
rename main config
dimapihtar Sep 3, 2024
4103009
remove base configs for models
dimapihtar Sep 3, 2024
e3793ad
Apply isort and black reformatting
dimapihtar Sep 3, 2024
5815586
Apply isort and black reformatting
artbataev Sep 3, 2024
4d03be0
change auto conf functionality
dimapihtar Sep 3, 2024
f812e2b
Apply isort and black reformatting
dimapihtar Sep 3, 2024
97f9e61
fix docstring
dimapihtar Sep 4, 2024
d2fed7a
Apply isort and black reformatting
dimapihtar Sep 4, 2024
eb9bae5
remove unused imports
dimapihtar Sep 4, 2024
4606ef3
add changes
dimapihtar Sep 4, 2024
a4e8128
remove activations_checkpoint_num_layers
dimapihtar Sep 4, 2024
b853a83
remove gbs from config
dimapihtar Sep 4, 2024
7040056
fix logs
dimapihtar Sep 4, 2024
ae744ae
Apply isort and black reformatting
dimapihtar Sep 4, 2024
eda32ce
fix performance calculation
dimapihtar Sep 4, 2024
ae46957
fix end-to-end example
dimapihtar Sep 5, 2024
1fe46ed
Apply isort and black reformatting
dimapihtar Sep 5, 2024
25b8e3f
Merge branch 'main' into dpykhtar/autoconf
dimapihtar Sep 5, 2024
38082d9
fix model config
dimapihtar Sep 5, 2024
0ce1672
Apply isort and black reformatting
dimapihtar Sep 5, 2024
41c9f29
minor changes
dimapihtar Sep 5, 2024
3fdcc83
minor changes
dimapihtar Sep 5, 2024
3fbcc16
Apply isort and black reformatting
dimapihtar Sep 5, 2024
010e0de
fix unit tests
dimapihtar Sep 6, 2024
7fd82cf
Apply isort and black reformatting
dimapihtar Sep 6, 2024
3a345e8
Merge branch 'main' into dpykhtar/autoconf
dimapihtar Sep 6, 2024
83e537d
add README
dimapihtar Sep 6, 2024
1aa3636
fix README
dimapihtar Sep 6, 2024
649eb44
fix README
dimapihtar Sep 6, 2024
c642281
fix readme
dimapihtar Sep 6, 2024
7309603
fix readme
dimapihtar Sep 6, 2024
df1dcb8
remove extra arg
dimapihtar Sep 6, 2024
25a148a
remove unused imports
dimapihtar Sep 6, 2024
f006372
add nemo-run installation
dimapihtar Sep 6, 2024
9dda193
fix unit tests
dimapihtar Sep 7, 2024
c4c5ecb
fix unit tests
dimapihtar Sep 7, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions nemo/collections/llm/tools/auto_configurator/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
from nemo.collections.llm.tools.auto_configurator.core.calculate_performance import get_results
from nemo.collections.llm.tools.auto_configurator.runner import AutoConfigurator
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

from nemo.collections.llm.tools.auto_configurator.base_configs.custom import custom
from nemo.collections.llm.tools.auto_configurator.base_configs.gemma import Gemma
from nemo.collections.llm.tools.auto_configurator.base_configs.gpt import GPT
from nemo.collections.llm.tools.auto_configurator.base_configs.llama import Llama
from nemo.collections.llm.tools.auto_configurator.base_configs.mistral import Mistral
from nemo.collections.llm.tools.auto_configurator.base_configs.mixtral import Mixtral
144 changes: 144 additions & 0 deletions nemo/collections/llm/tools/auto_configurator/base_configs/basic.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,144 @@
# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import numpy as np
Fixed Show fixed Hide fixed
from megatron.core.optimizer import OptimizerConfig

from nemo.collections.llm.utils import Config


class Basic:
def __init__(
self,
name: str = None,
version: int = None,
size: int = None,
measure: str = "B",
cfg: dict = {},
):
"""
:param str name: model name.
:param int version: model version.
:param int size: model size.
:param str measure: meausre of model size. "M" if model size in millions, "B" if in billions.
:param dict cfg: auto configurator runner config.
"""

self.name = name
self.version = version
self.size = size
self.measure = measure
self.cfg = cfg
self.num_nodes = cfg.get("num_nodes")
self.num_gpus = cfg.get("num_gpus")
self.max_steps = cfg.get("max_steps_per_run")
self.seq_length = cfg.get("seq_length")
self.global_batch_size = cfg.get("global_batch_size")
self.tokenizer_path = cfg.get("tokenizer_path")
self.data_paths = cfg.get("data_paths")
self.nemo_run = cfg.get("nemo_run")
self.max_minutes_per_run = cfg.get("max_minutes_per_run")

def model_config(self):
"""Function that returns model config."""

None
Fixed Show fixed Hide fixed

def get_optim_config(self) -> OptimizerConfig:
"""
Function that returns optimizer config.
:return: optim config.
:rtype: OptimizerConfig.
"""
optim_params = {
"optimizer": "adam",
"lr": 1e-4,
"min_lr": 1e-5,
"use_distributed_optimizer": True,
"bf16": True,
"adam_beta1": 0.9,
"adam_beta2": 0.95,
"overlap_grad_reduce": False,
"overlap_param_gather": True,
}

if self.nemo_run:
optim_config = Config(
OptimizerConfig,
**optim_params,
)
else:
optim_config = OptimizerConfig(
**optim_params,
)

return optim_config

def get_trainer_config(self) -> dict:
"""
Function that returns config for PTL trainer.
:return: trainer config.
:rtype: dict.
"""

trainer_config = {
"accelerator": "gpu",
"enable_checkpointing": False,
"use_distributed_sampler": False,
"max_epochs": None,
"log_every_n_steps": 1,
"limit_val_batches": 1,
"limit_test_batches": 1,
"accumulate_grad_batches": 1,
"gradient_clip_val": 1.0,
"num_nodes": self.num_nodes,
"devices": self.num_gpus,
"max_steps": self.max_steps,
"val_check_interval": self.max_steps,
}

return trainer_config

def get_data_config(self) -> dict:
"""
Function that returns dataset config.
:return: data config.
:rtype: dict.
"""

data_config = {
"paths": self.data_paths,
"seq_length": self.seq_length,
"global_batch_size": self.global_batch_size,
"num_workers": 2,
# "split": "99990,8,2",
"index_mapping_dir": None,
}

return data_config

def get_run_config(self) -> dict:
"""
Function that returns config for cluster job.
:return: cluster job config.
:rtype: dict.
"""

run_config = {
"name": f"{self.name}_{self.size}{self.measure}",
"results_dir": None,
"time_limit": f"0-00:{self.max_minutes_per_run}:00",
}

return run_config
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import copy
Fixed Show fixed Hide fixed
import os
Fixed Show fixed Hide fixed

from nemo.collections.llm.tools.auto_configurator import base_configs

from .basic import Basic
Fixed Show fixed Hide fixed


def custom(name, cfg):
"""
Function taht return custom model class.
:param dict cfg: auto configurator runner config.
:return: Custom class object.
"""
basic_class = getattr(base_configs, name)

class Custom(basic_class):
def __init__(self, name, cfg):
"""
:param str name: model name.
:param dict cfg: auto configurator runner config.
"""

super().__init__(name=name, cfg=cfg)

custom_class = Custom(name, cfg)

return custom_class
64 changes: 64 additions & 0 deletions nemo/collections/llm/tools/auto_configurator/base_configs/gemma.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import copy
Fixed Show fixed Hide fixed
import os
Fixed Show fixed Hide fixed
import torch

from nemo.collections import llm
from nemo.collections.llm.utils import Config

from .basic import Basic


class Gemma(Basic):
def __init__(
self,
name: str = "Gemma",
version: int = None,
size: int = 2,
measure: str = "B",
cfg: dict = {},
):
"""
:param str name: model name.
:param int version: model version.
:param int size: model size.
:param str measure: meausre of model size. "M" if model size in millions, "B" if in billions.
:param dict cfg: auto configurator runner config.
"""

super().__init__(name=name, version=version, size=size, measure=measure, cfg=cfg)
self.config_name = f"{self.name}Config{self.size}{self.measure}"

def get_model_config(self) -> Config:
"""
Function that returns model config.
:return: model config.
:rtype: Config.
"""

model_class = getattr(llm, self.config_name)
kwargs = self.cfg.get("model_args", {})

if self.nemo_run:
model_config = Config(model_class, **kwargs)
else:
model_config = model_class(**kwargs)

model_config.global_batch_size = self.global_batch_size
model_config.seq_length = self.seq_length
model_config.pipeline_dtype = torch.bfloat16

return model_config
62 changes: 62 additions & 0 deletions nemo/collections/llm/tools/auto_configurator/base_configs/gpt.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import copy
Fixed Show fixed Hide fixed
import os
Fixed Show fixed Hide fixed

from nemo.collections import llm
from nemo.collections.llm.utils import Config

from .basic import Basic


class GPT(Basic):
def __init__(
self,
name: str = "GPT",
version: int = 3,
size: int = 5,
measure: str = "B",
cfg: dict = {},
):
"""
:param str name: model name.
:param int version: model version.
:param int size: model size.
:param str measure: meausre of model size. "M" if model size in millions, "B" if in billions.
:param dict cfg: auto configurator runner config.
"""

super().__init__(name=name, version=version, size=size, measure=measure, cfg=cfg)
self.config_name = f"{self.name}Config{self.size}{self.measure}"

def get_model_config(self) -> Config:
"""
Function that returns model config.
:return: model config.
:rtype: Config.
"""

model_class = getattr(llm, self.config_name)
kwargs = self.cfg.get("model_args", {})

if self.nemo_run:
model_config = Config(model_class, **kwargs)
else:
model_config = model_class(**kwargs)

model_config.global_batch_size = self.global_batch_size
model_config.seq_length = self.seq_length

return model_config
Loading
Loading