Skip to content

Commit

Permalink
[Feature] Support finetune Deepseek v2 (InternLM#663)
Browse files Browse the repository at this point in the history
* support deepseek v2

* fix dispatch

* refactor deepseek v2

* fix lint

* fix bugs

* fix bugs

* delete useless codes

* refactor deepseek config

* rewrite DeepseekV2PreTrainedModel.from_pretrained

* revert sft.py to main

* delete useless codes

* add deepseek v2 config

* add deepseek readme

* add HFCheckpointHook

* optimize mixtral moe

* fix bugs

* delete useless codes

* delete evalchathook

* fix bugs

* fix bugs

* add moe SUPPORT_MODELS and fix HFCheckpointHook

* add moe SUPPORT_MODELS and fix HFCheckpointHook

* fix bugs

* refactor modeling_deepseek

* update deepseek readme

* support deepseek v2 lite

* fix bugs
  • Loading branch information
HIT-cwh authored Jun 13, 2024
1 parent 83de829 commit 35fdf40
Show file tree
Hide file tree
Showing 21 changed files with 5,518 additions and 2 deletions.
1 change: 1 addition & 0 deletions .pre-commit-config-zh-cn.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ repos:
rev: 5.0.4
hooks:
- id: flake8
args: ["--exclude=xtuner/model/transformers_models/*"]
- repo: https://gitee.com/openmmlab/mirrors-isort
rev: 5.11.5
hooks:
Expand Down
1 change: 1 addition & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ repos:
rev: 5.0.4
hooks:
- id: flake8
args: ["--exclude=xtuner/model/transformers_models/*"]
- repo: https://github.com/PyCQA/isort
rev: 5.11.5
hooks:
Expand Down
59 changes: 59 additions & 0 deletions xtuner/configs/deepseek/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
# DeepSeek V2

## Install

```bash
# Git clone the latest xtuner
git clone https://github.com/InternLM/xtuner.git

# Install the latest xtuner
cd xtuner
pip install -e '.[all]'

# Mixtral requires flash-attn
pip install flash-attn

# install the latest transformers
pip install -U transformers
```

## Full Parameter Fine-tune

Full parameter fine-tune DeepSeek V2 236B needs at least 64 A100-80G. The full-tuned model will be saved to `${WORK_DIRS}/hf_model` by `HFCheckpointHook`.

### slurm

Note: `$PARTITION` means the virtual partition of slurm.

```bash
srun -p $PARTITION --job-name=mixtral --nodes=8 --gres=gpu:8 --ntasks-per-node=8 xtuner train deepseek_v2_chat_full_alpaca_e3 --deepspeed deepspeed_zero3 --launcher slurm
```

### torchrun

Note: `$NODE_0_ADDR` means the ip address of the node_0 machine.

```bash
# excuete on node 0
NPROC_PER_NODE=8 NNODES=8 PORT=29600 ADDR=$NODE_0_ADDR NODE_RANK=0 xtuner train deepseek_v2_chat_full_alpaca_e3 --deepspeed deepspeed_zero3 --launcher pytorch

# excuete on node 1
NPROC_PER_NODE=8 NNODES=8 PORT=29600 ADDR=$NODE_0_ADDR NODE_RANK=1 xtuner train deepseek_v2_chat_full_alpaca_e3 --deepspeed deepspeed_zero3 --launcher pytorch

# excuete on node 2, 3, ..., 7
```

### Speed

128 * A100 80G:

| Model | Sequence Length | Use Varlen Attn | Sequence Parallel World Size | Tokens per Second |
| :--------------------: | :-------------: | :-------------: | :--------------------------: | :---------------: |
| deepseek v2 hf | 8k | False | 1 | 60 |
| **deepseek v2 XTuner** | **8k** | **False** | **1** | **120 (2x)** |
| deepseek v2 hf | 8k | True | 1 | 60 |
| **deepseek v2 XTuner** | **8k** | **True** | **1** | **130 (2.2x)** |
| deepseek v2 hf | 16k | False | 1 | OOM |
| **deepseek v2 XTuner** | **16k** | **False** | **1** | **148** |
| deepseek v2 hf | 16k | True | 1 | 95 |
| **deepseek v2 XTuner** | **16k** | **True** | **1** | **180 (1.9x)** |
Original file line number Diff line number Diff line change
@@ -0,0 +1,198 @@
# Copyright (c) OpenMMLab. All rights reserved.
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR
from torch.optim import AdamW
from transformers import AutoTokenizer

from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, HFCheckpointHook,
ThroughputHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.model.transformers_models.deepseek_v2 import DeepseekV2ForCausalLM
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE

#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'deepseek-ai/DeepSeek-V2-Chat'
use_varlen_attn = False

# Data
data_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.deepseek_v2
max_length = 2048
pack_to_max_length = True

# parallel
sequence_parallel_size = 1

# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 1 # bs per device 1 * acc 1 * 128 gpus = 128 total bs
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 4
max_epochs = 3
optim_type = AdamW
lr = 1e-5
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03

# Save
save_steps = 50
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Save the optimizer states of deepseek v2 236B will require a lot of
# storage space. It is recommended to set `save_optimizer` to False
# (The training phase can not be resumed.)
save_optimizer = True

# Evaluate the generation performance during the training
evaluation_freq = 25
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]

#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')

model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
# Only full-finetune is supported in `DeepseekV2ForCausalLM``, XTuner.
# Please use `AutoModelForCausalLM` for lora or qlora finetune.
type=DeepseekV2ForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
moe_implementation='shard',
expert_in_one_shard=10,
trust_remote_code=True))

#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)

sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler

train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))

#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')

# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=0,
end=max_epochs,
convert_to_iter_based=True)
]

# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)

#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(type=ThroughputHook),
dict(type=HFCheckpointHook)
]

if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]

# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=1),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)

# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)

# set visualizer
visualizer = None

# set log level
log_level = 'INFO'

# load from which checkpoint
load_from = None

# whether to resume training from the loaded checkpoint
resume = False

# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)

# set log processor
log_processor = dict(by_epoch=False, window_size=1)
Loading

0 comments on commit 35fdf40

Please sign in to comment.