-
Notifications
You must be signed in to change notification settings - Fork 444
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
update configs #1954
update configs #1954
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1954
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 0fd16d1 with merge base 24d3579 (): This comment was automatically generated by Dr. CI and updates every 15 minutes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
major nit since this must've been very annoying to already put together, but for flags like activation checkpointing/offloading, would be nice to say # True reduces memory but reduces speed
@@ -55,20 +55,20 @@ shuffle: True | |||
epochs: 1 | |||
max_steps_per_epoch: null | |||
batch_size: 2 | |||
gradient_accumulation_steps: 1 | |||
gradient_accumulation_steps: 1 # Use to increase virtual batch size | |||
optimizer: | |||
_component_: bitsandbytes.optim.PagedAdamW | |||
lr: 2e-5 | |||
optimizer_in_bwd: True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you need a comment that this requires grad accum = 1? a lot of configs don't have this comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch. I will add the comment.
@@ -100,7 +102,7 @@ log_peak_memory_stats: False | |||
# Environment | |||
device: cuda | |||
dtype: bf16 | |||
enable_activation_checkpointing: False | |||
enable_activation_checkpointing: False # True reduces memory |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does KD support optimizer in bwd, activation offloading?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No :/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont know if it could. The answer is probably yes and we just didnt add it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I believe it should be able to
@@ -7,7 +7,6 @@ | |||
# tune download meta-llama/Meta-Llama-3.1-8B-Instruct --output-dir /tmp/Meta-Llama-3.1-8B-Instruct --ignore-patterns "original/consolidated.00.pth" | |||
# | |||
# You get better results using KD if the teacher model has already been fine-tuned on the target dataset: | |||
packed: False # Set to true for great speed ups |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wut
|
||
# Training env | ||
device: cuda | ||
|
||
# Memory management | ||
enable_activation_checkpointing: True | ||
enable_activation_checkpointing: True # True reduces memory | ||
custom_sharded_layers: ['decoder.tok_embeddings'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a comment explaining this would be nice
@@ -179,3 +180,28 @@ metric_logger: | |||
log_dir: ${output_dir} | |||
|
|||
log_every_n_steps: 1 | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't currently have a profiler implemented for the PPO recipe, I'll be adding it soon so I can update the config when I do.
@@ -47,6 +47,7 @@ save_adapter_weights_only: False | |||
# Dataset and Sampler | |||
dataset: | |||
_component_: torchtune.datasets.stack_exchange_paired_dataset | |||
packed: False # True increases speed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be removed
@@ -46,6 +46,7 @@ save_adapter_weights_only: False | |||
# Dataset and Sampler | |||
dataset: | |||
_component_: torchtune.datasets.stack_exchange_paired_dataset | |||
packed: False # True increases speed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
packed: False # True increases speed |
@@ -47,6 +47,7 @@ save_adapter_weights_only: False | |||
# Dataset and Sampler | |||
dataset: | |||
_component_: torchtune.datasets.stack_exchange_paired_dataset | |||
packed: False # True increases speed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
packed: False # True increases speed |
@@ -33,6 +33,7 @@ tokenizer: | |||
# Dataset | |||
dataset: | |||
_component_: torchtune.datasets.text_completion_dataset | |||
packed: False # True increases speed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
packed: False # True increases speed |
profiler: | ||
_component_: torchtune.training.setup_torch_profiler | ||
enabled: False | ||
|
||
#Output directory of trace artifacts | ||
output_dir: ${output_dir}/profiling_outputs | ||
|
||
#`torch.profiler.ProfilerActivity` types to trace | ||
cpu: True | ||
cuda: True | ||
|
||
#trace options passed to `torch.profiler.profile` | ||
profile_memory: False | ||
with_stack: False | ||
record_shapes: True | ||
with_flops: False | ||
|
||
# `torch.profiler.schedule` options: | ||
# wait_steps -> wait, warmup_steps -> warmup, active_steps -> active, num_cycles -> repeat | ||
wait_steps: 1 | ||
warmup_steps: 8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
profiler: | |
_component_: torchtune.training.setup_torch_profiler | |
enabled: False | |
#Output directory of trace artifacts | |
output_dir: ${output_dir}/profiling_outputs | |
#`torch.profiler.ProfilerActivity` types to trace | |
cpu: True | |
cuda: True | |
#trace options passed to `torch.profiler.profile` | |
profile_memory: False | |
with_stack: False | |
record_shapes: True | |
with_flops: False | |
# `torch.profiler.schedule` options: | |
# wait_steps -> wait, warmup_steps -> warmup, active_steps -> active, num_cycles -> repeat | |
wait_steps: 1 | |
warmup_steps: 8 |
# Profiler (disabled) | ||
profiler: | ||
_component_: torchtune.training.setup_torch_profiler | ||
enabled: False | ||
|
||
#Output directory of trace artifacts | ||
output_dir: ${output_dir}/profiling_outputs | ||
|
||
#`torch.profiler.ProfilerActivity` types to trace | ||
cpu: True | ||
cuda: True | ||
|
||
#trace options passed to `torch.profiler.profile` | ||
profile_memory: False | ||
with_stack: False | ||
record_shapes: True | ||
with_flops: False | ||
|
||
# `torch.profiler.schedule` options: | ||
# wait_steps -> wait, warmup_steps -> warmup, active_steps -> active, num_cycles -> repeat | ||
wait_steps: 1 | ||
warmup_steps: 8 | ||
active_steps: 2 | ||
num_cycles: 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The DPO recipe also doesn't support a profiler, we can raise an issue to implement
# Profiler (disabled) | |
profiler: | |
_component_: torchtune.training.setup_torch_profiler | |
enabled: False | |
#Output directory of trace artifacts | |
output_dir: ${output_dir}/profiling_outputs | |
#`torch.profiler.ProfilerActivity` types to trace | |
cpu: True | |
cuda: True | |
#trace options passed to `torch.profiler.profile` | |
profile_memory: False | |
with_stack: False | |
record_shapes: True | |
with_flops: False | |
# `torch.profiler.schedule` options: | |
# wait_steps -> wait, warmup_steps -> warmup, active_steps -> active, num_cycles -> repeat | |
wait_steps: 1 | |
warmup_steps: 8 | |
active_steps: 2 | |
num_cycles: 1 |
# Profiler (disabled) | ||
profiler: | ||
_component_: torchtune.training.setup_torch_profiler | ||
enabled: False | ||
|
||
#Output directory of trace artifacts | ||
output_dir: ${output_dir}/profiling_outputs | ||
|
||
#`torch.profiler.ProfilerActivity` types to trace | ||
cpu: True | ||
cuda: True | ||
|
||
#trace options passed to `torch.profiler.profile` | ||
profile_memory: False | ||
with_stack: False | ||
record_shapes: True | ||
with_flops: False | ||
|
||
# `torch.profiler.schedule` options: | ||
# wait_steps -> wait, warmup_steps -> warmup, active_steps -> active, num_cycles -> repeat | ||
wait_steps: 1 | ||
warmup_steps: 8 | ||
active_steps: 2 | ||
num_cycles: 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Profiler (disabled) | |
profiler: | |
_component_: torchtune.training.setup_torch_profiler | |
enabled: False | |
#Output directory of trace artifacts | |
output_dir: ${output_dir}/profiling_outputs | |
#`torch.profiler.ProfilerActivity` types to trace | |
cpu: True | |
cuda: True | |
#trace options passed to `torch.profiler.profile` | |
profile_memory: False | |
with_stack: False | |
record_shapes: True | |
with_flops: False | |
# `torch.profiler.schedule` options: | |
# wait_steps -> wait, warmup_steps -> warmup, active_steps -> active, num_cycles -> repeat | |
wait_steps: 1 | |
warmup_steps: 8 | |
active_steps: 2 | |
num_cycles: 1 |
67ba093
to
c24c62a
Compare
@ebsmothers. mind taking a look at the test changes? |
Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:
@@ -59,7 +59,7 @@ shuffle: True | |||
epochs: 1 | |||
max_steps_per_epoch: null | |||
batch_size: 2 | |||
gradient_accumulation_steps: 16 | |||
gradient_accumulation_steps: 8 # Use to increase virtual batch size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry to be a pain but I feel like "effective batch size" is the more common term?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:D
lora_attn_modules: ['q_proj', 'v_proj'] | ||
apply_lora_to_mlp: False | ||
lora_attn_modules: ['q_proj', 'v_proj', 'output_proj'] | ||
apply_lora_to_mlp: True | ||
apply_lora_to_output: False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know it's not from your PR but I thought Qwen 2.5 has tied word embeddings for smaller model sizes. In that case we should not even be exposing apply_lora_to_output
here?
@@ -73,6 +73,8 @@ def test_training_state_on_resume( | |||
tune run lora_dpo_single_device \ | |||
--config llama2/7B_lora_dpo_single_device \ | |||
output_dir={tmpdir} \ | |||
model.lora_attn_modules=['q_proj','v_proj'] \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the better way to do this would be to just modify the source of truth test model definition here:
torchtune/tests/recipes/utils.py
Lines 205 to 227 in 3f15030
"llama2_lora": lora_llama2_test_config( | |
lora_attn_modules=["q_proj", "k_proj", "v_proj", "output_proj"], | |
apply_lora_to_mlp=False, | |
apply_lora_to_output=False, | |
lora_rank=8, | |
lora_alpha=16, | |
), | |
"llama2_dora": lora_llama2_test_config( | |
lora_attn_modules=["q_proj", "k_proj", "v_proj", "output_proj"], | |
apply_lora_to_mlp=False, | |
apply_lora_to_output=False, | |
lora_rank=8, | |
lora_alpha=16, | |
use_dora=True, | |
), | |
"llama2_qlora": lora_llama2_test_config( | |
lora_attn_modules=["q_proj", "k_proj", "v_proj", "output_proj"], | |
apply_lora_to_mlp=True, | |
apply_lora_to_output=False, | |
lora_rank=8, | |
lora_alpha=16, | |
quantize_base=True, | |
), |
Co-authored-by: ebsmothers <ebs@meta.com>
Co-authored-by: Felipe Mello <felipemello@fb.com> Co-authored-by: ebsmothers <ebs@meta.com>
Changelog