Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce Padding-Free Plugin to FMS-Acceleration #57

Merged

Conversation

achew010
Copy link
Contributor

@achew010 achew010 commented Jul 29, 2024

Description

This PR introduces support for a new padding-free plugin to FMS-Acceleration, this will allow for users to speed up their finetuning by performing attention computation without padding. This can be activated through the sft_trainer cli by passing plugin argument padding_free - e.g. --padding_free huggingface

Currently uses a fork of fms-hf-tuning to

  • Access plugin through sft_trainer argument
  • load pretokenized dataset

Note

  • Transformers natively supports padding-free from v4.44.0 if the Transformers version is lower, the plugin will use an internal implementation instead.
  • Currently only supports datasets that are already pre-tokenized.

Test

The following comparison is between a padded example and a padding free example.

  • We observe a 27% increase in runtime efficiency through the padding-free plugin, processing the same number of tokens

  • The improvement is dataset dependent as we see different performance improvements across datasets (see reference PR) possibly due to varying sequence length distributions from each dataset (longer sequences will lead to larger throughputs and more improvement).

Note:
The throughput results from SFTTrainer metrics will include the padding tokens if padding=True (see here). Instead we use train-runtime to compare.

Alpaca

Implementation Dataset Model Max Steps Num Device Batch Size Per Device Train Runtime (secs) % Increase
Padded Alpaca Mistral7B 100 1 4 79.4 -
Padding-Free Alpaca Mistral7B 100 1 4 57.8 27
Reproduce

Padded Experiment

export DATASET=alpaca-pretokenized-mistral.json
python -m tuning.sft_trainer --model_name_or_path mistralai/Mistral-7B-v0.1 --packing False --max_seq_len 4096 --learning_rate 2e-5 --torch_dtype float16 --use_flash_attn True --response_template '### Response:' --dataset_text_field output --include_tokens_per_second True --num_train_epochs 1 --gradient_accumulation_steps 1 --gradient_checkpointing True --evaluation_strategy no --save_strategy no --weight_decay 0.01 --warmup_steps 10 --adam_epsilon 1e-4 --lr_scheduler_type linear --logging_strategy steps --logging_steps 10 --max_steps 100 --training_data_path $DATASET --per_device_train_batch_size 4 --output_dir benchmark_outputs/ilab --skip_memory_metrics False
Result
{'loss': 1.0213, 'grad_norm': 49.03125, 'learning_rate': 2e-05, 'epoch': 0.04}                                                       
{'loss': 1.0554, 'grad_norm': 49.90625, 'learning_rate': 1.7777777777777777e-05, 'epoch': 0.08}                                      
{'loss': 0.9129, 'grad_norm': 41.65625, 'learning_rate': 1.555555555555556e-05, 'epoch': 0.12}                                       
{'loss': 1.1889, 'grad_norm': 71.875, 'learning_rate': 1.3333333333333333e-05, 'epoch': 0.16}                                        
{'loss': 1.5754, 'grad_norm': 59.78125, 'learning_rate': 1.1111111111111113e-05, 'epoch': 0.2}                                       
{'loss': 1.0262, 'grad_norm': 42.25, 'learning_rate': 8.888888888888888e-06, 'epoch': 0.24}                                          
{'loss': 1.0137, 'grad_norm': 35.03125, 'learning_rate': 6.666666666666667e-06, 'epoch': 0.28}                                       
{'loss': 1.066, 'grad_norm': 65.6875, 'learning_rate': 4.444444444444444e-06, 'epoch': 0.32}                                         
{'loss': 1.3277, 'grad_norm': 37.4375, 'learning_rate': 2.222222222222222e-06, 'epoch': 0.36}                                        
{'loss': 1.1137, 'grad_norm': 48.28125, 'learning_rate': 0.0, 'epoch': 0.4}                                                          
100%|
{'train_runtime': 79.4079, 'train_samples_per_second': 5.037, 'train_steps_per_second': 1.259, 'train_tokens_per_second': 2100.547, 'train_loss': 1.130120143890381, 'init_mem_cpu_alloc_delta': -14388334592, 'init_mem_gpu_alloc_delta': 14483611648, 'init_mem_cpu_peaked_delta': 14483914752, 'init_mem_gpu_peaked_delta': 0, 'train_mem_cpu_alloc_delta': 665673728, 'train_mem_gpu_alloc_delta': 28984274432, 'train_mem_cpu_peaked_delta': 0, 'train_mem_gpu_peaked_delta': 28999681024, 'before_init_mem_cpu': 15135694848, 'before_init_mem_gpu': 0, 'epoch': 0.4}
100%|

Padding-Free Experiment

Reproduce
export DATASET=alpaca-pretokenized-mistral.json
python -m tuning.sft_trainer --model_name_or_path mistralai/Mistral-7B-v0.1 --padding_free huggingface --packing False --max_seq_len 4096 --learning_rate 2e-5 --torch_dtype float16 --use_flash_attn True --response_template '### Response:' --dataset_text_field output --include_tokens_per_second True --num_train_epochs 1 --gradient_accumulation_steps 1 --gradient_checkpointing True --evaluation_strategy no --save_strategy no --weight_decay 0.01 --warmup_steps 10 --adam_epsilon 1e-4 --lr_scheduler_type linear --logging_strategy steps --logging_steps 10 --max_steps 100 --training_data_path $DATASET --per_device_train_batch_size 4 --output_dir benchmark_outputs/ilab --skip_memory_metrics False
Result
{'loss': 1.7849, 'grad_norm': 165.0, 'learning_rate': 2e-05, 'epoch': 0.0}
{'loss': 1.433, 'grad_norm': 158.25, 'learning_rate': 1.7777777777777777e-05, 'epoch': 0.0}
{'loss': 1.2872, 'grad_norm': 60.90625, 'learning_rate': 1.555555555555556e-05, 'epoch': 0.0}
{'loss': 1.2817, 'grad_norm': 93.625, 'learning_rate': 1.3333333333333333e-05, 'epoch': 0.0}
{'loss': 1.1573, 'grad_norm': 41.65625, 'learning_rate': 1.1111111111111113e-05, 'epoch': 0.0}
{'loss': 1.0525, 'grad_norm': 42.03125, 'learning_rate': 8.888888888888888e-06, 'epoch': 0.0}
{'loss': 1.9564, 'grad_norm': 125.1875, 'learning_rate': 6.666666666666667e-06, 'epoch': 0.01}
{'loss': 1.0277, 'grad_norm': 44.40625, 'learning_rate': 4.444444444444444e-06, 'epoch': 0.01}
{'loss': 0.9661, 'grad_norm': 31.546875, 'learning_rate': 2.222222222222222e-06, 'epoch': 0.01}
{'loss': 0.9497, 'grad_norm': 27.140625, 'learning_rate': 0.0, 'epoch': 0.01}
{'train_runtime': 57.805, 'train_samples_per_second': 6.92, 'train_steps_per_second': 1.73, 'train_tokens_per_second': 2383.876, 'train_loss': 1.2896488857269288, 'init_mem_cpu_alloc_delta': -14387732480, 'init_mem_gpu_alloc_delta': 14483611648, 'init_mem_cpu_peaked_delta': 14483365888, 'init_mem_gpu_peaked_delta': 0, 'train_mem_cpu_alloc_delta': 652550144, 'train_mem_gpu_alloc_delta': 28984245248, 'train_mem_cpu_peaked_delta': 0, 'train_mem_gpu_peaked_delta': 28990169600, 'before_init_mem_cpu': 15090880512, 'before_init_mem_gpu': 0, 'epoch': 0.01}

@achew010 achew010 changed the title Refactor/ilab plugin Introduce Padding-Free Plugin to FMS-Acceleration Jul 29, 2024
@achew010 achew010 marked this pull request as ready for review July 29, 2024 09:46
@achew010 achew010 requested a review from fabianlim as a code owner July 29, 2024 09:46
@achew010 achew010 force-pushed the refactor/ilab-plugin branch from 3f08e09 to decc009 Compare July 29, 2024 10:21
@fabianlim
Copy link
Contributor

fabianlim commented Jul 29, 2024

Make sure go through this checklist https://github.com/foundation-model-stack/fms-acceleration/tree/main/plugins/framework#adding-new-plugins

For benches maybe we can think about how to make a seperate set from the current set. Since this is completely seperate from other plugins, so that we do not have to rerun all the benches everytime. This will require some changes to the benchmarking. Maybe one simple solution is to just have a difference scenarios-ilab.yaml for

@achew010 achew010 force-pushed the refactor/ilab-plugin branch from 71321a1 to 3238801 Compare August 1, 2024 06:12
@achew010 achew010 force-pushed the refactor/ilab-plugin branch from 915ba17 to bff3128 Compare August 1, 2024 08:21
achew010 and others added 9 commits August 1, 2024 09:35
Signed-off-by: 1000960000 user <aaron.chew1@ibm.com>
Signed-off-by: 1000960000 user <aaron.chew1@ibm.com>
Signed-off-by: 1000960000 user <aaron.chew1@ibm.com>
Signed-off-by: 1000960000 user <aaron.chew1@ibm.com>
Signed-off-by: 1000960000 user <aaron.chew1@ibm.com>
Signed-off-by: 1000960000 user <aaron.chew1@ibm.com>
Signed-off-by: 1000960000 user <aaron.chew1@ibm.com>
Signed-off-by: 1000960000 user <aaron.chew1@ibm.com>
Signed-off-by: 1000960000 user <aaron.chew1@ibm.com>
@achew010 achew010 force-pushed the refactor/ilab-plugin branch from 66f9cc2 to c9e355a Compare August 1, 2024 09:36
@fabianlim fabianlim merged commit a6f6ef0 into foundation-model-stack:main Aug 1, 2024
6 checks passed
fabianlim added a commit that referenced this pull request Aug 2, 2024
* edits to readme

Signed-off-by: 1000960000 user <aaron.chew1@ibm.com>

* Apply suggestions from code review

Co-authored-by: Yu Chin Fabian Lim <fabianlim@users.noreply.github.com>
Signed-off-by: 1000960000 user <aaron.chew1@ibm.com>

* more readme changes

Signed-off-by: 1000960000 user <aaron.chew1@ibm.com>

---------

Signed-off-by: 1000960000 user <aaron.chew1@ibm.com>
Co-authored-by: Yu Chin Fabian Lim <fabianlim@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants