Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remove SharedDDP as it is deprecated #25702

Merged
merged 8 commits into from
Oct 6, 2023

Conversation

ji-huazhong
Copy link
Contributor

@ji-huazhong ji-huazhong commented Aug 24, 2023

What does this PR do?

As mentioned previously(see), fairscale's ShardedDDP is deprecated, and PyTorch FSDP is the recommended method for scaling to large NN models. Now it's time to say goodbye to this library👋.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@muellerz Good day. Could you please review this PR? Thanks😄

@ji-huazhong ji-huazhong marked this pull request as draft August 24, 2023 01:55
@ji-huazhong ji-huazhong changed the title remove SharedDDP as it is drepracated remove SharedDDP as it is deprecated Aug 24, 2023
@ji-huazhong ji-huazhong marked this pull request as ready for review August 24, 2023 03:15
@ji-huazhong
Copy link
Contributor Author

ji-huazhong commented Aug 24, 2023

@sgugger sorry for bothering you, but would you mind taking a look at this PR?

@sgugger
Copy link
Collaborator

sgugger commented Aug 24, 2023

cc @muellerzr and @pacman100 who will take over the Trainer.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

@ji-huazhong ji-huazhong force-pushed the drop_sharde_ddp branch 4 times, most recently from b797639 to a56008a Compare August 27, 2023 09:27
@ji-huazhong
Copy link
Contributor Author

Rebase my commits to master HEAD to fix the merge conflict.
BTW, Is this PR still under reviewing? Any review suggestions? Please let me know if there is anything else that needs to be done.

Copy link
Contributor

@pacman100 pacman100 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @statelesshz for removing the FairScale support as it is deprecated. The changes LGTM except for the TPU-related bits which Zach can review. Left a comment.

src/transformers/trainer.py Show resolved Hide resolved
src/transformers/trainer.py Show resolved Hide resolved
src/transformers/trainer.py Show resolved Hide resolved
@ji-huazhong ji-huazhong force-pushed the drop_sharde_ddp branch 2 times, most recently from 2447a3b to 959cd5d Compare September 1, 2023 14:17
Copy link
Contributor

@muellerzr muellerzr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! This is looking great! Just a small note in terms of gradient scaling. We need to either keep this for now in this PR and merge, or coordinate a PR in accelerate with the right logic before this merges. I don't particularly have a leaning one way or another :)

src/transformers/trainer.py Outdated Show resolved Hide resolved
@ji-huazhong ji-huazhong closed this Sep 1, 2023
@ji-huazhong ji-huazhong reopened this Sep 1, 2023
@ji-huazhong
Copy link
Contributor Author

@muellerzr Could you please take a second look at this PR?
Some modifications were made based on the code review comments.

Copy link
Contributor

@muellerzr muellerzr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! This all looks great to me, do you have access to a multi-gpu system by chance to run the sharded DDP yourself/test? Probably also good to have @pacman100 give it one more look as well :)

@ji-huazhong
Copy link
Contributor Author

I've tested this PR using 4xA100-80G on FastChat with the following scripts

CUDA_VISIBLE_DEVICES=4,5,6,7  torchrun --nproc_per_node=4 --master_port=20001 fastchat/train/train_mem.py \
    --model_name_or_path lmsys/vicuna-7b-v1.5  \
    --data_path data/dummy_conversation.json \
    --bf16 True \
    --output_dir output_vicuna \
    --num_train_epochs 1 \
    --max_steps 10 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 1200 \
    --save_total_limit 10 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --fsdp "full_shard offload auto_wrap" \
    --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
    --tf32 True \
    --model_max_length 2048 \
    --gradient_checkpointing False \
    --lazy_preprocess True

@ji-huazhong
Copy link
Contributor Author

@pacman100 Could you please take a look once again :-) I think it's ready to be merged.

@ji-huazhong
Copy link
Contributor Author

ji-huazhong commented Sep 7, 2023

Resolve merge conflicts by rebasing to main branch

@ji-huazhong
Copy link
Contributor Author

Rebasing my commits to master HEAD and resolving merge conflicts

@ji-huazhong
Copy link
Contributor Author

Hi there. This PR is approved and the tests are green :D
cc @muellerzr and @pacman100

@ji-huazhong
Copy link
Contributor Author

This PR is approved and the tests are green. @muellerzr Could you help to merge it?

@LysandreJik
Copy link
Member

Thank you @statelesshz!

@LysandreJik
Copy link
Member

Let me rebase quickly and merge if tests are green

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, LGTM! Thanks @statelesshz

@LysandreJik LysandreJik merged commit 27597fe into huggingface:main Oct 6, 2023
@ji-huazhong ji-huazhong deleted the drop_sharde_ddp branch October 7, 2023 07:32
helboukkouri pushed a commit to helboukkouri/transformers that referenced this pull request Oct 16, 2023
* remove SharedDDP as it was drepracated

* apply review suggestion

* make style

* Oops,forgot to remove the compute_loss context manager in Seq2SeqTrainer.

* remove the unnecessary conditional statement

* keep the logic of IPEX

* clean code

* mix precision setup & make fixup

---------

Co-authored-by: statelesshz <jihuazhong1@huawei.com>
lenglaender added a commit to lenglaender/adapters that referenced this pull request Nov 7, 2023
blbadger pushed a commit to blbadger/transformers that referenced this pull request Nov 8, 2023
* remove SharedDDP as it was drepracated

* apply review suggestion

* make style

* Oops,forgot to remove the compute_loss context manager in Seq2SeqTrainer.

* remove the unnecessary conditional statement

* keep the logic of IPEX

* clean code

* mix precision setup & make fixup

---------

Co-authored-by: statelesshz <jihuazhong1@huawei.com>
EduardoPach pushed a commit to EduardoPach/transformers that referenced this pull request Nov 18, 2023
* remove SharedDDP as it was drepracated

* apply review suggestion

* make style

* Oops,forgot to remove the compute_loss context manager in Seq2SeqTrainer.

* remove the unnecessary conditional statement

* keep the logic of IPEX

* clean code

* mix precision setup & make fixup

---------

Co-authored-by: statelesshz <jihuazhong1@huawei.com>
XuehaoSun pushed a commit to intel/intel-extension-for-transformers that referenced this pull request Jan 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants