Skip to content
This repository has been archived by the owner on Oct 9, 2023. It is now read-only.

Add finetuning strategies for DeepSpeed #1377

Conversation

ar90n
Copy link
Contributor

@ar90n ar90n commented Jul 3, 2022

What does this PR do?

This PR provides some workarounds to use DeepSpeed in finetuning. In fact, DeepSpeed cannot work with pytorch-lightning completely because its parameter loading and storing don't work. So this PR added some fine-tuning strategies whose parameter loading and storing are omitted.

Fixes #1249

Before submitting

  • Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests? [not needed for typos/docs]
  • Did you verify new and existing tests pass locally with your changes?
  • If you made a notable change (that affects users), did you update the CHANGELOG?

PR review

  • Is this pull request ready for review? (if not, please submit in draft mode)

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

@codecov
Copy link

codecov bot commented Jul 3, 2022

Codecov Report

Merging #1377 (acf3ae3) into master (0253d71) will decrease coverage by 0.01%.
The diff coverage is 77.77%.

@@            Coverage Diff             @@
##           master    #1377      +/-   ##
==========================================
- Coverage   92.90%   92.88%   -0.02%     
==========================================
  Files         286      286              
  Lines       12874    12891      +17     
==========================================
+ Hits        11960    11974      +14     
- Misses        914      917       +3     
Flag Coverage Δ
unittests 92.88% <77.77%> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
flash/core/finetuning.py 88.23% <76.47%> (-2.47%) ⬇️
flash/core/utilities/imports.py 91.47% <100.00%> (+0.04%) ⬆️
flash/text/question_answering/model.py 93.87% <0.00%> (-0.69%) ⬇️
flash/core/serve/dag/task.py 97.88% <0.00%> (+1.05%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@ar90n ar90n marked this pull request as ready for review July 3, 2022 11:14
@ar90n
Copy link
Contributor Author

ar90n commented Jul 3, 2022

I don't know how to fix Lightning-AI.lightning-flash (Examples) jobs. Could you give me some help?

@ethanwharris ethanwharris added this to the 0.8.0 milestone Jul 11, 2022
@krshrimali krshrimali self-requested a review as a code owner July 22, 2022 07:28
Copy link
Contributor

@krshrimali krshrimali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @ar90n - Thank you so much for working on this PR. I really appreciate that you added the documentation and relevant tests. 🎉

LGTM (just added a couple of minor suggestions)!

Regarding the Example failures, please don't worry about them. Though the CI has been fixed now, so it should be all good.

docs/source/general/finetuning.rst Outdated Show resolved Hide resolved
flash/core/finetuning.py Outdated Show resolved Hide resolved
Comment on lines +250 to +266
@pytest.mark.parametrize(
"strategy_key, strategy_metadata",
[
("no_freeze", None),
("freeze", None),
("freeze_unfreeze", 2),
("unfreeze_milestones", ((5, 10), 15)),
],
)
def test_deepspeed_finetuning_strategy_key(strategy_key, strategy_metadata):
deepspeed_strategy_key = f"{strategy_key}_deepspeed"

strategy = _FINETUNING_STRATEGIES_REGISTRY.get(key=strategy_key)(strategy_metadata=strategy_metadata).strategy
deepspeed_strategy = _FINETUNING_STRATEGIES_REGISTRY.get(key=deepspeed_strategy_key)(
strategy_metadata=strategy_metadata
).strategy
assert strategy == deepspeed_strategy
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really nice tests! Thanks for adding them.

ar90n and others added 3 commits July 22, 2022 20:21
Co-authored-by: Kushashwa Ravi Shrimali <kushashwaravishrimali@gmail.com>
Co-authored-by: Kushashwa Ravi Shrimali <kushashwaravishrimali@gmail.com>
@ar90n
Copy link
Contributor Author

ar90n commented Jul 22, 2022

Hi @krshrimali
Thanks for your review and suggestions! It's so helpful for me. Because my English is poor. I'm so glad that this PR was approved.

@ar90n
Copy link
Contributor Author

ar90n commented Jul 22, 2022

I checked the reason for Lightning-AI.lightning-flash (Examples) and found the follwoings.

RuntimeError: CUDA error: the launch timed out and was terminated
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

I'm not familiar with them. It seems that the test process was terminated by its timeout.
In my local environment, this test passes. Please some help to solve this issue.

@krshrimali
Copy link
Contributor

I checked the reason for Lightning-AI.lightning-flash (Examples) and found the follwoings.

RuntimeError: CUDA error: the launch timed out and was terminated
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

I'm not familiar with them. It seems that the test process was terminated by its timeout. In my local environment, this test passes. Please some help to solve this issue.

Hi, @ar90n - Please don't worry about it. I'll have to check once on my personal GPU, sometimes these failures can be flaky (because of resources not being available or anything else). I'll merge it once I'm done testing, but this is good to go! 🎉

@krshrimali
Copy link
Contributor

The test passes locally on my GPU, let's merge this and monitor the CI. In case an alarm is raised, I'll attempt to fix it. Thanks, @ar90n for your hard-work and patience with this PR. 🎉

@krshrimali krshrimali enabled auto-merge (squash) July 27, 2022 04:02
@krshrimali
Copy link
Contributor

Just added the CHANGELOG entry, let's wait for the CI, and push it ASAP. <3

@krshrimali
Copy link
Contributor

@ar90n - FYI, it took us some time to fix the CI, sorry for that. @ethanwharris is currently OOO for this week, so whenever he is back, he'll help merge this. 🎉 Thank you for your contribution, and patience.

@mergify mergify bot removed the has conflicts label Aug 31, 2022
@krshrimali krshrimali merged commit 0e9fdc0 into Lightning-Universe:master Aug 31, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Flash DeepSpeedPlugin error
3 participants