FullyShardedDataParallel wrapped models not being unwrapped, leading to incorrect checkpoints. #13500

jstjohn · 2022-07-01T21:49:09Z

🐛 Bug

When training in parallel with the fsdp strategy, the saved checkpoints are somehow messed up. When I try to resume training from those, the epoch number is properly resumed, but the loss spikes dramatically, like as if it went back to an initial/random state. When I do the same train/checkpoint/resume loop with ddp_sharded I do not have this issue and the checkpoint resumes with a similar loss to where it left off. I further saw that when I point a model with strategy fsdp at a checkpoint saved with ddp_sharded it also resumes with a reasonable loss that is roughly at the previous level. This suggests that fsdp loads a checkpoint ok, but there is something wrong with how it saves checkpoints in parallel. Conversely when I resume using ddp_sharded from an fsdp saved checkpoint, the loss is dramatically worse as if weights were randomly initialized, further suggesting that the issue is with how weights are saved in fsdp. Knowing all of this, I am able to just switch to using ddp_sharded but this seems like a really nasty bug that could cause other people headaches so I wanted to make sure it was known.

The fix seems to be to make sure to unwrap the FullyShardedDataParallel wrapper. One key difference between the fsdp strategy implementation and the ddp_sharded strategy implementation is that ddp_sharded overrides self.lightning_module and does calls a custom unwrap_... function which unwraps the ShardedDataParallel layer prior to calling the shared unwrap_lightning_module(...) function. fsdp doesn't do any of this, and it defaults to the method implemented in ParallelStrategy.lightning_module which only calls the unwrap_lightning_module(...) function.

I am going to open a PR and link it here which makes unwrap_lightning_module(...) aware of FullyShardedDataParallel (both flavors) as well as ShardedDataParallel so that all of the strategies that use one of those wrappers would benefit. Also in the future hopefully that will be a piece of code that is noticed which needs to be modified as new wrappers are added.

To Reproduce

Train a model in parallel that saves checkpoints for a few epochs, use --strategy fsdp. Note the loss at the beginning and make sure it drops.
Resume a model using any strategy from one of those saved checkpoints and note that the loss is similar to the beginning of training. Based on code I would guess that the fsdp native strategy, whatever that is called, is also broken. Maybe others.
Repeat 1,2 but this time use --strategy ddp_sharded and note that the loss resumes from where it left off.

Expected behavior

Model training continues when resuming.

Environment

CUDA:
- GPU:
  - NVIDIA A100 80GB PCIe
  - NVIDIA A10
  - NVIDIA A10
- available: True
- version: 11.5
Packages:
- numpy: 1.22.3
- pyTorch_debug: False
- pyTorch_version: 1.11.0
- pytorch-lightning: 1.6.3
- tqdm: 4.64.0
System:
- OS: Linux
- architecture:
  - 64bit
  - ELF
- processor: x86_64
- python: 3.9.12
- version: Training metrics #100-Ubuntu SMP Fri Sep 24 14:50:10 UTC 2021

Additional context

cc @SeanNaren @awaelchli @rohitgr7 @akihironitta

The text was updated successfully, but these errors were encountered:

jstjohn · 2022-07-01T23:21:15Z

I should note this issue occurs when applying the FSDP strategy to an otherwise normal model. Not a manually sharded at initialization time model.

zlenyk · 2022-07-12T10:25:35Z

We are experiencing lower accuracy of models trained with fsdp as opposed to ddp, do you think that could be the same underlying issue?

jstjohn · 2022-07-12T13:49:23Z

See if switching to ddp_sharded significantly improves the issue. That one does not have the same bug but should be very similar otherwise from my understanding. It’s basically the previous version of fsdp. The way to diagnose the issue is poor quality checkpoints. I saw loss curves look the same or similar during training, but then if I had to continue a run from the last checkpoint the loss would basically start from the same point as it has at step 0 rather than a higher step. Epoch and other state would properly reload though. Ddp sharded didn’t have that behavior with the same code otherwise. Also manual inspection of results from saved checkpoints looked bad even though it looked like the models were training.

…

On Jul 12, 2022, 3:25 AM -0700, Zygmunt Łenyk ***@***.***>, wrote: We are experiencing lower accuracy of models trained with fsdp as opposed to ddp, do you think that could be the same underlying issue? — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: ***@***.***>

zlenyk · 2022-07-12T14:31:27Z

Thanks for the information, I'll try ddp_sharded. So far we definitely saw big difference between fsdp and ddp, however I believe this difference was also visible in training curves, so it might be a different issue.

jstjohn · 2022-07-14T16:00:34Z

This solution looks good, hope it fixes the problem! #13502

jstjohn · 2022-07-14T17:11:00Z

Thanks for the information, I'll try ddp_sharded. So far we definitely saw big difference between fsdp and ddp, however I believe this difference was also visible in training curves, so it might be a different issue.

Did that fix your issue @zlenyk, out of curiosity?

zlenyk · 2022-07-19T17:46:21Z

Sorry for late response, queue of experiments and so on...
So DDP_sharded is running out of memory in our experiments, so we need to experiment a little more to squeeze our model into the memory. Therefore I don't yet have answer if model performance is on par.

awaelchli · 2022-08-03T10:51:09Z

Hey @jstjohn
Sorry for the delay. In #13738 I've nuked the complicated, error-prone unwrap logic. I will open it soon for review, but currently trying to verify I have solved your use case. If you want to give it a try as well on that branch and see if it works well for you, that would be really cool, even though I suspect you have probably moved on with your own fix already.

jstjohn added the needs triage Waiting to be triaged by maintainers label Jul 1, 2022

jstjohn mentioned this issue Jul 1, 2022

Fix FullyShardedDataParallel wrapped models not being unwrapped, leading to incorrect checkpoints #13501

Closed

12 tasks

awaelchli added bug Something isn't working strategy: fairscale fsdp (removed) Fully Sharded Data Parallel and removed needs triage Waiting to be triaged by maintainers labels Jul 1, 2022

awaelchli added this to the pl:1.6.x milestone Jul 1, 2022

carmocca mentioned this issue Jul 25, 2022

Remove unwrapping logic in strategies in favor of a direct reference to the original module #13502

Closed

awaelchli self-assigned this Jul 26, 2022

awaelchli mentioned this issue Jul 26, 2022

Replace unwrapping logic in strategies #13738

Merged

12 tasks

carmocca modified the milestones: pl:1.6.x, pl:1.8 Jul 28, 2022

awaelchli closed this as completed in #13738 Aug 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FullyShardedDataParallel wrapped models not being unwrapped, leading to incorrect checkpoints. #13500

FullyShardedDataParallel wrapped models not being unwrapped, leading to incorrect checkpoints. #13500

jstjohn commented Jul 1, 2022 •

edited by github-actions bot

Loading

jstjohn commented Jul 1, 2022

zlenyk commented Jul 12, 2022

jstjohn commented Jul 12, 2022 via email

zlenyk commented Jul 12, 2022 •

edited

Loading

jstjohn commented Jul 14, 2022

jstjohn commented Jul 14, 2022 •

edited

Loading

zlenyk commented Jul 19, 2022

awaelchli commented Aug 3, 2022

FullyShardedDataParallel wrapped models not being unwrapped, leading to incorrect checkpoints. #13500

FullyShardedDataParallel wrapped models not being unwrapped, leading to incorrect checkpoints. #13500

Comments

jstjohn commented Jul 1, 2022 • edited by github-actions bot Loading

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context

jstjohn commented Jul 1, 2022

zlenyk commented Jul 12, 2022

jstjohn commented Jul 12, 2022 via email

zlenyk commented Jul 12, 2022 • edited Loading

jstjohn commented Jul 14, 2022

jstjohn commented Jul 14, 2022 • edited Loading

zlenyk commented Jul 19, 2022

awaelchli commented Aug 3, 2022

jstjohn commented Jul 1, 2022 •

edited by github-actions bot

Loading

zlenyk commented Jul 12, 2022 •

edited

Loading

jstjohn commented Jul 14, 2022 •

edited

Loading