Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FSDP] Ensure that customized non tensor optimizer state can be saved #99214

Closed
wants to merge 3 commits into from

Conversation

fegin
Copy link
Contributor

@fegin fegin commented Apr 15, 2023

Stack from ghstack (oldest at bottom):

The current logic does not actually handle all different non-tensor optimizer states correctly. This PR fixes the issue and adds a test.

This PR will solve #99079

Differential Revision: D45021331

The current logic does not actually handle all different non-tensor optimizer states correctly. This PR fixes the issue and adds a test.

Differential Revision: [D45021331](https://our.internmc.facebook.com/intern/diff/D45021331/)

[ghstack-poisoned]
@pytorch-bot
Copy link

pytorch-bot bot commented Apr 15, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/99214

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit fd264bd:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the release notes: distributed (fsdp) release notes category label Apr 15, 2023
fegin added a commit that referenced this pull request Apr 15, 2023
The current logic does not actually handle all different non-tensor optimizer states correctly. This PR fixes the issue and adds a test.

Differential Revision: [D45021331](https://our.internmc.facebook.com/intern/diff/D45021331/)

ghstack-source-id: 186204536
Pull Request resolved: #99214
Copy link
Contributor

@awgu awgu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the super fast fix! This looks good to me. I just left a few nits.

osd = FSDP.optim_state_dict(model, optim, optim_state_dict=original_osd)
osd_to_load = FSDP.optim_state_dict_to_load(model, optim, osd )
for param_id, state in osd_to_load["state"].items():
# Addd customized value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: typo 😄

Suggested change
# Addd customized value
# Add customized value

step()
original_osd = deepcopy(optim.state_dict())
for param_id, state in original_osd["state"].items():
# Addd customized value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Addd customized value
# Add customized value

state["value2"] = None

osd = FSDP.optim_state_dict(model, optim, optim_state_dict=original_osd)
osd_to_load = FSDP.optim_state_dict_to_load(model, optim, osd )
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I wonder if lint will complain about this one

Suggested change
osd_to_load = FSDP.optim_state_dict_to_load(model, optim, osd )
osd_to_load = FSDP.optim_state_dict_to_load(model, optim, osd)

…an be saved"

The current logic does not actually handle all different non-tensor optimizer states correctly. This PR fixes the issue and adds a test.

Differential Revision: [D45021331](https://our.internmc.facebook.com/intern/diff/D45021331/)

[ghstack-poisoned]
fegin added a commit that referenced this pull request Apr 15, 2023
Pull Request resolved: #99214

The current logic does not actually handle all different non-tensor optimizer states correctly. This PR fixes the issue and adds a test.
ghstack-source-id: 186225950

Differential Revision: [D45021331](https://our.internmc.facebook.com/intern/diff/D45021331/)
…an be saved"


The current logic does not actually handle all different non-tensor optimizer states correctly. This PR fixes the issue and adds a test.

This PR will solve #99079

Differential Revision: [D45021331](https://our.internmc.facebook.com/intern/diff/D45021331/)

[ghstack-poisoned]
fegin added a commit that referenced this pull request Apr 16, 2023
Pull Request resolved: #99214

The current logic does not actually handle all different non-tensor optimizer states correctly. This PR fixes the issue and adds a test.
ghstack-source-id: 186239769

Differential Revision: [D45021331](https://our.internmc.facebook.com/intern/diff/D45021331/)
@fegin
Copy link
Contributor Author

fegin commented Apr 17, 2023

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 jobs have failed, first few of them are: trunk / macos-12-py3-arm64 / build

Details for Dev Infra team Raised by workflow job

Copy link
Contributor

@awaelchli awaelchli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks 🎉 !

@fegin
Copy link
Contributor Author

fegin commented Apr 17, 2023

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request Merged release notes: distributed (fsdp) release notes category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants