[Distributed] Support index + multi-bin loading #1275

kwen2501 · 2024-10-05T04:35:28Z

Some HF format uses index + multiple binaries instead of index + multiple safetensors.
Distributed workflows already support the later, this PR adds support for the former too.
(The reason is that quantized checkpoint cannot save as safetensor format yet.)

pytorch-bot · 2024-10-05T04:35:31Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchchat/1275

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 9ef2171 with merge base 766bee9 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

mikekg · 2024-10-05T18:56:13Z

Does this compose with #1255 ?

kwen2501 · 2024-10-06T05:26:42Z

Does this compose with #1255 ?

Thanks for the review. I hope yes.
Distributed already have support for "index + multi-safetensor" format (thanks to @lessw2020 ).
This PR only adds support for "index + multi-bin" format. (The reason is that quantized checkpoint cannot save as safetensor format yet.)

mikekg · 2024-10-06T20:45:54Z

Does this compose with #1255 ?

Thanks for the review. I hope yes. Distributed already have support for "index + multi-safetensor" format (thanks to @lessw2020 ). This PR only adds support for "index + multi-bin" format. (The reason is that quantized checkpoint cannot save as safetensor format yet.)

How much of this can be shared with local-only weight loading? Should distributed weight loading just be a general case of local loading (or vice versa, local just a special case of distributed?) Not suggesting to abandon this PR at all, but similar to what you've done to merging model.py and model_dist.py (yay! Awesome work!!!!) could be a real value here.

lessw2020

lgtm!
Thanks for adding this!

[Distributed] Support index + multi-bin loading

9ef2171

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 5, 2024

kwen2501 requested a review from lessw2020 October 5, 2024 04:36

kwen2501 changed the base branch from main to dtensor_shard October 5, 2024 04:37

lessw2020 approved these changes Oct 7, 2024

View reviewed changes

kwen2501 merged commit 75f3a35 into dtensor_shard Oct 7, 2024
52 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Distributed] Support index + multi-bin loading #1275

[Distributed] Support index + multi-bin loading #1275

kwen2501 commented Oct 5, 2024 •

edited

Loading

pytorch-bot bot commented Oct 5, 2024 •

edited

Loading

mikekg commented Oct 5, 2024

kwen2501 commented Oct 6, 2024

mikekg commented Oct 6, 2024

lessw2020 left a comment

[Distributed] Support index + multi-bin loading #1275

[Distributed] Support index + multi-bin loading #1275

Conversation

kwen2501 commented Oct 5, 2024 • edited Loading

pytorch-bot bot commented Oct 5, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchchat/1275

✅ No Failures

mikekg commented Oct 5, 2024

kwen2501 commented Oct 6, 2024

mikekg commented Oct 6, 2024

lessw2020 left a comment

Choose a reason for hiding this comment

kwen2501 commented Oct 5, 2024 •

edited

Loading

pytorch-bot bot commented Oct 5, 2024 •

edited

Loading