Safetensors #1255

gabe-l-hart · 2024-10-02T15:29:09Z

Description

This PR implements support for downloading and converting model checkpoints from huggingface which use the safetensors format rather than the pth binary (pickle) format.

Changes

Allow the tensor map file to be found under different *.index.json names
Allow loading model files with safetensors.torch.load when needed
Allow downloading safetensor files if they exist without pth files or the model explicitly prefers them

Testing

I have tested that the download and load for llama3.1 are un-changed with these changes (no safetensors are downloaded)
I have verified (on my WIP branch for Granite Code) that for models with only safetensors, they are not ignored and can be cleanly converted

pytorch-bot · 2024-10-02T15:29:13Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchchat/1255

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 9db974d with merge base d8c0aaf ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

facebook-github-bot · 2024-10-02T15:29:15Z

Hi @gabe-l-hart!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks!

gabe-l-hart · 2024-10-02T17:50:50Z

NOTE: The CLA is in process since I'm contributing through the IBM corporate CLA.

byjlw · 2024-10-03T00:16:59Z

Thanks @gabe-l-hart
I appreciate you contributing. Code looks good, please rebase when you're ready and i'll approve and merge.
Also since it looks like you're already adding another model to the list and have tested it feel free to bring in the definition so others can benefit by downloading via the cli

gabe-l-hart · 2024-10-03T15:22:04Z

Thanks @byjlw! I'll get the rebase done shortly. As you noted, this is part of my work to add Granite Code support. I have a single branch pointer for all that work, but I have also tried to organize the commits so that they can be merged as bite-sized features rather than a big overhaul. Right now, the addition of the configs for Granite Code are at the end of the chain since the model won't work at all until the other features are present (it also currently only works in the python runtime). I'm totally open to whatever convention the torchchat team prefers in terms of contribution chunking, so just let me know what you prefer!

byjlw · 2024-10-03T16:03:27Z

chunks are definitely better. Would love to learn more about the overall goals for the Granite code support

gabe-l-hart · 2024-10-03T16:06:42Z

Would love to learn more about the overall goals for the Granite code support

Good point, I didn't ever spell this out! Let me open a top-level issue about that support that I can use to track it all.

gabe-l-hart · 2024-10-03T16:18:32Z

Top-level Granite Code support issue: #1262

byjlw · 2024-10-03T16:19:32Z

Actually though, when i tested it by changing the the model.json to use safetensors for 11b base download errored out. It looks like this code works as long as it's not using definitions that use the torchtune format. Do you mind testing this case and resolving the issue?

"meta-llama/Llama-3.2-11B-Vision": {
        "aliases": ["llama3.2-11B-base", "Llama-3.2-11B-Vision-base"],
        "distribution_channel": "HuggingFaceSnapshot",
        "distribution_path": "meta-llama/Llama-3.2-11B-Vision",
        "prefer_safetensors": true
    },

Fetching 19 files: 100%|█████████████████████████████████████████████████████████████████████| 19/19 [1:06:30<00:00, 210.04s/it]
Converting meta-llama/Llama-3.2-11B-Vision to torchtune format...
Traceback (most recent call last):100%|████████████████████████████████████████████████████| 4.99G/4.99G [30:11<00:00, 3.52MB/s]
  File "/Users/byjlw/Documents/source/working/torchchat/torchchat.py", line 85, in <module>| 4.92G/4.92G [12:17<00:00, 3.34MB/s]
    check_args(args, "generate")
  File "/Users/byjlw/Documents/source/working/torchchat/torchchat/cli/cli.py", line 52, in check_args
    download_and_convert(args.model, args.model_directory, args.hf_token)
  File "/Users/byjlw/Documents/source/working/torchchat/torchchat/cli/download.py", line 123, in download_and_convert
    _download_hf_snapshot(model_config, temp_dir, hf_token)
  File "/Users/byjlw/Documents/source/working/torchchat/torchchat/cli/download.py", line 82, in _download_hf_snapshot
    convert_hf_checkpoint_to_tune( model_dir=artifact_dir, model_name=model_config.name)
  File "/Users/byjlw/Documents/source/working/torchchat/.venv/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/byjlw/Documents/source/working/torchchat/torchchat/cli/convert_hf_checkpoint.py", line 184, in convert_hf_checkpoint_to_tune
    raise RuntimeError(f"Could not find {consolidated_pth}")
RuntimeError: Could not find /Users/byjlw/.torchchat/model-cache/downloads/meta-llama/Llama-3.2-11B-Vision/original/consolidated.pth

gabe-l-hart · 2024-10-03T16:20:56Z

Ah, good catch! I guess N == 2 is a small sample size. I'll dig into your error and see how far I can get.

byjlw · 2024-10-03T17:25:04Z

@ebsmothers can also help

gabe-l-hart · 2024-10-03T17:35:10Z

Great. I'll report progress or blockers as they come up. @ebsmothers let me know if you dig in and get anywhere!

ebsmothers · 2024-10-03T18:07:32Z

@byjlw so I'm not super familiar with how torchchat is handling checkpoint conversion but if you're switching from the .pth format to .safetensors format you will no longer be able to just do the simple move that's happening here. The safetensors format for Llama 3.2 11B Vision distributes the model weights across multiple files (see here) so they will need to be loaded and merged into a single state dict as we do in torchtune here.

gabe-l-hart · 2024-10-03T18:20:33Z

That makes sense. A similar approach is being taken in convert_hf_checkpoint to load the sharded weights and convert them into the .pth format needed by torchchat. If I'm understanding the logic in convert_hf_checkpoint_to_tune correctly, it looks like it's really just a simpler version of the logic in convert_hf_checkpoint that doesn't require the tensor renaming or permuting. If that's the case, I think the fix should be to hoist out the shard-loading logic into a helper and then only do the post-processing in convert_hf_checkpoint before resaving as model.pth.

I finally have the safetensors downloaded locally, so I'll see how far I can get with this approach.

gabe-l-hart · 2024-10-03T20:25:26Z

I have the mechanics of this working for the safetensors weights with Llama-3.2-11B-Vision-Instruct, but it appears that there is a naming difference between the tensors in original/consolidated.pth and the safetensors weights. This is likely similar to the name remapping needed in the standard conversion logic.

Given that, and given the fact that these models do already have checkpoints that match the target naming scheme, I think it might make sense to leave the PR as-is for now and not switch prefer_safetensors to true for these models. I could also see an argument for using safetensors all-the-way-through to avoid the known pickle vulnerabilities with .pth, but this PR doesn't address that anyway since the models are converted to .pth during the conversion process.

…r names Branch: GraniteCodeSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

Branch: GraniteCodeSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

…h or safetensors The logic here will prefer pth over safetensors unless the model's config explicitly states a preference for safetensors over pth. If only one of the two is found, the download will use whichever is present. Branch: GraniteCodeSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

leseb · 2024-11-06T10:30:17Z

torchchat/cli/convert_hf_checkpoint.py

@@ -41,7 +42,12 @@ def convert_hf_checkpoint(
    print(f"Model config {config.__dict__}")

    # Load the json file containing weight mapping
-    model_map_json = model_dir / "pytorch_model.bin.index.json"
+    model_map_json_matches = [Path(m) for m in glob.glob(str(model_dir / "*.index.json"))]
+    assert len(model_map_json_matches) <= 1, "Found multiple weight mapping files"


Why is this an error? Thanks!

Good catch. See my response over on your PR: #1346 (comment)

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 2, 2024

Jack-Khuu self-requested a review October 2, 2024 23:53

Jack-Khuu added the enhancement New feature or request label Oct 2, 2024

gabe-l-hart force-pushed the Safetensors-1249 branch from 8a00032 to 625c12c Compare October 3, 2024 15:23

This was referenced Oct 3, 2024

Bias tensors #1259

Merged

Tied word embeddings #1260

Merged

Tokenizers tokenizer #1261

Merged

gabe-l-hart force-pushed the Safetensors-1249 branch from 625c12c to 2fc163c Compare October 3, 2024 21:03

byjlw approved these changes Oct 3, 2024

View reviewed changes

gabe-l-hart added 3 commits October 4, 2024 09:27

feat(granite): Add support for finding weight mapping files with othe…

25f1775

…r names Branch: GraniteCodeSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

feat(granite): Add support for loading state_dict from safetensors

33a2962

Branch: GraniteCodeSupport Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

gabe-l-hart force-pushed the Safetensors-1249 branch from 2fc163c to 9db974d Compare October 4, 2024 15:29

byjlw merged commit 766bee9 into pytorch:main Oct 4, 2024
52 checks passed

gabe-l-hart deleted the Safetensors-1249 branch October 4, 2024 20:02

This was referenced Oct 5, 2024

[Distributed] Support index + multi-bin loading #1275

Merged

[Distributed] Add support for torchchat checkpoint format #1268

Merged

gabe-l-hart mentioned this pull request Oct 31, 2024

Granite code support #1336

Open

6 tasks

leseb reviewed Nov 6, 2024

View reviewed changes

gabe-l-hart mentioned this pull request Nov 12, 2024

Download fix #1366

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Safetensors #1255

Safetensors #1255

gabe-l-hart commented Oct 2, 2024

pytorch-bot bot commented Oct 2, 2024 •

edited

Loading

facebook-github-bot commented Oct 2, 2024

gabe-l-hart commented Oct 2, 2024

byjlw commented Oct 3, 2024

gabe-l-hart commented Oct 3, 2024

byjlw commented Oct 3, 2024

gabe-l-hart commented Oct 3, 2024

gabe-l-hart commented Oct 3, 2024

byjlw commented Oct 3, 2024

gabe-l-hart commented Oct 3, 2024

byjlw commented Oct 3, 2024

gabe-l-hart commented Oct 3, 2024

ebsmothers commented Oct 3, 2024

gabe-l-hart commented Oct 3, 2024

gabe-l-hart commented Oct 3, 2024

leseb Nov 6, 2024

gabe-l-hart Nov 6, 2024 •

edited

Loading

Safetensors #1255

Safetensors #1255

Conversation

gabe-l-hart commented Oct 2, 2024

Description

Changes

Testing

pytorch-bot bot commented Oct 2, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchchat/1255

✅ No Failures

facebook-github-bot commented Oct 2, 2024

Action Required

Process

gabe-l-hart commented Oct 2, 2024

byjlw commented Oct 3, 2024

gabe-l-hart commented Oct 3, 2024

byjlw commented Oct 3, 2024

gabe-l-hart commented Oct 3, 2024

gabe-l-hart commented Oct 3, 2024

byjlw commented Oct 3, 2024

gabe-l-hart commented Oct 3, 2024

byjlw commented Oct 3, 2024

gabe-l-hart commented Oct 3, 2024

ebsmothers commented Oct 3, 2024

gabe-l-hart commented Oct 3, 2024

gabe-l-hart commented Oct 3, 2024

leseb Nov 6, 2024

Choose a reason for hiding this comment

gabe-l-hart Nov 6, 2024 • edited Loading

Choose a reason for hiding this comment

pytorch-bot bot commented Oct 2, 2024 •

edited

Loading

gabe-l-hart Nov 6, 2024 •

edited

Loading