Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve loading errors and add docs on HF repos #256

Merged
merged 3 commits into from
Sep 27, 2023
Merged

Conversation

jonatanklosko
Copy link
Member

Also detects if a repository has only safetensors parameters and picks that up automatically.

Comment on lines +513 to +515
cond do
Map.has_key?(repo_files, @pytorch_params_filename) ->
{@pytorch_params_filename, false}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@josevalim we are still defaulting to PyTorch, because I tried a couple repos and noticed that the safetensors parameters don't necessarily match the model architecture. I will investigate further and check the expected behaviour with HF folks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool, thank you!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jonatanklosko Nice!

Just curious, do you remember which specific HF repos you experienced that with?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@grzuy I tried a couple repos and it was often the case, for example bert-base-cased. I figured out what it is, basically many models are configured with tie_word_embeddings=True, which means that two specific layers share the same parameters. In such case the PyTorch .bin files still include parameters for both layers, so loading works, but the .safetensors files only include parameters for one layer, and they are then set for the other layer.

We haven't added explicit support for tied embeddings yet, once we do, I think it will work as expected.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see.

I guess it relates to this comment:

# TODO: use a shared parameter with embeddings.word_embeddings.kernel
# if spec.tie_word_embeddings is true (relevant for training)
.

Cool, thanks for clarifying!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this partially what is needed?
#263


First, if the repository is clearly a fine-tuned version of another model, you can look for `tokenizer.json` in the original model repository. For example, [`textattack/bert-base-uncased-yelp-polarity`](https://huggingface.co/textattack/bert-base-uncased-yelp-polarity) only includes `tokenizer_config.json`, but it is a fine-tuned version of [`bert-base-uncased`](bert-base-uncased), which does include `tokenizer.json`. Consequently, you can safely load the model from `textattack/bert-base-uncased-yelp-polarity` and tokenizer from `bert-base-uncased`.

Otherwise, the Transformers library includes conversion rules to load a "slow tokenizer" and convert it to a corresponding "fast tokenizer", which is possible in most cases. You can generate the `tokenizer.json` file using [this tool](https://jonatanklosko-bumblebee-tools.hf.space/apps/tokenizer-generator). Once successful, you can follow the steps to submit a PR adding `tokenizer.json` to the model repository. Note that you do not have to wait for the PR to be merged, instead you can copy commit SHA from the PR and load the tokenizer with `Bumblebee.load_tokenizer({:hf, "model-repo", revision: "..."})`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Beautifully written docs. I am wondering if we should include the README as our front-page in hexdocs.pm/bumblebee as well? Or maybe use the same trick we use in Livebook to include part of the README in the moduledocs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mirroring part of the README into docs sounds good!

lib/bumblebee.ex Outdated
{@safetensors_params_filename, true}

true ->
raise "none of the expected parameters files found in the repository." <>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think those should be raise ArgumentError because it is often something wrong with the input or fixed by changing the input? Anyway, it is up to you.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah, for these it makes sense.

I was also second guessing that in some cases we return :ok/:error, which happens mostly for http errors (repo not found, not authenticated, etc). But if we raise in all cases then it should have a bang 🤷‍♂️

Copy link
Contributor

@josevalim josevalim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new docs and error messages are 🔥 .

@jonatanklosko jonatanklosko merged commit 4623083 into main Sep 27, 2023
2 checks passed
@jonatanklosko jonatanklosko deleted the jk-load-errors branch September 27, 2023 16:10
@grzuy grzuy mentioned this pull request Oct 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants