-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve loading errors and add docs on HF repos #256
Conversation
cond do | ||
Map.has_key?(repo_files, @pytorch_params_filename) -> | ||
{@pytorch_params_filename, false} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@josevalim we are still defaulting to PyTorch, because I tried a couple repos and noticed that the safetensors parameters don't necessarily match the model architecture. I will investigate further and check the expected behaviour with HF folks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool, thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jonatanklosko Nice!
Just curious, do you remember which specific HF repos you experienced that with?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@grzuy I tried a couple repos and it was often the case, for example bert-base-cased
. I figured out what it is, basically many models are configured with tie_word_embeddings=True
, which means that two specific layers share the same parameters. In such case the PyTorch .bin
files still include parameters for both layers, so loading works, but the .safetensors
files only include parameters for one layer, and they are then set for the other layer.
We haven't added explicit support for tied embeddings yet, once we do, I think it will work as expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see.
I guess it relates to this comment:
bumblebee/lib/bumblebee/text/bert.ex
Lines 575 to 576 in 8867021
# TODO: use a shared parameter with embeddings.word_embeddings.kernel | |
# if spec.tie_word_embeddings is true (relevant for training) |
Cool, thanks for clarifying!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this partially what is needed?
#263
|
||
First, if the repository is clearly a fine-tuned version of another model, you can look for `tokenizer.json` in the original model repository. For example, [`textattack/bert-base-uncased-yelp-polarity`](https://huggingface.co/textattack/bert-base-uncased-yelp-polarity) only includes `tokenizer_config.json`, but it is a fine-tuned version of [`bert-base-uncased`](bert-base-uncased), which does include `tokenizer.json`. Consequently, you can safely load the model from `textattack/bert-base-uncased-yelp-polarity` and tokenizer from `bert-base-uncased`. | ||
|
||
Otherwise, the Transformers library includes conversion rules to load a "slow tokenizer" and convert it to a corresponding "fast tokenizer", which is possible in most cases. You can generate the `tokenizer.json` file using [this tool](https://jonatanklosko-bumblebee-tools.hf.space/apps/tokenizer-generator). Once successful, you can follow the steps to submit a PR adding `tokenizer.json` to the model repository. Note that you do not have to wait for the PR to be merged, instead you can copy commit SHA from the PR and load the tokenizer with `Bumblebee.load_tokenizer({:hf, "model-repo", revision: "..."})`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Beautifully written docs. I am wondering if we should include the README as our front-page in hexdocs.pm/bumblebee as well? Or maybe use the same trick we use in Livebook to include part of the README in the moduledocs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mirroring part of the README into docs sounds good!
lib/bumblebee.ex
Outdated
{@safetensors_params_filename, true} | ||
|
||
true -> | ||
raise "none of the expected parameters files found in the repository." <> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think those should be raise ArgumentError
because it is often something wrong with the input or fixed by changing the input? Anyway, it is up to you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yeah, for these it makes sense.
I was also second guessing that in some cases we return :ok
/:error
, which happens mostly for http errors (repo not found, not authenticated, etc). But if we raise in all cases then it should have a bang 🤷♂️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new docs and error messages are 🔥 .
Also detects if a repository has only safetensors parameters and picks that up automatically.