Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support tie_word_embeddings #263

Closed
wants to merge 2 commits into from

Conversation

grzuy
Copy link
Contributor

@grzuy grzuy commented Oct 16, 2023

Is something along these lines what's needed to support tie_word_embeddings for models with language modeling heads?

If so, I can continue updating the rest of the missing models.

@jonatanklosko
Copy link
Member

jonatanklosko commented Oct 17, 2023

Hey @grzuy :) This is more of a workaround that works for loading, but in order for parameters sharing to work during model training we actually need to share them at Axon level. Since PyTorch checkpoints generally include the tensors for both shared layers, we've been deferring this until we address the actual sharing.

There is Axon.block on Axon main (PR), but I think we still haven't converged on what is the best solution in the Bumblebee case. cc @seanmor5

@grzuy
Copy link
Contributor Author

grzuy commented Oct 18, 2023

Right, hehe 🤦‍♂️

Thanks!

@grzuy grzuy closed this Oct 18, 2023
@jonatanklosko
Copy link
Member

No worries :) If this was needed to load models it could be worth the workaround (we actually had to do it for T5), but I think it's fine to default to .bin parameters until we do the actual sharing, because we would need to rewrite the workaround anyway :)

@grzuy
Copy link
Contributor Author

grzuy commented Oct 18, 2023

So, this is clearly not actaully supporting tie_word_embedding, I do now realize :-)

Despite that, wouldn't these changes still be valuable under a title like "Support loading param files without repeated shared params" or similar?

I might be missing something else here maybe.

@grzuy
Copy link
Contributor Author

grzuy commented Oct 18, 2023

We almost posted comment at the same moment 😄

@grzuy
Copy link
Contributor Author

grzuy commented Oct 18, 2023

If this was needed to load models it could be worth the workaround

It indeed fixed loading of a few model.safetensors file for me locally.

@jonatanklosko
Copy link
Member

jonatanklosko commented Oct 18, 2023

It indeed fixed loading of a few model.safetensors file for me locally.

Are these models you saved? Does it work if you save in the .bin format?

@grzuy
Copy link
Contributor Author

grzuy commented Oct 18, 2023

It indeed fixed loading of a few model.safetensors file for me locally.

Are these models you saved?

Yes, it fixed loading model.safetensors for the models I updated in this PR.

Does it work if you save in the .bin format?

As far as I tested it did load the same LM head params for .bin files.

@grzuy
Copy link
Contributor Author

grzuy commented Oct 24, 2023

Hey @grzuy :) This is more of a workaround that works for loading, but in order for parameters sharing to work during model training we actually need to share them at Axon level. Since PyTorch checkpoints generally include the tensors for both shared layers, we've been deferring this until we address the actual sharing.

@jonatanklosko Is it accurate to say then that fine-tuning any HF model that has param sharing in it's architecture like these with tie_word_embeddings=True (BERT, Bart, alBERT) will not work for Bumblebee as of now, right?

@jonatanklosko
Copy link
Member

@grzuy yeah, if you train with Axon then the weights are not going to be shared and they would likely diverge to some extent. I don't now how much impact it has for fine-tuning (as opposed to training from scratch). That's precisely why we need actual sharing.

@grzuy grzuy deleted the tie-word-embeddings branch February 29, 2024 13:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants