-
Notifications
You must be signed in to change notification settings - Fork 27.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add OLMoE #32406
Add OLMoE #32406
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey! Feel free to ping me for a review once ready!
It's ready! :) I will update the README & double-check the slow tests once the model is released if that's fine! |
@ArthurZucker would be great if we could get it reviewed soon 😇 |
We'll release the model on Tuesday, would be amazing if we could have this approved by then! |
Oups sorry reviewing now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the late review!
Mostly missing copied from and should be good to go otherwise.
Could go the extra mile to have compile compatible verseion of the MOE blcok!
@@ -0,0 +1,281 @@ | |||
# Licensed under the Apache License, Version 2.0 (the "License"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
date is missing here!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that mandatory? Maybe the entire header can just be removed? Seems redundant to have this header for every file when the license is clear from the repo..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😄 yeah it is redundant but the licence AFAIK but it's a nit don't worry
return final_hidden_states, router_logits | ||
|
||
|
||
class OlmoeDecoderLayer(nn.Module): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sam ehere
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Different from others cuz we have no shared expert
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
COuld you make sure you rebased and CIs are green? Will review again after that! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM let's rebase and fix the CIs 🤗 (there is also a todo in the .md as well)
Great, all passing! 🙌 |
Thanks for your contribution! 🔥 |
Is it a problem on my end if I notice that a higher native batch size during training results in higher losses? (With DeepSpeed) Purple line is higher native bs FFT, orange line is gradient accumulation instead of a higher native batch size. I'm guessing expert routing parallelism is maybe not properly handled for bs>1? |
cc @Muennighoff would have a lot more clues than me 😉 |
Hm not sure about this - are other models the same for you in both scenarios? How about other MoEs? |
* Add OLMoE * Add OLMoE * Updates * Make norm optional; add keys * Add output * Add * Fix dtype * Fix eos config * Update * Add OLMoE * Fix OLMoE path * Format * Format * Rmv copy statement * Rmv copy statement * Format * Add copies * Cp rotary * Fix aming * Fix naming * Update RoPE integration; num_logits_to_keep; Add copy statements * Add eps to config * Format * Add aux loss * Adapt router_aux_loss_coef * Update md * Adapt * adapt tests
* Add OLMoE * Add OLMoE * Updates * Make norm optional; add keys * Add output * Add * Fix dtype * Fix eos config * Update * Add OLMoE * Fix OLMoE path * Format * Format * Rmv copy statement * Rmv copy statement * Format * Add copies * Cp rotary * Fix aming * Fix naming * Update RoPE integration; num_logits_to_keep; Add copy statements * Add eps to config * Format * Add aux loss * Adapt router_aux_loss_coef * Update md * Adapt * adapt tests
* Add OLMoE * Add OLMoE * Updates * Make norm optional; add keys * Add output * Add * Fix dtype * Fix eos config * Update * Add OLMoE * Fix OLMoE path * Format * Format * Rmv copy statement * Rmv copy statement * Format * Add copies * Cp rotary * Fix aming * Fix naming * Update RoPE integration; num_logits_to_keep; Add copy statements * Add eps to config * Format * Add aux loss * Adapt router_aux_loss_coef * Update md * Adapt * adapt tests
What does this PR do?
Before submitting
Pull Request section?
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
The model will be released in ~1 week - can we already review this so that we can merge right upon release?