-
Notifications
You must be signed in to change notification settings - Fork 27.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ModernBERT to Transformers #35158
Conversation
cc @Cyrilvallez ! 🤗 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Beyond the obvious (sdpa, eager, flex attention, and documentation), I haven't seen anything outrageous or very unexpected in my first scroll-through.
I recognize that this implementation goes a bit beyond our "usual" with unpadding/padding when possible, but I personally don't mind. Beyond this change (and the other obvious upgrades like RoPE), I quite like how this still mirrors the original BERT rather closely.
I'll have to actually start running this to get a better feel, but so far so good.
Also, the SequenceClassification and TokenClassification classes don't exist yet.
@ArthurZucker @Cyrilvallez
class ModernBertTokenizerFast(PreTrainedTokenizerFast):
model_input_names = ["input_ids", "attention_mask"]
|
With all of these changes in place, I was able to confirm that the output to one of the trained models using the original research implementation nearly matches the output of the Do we allow something like this to get an exact 1-1 match? @torch.compile(dynamic=True)
def compiled_mlp(self, hidden_states: torch.Tensor) -> torch.Tensor:
return self.mlp(self.mlp_norm(hidden_states)) Here's an indication of the difference between with and without:
|
cc @warner-benjamin let me know if the two should remain separate!
Please confirm @warner-benjamin
the model with FA2 and the RoPE kernel are not torch.compile compatible, we can't compile the whole model while using these. |
FA2 is compatible now, but the FA RoPE kernel isn't yet. I have a in progress fix I need to get merged into the FA repo. |
Users can make custom heads if they feel like it Also removes the unnecessary pool parameter
Doublechecked with Benjamin that it's correct/what we used for pretraining
We want mean pooling as an option for classification because Local Attention means unlike BERT the CLS token doesn't see all the output in all the attention layers, so mean pooling could outperform CLS on greater than 128 token sequences. Also, I added the pooling head to TokenClassification because otherwise we are throwing away one pretrained linear layer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Super cool! 2 things left: 1. remove the gradient thing: padding and unpadding is not model weight dependant, and should never have gradients.
2. remove the 2 functions, and just call the Head ClsPooling or MeanPooling depending on the one that was release / most common cf our offline discussion @tomaarsen
This reverts commit 99c38ba. There was no real motivation, no info on whether having this bigger head does anything useful.
Great addition! Thanks all for your hard work! 🤗 |
This PR will add ModernBERT to Transformers.