-
Notifications
You must be signed in to change notification settings - Fork 27.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ModernBERT to Transformers #35158
Merged
Merged
Changes from 1 commit
Commits
Show all changes
91 commits
Select commit
Hold shift + click to select a range
6b5a823
initial cut of modernbert for transformers
warner-benjamin dafb203
small bug fixes
warner-benjamin df13def
fixes
warner-benjamin d09eabf
Update import
tomaarsen 8c3afea
Use compiled mlp->mlp_norm to match research implementation
tomaarsen a40aaa9
Propagate changes in modular to modeling
tomaarsen 9f0b8ca
Replace duplicate attn_out_dropout in favor of attention_dropout
tomaarsen 900d8ec
Update BOS to CLS and EOS to SEP
tomaarsen caf8901
Set default classifier bias to False, matching research repo
tomaarsen 8276602
Update tie_word_embeddings description
tomaarsen 79e4bbb
Fix _init_weights for ForMaskedLM
tomaarsen b59bad9
Match base_model_prefix
tomaarsen e7bef53
Add compiled_head to match research repo outputs
tomaarsen 120578b
Fix imports for ModernBertForMaskedLM
tomaarsen 142ff11
Just use "gelu" default outright for classifier
tomaarsen b44abdc
Fix config name typo: initalizer -> initializer
tomaarsen 3de8ebf
Remove some unused parameters in docstring. Still lots to edit there!
tomaarsen 7a05b3f
Compile the embeddings forward
tomaarsen 88b0ecf
Add drafts for ForSequenceClassification/ForTokenClassification
tomaarsen 5e3d61d
Add initial SDPA support (not exactly equivalent to FA2 yet!)
tomaarsen 2a3d378
Only use attention dropout if training
tomaarsen a2051d6
Add initial eager attention support (also not equivalent to FA2 yet!)
tomaarsen 124f1fd
Add initial tests, output_attentions, output_hidden_states, prune_heads
tomaarsen 38f959b
Remove kwargs from ModernBertForMaskedLM
tomaarsen f716943
Remove/adjust/skip improper tests; warn if padding but no attn mask
tomaarsen f41adaa
Run formatting etc.
tomaarsen d06654a
Run python utils/custom_init_isort.py
tomaarsen f9301f4
FlexAttention with unpadded sequences(matches FA2 within bf16 numerics)
staghado a356708
Reformat init_weights based on review
tomaarsen f83fdc0
self -> module in attention forwards
tomaarsen b444c15
Remove if config.tie_word_embeddings
tomaarsen 5aaf273
Reformat output projection on a different line
tomaarsen 0a8d044
Remove pruning
tomaarsen 382e481
Remove assert
tomaarsen 5d05e8e
Call contiguous() to simplify paths
tomaarsen 98508c7
Remove prune_qkv_linear_layer
tomaarsen 2c076c8
Format code
tomaarsen 986c6fe
Keep as kwargs, only use if needed
tomaarsen 5cd39ad
Remove unused codepaths & related config options
tomaarsen 2d606b9
Remove 3d attn_mask test; fix token classification tuple output
tomaarsen 8eb87e8
Reorder: attention_mask above position_ids, fixes gradient checkpointing
tomaarsen 5d83c56
Merge branch 'main' into pr-35158
tomaarsen 3a24af4
Fix usage if no FA2 or torch v2.5+
tomaarsen 37a6030
Make torch.compile/triton optional
tomaarsen b3b4028
Separate pooling options into separate functions (cls, mean) - cls as…
tomaarsen b241a7e
Simplify _pad_modernbert_output, remove unused labels path
tomaarsen 66f4603
Update tied weights to remove decoder.weight, simplify decoder loading
tomaarsen 3eb786b
Adaptively set config.compile based on hf_device_map/device/resize, etc.
tomaarsen 093b601
Merge branch 'main' of https://github.com/huggingface/transformers in…
tomaarsen 28fc79e
Update ModernBertConfig docstring
tomaarsen 612befa
Satisfy some consistency checks, add unfinished docs
tomaarsen ae32e8b
Merge branch 'main' of https://github.com/huggingface/transformers in…
tomaarsen f4e280a
Only set compile to False if there's more than 1 device
tomaarsen bc14967
Add docstrings for public ModernBert classes
tomaarsen 0f17fb9
Dont replace docstring returns - ends up being duplicate
tomaarsen 25b12b4
Fix mistake in toctree
tomaarsen f312eef
Reformat toctree
tomaarsen 1e367df
Patched FlexAttention, SDPA, Eager with Local Attention
tomaarsen fb748ce
Implement FA2 -> SDPA -> Eager attn_impl defaulting, crucial
tomaarsen 051233f
Patch test edge case with Idefics3 not working with 'attn_implementat…
tomaarsen 6c01711
Repad all_hidden_states as well
tomaarsen 5f7c566
rename config.compile to reference_compile
warner-benjamin c8a80e7
disable flex_attention since it crashes
warner-benjamin 8962f05
Update modernbert.md
bclavie 7e89f4d
Using dtype min to mask in eager
NohTow 0742a1d
Fully remove flex attention for now
tomaarsen 6c6cddb
Call contiguous to allow for .view()
tomaarsen e37e4ec
Copyright 2020 -> 2024
tomaarsen 9afc480
Update/simplify __init__ structure
tomaarsen aa1bdb4
Remove "... if dropout_prob > 0 else identity"
tomaarsen 659807f
re-use existing pad/unpad functions instead of creating new ones
staghado 7955e39
remove flexattention method
staghado 4145119
Compute attention_mask and local_attention_mask once in modeling
tomaarsen 0e572d5
Simplify sequence classification prediction heads, only CLS now
tomaarsen e5dca63
Simplify module.training in eager attn
tomaarsen bf11173
Also export ModernBertPreTrainedModel
tomaarsen 54ed5db
Update the documentation with links to finetuning scripts
tomaarsen a1bfae8
Explain local_attention_mask parameter in docstring
tomaarsen df7658a
Simplify _autoset_attn_implementation, rely on super()
tomaarsen b3404ed
Keep "in" to initialize Prediction head
tomaarsen e057bc2
add back mean pooling
warner-benjamin 99c38ba
Use the pooling head in TokenClassification
warner-benjamin 5114ed7
update copyright
warner-benjamin 175fb95
Reset config._attn_implementation_internal on failure
tomaarsen 8cedfc5
Allow optional attention_mask in ForMaskedLM head
warner-benjamin 2380729
fix failing run_slow tests
warner-benjamin 7686134
Add links to the paper
tomaarsen 44275fd
Remove unpad_no_grad, always pad/unpad without gradients
tomaarsen d799d65
local_attention_mask -> sliding_window_mask
tomaarsen ed77867
Revert "Use the pooling head in TokenClassification"
tomaarsen 92e17c6
Simplify pooling, 2 options via if-else
tomaarsen File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Separate pooling options into separate functions (cls, mean) - cls as…
… default
- Loading branch information
commit b3b4028e826d14b623bc6c35f3541ab51a67b234
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
correct me if I am wrong, the inputs are the same as any other LLMs no? In that case if you want to unpad you should be using
transformers/src/transformers/modeling_flash_attention_utils.py
Lines 63 to 69 in 1b6cb1e
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm afraid not quite. We're unpadding much earlier and repadding much later, so that e.g. even the MaskedLM can take advantage of it. As a result,
_upad_input
(and_flash_attention_forward
) aren't viable here.