-
Notifications
You must be signed in to change notification settings - Fork 27.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow passing 2D attention mask #27640
Comments
Hey, the model's forward already supports passing a 2d attention mask, it is just expended to 4d because that is the format required by the attention implementation. |
Yeah, I might not make it clear. The current "2D"s are |
Just chiming in, here is some more context (also very interested in this feature). From what I understand, this is not trivial implement in general... As one current example, the axolotl finetuning harness implements efficient sample packing with correct block diagonal attention masking through a series of monkey patches for the underlying huggingface model definitions for a few of the very popular models like llama and mistral. Though I have not looked through the code in detail, I believe it leverages the fact that the flash attention api supports the masking required to implement this scheme. It is relevant for efficient finetuning (the reason it's incorporated into axolotl), and general wisdom (and whispers from inside large corps) suggest that this type of block diagonal masking is better for large scale training code. (#27539 is relevant, but it looks like the focus may be on the beam search/speculative decoding use case, not this slightly more general use case. Also here's a relevant hf forum post https://discuss.huggingface.co/t/the-correct-attention-mask-for-examples-packing/52909/2) |
Packing is indeed a good use-case for supporting 2D attention mask for huggingface models. |
Packing is planned |
Hello, is there any detailed schedule to support this feature ? many thanks. |
Most probably not next release, but the one after that! |
Looking forward to this feature! |
#31446 for packing |
Hi @ArthurZucker, does #31446 include packing? It seems it is just refactoring flash attention, a prerequisite of packing not packing itself. |
Yep, it's planned not done yet. I was gonna do both but ended up splitting! |
cc @Cyrilvallez |
#33932 is related for the packing as well! |
Feature request
Allow passing a 2D attention mask in
model.forward
.Motivation
With this feature, it would be much easier to avoid cross-context contamination during pretraining and supervised finetuning when packing the sequences together for more efficient training.
Here is an example usecase discussed in (huggingface/trl#805):
Your contribution
Upon investigation into the source code, I found the current logic of initializing attention masks is mostly a fixed code snippet encoded in each model:
To enable this behavior may require hacking into each model. I should be able to handle part of them and submit a draft PR. But before that, I want to know if this feature request is reasonable.
The text was updated successfully, but these errors were encountered: