Allow passing 2D attention mask #27640

UniverseFly · 2023-11-21T18:40:46Z

Feature request

Allow passing a 2D attention mask in model.forward.

Motivation

With this feature, it would be much easier to avoid cross-context contamination during pretraining and supervised finetuning when packing the sequences together for more efficient training.

Here is an example usecase discussed in (huggingface/trl#805):

Your contribution

Upon investigation into the source code, I found the current logic of initializing attention masks is mostly a fixed code snippet encoded in each model:

        if getattr(self.config, "_flash_attn_2_enabled", False):
            # 2d mask is passed through the layers
            attention_mask = attention_mask if (attention_mask is not None and 0 in attention_mask) else None
        else:
            # 4d mask is passed through the layers
            attention_mask = _prepare_4d_causal_attention_mask(
                attention_mask, (batch_size, seq_length), inputs_embeds, past_key_values_length
            )

To enable this behavior may require hacking into each model. I should be able to handle part of them and submit a draft PR. But before that, I want to know if this feature request is reasonable.

The text was updated successfully, but these errors were encountered:

ArthurZucker · 2023-11-22T10:33:28Z

Hey, the model's forward already supports passing a 2d attention mask, it is just expended to 4d because that is the format required by the attention implementation.
Would you mind elaborating on what you cannot currently do? (Might be related to #27539?)

UniverseFly · 2023-11-22T17:24:51Z

Hey, the model's forward already supports passing a 2d attention mask, it is just expended to 4d because that is the format required by the attention implementation.
Would you mind elaborating on what you cannot currently do? (Might be related to #27539?)

Yeah, I might not make it clear. The current "2D"s are [batch_size, num_tokens]. What I suggested was [batch_size, num_tokens, num_tokens] so we can have a matrix for each batch that explicitly defines what each token should attend to. #27539 seems relevant

jwkirchenbauer · 2023-12-01T02:15:57Z

Just chiming in, here is some more context (also very interested in this feature). From what I understand, this is not trivial implement in general...

As one current example, the axolotl finetuning harness implements efficient sample packing with correct block diagonal attention masking through a series of monkey patches for the underlying huggingface model definitions for a few of the very popular models like llama and mistral. Though I have not looked through the code in detail, I believe it leverages the fact that the flash attention api supports the masking required to implement this scheme.

It is relevant for efficient finetuning (the reason it's incorporated into axolotl), and general wisdom (and whispers from inside large corps) suggest that this type of block diagonal masking is better for large scale training code.

(#27539 is relevant, but it looks like the focus may be on the beam search/speculative decoding use case, not this slightly more general use case. Also here's a relevant hf forum post https://discuss.huggingface.co/t/the-correct-attention-mask-for-examples-packing/52909/2)

meliksahturker · 2024-05-09T21:23:21Z

Packing is indeed a good use-case for supporting 2D attention mask for huggingface models.

ArthurZucker · 2024-05-23T07:37:09Z

Packing is planned

thincal · 2024-06-05T15:57:11Z

Packing is planned

Hello, is there any detailed schedule to support this feature ? many thanks.

ArthurZucker · 2024-06-06T07:17:35Z

Most probably not next release, but the one after that!

shashwat14 · 2024-06-13T18:41:53Z

Looking forward to this feature!

ArthurZucker · 2024-06-18T18:39:21Z

#31446 for packing

insujang · 2024-08-15T15:23:16Z

Hi @ArthurZucker, does #31446 include packing? It seems it is just refactoring flash attention, a prerequisite of packing not packing itself.

ArthurZucker · 2024-08-19T14:27:59Z

Yep, it's planned not done yet. I was gonna do both but ended up splitting!

ArthurZucker · 2024-10-05T13:36:41Z

cc @Cyrilvallez

ArthurZucker · 2024-10-10T09:55:37Z

#33932 is related for the packing as well!

ArthurZucker added the Feature request Request for a new feature label Nov 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow passing 2D attention mask #27640

Allow passing 2D attention mask #27640

UniverseFly commented Nov 21, 2023

ArthurZucker commented Nov 22, 2023

UniverseFly commented Nov 22, 2023

jwkirchenbauer commented Dec 1, 2023

meliksahturker commented May 9, 2024

ArthurZucker commented May 23, 2024

thincal commented Jun 5, 2024

ArthurZucker commented Jun 6, 2024

shashwat14 commented Jun 13, 2024

ArthurZucker commented Jun 18, 2024

insujang commented Aug 15, 2024

ArthurZucker commented Aug 19, 2024

ArthurZucker commented Oct 5, 2024

ArthurZucker commented Oct 10, 2024

Allow passing 2D attention mask #27640

Allow passing 2D attention mask #27640

Comments

UniverseFly commented Nov 21, 2023

Feature request

Motivation

Your contribution

ArthurZucker commented Nov 22, 2023

UniverseFly commented Nov 22, 2023

jwkirchenbauer commented Dec 1, 2023

meliksahturker commented May 9, 2024

ArthurZucker commented May 23, 2024

thincal commented Jun 5, 2024

ArthurZucker commented Jun 6, 2024

shashwat14 commented Jun 13, 2024

ArthurZucker commented Jun 18, 2024

insujang commented Aug 15, 2024

ArthurZucker commented Aug 19, 2024

ArthurZucker commented Oct 5, 2024

ArthurZucker commented Oct 10, 2024