Refactor flash attention implementation in transformers #31446

ArthurZucker · 2024-06-17T08:55:27Z

What does this PR do?

EDIT: just refactor for now

Enables us to run transformers model with Ragged Tensors:

One of the goals is also to make it easy for people to re-define the ExtraKwargs typedict, to build on top of transformers

ArthurZucker · 2024-06-17T08:55:51Z

cc @fxmarty, @LysandreJik and @OlivierDehaene

HuggingFaceDocBuilderDev · 2024-06-26T08:01:44Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

ArthurZucker

Thanks so much @fxmarty for going to the end of this!

fxmarty · 2024-07-10T13:43:17Z

No more flash attention tests fail here compared to main (on H100).

================================================================== short test summary info ==================================================================
FAILED tests/models/bark/test_modeling_bark.py::BarkSemanticModelTest::test_flash_attn_2_from_config - ValueError: Unrecognized configuration class <class 'transformers.models.bark.configuration_bark.BarkSemanticConfig'> for this kind of AutoModel: AutoMo...
FAILED tests/models/bark/test_modeling_bark.py::BarkCoarseModelTest::test_flash_attn_2_from_config - ValueError: Unrecognized configuration class <class 'transformers.models.bark.configuration_bark.BarkCoarseConfig'> for this kind of AutoModel: AutoMode...
FAILED tests/models/dpr/test_modeling_dpr.py::DPRModelTest::test_sdpa_can_dispatch_on_flash - RuntimeError: No available kernel. Aborting execution.
FAILED tests/models/gemma/test_modeling_gemma.py::GemmaIntegrationTest::test_model_2b_flash_attn - OSError: You are trying to access a gated repo.
FAILED tests/models/gemma2/test_modeling_gemma2.py::Gemma2ModelTest::test_flash_attn_2_equivalence - AssertionError: assert False
FAILED tests/models/gemma2/test_modeling_gemma2.py::Gemma2ModelTest::test_sdpa_can_dispatch_on_flash - RuntimeError: No available kernel. Aborting execution.
FAILED tests/models/gpt_neox/test_modeling_gpt_neox.py::GPTNeoXModelTest::test_flash_attn_2_generate_padding_right - AssertionError: False is not true
FAILED tests/models/idefics2/test_modeling_idefics2.py::Idefics2ModelTest::test_flash_attn_2_inference_equivalence_right_padding - ValueError: You are attempting to perform batched generation with padding_side='right' this may lead to unexpected behaviour for Flash Attention version...
FAILED tests/models/idefics2/test_modeling_idefics2.py::Idefics2ForConditionalGenerationModelTest::test_flash_attn_2_inference_equivalence_right_padding - ValueError: You are attempting to perform batched generation with padding_side='right' this may lead to unexpected behaviour for Flash Attention version...
FAILED tests/models/jamba/test_modeling_jamba.py::JambaModelTest::test_flash_attn_2_generate_padding_right - AssertionError: ValueError not raised
FAILED tests/models/jamba/test_modeling_jamba.py::JambaModelTest::test_sdpa_can_dispatch_on_flash - RuntimeError: No available kernel. Aborting execution.
FAILED tests/models/m2m_100/test_modeling_m2m_100.py::M2M100ModelTest::test_flash_attn_2_from_config - ValueError: Unrecognized configuration class <class 'transformers.models.m2m_100.configuration_m2m_100.M2M100Config'> for this kind of AutoModel: AutoMo...
FAILED tests/models/m2m_100/test_modeling_m2m_100.py::M2M100ModelIntegrationTests::test_flash_attn_2_seq_to_seq_generation - RuntimeError: FlashAttention only support fp16 and bf16 data type
FAILED tests/models/mixtral/test_modeling_mixtral.py::MixtralModelTest::test_flash_attn_2_generate_padding_right - AssertionError: ValueError not raised
FAILED tests/models/olmo/test_modeling_olmo.py::OlmoModelTest::test_flash_attn_2_generate_padding_right - AssertionError: False is not true
FAILED tests/models/phi3/test_modeling_phi3.py::Phi3ModelTest::test_flash_attn_2_generate_padding_right - AssertionError: False is not true
FAILED tests/models/qwen2/test_modeling_qwen2.py::Qwen2ModelTest::test_flash_attn_2_generate_padding_right - AssertionError: ValueError not raised
FAILED tests/models/qwen2/test_modeling_qwen2.py::Qwen2ModelTest::test_flash_attn_2_inference_equivalence - AssertionError: assert False
FAILED tests/models/qwen2_moe/test_modeling_qwen2_moe.py::Qwen2MoeModelTest::test_flash_attn_2_generate_padding_right - AssertionError: ValueError not raised
FAILED tests/models/qwen2_moe/test_modeling_qwen2_moe.py::Qwen2MoeModelTest::test_flash_attn_2_inference_equivalence - AssertionError: assert False
FAILED tests/models/stablelm/test_modeling_stablelm.py::StableLmModelTest::test_flash_attn_2_generate_padding_right - AssertionError: False is not true
FAILED tests/models/starcoder2/test_modeling_starcoder2.py::Starcoder2ModelTest::test_flash_attn_2_generate_padding_right - AssertionError: ValueError not raised
FAILED tests/models/unispeech/test_modeling_unispeech.py::UniSpeechRobustModelTest::test_flash_attn_2_inference_equivalence - RuntimeError: mat1 and mat2 must have the same dtype, but got Float and BFloat16
FAILED tests/models/unispeech/test_modeling_unispeech.py::UniSpeechRobustModelTest::test_flash_attn_2_inference_equivalence_right_padding - RuntimeError: mat1 and mat2 must have the same dtype, but got Float and BFloat16
FAILED tests/models/unispeech/test_modeling_unispeech.py::UniSpeechRobustModelTest::test_sdpa_can_dispatch_on_flash - RuntimeError: mat1 and mat2 must have the same dtype, but got Float and Half
FAILED tests/models/whisper/test_modeling_whisper.py::WhisperModelTest::test_flash_attn_2_from_config - IndexError: too many indices for tensor of dimension 2
FAILED tests/models/whisper/test_modeling_whisper.py::WhisperModelTest::test_flash_attn_2_inference_equivalence_right_padding - IndexError: too many indices for tensor of dimension 2
FAILED tests/models/whisper/test_modeling_whisper.py::WhisperStandaloneDecoderModelTest::test_flash_attn_2_from_config - IndexError: too many indices for tensor of dimension 2
FAILED tests/models/whisper/test_modeling_whisper.py::WhisperStandaloneDecoderModelTest::test_flash_attn_2_generate_left_padding - IndexError: too many indices for tensor of dimension 2
FAILED tests/models/whisper/test_modeling_whisper.py::WhisperStandaloneDecoderModelTest::test_flash_attn_2_inference_equivalence - AssertionError: assert False
FAILED tests/models/whisper/test_modeling_whisper.py::WhisperStandaloneDecoderModelTest::test_flash_attn_2_inference_equivalence_right_padding - AssertionError: assert False
================================= 31 failed, 1064 passed, 1789 skipped, 62671 deselected, 166 warnings in 331.97s (0:05:31) =================================

Testing on MI250 for extra safety and good to merge.

edit: all good, can be merged

…31446) * dumb commit * nit * update * something like this * unpack in modeling utils * safe import * oups * update * nits * diff convert gemma * update * start propagating * udpate other modeling code as well * update for sliding window models * nits * more init cleanups * styling * fixup * noice * pass fixup * typo typing_extension -> typing_extensions * torch.nn.functionnal -> torch.nn.functional * add to import structure * unpack * simplify a bit more for this first version * nut * update * update * nit * ease the import of `Unpack` * remove useless `use_sliding_window` * no qua please * protect import? * style * [run-slow] * [run slow] llama,gemma,mistral,mixtral * remove extra kwargs * fix llama * address review comments * apply diff_model_converter to modeling_gemma.py * remove cache_position 1 * remove cache_position 2 * some cleaning * refactor gemma2 as well * apply review comments * rename file to modeling_flash_attention_utils.py * siglip refactor * remove dead code * is the hub down? * still down? * fix siglip * fix gemma2 * fatal: Could not read from remote repository. * fix typo in softcap implem * flacky * Failed: Timeout >120.0s --------- Co-authored-by: fxmarty <9808326+fxmarty@users.noreply.github.com>

ArthurZucker added 5 commits May 31, 2024 13:24

dumb commit

b66fdb0

nit

029ee11

Merge branch 'main' into backend-compatible

9a7885d

update

ac3e5b5

something like this

a7c48bd

ArthurZucker added 2 commits June 17, 2024 11:16

unpack in modeling utils

682f221

safe import

2201178

This was referenced Jun 18, 2024

Reducing memory usage: removing useless logits computation in generate() #31292

Merged

Allow passing 2D attention mask #27640

Open

ArthurZucker added 19 commits June 21, 2024 09:58

oups

55a3503

update

b5cbaef

nits

7c6fdd7

diff convert gemma

08d7e1e

update

27044da

start propagating

ca316a0

udpate other modeling code as well

ea93267

update for sliding window models

4b67223

nits

d59ac0c

more init cleanups

a1d3866

styling

aea7f03

fixup

f1bedd0

noice

86e2edc

pass fixup

e90a944

typo typing_extension -> typing_extensions

093fbf5

torch.nn.functionnal -> torch.nn.functional

a1c56d2

add to import structure

1aad4a2

unpack

10bc1fa

simplify a bit more for this first version

9f08ddb

fxmarty added 4 commits July 8, 2024 13:14

siglip refactor

c92028a

remove dead code

7243993

is the hub down?

8b077d8

still down?

a9796bc

ArthurZucker commented Jul 9, 2024

View reviewed changes

ArthurZucker mentioned this pull request Jul 10, 2024

Enhancing SFT Training Efficiency Using Packing and FlashAttention2 with Position IDs #31629

Merged

5 tasks

fix siglip

6752a9c

fabianlim mentioned this pull request Jul 11, 2024

PaddingFree instructlab/training#113

Closed

fxmarty added 6 commits July 11, 2024 13:29

Merge branch 'main' into backend-compatible

3a9cf1b

fix gemma2

b4d1df5

fatal: Could not read from remote repository.

1e1bc2f

fix typo in softcap implem

c79ca83

flacky

30dc123

Failed: Timeout >120.0s

fae6843

fxmarty merged commit e314395 into main Jul 11, 2024
26 checks passed

fxmarty deleted the backend-compatible branch July 11, 2024 12:37

fxmarty mentioned this pull request Jul 11, 2024

[fix] AttributeError in is_flash_attn_greater_or_equal #31908

Closed

This was referenced Jul 11, 2024

Add Nemotron HF Support #31699

Merged

Granite language models #31502

Merged

vasqu mentioned this pull request Jul 15, 2024

Flash Attention Refactor Vasqu-Adibvafa/Mamba2#2

Merged

casper-hansen mentioned this pull request Jul 19, 2024

Migrate multipack to refactored flash attention axolotl-ai-cloud/axolotl#1774

Open

5 tasks

ArthurZucker mentioned this pull request Jul 22, 2024

Add GLM-4 and Later GLM Model (Draft) #31977

Closed

3 tasks

ArthurZucker mentioned this pull request Jul 24, 2024

Llama 3 - RuntimeError: shape '[-1, 0]' is invalid for input of size 41041920 #32170

Closed

4 tasks

amyeroberts mentioned this pull request Jul 28, 2024

[Idefics2] - Fix FA2 call for Perceiver layer #32275

Merged

TJ-Solergibert mentioned this pull request Jul 30, 2024

Adding SFT training swiss-ai/nanotron#14

Open

7 tasks

ArthurZucker mentioned this pull request Oct 4, 2024

Remove graph breaks for torch.compile() in flash_attention_forward when Lllama Model is padding free tuned #33932

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor flash attention implementation in transformers #31446

Refactor flash attention implementation in transformers #31446

ArthurZucker commented Jun 17, 2024 •

edited

Loading

ArthurZucker commented Jun 17, 2024

HuggingFaceDocBuilderDev commented Jun 26, 2024

ArthurZucker left a comment

fxmarty commented Jul 10, 2024 •

edited

Loading

Refactor flash attention implementation in transformers #31446

Refactor flash attention implementation in transformers #31446

Conversation

ArthurZucker commented Jun 17, 2024 • edited Loading

What does this PR do?

ArthurZucker commented Jun 17, 2024

HuggingFaceDocBuilderDev commented Jun 26, 2024

ArthurZucker left a comment

Choose a reason for hiding this comment

fxmarty commented Jul 10, 2024 • edited Loading

ArthurZucker commented Jun 17, 2024 •

edited

Loading

fxmarty commented Jul 10, 2024 •

edited

Loading