[`WIP`] Add Mamba2 #32027

vasqu · 2024-07-17T17:00:01Z

What does this PR do?

As per title:
Paper: https://arxiv.org/abs/2405.21060
Repo: https://github.com/state-spaces/mamba

Mamba2 is a successor to Mamba which rethinks SSMs as a sort of special type of Attention (i.e. structured attention such as causal attention in decoder-only models). This implementation allows all architecture types, i.e. pure Mamba2, hybrid Mamba2-Attention, and pure Attention (we mostly follow the llama attention definition where possible). Maybe there's more interest after Mistral released their code model yesterday https://mistral.ai/news/codestral-mamba/ :)

There are still some TODOs left to do but the overall architecture and functionality should be there:

Caching with RoPE (unsure if it is even cached). Also if there are any transformations necessary to the weights for RoPE as done in Llama.
Additional warning about AMD compatibility (has been released after some time).
Update causal mask --> There has been something with static caches in llama; I doubt it affects us with the hybrid cache but just to be sure.
Hardware differences make it hard to gauge if some limits should be as high (see test_left_padding_compatibility).
Flash attention tests in general.
Integration tests in total.
Possibly allowing outputting the last SSM state of each block/layer, similarly to outputting attention weights.
Possibly allowing initial SSM states for the layers.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case. See Add Mamba2 #31204
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings. But see TODOs
Did you write any new necessary tests? But see TODOs

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
@ArthurZucker @amyeroberts @gante @Adibvafa @pglorio

…rwards

…al model (causal mask is missing todo)

…v to flash_attn impl, make ssd_naive a class function, todos (e.g. rope caching)

… specific for the used shapes, other small nits on names/comments

…update Refactor: Causal Mask Update and Prepare for Generate

Fix a lot of other remaining slow tests

…reshape instead of view; rope tests too

Fixes all remaining issues: Cache, attention, conversion, ...

Style and Quality

Add Mamba2ForSequenceClassficiation

Some fixes for the stuff introduced in PR #1

fix sequence classifier with copies and correct prefix for backbone model

ruipeterpan · 2024-07-19T23:16:01Z

Thanks for the great work 🫡! Quick question about the causal convolution part in triton_kernels_forward. During prefill (generating the first token), cached_forward is False so causal_conv1d_fn() is invoked. During autoregressive decode, cached_forward is now True so causal_conv1d_update() is invoked. When doing so, xBC has shape (batch_size, 1, dim). However, causal_conv1d_update() requires x=xBC to have shape (batch_size, dim) or (batch, dim, seqlen). Are we missing a reshape operation on xBC ("b l d -> b d l"), like in https://github.com/state-spaces/mamba/blob/main/mamba_ssm/modules/mamba2.py#L293? A similar issue occurs when passing in past_key_values to model.generate(): cached_forward is True so causal_conv1d_update() is invoked whereas xBC has shape (batch_size, seq_len, dim).

Thanks in advance!

vasqu · 2024-07-19T23:46:20Z

@ruipeterpan That's a great catch!

I think a simple transpose(1,2) should fix it on more recent versions (1.4=<) whereas a squeeze will mostly work on older versions too. ~~Would likely need a shape check since the versions handle it differently outputting different shapes (i.e. (bsz, dim) vs (bsz, dim, 1)).~~ Nvm, I missed a squeeze in the new code release.

For the generate issue. If you pass an initial cache, it should have the attribute has_previous_state set to False. So that the first pass should be a non cached forward call and then we re-enter the first scenario (which is bugged :D). Or is that not the case and even the first call has a false flag? (cant execute code atm)

ruipeterpan · 2024-07-20T00:03:14Z

Thanks for the clarification -- for using past_key_values, do you mean we need to manually set past_ket_values.has_previous_state to False before passing it in? My usage is as follows, not sure if I'm doing this correctly:

out = model.generate(input_ids, return_dict_in_generate=True)
past_key_values = out.past_key_values
# past_key_values.has_previous_state = False  # adding this LOC resolves the issue, thanks!
out = model.generate(other_input_ids, past_key_values=past_key_values)

vasqu · 2024-07-20T00:09:25Z

Yup, that's how you would do it. The problem here is that mamba can only decode on a one-by-one basis so expecting it to have seq_len > 1 is incompatible on the first pass (with a cache). You basically have to reset the cache.

I do admit that it's rather unintuitive tho to reset such a flag. Should be cleaner with a separate method that handles it.

github-actions · 2024-08-17T08:04:14Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Adibvafa and others added 30 commits July 8, 2024 11:49

Initial setup.

3a7cd05

initial import struct

836dad9

update config to include attention and mlp stuff

4e82ae4

validate attention layers idx

66ef38a

typo

bd15e4c

initial mlp and attention modules

aeeea80

check for correct mamba version

67829eb

torchified mamba2 ssm

b274277

small fixes on imports, typos etc

547f0c9

make rope optional, small nit for mlp

208b363

fix sdpa init

73b289a

rope again im dumb

8c1982b

comments, adjusted copies/adapted, some fixes for shaping

7e5d0fc

block implementation with either attn or mamba2 and optional mlp afte…

a9a2c84

…rwards

revert on copy as we do not use head masks or dropout

2527a23

dont return cache as everything is done in-place, base model and caus…

648f235

…al model (causal mask is missing todo)

style, copies

aa4df8e

fix einops to torch operations, some were not correct

105ea81

add deleted dtype back

99a953a

small fix on cache returns

e1a3981

nits on comments

73647a3

nits on style, copies, attn layers idx check

3a19d4d

use llama rope as it's more heavily tested and seems to be more in fa…

bd9f91e

…v to flash_attn impl, make ssd_naive a class function, todos (e.g. rope caching)

causal mask creation

ecba422

consistent todo

3c49b39

make convolution optional for the attention block, make comments more…

c4c90ed

… specific for the used shapes, other small nits on names/comments

add import structure

11ba3c8

fix some default values

237f190

style

0527db3

fix some used config stuff

ac6e26a

Adibvafa and others added 20 commits July 15, 2024 19:41

Merge with main

6a58c54

Merge pull request #3 from Vasqu-Adibvafa/causal-mask-and-generation-…

33c952b

…update Refactor: Causal Mask Update and Prepare for Generate

Merge pull request #4 from Vasqu-Adibvafa/fix-slow-tests

ee72ae0

Fix a lot of other remaining slow tests

fixes conversion, cache, attention in general and other nits such as …

76fce16

…reshape instead of view; rope tests too

Merge pull request #5 from Vasqu-Adibvafa/fix-attention-cache-conversion

2ee04c7

Fixes all remaining issues: Cache, attention, conversion, ...

nits and style

977ce42

Merge pull request #6 from Vasqu-Adibvafa/style-quality

5051149

Style and Quality

Merge branch 'huggingface:main' into main

0f557fe

Fix merge conflicts.

5906a33

Fix mamba test.

b2c44bd

Fix mamba test.

a057e4c

Remove extra test mamba2.

a6b6f9b

Undo name change of cache variable.

972ba35

Revert mamba tests.

f8b1f22

Change argument calls of test functions to original.

b5c8966

Revert mamba test.

a9258e7

Merge pull request #1 from Vasqu-Adibvafa/adib-dev

c535499

Add Mamba2ForSequenceClassficiation

some fixes

fb049af

typo

df188eb

Merge pull request #7 from Vasqu-Adibvafa/fix-some-stuff-from-pr1

a9acd4c

Some fixes for the stuff introduced in PR #1

ArthurZucker mentioned this pull request Jul 18, 2024

Add Mamba2 #31204

Closed

fix dtype conversion in block (fp32 persisted due to residual addition)

1ec87de

fix sequence classifier with copies and correct prefix for backbone model

causal conv1d fix on fastpath

2e44073

Adibvafa mentioned this pull request Jul 21, 2024

Add codestral mamba2 #32080

Merged

11 tasks

vasqu closed this Aug 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[`WIP`] Add Mamba2 #32027

[`WIP`] Add Mamba2 #32027

vasqu commented Jul 17, 2024 •

edited

Loading

ruipeterpan commented Jul 19, 2024 •

edited

Loading

vasqu commented Jul 19, 2024 •

edited

Loading

ruipeterpan commented Jul 20, 2024

vasqu commented Jul 20, 2024

github-actions bot commented Aug 17, 2024

[WIP] Add Mamba2 #32027

[WIP] Add Mamba2 #32027

Conversation

vasqu commented Jul 17, 2024 • edited Loading

What does this PR do?

Before submitting

Who can review?

ruipeterpan commented Jul 19, 2024 • edited Loading

vasqu commented Jul 19, 2024 • edited Loading

ruipeterpan commented Jul 20, 2024

vasqu commented Jul 20, 2024

github-actions bot commented Aug 17, 2024

[`WIP`] Add Mamba2 #32027

[`WIP`] Add Mamba2 #32027

vasqu commented Jul 17, 2024 •

edited

Loading

ruipeterpan commented Jul 19, 2024 •

edited

Loading

vasqu commented Jul 19, 2024 •

edited

Loading