[Flamingo][multimodal] Vision encoder + text decoder #1150

felipemello1 · 2024-07-08T22:29:27Z

Context

What is the purpose of this PR? Is it to

add a new feature
fix a bug
update tests and/or documentation
other (please add here)
Added flamingo vision encoder
Ported text decoder from: Fused SelfAttention and Cross Attention Decoder #1146
Updated CLIP attention module

TODO:

clip transformer has to be replaced with TransformerSelfAttentionLayer
Confirm location and naming of all new transformer related modules
regression tests
parity check
docstrings
builder with text decoder + vision encoder (prob in another PR)

Changelog

torchtune/models/clip/_component_builders.py

Replaced pytorch native transformer with torchtunes version;

torchtune/models/flamingo/_encoders.py

FlamingoVisionAdapter: takes in clip embedding and outputs final vision embedding
FlamingoVisionEncoder: wrapper to call Clip + adapter

torchtune/models/flamingo/_component_builders.py

flamingo_vision_encoder: instantiates clip_vision_encoder + FlamingoVisionAdapter and returns FlamingoVisionEncoder
flamingo_text_decoder: instantiates llama3 + TransformerCrossAttentionLayer using FusionLayer and returns MMTransformerDecoder

torchtune/modules/feedforward.py

MLP: simple feed forward layer for the transformer MLP. Used by CLIP and Flamingo.

torchtune/modules/multimodal_transformer.py

TransformerSelfAttentionLayer: used in flamingo with gates. Used in CLIP as vanillas transformer.
TransformerCrossAttentionLayer: used in flamingo_text_decoder
MMTransformerDecoder: used in flamingo_text_decoder

torchtune/modules/attention.py

GroupedQueryAttention: used in flamingo_text_decoder, clip, flamingo_vision_encoder. IMPORTANT: llama3 in flamingo will use this, while llama3-text-only will use CausalSelfAttention

torchtune/modules/model_fusion.py

fusion_embed.py: used in flamingo_text_encoder
fusion_layer.py: used in flamingo_text_encoder

torchtune/modules/tanh_gate.py

TanhGate: used in flamingo vision adapter and cross encoder.

Test plan

Clip tests pass, except one regression.
Flamingo vision encoder shapes pass, need regression.
Flamingo text decoder instantiates, but no shape or regression tests yet.
Need tests for the individual new modules. But testing the models as whole should work for now.
run pre-commit hooks and linters (make sure you've first installed via pre-commit install)
add unit tests for any new functionality
update docstrings for any new or updated methods or classes
run unit tests via pytest tests
run recipe tests via pytest tests -m integration_test
manually run any new or modified recipes with sufficient proof of correctness
- include relevant commands and any other artifacts in this summary (pastes of loss curves, eval results, etc.)

Co-authored-by: Kartikay Khandelwal <47255723+kartikayk@users.noreply.github.com>

…o flamingo_encoder

…tor' into flamingo_encoder" This reverts commit 4f1b1f5, reversing changes made to ba73e35.

pytorch-bot · 2024-07-08T22:29:30Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1150

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit f5d72b4 with merge base 069b12b ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

…oder

torchtune/models/flamingo/_encoders.py

ebsmothers · 2024-07-08T23:27:15Z

torchtune/models/flamingo/_encoders.py

+
+        # projection
+        x = x.view(bsz, n_ims, n_tiles, n_tokens, embed_dim)
+        x = torch.cat([x, hidden_states], dim=-1)


Add shape comment here

torchtune/models/flamingo/_encoders.py

ebsmothers · 2024-07-08T23:36:04Z

torchtune/modules/attention.py

@@ -227,3 +227,233 @@ def forward(
        # reshape the output to be the same shape as the input
        output = output.transpose(1, 2).contiguous().view(bsz, seq_len, -1)
        return self.output_proj(output)
+
+
+class GroupedQueryAttention(nn.Module):


It's not clear to me why this is in modules/attention.py but other stuff goes in models/flamingo. I would suggest moving this under models/flamingo too to be consistent unless there's a clear reason not to.

GQA isn't specific to Flamingo, for example, it's referenced in Meta's MobileLLM paper. https://arxiv.org/abs/2402.14905

So our "primary" attention implementation already supports GQA and MQA [ref]. I guess the question here is if this entire module is specific to flamingo, should this just reside within that folder till we figure out how to merge back? That said, I'll raise the same concern as I did in Philip's PR that GQA isn't a great name for this module. I'd prefer something like MMMultiHeadAttention or similar

Yeah our naming here is kinda unclear. Like we already support GQA but via our CausalSelfAttention class. So any new functionality here is mainly to support cross-attention. So as is the naming of these two classes is not actually conveying how they differ. My two cents is that we should just call them MultiHeadAttention and MultiHeadCrossAttention to get this point across, but I think the renaming of CausalSelfAttention can be saved for another day.

will rename it after we are 100% sure :P

ebsmothers · 2024-07-08T23:40:41Z

torchtune/modules/multimodal_transformer.py

+        attn_scale: Optional[nn.Module] = None,
+        mlp_scale: Optional[nn.Module] = None,


Sorry to keep harping on this, but I still don't understand why we don't just provide separate versions of self-attention and MLP modules with scaling (see e.g. here for what the MLP would look like). Then we can provide different builders for TransformerSelfAttentionLayer with and without scaling and users don't have to try and figure out attn_scale and mlp_scale mean

I think that both ways work. I like your idea, but i also think its convenient if the module already provides it. I guess that one benefits of this logic being in the transformer module is that we dont have to have 2x the implementation of every scale and MLP module, one gated and one that isnt. The con is that this module gets a bit bloated.

kartikayk · 2024-07-09T00:06:42Z

torchtune/modules/multimodal_transformer.py

+from torchtune.modules import GroupedQueryAttention
+
+
+class TransformerSelfAttentionLayer(nn.Module):


Would be good to just rename this tp have "Multimodal" in the prefix

This is also used in clip. I think that we should only use MM for modules that touch at the same time text and image. This is not the case for this module. What do you think?

codecov-commenter · 2024-07-10T01:25:37Z

Codecov Report

Attention: Patch coverage is 78.38542% with 83 lines in your changes missing coverage. Please review.

Project coverage is 69.14%. Comparing base (06a125e) to head (f5d72b4).
Report is 4 commits behind head on main.

Files	Patch %	Lines
torchtune/modules/multimodal_transformer.py	52.74%	43 Missing ⚠️
torchtune/modules/model_fusion/fusion_embed.py	41.37%	17 Missing ⚠️
torchtune/modules/attention.py	79.31%	12 Missing ⚠️
torchtune/modules/model_fusion/fusion_layer.py	55.55%	8 Missing ⚠️
...tune/models/flamingo/test_flamingo_text_decoder.py	83.33%	2 Missing ⚠️
...ne/models/flamingo/test_flamingo_vision_encoder.py	97.95%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##             main    #1150       +/-   ##
===========================================
+ Coverage   26.76%   69.14%   +42.38%     
===========================================
  Files         205      225       +20     
  Lines        9301    10096      +795     
===========================================
+ Hits         2489     6981     +4492     
+ Misses       6812     3115     -3697

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

facebook-github-bot · 2024-08-01T06:00:51Z

@tarun292 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2024-08-01T06:04:42Z

@tarun292 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2024-08-01T06:09:18Z

@tarun292 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Felipe Mello and others added 30 commits June 26, 2024 18:02

added all components

e8a895a

renamed cls embedding

2759f80

naming

c6161b9

docstrings

5d31e37

small fix

1840a57

docstrings and assertions

2d1475d

added unit test

50bdc53

add new param cls_output_dim

3dd9070

docstring

4a77d69

deleted file

3aee712

docstrings

d35c5c5

layernorm test fix

bca1116

fix cls test

8bb7313

added layernorm to rst

fc4ebf4

docstring

6fbf398

docstring

7d8160b

docstring

dc12ab6

removed unused function

9070944

Merge branch 'pytorch:main' into clip_encoder

aa3f0e3

added TODO comment

6763019

delete line

8715c06

renamed hidden states

cbad463

updated shape of hidden_states

d2c7fc5

Update torchtune/models/clip/_position_embeddings.py

62212fd

Co-authored-by: Kartikay Khandelwal <47255723+kartikayk@users.noreply.github.com>

Update torchtune/models/clip/_position_embeddings.py

c2f5afa

Co-authored-by: Kartikay Khandelwal <47255723+kartikayk@users.noreply.github.com>

Update torchtune/models/clip/_position_embeddings.py

0a26a22

Co-authored-by: Kartikay Khandelwal <47255723+kartikayk@users.noreply.github.com>

Update torchtune/models/clip/_position_embeddings.py

bd95402

Co-authored-by: Kartikay Khandelwal <47255723+kartikayk@users.noreply.github.com>

Merge branch 'main' into clip_encoder

9670faa

first pass of fixed

6ca2eaa

update rst

f9fdec6

Philip Bontrager and others added 9 commits July 5, 2024 14:14

fixed recipe errors

024be76

Merge branch 'main' into attn_refactor

c13388a

Merge branch 'clip_encoder' into flamingo_encoder

ba73e35

Merge remote-tracking branch 'pbontrager-torchtune/attn_refactor' int…

4f1b1f5

…o flamingo_encoder

Revert "Merge remote-tracking branch 'pbontrager-torchtune/attn_refac…

2007714

…tor' into flamingo_encoder" This reverts commit 4f1b1f5, reversing changes made to ba73e35.

flamingo encoder first commit

59a2349

naming update

29dc8d7

naming

d7bcdc3

file location

2f06794

felipemello1 requested review from pbontrager, ebsmothers, RdoubleA and kartikayk July 8, 2024 22:29

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 8, 2024

Felipe Mello added 3 commits July 8, 2024 15:37

comment update

895b67e

import error

7c25d09

Merge branch 'main' of github.com:pytorch/torchtune into flamingo_enc…

42a58ce

…oder

ebsmothers reviewed Jul 8, 2024

View reviewed changes

kartikayk reviewed Jul 9, 2024

View reviewed changes

added flamingo components

5c0bac6

felipemello1 changed the title ~~[Flamingo][multimodal] Vision encoder for flamingo~~ [Flamingo][multimodal] Vision encoder + text decoder Jul 10, 2024

small updates

f5d72b4

This was referenced Aug 14, 2024

Fused SelfAttention and Cross Attention Decoder #1146

Closed

Flamingo Model Components #1357

Merged

pbontrager closed this Aug 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Flamingo][multimodal] Vision encoder + text decoder #1150

[Flamingo][multimodal] Vision encoder + text decoder #1150

felipemello1 commented Jul 8, 2024 •

edited

Loading

pytorch-bot bot commented Jul 8, 2024 •

edited

Loading

ebsmothers Jul 8, 2024 •

edited

Loading

ebsmothers Jul 8, 2024

winglian Jul 9, 2024

kartikayk Jul 9, 2024

ebsmothers Jul 9, 2024

felipemello1 Jul 10, 2024

ebsmothers Jul 8, 2024

felipemello1 Jul 10, 2024

kartikayk Jul 9, 2024

felipemello1 Jul 10, 2024

codecov-commenter commented Jul 10, 2024 •

edited

Loading

facebook-github-bot commented Aug 1, 2024

facebook-github-bot commented Aug 1, 2024

facebook-github-bot commented Aug 1, 2024

		attn_scale: Optional[nn.Module] = None,
		mlp_scale: Optional[nn.Module] = None,

		from torchtune.modules import GroupedQueryAttention


		class TransformerSelfAttentionLayer(nn.Module):

[Flamingo][multimodal] Vision encoder + text decoder #1150

[Flamingo][multimodal] Vision encoder + text decoder #1150

Conversation

felipemello1 commented Jul 8, 2024 • edited Loading

Context

Changelog

Test plan

pytorch-bot bot commented Jul 8, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1150

✅ No Failures

ebsmothers Jul 8, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-commenter commented Jul 10, 2024 • edited Loading

Codecov Report

facebook-github-bot commented Aug 1, 2024

facebook-github-bot commented Aug 1, 2024

facebook-github-bot commented Aug 1, 2024

felipemello1 commented Jul 8, 2024 •

edited

Loading

pytorch-bot bot commented Jul 8, 2024 •

edited

Loading

ebsmothers Jul 8, 2024 •

edited

Loading

codecov-commenter commented Jul 10, 2024 •

edited

Loading