Flamingo Model Components #1357

pbontrager · 2024-08-16T21:30:20Z

Context

What is the purpose of this PR? Is it to

add a new feature
fix a bug
update tests and/or documentation
other (please add here)

Reimplementation of #1150 based on refactor

Changelog

Adds new flamingo folder

Test plan

Please make sure to do each of the following if applicable to your PR. (If you're not sure about any one of these just ask and we will happily help. We also have a contributing page for some guidance on contributing.)

run pre-commit hooks and linters (make sure you've first installed via pre-commit install)
add unit tests for any new functionality
update docstrings for any new or updated methods or classes
run unit tests via pytest tests
run recipe tests via pytest tests -m integration_test
manually run any new or modified recipes with sufficient proof of correctness
include relevant commands and any other artifacts in this summary (pastes of loss curves, eval results, etc.)

UX

If your function changed a public API, please add a dummy example of what the user experience will look like when calling it.
Example of docstring:

torchtune/torchtune/modules/vision_transformer.py

Line 285 in 6a7951f

Examples:

Example in our docs: https://pytorch.org/torchtune/main/tutorials/qat_finetune.html#applying-qat-to-llama3-models

I did not change any public API;
I have added an example to docs or docstrings;

pytorch-bot · 2024-08-16T21:30:23Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1357

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit ce16f38 with merge base f437639 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

torchtune/models/flamingo/_component_builders.py

torchtune/models/flamingo/_encoder.py

torchtune/models/flamingo/_component_builders.py

codecov-commenter · 2024-08-20T21:10:29Z

Codecov Report

Attention: Patch coverage is 34.16149% with 106 lines in your changes missing coverage. Please review.

Project coverage is 27.59%. Comparing base (5155c4a) to head (20aa7f8).
Report is 1 commits behind head on main.

Files	Patch %	Lines
...torchtune/models/flamingo/test_flamingo_encoder.py	28.84%	37 Missing ⚠️
torchtune/models/flamingo/_component_builders.py	26.19%	31 Missing ⚠️
torchtune/models/flamingo/_encoder.py	35.48%	20 Missing ⚠️
torchtune/models/clip/_component_builders.py	30.00%	7 Missing ⚠️
...torchtune/models/flamingo/test_flamingo_decoder.py	50.00%	6 Missing ⚠️
torchtune/modules/feed_forward.py	16.66%	5 Missing ⚠️

❗ There is a different number of reports uploaded between BASE (5155c4a) and HEAD (20aa7f8). Click for more details.

HEAD has 1 upload less than BASE

Flag BASE (5155c4a) HEAD (20aa7f8)

3 2

Additional details and impacted files

@@             Coverage Diff             @@
##             main    #1357       +/-   ##
===========================================
- Coverage   72.72%   27.59%   -45.14%     
===========================================
  Files         271      277        +6     
  Lines       12811    13040      +229     
===========================================
- Hits         9317     3598     -5719     
- Misses       3494     9442     +5948

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

tests/torchtune/models/flamingo/test_flamingo_encoder.py

tests/torchtune/modules/test_vision_transformer.py

tarun292 · 2024-08-22T20:04:36Z

torchtune/modules/model_fusion/_fusion.py

-            if key.startswith("layer"):
-                new_key = key.replace("layer.", "")
+            local_key = key[len(prefix) :]
+            if local_key.startswith("layer"):


I think this should be:

if local_key.startswith("layer.layer"):

SalmanMohammadi · 2024-08-23T13:01:59Z

torchtune/models/flamingo/_encoder.py

+    ) -> Tensor:
+        """
+        Args:
+            x (Tensor): input tensor with shape [b x i x t x e x d]


sorry this hurt my tiny brain

Suggested change

x (Tensor): input tensor with shape [b x i x t x e x d]

x (Tensor): input tensor with shape [b, i, t, e, d]

SalmanMohammadi · 2024-08-23T13:02:16Z

torchtune/models/flamingo/_encoder.py

+                from the encoder. Each hidden state has the same shape as x.
+
+        Returns:
+            Tensor: output tensor of a sequence of embedings [b x s x d]


Suggested change

Tensor: output tensor of a sequence of embedings [b x s x d]

Tensor: output tensor of a sequence of embedings [b, s, d]

ebsmothers

Mostly a bunch of nits for now, will give a more thorough pass tomorrow though

ebsmothers · 2024-08-27T04:48:30Z

torchtune/models/clip/_model_builders.py

@@ -1,6 +1,6 @@
 from torchtune.models.clip._transforms import CLIPImageTransform

-def _clip_vit_224_transform():
+def clip_vit_224_transform():


Should be added to public API now?

There is zero reason for a builder to exist and not be public. It's not a utility but a model. The only reason it's not add to the docs yet is because there should be a model builder as well, not just the transform.

ebsmothers · 2024-08-27T04:49:16Z

torchtune/models/flamingo/_component_builders.py

+stitch these building blocks into higher-level components. This design has
+two benefits:
+- The building blocks themselves are very flexible. For example, ``GroupedQueryAttention``
+can take either nn.Linear or nn.LoRALinear for ``q_proj``.
+- Builder functions expose a set of configurable params which keep the constructors of
+the building blocks simple.


imo this is not the place to sell the design

torchtune/models/flamingo/_component_builders.py

ebsmothers · 2024-08-27T04:50:44Z

torchtune/models/flamingo/_component_builders.py

+        num_heads (int): The number of attention heads in each transformer layer.
+        clip_embed_dim (int): The dimensionality of each patch embedding in CLIP.
+        clip_num_layers (int): The number of transformer layers.
+        clip_hidden_states (Optional[List[int]]): The indices of CLIP hidden layers to return


In the function it's List[int], no Optional

ebsmothers · 2024-08-27T04:57:38Z

torchtune/modules/feed_forward.py

@@ -14,7 +16,7 @@ class FeedForward(nn.Module):
        gate_proj (nn.Module): Projection from input dim to hidden dim, fed through activation
            and multiplied by up_proj.
        down_proj (nn.Module): Final projection to output dim.
-        up_proj (nn.Module): Projection from input dim to hidden dim, multiplied by
+        up_proj (Optional[nn.Module]): Projection from input dim to hidden dim, multiplied by


Good thing this class wasn't super well-defined anyways. But might be worth updating the docstring to explain the case of no up_proj

ebsmothers · 2024-08-27T04:59:18Z

tests/torchtune/models/flamingo/test_flamingo_decoder.py

+class TestFlamingoVisionEncoder:
+    def test_flamingo_text_decoder_initialization(self, decoder_config):
+        # Attempt to instantiate the Flamingo text decoder
+        try:
+            decoder = flamingo_decoder(**decoder_config)
+            print("Flamingo text decoder instantiated successfully.")
+        except Exception as e:
+            pytest.fail(f"Failed to instantiate Flamingo text decoder: {str(e)}")


Maybe I'm being dense but what is the point of this test?

It was a placeholder that's been udpated.

ebsmothers · 2024-08-27T05:01:06Z

torchtune/models/clip/_component_builders.py

+
+def clip_mlp(in_dim: int, out_dim: int, hidden_dim: int, activation: nn.Module, quantize_base: bool = False) -> FeedForward:
+    """
+    Build the MLP layer associated with the clip model.


Not the CLIP model, right? I feel like we can be a bit more explicit here

You mean CLIP ViT? I feel this is inline with naming for all our other model specific sub builders.

ebsmothers · 2024-08-27T05:02:45Z

torchtune/modules/transformer.py

@@ -338,6 +339,7 @@ def __init__(
        self.num_heads = num_heads
        self.head_dim = head_dim
        self.causal_mask = None
+        self.cur_pos = None


Can you explain what this is about?

This tracks position for the TransformerDecoder since the layers also track it during decoding. This is already obsolete as Salaman is updating how cacheing handles position. But I'll leave this in until that is updated. This is basically a fix for the previous update around input_pos and kvcache.

ebsmothers · 2024-08-27T05:04:21Z

torchtune/models/flamingo/_component_builders.py

+            attn = MultiHeadAttention(
+                embed_dim=embed_dim,
+                num_heads=num_heads,
+                num_kv_heads=num_kv_heads,
+                head_dim=head_dim,
+                q_proj=nn.Linear(embed_dim, num_heads * head_dim, bias=False),
+                k_proj=nn.Linear(embed_dim, num_kv_heads * head_dim, bias=False),
+                v_proj=nn.Linear(embed_dim, num_kv_heads * head_dim, bias=False),
+                output_proj=nn.Linear(embed_dim, embed_dim, bias=False),
+                q_norm=RMSNorm(dim=head_dim, eps=1e-05),
+                k_norm=RMSNorm(dim=head_dim, eps=1e-05),
+                pos_embeddings=None,
+                max_seq_len=max_seq_len,
+                is_causal=False,
+                attn_dropout=0.0,
+            )


nbd yet but come LoRA time might want a builder for this kinda stuff

I'm not sure what you're referring to

ebsmothers

Sorry all I did was correct typos. But that just means the design is solid through and through, I have no major complaints there. Please do address my old comments too though. Preemptive stamp so that you're unblocked

ebsmothers · 2024-09-04T22:02:27Z

torchtune/models/clip/_component_builders.py

+        in_dim=embed_dim,
+        hidden_dim=4 * embed_dim,
+        out_dim=embed_dim,
+        activation=activation(),


nit: if activation is truly a Callable as you've typed it this won't work

What would this be called? A Class?

ebsmothers · 2024-09-04T22:04:16Z

torchtune/modules/transformer.py

+        num_layers (Optional[int]): Number of Transformer Decoder layers, only define when
+            layers is not a list.


What's our long-term plan for this? Will we continue to support both cases?

This was in the TransformerLayer refactor. I left it in to support both because otherwise the builders get a lot less clean. But We could update that in the future.

tests/torchtune/models/flamingo/test_flamingo_decoder.py

ebsmothers · 2024-09-04T22:11:35Z

tests/torchtune/models/flamingo/test_flamingo_decoder.py

+            output.shape == expected_shape
+        ), f"Expected shape {expected_shape}, but got {output.shape}"
+
+        assert_expected(output.mean(), torch.tensor(-9.47548e-5), atol=1e-3, rtol=1e-3)


Is it expected to be so close to zero? Whenever the tolerance is a couple orders of magnitude larger than the actual value it makes me a bit nervous

ebsmothers · 2024-09-04T22:44:41Z

torchtune/models/flamingo/_encoder.py

+            - i: number of images
+            - t: number of tiles (where a single image is broken into multiple tiles)
+            - e: number of embeds per tile (e.g. CLS embed + patch embeds, etc.)
+            - s: sequence length computed by i*t*e


nit: it just equals i*t*e, right?

torchtune/models/flamingo/_component_builders.py

ebsmothers · 2024-09-04T22:52:18Z

torchtune/models/flamingo/_component_builders.py

+            attn = MultiHeadAttention(
+                embed_dim=embed_dim,
+                num_heads=num_heads,
+                num_kv_heads=num_kv_heads,
+                head_dim=head_dim,
+                q_proj=nn.Linear(embed_dim, num_heads * head_dim, bias=False),
+                k_proj=nn.Linear(embed_dim, num_kv_heads * head_dim, bias=False),
+                v_proj=nn.Linear(embed_dim, num_kv_heads * head_dim, bias=False),
+                output_proj=nn.Linear(embed_dim, embed_dim, bias=False),
+                q_norm=RMSNorm(dim=head_dim, eps=1e-05),
+                k_norm=RMSNorm(dim=head_dim, eps=1e-05),
+                pos_embeddings=None,
+                max_seq_len=max_seq_len,
+                is_causal=False,
+                attn_dropout=0.0,
+            )


ebsmothers · 2024-09-04T23:02:53Z

torchtune/models/flamingo/_encoder.py

+        clip (nn.Module): CLIP encoder vision model
+        projection_head (nn.Module): projection_head that takes embeddings
+            with dimension encoder_dim as input and outputs embeddings of
+            size decoder_dim.


nbd to leave these both as general nn.Modules but they do have very specific signatures that make it hard to plug in any old nn.Module. Maybe point to the relevant CLIP and projection classes as an example or something

This being inside of the flamingo folder it's not really meant to be reused for other purposes. I can just set the type to be the specific modules.

ebsmothers · 2024-09-04T23:10:16Z

torchtune/models/flamingo/_transform.py

+    def __call__(self, sample: Mapping[str, Any]) -> Mapping[str, Any]:
+        """
+        Apply image decoding and transformations to the "images" field in the sample
+        and tokenizization to the "messages" field in the sample. Also returns the


Suggested change

and tokenizization to the "messages" field in the sample. Also returns the

and tokenization to the "messages" field in the sample. Also returns the

ebsmothers · 2024-09-04T23:14:54Z

torchtune/models/flamingo/_transform.py

+            The extra text will still get tokenized as normal text, not as special tokens. Default is None.
+
+    Examples:
+        >>> model_transform = FlamingoTransform("/path/to/tokenizer.model", tile_size=256)


nit: I think you also need to provide patch_size in this example

RdoubleA · 2024-09-04T23:12:27Z

torchtune/models/flamingo/_encoder.py

+from torchtune.modules.transformer import _get_clones
+
+
+class FlamingoProjectionHead(nn.Module):


nit: is Head the right term? I would think Head is something that attached on top of the hidden states of a transformer and not a full model. I thought we had referred to this as an Adapter?

A projection heads have been around a long time and can vary a lot in architecture. The main point here is that it's learning a projection from the pre-trained encoder to the pretrained decoder.

torchtune/models/flamingo/_encoder.py

RdoubleA · 2024-09-04T23:15:03Z

torchtune/models/flamingo/_encoder.py

+
+
+class FlamingoProjectionHead(nn.Module):
+    """Projection transformer to adapt the output of a


Would like to see more details here, specifically on how this is used to map from encoder hidden dim to the decoder hidden dim in the cross attention layer

RdoubleA · 2024-09-04T23:15:50Z

torchtune/models/flamingo/_encoder.py

+        self,
+        layer: nn.Module,
+        num_layers: int,
+        output: nn.Module,


should we just hardcode this to nn.Linear? I'm wondering if it makes more sense for a user to configure encoder_dim -> decoder_dim rather than the output module directly

I can update this, I was just following our typical pattern but this would never need to be customized by a user anyway.

RdoubleA · 2024-09-04T23:16:06Z

torchtune/models/flamingo/_encoder.py

+
+    def forward(
+        self,
+        x: Tensor,


nit: we've been using torch.Tensor everywhere

I lost that vote :(

torchtune/models/flamingo/_transform.py

RdoubleA · 2024-09-04T23:26:17Z

torchtune/models/flamingo/_transform.py

+
+        Args:
+            sample (Mapping[str, Any]): A sample with a "tokens", "mask",
+                "encoder_input" and "encoder_mask" field to feed directly into the model.


we don't expect encoder_input, encoder_mask to already be in sample, we should expect "images" though

We do expect encoder_input to be in sample by this point, look at FlamingoTransform. I agree there is some questionable generalization here, but I think we should address that when it comes up. In summary, to allow unpacking of any arbitrary input for the encoder, we treat it as a dictionary.

RdoubleA · 2024-09-04T23:28:42Z

torchtune/models/llama3/_tokenizer.py

@@ -25,7 +25,7 @@
    "<|eom_id|>": 128008,
    "<|eot_id|>": 128009,
    "<|python_tag|>": 128010,
-    "<|image|>": 128011,
+    "<|image|>": 128256,


have we verified that this the official image token id...? When I first added this it was 128011

Long story, but both numbers were supposed to be correct. One for finetuning and one for inference, but now it's just the one.

RdoubleA · 2024-09-04T23:30:44Z

torchtune/modules/transformer.py

@@ -338,6 +339,7 @@ def __init__(
        self.num_heads = num_heads
        self.head_dim = head_dim
        self.causal_mask = None
+        self.pos = None


is this for generation? if so, a quick comment would be helpful

RdoubleA · 2024-09-04T23:32:26Z

torchtune/modules/transforms/_transforms.py

-                    where length of list == number of images in sample
-                - tokens (List[int]): original tokens
-                - images (List[torch.Tensor]): original images
+            Mapping[str, Any]: sample with a new key encoder_mask, with a mask per image with shape


why are we packaging multiple keys into encoder_input? also, you mention encoder_mask here but use encoder_input below. Would also be good to keep the bullets about tokens and images

Co-authored-by: Rafi Ayub <33648637+RdoubleA@users.noreply.github.com>

Co-authored-by: ebsmothers <ebs@meta.com>

pbontrager added 2 commits August 16, 2024 14:26

Added all Flamingo components

8624b0f

Merge branch 'main' into flamingo_components

b93dd69

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 16, 2024

tarun292 reviewed Aug 16, 2024

View reviewed changes

torchtune/models/flamingo/_component_builders.py Outdated Show resolved Hide resolved

Added unit tests

1e30bc6

SalmanMohammadi reviewed Aug 19, 2024

View reviewed changes

torchtune/models/flamingo/_encoder.py Outdated Show resolved Hide resolved

SalmanMohammadi reviewed Aug 19, 2024

View reviewed changes

torchtune/models/flamingo/_component_builders.py Outdated Show resolved Hide resolved

SalmanMohammadi reviewed Aug 19, 2024

View reviewed changes

torchtune/models/flamingo/_component_builders.py Outdated Show resolved Hide resolved

SalmanMohammadi reviewed Aug 19, 2024

View reviewed changes

torchtune/models/flamingo/_component_builders.py Outdated Show resolved Hide resolved

pbontrager added 2 commits August 20, 2024 13:41

Update naming/docstrings

5889e8a

Merge branch 'main' into flamingo_components

ab019a2

fixed unit test

20aa7f8

SalmanMohammadi reviewed Aug 21, 2024

View reviewed changes

tests/torchtune/models/flamingo/test_flamingo_encoder.py Outdated Show resolved Hide resolved

SalmanMohammadi reviewed Aug 21, 2024

View reviewed changes

tests/torchtune/modules/test_vision_transformer.py Show resolved Hide resolved

pbontrager added 2 commits August 21, 2024 21:31

bug fixes

0582f5f

fixed fusion state dict hooks

ade22da

tarun292 reviewed Aug 22, 2024

View reviewed changes

SalmanMohammadi reviewed Aug 23, 2024

View reviewed changes

ebsmothers reviewed Aug 27, 2024

View reviewed changes

Testing and updates

66a3be8

ebsmothers approved these changes Sep 4, 2024

View reviewed changes

RdoubleA reviewed Sep 4, 2024

View reviewed changes

pbontrager and others added 5 commits September 4, 2024 19:51

RdoubleA spelling suggestion 1

f9d3d36

Co-authored-by: Rafi Ayub <33648637+RdoubleA@users.noreply.github.com>

ebsmothers patch 1

f7a696d

Co-authored-by: ebsmothers <ebs@meta.com>

ebsmothers patch 2

9c88167

Co-authored-by: ebsmothers <ebs@meta.com>

ebsmothers patch 3

6a57893

Co-authored-by: ebsmothers <ebs@meta.com>

ebsmothers patch 4

932bedb

Co-authored-by: ebsmothers <ebs@meta.com>

pbontrager and others added 4 commits September 5, 2024 10:09

ebsmothers patch 5

6ed634e

Co-authored-by: ebsmothers <ebs@meta.com>

Responded to comments

342401c

Merge branch 'main' into flamingo_components

1b1c334

Pad to max tile

ce16f38

pbontrager merged commit 7920dc8 into pytorch:main Sep 5, 2024
17 checks passed

	x (Tensor): input tensor with shape [b x i x t x e x d]
	x (Tensor): input tensor with shape [b, i, t, e, d]

	Tensor: output tensor of a sequence of embedings [b x s x d]
	Tensor: output tensor of a sequence of embedings [b, s, d]

		num_layers (Optional[int]): Number of Transformer Decoder layers, only define when
		layers is not a list.

	and tokenizization to the "messages" field in the sample. Also returns the
	and tokenization to the "messages" field in the sample. Also returns the

		from torchtune.modules.transformer import _get_clones


		class FlamingoProjectionHead(nn.Module):



		class FlamingoProjectionHead(nn.Module):
		"""Projection transformer to adapt the output of a

Flamingo Model Components #1357

Flamingo Model Components #1357

Conversation

pbontrager commented Aug 16, 2024

Context

Changelog

Test plan

UX

pytorch-bot bot commented Aug 16, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1357

✅ No Failures

codecov-commenter commented Aug 20, 2024 • edited Loading

Codecov Report

Choose a reason for hiding this comment

SalmanMohammadi Aug 23, 2024 • edited Loading

Choose a reason for hiding this comment

SalmanMohammadi Aug 23, 2024 • edited Loading

Choose a reason for hiding this comment

ebsmothers left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ebsmothers left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pytorch-bot bot commented Aug 16, 2024 •

edited

Loading

codecov-commenter commented Aug 20, 2024 •

edited

Loading

SalmanMohammadi Aug 23, 2024 •

edited

Loading

SalmanMohammadi Aug 23, 2024 •

edited

Loading