Gemma #630

solitude-alive · 2024-04-01T15:07:01Z

Context

To support full fine-tune with Gemma-2B model

Changelog

Create gemma module, update for loading with .safetensors, support tied weight model.

Test plan

....

…ing.

pytorch-bot · 2024-04-01T15:07:04Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/630

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 8 New Failures

As of commit 321f59e with merge base aacaadd ():

NEW FAILURES - The following jobs have failed:

Multi-GPU Recipe Tests / recipe_test_multi_gpu (3.10) (gh)
User: arn:aws:sts::***:assumed-role/gh-ci-github-action-runners-runner-role/i-0d5cc287aac0a6202 is not authorized to perform: sts:TagSession on resource: arn:aws:iam::***:role/gha_workflow_torchtune_pytorch-multimodal
Multi-GPU Recipe Tests / recipe_test_multi_gpu (3.11) (gh)
Multi-GPU Recipe Tests / recipe_test_multi_gpu (3.8) (gh)
##[error]The operation was canceled.
Multi-GPU Recipe Tests / recipe_test_multi_gpu (3.9) (gh)
##[error]The operation was canceled.
Recipe Tests / recipe_test (3.10) (gh)
Credentials could not be loaded, please check your action inputs: Could not load credentials from any providers
Recipe Tests / recipe_test (3.11) (gh)
Credentials could not be loaded, please check your action inputs: Could not load credentials from any providers
Recipe Tests / recipe_test (3.8) (gh)
Credentials could not be loaded, please check your action inputs: Could not load credentials from any providers
Recipe Tests / recipe_test (3.9) (gh)
Credentials could not be loaded, please check your action inputs: Could not load credentials from any providers

This comment was automatically generated by Dr. CI and updates every 15 minutes.

joecummings

This is great - thanks for the contribution! I left a couple comments, but generally looks good.

Can you add a screenshot of running a distributed full finetune with Gemma to confirm it works?

joecummings · 2024-04-01T18:28:12Z

torchtune/models/gemma/_component_builders.py

+    return FeedForward(gate_proj=gate_proj, down_proj=down_proj, up_proj=up_proj, activation=activation)
+
+
+def lora_gemma(


Looks like you included the LoRA version of Gemma for this PR. Are you planning on including LoRA, as well, or just starting with the full fine-tuning version?

May take a longer time to complete LoRA version of Gemma, could I PR full fine-tuning version first?

I think it should be fine to start with just full fine-tune for now

@solitude-alive Can you remove all the LoRA code since we won't be addressing it in this PR?

Yeah, I remove them in the latest version.

joecummings · 2024-04-01T18:29:32Z

torchtune/modules/feed_forward.py

    ):
        super().__init__()
        self.w1 = gate_proj
        self.w2 = down_proj
        self.w3 = up_proj
-        self.activation = F.silu


Good abstraction!

joecummings · 2024-04-01T18:30:26Z

torchtune/utils/_checkpointing/_checkpointer_utils.py

@@ -11,14 +11,17 @@
 import torch
 import torch.nn as nn
 import torch.optim as optim
+from safetensors import safe_open


Can you add this to requirements.txt?

Yeah, thank you for your suggestion, I add it in the latest version.

joecummings · 2024-04-01T18:31:33Z

torchtune/models/gemma/_model_builders.py

+        TransformerDecoder: Instantiation of Gemma 2B model
+    """
+    return gemma(
+        vocab_size=256_000,


Still shocked by this vocab size - so large!

Does this mean the embedding(/output projection since they're tied) constitutes a full 25% of their params?!

I calculate it with count_trainable_parameters , the params of embed_tokens is 21%.

def count_trainable_parameters(model): return sum(p.numel() for p in model.parameters() if p.requires_grad)

joecummings · 2024-04-01T18:33:23Z

recipes/full_finetune_distributed.py

@@ -203,6 +210,7 @@ def _setup_model(
        cfg_model: DictConfig,
        enable_activation_checkpointing: bool,
        model_state_dict: Dict[str, Any],
+        mode_tie: bool = False,


Suggested change

mode_tie: bool = False,

model_tie: bool = False,

Thank you for your suggestion, I have fixed it in the latest version.

joecummings · 2024-04-01T18:33:32Z

recipes/full_finetune_distributed.py

@@ -259,6 +267,10 @@ def _setup_model(
            ),
        )

+        if mode_tie:  # Tie the weights of the model if required


Suggested change

if mode_tie: # Tie the weights of the model if required

if model_tie: # Tie the weights of the model if required

Thank you for your suggestion, I have fixed it in the latest version.

solitude-alive · 2024-04-02T00:38:53Z

Yeah, this is a screenshot of running a distributed full finetune with Gemma.

ebsmothers

Thanks for this PR! Really excited to see how nicely this is shaping up.

Re testing, aside from making sure training runs, let's try to get a sanity check that the model forward here lines up with the one from the original implementation on some dummy data (assuming you haven't done so already). We have a bunch of scripts we've used in the past for this with various components in the library, so you can use these as a reference if it helps. For example (Note: you do not have to actually write a script like this and check it in, this is meant more as a reference if it helps you)

ebsmothers · 2024-04-02T01:16:38Z

recipes/configs/gemma/2B_full.yaml

+#   --config gemma/2B_full \
+#   checkpointer.checkpoint_dir=<YOUR_CHECKPOINT_DIR>
+#
+# This config works best when the model is being fine-tuned on 2+ GPUs.


nit: I think if we are running with full_finetune_distributed recipe it will only work on 2+ GPUs

yeah, I update it in the latest version.

recipes/configs/gemma/2B_full.yaml

ebsmothers · 2024-04-02T01:24:02Z

torchtune/models/gemma/_model_builders.py

+
+def gemma_2b() -> TransformerDecoder:
+    """
+    Builder for creating a Gemma 2B model initialized w/ the default 2b parameter values


nit: add pointer to the paper or blog post here

yeah, I add it in the latest version.

ebsmothers · 2024-04-02T01:24:17Z

torchtune/models/gemma/_model_builders.py

+        TransformerDecoder: Instantiation of Gemma 2B model
+    """
+    return gemma(
+        vocab_size=256_000,


Does this mean the embedding(/output projection since they're tied) constitutes a full 25% of their params?!

ebsmothers · 2024-04-02T01:24:43Z

torchtune/models/gemma/_component_builders.py

+    return FeedForward(gate_proj=gate_proj, down_proj=down_proj, up_proj=up_proj, activation=activation)
+
+
+def lora_gemma(


I think it should be fine to start with just full fine-tune for now

ebsmothers · 2024-04-02T01:27:26Z

torchtune/utils/_checkpointing/_checkpointer_utils.py

 from torch.distributed.fsdp import FullyShardedDataParallel as FSDP

 from torchtune.utils._distributed import contains_fsdp
+from transformers.utils import is_safetensors_available


I think we shouldn't import from transformers here as it's not in our core dependencies. If you've added safetensors to our core dependencies (based on the above comment) probably don't need to do this check anyways.

yeah, I remove it in the latest version.

ebsmothers · 2024-04-02T01:28:37Z

torchtune/modules/feed_forward.py

@@ -25,12 +26,13 @@ def __init__(
        gate_proj: nn.Module,
        down_proj: nn.Module,
        up_proj: nn.Module,
+        activation: nn.Module = F.silu,


nit: technically F.silu is a Callable, not an nn.Module

yeah, I replace it with nn.SiLU() in the latest version.

ebsmothers · 2024-04-02T01:32:24Z

torchtune/models/gemma/_component_builders.py

+            Default: False
+
+    Returns:
+        FeedForward: instantiation of the MLP module with LoRA applied to


I think the second line of this docstring got lost somewhere along the way

yeah, I fix it in the latest version.

ebsmothers · 2024-04-02T01:41:08Z

recipes/full_finetune_distributed.py

+        if cfg.checkpointer.model_type == "GEMMA":
+            model_tie = True
+        else:
+            model_tie = False


We could also consider associating a weight tying config explicitly with the model type and using that in the checkpointer. E.g.

@dataclass ModelType name: str weight_tying_config: Dict[str, str] = field(default_factory=dict)

Then Gemma would be ModelType(name="GEMMA", weight_tying_config={"tok_embeddings.weight": "output.weight"}

(Anyways, not a blocker for this PR as it's more of a design question)

Thank you for your suggestion, I update in the gemma_full_finetune.py.

ebsmothers · 2024-04-02T01:42:08Z

recipes/full_finetune_distributed.py

@@ -259,6 +267,10 @@ def _setup_model(
            ),
        )

+        if model_tie:  # Tie the weights of the model if required
+            model.output.weight = model.tok_embeddings.weight


It was pointed out by @rohan-varma that this may not actually do what we expect because FSDP has already sharded the params, so let's double-confirm via testing that the weights are tied correctly here.

Thank you for pointing out it, I check the model weight after training, they are not same. Is there any solution? I'm not familiar with that. This can cause some problems if the weights are tied before FSDP. issue

OK sorry for the back and forth on this. Confirmed with @rohan-varma that we should not tie weights after FSDP wrapping after all. The main issue was not FSDP but the initialization on meta device. Unfortunately, weight tying + meta device is tricky because the usage of to_empty breaks existing references.

Instead, for Gemma we can do everything on CPU without using meta device at all, basically initializing the model on CPU for every rank and then defining a more vanilla FSDP without the param_init_fn we currently have. This should work fine for smaller models (at least up to 7B). @kartikayk put together a snippet on what this can look like, you can find it here.

We need to decide what the best way to expose this is, but for now feel free to create a separate recipe for Gemma, e.g. gemma_full_finetune.py. It should look pretty much the same as the existing full_finetune_distributed.py, but with the changes needed to initialize everything on CPU and perform weight tying there before wrapping with FSDP.

Thanks also to @awgu for helping debug this.

Thank you for your suggestion, it works well.

Co-authored-by: ebsmothers <ebs@meta.com>

kartikayk

Thanks for adding this PR @solitude-alive! This would be an awesome contribution to the repo!

Similar to @joecummings I have some questions in the code. My biggest question though is correctness. The loss from the screenshot seems to be much higher than what we've seen with Mistral/Llama2. have you compared this loss for gemma with the official implementation/some other implementation? Or have you seen some issues/blogs which show case the loss value during training that we can compare against?

Also, when adding models we provide some evidence of model numerical correctness - this is really important to build confidence with our users. Please see how we did this for llama2 13B and mistral 7B in the context section of this PR: #571. Would be great if you can add a similar check for Gemma2B. This check would look something like:

Load official implementation of Gemma2B and take a random tensor, run forward and get output
Load torchtune implementation, take same tensor, run forward and get output
Compare outputs with torch.allclose and make sure this returns a True.

kartikayk · 2024-04-02T02:54:43Z

recipes/full_finetune_distributed.py

@@ -259,6 +267,10 @@ def _setup_model(
            ),
        )

+        if model_tie:  # Tie the weights of the model if required


I find the use of model_tie to be a bit unintuitive. Can we rename this to something like share_weights or share_embed since I don't think we'll have other modules we share?

Yeah, I update them in the latest version.

kartikayk · 2024-04-02T02:55:39Z

requirements.txt

@@ -1,6 +1,7 @@
 # Hugging Face Integration Reqs
 datasets
 huggingface_hub
+safetensors


@joecummings do we need to explicitly add this if it's a part of the huggingface_hub? I guess it's good practice to explicitly call out?

kartikayk · 2024-04-02T02:59:17Z

torchtune/models/gemma/_model_utils.py

+    Tie the weights of the output embeddings and the token embeddings in the model.
+
+    Args:
+        model (TransformerDecoder): The to tie the weights of the output embeddings and the token embeddings.


This sentence is missing some info: "the to tie" reads a bit weird

Sorry, I modified it in the latest version.

kartikayk · 2024-04-02T03:04:37Z

torchtune/models/gemma/_model_builders.py

+        num_kv_heads=1,
+        embed_dim=2048,
+        intermediate_dim=16384,
+        max_seq_len=32768,


Is this right? I thought this was 8192 for Gemma 2B

Sorry, this is my problem, I fixed it in the latest version.

kartikayk · 2024-04-02T03:07:58Z

torchtune/utils/_checkpointing/_checkpointer.py

@@ -383,6 +383,14 @@ def load_checkpoint(self) -> Dict[str, Any]:
            dim=self._config["hidden_size"],
        )

+        if (
+            self._model_type == "GEMMA"


Hmm so I have a question about this code.

hf_to_tune makes an assumption that head_dim * num_heads = dim (see here).

But this isn't true for Gemma 7B where num_heads=16 and head_dim= 256 but dim=3072 and not 4096. So we will need to differentiate between gemma 2b and 7b here

Also, please move to a utility function in checkpointer_utils so we can keep this code clean.

If I allow explicit parameter num_heads passing in function hf_to_tune , is this allowed?

And I moved them to a utility function in checkpointer_utils.

kartikayk · 2024-04-02T03:10:20Z

torchtune/utils/_checkpointing/_checkpointer.py

+        print(f"======={self._model_type}==========")
+        if (
+            self._model_type == "GEMMA"


Can we move this to a separate utility function in checkpointer_utils? we should keep the checkpointer as clean as possible.

Yeah, I did it in the latest version.

kartikayk · 2024-04-02T03:13:43Z

torchtune/utils/_checkpointing/_checkpointer.py

+                        "because it is the same as the model embed_tokens weight"
+                    )
+                else:
+                    self._weight_map["lm_head.weight"] = "0002"


When will this else block hit? If we know the checkpoints don't contain this key, let's just work with that assumption? Anyways we're hard coding a bunch of stuff like the name of the key etc.

It was because the parameters were not really tied before. Now I removed the else block.

kartikayk · 2024-04-02T03:16:17Z

torchtune/utils/_checkpointing/_checkpointer_utils.py

-        state_dict = torch.load(
-            str(checkpoint_path), map_location="cpu", mmap=True, weights_only=True
-        )
+        if str(checkpoint_path).endswith(".safetensors") and is_safetensors_available():


Not a fan of this approach. Can we just add a key to the config, something like is_safetensors_file and then based on the value determine if we use torch.load or not. Also please break this down into a sub function (eg: load_from_safetensor or something similar.

@joecummings WDYT?

…the same as model embed_tokens weight

kartikayk

A huge thank you @solitude-alive for adding this functionality and also patiently addressing the many review comments. This functionality makes TorchTune better and we really appreciate all of your hard work. I'll merge this into MAIN, make a few small changes to the core recipes based on some upcoming changes and then add this to our README and cite you as the author. Thanks so much for all of the hard work!

Co-authored-by: ebsmothers <ebs@meta.com>

jerryzh168 · 2024-04-15T22:07:01Z

torchtune/utils/_checkpointing/_checkpointer_utils.py

+            state_dict = result
+        else:
+            state_dict = torch.load(
+                str(checkpoint_path), map_location="cpu", mmap=True, weights_only=True


looks like weights_only arg is not passed around here?

Oh yes, but I just looked at the latest version and it has been updated.

state_dict = torch.load( str(checkpoint_path), map_location="cpu", mmap=True, weights_only=weights_only, )

jerryzh168 · 2024-04-15T22:07:41Z

torchtune/utils/_checkpointing/_checkpointer_utils.py

-            mmap=True,
-            weights_only=weights_only,
+        is_safetensors_file = (
+            True if str(checkpoint_path).endswith(".safetensors") else False


nit: btw this seems to be the same as:

is_safetensors_file = str(...).endswith(".safetensors")

solitude-alive added 14 commits March 31, 2024 22:04

Create Gemma models.

1928a47

Create _model_builders.py for gemma.

e7e4c88

Create _component_builders.py for gemma.

a56ff40

Create _model_utils.py for gemma.

e764355

Update for activation Option.

47c693d

Create default 2B_full.yaml.

2db033f

Update to support the output weight is same as the token embedding.

f6f29b3

Update for support .safetensors weight file.

76a7558

tie weight for gemma.

6371378

Create comments

f3912da

Create comments.

b355634

Fix the lm_head.weight is not loaded from checkpoint.

0f78903

Fix the bug for model need to tie weight and move it after FSDP wrapp…

8986464

…ing.

complete todo to check the pad_id.

f9301e3

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 1, 2024

solitude-alive closed this Apr 1, 2024

solitude-alive reopened this Apr 1, 2024

joecummings reviewed Apr 1, 2024

View reviewed changes

solitude-alive added 5 commits April 1, 2024 21:26

complete todo to check the position embedding.

1bd6210

complete todo to check the attention.

eb53e96

Update Gemma model with tied weight on load and save model.

71952e9

Update for tied model.

de4bc58

Remove tie_weight during setup.

142b06e

ebsmothers reviewed Apr 2, 2024

View reviewed changes

Update recipes/configs/gemma/2B_full.yaml

e71bc04

Co-authored-by: ebsmothers <ebs@meta.com>

kartikayk reviewed Apr 2, 2024

View reviewed changes

solitude-alive added 2 commits April 2, 2024 07:57

Add the model lm_head weight to the weight map manually if it is not …

8ce9a1f

…the same as model embed_tokens weight

fix the typo.

a9d0aee

kartikayk approved these changes Apr 4, 2024

View reviewed changes

kartikayk merged commit 09f9d95 into pytorch:main Apr 4, 2024
12 of 20 checks passed

solitude-alive added 23 commits April 4, 2024 19:43

Remove unnecessary comments.

62f936d

Move GemmaRMSNorm to torchtune/models/gemma/rms_norm.py

e17ce75

Move GemmaRMSNorm to torchtune/models/gemma/rms_norm.py

1940802

Update GemmaRMSNorm.

1f62925

Update GemmaRMSNorm.

c3e99a4

Update GemmaRMSNorm.

32e5e10

Update GemmaTransformerDecoder.

9889a59

Update GemmaTransformerDecoder.

b8b6b6d

Update docstring.

6385652

Update para name.

ddacd6c

Update Gemma 2B_full.yaml.

83e3f69

Update shared weight setting.

da89d81

Update transformer.py.

7370f67

fix typo.

59cb14c

Update default setting.

59e6a4e

Update default setting.

390a77c

Remove _model_utils.py.

7adf487

Update GemmaTransformerDecoder logits.

8ba1412

Update default parameter.

53a3ddf

Add save json to out_dir.

23b3a02

Update gemma.yaml

e7c99c3

Remove gemma.yaml

29eeba9

fix the path not exist.

321f59e

tcapelle pushed a commit to tcapelle/torchtune that referenced this pull request Apr 5, 2024

Gemma (pytorch#630)

76c21b7

Co-authored-by: ebsmothers <ebs@meta.com>

RdoubleA mentioned this pull request Apr 9, 2024

Merge Gemma recipe with full finetune #668

Merged

jerryzh168 reviewed Apr 15, 2024

View reviewed changes

solitude-alive deleted the gemma branch April 18, 2024 02:08

		return FeedForward(gate_proj=gate_proj, down_proj=down_proj, up_proj=up_proj, activation=activation)


		def lora_gemma(

	if mode_tie: # Tie the weights of the model if required
	if model_tie: # Tie the weights of the model if required

Gemma #630

Gemma #630

Conversation

solitude-alive commented Apr 1, 2024

Context

Changelog

Test plan

pytorch-bot bot commented Apr 1, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/630

❌ 8 New Failures

joecummings left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

solitude-alive commented Apr 2, 2024

ebsmothers left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ebsmothers Apr 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kartikayk left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

solitude-alive Apr 3, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kartikayk left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pytorch-bot bot commented Apr 1, 2024 •

edited

Loading

ebsmothers Apr 2, 2024 •

edited

Loading

solitude-alive Apr 3, 2024 •

edited

Loading