Feature/pytorch tpu container #14

shub-kris · 2024-02-16T13:53:22Z

This PR adds PyTorch TPU Dockerfile.

shub-kris · 2024-02-16T13:56:21Z

It is based on the image mentioned in the PyTorch_XLA GitHub repo. So, far I have tested it on the example mentioned in Google Cloud TPU docs.

shub-kris · 2024-02-16T13:56:53Z

Added notebook, as it was not installed, also removed libraries that doesn't support TPU.

containers/pytorch/training/tpu/2.1/transformers/4.37.2/py310/Dockerfile

philschmid · 2024-02-16T15:05:12Z

containers/pytorch/training/tpu/2.1/transformers/4.37.2/py310/Dockerfile

+ARG DATASETS='2.16.1'
+ARG ACCELERATE='0.27.0'
+ARG EVALUATE='0.4.1'
+ARG SENTENCE_TRANSFORMERS='2.3.1'


not sure if sentence transformers support tpu

philschmid · 2024-02-16T15:05:16Z

containers/pytorch/training/tpu/2.1/transformers/4.37.2/py310/Dockerfile

+ARG TRANSFORMERS='4.37.2'
+ARG DIFFUSERS='0.26.1'
+ARG PEFT='0.8.2'
+ARG TRL='0.7.10'


not sure if sentence transformers support tpu

You are right it doesn't.

philschmid · 2024-02-16T15:05:18Z

containers/pytorch/training/tpu/2.1/transformers/4.37.2/py310/Dockerfile

+# Versions
+ARG TRANSFORMERS='4.37.2'
+ARG DIFFUSERS='0.26.1'
+ARG PEFT='0.8.2'


not sure if sentence transformers support tpu

tengomucho · 2024-02-16T15:29:56Z

containers/pytorch/training/tpu/2.0/transformers/4.37.2/py310/Dockerfile

@@ -0,0 +1 @@
+FROM us-central1-docker.pkg.dev/tpu-pytorch-releases/docker/xla:r2.1.0_3.10_tpuvm


as discussed, maybe delete this file

shub-kris · 2024-02-23T14:24:06Z

@philschmid let's merge this PR. I will test with the new transformers version 4.38.1 in a different branch and do a PR then.

philschmid

Wouldn't it make sense to move (#17) into this one that we have a closed "artifact" with at test/script that works? Not that we need to make changes in 17 which are not represented here?

…pytorch-tpu-container

tengomucho · 2024-02-27T10:19:56Z

examples/google-cloud-tpu-vm/causal-language-modeling/README.md

@@ -0,0 +1,104 @@
+# Finetune Facebook OPT-350M on Dolly using Hugging Face PyTorch TPU DLC on Google Cloud TPU(v5e)
+
+This example demonstrates how to finetune [Facebook OPT-350M](https://huggingface.co/facebook/opt-350m) using Hugging Face's DLCs on Google Cloud single-host TPU(v5e) VM. We use the [transformers](https://huggingface.co/docs/transformers/), [TRL](https://huggingface.co/docs/trl/en/index), and [PEFT](https://huggingface.co/docs/peft/index) library to fine-tune. The dataset used for this example is the [Doly-15k](databricks/databricks-dolly-15k) dataset which can be easily accessed from Hugging Face's [Datasets](https://huggingface.co/datasets) Hub. 


Doly-15k -> Dolly-15k

shub-kris · 2024-02-28T12:12:34Z

@philschmid @tengomucho according the discussions in the slack, I have updated the Dockerfile to have the nightly version, also I have ran the finetune-gemma-lora-dolly.py and it runs successfully and saves the checkpoints too.

philschmid · 2024-02-28T16:20:02Z

Did we validate if the results are good?

… fix

tengomucho

For future reviews, I would recommend adding more descriptive commit messages (check https://www.conventionalcommits.org/)

tengomucho · 2024-03-15T08:18:21Z

examples/google-cloud-tpu-vm/causal-language-modeling/finetune-llama2-lora-guanaco.py

+from trl import SFTTrainer
+
+
+def train_gemma(args):


I think it would be better to rename this into train_llama or, if you want to keep naming more generic, just train_model.

Good catch @tengomucho. I missed it because just copied the stuff from another example.

shub-kris · 2024-03-15T09:02:18Z

@tengomucho tried to improve the commit message

philschmid

Left some minor comments. Mostly did we ever validate that a trained model generates no garbage afterward?

philschmid · 2024-03-19T12:42:00Z

containers/pytorch/training/tpu/2.3/transformers/4.39.0.dev0/py310/Dockerfile

@@ -0,0 +1,72 @@
+FROM us-central1-docker.pkg.dev/tpu-pytorch-releases/docker/xla:nightly_3.10_tpuvm_20240229


Do we know when there is the next official XLA release?

Somewhere early next month.

Will update the dockerfile once there is a new release, for now I think we can merge the PR.

philschmid · 2024-03-19T12:42:07Z

containers/pytorch/training/tpu/2.3/transformers/4.39.0.dev0/py310/Dockerfile

+ARG TRANSFORMERS='4.38.1'
+ARG DIFFUSERS='0.26.3'
+ARG PEFT='0.8.2'
+ARG TRL='0.7.11'
+ARG DATASETS='2.17.1'
+ARG ACCELERATE='0.27.2'
+ARG EVALUATE='0.4.1'
+ARG NOTEBOOK='7.1.1'


check all versions

Updated the versions.

philschmid · 2024-03-19T12:42:30Z

containers/pytorch/training/tpu/2.3/transformers/4.39.0.dev0/py310/Dockerfile

+RUN pip install --upgrade git+https://github.com/huggingface/transformers.git \
+  git+https://github.com/huggingface/trl.git


Some of my fixes related to the bugs Fix save_pretrained to make sure adapters weights are also saved on TPU and Fix TPU checkpointing inside Trainer.
will be released with 4.39.0 I guess. And had to update
trl because of this issue

There is still a fix that needs to land in transformers and then when it's released, we can remove the additional pip install for transformers from git

philschmid · 2024-03-19T12:42:48Z

examples/tpu-examples/causal-language-modeling/README.md

@@ -0,0 +1,104 @@
+# Finetune Gemma-2B using Hugging Face PyTorch TPU DLC on Google Cloud TPU(v5e)


Can we try Gemma 7B?

Changed it to Gemma 7B,

philschmid · 2024-03-19T12:43:12Z

examples/tpu-examples/causal-language-modeling/README.md

+--num_epochs 3 \
+--train_batch_size 32 \
+--lr 3e-4
+```


Can we include some steps to test or evaluate the results? To make sure the model is really learning and not generating garbage?

Yes, I will do that. I and someone else encountered a similar error: which I am working on. Once that is fixed which has also to do with evaluation and saving, will add an evaluation too.

added a small inference function to verify if the model after being trained is working as expected.

Tested the trained model and the inference generated:

### Instruction Why can camels survive for long without water? ### Answer Camels have a special adaptation that allows them to store large amounts of fat in their humps. This fat can be converted into water, which is essential for survival. Additionally, camels have thick fur that helps them to conserve body ### Instruction Are the following items candy bars or gum: trident, Twix, hubba bubba, snickers, three musketeers, and wrigleys. ### Answer Trident is gum, Twix is a candy bar, hubba bubba is gum, snickers is a candy bar, three musketeers is a candy bar, and wrigleys is gum.

I also did it for the llama-example. Here are the results:

### Human: From now on, you will act as a nutritionist. I will ask questions about nutrition and you will reply with an explanation on how I can apply it to my daily basis. My first request: What is the main benefit of doing intermittent fastening regularly?### Assistant: Intermittent fasting is a popular dietary pattern that involves alternating between periods of fasting and eating. It involves restricting calorie intake for a certain period of time, typically 12- ### Human: Was kannst Du im Vergleich zu anderen Large Language Models? ### Assistant: Ich kann nicht nur besser als andere Large Language Models, aber ich kann auch in vielen Bereichen besser sein. Als andere Modelle. Ich bin ein allumfassender Modell, das in viel

Also, updated the README and changed TPU to v3-8 as it was easier for me to spin-up and use it. With v5e one has to delete the instance once you are not using, else it keeps running all the time.

philschmid · 2024-03-19T12:43:34Z

examples/tpu-examples/causal-language-modeling/finetune-llama2-lora-guanaco.py

+    tokenizer = AutoTokenizer.from_pretrained(model_id)
+    tokenizer.pad_token = tokenizer.eos_token
+
+    device = xm.xla_device()


device is never used? Why is that here?

Can be removed, it was just to check if access to TPU was there or not. But I removed the print statement, so doesn't make sense anymore to have it.

philschmid · 2024-03-19T12:43:43Z

examples/tpu-examples/causal-language-modeling/finetune-llama2-lora-guanaco.py

+
+def train_llama(args):
+    raw_dataset = load_dataset("timdettmers/openassistant-guanaco", split="train")
+    model_id = "meta-llama/Llama-2-7b-hf"


why hard coded?

I guess, because the file name is finetune-llama2-lora-guanaco.py but okay, will include in the args which is better.

…end of training.

shub-kris added 3 commits February 15, 2024 21:26

PyTorch TPU Dockerfile

d20a488

PyTorch TPU Dockerfile

84ca2b2

Add Notebook

42bbd21

tengomucho reviewed Feb 16, 2024

View reviewed changes

containers/pytorch/training/tpu/2.1/transformers/4.37.2/py310/Dockerfile Outdated Show resolved Hide resolved

shub-kris requested a review from philschmid February 16, 2024 14:46

philschmid reviewed Feb 16, 2024

View reviewed changes

tengomucho reviewed Feb 16, 2024

View reviewed changes

shub-kris added 4 commits February 16, 2024 16:41

remove: unused dockerfile, sentence-transformers

e679d99

Add example with PyTorch_XLA TPU DLC

5447403

Update README to add Single-Host

9b2fb84

Add more info about different hosts

de6c186

tengomucho approved these changes Feb 21, 2024

View reviewed changes

shub-kris requested a review from philschmid February 21, 2024 17:01

shub-kris added 2 commits February 22, 2024 22:49

Replace manual training script with trainer

4680bcd

Add Dolly, use TRL and OPT-350M

0901c59

shub-kris mentioned this pull request Feb 23, 2024

Add example with PyTorch_XLA TPU DLC #17

Closed

Push local changes

c2e8a3b

philschmid reviewed Feb 26, 2024

View reviewed changes

Merge branch 'feature/pytorch-tpu-transformers-example' into feature/…

815b0df

…pytorch-tpu-container

shub-kris requested a review from philschmid February 26, 2024 10:25

Change name, add transformers trainer example

e935b17

tengomucho reviewed Feb 27, 2024

View reviewed changes

shub-kris added 2 commits February 28, 2024 13:07

Update dockerfile and example according to nightly version

6747611

Add info to add token

68f94b4

shub-kris requested a review from tengomucho February 28, 2024 12:10

tengomucho approved these changes Feb 28, 2024

View reviewed changes

shub-kris added 6 commits February 29, 2024 19:50

checkpointing is faster with this base image

79872fd

Build transformers and trl from main for checkpointing and sfttrainer…

2a80fe2

… fix

Update README

b53b200

Change parameters

e75a3dd

Change paramters, add example for LLama

e3041ad

Rename folder

7f2e378

tengomucho reviewed Mar 15, 2024

View reviewed changes

llama-example: name of function from train_gemma to train_llama

6bb2197

shub-kris requested a review from tengomucho March 15, 2024 09:02

tengomucho approved these changes Mar 15, 2024

View reviewed changes

philschmid reviewed Mar 19, 2024

View reviewed changes

update(tpu-examples): change model to gemma-7b, add inference at the …

b119483

…end of training.

shub-kris requested a review from philschmid March 25, 2024 13:05

philschmid approved these changes Mar 26, 2024

View reviewed changes

philschmid merged commit 224d974 into main Mar 26, 2024

philschmid deleted the feature/pytorch-tpu-container branch March 26, 2024 06:31

		@@ -0,0 +1 @@
		FROM us-central1-docker.pkg.dev/tpu-pytorch-releases/docker/xla:r2.1.0_3.10_tpuvm

		@@ -0,0 +1,104 @@
		# Finetune Facebook OPT-350M on Dolly using Hugging Face PyTorch TPU DLC on Google Cloud TPU(v5e)

		This example demonstrates how to finetune [Facebook OPT-350M](https://huggingface.co/facebook/opt-350m) using Hugging Face's DLCs on Google Cloud single-host TPU(v5e) VM. We use the [transformers](https://huggingface.co/docs/transformers/), [TRL](https://huggingface.co/docs/trl/en/index), and [PEFT](https://huggingface.co/docs/peft/index) library to fine-tune. The dataset used for this example is the [Doly-15k](databricks/databricks-dolly-15k) dataset which can be easily accessed from Hugging Face's [Datasets](https://huggingface.co/datasets) Hub.

		@@ -0,0 +1,72 @@
		FROM us-central1-docker.pkg.dev/tpu-pytorch-releases/docker/xla:nightly_3.10_tpuvm_20240229

		RUN pip install --upgrade git+https://github.com/huggingface/transformers.git \
		git+https://github.com/huggingface/trl.git

		@@ -0,0 +1,104 @@
		# Finetune Gemma-2B using Hugging Face PyTorch TPU DLC on Google Cloud TPU(v5e)

Feature/pytorch tpu container #14

Feature/pytorch tpu container #14

Conversation

shub-kris commented Feb 16, 2024

shub-kris commented Feb 16, 2024 • edited Loading

shub-kris commented Feb 16, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shub-kris commented Feb 23, 2024 • edited Loading

philschmid left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shub-kris commented Feb 28, 2024 • edited Loading

philschmid commented Feb 28, 2024

tengomucho left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shub-kris commented Mar 15, 2024

philschmid left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shub-kris Mar 20, 2024 • edited Loading

Choose a reason for hiding this comment

shub-kris Mar 25, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shub-kris Mar 25, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shub-kris commented Feb 16, 2024 •

edited

Loading

shub-kris commented Feb 23, 2024 •

edited

Loading

shub-kris commented Feb 28, 2024 •

edited

Loading

shub-kris Mar 20, 2024 •

edited

Loading

shub-kris Mar 25, 2024 •

edited

Loading

shub-kris Mar 25, 2024 •

edited

Loading