Add example with PyTorch_XLA TPU DLC #17

shub-kris · 2024-02-21T15:38:55Z

This PR adds an example for our PyTorch TPU container. The README will be updated later once the DLCs are released. For now it mentions the steps that I followed to build and test it.

The example is training BERT for emotion classification. This example is based on pytorch-xla test

tengomucho

as discussed, mention this works in single-host TPU VM setups (steps would be different in the multi-host setup).

tengomucho · 2024-02-21T15:44:10Z

examples/google-cloud-tpu-vm/text-classification/README.md

+
+- Designed to scale cost-efficiently for a wide range of AI workloads, spanning training, fine-tuning, and inference.
+- Optimized for TensorFlow, PyTorch, and JAX, and are available in a variety of form factors, including edge devices, workstations, and cloud-based infrastructure.
+- TPUs are available in [Google Cloud](https://cloud.google.com/tpu/docs/intro-to-tpu), and has been integrated with [Vertex AI](https://cloud.google.com/vertex-ai/docs/training/training-with-tpu-vm), and [Google Kubernetes Engine (GKE)](https://cloud.google.com/tpu?hl=en#cloud-tpu-in-gke).


and had been -> and have been

philschmid · 2024-02-22T07:32:11Z

examples/google-cloud-tpu-vm/text-classification/bert-emotion-classification-pytorch-xla-tpu.py

We shouldn't only test BERT. Nobody is interested in encoder models. Can we try Gemma or llama. Additionally, we should use our libraries for training, e.g. Trainer from transformers or accelerate.

I tried running with trainer but somehow the training time was way too much and when I asked in the internal chat, the team said no one seems to be actively working on this.

Managed to run, I was doing something super stupid earlier. I will update the example with a CLM task or something else and use GEmma

how long does it currently take?

With BERT:

Batch-size Time Taken(mins:secs)

32x8 ~2:30

16x8 ~4:45

shub-kris · 2024-02-23T14:23:03Z

I have added TRL, PEFT and used Dolly-15k.

With the setup mentioned in README, I was able to run the training in 2 minutes and 30 seconds.

cd /workspace
python google-partnership/Google-Cloud-Containers/examples/google-cloud-tpu-vm/causal-language-modeling/peft_lora_trl_dolly_clm.py \ 
--model_id facebook/opt-350m \
--num_epochs 3 \
--train_batch_size 8 \
--num_cores 8 \
--lr 3e-4

@philschmid running with Llama 7B will require a bigger machine and I am testing that currently as with TPU: v5-litepod-8 runs OOM.

So for now, we can merge this PR along with the Dockerfile mentioned in other PR: #14

I will open a separate PR where I would add an example to work with LLama-7B as it requires setting up a VM with multiple hosts: v5-litepod-16 and for that the steps to execute is different.

shub-kris · 2024-02-26T10:24:58Z

Merged into PR #14

Add example with PyTorch_XLA TPU DLC

5447403

shub-kris requested review from ydshieh, tengomucho and philschmid February 21, 2024 15:38

tengomucho reviewed Feb 21, 2024

View reviewed changes

shub-kris added 2 commits February 21, 2024 17:27

Update README to add Single-Host

9b2fb84

Add more info about different hosts

de6c186

tengomucho approved these changes Feb 21, 2024

View reviewed changes

philschmid reviewed Feb 22, 2024

View reviewed changes

shub-kris added 2 commits February 22, 2024 22:49

Replace manual training script with trainer

4680bcd

Add Dolly, use TRL and OPT-350M

0901c59

shub-kris requested review from philschmid and tengomucho February 23, 2024 14:23

Push local changes

c2e8a3b

philschmid mentioned this pull request Feb 26, 2024

Feature/pytorch tpu container #14

Merged

shub-kris closed this Feb 26, 2024

shub-kris deleted the feature/pytorch-tpu-transformers-example branch March 27, 2024 12:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add example with PyTorch_XLA TPU DLC #17

Add example with PyTorch_XLA TPU DLC #17

shub-kris commented Feb 21, 2024

tengomucho left a comment

tengomucho Feb 21, 2024

philschmid Feb 22, 2024 •

edited

Loading

shub-kris Feb 22, 2024 •

edited

Loading

shub-kris Feb 22, 2024

philschmid Feb 23, 2024

shub-kris Feb 23, 2024

shub-kris commented Feb 23, 2024 •

edited

Loading

shub-kris commented Feb 26, 2024

Add example with PyTorch_XLA TPU DLC #17

Add example with PyTorch_XLA TPU DLC #17

Conversation

shub-kris commented Feb 21, 2024

tengomucho left a comment

Choose a reason for hiding this comment

tengomucho Feb 21, 2024

Choose a reason for hiding this comment

philschmid Feb 22, 2024 • edited Loading

Choose a reason for hiding this comment

shub-kris Feb 22, 2024 • edited Loading

Choose a reason for hiding this comment

shub-kris Feb 22, 2024

Choose a reason for hiding this comment

philschmid Feb 23, 2024

Choose a reason for hiding this comment

shub-kris Feb 23, 2024

Choose a reason for hiding this comment

shub-kris commented Feb 23, 2024 • edited Loading

shub-kris commented Feb 26, 2024

philschmid Feb 22, 2024 •

edited

Loading

shub-kris Feb 22, 2024 •

edited

Loading

shub-kris commented Feb 23, 2024 •

edited

Loading