Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add example with PyTorch_XLA TPU DLC #17

Closed
wants to merge 6 commits into from

Conversation

shub-kris
Copy link
Contributor

This PR adds an example for our PyTorch TPU container. The README will be updated later once the DLCs are released. For now it mentions the steps that I followed to build and test it.

The example is training BERT for emotion classification. This example is based on pytorch-xla test

Copy link
Collaborator

@tengomucho tengomucho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as discussed, mention this works in single-host TPU VM setups (steps would be different in the multi-host setup).


- Designed to scale cost-efficiently for a wide range of AI workloads, spanning training, fine-tuning, and inference.
- Optimized for TensorFlow, PyTorch, and JAX, and are available in a variety of form factors, including edge devices, workstations, and cloud-based infrastructure.
- TPUs are available in [Google Cloud](https://cloud.google.com/tpu/docs/intro-to-tpu), and has been integrated with [Vertex AI](https://cloud.google.com/vertex-ai/docs/training/training-with-tpu-vm), and [Google Kubernetes Engine (GKE)](https://cloud.google.com/tpu?hl=en#cloud-tpu-in-gke).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and had been -> and have been

Copy link
Member

@philschmid philschmid Feb 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't only test BERT. Nobody is interested in encoder models. Can we try Gemma or llama. Additionally, we should use our libraries for training, e.g. Trainer from transformers or accelerate.

Copy link
Contributor Author

@shub-kris shub-kris Feb 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried running with trainer but somehow the training time was way too much and when I asked in the internal chat, the team said no one seems to be actively working on this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Managed to run, I was doing something super stupid earlier. I will update the example with a CLM task or something else and use GEmma

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how long does it currently take?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With BERT:

Batch-size Time Taken(mins:secs)
32x8 ~2:30
16x8 ~4:45

@shub-kris
Copy link
Contributor Author

shub-kris commented Feb 23, 2024

I have added TRL, PEFT and used Dolly-15k.

With the setup mentioned in README, I was able to run the training in 2 minutes and 30 seconds.

cd /workspace
python google-partnership/Google-Cloud-Containers/examples/google-cloud-tpu-vm/causal-language-modeling/peft_lora_trl_dolly_clm.py \ 
--model_id facebook/opt-350m \
--num_epochs 3 \
--train_batch_size 8 \
--num_cores 8 \
--lr 3e-4

@philschmid running with Llama 7B will require a bigger machine and I am testing that currently as with TPU: v5-litepod-8 runs OOM.

So for now, we can merge this PR along with the Dockerfile mentioned in other PR: #14

I will open a separate PR where I would add an example to work with LLama-7B as it requires setting up a VM with multiple hosts: v5-litepod-16 and for that the steps to execute is different.

@shub-kris
Copy link
Contributor Author

Merged into PR #14

@shub-kris shub-kris deleted the feature/pytorch-tpu-transformers-example branch March 27, 2024 12:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants