-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add example with PyTorch_XLA TPU DLC #17
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as discussed, mention this works in single-host TPU VM setups (steps would be different in the multi-host setup).
|
||
- Designed to scale cost-efficiently for a wide range of AI workloads, spanning training, fine-tuning, and inference. | ||
- Optimized for TensorFlow, PyTorch, and JAX, and are available in a variety of form factors, including edge devices, workstations, and cloud-based infrastructure. | ||
- TPUs are available in [Google Cloud](https://cloud.google.com/tpu/docs/intro-to-tpu), and has been integrated with [Vertex AI](https://cloud.google.com/vertex-ai/docs/training/training-with-tpu-vm), and [Google Kubernetes Engine (GKE)](https://cloud.google.com/tpu?hl=en#cloud-tpu-in-gke). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and had been -> and have been
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't only test BERT. Nobody is interested in encoder models. Can we try Gemma or llama. Additionally, we should use our libraries for training, e.g. Trainer from transformers or accelerate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried running with trainer but somehow the training time was way too much and when I asked in the internal chat, the team said no one seems to be actively working on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Managed to run, I was doing something super stupid earlier. I will update the example with a CLM task or something else and use GEmma
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how long does it currently take?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With BERT:
Batch-size | Time Taken(mins:secs) |
---|---|
32x8 | ~2:30 |
16x8 | ~4:45 |
I have added TRL, PEFT and used Dolly-15k. With the setup mentioned in README, I was able to run the training in 2 minutes and 30 seconds. cd /workspace
python google-partnership/Google-Cloud-Containers/examples/google-cloud-tpu-vm/causal-language-modeling/peft_lora_trl_dolly_clm.py \
--model_id facebook/opt-350m \
--num_epochs 3 \
--train_batch_size 8 \
--num_cores 8 \
--lr 3e-4 @philschmid running with So for now, we can merge this PR along with the Dockerfile mentioned in other PR: #14 I will open a separate PR where I would add an example to work with LLama-7B as it requires setting up a VM with multiple hosts: |
Merged into PR #14 |
This PR adds an example for our PyTorch TPU container. The README will be updated later once the DLCs are released. For now it mentions the steps that I followed to build and test it.
The example is training BERT for emotion classification. This example is based on pytorch-xla test