-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/pytorch tpu container #14
Conversation
It is based on the image mentioned in the PyTorch_XLA GitHub repo. So, far I have tested it on the example mentioned in Google Cloud TPU docs. |
Added notebook, as it was not installed, also removed libraries that doesn't support TPU. |
containers/pytorch/training/tpu/2.1/transformers/4.37.2/py310/Dockerfile
Outdated
Show resolved
Hide resolved
ARG DATASETS='2.16.1' | ||
ARG ACCELERATE='0.27.0' | ||
ARG EVALUATE='0.4.1' | ||
ARG SENTENCE_TRANSFORMERS='2.3.1' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure if sentence transformers support tpu
ARG TRANSFORMERS='4.37.2' | ||
ARG DIFFUSERS='0.26.1' | ||
ARG PEFT='0.8.2' | ||
ARG TRL='0.7.10' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure if sentence transformers support tpu
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right it doesn't.
# Versions | ||
ARG TRANSFORMERS='4.37.2' | ||
ARG DIFFUSERS='0.26.1' | ||
ARG PEFT='0.8.2' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure if sentence transformers support tpu
@@ -0,0 +1 @@ | |||
FROM us-central1-docker.pkg.dev/tpu-pytorch-releases/docker/xla:r2.1.0_3.10_tpuvm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as discussed, maybe delete this file
@philschmid let's merge this PR. I will test with the new transformers version 4.38.1 in a different branch and do a PR then. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it make sense to move (#17) into this one that we have a closed "artifact" with at test/script that works? Not that we need to make changes in 17 which are not represented here?
…pytorch-tpu-container
@@ -0,0 +1,104 @@ | |||
# Finetune Facebook OPT-350M on Dolly using Hugging Face PyTorch TPU DLC on Google Cloud TPU(v5e) | |||
|
|||
This example demonstrates how to finetune [Facebook OPT-350M](https://huggingface.co/facebook/opt-350m) using Hugging Face's DLCs on Google Cloud single-host TPU(v5e) VM. We use the [transformers](https://huggingface.co/docs/transformers/), [TRL](https://huggingface.co/docs/trl/en/index), and [PEFT](https://huggingface.co/docs/peft/index) library to fine-tune. The dataset used for this example is the [Doly-15k](databricks/databricks-dolly-15k) dataset which can be easily accessed from Hugging Face's [Datasets](https://huggingface.co/datasets) Hub. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doly-15k -> Dolly-15k
@philschmid @tengomucho according the discussions in the slack, I have updated the Dockerfile to have the nightly version, also I have ran the finetune-gemma-lora-dolly.py and it runs successfully and saves the checkpoints too. |
Did we validate if the results are good? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For future reviews, I would recommend adding more descriptive commit messages (check https://www.conventionalcommits.org/)
from trl import SFTTrainer | ||
|
||
|
||
def train_gemma(args): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be better to rename this into train_llama
or, if you want to keep naming more generic, just train_model
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch @tengomucho. I missed it because just copied the stuff from another example.
@tengomucho tried to improve the commit message |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some minor comments. Mostly did we ever validate that a trained model generates no garbage afterward?
@@ -0,0 +1,72 @@ | |||
FROM us-central1-docker.pkg.dev/tpu-pytorch-releases/docker/xla:nightly_3.10_tpuvm_20240229 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we know when there is the next official XLA release?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Somewhere early next month.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will update the dockerfile once there is a new release, for now I think we can merge the PR.
ARG TRANSFORMERS='4.38.1' | ||
ARG DIFFUSERS='0.26.3' | ||
ARG PEFT='0.8.2' | ||
ARG TRL='0.7.11' | ||
ARG DATASETS='2.17.1' | ||
ARG ACCELERATE='0.27.2' | ||
ARG EVALUATE='0.4.1' | ||
ARG NOTEBOOK='7.1.1' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
check all versions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the versions.
RUN pip install --upgrade git+https://github.com/huggingface/transformers.git \ | ||
git+https://github.com/huggingface/trl.git |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some of my fixes related to the bugs Fix save_pretrained to make sure adapters weights are also saved on TPU and Fix TPU checkpointing inside Trainer.
will be released with 4.39.0
I guess. And had to update
trl
because of this issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is still a fix that needs to land in transformers
and then when it's released, we can remove the additional pip install
for transformers from git
@@ -0,0 +1,104 @@ | |||
# Finetune Gemma-2B using Hugging Face PyTorch TPU DLC on Google Cloud TPU(v5e) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we try Gemma 7B?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed it to Gemma 7B,
--num_epochs 3 \ | ||
--train_batch_size 32 \ | ||
--lr 3e-4 | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we include some steps to test or evaluate the results? To make sure the model is really learning and not generating garbage?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I will do that. I and someone else encountered a similar error: which I am working on. Once that is fixed which has also to do with evaluation and saving, will add an evaluation too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added a small inference function to verify if the model after being trained is working as expected.
Tested the trained model and the inference generated:
### Instruction
Why can camels survive for long without water?
### Answer
Camels have a special adaptation that allows them to store large amounts of fat in their humps. This fat can be converted into water, which is essential for survival. Additionally, camels have thick fur that helps them to conserve body
### Instruction
Are the following items candy bars or gum: trident, Twix, hubba bubba, snickers, three musketeers, and wrigleys.
### Answer
Trident is gum, Twix is a candy bar, hubba bubba is gum, snickers is a candy bar, three musketeers is a candy bar, and wrigleys is gum.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also did it for the llama-example
. Here are the results:
### Human: From now on, you will act as a nutritionist. I will ask questions about nutrition and you will reply with an explanation on how I can apply it to my daily basis. My first request: What is the main benefit of doing intermittent fastening regularly?### Assistant: Intermittent fasting is a popular dietary pattern that involves alternating between periods of fasting and eating. It involves restricting calorie intake for a certain period of time, typically 12-
### Human: Was kannst Du im Vergleich zu anderen Large Language Models? ### Assistant: Ich kann nicht nur besser als andere Large Language Models, aber ich kann auch in vielen Bereichen besser sein. Als andere Modelle. Ich bin ein allumfassender Modell, das in viel
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, updated the README and changed TPU to v3-8
as it was easier for me to spin-up and use it. With v5e one has to delete the instance once you are not using, else it keeps running all the time.
tokenizer = AutoTokenizer.from_pretrained(model_id) | ||
tokenizer.pad_token = tokenizer.eos_token | ||
|
||
device = xm.xla_device() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
device is never used? Why is that here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can be removed, it was just to check if access to TPU was there or not. But I removed the print statement, so doesn't make sense anymore to have it.
|
||
def train_llama(args): | ||
raw_dataset = load_dataset("timdettmers/openassistant-guanaco", split="train") | ||
model_id = "meta-llama/Llama-2-7b-hf" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why hard coded?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess, because the file name is finetune-llama2-lora-guanaco.py
but okay, will include in the args which is better.
This PR adds PyTorch TPU Dockerfile.