Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/pytorch tpu container #14

Merged
merged 22 commits into from
Mar 26, 2024
Merged

Conversation

shub-kris
Copy link
Contributor

This PR adds PyTorch TPU Dockerfile.

@shub-kris
Copy link
Contributor Author

shub-kris commented Feb 16, 2024

It is based on the image mentioned in the PyTorch_XLA GitHub repo. So, far I have tested it on the example mentioned in Google Cloud TPU docs.

@shub-kris
Copy link
Contributor Author

Added notebook, as it was not installed, also removed libraries that doesn't support TPU.

ARG DATASETS='2.16.1'
ARG ACCELERATE='0.27.0'
ARG EVALUATE='0.4.1'
ARG SENTENCE_TRANSFORMERS='2.3.1'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if sentence transformers support tpu

ARG TRANSFORMERS='4.37.2'
ARG DIFFUSERS='0.26.1'
ARG PEFT='0.8.2'
ARG TRL='0.7.10'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if sentence transformers support tpu

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right it doesn't.

# Versions
ARG TRANSFORMERS='4.37.2'
ARG DIFFUSERS='0.26.1'
ARG PEFT='0.8.2'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if sentence transformers support tpu

@@ -0,0 +1 @@
FROM us-central1-docker.pkg.dev/tpu-pytorch-releases/docker/xla:r2.1.0_3.10_tpuvm
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as discussed, maybe delete this file

@shub-kris
Copy link
Contributor Author

shub-kris commented Feb 23, 2024

@philschmid let's merge this PR. I will test with the new transformers version 4.38.1 in a different branch and do a PR then.

Copy link
Member

@philschmid philschmid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it make sense to move (#17) into this one that we have a closed "artifact" with at test/script that works? Not that we need to make changes in 17 which are not represented here?

@@ -0,0 +1,104 @@
# Finetune Facebook OPT-350M on Dolly using Hugging Face PyTorch TPU DLC on Google Cloud TPU(v5e)

This example demonstrates how to finetune [Facebook OPT-350M](https://huggingface.co/facebook/opt-350m) using Hugging Face's DLCs on Google Cloud single-host TPU(v5e) VM. We use the [transformers](https://huggingface.co/docs/transformers/), [TRL](https://huggingface.co/docs/trl/en/index), and [PEFT](https://huggingface.co/docs/peft/index) library to fine-tune. The dataset used for this example is the [Doly-15k](databricks/databricks-dolly-15k) dataset which can be easily accessed from Hugging Face's [Datasets](https://huggingface.co/datasets) Hub.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doly-15k -> Dolly-15k

@shub-kris
Copy link
Contributor Author

shub-kris commented Feb 28, 2024

@philschmid @tengomucho according the discussions in the slack, I have updated the Dockerfile to have the nightly version, also I have ran the finetune-gemma-lora-dolly.py and it runs successfully and saves the checkpoints too.

@philschmid
Copy link
Member

Did we validate if the results are good?

Copy link
Collaborator

@tengomucho tengomucho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For future reviews, I would recommend adding more descriptive commit messages (check https://www.conventionalcommits.org/)

from trl import SFTTrainer


def train_gemma(args):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to rename this into train_llama or, if you want to keep naming more generic, just train_model.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch @tengomucho. I missed it because just copied the stuff from another example.

@shub-kris
Copy link
Contributor Author

@tengomucho tried to improve the commit message

Copy link
Member

@philschmid philschmid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some minor comments. Mostly did we ever validate that a trained model generates no garbage afterward?

@@ -0,0 +1,72 @@
FROM us-central1-docker.pkg.dev/tpu-pytorch-releases/docker/xla:nightly_3.10_tpuvm_20240229
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we know when there is the next official XLA release?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somewhere early next month.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will update the dockerfile once there is a new release, for now I think we can merge the PR.

Comment on lines 9 to 16
ARG TRANSFORMERS='4.38.1'
ARG DIFFUSERS='0.26.3'
ARG PEFT='0.8.2'
ARG TRL='0.7.11'
ARG DATASETS='2.17.1'
ARG ACCELERATE='0.27.2'
ARG EVALUATE='0.4.1'
ARG NOTEBOOK='7.1.1'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check all versions

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the versions.

Comment on lines 50 to 51
RUN pip install --upgrade git+https://github.com/huggingface/transformers.git \
git+https://github.com/huggingface/trl.git
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of my fixes related to the bugs Fix save_pretrained to make sure adapters weights are also saved on TPU and Fix TPU checkpointing inside Trainer.
will be released with 4.39.0 I guess. And had to update
trl because of this issue

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is still a fix that needs to land in transformers and then when it's released, we can remove the additional pip install for transformers from git

@@ -0,0 +1,104 @@
# Finetune Gemma-2B using Hugging Face PyTorch TPU DLC on Google Cloud TPU(v5e)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we try Gemma 7B?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed it to Gemma 7B,

--num_epochs 3 \
--train_batch_size 32 \
--lr 3e-4
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we include some steps to test or evaluate the results? To make sure the model is really learning and not generating garbage?

Copy link
Contributor Author

@shub-kris shub-kris Mar 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I will do that. I and someone else encountered a similar error: which I am working on. Once that is fixed which has also to do with evaluation and saving, will add an evaluation too.

Copy link
Contributor Author

@shub-kris shub-kris Mar 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a small inference function to verify if the model after being trained is working as expected.

Tested the trained model and the inference generated:

### Instruction
Why can camels survive for long without water?

### Answer
Camels have a special adaptation that allows them to store large amounts of fat in their humps. This fat can be converted into water, which is essential for survival. Additionally, camels have thick fur that helps them to conserve body


### Instruction
Are the following items candy bars or gum: trident, Twix, hubba bubba, snickers, three musketeers, and wrigleys.

### Answer
Trident is gum, Twix is a candy bar, hubba bubba is gum, snickers is a candy bar, three musketeers is a candy bar, and wrigleys is gum.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also did it for the llama-example. Here are the results:

### Human: From now on, you will act as a nutritionist. I will ask questions about nutrition and you will reply with an explanation on how I can apply it to my daily basis. My first request: What is the main benefit of doing intermittent fastening regularly?### Assistant: Intermittent fasting is a popular dietary pattern that involves alternating between periods of fasting and eating. It involves restricting calorie intake for a certain period of time, typically 12-


### Human: Was kannst Du im Vergleich zu anderen Large Language Models? ### Assistant: Ich kann nicht nur besser als andere Large Language Models, aber ich kann auch in vielen Bereichen besser sein. Als andere Modelle. Ich bin ein allumfassender Modell, das in viel


Copy link
Contributor Author

@shub-kris shub-kris Mar 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, updated the README and changed TPU to v3-8 as it was easier for me to spin-up and use it. With v5e one has to delete the instance once you are not using, else it keeps running all the time.

tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token

device = xm.xla_device()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

device is never used? Why is that here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be removed, it was just to check if access to TPU was there or not. But I removed the print statement, so doesn't make sense anymore to have it.


def train_llama(args):
raw_dataset = load_dataset("timdettmers/openassistant-guanaco", split="train")
model_id = "meta-llama/Llama-2-7b-hf"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why hard coded?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess, because the file name is finetune-llama2-lora-guanaco.py but okay, will include in the args which is better.

@philschmid philschmid merged commit 224d974 into main Mar 26, 2024
@philschmid philschmid deleted the feature/pytorch-tpu-container branch March 26, 2024 06:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants