Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[no_early_kickoff][Train] ray.train.huggingface restructure #33278

Merged
merged 45 commits into from
May 8, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
4bcb4e3
WIP
Yard1 Mar 13, 2023
673cd80
WIP
Yard1 Mar 13, 2023
f4469e1
WIP
Yard1 Mar 13, 2023
e838585
Merge branch 'ray-project:master' into accelerate_trainer_2
Yard1 Mar 13, 2023
5595dd3
WIP
Yard1 Mar 13, 2023
6bf6603
Fix
Yard1 Mar 13, 2023
fb333c9
Restructure `ray.train.huggingface`
Yard1 Mar 13, 2023
1685593
Merge branch 'master' into huggingface_restructure
Yard1 Mar 25, 2023
812cc08
Fix
Yard1 Mar 25, 2023
a4dc374
Fix
Yard1 Mar 25, 2023
c6ebc40
Fix
Yard1 Mar 25, 2023
cfad9cb
Merge branch 'ray-project:master' into huggingface_restructure
Yard1 Mar 25, 2023
5cfd904
Merge branch 'ray-project:master' into huggingface_restructure
Yard1 Mar 25, 2023
dacf5c6
Merge branch 'ray-project:master' into huggingface_restructure
Yard1 Mar 25, 2023
00d4242
Fix
Yard1 Mar 25, 2023
a6fcd01
Fix docs
Yard1 Mar 27, 2023
cd9df03
Merge branch 'master' into huggingface_restructure
Yard1 Mar 27, 2023
e2b84eb
Fix
Yard1 Mar 27, 2023
73d1bff
Apply feedback from code review
Yard1 Mar 28, 2023
6263184
Merge branch 'master' into huggingface_restructure
Yard1 Mar 28, 2023
e9eee1e
Merge branch 'ray-project:master' into huggingface_restructure
Yard1 Mar 31, 2023
bf9d662
Merge branch 'master' into huggingface_restructure
Yard1 Apr 17, 2023
ca88b8c
Merge branch 'huggingface_restructure' of https://github.com/Yard1/ra…
Yard1 Apr 17, 2023
3d3e1ac
Allow local datasets in HuggingFaceTrainer
Yard1 Apr 17, 2023
1fd2c53
Merge branch 'hf_trainer_allow_non_ray_datasets' into huggingface_res…
Yard1 Apr 17, 2023
56289d6
Clarify
Yard1 Apr 17, 2023
daa8e66
Merge branch 'hf_trainer_allow_non_ray_datasets' into huggingface_res…
Yard1 Apr 17, 2023
e785d2e
Update
Yard1 Apr 17, 2023
c9c5203
Merge branch 'master' into huggingface_restructure
Yard1 Apr 28, 2023
4696da2
Merge branch 'ray-project:master' into huggingface_restructure
Yard1 May 1, 2023
34cf027
Tweak docs
Yard1 May 1, 2023
ac8e23b
Merge branch 'master' into huggingface_restructure
Yard1 May 2, 2023
39507c9
Change paths
Yard1 May 2, 2023
cfe66a9
Add alias
Yard1 May 2, 2023
cf9b54b
Remove transformers alias
Yard1 May 2, 2023
969bde1
Rename to hf_transformers
Yard1 May 2, 2023
2680826
Fix
Yard1 May 2, 2023
076bec6
Fix
Yard1 May 3, 2023
9ac55dd
Merge branch 'ray-project:master' into huggingface_restructure
Yard1 May 3, 2023
3392072
Merge branch 'ray-project:master' into huggingface_restructure
Yard1 May 3, 2023
17f18bf
Merge branch 'ray-project:master' into huggingface_restructure
Yard1 May 3, 2023
264e3f6
Merge branch 'ray-project:master' into huggingface_restructure
Yard1 May 4, 2023
e37ffd9
Merge branch 'ray-project:master' into huggingface_restructure
Yard1 May 4, 2023
8f7e745
Merge branch 'master' into huggingface_restructure
Yard1 May 4, 2023
9e75a43
Merge branch 'ray-project:master' into huggingface_restructure
Yard1 May 8, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion doc/source/ray-air/api/predictor.rst
Original file line number Diff line number Diff line change
Expand Up @@ -92,6 +92,6 @@ Built-in Predictors for Library Integrations
~lightgbm.LightGBMPredictor
~tensorflow.TensorflowPredictor
~torch.TorchPredictor
~huggingface.HuggingFacePredictor
~hf_transformers.TransformersPredictor
~sklearn.SklearnPredictor
~rl.RLPredictor
2 changes: 1 addition & 1 deletion doc/source/ray-air/doc_code/accelerate_trainer.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@

import ray
from ray.air import session, Checkpoint
from ray.train.huggingface.accelerate import AccelerateTrainer
from ray.train.hf_accelerate import AccelerateTrainer
from ray.air.config import ScalingConfig


Expand Down
4 changes: 2 additions & 2 deletions doc/source/ray-air/doc_code/hf_trainer.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@
from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer

import ray
from ray.train.huggingface import HuggingFaceTrainer
from ray.train.hf_transformers import TransformersTrainer
from ray.air.config import ScalingConfig


Expand Down Expand Up @@ -81,7 +81,7 @@ def trainer_init_per_worker(train_dataset, eval_dataset, **config):


scaling_config = ScalingConfig(num_workers=3, use_gpu=use_gpu)
trainer = HuggingFaceTrainer(
trainer = TransformersTrainer(
trainer_init_per_worker=trainer_init_per_worker,
scaling_config=scaling_config,
datasets={"train": ray_train_ds, "evaluation": ray_evaluation_ds},
Expand Down
26 changes: 14 additions & 12 deletions doc/source/ray-air/examples/gptj_deepspeed_fine_tuning.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -402,16 +402,17 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### Fine-tuning the model with Ray AIR <a name=\"train\"></a>\n",
"\n",
"We can now configure Ray AIR's {class}`~ray.train.huggingface.huggingface_trainer.HuggingFaceTrainer` to perform distributed fine-tuning of the model. In order to do that, we specify a `trainer_init_per_worker` function, which creates a 🤗 Transformers `Trainer` that will be distributed by Ray using Distributed Data Parallelism (using PyTorch Distributed backend internally). This means that each worker will have its own copy of the model, but operate on different data, At the end of each step, all the workers will sync gradients.\n",
"We can now configure Ray AIR's {class}`~ray.train.hf_transformers.TransformersTrainer` to perform distributed fine-tuning of the model. In order to do that, we specify a `trainer_init_per_worker` function, which creates a 🤗 Transformers `Trainer` that will be distributed by Ray using Distributed Data Parallelism (using PyTorch Distributed backend internally). This means that each worker will have its own copy of the model, but operate on different data, At the end of each step, all the workers will sync gradients.\n",
"\n",
"Because GPT-J is a relatively large model, it may not be possible to fit it on smaller GPU types (<=16 GB GRAM). To deal with that issue, we can use [DeepSpeed](https://github.com/microsoft/DeepSpeed), a library to optimize the training process and allow us to (among other things) offload and partition optimizer and parameter states, reducing GRAM usage. Furthermore, DeepSpeed ZeRO Stage 3 allows us to load large models without running out of memory.\n",
"\n",
"🤗 Transformers and Ray AIR's integration ({class}`~ray.train.huggingface.huggingface_trainer.HuggingFaceTrainer`) allow you to easily configure and use DDP and DeepSpeed. All you need to do is specify the DeepSpeed configuration in the [`TrainingArguments`](https://huggingface.co/docs/transformers/en/main_classes/trainer#transformers.TrainingArguments) object.\n",
"🤗 Transformers and Ray AIR's integration ({class}`~ray.train.hf_transformers.TransformersTrainer`) allow you to easily configure and use DDP and DeepSpeed. All you need to do is specify the DeepSpeed configuration in the [`TrainingArguments`](https://huggingface.co/docs/transformers/en/main_classes/trainer#transformers.TrainingArguments) object.\n",
"\n",
"```{tip}\n",
"There are many DeepSpeed settings that allow you to trade-off speed for memory usage. The settings used below are tailored to the cluster setup used (16 g4dn.4xlarge nodes) and per device batch size of 16. Some things to keep in mind:\n",
Expand Down Expand Up @@ -564,7 +565,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"With our `trainer_init_per_worker` complete, we can now instantiate the {class}`~ray.train.huggingface.huggingface_trainer.HuggingFaceTrainer`. Aside from the function, we set the `scaling_config`, controlling the amount of workers and resources used, and the `datasets` we will use for training and evaluation.\n",
"With our `trainer_init_per_worker` complete, we can now instantiate the {class}`~ray.train.hf_transformers.TransformersTrainer`. Aside from the function, we set the `scaling_config`, controlling the amount of workers and resources used, and the `datasets` we will use for training and evaluation.\n",
"\n",
"We pass the preprocessors we have defined earlier as an argument, wrapped in a {class}`~ray.data.preprocessors.chain.Chain`. The preprocessor will be included with the returned {class}`~ray.air.checkpoint.Checkpoint`, meaning it will also be applied during inference.\n",
"\n",
Expand All @@ -579,12 +580,12 @@
"metadata": {},
"outputs": [],
"source": [
"from ray.train.huggingface import HuggingFaceTrainer\n",
"from ray.train.hf_transformers import TransformersTrainer\n",
"from ray.air.config import ScalingConfig\n",
"from ray.data.preprocessors import Chain\n",
"\n",
"\n",
"trainer = HuggingFaceTrainer(\n",
"trainer = TransformersTrainer(\n",
" trainer_init_per_worker=trainer_init_per_worker,\n",
" trainer_init_config={\n",
" \"batch_size\": 16, # per device\n",
Expand All @@ -601,10 +602,11 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Finally, we call the {meth}`~ray.train.huggingface.huggingface_trainer.HuggingFaceTrainer.fit` method to start training with Ray AIR. We will save the {class}`~ray.air.Result` object to a variable so we can access metrics and checkpoints."
"Finally, we call the {meth}`~ray.train.hf_transformers.TransformersTrainer.fit` method to start training with Ray AIR. We will save the {class}`~ray.air.Result` object to a variable so we can access metrics and checkpoints."
]
},
{
Expand Down Expand Up @@ -642,7 +644,7 @@
"<tr><th>Trial name </th><th>status </th><th>loc </th><th style=\"text-align: right;\"> iter</th><th style=\"text-align: right;\"> total time (s)</th><th style=\"text-align: right;\"> loss</th><th style=\"text-align: right;\"> learning_rate</th><th style=\"text-align: right;\"> epoch</th></tr>\n",
"</thead>\n",
"<tbody>\n",
"<tr><td>HuggingFaceTrainer_f623d_00000</td><td>TERMINATED</td><td>10.0.30.196:30861</td><td style=\"text-align: right;\"> 85</td><td style=\"text-align: right;\"> 2579.3</td><td style=\"text-align: right;\">0.0715</td><td style=\"text-align: right;\"> 4.70588e-07</td><td style=\"text-align: right;\"> 1</td></tr>\n",
"<tr><td>TransformersTrainer_f623d_00000</td><td>TERMINATED</td><td>10.0.30.196:30861</td><td style=\"text-align: right;\"> 85</td><td style=\"text-align: right;\"> 2579.3</td><td style=\"text-align: right;\">0.0715</td><td style=\"text-align: right;\"> 4.70588e-07</td><td style=\"text-align: right;\"> 1</td></tr>\n",
"</tbody>\n",
"</table>\n",
" </div>\n",
Expand Down Expand Up @@ -979,7 +981,7 @@
{
"data": {
"text/plain": [
"HuggingFaceCheckpoint(local_path=/home/ray/ray_results/HuggingFaceTrainer_2023-03-06_16-35-29/HuggingFaceTrainer_f623d_00000_0_2023-03-06_16-35-30/checkpoint_000000)"
"TransformersCheckpoint(local_path=/home/ray/ray_results/TransformersTrainer_2023-03-06_16-35-29/TransformersTrainer_f623d_00000_0_2023-03-06_16-35-30/checkpoint_000000)"
]
},
"execution_count": 18,
Expand All @@ -998,13 +1000,13 @@
"source": [
"### Generate text from prompt\n",
"\n",
"We can use the {class}`~ray.train.huggingface.huggingface_predictor.HuggingFacePredictor` to generate predictions from our fine-tuned model.\n",
"We can use the {class}`~ray.train.hf_transformers.huggingface_predictor.TransformersPredictor` to generate predictions from our fine-tuned model.\n",
"\n",
"```{tip}\n",
"For large scale batch inference, consider configuring cloud checkpointing and then pass the cloud-backed {class}`~ray.air.checkpoint.Checkpoint` to {class}`~ray.train.batch_predictor.BatchPredictor`. More information [here](air-predictors).\n",
"```\n",
"\n",
"Because the {class}`~ray.train.huggingface.huggingface_predictor.HuggingFacePredictor` uses a 🤗 Transformers [`pipeline`](https://huggingface.co/docs/transformers/en/main_classes/pipelines) under the hood, we disable the tokenizer AIR Preprocessor we have used for training and let the `pipeline` to tokenize the data itself."
"Because the {class}`~ray.train.hf_transformers.huggingface_predictor.TransformersPredictor` uses a 🤗 Transformers [`pipeline`](https://huggingface.co/docs/transformers/en/main_classes/pipelines) under the hood, we disable the tokenizer AIR Preprocessor we have used for training and let the `pipeline` to tokenize the data itself."
]
},
{
Expand All @@ -1030,13 +1032,13 @@
"metadata": {},
"outputs": [],
"source": [
"from ray.train.huggingface import HuggingFacePredictor\n",
"from ray.train.hf_transformers import TransformersPredictor\n",
"import pandas as pd\n",
"\n",
"prompts = pd.DataFrame([\"Romeo and Juliet\", \"Romeo\", \"Juliet\"], columns=[\"text\"])\n",
"\n",
"# Predict on the head node.\n",
"predictor = HuggingFacePredictor.from_checkpoint(\n",
"predictor = TransformersPredictor.from_checkpoint(\n",
" checkpoint=checkpoint,\n",
" task=\"text-generation\",\n",
" torch_dtype=torch.float16 if use_gpu else None,\n",
Expand Down
Loading