Integrate OrpoTrainer with PyTorchXLA for faster step time on TPUs #2001

wenxindongwork · 2024-08-31T04:01:38Z

OrpoTrainer currently runs very slow on TPU because the code is not integrated with TorchXLA. Currently there are too many dynamic shapes and data device transfer in the code which trigger graph recompilation and slow down step time. This PR makes changes to improve step time of OrpoTrainer on TPUs by more than 300x. Tested on Llama3-8b, the current step time is 2s using Lora on all linear modules, compared to 10mins which we started with.

The changes should not impact performance on other backends since we have guarded the changes with is_torch_xla_available.

qgallouedec · 2024-09-03T17:24:14Z

trl/trainer/orpo_trainer.py

+                    pad_value = self.padding_value
+                elif k.endswith("_attention_mask"):
+                    pad_value = 0
+                batch[k] = pad_list_to_length(batch[k], self.max_length, pad_value=pad_value)


Suggested change

batch[k] = pad_list_to_length(batch[k], self.max_length, pad_value=pad_value)

batch[k] = batch[k] + [pad_value] * (self.max_length - len(batch[k]))

way faster and does not requires a new helper func

thanks for the suggestion!

qgallouedec · 2024-09-03T17:25:02Z

trl/trainer/orpo_trainer.py

@@ -533,7 +536,17 @@ def tokenize_row(self, feature, model: Optional[Union[PreTrainedModel, nn.Module
                batch["chosen_decoder_input_ids"] = model.prepare_decoder_input_ids_from_labels(
                    labels=torch.tensor(batch["chosen_labels"])
                )
-
+
+        if is_torch_xla_available():


Why do you need this only when is_torch_xla_available?

Pytorch XLA doesn't support dynamic shape compilation, so we are padding all sequences to the global batch size. This may not be a problem for GPU, so I kept the original algorithm which supports padding to the longest sequence length in the batch.

qgallouedec · 2024-09-03T17:32:58Z

trl/trainer/orpo_trainer.py

@@ -35,7 +35,7 @@
 from transformers import AutoModelForCausalLM, DataCollator, PreTrainedModel, PreTrainedTokenizerBase, Trainer
 from transformers.trainer_callback import TrainerCallback
 from transformers.trainer_utils import EvalLoopOutput
-from transformers.utils import is_torch_fx_proxy
+from transformers.utils import is_torch_fx_proxy, is_torch_xla_available


This has been added with transformers v4.39. We should probably set this version as the new minimal requirement.

this for catching this!

qgallouedec · 2024-09-03T17:36:34Z

trl/trainer/orpo_trainer.py

@@ -659,7 +672,7 @@ def get_batch_logps(
        loss_mask = labels != label_pad_token_id

        # dummy token; we'll ignore the losses on these tokens later
-        labels[labels == label_pad_token_id] = 0
+        labels = torch.where(labels == label_pad_token_id, 0, labels)


Is it a necessary change? Personal opinion, I find it a bit less intuitive to read

this is necessary, because the previous code calls torch.non_zero under the hood, which produces a dynamic shape object and tiggers graph recompilation.

qgallouedec · 2024-09-03T17:44:13Z

Thank you very much for this addition @wenxindongwork! Unfortunately we can't test with GitHub CI but I'm relying on you for the fact that it works and run faster.
Can you just address the question/comment? then we're good to merge.

HuggingFaceDocBuilderDev · 2024-09-03T17:48:32Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

wenxindongwork · 2024-09-03T20:46:33Z

addressed comments, thanks for the quick review!

lewtun · 2024-09-06T18:33:56Z

Hello @wenxindongwork can you please fix the code quality issues with make precommit 🙏 ?

wenxindongwork · 2024-09-06T18:48:07Z

should work now!

qgallouedec · 2024-09-09T07:47:59Z

Can you also set the min transformers version in setup.py as well?

…into torch-xla-pr

wenxindongwork · 2024-09-09T16:04:23Z

just did, thanks for pointing this out!

lewtun

Thanks for iterating @wenxindongwork - LGTM!

wenxindongwork and others added 8 commits August 22, 2024 15:05

make Orpotrainer run faster on tpu

1d1479d

less data transfer

8c23220

train-trl.py

8a69d3a

fix

a5fc613

set device_map=auto

eb3d7d4

add is_torch_xla_available guards

c641811

delete file

abffa3b

Merge branch 'main' into torch-xla-pr

2a3577b

qgallouedec reviewed Sep 3, 2024

View reviewed changes

Merge branch 'main' into torch-xla-pr

04d3e40

address comments

b91ce0c

wenxindongwork and others added 2 commits September 4, 2024 14:56

Merge branch 'main' into torch-xla-pr

259dff8

Merge branch 'main' into torch-xla-pr

1781819

make presubmit

12c2570

Merge branch 'main' into torch-xla-pr

45d5ff6

wenxindongwork added 2 commits September 9, 2024 09:03

Update transformer version in setup.py

003f09c

Merge branch 'torch-xla-pr' of https://github.com/wenxindongwork/trl …

ab0a612

…into torch-xla-pr

Merge branch 'main' into torch-xla-pr

69a4830

lewtun approved these changes Sep 11, 2024

View reviewed changes

lewtun merged commit e2966c8 into huggingface:main Sep 11, 2024
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate OrpoTrainer with PyTorchXLA for faster step time on TPUs #2001

Integrate OrpoTrainer with PyTorchXLA for faster step time on TPUs #2001

wenxindongwork commented Aug 31, 2024

qgallouedec Sep 3, 2024 •

edited

Loading

wenxindongwork Sep 3, 2024

qgallouedec Sep 3, 2024

wenxindongwork Sep 3, 2024

qgallouedec Sep 3, 2024

wenxindongwork Sep 3, 2024

qgallouedec Sep 3, 2024

wenxindongwork Sep 3, 2024

qgallouedec commented Sep 3, 2024

HuggingFaceDocBuilderDev commented Sep 3, 2024

wenxindongwork commented Sep 3, 2024

lewtun commented Sep 6, 2024

wenxindongwork commented Sep 6, 2024

qgallouedec commented Sep 9, 2024

wenxindongwork commented Sep 9, 2024

lewtun left a comment

	batch[k] = pad_list_to_length(batch[k], self.max_length, pad_value=pad_value)
	batch[k] = batch[k] + [pad_value] * (self.max_length - len(batch[k]))

Integrate OrpoTrainer with PyTorchXLA for faster step time on TPUs #2001

Integrate OrpoTrainer with PyTorchXLA for faster step time on TPUs #2001

Conversation

wenxindongwork commented Aug 31, 2024

qgallouedec Sep 3, 2024 • edited Loading

Choose a reason for hiding this comment

wenxindongwork Sep 3, 2024

Choose a reason for hiding this comment

qgallouedec Sep 3, 2024

Choose a reason for hiding this comment

wenxindongwork Sep 3, 2024

Choose a reason for hiding this comment

qgallouedec Sep 3, 2024

Choose a reason for hiding this comment

wenxindongwork Sep 3, 2024

Choose a reason for hiding this comment

qgallouedec Sep 3, 2024

Choose a reason for hiding this comment

wenxindongwork Sep 3, 2024

Choose a reason for hiding this comment

qgallouedec commented Sep 3, 2024

HuggingFaceDocBuilderDev commented Sep 3, 2024

wenxindongwork commented Sep 3, 2024

lewtun commented Sep 6, 2024

wenxindongwork commented Sep 6, 2024

qgallouedec commented Sep 9, 2024

wenxindongwork commented Sep 9, 2024

lewtun left a comment

Choose a reason for hiding this comment

qgallouedec Sep 3, 2024 •

edited

Loading