AssertionError: The dataloader must be a `torch_xla.distributed.parallel_loader.MpDeviceLoader` #30091

moficodes · 2024-04-06T21:37:06Z

System Info

transformer: v4.39.3
torch: 2.3.0
torch_xla: 2.3.0+gite385c2f
peft: 0.10.0
trl: 0.8.1

Following the discussion from #29659 where @shub-kris provided a script in comment #29659 (comment)

Ran into this issue

WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1712436980.191572       7 pjrt_api.cc:100] GetPjrtApi was found for tpu at /usr/local/lib/python3.10/site-packages/torch_xla/lib/libtpu.so
I0000 00:00:1712436980.191657       7 pjrt_api.cc:79] PJRT_Api is set for device type tpu
I0000 00:00:1712436980.191663       7 pjrt_api.cc:146] The PJRT plugin has PJRT API version 0.40. The framework PJRT API version is 0.40.
/usr/local/lib/python3.10/site-packages/torch_xla/core/xla_model.py:104: UserWarning: `devkind` argument is deprecated and will be removed in a future release.
  warnings.warn("`devkind` argument is deprecated and will be removed in a "
Generating train split: 102 examples [00:00, 452.42 examples/s]
/usr/local/lib/python3.10/site-packages/trl/trainer/sft_trainer.py:317: UserWarning: You passed a tokenizer with `padding_side` not equal to `right` to the SFTTrainer. This might lead to some unexpected behaviour due to overflow issues when training a model in half-precision. You might consider adding `tokenizer.padding_side = 'right'` to your code.
  warnings.warn(
/usr/local/lib/python3.10/site-packages/accelerate/accelerator.py:436: FutureWarning: Passing the following arguments to `Accelerator` is deprecated and will be removed in version 1.0 of Accelerate: dict_keys(['dispatch_batches', 'split_batches', 'even_batches', 'use_seedable_sampler']). Please pass an `accelerate.DataLoaderConfiguration` instead: 
dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)
  warnings.warn(
Traceback (most recent call last):
  File "//demo.py", line 81, in <module>
    train()
  File "//demo.py", line 63, in train
    trainer.train()
  File "/usr/local/lib/python3.10/site-packages/trl/trainer/sft_trainer.py", line 360, in train
    output = super().train(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 1780, in train
    return inner_training_loop(
  File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 1811, in _inner_training_loop
    train_dataloader = tpu_spmd_dataloader(train_dataloader)
  File "/usr/local/lib/python3.10/site-packages/transformers/integrations/tpu.py", line 24, in tpu_spmd_dataloader
    assert isinstance(
AssertionError: The dataloader must be a `torch_xla.distributed.parallel_loader.MpDeviceLoader`.

It asks to pass a accelerate.DataLoaderConfiguration but I am not sure where to do it. Accelerate is being called internally by transformers somewhere.

Who can help?

@ArthurZucker
@muellerz
@shub-kris

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

I ran the example in TPU V4 on Kubernetes. It was 2x2x4 TPU Device.

The Dockerfile to build the image is as follows where demo.py was the script from the issue-comment.

FROM us-central1-docker.pkg.dev/tpu-pytorch-releases/docker/xla:nightly_3.10_tpuvm_20240229

RUN pip install -U transformers trl datasets peft

COPY . .

CMD python demo.py

Expected behavior

Model finetunes and prints output.

The text was updated successfully, but these errors were encountered:

amyeroberts · 2024-04-08T08:27:44Z

Hi @moficodes, thanks for raising this issue!

I think you've tagged the wrong Mu(e)ller - cc'ing in @muellerzr @pacman100

jarokaz · 2024-04-09T15:28:51Z

I have encountered a similar issue. I think it is caused by accelerate

https://github.com/huggingface/accelerate/blob/b8c85839531ded28efb77c32e0ad85af2062b27a/src/accelerate/state.py#L257

This code resets self.distributed_type to DistributedType.NO and as a result MpDeviceLoader is not returned by Accelerate

https://github.com/huggingface/accelerate/blob/b8c85839531ded28efb77c32e0ad85af2062b27a/src/accelerate/data_loader.py#L976

muellerzr · 2024-04-09T16:47:00Z

Yes indeed, this should be fixed by the patch release we just made. Can you try upgrading accelerate and see? (0.29.2)

github-actions · 2024-05-07T08:03:17Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions bot closed this as completed May 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AssertionError: The dataloader must be a `torch_xla.distributed.parallel_loader.MpDeviceLoader` #30091

AssertionError: The dataloader must be a `torch_xla.distributed.parallel_loader.MpDeviceLoader` #30091

moficodes commented Apr 6, 2024

amyeroberts commented Apr 8, 2024

jarokaz commented Apr 9, 2024

muellerzr commented Apr 9, 2024

github-actions bot commented May 7, 2024

AssertionError: The dataloader must be a torch_xla.distributed.parallel_loader.MpDeviceLoader #30091

AssertionError: The dataloader must be a torch_xla.distributed.parallel_loader.MpDeviceLoader #30091

Comments

moficodes commented Apr 6, 2024

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

amyeroberts commented Apr 8, 2024

jarokaz commented Apr 9, 2024

muellerzr commented Apr 9, 2024

github-actions bot commented May 7, 2024

AssertionError: The dataloader must be a `torch_xla.distributed.parallel_loader.MpDeviceLoader` #30091

AssertionError: The dataloader must be a `torch_xla.distributed.parallel_loader.MpDeviceLoader` #30091