Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AssertionError: The dataloader must be a torch_xla.distributed.parallel_loader.MpDeviceLoader #30091

Closed
1 of 4 tasks
moficodes opened this issue Apr 6, 2024 · 4 comments
Closed
1 of 4 tasks

Comments

@moficodes
Copy link

System Info

transformer: v4.39.3
torch: 2.3.0
torch_xla: 2.3.0+gite385c2f
peft: 0.10.0
trl: 0.8.1

Following the discussion from #29659 where @shub-kris provided a script in comment #29659 (comment)

Ran into this issue

WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1712436980.191572       7 pjrt_api.cc:100] GetPjrtApi was found for tpu at /usr/local/lib/python3.10/site-packages/torch_xla/lib/libtpu.so
I0000 00:00:1712436980.191657       7 pjrt_api.cc:79] PJRT_Api is set for device type tpu
I0000 00:00:1712436980.191663       7 pjrt_api.cc:146] The PJRT plugin has PJRT API version 0.40. The framework PJRT API version is 0.40.
/usr/local/lib/python3.10/site-packages/torch_xla/core/xla_model.py:104: UserWarning: `devkind` argument is deprecated and will be removed in a future release.
  warnings.warn("`devkind` argument is deprecated and will be removed in a "
Generating train split: 102 examples [00:00, 452.42 examples/s]
/usr/local/lib/python3.10/site-packages/trl/trainer/sft_trainer.py:317: UserWarning: You passed a tokenizer with `padding_side` not equal to `right` to the SFTTrainer. This might lead to some unexpected behaviour due to overflow issues when training a model in half-precision. You might consider adding `tokenizer.padding_side = 'right'` to your code.
  warnings.warn(
/usr/local/lib/python3.10/site-packages/accelerate/accelerator.py:436: FutureWarning: Passing the following arguments to `Accelerator` is deprecated and will be removed in version 1.0 of Accelerate: dict_keys(['dispatch_batches', 'split_batches', 'even_batches', 'use_seedable_sampler']). Please pass an `accelerate.DataLoaderConfiguration` instead: 
dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)
  warnings.warn(
Traceback (most recent call last):
  File "//demo.py", line 81, in <module>
    train()
  File "//demo.py", line 63, in train
    trainer.train()
  File "/usr/local/lib/python3.10/site-packages/trl/trainer/sft_trainer.py", line 360, in train
    output = super().train(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 1780, in train
    return inner_training_loop(
  File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 1811, in _inner_training_loop
    train_dataloader = tpu_spmd_dataloader(train_dataloader)
  File "/usr/local/lib/python3.10/site-packages/transformers/integrations/tpu.py", line 24, in tpu_spmd_dataloader
    assert isinstance(
AssertionError: The dataloader must be a `torch_xla.distributed.parallel_loader.MpDeviceLoader`.

It asks to pass a accelerate.DataLoaderConfiguration but I am not sure where to do it. Accelerate is being called internally by transformers somewhere.

Who can help?

@ArthurZucker
@muellerz
@shub-kris

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

I ran the example in TPU V4 on Kubernetes. It was 2x2x4 TPU Device.

The Dockerfile to build the image is as follows where demo.py was the script from the issue-comment.

FROM us-central1-docker.pkg.dev/tpu-pytorch-releases/docker/xla:nightly_3.10_tpuvm_20240229

RUN pip install -U transformers trl datasets peft

COPY . .

CMD python demo.py

Expected behavior

Model finetunes and prints output.

@amyeroberts
Copy link
Collaborator

Hi @moficodes, thanks for raising this issue!

I think you've tagged the wrong Mu(e)ller - cc'ing in @muellerzr @pacman100

@jarokaz
Copy link

jarokaz commented Apr 9, 2024

I have encountered a similar issue. I think it is caused by accelerate

https://github.com/huggingface/accelerate/blob/b8c85839531ded28efb77c32e0ad85af2062b27a/src/accelerate/state.py#L257

This code resets self.distributed_type to DistributedType.NO and as a result MpDeviceLoader is not returned by Accelerate

https://github.com/huggingface/accelerate/blob/b8c85839531ded28efb77c32e0ad85af2062b27a/src/accelerate/data_loader.py#L976

@muellerzr
Copy link
Contributor

Yes indeed, this should be fixed by the patch release we just made. Can you try upgrading accelerate and see? (0.29.2)

Copy link

github-actions bot commented May 7, 2024

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants