Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

all_gather raises NotImplementedError when no Accelerator defined in Trainer #5181

Closed
8greg8 opened this issue Dec 18, 2020 · 7 comments
Closed
Assignees
Labels
bug Something isn't working distributed Generic distributed-related topic help wanted Open to be worked on
Milestone

Comments

@8greg8
Copy link
Contributor

8greg8 commented Dec 18, 2020

🐛 Bug

When no Accelerator is defined in Trainer, all_gather function in LightningModule raises NotImplementedError.

Please reproduce using the BoringModel and post here

https://colab.research.google.com/drive/1VPEIaQ-aN5KVA70VtvGk24AVkTYPvMoY?usp=sharing

To Reproduce

Expected behavior

  • All gather should return a value instead.

Environment

  • CUDA:
    • GPU:
      • Tesla P100-PCIE-16GB
    • available: True
    • version: 10.1
  • Packages:
    • numpy: 1.19.4
    • pyTorch_debug: True
    • pyTorch_version: 1.7.0+cu101
    • pytorch-lightning: 1.1.2rc1
    • tqdm: 4.41.1
  • System:
    • OS: Linux
    • architecture:
      • 64bit
    • processor: x86_64
    • python: 3.6.9
    • version: Proposal for help #1 SMP Thu Jul 23 08:00:38 PDT 2020

Additional context

@8greg8 8greg8 added bug Something isn't working help wanted Open to be worked on labels Dec 18, 2020
@awaelchli
Copy link
Contributor

Fyi your colab notebook is not public, it can't be accessed :)

@awaelchli awaelchli added the distributed Generic distributed-related topic label Dec 20, 2020
@awaelchli awaelchli added this to the 1.1.x milestone Dec 20, 2020
@8greg8
Copy link
Contributor Author

8greg8 commented Dec 21, 2020

@awaelchli sorry my bad. Corrected and changed the link in bug description. You should be able to access now.

@tchaton tchaton self-assigned this Dec 21, 2020
@tchaton tchaton mentioned this issue Dec 21, 2020
11 tasks
@awaelchli
Copy link
Contributor

awaelchli commented Jan 15, 2021

@tchaton we should provide implementations for all_gather on cpu and single gpu to make code device agnostic. Does that make sense?

@Borda Borda modified the milestones: 1.1.x, 1.2 Feb 8, 2021
@edenlightning edenlightning modified the milestones: 1.2, 1.2.x Feb 8, 2021
@awaelchli awaelchli self-assigned this Feb 25, 2021
@tchaton
Copy link
Contributor

tchaton commented Mar 2, 2021

Hey @awaelchli,

Yes, we should.

Best,
T.C

@tchaton
Copy link
Contributor

tchaton commented Mar 2, 2021

Dear @8greg8,

I checked the notebook and couldn't reproduce the bug.

@awaelchli I have checked the code and it seems we are already supporting all_gather on cpu and single gpu.
However, we don't support gradients for TPU.

Best,
T.C

@tchaton
Copy link
Contributor

tchaton commented Mar 2, 2021

Dear @8greg8,

I checked the notebook and couldn't reproduce the bug.

@awaelchli I have checked the code and it seems we are already supporting all_gather on cpu and single gpu.
However, we don't support gradients for TPU. I opened an issue for it.

Best,
T.C

@awaelchli
Copy link
Contributor

Yes because that came automatically with the accelerator refactor and the discussion here is before that was introduced.
Today, Trainer always has an accelerator defined. This issue can be closed if you agree.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working distributed Generic distributed-related topic help wanted Open to be worked on
Projects
None yet
Development

No branches or pull requests

5 participants