Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PoC: Accelerator refactor [wip] [skip ci] #5616

Closed
wants to merge 168 commits into from
Closed

PoC: Accelerator refactor [wip] [skip ci] #5616

wants to merge 168 commits into from

Conversation

justusschock
Copy link
Member

@justusschock justusschock commented Jan 22, 2021

What does this PR do?

This PR is a revamp of #5385 since I messed up with git.
Closes #5385

Fixes #4510

This PR separates the Accelerator (Hardware Part) from the actual different training routines.

Workflow actions:

  • Move current accelerators to legacy - change imports (still use them for training) >> Refactor: 1/n legacy accelerators and plugins  #5645
  • Move current accelerator connector to legacy (still use it for training) and add new connector (necessary for tests)
  • Add new accelerators (still use legacy for training)
  • Add new plugins one by one (still use legacy for training)
  • Add missing logic
  • Drop Legacy for new backends. (Not sure if keeping them for next release).

Remaining TODOs:

  • TPUAccelerator (Justus)
  • DDP2 Plugin (Adrian)
  • Shared Training Plugin (Sean/Justus)
  • RPC Plugin (Justus)
  • Add Layers for parallel: SingleProcessMultipleDevice (SPMD), SingleProcessSingleDevice (SPSD) (Justus)
  • Rebase on release branch
  • Testing DDP2 and DDP Slurm
  • Remove old plugins (pl/plugins/old)
  • Make Tuner work (requires setting some attrs through trainer) (Adrian)
  • Port the left-over functions (like block_backward_sync and prepare_for_backward) from old DDPPlugin to new to avoid performance hits (Adrian/Justus)

So far this PR was co-authored with @awaelchli !

cc @awaelchli @tchaton @SeanNaren who will likely work on this!

Slides for motivation and high-level overview

List of PRs to look out for (when rebasing, code we need to manually copy over to new files):
#5221, #5195, #5300, #5388

@justusschock justusschock added feature Is an improvement or enhancement refactor labels Jan 22, 2021
@justusschock justusschock added this to the 1.2 milestone Jan 22, 2021
@Borda Borda changed the title Accelerator refactor sharded Accelerator refactor [wip] Jan 22, 2021
@Borda
Copy link
Member

Borda commented Jan 22, 2021

@justusschock took me almost two hours to rebase and resolve conflicts on the way, but still, can you check that all is fine?

@awaelchli awaelchli changed the title Accelerator refactor [wip] Accelerator refactor [wip] [skip ci] Jan 25, 2021
@awaelchli
Copy link
Contributor

@Borda Thank you.
I added some commits to make the tests run again.
What is missing is the device_type and distrib_type refactor changes that you recently made. Could you help integrate them?
I can also assist

@Borda
Copy link
Member

Borda commented Jan 25, 2021

What is missing is the device_type and distrib_type refactor changes that you recently made. Could you help integrate them?
I can also assist

Happy to help =)

@Lightning-AI Lightning-AI deleted a comment from pep8speaks Jan 25, 2021
@justusschock justusschock changed the base branch from release/1.2-dev to accelerator-refactor-poc February 2, 2021 10:25
@justusschock justusschock marked this pull request as ready for review February 2, 2021 10:26
@mergify mergify bot removed the has conflicts label Feb 2, 2021
@justusschock justusschock changed the base branch from accelerator-refactor-poc to release/1.2-dev February 2, 2021 10:30
@justusschock justusschock mentioned this pull request Feb 2, 2021
16 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Is an improvement or enhancement let's do it! approved to implement refactor
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants