Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Distributed data parallel training support #2464

Merged
merged 31 commits into from
Aug 19, 2024
Merged

Conversation

askorupka
Copy link
Collaborator

@askorupka askorupka commented Jun 22, 2024

Support for distributed data parallel training. Inspired by LuxDL/Lux.jl#500
This PR is still work in progress.
PR checklist to be continued.

PR Checklist

  • Tests are added
  • Entry in NEWS.md
  • Documentation, if applicable

Both MPIBackend and NCCLBackend are supported.

Module can be used as in the below example (for distributed runs use mpiexecjl --project=@. -n 3 julia distributed_MPI.jl from your terminal, where distributed_MPI.jl (feel free to also use NCCLBakend'):

using Flux, MPI, NCCL, CUDA
using Random
using Optimisers
using Zygote
using Statistics

CUDA.allowscalar(false)

DistributedUtils.initialize(MPIBackend)
backend = DistributedUtils.get_distributed_backend(MPIBackend)
rank = DistributedUtils.local_rank(backend)

model = Chain(Dense(1 => 256, tanh), Dense(256 => 1)) |> gpu

model = DistributedUtils.synchronize!!(backend, DistributedUtils.FluxDistributedModel(model); root=0) 

x = rand(Float32, 1, 16) |> gpu
y = x .^ 3

opt = DistributedUtils.DistributedOptimizer(backend, Optimisers.Adam(0.001f0))
st_opt = Optimisers.setup(opt, model)
st_opt = DistributedUtils.synchronize!!(backend, st_opt; root=0) 

loss(model) = mean((model(x) .- y).^2)
g_ = gradient(m -> loss(m), model)[1] 
Optimisers.update!(st_opt, model, g_)

for epoch in 1:100
  global model, st_opt
  l, back = Zygote.pullback(loss, model)
  println("Epoch $epoch: Loss $l")
  g = back(one(l))[1]
  st_opt, model = Optimisers.update(st_opt, model, g)
end

@CarloLucibello
Copy link
Member

I suggest removing NCCL from this PR and just focusing on MPI

distributed.jl Outdated Show resolved Hide resolved
distributed.jl Outdated Show resolved Hide resolved
distributed.jl Outdated Show resolved Hide resolved
@askorupka
Copy link
Collaborator Author

askorupka commented Jun 30, 2024

@CarloLucibello I was able to move it forward according to your suggestions and MPI example with training works 🎉 (I still need to do some cleanup tho)
details are in comments above - may be useful for you to have a look.

@askorupka
Copy link
Collaborator Author

askorupka commented Jul 7, 2024

Update: both MPI and NCCL work.
please run mpiexecjl --project=@. -n 3 julia distributed_NCCL.jl or mpiexecjl --project=@. -n 3 julia distributed_MPI.jl respectively from your terminal.

still in the draft state, requires some work - should be easier from now on:

  • examples
  • docs

Tests added, confliicts resolved

Copy link
Member

@ToucheSir ToucheSir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is a real tour de force, great work!

Assuming #2464 (comment) means you're starting to wrap things up, here are a couple heads up so they don't come as a surprise for the non-draft PR review.

Project.toml Outdated Show resolved Hide resolved
ext/FluxMPINCCLExt/FluxMPINCCLExt.jl Show resolved Hide resolved
@CarloLucibello CarloLucibello marked this pull request as ready for review July 24, 2024 07:24
@askorupka
Copy link
Collaborator Author

Docs added so PR checklist completed. Ready for review 🚀

@askorupka askorupka requested a review from CarloLucibello July 24, 2024 20:42
docs/src/guide/gpu.md Outdated Show resolved Hide resolved
docs/src/guide/gpu.md Outdated Show resolved Hide resolved
docs/src/guide/gpu.md Outdated Show resolved Hide resolved
docs/src/guide/gpu.md Show resolved Hide resolved
docs/src/guide/gpu.md Show resolved Hide resolved
docs/src/guide/gpu.md Outdated Show resolved Hide resolved
docs/src/guide/gpu.md Outdated Show resolved Hide resolved
docs/src/guide/gpu.md Outdated Show resolved Hide resolved
docs/src/guide/gpu.md Outdated Show resolved Hide resolved
docs/src/guide/gpu.md Outdated Show resolved Hide resolved
askorupka and others added 7 commits August 3, 2024 21:43
Co-authored-by: Carlo Lucibello <carlo.lucibello@unibocconi.it>
Co-authored-by: Carlo Lucibello <carlo.lucibello@unibocconi.it>
Co-authored-by: Carlo Lucibello <carlo.lucibello@unibocconi.it>
Co-authored-by: Carlo Lucibello <carlo.lucibello@unibocconi.it>
Co-authored-by: Carlo Lucibello <carlo.lucibello@unibocconi.it>
Co-authored-by: Carlo Lucibello <carlo.lucibello@unibocconi.it>
Co-authored-by: Carlo Lucibello <carlo.lucibello@unibocconi.it>
docs/src/guide/gpu.md Outdated Show resolved Hide resolved
docs/src/guide/gpu.md Outdated Show resolved Hide resolved
@CarloLucibello
Copy link
Member

The new test files should be included in tests/runtests.jl.
As with other extensions, we should define in the file flags like

ENV["FLUX_TEST_DISTRIBUTED_MPI"] = "true"
ENV["FLUX_TEST_DISTRIBUTED_NCCL"] = "true"

and tests only conditional on them. We should test separately the MPI and NCCL backed.

For the time being, we won't run the test on the CI because we would have to setup the MPI and NCCL stuff. We can figure out how to test on CI in a follow up PR.

Project.toml Outdated Show resolved Hide resolved
Project.toml Outdated Show resolved Hide resolved
@CarloLucibello
Copy link
Member

tests failure is unrelated and likely due to Enzyme. I opened an issue in EnzymeAD/Enzyme.jl#1738

@CarloLucibello CarloLucibello merged commit d1ff714 into master Aug 19, 2024
3 of 9 checks passed
@pxl-th
Copy link
Member

pxl-th commented Aug 19, 2024

Since Enzyme explicitly installs CUDA when running tests, we should avoid running them on AMDGPU/Metal CIs, until it gains support for them or switch to those backends properly.

@mcabbott mcabbott deleted the distributed branch September 20, 2024 14:51
@kishore-nori
Copy link

kishore-nori commented Sep 26, 2024

When updating to Flux.jl latest v0.14.20 , I get the following error, which wasn't there for v0.14.19 , I am on Julia 1.10.5. I have tested it and this is a precompilation error

ERROR: LoadError: ArgumentError: Package FluxMPIExt does not have CUDA in its dependencies:
- You may have a partially installed environment. Try `Pkg.instantiate()`
  to ensure all packages in the environment are installed.
- Or, if you have FluxMPIExt checked out for development and have
  added CUDA as a dependency but haven't updated your primary
  environment's manifest file, try `Pkg.resolve()`.
- Otherwise you may need to report an issue with FluxMPIExt 

I think CUDA, AMDGPU should be mentioned here

FluxMPIExt = "MPI"

cc @mcabbott

@kishore-nori
Copy link

kishore-nori commented Oct 3, 2024

update to above: the precompilation error happens only when both Flux.jl (0.14.20) and MPI.jl are in the environment, if MPI.jl is not there, then there is no problem. And the precompilation error complains CUDA absence, like shown above, even if CUDA is not in the environment. So, I think it's got to do with the missing deps for FluxMPIExt extension.

And CUDA (and AMDGPU) are being used in FluxMPIExt:

@askorupka
Copy link
Collaborator Author

askorupka commented Oct 3, 2024

Hi @kishore-nori I've managed to replicate the issue, thanks for reporting it.
I'm working on the fix so that FluxMPIExt doesn't require CUDA anymore as we want to avoid adding too many deps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants