Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Megablocks-based MoE #1197

Closed

Conversation

DayOfThePenguin
Copy link
Contributor

This PR adds dropless MoE support using the Grouped GEMM implementation in megablocks.

Features

Unlike the legacy DeepSpeed MoE implementation that uses the data parallel groups for expert parallelism, this implementation uses the model parallel group to parallelize the experts. This avoids the following problems:

  • Using data parallel groups to distribute the experts will incur inter-node communications to do a forward pass through a single layer
  • MoE + pipeline parallelism is very complicated to reason about when you have MoE weights distributed across data parallel groups & deepspeed doesn't natively support it.

Clarified arguments to make it clear which ones are only required for token dropping deepspeed MoE.

Use sinkhorn routing by default, support k>=1.

Testing

Tested PP [3, 2, 1] and MP [1, 2, 4, 8] on Ampere GPUs.

Notes

Added megablocks and grouped_gemm to the dependencies. It might be desirable to pull some of the kernels in directly like in NVIDIA megatron-core.

@yang
Copy link
Contributor

yang commented Mar 28, 2024

👍

Just wanted to jump in with some quick clarifications! DS doesn't actually necessitate using DP groups for expert parallelism / as your EP groups. You can choose to do so if you want - these are just different configurations of the parallelism, which generalizes over these arrangements.

So if you want to use the model/tensor parallel groups to parallelize your experts (and avoid DP shuffling), you can do so just by setting the DEP size to be equal to your DP size (rather than DEP > DP). Then set EP groups = TP groups exactly. It's one option (you have the degrees of freedom to choose).

(You can furthermore choose also whether you want expert tensor parallelism or not, which is another degree of freedom.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants