Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR adds dropless MoE support using the Grouped GEMM implementation in megablocks.
Features
Unlike the legacy DeepSpeed MoE implementation that uses the data parallel groups for expert parallelism, this implementation uses the model parallel group to parallelize the experts. This avoids the following problems:
Clarified arguments to make it clear which ones are only required for token dropping deepspeed MoE.
Use sinkhorn routing by default, support k>=1.
Testing
Tested PP [3, 2, 1] and MP [1, 2, 4, 8] on Ampere GPUs.
Notes
Added megablocks and grouped_gemm to the dependencies. It might be desirable to pull some of the kernels in directly like in NVIDIA megatron-core.