Mixtral FastGen Support #4828

cmikeh2 · 2023-12-16T04:10:36Z

Adds support for Mixtral with FastGen. Key features implemented:

Top-2 MoE support
Better support for RoPE thetas
The mistral model implementation

RezaYazdaniAminabadi · 2023-12-16T23:05:43Z

this is amazing, i was also working on this, but i think most of what i have you already added in this pr. thanks @cmikeh2 :)

RezaYazdaniAminabadi · 2023-12-18T22:16:12Z

deepspeed/inference/v2/kernels/ragged_ops/moe_scatter/moe_scatter.cu


 }  // namespace scatter

-template <typename T, int copyUnroll>
+template <typename T, int copyUnroll, int N_TOP_K>


@cmikeh2, I know you like to generalize this function, but I was wondering if we can have two kernels here, one for top-1 and one for top-k, just so that we can remove some of the complexity added for top-1. what do u think?

Are you observing any slowdown with top-1? The re-org intention here was primarily to simplify things. Previously, on block-0 we did a cumsum for the GEMM kernel and the rest of the thread blocks did max reductions. The max reduction is of similar complexity to the cumsum anyways (log(n)) steps and since it's necessary on all blocks for the top-N case anyways, I thought it made sense to remove the branch from the code and have a unified path.

i haven't but i will try to do some profiling of this in the next days. thanks for the clarification on the changes :)

mrwyattii

Thanks @cmikeh2!

deepspeed/inference/v2/kernels/ragged_ops/linear_blocked_kv_rotary/blocked_kv_rotary.cpp

deepspeed/inference/v2/kernels/ragged_ops/moe_gather/moe_gather.cu

The Mixtral PR #4828 has introduced the positional embedding config class which is a required argument of `make_attn_layer()` function. This has forced the user to override and duplicate the `make_attn_layer()` call for new model implementations using RoPE (This has also broken the Falcon model implementations). This PR: - refactors the inference transformer base class to avoid code duplication by adding a new abstract `positional_embedding_config` property - Fixes the Falcon model implementation to use positional embedding config. The models `llama_v2`, `OPT`, `Mistral 7B`, `Mixtral`, `Falcon` and `Phi-2` are tested with the PR! --------- Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>

Adds support for Mixtral with FastGen. Key features implemented: 1. Top-2 MoE support 2. Better support for RoPE thetas 3. The mistral model implementation --------- Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>

The Mixtral PR deepspeedai#4828 has introduced the positional embedding config class which is a required argument of `make_attn_layer()` function. This has forced the user to override and duplicate the `make_attn_layer()` call for new model implementations using RoPE (This has also broken the Falcon model implementations). This PR: - refactors the inference transformer base class to avoid code duplication by adding a new abstract `positional_embedding_config` property - Fixes the Falcon model implementation to use positional embedding config. The models `llama_v2`, `OPT`, `Mistral 7B`, `Mixtral`, `Falcon` and `Phi-2` are tested with the PR! --------- Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>

cmikeh2 added 9 commits December 15, 2023 05:13

Kernel changes, cleanup still necessary

72212fe

Add explicit theta to rotary embeddings

62fdf96

Mixtral model implementation

20823e5

Commit the unsaved files

0e7d4e0

Minor fixes

6b1761a

Missing engine factor, rope theta fixes

28ab644

MoE type mismatches

d28c327

Misnamed mapping

c1e90a3

Clear output

880417e

cmikeh2 requested review from mrwyattii, awan-10, arashb and tjruwase as code owners December 16, 2023 04:10

RezaYazdaniAminabadi reviewed Dec 18, 2023

View reviewed changes

mrwyattii approved these changes Dec 19, 2023

View reviewed changes

deepspeed/inference/v2/kernels/ragged_ops/linear_blocked_kv_rotary/blocked_kv_rotary.cpp Show resolved Hide resolved

deepspeed/inference/v2/kernels/ragged_ops/moe_gather/moe_gather.cu Outdated Show resolved Hide resolved

Fix unit test

77b871d

cmikeh2 requested a review from loadams as a code owner December 19, 2023 23:08

cmikeh2 added 2 commits December 20, 2023 21:58

Fix unit tests

a383b67

Clean up top_k support in the C++ code

0255d6b

mrwyattii mentioned this pull request Dec 21, 2023

Update supported models list deepspeedai/DeepSpeed-MII#360

Merged

Merge branch 'master' into cholmes/mixtral-fastgen-support

e67e15a

mrwyattii merged commit c00388a into master Dec 21, 2023
12 of 13 checks passed

mrwyattii deleted the cholmes/mixtral-fastgen-support branch December 21, 2023 00:05

arashb mentioned this pull request Jan 8, 2024

Refactor the positional emebdding config code #4920

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mixtral FastGen Support #4828

Mixtral FastGen Support #4828

cmikeh2 commented Dec 16, 2023

RezaYazdaniAminabadi commented Dec 16, 2023

RezaYazdaniAminabadi Dec 18, 2023

cmikeh2 Dec 18, 2023

RezaYazdaniAminabadi Dec 20, 2023

mrwyattii left a comment

Mixtral FastGen Support #4828

Mixtral FastGen Support #4828

Conversation

cmikeh2 commented Dec 16, 2023

RezaYazdaniAminabadi commented Dec 16, 2023

RezaYazdaniAminabadi Dec 18, 2023

Choose a reason for hiding this comment

cmikeh2 Dec 18, 2023

Choose a reason for hiding this comment

RezaYazdaniAminabadi Dec 20, 2023

Choose a reason for hiding this comment

mrwyattii left a comment

Choose a reason for hiding this comment