Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mixtral FastGen Support #4828

Merged
merged 13 commits into from
Dec 21, 2023
Merged

Mixtral FastGen Support #4828

merged 13 commits into from
Dec 21, 2023

Conversation

cmikeh2
Copy link
Contributor

@cmikeh2 cmikeh2 commented Dec 16, 2023

Adds support for Mixtral with FastGen. Key features implemented:

  1. Top-2 MoE support
  2. Better support for RoPE thetas
  3. The mistral model implementation

@RezaYazdaniAminabadi
Copy link
Contributor

this is amazing, i was also working on this, but i think most of what i have you already added in this pr. thanks @cmikeh2 :)


} // namespace scatter

template <typename T, int copyUnroll>
template <typename T, int copyUnroll, int N_TOP_K>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cmikeh2, I know you like to generalize this function, but I was wondering if we can have two kernels here, one for top-1 and one for top-k, just so that we can remove some of the complexity added for top-1. what do u think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you observing any slowdown with top-1? The re-org intention here was primarily to simplify things. Previously, on block-0 we did a cumsum for the GEMM kernel and the rest of the thread blocks did max reductions. The max reduction is of similar complexity to the cumsum anyways (log(n)) steps and since it's necessary on all blocks for the top-N case anyways, I thought it made sense to remove the branch from the code and have a unified path.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i haven't but i will try to do some profiling of this in the next days. thanks for the clarification on the changes :)

Copy link
Contributor

@mrwyattii mrwyattii left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @cmikeh2!

@cmikeh2 cmikeh2 requested a review from loadams as a code owner December 19, 2023 23:08
@mrwyattii mrwyattii merged commit c00388a into master Dec 21, 2023
12 of 13 checks passed
@mrwyattii mrwyattii deleted the cholmes/mixtral-fastgen-support branch December 21, 2023 00:05
mrwyattii pushed a commit that referenced this pull request Jan 10, 2024
The Mixtral PR #4828 has
introduced the positional embedding config class which is a required
argument of `make_attn_layer()` function. This has forced the user to
override and duplicate the `make_attn_layer()` call for new model
implementations using RoPE (This has also broken the Falcon model
implementations). This PR:

- refactors the inference transformer base class to avoid code
duplication by adding a new abstract `positional_embedding_config`
property
- Fixes the Falcon model implementation to use positional embedding
config.

The models `llama_v2`, `OPT`, `Mistral 7B`, `Mixtral`, `Falcon` and
`Phi-2` are tested with the PR!

---------

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
mauryaavinash95 pushed a commit to mauryaavinash95/DeepSpeed that referenced this pull request Feb 17, 2024
Adds support for Mixtral with FastGen. Key features implemented:

1. Top-2 MoE support
2. Better support for RoPE thetas
3. The mistral model implementation

---------

Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>
mauryaavinash95 pushed a commit to mauryaavinash95/DeepSpeed that referenced this pull request Feb 17, 2024
The Mixtral PR deepspeedai#4828 has
introduced the positional embedding config class which is a required
argument of `make_attn_layer()` function. This has forced the user to
override and duplicate the `make_attn_layer()` call for new model
implementations using RoPE (This has also broken the Falcon model
implementations). This PR:

- refactors the inference transformer base class to avoid code
duplication by adding a new abstract `positional_embedding_config`
property
- Fixes the Falcon model implementation to use positional embedding
config.

The models `llama_v2`, `OPT`, `Mistral 7B`, `Mixtral`, `Falcon` and
`Phi-2` are tested with the PR!

---------

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants