Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cherry pick LLaMA or SDXL to rel-1.16.2 (round 2) #18245

Merged
merged 2 commits into from
Nov 3, 2023

Conversation

tianleiwu
Copy link
Contributor

Description

2nd round of cherry pick LLaMA or SDXL related changes to 1.16.2 release.

Motivation and Context

aciddelgado and others added 2 commits November 2, 2023 13:23
Implement Cutlass Memory Efficient Attention Kernel into Group Query
Attention Operator.

### Motivation and Context
Before this change, Group Query Attention Operator was supported only by
Flash-Attention. While this is the most efficient kernel for the
operation, it only supports sm >= 80. Cutlass Memory Efficient Attention
Kernel supports sm >= 53, allowing us to support a broader range of GPU
hardware.
### Description
Support llama-70b model fusion and shardding



### Motivation and Context
This change enables shard and export llama-70b model into Onnx as this
model is too large for single GPU.
This change also fuses llama-70b model with repeat_kv pattern different
with llama-7b and llama-13b.
@tianleiwu tianleiwu merged commit 70b8cda into rel-1.16.2 Nov 3, 2023
97 of 99 checks passed
@tianleiwu tianleiwu deleted the tlwu/rel-1.16.2_llama branch November 3, 2023 00:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants