Cherry pick LLaMA or SDXL to rel-1.16.2 (round 2) #18245

tianleiwu · 2023-11-02T13:26:59Z

Description

2nd round of cherry pick LLaMA or SDXL related changes to 1.16.2 release.

Motivation and Context

Implement Cutlass Memory Efficient Attention Kernel into Group Query Attention Operator. ### Motivation and Context Before this change, Group Query Attention Operator was supported only by Flash-Attention. While this is the most efficient kernel for the operation, it only supports sm >= 80. Cutlass Memory Efficient Attention Kernel supports sm >= 53, allowing us to support a broader range of GPU hardware.

### Description Support llama-70b model fusion and shardding ### Motivation and Context This change enables shard and export llama-70b model into Onnx as this model is too large for single GPU. This change also fuses llama-70b model with repeat_kv pattern different with llama-7b and llama-13b.

aciddelgado and others added 2 commits November 2, 2023 13:23

tianleiwu requested review from aciddelgado and frank-dong-ms November 2, 2023 13:27

aciddelgado approved these changes Nov 2, 2023

View reviewed changes

frank-dong-ms approved these changes Nov 3, 2023

View reviewed changes

tianleiwu merged commit 70b8cda into rel-1.16.2 Nov 3, 2023
97 of 99 checks passed

tianleiwu deleted the tlwu/rel-1.16.2_llama branch November 3, 2023 00:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cherry pick LLaMA or SDXL to rel-1.16.2 (round 2) #18245

Cherry pick LLaMA or SDXL to rel-1.16.2 (round 2) #18245

tianleiwu commented Nov 2, 2023

Cherry pick LLaMA or SDXL to rel-1.16.2 (round 2) #18245

Cherry pick LLaMA or SDXL to rel-1.16.2 (round 2) #18245

Conversation

tianleiwu commented Nov 2, 2023

Description

Motivation and Context