Llama 3 ring attention implementation for inference #21

joshpopelka20gmail · 2024-07-25T00:49:07Z

Hope you can help with this. I'm trying to implement ring attention using Llama 3 architecture and I'm starting with the blockwise parallel transformer piece. My question is when do I start to break the input sequence into chunks 1) after projection of weights to Q, K, and V or 2) prior to self-attention in the block?

Any feedback would be much appreciated :)

haoliuhl · 2024-08-04T18:01:01Z

Hi, ringattention inference has been supported in LWM

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Llama 3 ring attention implementation for inference #21

Llama 3 ring attention implementation for inference #21

joshpopelka20gmail commented Jul 25, 2024

haoliuhl commented Aug 4, 2024 •

edited

Loading

Llama 3 ring attention implementation for inference #21

Llama 3 ring attention implementation for inference #21

Comments

joshpopelka20gmail commented Jul 25, 2024

haoliuhl commented Aug 4, 2024 • edited Loading

haoliuhl commented Aug 4, 2024 •

edited

Loading