Integrate chunked prefill into t3k Llama3-70B #15921

cglagovichTT · 2024-12-11T18:42:13Z

Ticket

Problem description

Llama3-7-B on T3K supports up to 128k context length with batch=1. However, it is limited such that the maximum prompt length is 32k. That means that the full 128k context can be used by prefill len + decode len only if prefill len <= 32k.

This constraint is caused by the fact that prefill activations for long sequences become large, leading to OOM on device DRAM. For example, the activations are 128k * 8k * 2B = 2 GB for input len 128k.

What's changed

The solution to this problem is to implement chunked prefill. Given some prompt_len and a chunk_size (<= 32k), the prompt is split into prompt_len / chunk_size chunks and iteratively prefilled. Chunked prefill will reduce the size of the activations which should solve the OOM error.

This PR uses the new chunked_scaled_dot_product_attention kernel in Llama to enable chunked prefill. It modifies the attention and model tests, and creates a new test in the T3K demo pipeline which tests chunked prefill as used by llama_generation, which is the entrypoint that vLLM will use.

Note that this functionality was added to the "old" Llama codebase since it was an urgent request. I expect we will be adding chunked prefill to the llama family folder sometime soon.

Checklist

All post commit https://github.com/tenstorrent/tt-metal/actions/runs/12303936776
T3K pipelines https://github.com/tenstorrent/tt-metal/actions/runs/12303156746
- ttnn test fails but this PR should not have affected it
- Kicking T3K pipelines again after some changes https://github.com/tenstorrent/tt-metal/actions/runs/12354567605
- Many T3K pipelines are failing but these changes only affect old-codebase llama3-70b, which are green.

models/demos/t3000/llama2_70b/tt/llama_attention_optimized.py

models/demos/t3000/llama2_70b/tt/llama_model_optimized.py

models/demos/t3000/llama2_70b/tests/test_llama_model_t3000.py

models/demos/t3000/llama2_70b/tests/test_llama_attention.py

models/demos/t3000/llama2_70b/tests/test_llama_model.py

models/demos/t3000/llama2_70b/tt/llama_generation.py

… been lifted

…for long sequences

skhorasganiTT self-requested a review December 11, 2024 19:26

skhorasganiTT reviewed Dec 12, 2024

View reviewed changes

cglagovichTT added 7 commits December 12, 2024 18:48

#0: Add chunked prefill to llama attention

5b9d314

#0: Add chunked prefill to model and llama_generation

f76bed7

#0: Apply fix for paged KV fill cache by unpadding page table

e68a011

#0: Create test for chunked prefill through llama_generation

fca742f

#0: Address PR comments, minor test changes

e7b3a33

#0: Remove TODOs

48e9da1

#0: Add chunked generation test to t3k demo

813e5c2

cglagovichTT force-pushed the cglagovich/model_chunked_prefill branch from d0f1c15 to 813e5c2 Compare December 12, 2024 19:09

#0: Remove custom vllm input processor for Llama since constraint has…

d7f443e

… been lifted

cglagovichTT marked this pull request as ready for review December 12, 2024 19:54

cglagovichTT requested review from uaydonat, johanna-rock-tt, djordje-tt, kpaigwar and a team as code owners December 12, 2024 19:54

skhorasganiTT approved these changes Dec 12, 2024

View reviewed changes

#0: Remove log statement

3c73ce2

cglagovichTT changed the title ~~Cglagovich/model chunked prefill~~ Integrate chunked prefill into t3k Llama3-70B Dec 12, 2024

tt-rkim approved these changes Dec 12, 2024

View reviewed changes

#0: Increase SDPA program config k-chunk-size to improve correctness …

f66a5f8

…for long sequences

cglagovichTT requested review from ayerofieiev-tt, dmakoviichuk-tt, cfjchu and TT-BrianLiu as code owners December 13, 2024 20:43

cglagovichTT force-pushed the cglagovich/model_chunked_prefill branch from 4b6878a to f66a5f8 Compare December 16, 2024 14:18

cglagovichTT removed request for TT-BrianLiu, cfjchu and ayerofieiev-tt December 16, 2024 14:20

cglagovichTT removed the request for review from dmakoviichuk-tt December 16, 2024 14:20

johanna-rock-tt approved these changes Dec 16, 2024

View reviewed changes

djordje-tt approved these changes Dec 16, 2024

View reviewed changes

cglagovichTT merged commit 8d01f5d into main Dec 16, 2024
35 of 44 checks passed

cglagovichTT deleted the cglagovich/model_chunked_prefill branch December 16, 2024 18:03

cglagovichTT mentioned this pull request Dec 16, 2024

Llama3 Chunked Prefill #15873

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate chunked prefill into t3k Llama3-70B #15921

Integrate chunked prefill into t3k Llama3-70B #15921

cglagovichTT commented Dec 11, 2024 •

edited

Loading

Integrate chunked prefill into t3k Llama3-70B #15921

Integrate chunked prefill into t3k Llama3-70B #15921

Conversation

cglagovichTT commented Dec 11, 2024 • edited Loading

Ticket

Problem description

What's changed

Checklist

cglagovichTT commented Dec 11, 2024 •

edited

Loading