Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate chunked prefill into t3k Llama3-70B #15921

Merged
merged 10 commits into from
Dec 16, 2024

Conversation

cglagovichTT
Copy link
Contributor

@cglagovichTT cglagovichTT commented Dec 11, 2024

Ticket

#15873

Problem description

Llama3-7-B on T3K supports up to 128k context length with batch=1. However, it is limited such that the maximum prompt length is 32k. That means that the full 128k context can be used by prefill len + decode len only if prefill len <= 32k.

This constraint is caused by the fact that prefill activations for long sequences become large, leading to OOM on device DRAM. For example, the activations are 128k * 8k * 2B = 2 GB for input len 128k.

What's changed

The solution to this problem is to implement chunked prefill. Given some prompt_len and a chunk_size (<= 32k), the prompt is split into prompt_len / chunk_size chunks and iteratively prefilled. Chunked prefill will reduce the size of the activations which should solve the OOM error.

This PR uses the new chunked_scaled_dot_product_attention kernel in Llama to enable chunked prefill. It modifies the attention and model tests, and creates a new test in the T3K demo pipeline which tests chunked prefill as used by llama_generation, which is the entrypoint that vLLM will use.

Note that this functionality was added to the "old" Llama codebase since it was an urgent request. I expect we will be adding chunked prefill to the llama family folder sometime soon.

Checklist

@skhorasganiTT skhorasganiTT self-requested a review December 11, 2024 19:26
@cglagovichTT cglagovichTT force-pushed the cglagovich/model_chunked_prefill branch from d0f1c15 to 813e5c2 Compare December 12, 2024 19:09
@cglagovichTT cglagovichTT marked this pull request as ready for review December 12, 2024 19:54
@cglagovichTT cglagovichTT changed the title Cglagovich/model chunked prefill Integrate chunked prefill into t3k Llama3-70B Dec 12, 2024
@cglagovichTT cglagovichTT removed the request for review from dmakoviichuk-tt December 16, 2024 14:20
@cglagovichTT cglagovichTT merged commit 8d01f5d into main Dec 16, 2024
35 of 44 checks passed
@cglagovichTT cglagovichTT deleted the cglagovich/model_chunked_prefill branch December 16, 2024 18:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants