Handling Prefill Lengths Exceeding 2k in TinyLlama_v1.1 #4

bettybaii · 2024-10-24T03:39:54Z

Thank you greatly for your remarkable efforts and significant contributions to the open-source community.
I noticed that the TinyLlama_v1.1 model supports a maximum context length of only 2k. How does TinyLlama_v1.1 propose tokens for the target model when the requested prefill length exceeds this 2k limit?

ranonrkm · 2024-10-31T12:56:38Z

Hi, thanks for your interest in our work. We use sparse KV for drafting. Even though the sequence length goes beyond 2k, the sparse KV window length will be upper bounded by 2k for TinyLlama.

bettybaii · 2024-11-01T03:49:03Z

Hi, thanks for your interest in our work. We use sparse KV for drafting. Even though the sequence length goes beyond 2k, the sparse KV window length will be upper bounded by 2k for TinyLlama.

Hi, thank you very much for your detailed response. @ranonrkm I’m curious about the specific “sparse KV” method you used to ensure the sparse KV window length is upper-bounded by 2k, even when the sequence length exceeds 2k. Are you retaining only the last 2k KV? If possible, could you point me to the relevant section in the code for this part? As the draft model’s accuracy during speculative decoding significantly impacts inference efficiency. Would this method affect the output quality of the draft model (TinyLlama)?

ranonrkm · 2024-11-01T18:29:41Z

The simplest sparsification technique we tried is StreamingLLM. During prefilling, we ensure that the KV cache size does not exceed the KV budget by sliding the local window.
you can refer to this:
https://github.com/Infini-AI-Lab/MagicDec/blob/82999d621569a785587138cfb8ed896ca0e3daa0/Engine/backend_draft.py#L68C1-L72C44

bettybaii · 2024-11-03T09:21:49Z

The simplest sparsification technique we tried is StreamingLLM. During prefilling, we ensure that the KV cache size does not exceed the KV budget by sliding the local window. you can refer to this: https://github.com/Infini-AI-Lab/MagicDec/blob/82999d621569a785587138cfb8ed896ca0e3daa0/Engine/backend_draft.py#L68C1-L72C44

I understand now. Thank you for your patience and detailed explanation!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handling Prefill Lengths Exceeding 2k in TinyLlama_v1.1 #4

Handling Prefill Lengths Exceeding 2k in TinyLlama_v1.1 #4

bettybaii commented Oct 24, 2024

ranonrkm commented Oct 31, 2024

bettybaii commented Nov 1, 2024

ranonrkm commented Nov 1, 2024

bettybaii commented Nov 3, 2024

Handling Prefill Lengths Exceeding 2k in TinyLlama_v1.1 #4

Handling Prefill Lengths Exceeding 2k in TinyLlama_v1.1 #4

Comments

bettybaii commented Oct 24, 2024

ranonrkm commented Oct 31, 2024

bettybaii commented Nov 1, 2024

ranonrkm commented Nov 1, 2024

bettybaii commented Nov 3, 2024