-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handling Prefill Lengths Exceeding 2k in TinyLlama_v1.1 #4
Comments
Hi, thanks for your interest in our work. We use sparse KV for drafting. Even though the sequence length goes beyond 2k, the sparse KV window length will be upper bounded by 2k for TinyLlama. |
Hi, thank you very much for your detailed response. @ranonrkm I’m curious about the specific “sparse KV” method you used to ensure the sparse KV window length is upper-bounded by 2k, even when the sequence length exceeds 2k. Are you retaining only the last 2k KV? If possible, could you point me to the relevant section in the code for this part? As the draft model’s accuracy during speculative decoding significantly impacts inference efficiency. Would this method affect the output quality of the draft model (TinyLlama)? |
The simplest sparsification technique we tried is StreamingLLM. During prefilling, we ensure that the KV cache size does not exceed the KV budget by sliding the local window. |
I understand now. Thank you for your patience and detailed explanation! |
Thank you greatly for your remarkable efforts and significant contributions to the open-source community.
I noticed that the TinyLlama_v1.1 model supports a maximum context length of only 2k. How does TinyLlama_v1.1 propose tokens for the target model when the requested prefill length exceeds this 2k limit?
The text was updated successfully, but these errors were encountered: