Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling Prefill Lengths Exceeding 2k in TinyLlama_v1.1 #4

Open
bettybaii opened this issue Oct 24, 2024 · 4 comments
Open

Handling Prefill Lengths Exceeding 2k in TinyLlama_v1.1 #4

bettybaii opened this issue Oct 24, 2024 · 4 comments

Comments

@bettybaii
Copy link

Thank you greatly for your remarkable efforts and significant contributions to the open-source community.
I noticed that the TinyLlama_v1.1 model supports a maximum context length of only 2k. How does TinyLlama_v1.1 propose tokens for the target model when the requested prefill length exceeds this 2k limit?

@ranonrkm
Copy link
Contributor

Hi, thanks for your interest in our work. We use sparse KV for drafting. Even though the sequence length goes beyond 2k, the sparse KV window length will be upper bounded by 2k for TinyLlama.

@bettybaii
Copy link
Author

Hi, thanks for your interest in our work. We use sparse KV for drafting. Even though the sequence length goes beyond 2k, the sparse KV window length will be upper bounded by 2k for TinyLlama.

Hi, thank you very much for your detailed response. @ranonrkm I’m curious about the specific “sparse KV” method you used to ensure the sparse KV window length is upper-bounded by 2k, even when the sequence length exceeds 2k. Are you retaining only the last 2k KV? If possible, could you point me to the relevant section in the code for this part? As the draft model’s accuracy during speculative decoding significantly impacts inference efficiency. Would this method affect the output quality of the draft model (TinyLlama)?

@ranonrkm
Copy link
Contributor

ranonrkm commented Nov 1, 2024

The simplest sparsification technique we tried is StreamingLLM. During prefilling, we ensure that the KV cache size does not exceed the KV budget by sliding the local window.
you can refer to this:
https://github.com/Infini-AI-Lab/MagicDec/blob/82999d621569a785587138cfb8ed896ca0e3daa0/Engine/backend_draft.py#L68C1-L72C44

@bettybaii
Copy link
Author

The simplest sparsification technique we tried is StreamingLLM. During prefilling, we ensure that the KV cache size does not exceed the KV budget by sliding the local window. you can refer to this: https://github.com/Infini-AI-Lab/MagicDec/blob/82999d621569a785587138cfb8ed896ca0e3daa0/Engine/backend_draft.py#L68C1-L72C44

I understand now. Thank you for your patience and detailed explanation!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants