Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix LongRoPE KV Cache #1

Conversation

zelinms
Copy link

@zelinms zelinms commented Aug 1, 2024

Issue

Before this fix: if we start from a prompt with length < 4K but the total sequence length will >4K after generation, the generated tokens after 4K will be gabage.

Root Cause

LongRoPE uses difference scaling factors for <=4k / > 4k sequences. When the generation comes across the switch point of 4K, the kv cache of prefill tokens are calculated based on short factors, but the new tokens are generated based on long factors. This inconsistency leads to the generation crash.

Solution

Pass prompt_length + max_generation_tokens to the model.
If it is larger than 4K, we should use the long factors for all calculations of this sequence.

@zelinms zelinms changed the title Fit cluster tests fp8 longrope Fix LongRoPE KV Cache Aug 1, 2024
@zelinms zelinms marked this pull request as ready for review August 1, 2024 05:05
@zelinms zelinms marked this pull request as draft August 1, 2024 10:45
@zelinms zelinms closed this Aug 1, 2024
@zelinms zelinms deleted the fit-cluster-tests-fp8-longrope branch August 1, 2024 14:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants