Fix LongRoPE KV Cache #1

zelinms · 2024-08-01T04:56:27Z

Issue

Before this fix: if we start from a prompt with length < 4K but the total sequence length will >4K after generation, the generated tokens after 4K will be gabage.

Root Cause

LongRoPE uses difference scaling factors for <=4k / > 4k sequences. When the generation comes across the switch point of 4K, the kv cache of prefill tokens are calculated based on short factors, but the new tokens are generated based on long factors. This inconsistency leads to the generation crash.

Solution

Pass prompt_length + max_generation_tokens to the model.
If it is larger than 4K, we should use the long factors for all calculations of this sequence.

linzeqipku added 24 commits July 30, 2024 22:00

tmp

7a7ccb3

tmp

b07d6c5

tmp

693c6df

tmp

477c792

update metadata

5dd7ffb

update metadata

ac7f93a

update metadata

51b736e

update device

29e5c92

update metadata

8e2fde2

update metadata

77d44d4

max_seq_tokens_list is empty

3842e4d

mixtral to phimoe

f75159c

phimoe

2c3177b

cuda graph

a5e5908

static flow

ccd65b0

tmp

ff111d9

llama

6e301c1

tmp

6d440cc

tmp

45a0828

capture

08d4c41

update log

804acf7

tmp

c67b7bd

code style

f24a5db

resolve conflict

2c370dd

zelinms changed the title ~~Fit cluster tests fp8 longrope~~ Fix LongRoPE KV Cache Aug 1, 2024

zelinms marked this pull request as ready for review August 1, 2024 05:05

tmp

74e0e13

zelinms marked this pull request as draft August 1, 2024 10:45

linzeqipku added 2 commits August 1, 2024 18:54

tmp

05b4b66

tmp

315ab72

tmp

c17a344

zelinms closed this Aug 1, 2024

zelinms deleted the fit-cluster-tests-fp8-longrope branch August 1, 2024 14:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix LongRoPE KV Cache #1

Fix LongRoPE KV Cache #1

zelinms commented Aug 1, 2024 •

edited

Loading

Fix LongRoPE KV Cache #1

Fix LongRoPE KV Cache #1

Conversation

zelinms commented Aug 1, 2024 • edited Loading

Issue

Root Cause

Solution

zelinms commented Aug 1, 2024 •

edited

Loading