Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Currently the rotating KV cache can be both
max_size
andmax_size - 1
depending on whether we filled it by generating or by prompt token processing. This PR makes sure the max is alwaysmax_size
and notmax_size - 1
.The following two simple repros exemplify the problem a bit:
Generate past max size
Prompt process past max size
The first breaks on main. Then changing line 45 on base.py breaks the 2nd test and changing the trim size makes sure everything works. Btw these also occur in
mlx_lm.chat
andmlx_lm.cache_prompt
it is just more explicit with the tests above.