PR: Add KV-cache creation capability to mlx_lm.generate for after a text completion #1001
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Whilst mlx_lm.cache_prompt lets you encode a prompt and save the key value pairs as a kv-cache.safetensors file in advance, there's currently no means of saving the kv-cache after a text completion by an LLM.
Adding this will bring MLX_lm more in line with Llama.cpp in terms of reducing latency in multi-turn scenarios. For example, using an MLX-served model as a chatbot and having a drawn out discussion about a given topic. Saving the KV-cache after each turn by the LLM means that even as the conversation history continues, there won't be any latency introduced by having to re-encode the entire chat log again - only the most recent user prompt.
Not sure I went about it the best way, but it seems to work from my testing! There's one superfluous edit to the step generator line 357 which can probably be left out, but otherwise I think I kept this as streamlined as I could.