More cache improvements #1015

awni · 2024-10-05T21:50:08Z

Sorry for the large diff. There's a lot of boiler plate / moving stuff around which accounts for most of it.

The main bits are:

Fix RotatingKVCache for alternating chat, response use case
Enable prompt caching for all types (not just KVCache)
Unify APIs and cache types in a single file for ease of use / consistency.
Chat mode allows prompt caching for efficiency. Example here.
Add a bunch of tests.

…ases for e.g. caching during a chat

awni · 2024-10-07T04:28:07Z

I also added a chat command to MLX LM which is a good use case for the prompt cache re-use. The example is kind of fun to play with:

mlx_lm.chat

Then you can just chat with the model and it preserves the history and doesn't do any prompt recomputations..

[INFO] Starting chat sessiong with mlx-community/Llama-3.2-3B-Instruct-4bit. To exit, enter 'q'.
>> Hi, my name is Awni!
Hi Awni! It's nice to meet you. Is there something I can help you with or would you like to chat?
>> What's the tallest mountain in the world?
The tallest mountain in the world is Mount Everest, which is located in the Himalayas on the border between Nepal and Tibet, China. It stands at an elevation of 8,848 meters (29,029 feet) above sea level.
>> Do you remember my name?
Yes, your name is Awni.
>> Nice talking with you!
It's great to chat with you too, Awni! Is there anything else you'd like to talk about or ask about?
>>

angeloskath

This is a fantastic refactoring! Especially love the tests.

I am wondering what is ht point of the extra state in the KV cace? Is anybody using it now? Is there any reason it is set to the empty string instead of None?

I may have missed a discussion but I couldn't find its use in the code as well.

.gitignore

angeloskath · 2024-10-07T17:55:28Z

llms/mlx_lm/models/gemma2.py

    ) -> mx.array:
-        r = self.self_attn(self.input_layernorm(x.astype(mx.float32)), mask, cache)
+        r = self.self_attn(self.input_layernorm(x), mask, cache)


angeloskath · 2024-10-07T18:00:37Z

llms/mlx_lm/models/stablelm.py

@@ -198,20 +197,12 @@ def __call__(
        self,
        x: mx.array,
        mask: mx.array = None,
-        cache: mx.array = None,
-    ) -> Tuple[mx.array, mx.array]:
+        cache=None,


I don't personally care but in most cases this is written as cache: Optional[Any] = None

Good catch, will fix.

angeloskath · 2024-10-07T18:06:57Z

llms/mlx_lm/models/cache.py

+    """
+    cache_data, cache_info = zip(*(c.state for c in cache))
+    cache_data = dict(tree_flatten(cache_data))
+    cache_classes = [type(c).__name__ for c in cache]


awni · 2024-10-07T18:40:05Z

I am wondering what is ht point of the extra state in the KV cace? Is anybody using it now? Is there any reason it is set to the empty string instead of None?

It's my least favorite thing in this diff, but I didn't think of a cleaner solution yet (if you have ideas I'm all 👂 )

It is used only for the RotatingKVCache so that we can save the cache.offset and cache._idx. Otherwise we don't get the right/same behavior when serializing and deserializing that cache.

The reason I made it a string and not None is because that simplified saving it in safetensors metadata. So downstream code just does something like dict(tree_flatten([c for c in cache.state[1]])).

angeloskath · 2024-10-07T18:55:08Z

Huh, not sure how I managed to miss it in the RotatingKVCache...

I think this should be changed because the state is what we evaluate from the caches so this is why I was confused with the string. The most minor change that would be imho significantly better would be to split it into state and serialization_state. It is a bit verbose but at least it separates the two types of information cleanly.

Nothing changes much. Line 47 would do sth like

cache_data = [c.state for c in cache]
cache_info = [c.serialization_state for c in cache] # do we also want type(self).__name__ here?

and lines 75 would change to

for c, state, serialization_state in zip(cache, arrays, info):
    c.state = state
    c.serialization_state = serialization_state

The rest remains the same... Wdyt?

awni · 2024-10-07T19:06:27Z

Yea I thought about a separate property.. and/or overriding __getstate__ and __setstate__. The main downside I didn't like is that all the caches needed to implement it.. but maybe the right call is to check if the attribute exists to avoid that. I think you're right it could be cleaner even if a little more verbose.

angeloskath · 2024-10-07T20:40:23Z

I added a small base class that implements the empty meta state and makes the load/save code a tad bit cleaner? Should I push it on top or we are avoiding base classes for some reason?

Also just played with the chat command it is an absolute joy to use :-)

awni · 2024-10-07T20:58:50Z

I added a small base class that implements the empty meta state and makes the load/save code a tad bit cleaner? Should I push it on top or we are avoiding base classes for some reason?

No reason at this point, please send the diff!

awni · 2024-10-08T03:07:37Z

Ok, I tested prompt caching with a few different models / cache types and it seems to work well. I'm going to merge this.

As a follow up we should consider:

Making a way to serialize the chat context
Adding a chat endpoint to mlx_lm.server with prompt caching

zcbenz · 2024-10-08T10:10:21Z

Awesome work, thanks for fixing it!

mark-lord · 2024-10-08T15:56:05Z

So excited this got merged 😄

awni added 4 commits October 4, 2024 07:43

fix rotating kv cache for chat use case

ed060a7

reorg + fixes to caching, unify prompt caching across types and use c…

782f5a7

…ases for e.g. caching during a chat

nit in chat

5f52882

fix tests

60c9794

awni mentioned this pull request Oct 5, 2024

Fix rotating KV cache for chat use case #1014

Closed

awni added 3 commits October 5, 2024 15:24

fix tests

62dbd41

fix tests

4dc3cc0

docs

daf79f3

awni mentioned this pull request Oct 6, 2024

PR: Add KV-cache creation capability to mlx_lm.generate for after a text completion #1001

Closed

chat command

52ffc2f

angeloskath approved these changes Oct 7, 2024

View reviewed changes

comments + docs

f6ff4f2

angeloskath and others added 3 commits October 7, 2024 15:46

Define meta_state on all Cache implementations

7a3d0dd

fixes + trim_prompt_cache api

fbff8e2

fix default model

16c6f6b

awni merged commit fca087b into main Oct 8, 2024
2 checks passed

awni deleted the more_cache_improvements branch October 8, 2024 03:45

awni mentioned this pull request Oct 12, 2024

Enable caching for 'generate' and 'stream_generate' functions to ensure persistence of cache across multiple requests #989

Closed

cmcmaster1 mentioned this pull request Nov 11, 2024

Updated mlx-lm kvcache creation dottxt-ai/outlines#1260

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More cache improvements #1015

More cache improvements #1015

awni commented Oct 5, 2024 •

edited

Loading

awni commented Oct 7, 2024 •

edited

Loading

angeloskath left a comment

angeloskath Oct 7, 2024

angeloskath Oct 7, 2024

awni Oct 7, 2024

angeloskath Oct 7, 2024

awni commented Oct 7, 2024 •

edited

Loading

angeloskath commented Oct 7, 2024

awni commented Oct 7, 2024

angeloskath commented Oct 7, 2024

awni commented Oct 7, 2024

awni commented Oct 8, 2024

zcbenz commented Oct 8, 2024

mark-lord commented Oct 8, 2024

More cache improvements #1015

More cache improvements #1015

Conversation

awni commented Oct 5, 2024 • edited Loading

awni commented Oct 7, 2024 • edited Loading

angeloskath left a comment

Choose a reason for hiding this comment

angeloskath Oct 7, 2024

Choose a reason for hiding this comment

angeloskath Oct 7, 2024

Choose a reason for hiding this comment

awni Oct 7, 2024

Choose a reason for hiding this comment

angeloskath Oct 7, 2024

Choose a reason for hiding this comment

awni commented Oct 7, 2024 • edited Loading

angeloskath commented Oct 7, 2024

awni commented Oct 7, 2024

angeloskath commented Oct 7, 2024

awni commented Oct 7, 2024

awni commented Oct 8, 2024

zcbenz commented Oct 8, 2024

mark-lord commented Oct 8, 2024

awni commented Oct 5, 2024 •

edited

Loading

awni commented Oct 7, 2024 •

edited

Loading

awni commented Oct 7, 2024 •

edited

Loading