Clear cache every now and then #1081

awni · 2024-11-01T20:55:42Z

As we step the KV cache.. the buffer cache fills up which causes the machine to use more RAM than is really needed.

For example:

mlx_lm.generate --model mlx-community/Meta-Llama-3.1-8B-Instruct-4bit -m 2048 --prompt - < prompt.txt

Pre:

Prompt: 32188 tokens, 430.339 tokens-per-sec
Generation: 892 tokens, 32.480 tokens-per-sec
Peak memory: 11.496 GB
Cache memory: 22.795 GB

Post:

Prompt: 32188 tokens, 424.646 tokens-per-sec
Generation: 892 tokens, 32.211 tokens-per-sec
Peak memory: 11.496 GB
Cache memory: 4.034 GB

No difference on M2 Ultra with:

mlx_lm.generate --model mlx-community/Mistral-7B-Instruct-v0.3-4bit --prompt "Write a story about Einstein"  --temp 0.0 --max-tokens 512

Still hits about 120.5 toks/sec

angeloskath

Nice :-)

awni added 2 commits November 1, 2024 13:54

clear cache every now and then

c102d52

don't need user arg anymore

8cdd9da

angeloskath approved these changes Nov 1, 2024

View reviewed changes

awni merged commit e510987 into main Nov 1, 2024
2 checks passed

awni deleted the clear_cache branch November 1, 2024 21:15

Provide feedback