Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clear cache every now and then #1081

Merged
merged 2 commits into from
Nov 1, 2024
Merged

Clear cache every now and then #1081

merged 2 commits into from
Nov 1, 2024

Conversation

awni
Copy link
Member

@awni awni commented Nov 1, 2024

As we step the KV cache.. the buffer cache fills up which causes the machine to use more RAM than is really needed.

For example:

mlx_lm.generate --model mlx-community/Meta-Llama-3.1-8B-Instruct-4bit -m 2048 --prompt - < prompt.txt

Pre:

Prompt: 32188 tokens, 430.339 tokens-per-sec
Generation: 892 tokens, 32.480 tokens-per-sec
Peak memory: 11.496 GB
Cache memory: 22.795 GB

Post:

Prompt: 32188 tokens, 424.646 tokens-per-sec
Generation: 892 tokens, 32.211 tokens-per-sec
Peak memory: 11.496 GB
Cache memory: 4.034 GB

No difference on M2 Ultra with:

mlx_lm.generate --model mlx-community/Mistral-7B-Instruct-v0.3-4bit --prompt "Write a story about Einstein"  --temp 0.0 --max-tokens 512

Still hits about 120.5 toks/sec

Copy link
Member

@angeloskath angeloskath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice :-)

@awni awni merged commit e510987 into main Nov 1, 2024
2 checks passed
@awni awni deleted the clear_cache branch November 1, 2024 21:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants