Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mixtral 8x7b models using more memory while loading #6652

Closed
RyenNelsen opened this issue Apr 13, 2024 · 4 comments
Closed

Mixtral 8x7b models using more memory while loading #6652

RyenNelsen opened this issue Apr 13, 2024 · 4 comments

Comments

@RyenNelsen
Copy link

RyenNelsen commented Apr 13, 2024

There appears to be a regression between release versions b2586 and b2589. When attempting to load Mixtral 8x7b models with any version greater than b2586, the system utilizes an abnormal amount of memory compared to previous versions. Manually disabling mmap does resolve the issue.

Platform:
Windows 11 Pro
64GB RAM
Nvidia 3080

Example command:
.\main.exe -m 'C:\models\dolphin-2.7-mixtral-8x7b.Q5_0.gguf' -p "<|im_start|>user\nHello!\n<|im_end|>\n<|im_start|>assistant\n"

Versions I tested:
b2586 cuda cu12.2.0 & openblas
version b2586 memory loading graph

b2589 cuda cu12.2.0 & openblas
version b2589 memory loading graph cuda & openblas

b2589 avx512
version b2589 memory loading graph avx512

Diffing log output from b2586 cuda cu12.2.0 and b2589 cuda cu12.2.0 shows the following:
b2586: llm_load_tensors: CPU buffer size = 30735.50 MiB
b2589: llm_load_tensors: CUDA_Host buffer size = 30735.50 MiB

@RyenNelsen RyenNelsen changed the title Mixtral 8x7b models using greater memory while loading Mixtral 8x7b models using more memory while loading Apr 13, 2024
@compilade
Copy link
Collaborator

compilade commented Apr 13, 2024

This may be related to #6387. If I'm understanding this correctly, a solution would be to re-convert the model to GGUF from the original model files. (@slaren might want to clarify)

I didn't yet find a recent re-conversion of dolphin-mixtral-8x7b on HuggingFace but someone might do it eventually.
But there are already finetunes which use the new format, like https://huggingface.co/bartowski/Tess-2.0-Mixtral-v0.2-GGUF.

@LostRuins
Copy link
Collaborator

I can confirm this seems to happen for me too, I'm getting OOM at configs which worked fine previously.

@phymbert
Copy link
Collaborator

Yes that's why #6387 is a breaking changes. You need to convert to GGUF again to have merged experts tensors per layer or disable mmap.

It is clearly stated here:
#6387 (comment)

@phymbert phymbert closed this as not planned Won't fix, can't repro, duplicate, stale Apr 17, 2024
@aleksusklim
Copy link

@slaren, @phymbert, can you look please?
LostRuins#786

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants