-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mixtral 8x7b models using more memory while loading #6652
Comments
This may be related to #6387. If I'm understanding this correctly, a solution would be to re-convert the model to GGUF from the original model files. (@slaren might want to clarify) I didn't yet find a recent re-conversion of dolphin-mixtral-8x7b on HuggingFace but someone might do it eventually. |
I can confirm this seems to happen for me too, I'm getting OOM at configs which worked fine previously. |
Yes that's why #6387 is a breaking changes. You need to convert to GGUF again to have merged experts tensors per layer or disable mmap. It is clearly stated here: |
@slaren, @phymbert, can you look please? |
There appears to be a regression between release versions b2586 and b2589. When attempting to load Mixtral 8x7b models with any version greater than b2586, the system utilizes an abnormal amount of memory compared to previous versions. Manually disabling
mmap
does resolve the issue.Platform:
Windows 11 Pro
64GB RAM
Nvidia 3080
Example command:
.\main.exe -m 'C:\models\dolphin-2.7-mixtral-8x7b.Q5_0.gguf' -p "<|im_start|>user\nHello!\n<|im_end|>\n<|im_start|>assistant\n"
Versions I tested:
b2586 cuda cu12.2.0 & openblas
b2589 cuda cu12.2.0 & openblas
b2589 avx512
Diffing log output from b2586 cuda cu12.2.0 and b2589 cuda cu12.2.0 shows the following:
b2586:
llm_load_tensors: CPU buffer size = 30735.50 MiB
b2589:
llm_load_tensors: CUDA_Host buffer size = 30735.50 MiB
The text was updated successfully, but these errors were encountered: