-
Notifications
You must be signed in to change notification settings - Fork 883
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix for cohere plus #650
Fix for cohere plus #650
Conversation
I was about to submit a PR, great I checked 😄. Already uploaded the model to the hub. |
@Blaizzy Thank you! How much RAM does it require to run 4bit q? |
Needs about 65GB to generate with 4-bit. But the generation is slow right now, trying to debug the performance issue. |
@DenisSergeevitch, as @awni said 👆🏽. I can't run it myself, I use a M1 Air 16GB :) |
Thank you, I will wait for i_q1 then |
Btw to get this to run reasonably fast on an M2 Ultra you need to set the wired GPU memory lower limit appropriately. Something like:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
@awni 👆🏽 |
@jeanromainroy did you set the memory limits? You could try making it larger:
|
Do you mind to open an issue and include the command, versions of MLX / MLX LM, OS etc? |
Use the qk norm param to work with cohere plus.
Machine setting:
Command for generation:
Command for QLoRA: