Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I keep running out of memory. What's the biggest model, and most context, I can run on 3060 12gb? With decent speed? : r/LocalLLaMA #457

Open
1 task
irthomasthomas opened this issue Jan 26, 2024 · 0 comments
Labels
linux Linux notes tools links llm-experiments experiments with large language models llm-inference-engines Software to run inference on large language models llm-quantization All about Quantized LLM models and serving llm-serving-optimisations Tips, tricks and tools to speedup inference of large language models Planning Planning and organizing tips and tools shell-script shell scripting in Bash, ZSH, POSIX etc

Comments

@irthomasthomas
Copy link
Owner

Here's the reformatted text in Markdown format:

# GPU and Model Recommendations

**GPU Only:**

- You can use 7B Models at 8 bpw with 8K context, or maybe up to 12k context.
- If you wish to use 13B models, then you have to use 4bpw and limit yourself to 2K Context.

**GPU + CPU:**

- Use `.gguf` files to offload part of the model to VRAM.
- Check the disk usage when inferencing in the activity monitor app (or whatever it is called in your OS). If the disk usage is 100% (disk is swapping), then it is impossible to fit the model in RAM + VRAM and tokens per second will be very low.
- In that case, reduce context size and reduce bpw. The best models you can probably run now are:
  - OpenChat 3.5 7B at 8bpw (Use the latest version)
  - <https://huggingface.co/vicgalle/solarized-18B-dpo-GGUF> at 4bpw and 4K context.
- If you want to run Nous-Capybara-34b, switch to the 3bpw version and try to offload 35 layers to GPU. If you want to run bigger models, upgrade RAM to 64GB.

**Tip from /u/Working-Flatworm-531:**

- Just do not load kv in VRAM, you can use `ooba` to disable it.
- Also try lower quants, for example Q4_K_S is still good. You still wouldn't be able to run 34B models with good speed, but at least it's something.
- You can also check your BIOS and maybe increase RAM frequency. After that, you'd be able to run ~20B models at ~2t/s at 8k ~ 12k context.

**Recommended List from /u/Working-Flatworm-531:**

- Use Linux.
- Overclock RAM (if possible).
- Overclock CPU (if possible).
- Overclock GPU.
- Don't load kv cache in VRAM, instead load more layers to the VRAM.
- Use smaller quants.
- Use fast interface (didn't try Kobold, use Ooba).
- Check RAM (should be dual channel).

Suggested labels

{ "label-name": "memory-optimization", "description": "Strategies for optimizing memory usage when running models on 3060 12gb GPU.", "confidence": 94.88 }

@irthomasthomas irthomasthomas added linux Linux notes tools links llama llm-experiments experiments with large language models New-Label Choose this option if the existing labels are insufficient to describe the content accurately Planning Planning and organizing tips and tools shell-script shell scripting in Bash, ZSH, POSIX etc llm-inference-engines Software to run inference on large language models llm-quantization All about Quantized LLM models and serving llm-serving-optimisations Tips, tricks and tools to speedup inference of large language models and removed New-Label Choose this option if the existing labels are insufficient to describe the content accurately labels Jan 26, 2024
@ShellLM ShellLM removed the llama label May 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
linux Linux notes tools links llm-experiments experiments with large language models llm-inference-engines Software to run inference on large language models llm-quantization All about Quantized LLM models and serving llm-serving-optimisations Tips, tricks and tools to speedup inference of large language models Planning Planning and organizing tips and tools shell-script shell scripting in Bash, ZSH, POSIX etc
Projects
None yet
Development

No branches or pull requests

2 participants