I keep running out of memory. What's the biggest model, and most context, I can run on 3060 12gb? With decent speed? : r/LocalLLaMA #457

irthomasthomas · 2024-01-26T20:29:17Z

I keep running out of memory. What's the biggest model, and most context, I can run on 3060 12gb? With decent speed? : r/LocalLLaMA

Here's the reformatted text in Markdown format:

# GPU and Model Recommendations

**GPU Only:**

- You can use 7B Models at 8 bpw with 8K context, or maybe up to 12k context.
- If you wish to use 13B models, then you have to use 4bpw and limit yourself to 2K Context.

**GPU + CPU:**

- Use `.gguf` files to offload part of the model to VRAM.
- Check the disk usage when inferencing in the activity monitor app (or whatever it is called in your OS). If the disk usage is 100% (disk is swapping), then it is impossible to fit the model in RAM + VRAM and tokens per second will be very low.
- In that case, reduce context size and reduce bpw. The best models you can probably run now are:
  - OpenChat 3.5 7B at 8bpw (Use the latest version)
  - <https://huggingface.co/vicgalle/solarized-18B-dpo-GGUF> at 4bpw and 4K context.
- If you want to run Nous-Capybara-34b, switch to the 3bpw version and try to offload 35 layers to GPU. If you want to run bigger models, upgrade RAM to 64GB.

**Tip from /u/Working-Flatworm-531:**

- Just do not load kv in VRAM, you can use `ooba` to disable it.
- Also try lower quants, for example Q4_K_S is still good. You still wouldn't be able to run 34B models with good speed, but at least it's something.
- You can also check your BIOS and maybe increase RAM frequency. After that, you'd be able to run ~20B models at ~2t/s at 8k ~ 12k context.

**Recommended List from /u/Working-Flatworm-531:**

- Use Linux.
- Overclock RAM (if possible).
- Overclock CPU (if possible).
- Overclock GPU.
- Don't load kv cache in VRAM, instead load more layers to the VRAM.
- Use smaller quants.
- Use fast interface (didn't try Kobold, use Ooba).
- Check RAM (should be dual channel).

Suggested labels

{ "label-name": "memory-optimization", "description": "Strategies for optimizing memory usage when running models on 3060 12gb GPU.", "confidence": 94.88 }

This was referenced Feb 27, 2024

S-LoRA: Serving Thousands of Models From One GPU for Fun and Profit - OpenPipe #636

Open

Guide to choosing quants and engines : r/LocalLLaMA #641

Open

ShellLM removed the llama label May 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

I keep running out of memory. What's the biggest model, and most context, I can run on 3060 12gb? With decent speed? : r/LocalLLaMA #457

I keep running out of memory. What's the biggest model, and most context, I can run on 3060 12gb? With decent speed? : r/LocalLLaMA #457

irthomasthomas commented Jan 26, 2024

I keep running out of memory. What's the biggest model, and most context, I can run on 3060 12gb? With decent speed? : r/LocalLLaMA #457

I keep running out of memory. What's the biggest model, and most context, I can run on 3060 12gb? With decent speed? : r/LocalLLaMA #457

Comments

irthomasthomas commented Jan 26, 2024

Suggested labels

{ "label-name": "memory-optimization", "description": "Strategies for optimizing memory usage when running models on 3060 12gb GPU.", "confidence": 94.88 }