Running LLMs Locally with Ollama
I wanted to run AI models on my own machine — no API keys, no usage limits, no sending my code to someone else’s server. Same philosophy as the home server: own your tools. Ollama makes this surprisingly easy if you have a decent GPU.
What Actually Runs Well on 8GB VRAM
After pulling way too many models and benchmarking them, here’s what I actually kept:
The daily drivers:
qwen2.5-coder:7b— This thing is scary good at Python. Fits entirely in GPU memory, responds fast, and the code quality rivals cloud models for everyday tasks.qwen2.5:14b— My go-to for writing, documentation, and anything that needs good language. Slower than the 7B models but the quality jump is worth it.llama3.2:3b— When I just need a quick answer and don’t want to wait. Two seconds and done.
For when you need to think harder:
deepseek-r1-distill-qwen:14b— Reasoning + code. When the 7B coder gives you nonsense on a complex problem, this one usually gets it.
Pull with -q4_K_M quantization for the best quality-to-size balance: ollama pull qwen2.5:7b-q4_K_M
The Real Question: Local or Cloud?
This took me a while to figure out. The answer isn’t “always local” — it’s about matching the task to the tool.
| Task | Local | Cloud | Why |
|---|---|---|---|
| Code completion | ✅ | You do this 100 times a day. Privacy matters. | |
| Refactoring | ✅ | Repetitive and standardized | |
| Bug explanations | ✅ | 7B handles this fine | |
| Architecture decisions | ✅ | You need the big brain for this | |
| Documentation | ✅ | One-time cost, quality matters | |
| Writing tests | ✅ | ✅ | Draft locally, polish with cloud |
The pattern: if it’s frequent, private, or “good enough” works — run it locally. If it’s a one-time thing where quality is everything — pay for the cloud model.
Performance Reality Check
With an RTX 4050/5050 (8GB VRAM):
- 3B–7B models: 30–50 tokens/sec. Feels instant. This is your sweet spot for daily use.
- 12B–14B models: 15–25 tokens/sec. Noticeable pause but totally usable. Better quality.
- 34B+ models: Spills into RAM, gets slow. Only use these when you really need the reasoning power and don’t mind waiting.
WSL2 GPU Setup
If you’re on Windows, Ollama runs great through WSL. The GPU passthrough just works now — verify with nvidia-smi inside WSL. If it fails, run wsl --update && wsl --shutdown and restart.
No extra CUDA setup needed. Ollama detects it automatically.
See Also
- Open WebUI Setup — give your local models a proper interface
- WSL for Development — the environment where Ollama runs
- Self-Hosting a Server on an Old Laptop — same philosophy, different rabbit hole
- Claude Prompts — structured prompts that work with local models too