Running LLMs Locally with Ollama

I wanted to run AI models on my own machine — no API keys, no usage limits, no sending my code to someone else’s server. Same philosophy as the home server: own your tools. Ollama makes this surprisingly easy if you have a decent GPU.

What Actually Runs Well on 8GB VRAM

After pulling way too many models and benchmarking them, here’s what I actually kept:

The daily drivers:

qwen2.5-coder:7b — This thing is scary good at Python. Fits entirely in GPU memory, responds fast, and the code quality rivals cloud models for everyday tasks.
qwen2.5:14b — My go-to for writing, documentation, and anything that needs good language. Slower than the 7B models but the quality jump is worth it.
llama3.2:3b — When I just need a quick answer and don’t want to wait. Two seconds and done.

For when you need to think harder:

deepseek-r1-distill-qwen:14b — Reasoning + code. When the 7B coder gives you nonsense on a complex problem, this one usually gets it.

Pull with -q4_K_M quantization for the best quality-to-size balance: ollama pull qwen2.5:7b-q4_K_M

The Real Question: Local or Cloud?

This took me a while to figure out. The answer isn’t “always local” — it’s about matching the task to the tool.

Task	Local	Cloud	Why
Code completion	✅		You do this 100 times a day. Privacy matters.
Refactoring	✅		Repetitive and standardized
Bug explanations	✅		7B handles this fine
Architecture decisions		✅	You need the big brain for this
Documentation		✅	One-time cost, quality matters
Writing tests	✅	✅	Draft locally, polish with cloud

The pattern: if it’s frequent, private, or “good enough” works — run it locally. If it’s a one-time thing where quality is everything — pay for the cloud model.

Performance Reality Check

With an RTX 4050/5050 (8GB VRAM):

3B–7B models: 30–50 tokens/sec. Feels instant. This is your sweet spot for daily use.
12B–14B models: 15–25 tokens/sec. Noticeable pause but totally usable. Better quality.
34B+ models: Spills into RAM, gets slow. Only use these when you really need the reasoning power and don’t mind waiting.

WSL2 GPU Setup

If you’re on Windows, Ollama runs great through WSL. The GPU passthrough just works now — verify with nvidia-smi inside WSL. If it fails, run wsl --update && wsl --shutdown and restart.

No extra CUDA setup needed. Ollama detects it automatically.

Ibrah5em

Ibrah5em Notes

Running LLMs Locally with Ollama

Running LLMs Locally with Ollama

What Actually Runs Well on 8GB VRAM

The Real Question: Local or Cloud?

Performance Reality Check

WSL2 GPU Setup

See Also

Graph View