Running AI locally in April 2026 has never been more practical: Qwen3.5-9B at Q4_K_M runs at 55+ tokens per second on an 8GB GPU, the RTX 4090 delivers 80–110 tok/s for 8B models, and the M3 Max 96GB is the only consumer device capable of running 70B+ models on a single chip. The key insight for 2026: choose your model based on VRAM first, tool second — MLX on Apple Silicon delivers 6–10× the throughput of Ollama for the same hardware, making tool selection as important as model selection.
Hardware Tier Overview
| Tier | VRAM / RAM | Example Hardware | Recommended Models | Max Params (Q4_K_M) |
|---|---|---|---|---|
| CPU-Only | 8–16GB RAM | Any modern laptop/desktop | Phi-4 Mini, Qwen3 1.5B, Gemma 3 2B | ~3B |
| Entry GPU | 4–8GB VRAM | RTX 3060, GTX 1080 Ti, M1/M2 base | Qwen3.5-9B, Phi-4 Mini, Llama 3.2 8B | ~9B |
| Mid GPU | 8–16GB VRAM | RTX 3080, RTX 4070, M2 Pro/Max 16GB | GPT-OSS 20B, Qwen3 14B, Mistral Nemo | ~20B |
| High-End GPU | 24GB VRAM | RTX 4090, RTX 3090, A5000 | Nemotron 3 Nano 30B, Qwen3 VL 32B | ~34B |
| Workstation | 48GB+ VRAM | A100, H100, 2× RTX 4090, M3 Max 96GB | Llama 3.3 70B, Qwen2.5 72B | ~80B |
| Apple Silicon | Unified 8–192GB | M1–M4 (all variants), M2/M3/M4 Ultra | Varies by chip; MLX recommended | Up to ~180B on M3 Ultra |
CPU-Only & Low-End (≤8GB RAM)
Running LLMs on CPU-only systems is viable in 2026 thanks to aggressive quantization and purpose-built small models. Expect 2–8 tokens per second — usable for batch tasks, not real-time chat.
- Best overall: Phi-4 Mini (Microsoft) — purpose-built for edge/CPU deployment; exceptional reasoning per parameter count; Q4_K_M fits in ~2.5GB RAM.
- Best for coding: Qwen3 1.5B or Qwen3 4B — strong code completion despite small size; Q4_K_M versions run comfortably on 8GB RAM at 4–6 tok/s.
- Best for chat: Gemma 3 2B (Google) — optimized instruction-following in a tiny footprint; best conversational quality at ≤2B parameters.
- Performance expectations: 8-core modern CPU: ~4–8 tok/s for 3B models. 4-core older CPU: ~2–4 tok/s. Sufficient for local summarization, offline assistants, and edge IoT applications.
- Tip: Use llama.cpp with AVX2/AVX-512 enabled for maximum CPU throughput. On modern Intel/AMD CPUs, CPU inference has improved 3–4× since 2024.
Entry GPU Tier (4–8GB VRAM: RTX 3060, M1/M2 base)
The 8GB VRAM sweet spot enables running 7–9B parameter models fully in GPU memory at 40–60+ tokens per second — a qualitative leap over CPU inference. This tier covers the vast majority of consumer gaming GPUs.
- Best overall: Qwen3.5-9B (Q4_K_M) — best 8GB model in 2026; 55+ tok/s fully in VRAM; strong across coding, reasoning, and multilingual tasks.
- Best for speed: Phi-4 Mini — leads at 28 tok/s on 8GB systems with a smaller footprint (~5GB Q4_K_M); best choice when raw speed matters most.
- Best for coding: Llama 3.2 8B or Qwen 3 8B — strong HumanEval scores; fit comfortably in 6GB Q4_K_M leaving room for 8K+ context.
- Best for multimodal: Gemma 3 9B (vision) — supports image understanding within 8GB VRAM at Q4_K_M; useful for screenshot analysis and document OCR.
- Models that fit (Q4_K_M, ~5–6GB): Llama 3.2 8B, Qwen 3 8B, Qwen3.5-9B, Mistral 7B v0.3, Gemma 3 9B, Phi-4 Mini.
- Context note: At 8GB VRAM, leave ~2GB headroom for KV cache. Aim for 8K–16K context rather than pushing to max — it preserves speed significantly.
Mid GPU Tier (8–16GB VRAM: RTX 3080/4070, M2 Pro/Max)
The 16GB tier unlocks 14–20B parameter models — a substantial quality jump over 8B that rivals GPT-3.5-era cloud models. This is the sweet spot for power users and developers running local AI daily.
- Best overall: Qwen3 14B (Q4_K_M, ~10–11GB) — the community consensus "best 16GB model" for 2026. Runs at 60–70 tok/s on RTX 4080/4090; dramatically better than 8B models on reasoning and coding while leaving VRAM for 8–16K context.
- Best for AI benchmarks: GPT-OSS 20B (Q4_K_M, ~15GB) — 52.1% AI index while using ~15GB at 60K context; top-performing open model for this VRAM range.
- Best with vision: Apriel 1.5 (~9.9GB VRAM) — 51.6% AI index with native vision support; uniquely fits two heavyweight tasks in 16GB.
- Best for creative writing: Mistral Nemo 12B — strong instruction-following and creative generation; Q4_K_M fits in ~8GB, leaving half the VRAM for long contexts.
- RTX 4070 (12GB) tip: Qwen3 14B Q4_K_M at ~10–11GB is the ideal fit — 1–2GB headroom for KV cache, ~50–60 tok/s. Avoid Q5 unless willing to sacrifice context length.
High-End GPU (24GB+ VRAM: RTX 4090, A100, M3 Max/Ultra)
At 24GB VRAM, you can run 30B-class models fully in GPU memory — delivering quality that approaches Claude Haiku / GPT-4o Mini territory from local hardware. The RTX 4090 is the consumer king; it delivers ~1,008 GB/s memory bandwidth, roughly 2.5× the M3 Max's 400 GB/s.
- Best 30B model: NVIDIA Nemotron 3 Nano 30B (24.3GB) — 91% Math 500 score, the highest in its parameter class; exceptional STEM and reasoning performance.
- Best 32B multimodal: Qwen3 VL 32B (24.5GB Q4_K_M) — leading vision-language model at this tier; strong for document understanding, chart analysis, and screenshot-based workflows.
- RTX 4090 real-world performance: Llama 3 8B Q4_K_M: 80–110 tok/s. Qwen3 14B Q4_K_M: 60–80 tok/s. Nemotron 30B Q4_K_M: 30–40 tok/s. Qwen3 VL 32B Q4_K_M: 25–35 tok/s.
- Best for 48GB+: Two RTX 4090s in tensor parallel, or a single A100 80GB, unlocks Llama 3.3 70B and Qwen2.5 72B at full quality — performance comparable to GPT-4-class hosted models.
- Coding recommendation for RTX 4090: Qwen3 14B Q4_K_M for daily use; step up to Nemotron 30B or Qwen3 32B for complex tasks. The 14B → 30B quality jump is meaningful; the 30B → 70B jump is incremental for most tasks.
Apple Silicon & MLX
Apple Silicon's unified memory architecture is transformative for local AI: the M3 Max with 96GB is the only single consumer device that can run 70B-class models at all. The key is using MLX — Apple's machine learning framework optimized for Apple Silicon — which delivers 6–10× the throughput of Ollama for identical hardware.
- M1/M2 base (8–16GB unified): Equivalent to entry GPU tier; use Qwen3.5-9B or Llama 3.2 8B Q4_K_M. MLX gives 30–50 tok/s vs. Ollama's 5–15 tok/s on the same chip.
- M2 Pro/Max (16–32GB): Strong mid-tier; MLX sustains ~230 tok/s on Llama 3 8B vs. Ollama's 20–40 tok/s. Run Qwen3 14B or GPT-OSS 20B Q4_K_M comfortably.
- M3 Max 96GB: The local AI flagship. Best 2026 model: Qwen3.6-35B-A3B MoE — only 3B parameters active per token, so it runs fast and leaves room for long context. Can also run Llama 3.3 70B Q4_K_M (~45GB) — impossible on any single consumer NVIDIA GPU.
- M4 Ultra (192GB): Can run Llama 3 405B Q4_K_M (~230GB) — the largest open-weight model. Niche but remarkable for researchers and enterprises needing maximum local quality.
- MLX vs Ollama on Apple Silicon: MLX achieves up to 6–10× the throughput. For any Apple Silicon user, MLX is non-negotiable. Use mlx-lm or LM Studio (which supports MLX backend) instead of standard Ollama.
- Best MoE for Mac (32–48GB): Qwen3.6-35B-A3B MoE — 2026 community default on M2/M3 Pro/Max; only 3B active per token means quality of a 35B dense model at 8B-class inference speed.
Quantization Guide
| Format | Bits per Weight | Quality vs FP16 | VRAM vs FP16 | Best Use Case |
|---|---|---|---|---|
| Q4_K_M | ~4.5 bits avg | ~97–98% | ~75% reduction | Gold standard for consumer hardware; default recommendation. |
| Q5_K_M | ~5.5 bits avg | ~98–99% | ~65% reduction | When you have spare VRAM and want closer to FP16 quality. |
| Q8_0 | 8 bits | ~99.5% | ~50% reduction | Near-lossless; use for professional/research tasks where quality is paramount and VRAM allows. |
| Q2_K | ~2.6 bits avg | ~85–90% | ~87% reduction | Extreme memory savings only; noticeable quality degradation. Avoid for important tasks. |
| IQ4_XS | ~4.25 bits avg | ~97% | ~76% reduction | Slightly smaller than Q4_K_M with similar quality; good for tight 8GB VRAM systems. |
| F16 / BF16 | 16 bits | 100% (baseline) | Baseline | Fine-tuning and research only; impractical for consumer inference. |
For most users: stick with Q4_K_M. It uses ~75% less VRAM than FP16 while retaining 97–98% of the model's quality — the best tradeoff available. Step up to Q5_K_M if you have VRAM headroom and are running math-heavy or coding tasks where the extra 1–2% quality matters. Q8_0 is only worth it for research workflows where you need provably near-lossless inference.
Recommended Tools
Ollama
- Best for: Easy model management, one-command setup, OpenAI-compatible API endpoint. The default starting point for most users.
- Command:
ollama run qwen3.5:9b— automatically downloads and runs the recommended quantization. - Caveat: Not optimal for Apple Silicon — use MLX-backed tools for 6–10× better performance on Mac.
- Models library: ollama.com/library — thousands of models with one-line install.
LM Studio
- Best for: GUI-first users; built-in MLX backend for Apple Silicon (6–10× faster than Ollama on Mac); model discovery and GGUF management. Supports OpenAI-compatible local server.
- Standout feature: Visual VRAM usage estimator shows exactly how much memory a model+quantization combo will use before you download.
llama.cpp
- Best for: Maximum performance and control; C++ CLI with CUDA, Metal, Vulkan, and CPU backends. The foundation that most other tools are built on.
- Best use cases: Custom server deployments, fine-tuned models, batch inference, and embedding generation. AVX2/AVX-512 support gives best-in-class CPU performance.
- Tip: Use the
-nglflag to offload specific layers to GPU while running overflow on CPU — useful for models that almost but don't quite fit in VRAM.