Best Local AI by Hardware — April 27, 2026

Running AI locally in April 2026 has never been more practical: Qwen3.5-9B at Q4_K_M runs at 55+ tokens per second on an 8GB GPU, the RTX 4090 delivers 80–110 tok/s for 8B models, and the M3 Max 96GB is the only consumer device capable of running 70B+ models on a single chip. The key insight for 2026: choose your model based on VRAM first, tool second — MLX on Apple Silicon delivers 6–10× the throughput of Ollama for the same hardware, making tool selection as important as model selection.

Hardware Tier Overview

Tier	VRAM / RAM	Example Hardware	Recommended Models	Max Params (Q4_K_M)
CPU-Only	8–16GB RAM	Any modern laptop/desktop	Phi-4 Mini, Qwen3 1.5B, Gemma 3 2B	~3B
Entry GPU	4–8GB VRAM	RTX 3060, GTX 1080 Ti, M1/M2 base	Qwen3.5-9B, Phi-4 Mini, Llama 3.2 8B	~9B
Mid GPU	8–16GB VRAM	RTX 3080, RTX 4070, M2 Pro/Max 16GB	GPT-OSS 20B, Qwen3 14B, Mistral Nemo	~20B
High-End GPU	24GB VRAM	RTX 4090, RTX 3090, A5000	Nemotron 3 Nano 30B, Qwen3 VL 32B	~34B
Workstation	48GB+ VRAM	A100, H100, 2× RTX 4090, M3 Max 96GB	Llama 3.3 70B, Qwen2.5 72B	~80B
Apple Silicon	Unified 8–192GB	M1–M4 (all variants), M2/M3/M4 Ultra	Varies by chip; MLX recommended	Up to ~180B on M3 Ultra

CPU-Only & Low-End (≤8GB RAM)

Running LLMs on CPU-only systems is viable in 2026 thanks to aggressive quantization and purpose-built small models. Expect 2–8 tokens per second — usable for batch tasks, not real-time chat.

Best overall: Phi-4 Mini (Microsoft) — purpose-built for edge/CPU deployment; exceptional reasoning per parameter count; Q4_K_M fits in ~2.5GB RAM.
Best for coding: Qwen3 1.5B or Qwen3 4B — strong code completion despite small size; Q4_K_M versions run comfortably on 8GB RAM at 4–6 tok/s.
Best for chat: Gemma 3 2B (Google) — optimized instruction-following in a tiny footprint; best conversational quality at ≤2B parameters.
Performance expectations: 8-core modern CPU: ~4–8 tok/s for 3B models. 4-core older CPU: ~2–4 tok/s. Sufficient for local summarization, offline assistants, and edge IoT applications.
Tip: Use llama.cpp with AVX2/AVX-512 enabled for maximum CPU throughput. On modern Intel/AMD CPUs, CPU inference has improved 3–4× since 2024.

Entry GPU Tier (4–8GB VRAM: RTX 3060, M1/M2 base)

The 8GB VRAM sweet spot enables running 7–9B parameter models fully in GPU memory at 40–60+ tokens per second — a qualitative leap over CPU inference. This tier covers the vast majority of consumer gaming GPUs.

Best overall: Qwen3.5-9B (Q4_K_M) — best 8GB model in 2026; 55+ tok/s fully in VRAM; strong across coding, reasoning, and multilingual tasks.
Best for speed: Phi-4 Mini — leads at 28 tok/s on 8GB systems with a smaller footprint (~5GB Q4_K_M); best choice when raw speed matters most.
Best for coding: Llama 3.2 8B or Qwen 3 8B — strong HumanEval scores; fit comfortably in 6GB Q4_K_M leaving room for 8K+ context.
Best for multimodal: Gemma 3 9B (vision) — supports image understanding within 8GB VRAM at Q4_K_M; useful for screenshot analysis and document OCR.
Models that fit (Q4_K_M, ~5–6GB): Llama 3.2 8B, Qwen 3 8B, Qwen3.5-9B, Mistral 7B v0.3, Gemma 3 9B, Phi-4 Mini.
Context note: At 8GB VRAM, leave ~2GB headroom for KV cache. Aim for 8K–16K context rather than pushing to max — it preserves speed significantly.

Mid GPU Tier (8–16GB VRAM: RTX 3080/4070, M2 Pro/Max)

The 16GB tier unlocks 14–20B parameter models — a substantial quality jump over 8B that rivals GPT-3.5-era cloud models. This is the sweet spot for power users and developers running local AI daily.

Best overall: Qwen3 14B (Q4_K_M, ~10–11GB) — the community consensus "best 16GB model" for 2026. Runs at 60–70 tok/s on RTX 4080/4090; dramatically better than 8B models on reasoning and coding while leaving VRAM for 8–16K context.
Best for AI benchmarks: GPT-OSS 20B (Q4_K_M, ~15GB) — 52.1% AI index while using ~15GB at 60K context; top-performing open model for this VRAM range.
Best with vision: Apriel 1.5 (~9.9GB VRAM) — 51.6% AI index with native vision support; uniquely fits two heavyweight tasks in 16GB.
Best for creative writing: Mistral Nemo 12B — strong instruction-following and creative generation; Q4_K_M fits in ~8GB, leaving half the VRAM for long contexts.
RTX 4070 (12GB) tip: Qwen3 14B Q4_K_M at ~10–11GB is the ideal fit — 1–2GB headroom for KV cache, ~50–60 tok/s. Avoid Q5 unless willing to sacrifice context length.

High-End GPU (24GB+ VRAM: RTX 4090, A100, M3 Max/Ultra)

At 24GB VRAM, you can run 30B-class models fully in GPU memory — delivering quality that approaches Claude Haiku / GPT-4o Mini territory from local hardware. The RTX 4090 is the consumer king; it delivers ~1,008 GB/s memory bandwidth, roughly 2.5× the M3 Max's 400 GB/s.

Best 30B model: NVIDIA Nemotron 3 Nano 30B (24.3GB) — 91% Math 500 score, the highest in its parameter class; exceptional STEM and reasoning performance.
Best 32B multimodal: Qwen3 VL 32B (24.5GB Q4_K_M) — leading vision-language model at this tier; strong for document understanding, chart analysis, and screenshot-based workflows.
RTX 4090 real-world performance: Llama 3 8B Q4_K_M: 80–110 tok/s. Qwen3 14B Q4_K_M: 60–80 tok/s. Nemotron 30B Q4_K_M: 30–40 tok/s. Qwen3 VL 32B Q4_K_M: 25–35 tok/s.
Best for 48GB+: Two RTX 4090s in tensor parallel, or a single A100 80GB, unlocks Llama 3.3 70B and Qwen2.5 72B at full quality — performance comparable to GPT-4-class hosted models.
Coding recommendation for RTX 4090: Qwen3 14B Q4_K_M for daily use; step up to Nemotron 30B or Qwen3 32B for complex tasks. The 14B → 30B quality jump is meaningful; the 30B → 70B jump is incremental for most tasks.

Apple Silicon & MLX

Apple Silicon's unified memory architecture is transformative for local AI: the M3 Max with 96GB is the only single consumer device that can run 70B-class models at all. The key is using MLX — Apple's machine learning framework optimized for Apple Silicon — which delivers 6–10× the throughput of Ollama for identical hardware.

M1/M2 base (8–16GB unified): Equivalent to entry GPU tier; use Qwen3.5-9B or Llama 3.2 8B Q4_K_M. MLX gives 30–50 tok/s vs. Ollama's 5–15 tok/s on the same chip.
M2 Pro/Max (16–32GB): Strong mid-tier; MLX sustains ~230 tok/s on Llama 3 8B vs. Ollama's 20–40 tok/s. Run Qwen3 14B or GPT-OSS 20B Q4_K_M comfortably.
M3 Max 96GB: The local AI flagship. Best 2026 model: Qwen3.6-35B-A3B MoE — only 3B parameters active per token, so it runs fast and leaves room for long context. Can also run Llama 3.3 70B Q4_K_M (~45GB) — impossible on any single consumer NVIDIA GPU.
M4 Ultra (192GB): Can run Llama 3 405B Q4_K_M (~230GB) — the largest open-weight model. Niche but remarkable for researchers and enterprises needing maximum local quality.
MLX vs Ollama on Apple Silicon: MLX achieves up to 6–10× the throughput. For any Apple Silicon user, MLX is non-negotiable. Use mlx-lm or LM Studio (which supports MLX backend) instead of standard Ollama.
Best MoE for Mac (32–48GB): Qwen3.6-35B-A3B MoE — 2026 community default on M2/M3 Pro/Max; only 3B active per token means quality of a 35B dense model at 8B-class inference speed.

Quantization Guide

Format	Bits per Weight	Quality vs FP16	VRAM vs FP16	Best Use Case
Q4_K_M	~4.5 bits avg	~97–98%	~75% reduction	Gold standard for consumer hardware; default recommendation.
Q5_K_M	~5.5 bits avg	~98–99%	~65% reduction	When you have spare VRAM and want closer to FP16 quality.
Q8_0	8 bits	~99.5%	~50% reduction	Near-lossless; use for professional/research tasks where quality is paramount and VRAM allows.
Q2_K	~2.6 bits avg	~85–90%	~87% reduction	Extreme memory savings only; noticeable quality degradation. Avoid for important tasks.
IQ4_XS	~4.25 bits avg	~97%	~76% reduction	Slightly smaller than Q4_K_M with similar quality; good for tight 8GB VRAM systems.
F16 / BF16	16 bits	100% (baseline)	Baseline	Fine-tuning and research only; impractical for consumer inference.

For most users: stick with Q4_K_M. It uses ~75% less VRAM than FP16 while retaining 97–98% of the model's quality — the best tradeoff available. Step up to Q5_K_M if you have VRAM headroom and are running math-heavy or coding tasks where the extra 1–2% quality matters. Q8_0 is only worth it for research workflows where you need provably near-lossless inference.

Recommended Tools

Ollama

Best for: Easy model management, one-command setup, OpenAI-compatible API endpoint. The default starting point for most users.
Command: ollama run qwen3.5:9b — automatically downloads and runs the recommended quantization.
Caveat: Not optimal for Apple Silicon — use MLX-backed tools for 6–10× better performance on Mac.
Models library: ollama.com/library — thousands of models with one-line install.

LM Studio

Best for: GUI-first users; built-in MLX backend for Apple Silicon (6–10× faster than Ollama on Mac); model discovery and GGUF management. Supports OpenAI-compatible local server.
Standout feature: Visual VRAM usage estimator shows exactly how much memory a model+quantization combo will use before you download.

llama.cpp

Best for: Maximum performance and control; C++ CLI with CUDA, Metal, Vulkan, and CPU backends. The foundation that most other tools are built on.
Best use cases: Custom server deployments, fine-tuned models, batch inference, and embedding generation. AVX2/AVX-512 support gives best-in-class CPU performance.
Tip: Use the -ngl flag to offload specific layers to GPU while running overflow on CPU — useful for models that almost but don't quite fit in VRAM.