Best Local AI by Hardware — April 24, 2026

Running large language models locally has never been more accessible. In April 2026, the combination of highly optimized quantization techniques, mature inference frameworks like Ollama and llama.cpp, and increasingly capable open-weight models means that consumer hardware from an 8GB RAM laptop to an RTX 4090 desktop can deliver genuinely useful AI at zero marginal cost. The key decisions are matching model size and quantization level to your hardware tier.

Hardware Tier Overview

Tier	VRAM / RAM	Example Hardware	Recommended Models	Max Practical Params
CPU-Only / Low-End	≤8GB RAM	Older laptops, Raspberry Pi 5	Phi-4 Mini, Gemma 3 2B, Qwen3 1.7B	~3B (Q4)
Entry GPU	4–8GB VRAM	RTX 3060, RTX 4060, M1/M2 base (8GB)	Mistral 7B, Llama 3.2 8B, Gemma 3 9B	~9B (Q4)
Mid GPU	8–16GB VRAM	RTX 3080, RTX 4070, M2 Pro/Max (16–32GB)	Qwen3 14B, GPT-OSS 20B, Mistral Large 3	~20B (Q4)
High-End GPU	24GB+ VRAM	RTX 3090, RTX 4090, A100 40GB	Qwen3 32B, Llama 4 Scout 17B, DeepSeek V3 (Q2)	~70B (Q2–Q3)
Professional / Multi-GPU	48GB+ VRAM	A100 80GB, H100, 2× RTX 4090	Llama 4 Maverick 70B, DeepSeek V3.2 (Q4)	~70B (Q4+)
Apple Silicon	16–192GB unified	M2/M3/M4 Pro, Max, Ultra	Qwen3 32B (M3 Max), Llama 4 Scout (M2 Pro)	~70B+ (M3 Ultra)

CPU-Only & Low-End (≤8GB RAM)

CPU inference is slow by GPU standards but fully functional for lightweight tasks. Expect 1–8 tokens per second depending on CPU generation and model size. The sweet spot is sub-3B parameter models.

Phi-4 Mini (Microsoft) — The standout choice for CPU-only or very RAM-constrained systems. Remarkably capable at 3.8B parameters; achieves 28 tok/s even on modest hardware. Strong at reasoning and instruction following for its size.
Gemma 3 2B (Google) — Excellent for devices with as little as 4GB RAM. Quantized to Q4_K_M it fits easily and delivers coherent responses for chat and summarization. Free and open-weight.
Qwen3 1.7B (Alibaba) — Smallest capable model in the Qwen3 family. Strong multilingual support and good instruction following. Available on Ollama as qwen3:1.7b.
Llama 3.2 3B (Meta) — Meta's compact model with strong English-language performance. A solid default for hobbyists on CPU-only machines who want a well-known, widely-supported model.

Entry GPU Tier (4–8GB VRAM: RTX 3060, M1/M2 Base)

At 8GB VRAM, the 7–9B parameter class runs fully on-GPU at comfortable speeds (25–60 tok/s). This tier represents the most popular local AI setup in 2026.

Qwen3.5-9B (Q4_K_M) — Best overall for 8GB VRAM in 2026. Delivers 55+ tokens per second fully resident in GPU memory. Exceptional instruction following, coding ability, and multilingual support. Pull with ollama pull qwen3.5:9b.
Mistral 7B v0.3 — The reliable workhorse. Runs at 50–60 tok/s on an RTX 3060, handles most general-purpose tasks competently, and has wide tooling support across all major inference frameworks.
Llama 3.2 8B (Meta) — Excellent English-language chat and reasoning at 8GB. Strong community support and many fine-tuned variants available on Ollama and HuggingFace.
Gemma 3 9B (Google) — Google's 9B model fits in 8GB at Q4 and delivers surprisingly strong performance on coding and math tasks for its size.
Phi-4 Mini — For 4GB VRAM cards (RTX 3060 6GB, older laptops), Phi-4 Mini remains the best option — fits with headroom and provides the best quality-per-VRAM-byte at this constraint.

Mid GPU Tier (8–16GB VRAM: RTX 3080/4070, M2 Pro/Max)

The 16GB VRAM tier unlocks the 13–20B parameter class, which provides a substantial quality jump over 7–9B models while still running at usable speeds.

Qwen3 14B (Q4_K_M) — Best balance of speed and capability in the 16GB class. Strong instruction following, coding, and reasoning. Runs at 35–50 tok/s on an RTX 4070. Pull with ollama pull qwen3:14b.
GPT-OSS 20B — Achieves 52.1% on the AI index benchmark while using approximately 15GB of total memory at 60K context with Q4_K_M quantization. Runs at around 30–40 tok/s on RTX 3080. Impressive performance-per-VRAM ratio.
Mistral Large 3 (Q4) — At ~16GB Q4, this fits on a single RTX 3080/4070 and delivers near-frontier quality for complex reasoning and multilingual tasks. Strong alternative for European-language use cases.
DeepSeek-V2.5 (Q4) — For developers focused on coding tasks, DeepSeek V2.5 in the 16B quantized form is the strongest coding model available at this tier — competitive benchmark results and excellent diff-format output for Aider-style workflows.
Apple M2 Pro/Max (16–32GB unified) — Apple Silicon handles this tier particularly well due to the unified memory architecture eliminating CPU-GPU bandwidth bottlenecks. Qwen3 14B runs at 35–45 tok/s on M2 Pro — comparable to an RTX 4070 while using a fraction of the power.

High-End GPU (24GB+ VRAM: RTX 4090, A100, M3 Max/Ultra)

With 24GB VRAM, the 30–32B class runs fully on-GPU with Q4 quantization. This tier delivers near-frontier quality for most tasks. The RTX 4090 is the best consumer GPU for local AI in 2026; the RTX 3090 offers the same 24GB at 30–40% lower speeds for $500–700 used.

Qwen3 32B (Q4_K_M) — The top recommendation for RTX 4090 owners in 2026. 24GB VRAM fits it comfortably; delivers excellent quality across chat, coding, and reasoning at ~20 tok/s. Pull with ollama pull qwen3:32b.
Qwen3 30B (Q4) — Maintains 33.38 tok/s even at 48K context with 100% on-GPU workload. Exceptional for long-context tasks where speed matters.
Qwen 2.5 Coder 32B — The specialist choice for coding. Outperforms general-purpose 32B models on coding benchmarks while fitting in 24GB at Q4. Best local option for code generation, review, and debugging workflows.
DeepSeek R1 14B (Q8_0) — For reasoning-heavy tasks, running R1 at Q8_0 (near-lossless quality) fits comfortably within 24GB and delivers the strongest local reasoning capability currently available.
Llama 4 Scout 17B — Meta's latest Scout model brings competitive quality in a compact form factor. Runs at 40–55 tok/s on an RTX 4090 at Q4, making it the fastest capable model at this tier.
NVIDIA Nemotron 3 Nano 30B — NVIDIA's own optimized model trades blows with Qwen3 VL 32B at the top of the 24GB class. Particularly strong for structured output and enterprise document tasks.

Apple Silicon & MLX

Apple Silicon's unified memory architecture makes it uniquely competitive for local LLM inference. The same memory is shared between CPU and GPU with high-bandwidth access, eliminating the VRAM ceiling that constrains discrete GPU systems.

Mac Mini M4 Pro ($1,399) — The best entry point for serious local AI work in 2026. 24GB unified memory, silent operation, ~$2.60/month electricity cost running 24/7. Runs Qwen3 32B at Q4 competently. Exceptional value.
MacBook Pro M3 Max (48GB) — Handles 40–70B parameter models at Q4 in unified memory — territory that requires a multi-GPU setup on the discrete GPU side. Generates text at 15–25 tok/s for 70B models, fast enough for most interactive use cases.
Mac Studio / Mac Pro M3 Ultra (128–192GB) — Enables 120B+ parameter models locally. An M2 Ultra generates text at 15–20 tok/s for very large models. The only consumer platform where true 70B+ Q8 inference is practical.
MLX Framework — Apple's MLX library is purpose-built for Apple Silicon inference and consistently outperforms Ollama's Metal backend by 10–30% on M-series chips. For maximum performance on Mac, use MLX directly or tools like LM Studio which integrate it natively.
Ollama on Mac — Still the easiest setup for most users. Supports all major models via simple ollama pull commands and manages Metal/MLX acceleration automatically.

Quantization Guide (Q4_K_M vs Q5_K_M vs Q8_0)

Quantization compresses model weights from 16-bit floating point to lower bit-widths, dramatically reducing VRAM requirements and increasing inference speed. Understanding the tradeoffs between quantization levels is essential for matching models to hardware.

Format	Bits per Weight	Memory vs FP16	Quality Loss	Speed	Best For
Q4_K_M	~4.5 bits	~75% reduction	Minimal (~1–2%)	Fastest	Default choice; best memory/quality tradeoff
Q5_K_M	~5.5 bits	~65% reduction	Near-lossless	Fast	When you have headroom and want better quality
Q6_K	~6.5 bits	~60% reduction	Negligible	Moderate	Maximum quality within VRAM constraints
Q8_0	8 bits	~50% reduction	Imperceptible	Moderate	Reasoning models (R1) where quality is critical
Q2_K	~2.5 bits	~85% reduction	Significant	Very Fast	70B+ models on 24GB VRAM (last resort)
FP16 / BF16	16 bits	Baseline	None	Slowest	Fine-tuning only; too large for most consumer hardware

Recommendation: Start with Q4_K_M — it is the community standard for a reason. Move to Q5_K_M if you have spare VRAM headroom and the task is quality-sensitive. Use Q8_0 only for reasoning models (DeepSeek R1, QwQ) where the chain-of-thought quality difference is measurable.

Recommended Tools

Ollama

The easiest way to run local models. Single command install, automatic model downloads, and a REST API compatible with many frontends.
Supports Mac (Metal/MLX), Linux (CUDA/ROCm), and Windows (CUDA/DirectML).
Model library at ollama.com/search covers Llama, Qwen, Mistral, Gemma, Phi, DeepSeek, and 100+ others.
Best for: getting started quickly, serving models to local apps via OpenAI-compatible API.

LM Studio

GUI-based desktop application with model browser, chat interface, and local server mode.
Excellent for non-technical users. Integrates MLX on Apple Silicon for best Mac performance.
Supports GGUF models from HuggingFace and has built-in model management.
Best for: desktop users who prefer a GUI; Apple Silicon users (MLX integration gives 10–30% speed boost).

llama.cpp

The foundational C++ inference engine that underpins Ollama, LM Studio, and dozens of other tools.
Maximum performance and customization at the cost of a steeper setup curve. Compile-time flags for CUDA, Metal, OpenCL, and BLAS backends.
Supports the widest range of quantization formats and experimental features (speculative decoding, continuous batching) before they reach higher-level wrappers.
Best for: power users, server deployments, and developers who need fine-grained control over inference parameters.