Running large language models locally has never been more accessible. In April 2026, the combination of highly optimized quantization techniques, mature inference frameworks like Ollama and llama.cpp, and increasingly capable open-weight models means that consumer hardware from an 8GB RAM laptop to an RTX 4090 desktop can deliver genuinely useful AI at zero marginal cost. The key decisions are matching model size and quantization level to your hardware tier.
Hardware Tier Overview
| Tier | VRAM / RAM | Example Hardware | Recommended Models | Max Practical Params |
|---|---|---|---|---|
| CPU-Only / Low-End | ≤8GB RAM | Older laptops, Raspberry Pi 5 | Phi-4 Mini, Gemma 3 2B, Qwen3 1.7B | ~3B (Q4) |
| Entry GPU | 4–8GB VRAM | RTX 3060, RTX 4060, M1/M2 base (8GB) | Mistral 7B, Llama 3.2 8B, Gemma 3 9B | ~9B (Q4) |
| Mid GPU | 8–16GB VRAM | RTX 3080, RTX 4070, M2 Pro/Max (16–32GB) | Qwen3 14B, GPT-OSS 20B, Mistral Large 3 | ~20B (Q4) |
| High-End GPU | 24GB+ VRAM | RTX 3090, RTX 4090, A100 40GB | Qwen3 32B, Llama 4 Scout 17B, DeepSeek V3 (Q2) | ~70B (Q2–Q3) |
| Professional / Multi-GPU | 48GB+ VRAM | A100 80GB, H100, 2× RTX 4090 | Llama 4 Maverick 70B, DeepSeek V3.2 (Q4) | ~70B (Q4+) |
| Apple Silicon | 16–192GB unified | M2/M3/M4 Pro, Max, Ultra | Qwen3 32B (M3 Max), Llama 4 Scout (M2 Pro) | ~70B+ (M3 Ultra) |
CPU-Only & Low-End (≤8GB RAM)
CPU inference is slow by GPU standards but fully functional for lightweight tasks. Expect 1–8 tokens per second depending on CPU generation and model size. The sweet spot is sub-3B parameter models.
- Phi-4 Mini (Microsoft) — The standout choice for CPU-only or very RAM-constrained systems. Remarkably capable at 3.8B parameters; achieves 28 tok/s even on modest hardware. Strong at reasoning and instruction following for its size.
- Gemma 3 2B (Google) — Excellent for devices with as little as 4GB RAM. Quantized to Q4_K_M it fits easily and delivers coherent responses for chat and summarization. Free and open-weight.
- Qwen3 1.7B (Alibaba) — Smallest capable model in the Qwen3 family. Strong multilingual support and good instruction following. Available on Ollama as
qwen3:1.7b. - Llama 3.2 3B (Meta) — Meta's compact model with strong English-language performance. A solid default for hobbyists on CPU-only machines who want a well-known, widely-supported model.
Entry GPU Tier (4–8GB VRAM: RTX 3060, M1/M2 Base)
At 8GB VRAM, the 7–9B parameter class runs fully on-GPU at comfortable speeds (25–60 tok/s). This tier represents the most popular local AI setup in 2026.
- Qwen3.5-9B (Q4_K_M) — Best overall for 8GB VRAM in 2026. Delivers 55+ tokens per second fully resident in GPU memory. Exceptional instruction following, coding ability, and multilingual support. Pull with
ollama pull qwen3.5:9b. - Mistral 7B v0.3 — The reliable workhorse. Runs at 50–60 tok/s on an RTX 3060, handles most general-purpose tasks competently, and has wide tooling support across all major inference frameworks.
- Llama 3.2 8B (Meta) — Excellent English-language chat and reasoning at 8GB. Strong community support and many fine-tuned variants available on Ollama and HuggingFace.
- Gemma 3 9B (Google) — Google's 9B model fits in 8GB at Q4 and delivers surprisingly strong performance on coding and math tasks for its size.
- Phi-4 Mini — For 4GB VRAM cards (RTX 3060 6GB, older laptops), Phi-4 Mini remains the best option — fits with headroom and provides the best quality-per-VRAM-byte at this constraint.
Mid GPU Tier (8–16GB VRAM: RTX 3080/4070, M2 Pro/Max)
The 16GB VRAM tier unlocks the 13–20B parameter class, which provides a substantial quality jump over 7–9B models while still running at usable speeds.
- Qwen3 14B (Q4_K_M) — Best balance of speed and capability in the 16GB class. Strong instruction following, coding, and reasoning. Runs at 35–50 tok/s on an RTX 4070. Pull with
ollama pull qwen3:14b. - GPT-OSS 20B — Achieves 52.1% on the AI index benchmark while using approximately 15GB of total memory at 60K context with Q4_K_M quantization. Runs at around 30–40 tok/s on RTX 3080. Impressive performance-per-VRAM ratio.
- Mistral Large 3 (Q4) — At ~16GB Q4, this fits on a single RTX 3080/4070 and delivers near-frontier quality for complex reasoning and multilingual tasks. Strong alternative for European-language use cases.
- DeepSeek-V2.5 (Q4) — For developers focused on coding tasks, DeepSeek V2.5 in the 16B quantized form is the strongest coding model available at this tier — competitive benchmark results and excellent diff-format output for Aider-style workflows.
- Apple M2 Pro/Max (16–32GB unified) — Apple Silicon handles this tier particularly well due to the unified memory architecture eliminating CPU-GPU bandwidth bottlenecks. Qwen3 14B runs at 35–45 tok/s on M2 Pro — comparable to an RTX 4070 while using a fraction of the power.
High-End GPU (24GB+ VRAM: RTX 4090, A100, M3 Max/Ultra)
With 24GB VRAM, the 30–32B class runs fully on-GPU with Q4 quantization. This tier delivers near-frontier quality for most tasks. The RTX 4090 is the best consumer GPU for local AI in 2026; the RTX 3090 offers the same 24GB at 30–40% lower speeds for $500–700 used.
- Qwen3 32B (Q4_K_M) — The top recommendation for RTX 4090 owners in 2026. 24GB VRAM fits it comfortably; delivers excellent quality across chat, coding, and reasoning at ~20 tok/s. Pull with
ollama pull qwen3:32b. - Qwen3 30B (Q4) — Maintains 33.38 tok/s even at 48K context with 100% on-GPU workload. Exceptional for long-context tasks where speed matters.
- Qwen 2.5 Coder 32B — The specialist choice for coding. Outperforms general-purpose 32B models on coding benchmarks while fitting in 24GB at Q4. Best local option for code generation, review, and debugging workflows.
- DeepSeek R1 14B (Q8_0) — For reasoning-heavy tasks, running R1 at Q8_0 (near-lossless quality) fits comfortably within 24GB and delivers the strongest local reasoning capability currently available.
- Llama 4 Scout 17B — Meta's latest Scout model brings competitive quality in a compact form factor. Runs at 40–55 tok/s on an RTX 4090 at Q4, making it the fastest capable model at this tier.
- NVIDIA Nemotron 3 Nano 30B — NVIDIA's own optimized model trades blows with Qwen3 VL 32B at the top of the 24GB class. Particularly strong for structured output and enterprise document tasks.
Apple Silicon & MLX
Apple Silicon's unified memory architecture makes it uniquely competitive for local LLM inference. The same memory is shared between CPU and GPU with high-bandwidth access, eliminating the VRAM ceiling that constrains discrete GPU systems.
- Mac Mini M4 Pro ($1,399) — The best entry point for serious local AI work in 2026. 24GB unified memory, silent operation, ~$2.60/month electricity cost running 24/7. Runs Qwen3 32B at Q4 competently. Exceptional value.
- MacBook Pro M3 Max (48GB) — Handles 40–70B parameter models at Q4 in unified memory — territory that requires a multi-GPU setup on the discrete GPU side. Generates text at 15–25 tok/s for 70B models, fast enough for most interactive use cases.
- Mac Studio / Mac Pro M3 Ultra (128–192GB) — Enables 120B+ parameter models locally. An M2 Ultra generates text at 15–20 tok/s for very large models. The only consumer platform where true 70B+ Q8 inference is practical.
- MLX Framework — Apple's MLX library is purpose-built for Apple Silicon inference and consistently outperforms Ollama's Metal backend by 10–30% on M-series chips. For maximum performance on Mac, use MLX directly or tools like LM Studio which integrate it natively.
- Ollama on Mac — Still the easiest setup for most users. Supports all major models via simple
ollama pullcommands and manages Metal/MLX acceleration automatically.
Quantization Guide (Q4_K_M vs Q5_K_M vs Q8_0)
Quantization compresses model weights from 16-bit floating point to lower bit-widths, dramatically reducing VRAM requirements and increasing inference speed. Understanding the tradeoffs between quantization levels is essential for matching models to hardware.
| Format | Bits per Weight | Memory vs FP16 | Quality Loss | Speed | Best For |
|---|---|---|---|---|---|
| Q4_K_M | ~4.5 bits | ~75% reduction | Minimal (~1–2%) | Fastest | Default choice; best memory/quality tradeoff |
| Q5_K_M | ~5.5 bits | ~65% reduction | Near-lossless | Fast | When you have headroom and want better quality |
| Q6_K | ~6.5 bits | ~60% reduction | Negligible | Moderate | Maximum quality within VRAM constraints |
| Q8_0 | 8 bits | ~50% reduction | Imperceptible | Moderate | Reasoning models (R1) where quality is critical |
| Q2_K | ~2.5 bits | ~85% reduction | Significant | Very Fast | 70B+ models on 24GB VRAM (last resort) |
| FP16 / BF16 | 16 bits | Baseline | None | Slowest | Fine-tuning only; too large for most consumer hardware |
Recommendation: Start with Q4_K_M — it is the community standard for a reason. Move to Q5_K_M if you have spare VRAM headroom and the task is quality-sensitive. Use Q8_0 only for reasoning models (DeepSeek R1, QwQ) where the chain-of-thought quality difference is measurable.
Recommended Tools
Ollama
- The easiest way to run local models. Single command install, automatic model downloads, and a REST API compatible with many frontends.
- Supports Mac (Metal/MLX), Linux (CUDA/ROCm), and Windows (CUDA/DirectML).
- Model library at ollama.com/search covers Llama, Qwen, Mistral, Gemma, Phi, DeepSeek, and 100+ others.
- Best for: getting started quickly, serving models to local apps via OpenAI-compatible API.
LM Studio
- GUI-based desktop application with model browser, chat interface, and local server mode.
- Excellent for non-technical users. Integrates MLX on Apple Silicon for best Mac performance.
- Supports GGUF models from HuggingFace and has built-in model management.
- Best for: desktop users who prefer a GUI; Apple Silicon users (MLX integration gives 10–30% speed boost).
llama.cpp
- The foundational C++ inference engine that underpins Ollama, LM Studio, and dozens of other tools.
- Maximum performance and customization at the cost of a steeper setup curve. Compile-time flags for CUDA, Metal, OpenCL, and BLAS backends.
- Supports the widest range of quantization formats and experimental features (speculative decoding, continuous batching) before they reach higher-level wrappers.
- Best for: power users, server deployments, and developers who need fine-grained control over inference parameters.