Running AI models locally has never been more practical. In April 2026, a consumer RTX 4090 can serve Qwen3-Coder 32B at 20–30 tokens per second — rivaling cloud API responsiveness at zero marginal cost. Apple Silicon Macs with 32GB+ unified memory have emerged as the premier local AI platform, with Ollama's optional MLX backend now pushing throughput to 70–80 tok/s on M4 Max hardware. Whether you're protecting sensitive data, eliminating API costs, or experimenting offline, this guide covers the right model for every hardware tier.
Hardware Tier Overview
| Tier | VRAM / RAM | Example Hardware | Recommended Models | Max Params (Practical) |
|---|---|---|---|---|
| CPU Only | 8–16GB RAM | Any modern laptop/desktop | Llama 3.2 3B, Qwen3-1.7B, Phi-3.5 Mini | 3–4B |
| Low-End GPU | 4–6GB VRAM | GTX 1060, GTX 1650, older laptops | Llama 3.2 3B Q4, Gemma 2 2B, Phi-3.5 Mini | 3–4B |
| Entry GPU | 6–8GB VRAM | RTX 3060 (6GB/8GB), M1/M2 base (8GB) | Llama 3.1 8B Q4, Mistral 7B, Gemma 2 9B Q4 | 7–9B |
| Mid GPU | 8–16GB VRAM | RTX 3080, RTX 4070, M2 Pro/Max (16GB) | Llama 3.3 14B, Mistral Small 22B Q4, DeepSeek Coder V2 16B | 14–22B |
| High-End GPU | 24GB+ VRAM | RTX 4090, A100, RTX 5090, M3/M4 Max/Ultra | Qwen3-32B Q4, DeepSeek-R1 32B, Llama 3.3 70B Q2 | 32–70B |
| Multi-GPU | 48GB+ VRAM | 2× RTX 4090, 2× A100, M3 Ultra (192GB) | Llama 3.3 70B Q4, Qwen3-72B, DeepSeek-R2 | 70–235B |
CPU-Only & Low-End (≤8GB RAM)
CPU-only inference is slow but functional for occasional use. Expect 2–8 tokens per second on a modern 8-core CPU. These models punch above their weight for specific tasks like local search, document Q&A, and simple code help.
- Phi-3.5 Mini 3.8B (Q4_K_M): Microsoft's model is the undisputed king of sub-4B performance. Optimized specifically for CPU inference, it achieves coherent reasoning on tasks that would embarrass older 7B models. Best all-rounder for CPU-only systems.
- Qwen3-1.7B: Alibaba's smallest model in the Qwen3 family. Remarkably capable for its size — handles coding help, translation, and summarization. Runs comfortably in 2GB RAM at Q4.
- Llama 3.2 3B: Meta's entry-level model with strong instruction following. 3B at Q4_K_M uses ~2.5GB RAM; a good baseline for testing Ollama setups before committing to larger models.
- Gemma 2 2B: Google's 2B model is surprisingly good at structured tasks (JSON extraction, classification). Fastest inference on CPU of any quality model in this tier.
Key tip: Enable llama.cpp's CPU offloading flags and use Q4_K_M quantization. Avoid Q8 on CPU-only — the quality gain doesn't justify 2x slower inference at this tier.
Entry GPU Tier (6–8GB VRAM: RTX 3060, M1/M2 Base)
The 6–8GB tier hits a sweet spot: fast enough for interactive use (20–40 tok/s), capable enough for real work. The key is staying within VRAM limits; models that spill to CPU RAM become dramatically slower.
- Llama 3.1 8B (Q4_K_M): The safe default for this tier. ~4.7GB VRAM, 25–35 tok/s on RTX 3060. Strong instruction following, good at coding basics, and the most widely tested model in Ollama's ecosystem.
- Mistral 7B Instruct v0.3: Slightly smaller than Llama 3.1 8B, better at structured output tasks. Excellent for RAG pipelines and document Q&A. Fits in 5GB VRAM at Q4.
- Gemma 2 9B (Q4_K_M): Google's 9B model delivers quality noticeably above 7–8B models while still fitting in 8GB VRAM at Q4. Best reasoning quality in this tier.
- Deepseek-Coder-V2 Lite 16B (Q2_K): For coding specifically, the 16B Lite model at aggressive Q2 quantization squeezes into ~8GB. Quality is compromised but still strong for code completion at this VRAM level.
Mid GPU Tier (8–16GB VRAM: RTX 3080/4070, M2 Pro/Max)
The mid tier is the sweet spot for productive local AI in 2026. Models in the 14–22B range deliver quality that's genuinely competitive with older frontier models, at 30–50 tok/s speeds.
- Qwen3-14B (Q5_K_M): The best general-purpose model for 12–16GB VRAM. Qwen3 excels at coding, multilingual tasks, and structured reasoning. Q5_K_M quality is very close to full-precision while using ~10GB VRAM.
- Mistral Small 22B (Q4_K_M): For 16GB VRAM systems, Mistral Small 22B at Q4 (~14GB) is a significant quality jump over 14B models. Excellent for code review, writing, and analysis tasks.
- DeepSeek Coder V2 16B: The dedicated coding model for this tier. Outperforms general-purpose models on code completion and debugging tasks. Fits in 12GB VRAM at Q4_K_M.
- Llama 3.3 14B: Meta's 14B is notably better than 13B predecessors. Excellent at instruction following and chain-of-thought reasoning. Good all-rounder when you don't need specialized coding or multilingual capability.
High-End GPU (24GB+ VRAM: RTX 4090, A100, M3 Max/Ultra)
The RTX 4090 is the local AI sweet spot in 2026 — it runs every important model at speeds that feel like a cloud API. The 24GB VRAM sweet spot enables 32B models at Q4_K_M with room to spare for long context.
- Qwen3-32B (Q4_K_M) — Top Pick: ~19GB VRAM. 20–30 tok/s on RTX 4090. The best balance of quality and speed in the 24GB tier. Outperforms DeepSeek-V2.5 on coding and matches it on reasoning. Near-GPT-4-class output for zero per-token cost.
- DeepSeek-R1 32B (Q4_K_M): DeepSeek's reasoning model at 32B delivers near-70B reasoning quality on math and coding problems. ~19GB VRAM, 20–25 tok/s. Best choice when chain-of-thought reasoning quality matters most.
- Qwen3-Coder 32B (Q4_K_M): For development workflows, Qwen3-Coder 32B is the specialist option. Optimized specifically for code generation, editing, and debugging across all major languages.
- Llama 3.3 70B (Q2_K): Aggressive quantization allows 70B to fit in ~24GB. Quality is noticeably below Q4, but the massive parameter count still delivers stronger reasoning than any 32B model. Best for occasional complex tasks where speed is secondary.
- Mistral Small 22B (Q8_0): At Q8_0 (~24GB), Mistral Small delivers near-full-precision quality — a compelling option when you want the best possible output from a 22B model without quantization compromise.
RTX 4090 performance expectations: 7B Q4 = 80–100 tok/s, 13B Q4 = 50–70 tok/s, 30B Q4 = 20–35 tok/s, 70B Q2 = 15–25 tok/s.
Apple Silicon & MLX
Apple Silicon's unified memory architecture eliminates the VRAM bottleneck that constrains discrete GPUs. A Mac with 32GB unified memory effectively has a "32GB GPU" — and 64GB/128GB/192GB Macs unlock model sizes impossible on any single consumer GPU.
- M1/M2 base (8GB): Llama 3.1 8B or Qwen3-8B at Q4. Similar to the entry GPU tier. ~15–25 tok/s via Ollama. The integrated GPU is less efficient than a dedicated NVIDIA for pure inference speed, but the thermal and power efficiency is excellent.
- M2 Pro / M3 (16–32GB): Qwen3-14B to Qwen3-32B at Q4. This is where Apple Silicon starts to shine — 30–45 tok/s on Qwen3-32B Q4 on M3 Pro 32GB, comparable to an RTX 4090 with no GPU fan noise.
- M3/M4 Max (36–64GB): Qwen3-32B at Q8, or Llama 3.3 70B at Q4. The April 2026 sweet spot. Qwen3-Coder 32B Q4_K_M delivers ~30–35 tok/s on M4 Pro. High quality, silent, fanless when not under sustained load.
- M3/M4 Ultra (128–192GB): Llama 3.3 70B at Q8, Qwen3-72B at Q5, or even 235B models at Q2. Apple's Ultra chips essentially eliminate local model size limits for consumer hardware.
MLX vs Ollama on Apple Silicon: MLX beats Ollama (llama.cpp Metal backend) by 15–30% throughput while using ~10% less memory. As of Ollama 0.19 (March 2026), Ollama added an optional MLX backend for Apple Silicon Macs with 32GB+ unified memory, lifting Qwen3-35B from ~45 tok/s to ~70–80 tok/s on M4 Max. For maximum speed: use MLX-LM. For easiest setup and model management: use Ollama. For the newest model support when MLX lags: llama.cpp.
Quantization Guide
Quantization reduces model precision to shrink VRAM requirements. The right level depends on your VRAM budget and quality requirements.
| Format | Bits per Weight | Quality vs Full | VRAM vs Full | Best Use Case |
|---|---|---|---|---|
| Q8_0 | 8-bit | ~99% — imperceptible loss | ~50% | When VRAM allows; premium quality at reduced footprint |
| Q6_K | 6-bit | ~97% — negligible loss | ~38% | Good balance; worth choosing over Q4 when VRAM allows |
| Q5_K_M | 5-bit | ~95% — minimal loss | ~31% | Sweet spot for mid-tier GPUs; excellent quality per GB |
| Q4_K_M | 4-bit | ~90–93% — small but noticeable | ~25% | Most popular; best VRAM-to-quality tradeoff for 7B–32B |
| Q3_K_M | 3-bit | ~82–85% — noticeable degradation | ~19% | Only when Q4 won't fit; significant quality hit on reasoning |
| Q2_K | 2-bit | ~70–75% — significant loss | ~13% | Last resort; enables 70B on 24GB; reasoning suffers |
Recommendation for most users: Use Q4_K_M as the default. If your GPU has room to spare after loading at Q4, upgrade to Q5_K_M for a noticeable improvement. Reserve Q8 for models you use heavily where quality is paramount. Never use Q2 unless it's the only way to run a target model size.
Recommended Tools
Ollama
The easiest entry point for local AI in 2026. One-command install, automatic model downloads, and a clean REST API compatible with OpenAI's format. Manages multiple models, handles GPU detection automatically, and supports macOS, Linux, and Windows. The Ollama model library covers 200+ models. Version 0.19 (March 2026) added optional MLX backend for Apple Silicon. Best for: developers who want to get running in under 5 minutes and use Ollama-served models from existing OpenAI-compatible apps.
LM Studio
A polished desktop GUI for running local models. Point-and-click model downloads from HuggingFace, built-in chat interface, and a local server that exposes an OpenAI-compatible API. Best for: non-technical users, professionals who want a clean chat interface, and teams evaluating models side-by-side without writing code.
llama.cpp
The foundational inference engine underneath most local AI tools (including Ollama). Direct llama.cpp access offers the most control: custom quantization, server configuration, Metal/CUDA/Vulkan backend selection, and support for the newest models before they appear in higher-level tools. Best for: power users who want maximum control, researchers benchmarking new model releases, and deployments requiring custom server configuration.
MLX-LM (Apple Silicon Only)
Apple's machine learning framework optimized specifically for Apple Silicon. 15–30% faster than llama.cpp on M-series chips with ~10% better memory efficiency. Install via pip, load models from HuggingFace. Increasingly becoming the preferred backend for Apple Silicon users who prioritize throughput. Best for: M-series Mac users who want maximum speed and are comfortable with Python command-line tools.