Running AI models locally has never been more practical. In April 2026, a consumer RTX 4090 can serve Qwen3-Coder 32B at 20–30 tokens per second — rivaling cloud API responsiveness at zero marginal cost. Apple Silicon Macs with 32GB+ unified memory have emerged as the premier local AI platform, with Ollama's optional MLX backend now pushing throughput to 70–80 tok/s on M4 Max hardware. Whether you're protecting sensitive data, eliminating API costs, or experimenting offline, this guide covers the right model for every hardware tier.

Hardware Tier Overview

Tier VRAM / RAM Example Hardware Recommended Models Max Params (Practical)
CPU Only 8–16GB RAM Any modern laptop/desktop Llama 3.2 3B, Qwen3-1.7B, Phi-3.5 Mini 3–4B
Low-End GPU 4–6GB VRAM GTX 1060, GTX 1650, older laptops Llama 3.2 3B Q4, Gemma 2 2B, Phi-3.5 Mini 3–4B
Entry GPU 6–8GB VRAM RTX 3060 (6GB/8GB), M1/M2 base (8GB) Llama 3.1 8B Q4, Mistral 7B, Gemma 2 9B Q4 7–9B
Mid GPU 8–16GB VRAM RTX 3080, RTX 4070, M2 Pro/Max (16GB) Llama 3.3 14B, Mistral Small 22B Q4, DeepSeek Coder V2 16B 14–22B
High-End GPU 24GB+ VRAM RTX 4090, A100, RTX 5090, M3/M4 Max/Ultra Qwen3-32B Q4, DeepSeek-R1 32B, Llama 3.3 70B Q2 32–70B
Multi-GPU 48GB+ VRAM 2× RTX 4090, 2× A100, M3 Ultra (192GB) Llama 3.3 70B Q4, Qwen3-72B, DeepSeek-R2 70–235B

CPU-Only & Low-End (≤8GB RAM)

CPU-only inference is slow but functional for occasional use. Expect 2–8 tokens per second on a modern 8-core CPU. These models punch above their weight for specific tasks like local search, document Q&A, and simple code help.

  • Phi-3.5 Mini 3.8B (Q4_K_M): Microsoft's model is the undisputed king of sub-4B performance. Optimized specifically for CPU inference, it achieves coherent reasoning on tasks that would embarrass older 7B models. Best all-rounder for CPU-only systems.
  • Qwen3-1.7B: Alibaba's smallest model in the Qwen3 family. Remarkably capable for its size — handles coding help, translation, and summarization. Runs comfortably in 2GB RAM at Q4.
  • Llama 3.2 3B: Meta's entry-level model with strong instruction following. 3B at Q4_K_M uses ~2.5GB RAM; a good baseline for testing Ollama setups before committing to larger models.
  • Gemma 2 2B: Google's 2B model is surprisingly good at structured tasks (JSON extraction, classification). Fastest inference on CPU of any quality model in this tier.

Key tip: Enable llama.cpp's CPU offloading flags and use Q4_K_M quantization. Avoid Q8 on CPU-only — the quality gain doesn't justify 2x slower inference at this tier.

Entry GPU Tier (6–8GB VRAM: RTX 3060, M1/M2 Base)

The 6–8GB tier hits a sweet spot: fast enough for interactive use (20–40 tok/s), capable enough for real work. The key is staying within VRAM limits; models that spill to CPU RAM become dramatically slower.

  • Llama 3.1 8B (Q4_K_M): The safe default for this tier. ~4.7GB VRAM, 25–35 tok/s on RTX 3060. Strong instruction following, good at coding basics, and the most widely tested model in Ollama's ecosystem.
  • Mistral 7B Instruct v0.3: Slightly smaller than Llama 3.1 8B, better at structured output tasks. Excellent for RAG pipelines and document Q&A. Fits in 5GB VRAM at Q4.
  • Gemma 2 9B (Q4_K_M): Google's 9B model delivers quality noticeably above 7–8B models while still fitting in 8GB VRAM at Q4. Best reasoning quality in this tier.
  • Deepseek-Coder-V2 Lite 16B (Q2_K): For coding specifically, the 16B Lite model at aggressive Q2 quantization squeezes into ~8GB. Quality is compromised but still strong for code completion at this VRAM level.

Mid GPU Tier (8–16GB VRAM: RTX 3080/4070, M2 Pro/Max)

The mid tier is the sweet spot for productive local AI in 2026. Models in the 14–22B range deliver quality that's genuinely competitive with older frontier models, at 30–50 tok/s speeds.

  • Qwen3-14B (Q5_K_M): The best general-purpose model for 12–16GB VRAM. Qwen3 excels at coding, multilingual tasks, and structured reasoning. Q5_K_M quality is very close to full-precision while using ~10GB VRAM.
  • Mistral Small 22B (Q4_K_M): For 16GB VRAM systems, Mistral Small 22B at Q4 (~14GB) is a significant quality jump over 14B models. Excellent for code review, writing, and analysis tasks.
  • DeepSeek Coder V2 16B: The dedicated coding model for this tier. Outperforms general-purpose models on code completion and debugging tasks. Fits in 12GB VRAM at Q4_K_M.
  • Llama 3.3 14B: Meta's 14B is notably better than 13B predecessors. Excellent at instruction following and chain-of-thought reasoning. Good all-rounder when you don't need specialized coding or multilingual capability.

High-End GPU (24GB+ VRAM: RTX 4090, A100, M3 Max/Ultra)

The RTX 4090 is the local AI sweet spot in 2026 — it runs every important model at speeds that feel like a cloud API. The 24GB VRAM sweet spot enables 32B models at Q4_K_M with room to spare for long context.

  • Qwen3-32B (Q4_K_M) — Top Pick: ~19GB VRAM. 20–30 tok/s on RTX 4090. The best balance of quality and speed in the 24GB tier. Outperforms DeepSeek-V2.5 on coding and matches it on reasoning. Near-GPT-4-class output for zero per-token cost.
  • DeepSeek-R1 32B (Q4_K_M): DeepSeek's reasoning model at 32B delivers near-70B reasoning quality on math and coding problems. ~19GB VRAM, 20–25 tok/s. Best choice when chain-of-thought reasoning quality matters most.
  • Qwen3-Coder 32B (Q4_K_M): For development workflows, Qwen3-Coder 32B is the specialist option. Optimized specifically for code generation, editing, and debugging across all major languages.
  • Llama 3.3 70B (Q2_K): Aggressive quantization allows 70B to fit in ~24GB. Quality is noticeably below Q4, but the massive parameter count still delivers stronger reasoning than any 32B model. Best for occasional complex tasks where speed is secondary.
  • Mistral Small 22B (Q8_0): At Q8_0 (~24GB), Mistral Small delivers near-full-precision quality — a compelling option when you want the best possible output from a 22B model without quantization compromise.

RTX 4090 performance expectations: 7B Q4 = 80–100 tok/s, 13B Q4 = 50–70 tok/s, 30B Q4 = 20–35 tok/s, 70B Q2 = 15–25 tok/s.

Apple Silicon & MLX

Apple Silicon's unified memory architecture eliminates the VRAM bottleneck that constrains discrete GPUs. A Mac with 32GB unified memory effectively has a "32GB GPU" — and 64GB/128GB/192GB Macs unlock model sizes impossible on any single consumer GPU.

  • M1/M2 base (8GB): Llama 3.1 8B or Qwen3-8B at Q4. Similar to the entry GPU tier. ~15–25 tok/s via Ollama. The integrated GPU is less efficient than a dedicated NVIDIA for pure inference speed, but the thermal and power efficiency is excellent.
  • M2 Pro / M3 (16–32GB): Qwen3-14B to Qwen3-32B at Q4. This is where Apple Silicon starts to shine — 30–45 tok/s on Qwen3-32B Q4 on M3 Pro 32GB, comparable to an RTX 4090 with no GPU fan noise.
  • M3/M4 Max (36–64GB): Qwen3-32B at Q8, or Llama 3.3 70B at Q4. The April 2026 sweet spot. Qwen3-Coder 32B Q4_K_M delivers ~30–35 tok/s on M4 Pro. High quality, silent, fanless when not under sustained load.
  • M3/M4 Ultra (128–192GB): Llama 3.3 70B at Q8, Qwen3-72B at Q5, or even 235B models at Q2. Apple's Ultra chips essentially eliminate local model size limits for consumer hardware.

MLX vs Ollama on Apple Silicon: MLX beats Ollama (llama.cpp Metal backend) by 15–30% throughput while using ~10% less memory. As of Ollama 0.19 (March 2026), Ollama added an optional MLX backend for Apple Silicon Macs with 32GB+ unified memory, lifting Qwen3-35B from ~45 tok/s to ~70–80 tok/s on M4 Max. For maximum speed: use MLX-LM. For easiest setup and model management: use Ollama. For the newest model support when MLX lags: llama.cpp.

Quantization Guide

Quantization reduces model precision to shrink VRAM requirements. The right level depends on your VRAM budget and quality requirements.

Format Bits per Weight Quality vs Full VRAM vs Full Best Use Case
Q8_0 8-bit ~99% — imperceptible loss ~50% When VRAM allows; premium quality at reduced footprint
Q6_K 6-bit ~97% — negligible loss ~38% Good balance; worth choosing over Q4 when VRAM allows
Q5_K_M 5-bit ~95% — minimal loss ~31% Sweet spot for mid-tier GPUs; excellent quality per GB
Q4_K_M 4-bit ~90–93% — small but noticeable ~25% Most popular; best VRAM-to-quality tradeoff for 7B–32B
Q3_K_M 3-bit ~82–85% — noticeable degradation ~19% Only when Q4 won't fit; significant quality hit on reasoning
Q2_K 2-bit ~70–75% — significant loss ~13% Last resort; enables 70B on 24GB; reasoning suffers

Recommendation for most users: Use Q4_K_M as the default. If your GPU has room to spare after loading at Q4, upgrade to Q5_K_M for a noticeable improvement. Reserve Q8 for models you use heavily where quality is paramount. Never use Q2 unless it's the only way to run a target model size.

Ollama

The easiest entry point for local AI in 2026. One-command install, automatic model downloads, and a clean REST API compatible with OpenAI's format. Manages multiple models, handles GPU detection automatically, and supports macOS, Linux, and Windows. The Ollama model library covers 200+ models. Version 0.19 (March 2026) added optional MLX backend for Apple Silicon. Best for: developers who want to get running in under 5 minutes and use Ollama-served models from existing OpenAI-compatible apps.

LM Studio

A polished desktop GUI for running local models. Point-and-click model downloads from HuggingFace, built-in chat interface, and a local server that exposes an OpenAI-compatible API. Best for: non-technical users, professionals who want a clean chat interface, and teams evaluating models side-by-side without writing code.

llama.cpp

The foundational inference engine underneath most local AI tools (including Ollama). Direct llama.cpp access offers the most control: custom quantization, server configuration, Metal/CUDA/Vulkan backend selection, and support for the newest models before they appear in higher-level tools. Best for: power users who want maximum control, researchers benchmarking new model releases, and deployments requiring custom server configuration.

MLX-LM (Apple Silicon Only)

Apple's machine learning framework optimized specifically for Apple Silicon. 15–30% faster than llama.cpp on M-series chips with ~10% better memory efficiency. Install via pip, load models from HuggingFace. Increasingly becoming the preferred backend for Apple Silicon users who prioritize throughput. Best for: M-series Mac users who want maximum speed and are comfortable with Python command-line tools.