Best Local AI by Hardware — April 22, 2026

The local LLM landscape in April 2026 has matured dramatically: Qwen3 32B runs comfortably on a 24GB GPU at Q4_K_M quantization, Apple's MLX backend delivers 20–40% higher throughput than llama.cpp on M3 and M4 chips, and models like DeepSeek-R1 32B bring near-70B reasoning quality to consumer hardware. If you have an RTX 3060 or an M2 Mac, there has never been a better time to run AI entirely locally — with no API costs, no data leaving your machine, and throughput competitive with cloud-hosted models from two years ago.

Hardware Tier Overview

Tier VRAM/RAM Example GPUs/Chips Recommended Models Max Params
CPU-Only / Low-End ≤8GB RAM Older laptops, Raspberry Pi 5 Phi-4 Mini (Q4), Gemma 3 2B, Llama 3.2 3B ~3–4B
Entry GPU 4–8GB VRAM RTX 3060, M1/M2 base (16GB) Qwen3 8B Q4, Phi-4 Q4, Llama 3.2 8B 8B
Mid GPU 8–16GB VRAM RTX 3080, RTX 4070, M2 Pro/Max (32GB) Qwen3 14B Q5_K_M, Mistral Small 3 24B, Llama 3.3 13B 14–24B
High-End GPU 24GB+ VRAM RTX 4090, A100, M3 Max/Ultra (64–192GB) DeepSeek-R1 32B Q4_K_M, Qwen3 32B, Llama 3.3 70B Q2 32–70B

CPU-Only & Low-End (≤8GB RAM)

CPU-only inference is slow but viable for light tasks. Expect 3–8 tokens per second on a modern laptop CPU with a well-optimized small model.

  • Phi-4 Mini (3.8B, Q4_K_M) — Microsoft's efficiency-optimized model punches well above its weight class. Fits in under 3GB RAM and handles basic coding, summarization, and Q&A tasks well. Best-in-class quality for the parameter count in April 2026.
  • Gemma 3 2B (Q4) — Google's 2B model is faster than Phi-4 Mini on CPU and acceptable for simple chat tasks. Limited reasoning ability but excellent for quick lookup-style queries.
  • Llama 3.2 3B (Q4_K_M) — Meta's 3B Llama variant offers multilingual support and solid instruction following. Good general-purpose choice if ecosystem compatibility and broad tool support matter.

For CPU-only use, prioritize models under 4B parameters at Q4 quantization. Anything larger will produce unacceptably low throughput (under 2 tok/s) on most modern CPUs and make interactive use frustrating.

Entry GPU Tier (4–8GB VRAM: RTX 3060, M1/M2 base)

With 4–8GB of VRAM, you can run 7–8B models entirely on-GPU — the most impactful threshold for local AI usability, where responses feel genuinely interactive.

  • Qwen3 8B (Q4_K_M) — The top recommendation at this tier in 2026. Strong reasoning, excellent multilingual support, and best-in-class instruction following at 8B scale. Fits in approximately 5.5GB VRAM. Offers a /think mode for complex multi-step tasks that would otherwise require larger models.
  • Phi-4 (Q4_K_M) — Microsoft's 14B model performs best at 10GB VRAM; at 8GB, use the smaller Phi-4 Mini variant. Strong on STEM and coding tasks relative to size.
  • Llama 3.2 8B (Q4_K_M) — Meta's well-rounded 8B model covers coding, chat, and tool use. Slightly below Qwen3 8B on recent benchmarks but has the widest ecosystem support across Ollama, llama.cpp, and LM Studio.
  • Speeds: 7B Q4 = 80–100 tok/s on RTX 3060; 50–60 tok/s on M1 base Mac via Ollama.

Mid GPU Tier (8–16GB VRAM: RTX 3080/4070, M2 Pro/Max)

The mid GPU tier unlocks 13–24B models — a significant quality jump over 7B that makes local AI genuinely competitive with cloud API quality for most everyday tasks.

  • Qwen3 14B (Q5_K_M) — Best overall pick for this tier. At Q5_K_M quantization, fits in approximately 12GB VRAM with near-full quality. Outperforms older 70B models from 2024 on several benchmarks. Excellent for coding, analysis, and complex reasoning tasks.
  • Mistral Small 3 (24B, Q4_K_M) — The sweet spot model for 16GB VRAM users. State-of-the-art benchmark performance in the 24B class with strong long-context handling. Widely regarded as one of the best open-weight releases of early 2026 for general use.
  • Llama 3.3 13B (Q5_K_M) — Meta's optimized 13B with improved instruction following over prior versions. Best baseline if you need maximum ecosystem compatibility and third-party integration support.
  • Speeds: 13B Q4 = 50–70 tok/s on RTX 3080; 30B Q4 = 20–35 tok/s on RTX 4070.

High-End GPU (24GB+ VRAM: RTX 4090, A100, M3 Max/Ultra)

With 24GB of VRAM you can run 32B models at Q4_K_M quantization with excellent quality — the threshold where local AI becomes indistinguishable from mid-tier cloud models for most tasks.

  • DeepSeek-R1 32B (Q4_K_M, ~19GB VRAM) — The top pick for RTX 4090 owners in April 2026. Delivers near-70B reasoning quality at 32B scale due to DeepSeek's chain-of-thought training. Exceptional for multi-step reasoning, mathematics, and coding tasks that require sustained logical chains.
  • Qwen3 32B (Q4_K_M) — Fits perfectly in 24GB VRAM with top benchmark scores in the 32B class. Strong multilingual and coding performance. The /think mode handles complex reasoning without needing to step up to a larger model.
  • Mistral Small 3 (24B, Q5_K_M) — At higher quantization, fully utilizes 24GB VRAM for near-lossless inference. Best quality-per-VRAM option if you want a safety margin over the 32B models.
  • Llama 3.3 70B (Q2_K, ~24GB VRAM) — For maximum capability, 70B at Q2 quantization fits the RTX 4090. The quality compromise from Q2 is noticeable but acceptable for tasks where scale matters more than precision.
  • Qwen3.5 35B-A3B (MoE) — Mixture-of-Experts architecture with only 3B active parameters achieves 80 tok/s on an RTX 4090 using just 22GB VRAM. Excellent throughput-to-quality ratio for interactive use.
  • Speeds on RTX 4090: 7B Q4 = 80–100 tok/s; 13B Q4 = 50–70 tok/s; 30B Q4 = 20–35 tok/s; 70B Q2 = 15–25 tok/s.

Apple Silicon & MLX

Apple Silicon (M1–M5) uses unified memory, meaning the GPU and CPU share the same RAM pool. A MacBook Pro with 64GB RAM can run 70B models at Q4_K_M — something no standalone consumer GPU can match at any price.

  • MLX framework — Apple's purpose-built ML framework for Apple Silicon delivers 20–40% higher token throughput than llama.cpp on M3 and M4 chips, with the gap widening on longer context windows. Use Ollama 0.19+ with the MLX backend enabled for 93% faster decode with zero configuration changes.
  • 16GB Mac (M1/M2 base): Qwen3 8B and Phi-4 at Q4_K_M are the best picks. Expect 40–60 tok/s via MLX.
  • 32GB Mac (M1/M2 Pro): Qwen3 14B gives excellent quality. Qwen3 32B Q4_K_M is the top recommendation at this tier, offering expert-level responses on complex topics, strong coding, and good creative writing — with /think mode handling multi-step reasoning.
  • 64GB Mac (M2 Max, M3 Max, M4 Max): Run Llama 3.3 70B at Q4_K_M (~43GB) for maximum quality, or Qwen3 32B at Q8 for near-lossless inference at 32B scale.
  • 128GB Mac (M5 Max/Ultra): Run full Q8 70B models or experimental 100B+ models. Represents the state of the art for local-only inference without data center hardware.
  • Agentic use caveat: Qwen3.5's tool-calling reliability degrades after 5–10 rounds when using MLX quantization. For agentic workflows on Apple Silicon, GGUF quantization via llama.cpp maintains stability longer through longer sessions.

Quantization Guide

Format Bits/Weight Quality Loss Size vs FP16 Best Use Case
Q4_K_M ~4.5 bits ~3.3% 75% reduction Daily driver; best balance of size and quality. ~0.6GB per billion parameters.
Q5_K_M ~5.5 bits ~1.5% 65% reduction Step up when VRAM allows — noticeably better on reasoning and code tasks
Q8_0 8 bits <0.5% 50% reduction Near-lossless; use when VRAM is generous and quality is paramount
Q2_K ~2.6 bits ~8–12% 87% reduction Only for fitting very large models (70B+) on limited VRAM; noticeable quality drop
IQ4_XS ~4.3 bits ~3.5% 73% reduction Newer importance-weighted quant; often matches Q4_K_M quality at slightly smaller size

Rule of thumb: Q4_K_M is the sweet spot for most users. Step up to Q5_K_M if your VRAM allows — the quality improvement on coding and reasoning tasks is noticeable. Only use Q2 when you absolutely need to fit an oversized model and accept a meaningful quality tradeoff.

Ollama

  • The easiest way to get started with local LLMs. Single-command installation, an extensive model library with one-line downloads, and a built-in OpenAI-compatible API server.
  • Ollama 0.19+ includes MLX backend support on Apple Silicon for 93% faster decode at zero extra configuration cost.
  • Best for: beginners, rapid model switching, integration with Open WebUI, Continue.dev, and other ecosystem tools.

LM Studio

  • GUI-first desktop application for discovering, downloading, and running GGUF models locally. Includes a built-in chat UI, model comparison mode, and local server mode.
  • Excellent for users who prefer a visual interface. Supports NVIDIA CUDA and Apple Silicon Metal backends with no configuration required.
  • Best for: non-technical users, model exploration, local ChatGPT replacement without any terminal usage.

llama.cpp

  • The foundational C++ inference engine powering most local LLM tools. Maximum control, lowest overhead, and the widest quantization format support (GGUF, IQ-series, all K-quants).
  • Runs on CPU, CUDA, Metal, ROCm, Vulkan, and SYCL — the most hardware-portable option available, including embedded and server deployments.
  • Best for: advanced users, embedding inference in custom applications, maximum performance tuning, and agentic workflows where GGUF quantization stability is required over MLX.

Subscribe to Carlos Marten

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
[email protected]
Subscribe