Best Local AI by Hardware — April 25, 2026

Running a capable LLM entirely on local hardware is now a realistic option for most developers — not just enthusiasts with server racks. In 2026, Qwen3 14B fits comfortably on a 16GB VRAM GPU with near-GPT-4-class quality, a 70B model runs in real time on Apple Silicon M3 Max, and even a 9B model on an 8GB VRAM card handles most chat and coding tasks at 40+ tokens per second. The key to matching hardware to model is choosing the right quantization level: Q4_K_M remains the sweet spot for nearly every tier, delivering ~75% memory savings with only 2–5% quality loss.

Hardware Tier Overview

Tier VRAM / RAM Example Hardware Recommended Models Max Params (Q4_K_M)
CPU-Only ≤8GB RAM Older laptops, basic desktops Gemma 3 2B, Qwen3 1.7B, Phi-4 Mini ~3B
Entry GPU 4–8GB VRAM RTX 3060, M1/M2 base (8GB) Qwen3.5-9B, Llama 3.2 8B, Phi-4 Mini, Gemma 3 9B ~9B
Mid GPU 8–16GB VRAM RTX 3080/4070, M2 Pro/Max (16–32GB) Qwen3 14B, Llama 3.1 14B, Mistral Nemo 12B ~14B
High-End GPU 24GB VRAM RTX 4090, RTX 3090, A10G Qwen3 30B, GLM-4.7-Flash MoE, Qwen2.5 Coder 32B ~34B
Multi-GPU / Workstation 48GB+ VRAM 2× RTX 4090, A100, H100 Llama 3.3 70B, Qwen2.5 72B, DeepSeek V3.2 (partial) ~72B
Apple Silicon High-End 48–192GB Unified M3 Max (48GB), M3 Ultra (192GB) Llama 3.3 70B, Qwen2.5 72B, Mixtral, DeepSeek R1 ~72B (48GB) / ~200B+ (192GB)

CPU-Only & Low-End (≤8GB RAM)

Running a model on CPU only is slow (typically 2–8 tokens per second), but entirely feasible for simple tasks, offline chatbots, and privacy-sensitive use cases where no GPU is available.

  • Qwen3 1.7B (Q4_K_M) — Impressively capable for its size. Handles basic Q&A, summarization, and simple code editing. Fits in 1.2GB RAM. Best choice when RAM is extremely constrained.
  • Phi-4 Mini (Q4_K_M) — Microsoft's compact model punches above its weight on reasoning and instruction following. ~2.4GB footprint; 28 tok/s on a mid-range CPU with AVX2 support.
  • Gemma 3 2B (Q4_K_M) — Google's entry-level Gemma excels at multilingual tasks and short-form generation. Strong knowledge base for a 2B model. ~1.6GB footprint.

For CPU-only inference, use llama.cpp directly or Ollama, which bundles llama.cpp under the hood. Enabling AVX2/AVX-512 and using 4-bit quantization are the two biggest performance levers.

Entry GPU Tier (4–8GB VRAM: RTX 3060, M1/M2 Base)

An 8GB VRAM card is the most popular tier for local AI hobbyists. With Q4_K_M quantization, you can comfortably fit a 9B model and achieve conversational speeds.

  • Qwen3.5-9B Q4_K_M — Best Overall — The top recommendation for 8GB VRAM. Excellent instruction following, strong coding, and multilingual support. Runs at 40+ tokens/sec on an RTX 3060. At 5.5GB, it leaves room for context and KV cache.
  • Llama 3.2 8B Q4_K_M — Meta's highly optimized architecture. Great tool-use support and compatible with most frameworks. 4.9GB footprint; widely supported in LM Studio and Ollama.
  • Phi-4 Mini Q4_K_M — Best for reasoning-heavy tasks at this tier. Achieves remarkable results on math and logical puzzles given its size. 2.4GB — can even fit alongside a browser tab in 8GB.
  • Gemma 3 9B Q4_K_M — Best for image + text tasks (multimodal). Google's vision-language model fits in 6.5GB with the vision encoder loaded.

Mid GPU Tier (8–16GB VRAM: RTX 3080/4070, M2 Pro/Max)

16GB VRAM is the sweet spot for developers who want near-cloud quality without cloud costs. Qwen3 14B is the standout model at this tier.

  • Qwen3 14B Q4_K_M — Best Overall — 12GB footprint; 61.85 tokens/sec on an RTX 4070. Leads the 16GB tier on instruction following, reasoning, and code. Close to GPT-4-Turbo quality on many tasks. The go-to recommendation for this tier in 2026.
  • Llama 3.1 14B Q4_K_M — Excellent general-purpose model with strong community support and extensive fine-tune availability. 8.7GB; slightly faster than Qwen3 14B at 68 tok/s due to smaller KV heads.
  • Mistral Nemo 12B Q4_K_M — Best for European language tasks and GDPR-conscious deployments. Apache 2.0 license; strong at function calling and structured output.
  • Qwen2.5 Coder 14B Q4_K_M — Best coding model at the 16GB tier. Specialized on code generation, debugging, and code completion. 9.2GB; recommended for developers using local AI in their IDE.

High-End GPU (24GB+ VRAM: RTX 4090, A100, M3 Max/Ultra)

With 24GB of VRAM you enter a different quality tier. 30B-class models and MoE (Mixture of Experts) architectures become practical, and coding specialists like Qwen2.5 Coder 32B are genuinely competitive with cloud models.

RTX 4090 (24GB VRAM)

  • Qwen3 30B-A3B MoE Q4_K_M — Best Speed — An MoE model that activates only 3B parameters per forward pass while storing 30B of knowledge. Runs at a remarkable 196 tokens/sec on the RTX 4090. Best for high-throughput local inference.
  • GLM-4.7-Flash (30B-A3B MoE) — Best Reasoning — Leads at the 24GB tier on the Intelligence Index (30.1). In autonomous agent testing it was the only 24GB model to successfully complete a complex multi-step game-building task. Supports extended reasoning mode.
  • Qwen2.5 Coder 32B Q4_K_M — Best Coding — 15–25 tok/s on RTX 4090; competitive with Claude Sonnet on many coding benchmarks. The top local coding model if you have 24GB VRAM.
  • Llama 3.1 70B Q4_K_M (partial offload) — At 52 tok/s with CPU offload for layers that don't fit in VRAM, a 70B model is viable on the 4090 with Ollama or llama.cpp's GPU layer splitting. Quality jumps significantly over 30B models.

A100 / H100 (40–80GB VRAM)

  • Llama 3.3 70B Q4_K_M / Q5_K_M — Fully fits in 40GB at Q5_K_M. Delivers genuine GPT-4-class performance on-premise. Recommended for enterprise self-hosted deployments.
  • DeepSeek V3.2 (full precision shards) — Requires an 8× GPU setup at full precision, but individual 13B shards can be run on a single A100 for testing purposes.

Apple Silicon & MLX

Apple Silicon's unified memory architecture fundamentally changes the economics of local AI. The GPU and CPU share the same high-bandwidth memory pool, so a 96GB M3 Max can fully load a 70B model — something impossible on discrete GPUs without multi-GPU setups.

  • M3 Max (48–96GB unified) — The Portability King — Llama 3.3 70B at Q4_K_M requires ~40GB and runs at 18–25 tok/s on a 96GB M3 Max via Ollama or MLX. The M3 Max wins decisively over RTX 4090 for large models (70B+) while consuming 60–80W vs the 4090's 450W TDP.
  • M3 Ultra (192GB unified) — Can run DeepSeek R1 671B at reduced quantization, or any open-weight model currently available. The ultimate local AI workstation for research teams.
  • M2 Pro / M3 (16–32GB unified) — Excellent for 14B-class models; Qwen3 14B and Llama 3.1 14B run smoothly at 35–50 tok/s via Ollama. Better sustained throughput than discrete 16GB cards due to higher memory bandwidth.

MLX Framework

  • Apple's MLX framework provides native Metal-optimized inference that typically outperforms llama.cpp on Apple Silicon by 15–30%.
  • The mlx-community on Hugging Face publishes pre-quantized MLX models for all major architectures.
  • Use LM Studio (which supports both MLX and llama.cpp backends) for a GUI interface on Mac, or run mlx_lm.generate directly from the terminal for scripting.

Quantization Guide

Format Bits per Weight Size vs FP16 Quality Loss Best For
Q4_K_M 4-bit (K-quant) ~25% of FP16 2–5% perplexity increase Default choice for all hardware tiers; best speed/quality ratio
Q5_K_M 5-bit (K-quant) ~31% of FP16 1–2% perplexity increase When you have extra VRAM headroom and want slightly better quality
Q8_0 8-bit ~50% of FP16 <0.5% perplexity increase Near-lossless quality on Apple Silicon with large unified memory
IQ2_XXS 2-bit (imatrix) ~12% of FP16 10–15% perplexity increase CPU-only or extremely limited RAM; large models on tiny hardware
FP16 / BF16 16-bit full 100% None Fine-tuning, training, reference inference; requires server GPU

Practical rule of thumb: Start with Q4_K_M. If you have 25% more VRAM headroom than the model requires at Q4_K_M, upgrade to Q5_K_M. If you are on Apple Silicon with large unified memory and need the highest fidelity for coding or structured output tasks, use Q8_0.

Ollama

  • The easiest way to get started. One command downloads and runs a model: ollama run qwen3:14b
  • Manages model storage, GPU layer allocation, and serves an OpenAI-compatible REST API at localhost:11434
  • Supports model library browsing at ollama.com/search; includes 300+ pre-quantized GGUF models
  • Best for: developers who want quick setup, CI/CD integration, and REST API access without GUI overhead

LM Studio

  • Cross-platform GUI (Mac, Windows, Linux) with a model discovery browser connected to Hugging Face
  • Supports llama.cpp and MLX backends; automatic GPU detection and layer splitting
  • Built-in chat interface and OpenAI-compatible local server
  • Best for: non-developers, researchers, and Mac users who prefer a polished UI and easy model switching

llama.cpp

  • The original C++ inference engine for GGUF models. Runs on CPU, CUDA, Metal, ROCm, Vulkan, and SYCL backends
  • Maximum control over quantization, prompt formatting, context size, and batch settings
  • Use llama-server for an OpenAI-compatible HTTP server; llama-cli for direct terminal inference
  • Best for: power users, embedded deployments, custom quantization workflows, and situations where Ollama's overhead is undesirable

Subscribe to Carlos Marten

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
[email protected]
Subscribe