By carlos Marten in AI — 25 Apr 2026

Best Local AI by Hardware — April 25, 2026

Running a capable LLM entirely on local hardware is now a realistic option for most developers — not just enthusiasts with server racks. In 2026, Qwen3 14B fits comfortably on a 16GB VRAM GPU with near-GPT-4-class quality, a 70B model runs in real time on Apple Silicon M3 Max, and even a 9B model on an 8GB VRAM card handles most chat and coding tasks at 40+ tokens per second. The key to matching hardware to model is choosing the right quantization level: Q4_K_M remains the sweet spot for nearly every tier, delivering ~75% memory savings with only 2–5% quality loss.

Hardware Tier Overview

Tier	VRAM / RAM	Example Hardware	Recommended Models	Max Params (Q4_K_M)
CPU-Only	≤8GB RAM	Older laptops, basic desktops	Gemma 3 2B, Qwen3 1.7B, Phi-4 Mini	~3B
Entry GPU	4–8GB VRAM	RTX 3060, M1/M2 base (8GB)	Qwen3.5-9B, Llama 3.2 8B, Phi-4 Mini, Gemma 3 9B	~9B
Mid GPU	8–16GB VRAM	RTX 3080/4070, M2 Pro/Max (16–32GB)	Qwen3 14B, Llama 3.1 14B, Mistral Nemo 12B	~14B
High-End GPU	24GB VRAM	RTX 4090, RTX 3090, A10G	Qwen3 30B, GLM-4.7-Flash MoE, Qwen2.5 Coder 32B	~34B
Multi-GPU / Workstation	48GB+ VRAM	2× RTX 4090, A100, H100	Llama 3.3 70B, Qwen2.5 72B, DeepSeek V3.2 (partial)	~72B
Apple Silicon High-End	48–192GB Unified	M3 Max (48GB), M3 Ultra (192GB)	Llama 3.3 70B, Qwen2.5 72B, Mixtral, DeepSeek R1	~72B (48GB) / ~200B+ (192GB)

CPU-Only & Low-End (≤8GB RAM)

Running a model on CPU only is slow (typically 2–8 tokens per second), but entirely feasible for simple tasks, offline chatbots, and privacy-sensitive use cases where no GPU is available.

Qwen3 1.7B (Q4_K_M) — Impressively capable for its size. Handles basic Q&A, summarization, and simple code editing. Fits in 1.2GB RAM. Best choice when RAM is extremely constrained.
Phi-4 Mini (Q4_K_M) — Microsoft's compact model punches above its weight on reasoning and instruction following. ~2.4GB footprint; 28 tok/s on a mid-range CPU with AVX2 support.
Gemma 3 2B (Q4_K_M) — Google's entry-level Gemma excels at multilingual tasks and short-form generation. Strong knowledge base for a 2B model. ~1.6GB footprint.

For CPU-only inference, use llama.cpp directly or Ollama, which bundles llama.cpp under the hood. Enabling AVX2/AVX-512 and using 4-bit quantization are the two biggest performance levers.

Entry GPU Tier (4–8GB VRAM: RTX 3060, M1/M2 Base)

An 8GB VRAM card is the most popular tier for local AI hobbyists. With Q4_K_M quantization, you can comfortably fit a 9B model and achieve conversational speeds.

Qwen3.5-9B Q4_K_M — Best Overall — The top recommendation for 8GB VRAM. Excellent instruction following, strong coding, and multilingual support. Runs at 40+ tokens/sec on an RTX 3060. At 5.5GB, it leaves room for context and KV cache.
Llama 3.2 8B Q4_K_M — Meta's highly optimized architecture. Great tool-use support and compatible with most frameworks. 4.9GB footprint; widely supported in LM Studio and Ollama.
Phi-4 Mini Q4_K_M — Best for reasoning-heavy tasks at this tier. Achieves remarkable results on math and logical puzzles given its size. 2.4GB — can even fit alongside a browser tab in 8GB.
Gemma 3 9B Q4_K_M — Best for image + text tasks (multimodal). Google's vision-language model fits in 6.5GB with the vision encoder loaded.

Mid GPU Tier (8–16GB VRAM: RTX 3080/4070, M2 Pro/Max)

16GB VRAM is the sweet spot for developers who want near-cloud quality without cloud costs. Qwen3 14B is the standout model at this tier.

Qwen3 14B Q4_K_M — Best Overall — 12GB footprint; 61.85 tokens/sec on an RTX 4070. Leads the 16GB tier on instruction following, reasoning, and code. Close to GPT-4-Turbo quality on many tasks. The go-to recommendation for this tier in 2026.
Llama 3.1 14B Q4_K_M — Excellent general-purpose model with strong community support and extensive fine-tune availability. 8.7GB; slightly faster than Qwen3 14B at 68 tok/s due to smaller KV heads.
Mistral Nemo 12B Q4_K_M — Best for European language tasks and GDPR-conscious deployments. Apache 2.0 license; strong at function calling and structured output.
Qwen2.5 Coder 14B Q4_K_M — Best coding model at the 16GB tier. Specialized on code generation, debugging, and code completion. 9.2GB; recommended for developers using local AI in their IDE.

High-End GPU (24GB+ VRAM: RTX 4090, A100, M3 Max/Ultra)

With 24GB of VRAM you enter a different quality tier. 30B-class models and MoE (Mixture of Experts) architectures become practical, and coding specialists like Qwen2.5 Coder 32B are genuinely competitive with cloud models.

RTX 4090 (24GB VRAM)

Qwen3 30B-A3B MoE Q4_K_M — Best Speed — An MoE model that activates only 3B parameters per forward pass while storing 30B of knowledge. Runs at a remarkable 196 tokens/sec on the RTX 4090. Best for high-throughput local inference.
GLM-4.7-Flash (30B-A3B MoE) — Best Reasoning — Leads at the 24GB tier on the Intelligence Index (30.1). In autonomous agent testing it was the only 24GB model to successfully complete a complex multi-step game-building task. Supports extended reasoning mode.
Qwen2.5 Coder 32B Q4_K_M — Best Coding — 15–25 tok/s on RTX 4090; competitive with Claude Sonnet on many coding benchmarks. The top local coding model if you have 24GB VRAM.
Llama 3.1 70B Q4_K_M (partial offload) — At 52 tok/s with CPU offload for layers that don't fit in VRAM, a 70B model is viable on the 4090 with Ollama or llama.cpp's GPU layer splitting. Quality jumps significantly over 30B models.

A100 / H100 (40–80GB VRAM)

Llama 3.3 70B Q4_K_M / Q5_K_M — Fully fits in 40GB at Q5_K_M. Delivers genuine GPT-4-class performance on-premise. Recommended for enterprise self-hosted deployments.
DeepSeek V3.2 (full precision shards) — Requires an 8× GPU setup at full precision, but individual 13B shards can be run on a single A100 for testing purposes.

Apple Silicon & MLX

Apple Silicon's unified memory architecture fundamentally changes the economics of local AI. The GPU and CPU share the same high-bandwidth memory pool, so a 96GB M3 Max can fully load a 70B model — something impossible on discrete GPUs without multi-GPU setups.

M3 Max (48–96GB unified) — The Portability King — Llama 3.3 70B at Q4_K_M requires ~40GB and runs at 18–25 tok/s on a 96GB M3 Max via Ollama or MLX. The M3 Max wins decisively over RTX 4090 for large models (70B+) while consuming 60–80W vs the 4090's 450W TDP.
M3 Ultra (192GB unified) — Can run DeepSeek R1 671B at reduced quantization, or any open-weight model currently available. The ultimate local AI workstation for research teams.
M2 Pro / M3 (16–32GB unified) — Excellent for 14B-class models; Qwen3 14B and Llama 3.1 14B run smoothly at 35–50 tok/s via Ollama. Better sustained throughput than discrete 16GB cards due to higher memory bandwidth.

MLX Framework

Apple's MLX framework provides native Metal-optimized inference that typically outperforms llama.cpp on Apple Silicon by 15–30%.
The mlx-community on Hugging Face publishes pre-quantized MLX models for all major architectures.
Use LM Studio (which supports both MLX and llama.cpp backends) for a GUI interface on Mac, or run mlx_lm.generate directly from the terminal for scripting.

Quantization Guide

Format	Bits per Weight	Size vs FP16	Quality Loss	Best For
Q4_K_M	4-bit (K-quant)	~25% of FP16	2–5% perplexity increase	Default choice for all hardware tiers; best speed/quality ratio
Q5_K_M	5-bit (K-quant)	~31% of FP16	1–2% perplexity increase	When you have extra VRAM headroom and want slightly better quality
Q8_0	8-bit	~50% of FP16	<0.5% perplexity increase	Near-lossless quality on Apple Silicon with large unified memory
IQ2_XXS	2-bit (imatrix)	~12% of FP16	10–15% perplexity increase	CPU-only or extremely limited RAM; large models on tiny hardware
FP16 / BF16	16-bit full	100%	None	Fine-tuning, training, reference inference; requires server GPU

Practical rule of thumb: Start with Q4_K_M. If you have 25% more VRAM headroom than the model requires at Q4_K_M, upgrade to Q5_K_M. If you are on Apple Silicon with large unified memory and need the highest fidelity for coding or structured output tasks, use Q8_0.

Recommended Tools

Ollama

The easiest way to get started. One command downloads and runs a model: ollama run qwen3:14b
Manages model storage, GPU layer allocation, and serves an OpenAI-compatible REST API at localhost:11434
Supports model library browsing at ollama.com/search; includes 300+ pre-quantized GGUF models
Best for: developers who want quick setup, CI/CD integration, and REST API access without GUI overhead

LM Studio

Cross-platform GUI (Mac, Windows, Linux) with a model discovery browser connected to Hugging Face
Supports llama.cpp and MLX backends; automatic GPU detection and layer splitting
Built-in chat interface and OpenAI-compatible local server
Best for: non-developers, researchers, and Mac users who prefer a polished UI and easy model switching

llama.cpp

The original C++ inference engine for GGUF models. Runs on CPU, CUDA, Metal, ROCm, Vulkan, and SYCL backends
Maximum control over quantization, prompt formatting, context size, and batch settings
Use llama-server for an OpenAI-compatible HTTP server; llama-cli for direct terminal inference
Best for: power users, embedded deployments, custom quantization workflows, and situations where Ollama's overhead is undesirable