Running a capable LLM entirely on local hardware is now a realistic option for most developers — not just enthusiasts with server racks. In 2026, Qwen3 14B fits comfortably on a 16GB VRAM GPU with near-GPT-4-class quality, a 70B model runs in real time on Apple Silicon M3 Max, and even a 9B model on an 8GB VRAM card handles most chat and coding tasks at 40+ tokens per second. The key to matching hardware to model is choosing the right quantization level: Q4_K_M remains the sweet spot for nearly every tier, delivering ~75% memory savings with only 2–5% quality loss.

Hardware Tier Overview

Tier VRAM / RAM Example Hardware Recommended Models Max Params (Q4_K_M)
CPU-Only ≤8GB RAM Older laptops, basic desktops Gemma 3 2B, Qwen3 1.7B, Phi-4 Mini ~3B
Entry GPU 4–8GB VRAM RTX 3060, M1/M2 base (8GB) Qwen3.5-9B, Llama 3.2 8B, Phi-4 Mini, Gemma 3 9B ~9B
Mid GPU 8–16GB VRAM RTX 3080/4070, M2 Pro/Max (16–32GB) Qwen3 14B, Llama 3.1 14B, Mistral Nemo 12B ~14B
High-End GPU 24GB VRAM RTX 4090, RTX 3090, A10G Qwen3 30B, GLM-4.7-Flash MoE, Qwen2.5 Coder 32B ~34B
Multi-GPU / Workstation 48GB+ VRAM 2× RTX 4090, A100, H100 Llama 3.3 70B, Qwen2.5 72B, DeepSeek V3.2 (partial) ~72B
Apple Silicon High-End 48–192GB Unified M3 Max (48GB), M3 Ultra (192GB) Llama 3.3 70B, Qwen2.5 72B, Mixtral, DeepSeek R1 ~72B (48GB) / ~200B+ (192GB)

CPU-Only & Low-End (≤8GB RAM)

Running a model on CPU only is slow (typically 2–8 tokens per second), but entirely feasible for simple tasks, offline chatbots, and privacy-sensitive use cases where no GPU is available.

  • Qwen3 1.7B (Q4_K_M) — Impressively capable for its size. Handles basic Q&A, summarization, and simple code editing. Fits in 1.2GB RAM. Best choice when RAM is extremely constrained.
  • Phi-4 Mini (Q4_K_M) — Microsoft's compact model punches above its weight on reasoning and instruction following. ~2.4GB footprint; 28 tok/s on a mid-range CPU with AVX2 support.
  • Gemma 3 2B (Q4_K_M) — Google's entry-level Gemma excels at multilingual tasks and short-form generation. Strong knowledge base for a 2B model. ~1.6GB footprint.

For CPU-only inference, use llama.cpp directly or Ollama, which bundles llama.cpp under the hood. Enabling AVX2/AVX-512 and using 4-bit quantization are the two biggest performance levers.

Entry GPU Tier (4–8GB VRAM: RTX 3060, M1/M2 Base)

An 8GB VRAM card is the most popular tier for local AI hobbyists. With Q4_K_M quantization, you can comfortably fit a 9B model and achieve conversational speeds.

  • Qwen3.5-9B Q4_K_M — Best Overall — The top recommendation for 8GB VRAM. Excellent instruction following, strong coding, and multilingual support. Runs at 40+ tokens/sec on an RTX 3060. At 5.5GB, it leaves room for context and KV cache.
  • Llama 3.2 8B Q4_K_M — Meta's highly optimized architecture. Great tool-use support and compatible with most frameworks. 4.9GB footprint; widely supported in LM Studio and Ollama.
  • Phi-4 Mini Q4_K_M — Best for reasoning-heavy tasks at this tier. Achieves remarkable results on math and logical puzzles given its size. 2.4GB — can even fit alongside a browser tab in 8GB.
  • Gemma 3 9B Q4_K_M — Best for image + text tasks (multimodal). Google's vision-language model fits in 6.5GB with the vision encoder loaded.

Mid GPU Tier (8–16GB VRAM: RTX 3080/4070, M2 Pro/Max)

16GB VRAM is the sweet spot for developers who want near-cloud quality without cloud costs. Qwen3 14B is the standout model at this tier.

  • Qwen3 14B Q4_K_M — Best Overall — 12GB footprint; 61.85 tokens/sec on an RTX 4070. Leads the 16GB tier on instruction following, reasoning, and code. Close to GPT-4-Turbo quality on many tasks. The go-to recommendation for this tier in 2026.
  • Llama 3.1 14B Q4_K_M — Excellent general-purpose model with strong community support and extensive fine-tune availability. 8.7GB; slightly faster than Qwen3 14B at 68 tok/s due to smaller KV heads.
  • Mistral Nemo 12B Q4_K_M — Best for European language tasks and GDPR-conscious deployments. Apache 2.0 license; strong at function calling and structured output.
  • Qwen2.5 Coder 14B Q4_K_M — Best coding model at the 16GB tier. Specialized on code generation, debugging, and code completion. 9.2GB; recommended for developers using local AI in their IDE.

High-End GPU (24GB+ VRAM: RTX 4090, A100, M3 Max/Ultra)

With 24GB of VRAM you enter a different quality tier. 30B-class models and MoE (Mixture of Experts) architectures become practical, and coding specialists like Qwen2.5 Coder 32B are genuinely competitive with cloud models.

RTX 4090 (24GB VRAM)

  • Qwen3 30B-A3B MoE Q4_K_M — Best Speed — An MoE model that activates only 3B parameters per forward pass while storing 30B of knowledge. Runs at a remarkable 196 tokens/sec on the RTX 4090. Best for high-throughput local inference.
  • GLM-4.7-Flash (30B-A3B MoE) — Best Reasoning — Leads at the 24GB tier on the Intelligence Index (30.1). In autonomous agent testing it was the only 24GB model to successfully complete a complex multi-step game-building task. Supports extended reasoning mode.
  • Qwen2.5 Coder 32B Q4_K_M — Best Coding — 15–25 tok/s on RTX 4090; competitive with Claude Sonnet on many coding benchmarks. The top local coding model if you have 24GB VRAM.
  • Llama 3.1 70B Q4_K_M (partial offload) — At 52 tok/s with CPU offload for layers that don't fit in VRAM, a 70B model is viable on the 4090 with Ollama or llama.cpp's GPU layer splitting. Quality jumps significantly over 30B models.

A100 / H100 (40–80GB VRAM)

  • Llama 3.3 70B Q4_K_M / Q5_K_M — Fully fits in 40GB at Q5_K_M. Delivers genuine GPT-4-class performance on-premise. Recommended for enterprise self-hosted deployments.
  • DeepSeek V3.2 (full precision shards) — Requires an 8× GPU setup at full precision, but individual 13B shards can be run on a single A100 for testing purposes.

Apple Silicon & MLX

Apple Silicon's unified memory architecture fundamentally changes the economics of local AI. The GPU and CPU share the same high-bandwidth memory pool, so a 96GB M3 Max can fully load a 70B model — something impossible on discrete GPUs without multi-GPU setups.

  • M3 Max (48–96GB unified) — The Portability King — Llama 3.3 70B at Q4_K_M requires ~40GB and runs at 18–25 tok/s on a 96GB M3 Max via Ollama or MLX. The M3 Max wins decisively over RTX 4090 for large models (70B+) while consuming 60–80W vs the 4090's 450W TDP.
  • M3 Ultra (192GB unified) — Can run DeepSeek R1 671B at reduced quantization, or any open-weight model currently available. The ultimate local AI workstation for research teams.
  • M2 Pro / M3 (16–32GB unified) — Excellent for 14B-class models; Qwen3 14B and Llama 3.1 14B run smoothly at 35–50 tok/s via Ollama. Better sustained throughput than discrete 16GB cards due to higher memory bandwidth.

MLX Framework

  • Apple's MLX framework provides native Metal-optimized inference that typically outperforms llama.cpp on Apple Silicon by 15–30%.
  • The mlx-community on Hugging Face publishes pre-quantized MLX models for all major architectures.
  • Use LM Studio (which supports both MLX and llama.cpp backends) for a GUI interface on Mac, or run mlx_lm.generate directly from the terminal for scripting.

Quantization Guide

Format Bits per Weight Size vs FP16 Quality Loss Best For
Q4_K_M 4-bit (K-quant) ~25% of FP16 2–5% perplexity increase Default choice for all hardware tiers; best speed/quality ratio
Q5_K_M 5-bit (K-quant) ~31% of FP16 1–2% perplexity increase When you have extra VRAM headroom and want slightly better quality
Q8_0 8-bit ~50% of FP16 <0.5% perplexity increase Near-lossless quality on Apple Silicon with large unified memory
IQ2_XXS 2-bit (imatrix) ~12% of FP16 10–15% perplexity increase CPU-only or extremely limited RAM; large models on tiny hardware
FP16 / BF16 16-bit full 100% None Fine-tuning, training, reference inference; requires server GPU

Practical rule of thumb: Start with Q4_K_M. If you have 25% more VRAM headroom than the model requires at Q4_K_M, upgrade to Q5_K_M. If you are on Apple Silicon with large unified memory and need the highest fidelity for coding or structured output tasks, use Q8_0.

Ollama

  • The easiest way to get started. One command downloads and runs a model: ollama run qwen3:14b
  • Manages model storage, GPU layer allocation, and serves an OpenAI-compatible REST API at localhost:11434
  • Supports model library browsing at ollama.com/search; includes 300+ pre-quantized GGUF models
  • Best for: developers who want quick setup, CI/CD integration, and REST API access without GUI overhead

LM Studio

  • Cross-platform GUI (Mac, Windows, Linux) with a model discovery browser connected to Hugging Face
  • Supports llama.cpp and MLX backends; automatic GPU detection and layer splitting
  • Built-in chat interface and OpenAI-compatible local server
  • Best for: non-developers, researchers, and Mac users who prefer a polished UI and easy model switching

llama.cpp

  • The original C++ inference engine for GGUF models. Runs on CPU, CUDA, Metal, ROCm, Vulkan, and SYCL backends
  • Maximum control over quantization, prompt formatting, context size, and batch settings
  • Use llama-server for an OpenAI-compatible HTTP server; llama-cli for direct terminal inference
  • Best for: power users, embedded deployments, custom quantization workflows, and situations where Ollama's overhead is undesirable