Running a capable LLM entirely on local hardware is now a realistic option for most developers — not just enthusiasts with server racks. In 2026, Qwen3 14B fits comfortably on a 16GB VRAM GPU with near-GPT-4-class quality, a 70B model runs in real time on Apple Silicon M3 Max, and even a 9B model on an 8GB VRAM card handles most chat and coding tasks at 40+ tokens per second. The key to matching hardware to model is choosing the right quantization level: Q4_K_M remains the sweet spot for nearly every tier, delivering ~75% memory savings with only 2–5% quality loss.
Hardware Tier Overview
| Tier | VRAM / RAM | Example Hardware | Recommended Models | Max Params (Q4_K_M) |
|---|---|---|---|---|
| CPU-Only | ≤8GB RAM | Older laptops, basic desktops | Gemma 3 2B, Qwen3 1.7B, Phi-4 Mini | ~3B |
| Entry GPU | 4–8GB VRAM | RTX 3060, M1/M2 base (8GB) | Qwen3.5-9B, Llama 3.2 8B, Phi-4 Mini, Gemma 3 9B | ~9B |
| Mid GPU | 8–16GB VRAM | RTX 3080/4070, M2 Pro/Max (16–32GB) | Qwen3 14B, Llama 3.1 14B, Mistral Nemo 12B | ~14B |
| High-End GPU | 24GB VRAM | RTX 4090, RTX 3090, A10G | Qwen3 30B, GLM-4.7-Flash MoE, Qwen2.5 Coder 32B | ~34B |
| Multi-GPU / Workstation | 48GB+ VRAM | 2× RTX 4090, A100, H100 | Llama 3.3 70B, Qwen2.5 72B, DeepSeek V3.2 (partial) | ~72B |
| Apple Silicon High-End | 48–192GB Unified | M3 Max (48GB), M3 Ultra (192GB) | Llama 3.3 70B, Qwen2.5 72B, Mixtral, DeepSeek R1 | ~72B (48GB) / ~200B+ (192GB) |
CPU-Only & Low-End (≤8GB RAM)
Running a model on CPU only is slow (typically 2–8 tokens per second), but entirely feasible for simple tasks, offline chatbots, and privacy-sensitive use cases where no GPU is available.
- Qwen3 1.7B (Q4_K_M) — Impressively capable for its size. Handles basic Q&A, summarization, and simple code editing. Fits in 1.2GB RAM. Best choice when RAM is extremely constrained.
- Phi-4 Mini (Q4_K_M) — Microsoft's compact model punches above its weight on reasoning and instruction following. ~2.4GB footprint; 28 tok/s on a mid-range CPU with AVX2 support.
- Gemma 3 2B (Q4_K_M) — Google's entry-level Gemma excels at multilingual tasks and short-form generation. Strong knowledge base for a 2B model. ~1.6GB footprint.
For CPU-only inference, use llama.cpp directly or Ollama, which bundles llama.cpp under the hood. Enabling AVX2/AVX-512 and using 4-bit quantization are the two biggest performance levers.
Entry GPU Tier (4–8GB VRAM: RTX 3060, M1/M2 Base)
An 8GB VRAM card is the most popular tier for local AI hobbyists. With Q4_K_M quantization, you can comfortably fit a 9B model and achieve conversational speeds.
- Qwen3.5-9B Q4_K_M — Best Overall — The top recommendation for 8GB VRAM. Excellent instruction following, strong coding, and multilingual support. Runs at 40+ tokens/sec on an RTX 3060. At 5.5GB, it leaves room for context and KV cache.
- Llama 3.2 8B Q4_K_M — Meta's highly optimized architecture. Great tool-use support and compatible with most frameworks. 4.9GB footprint; widely supported in LM Studio and Ollama.
- Phi-4 Mini Q4_K_M — Best for reasoning-heavy tasks at this tier. Achieves remarkable results on math and logical puzzles given its size. 2.4GB — can even fit alongside a browser tab in 8GB.
- Gemma 3 9B Q4_K_M — Best for image + text tasks (multimodal). Google's vision-language model fits in 6.5GB with the vision encoder loaded.
Mid GPU Tier (8–16GB VRAM: RTX 3080/4070, M2 Pro/Max)
16GB VRAM is the sweet spot for developers who want near-cloud quality without cloud costs. Qwen3 14B is the standout model at this tier.
- Qwen3 14B Q4_K_M — Best Overall — 12GB footprint; 61.85 tokens/sec on an RTX 4070. Leads the 16GB tier on instruction following, reasoning, and code. Close to GPT-4-Turbo quality on many tasks. The go-to recommendation for this tier in 2026.
- Llama 3.1 14B Q4_K_M — Excellent general-purpose model with strong community support and extensive fine-tune availability. 8.7GB; slightly faster than Qwen3 14B at 68 tok/s due to smaller KV heads.
- Mistral Nemo 12B Q4_K_M — Best for European language tasks and GDPR-conscious deployments. Apache 2.0 license; strong at function calling and structured output.
- Qwen2.5 Coder 14B Q4_K_M — Best coding model at the 16GB tier. Specialized on code generation, debugging, and code completion. 9.2GB; recommended for developers using local AI in their IDE.
High-End GPU (24GB+ VRAM: RTX 4090, A100, M3 Max/Ultra)
With 24GB of VRAM you enter a different quality tier. 30B-class models and MoE (Mixture of Experts) architectures become practical, and coding specialists like Qwen2.5 Coder 32B are genuinely competitive with cloud models.
RTX 4090 (24GB VRAM)
- Qwen3 30B-A3B MoE Q4_K_M — Best Speed — An MoE model that activates only 3B parameters per forward pass while storing 30B of knowledge. Runs at a remarkable 196 tokens/sec on the RTX 4090. Best for high-throughput local inference.
- GLM-4.7-Flash (30B-A3B MoE) — Best Reasoning — Leads at the 24GB tier on the Intelligence Index (30.1). In autonomous agent testing it was the only 24GB model to successfully complete a complex multi-step game-building task. Supports extended reasoning mode.
- Qwen2.5 Coder 32B Q4_K_M — Best Coding — 15–25 tok/s on RTX 4090; competitive with Claude Sonnet on many coding benchmarks. The top local coding model if you have 24GB VRAM.
- Llama 3.1 70B Q4_K_M (partial offload) — At 52 tok/s with CPU offload for layers that don't fit in VRAM, a 70B model is viable on the 4090 with Ollama or llama.cpp's GPU layer splitting. Quality jumps significantly over 30B models.
A100 / H100 (40–80GB VRAM)
- Llama 3.3 70B Q4_K_M / Q5_K_M — Fully fits in 40GB at Q5_K_M. Delivers genuine GPT-4-class performance on-premise. Recommended for enterprise self-hosted deployments.
- DeepSeek V3.2 (full precision shards) — Requires an 8× GPU setup at full precision, but individual 13B shards can be run on a single A100 for testing purposes.
Apple Silicon & MLX
Apple Silicon's unified memory architecture fundamentally changes the economics of local AI. The GPU and CPU share the same high-bandwidth memory pool, so a 96GB M3 Max can fully load a 70B model — something impossible on discrete GPUs without multi-GPU setups.
- M3 Max (48–96GB unified) — The Portability King — Llama 3.3 70B at Q4_K_M requires ~40GB and runs at 18–25 tok/s on a 96GB M3 Max via Ollama or MLX. The M3 Max wins decisively over RTX 4090 for large models (70B+) while consuming 60–80W vs the 4090's 450W TDP.
- M3 Ultra (192GB unified) — Can run DeepSeek R1 671B at reduced quantization, or any open-weight model currently available. The ultimate local AI workstation for research teams.
- M2 Pro / M3 (16–32GB unified) — Excellent for 14B-class models; Qwen3 14B and Llama 3.1 14B run smoothly at 35–50 tok/s via Ollama. Better sustained throughput than discrete 16GB cards due to higher memory bandwidth.
MLX Framework
- Apple's MLX framework provides native Metal-optimized inference that typically outperforms llama.cpp on Apple Silicon by 15–30%.
- The mlx-community on Hugging Face publishes pre-quantized MLX models for all major architectures.
- Use LM Studio (which supports both MLX and llama.cpp backends) for a GUI interface on Mac, or run
mlx_lm.generatedirectly from the terminal for scripting.
Quantization Guide
| Format | Bits per Weight | Size vs FP16 | Quality Loss | Best For |
|---|---|---|---|---|
| Q4_K_M | 4-bit (K-quant) | ~25% of FP16 | 2–5% perplexity increase | Default choice for all hardware tiers; best speed/quality ratio |
| Q5_K_M | 5-bit (K-quant) | ~31% of FP16 | 1–2% perplexity increase | When you have extra VRAM headroom and want slightly better quality |
| Q8_0 | 8-bit | ~50% of FP16 | <0.5% perplexity increase | Near-lossless quality on Apple Silicon with large unified memory |
| IQ2_XXS | 2-bit (imatrix) | ~12% of FP16 | 10–15% perplexity increase | CPU-only or extremely limited RAM; large models on tiny hardware |
| FP16 / BF16 | 16-bit full | 100% | None | Fine-tuning, training, reference inference; requires server GPU |
Practical rule of thumb: Start with Q4_K_M. If you have 25% more VRAM headroom than the model requires at Q4_K_M, upgrade to Q5_K_M. If you are on Apple Silicon with large unified memory and need the highest fidelity for coding or structured output tasks, use Q8_0.
Recommended Tools
Ollama
- The easiest way to get started. One command downloads and runs a model:
ollama run qwen3:14b - Manages model storage, GPU layer allocation, and serves an OpenAI-compatible REST API at
localhost:11434 - Supports model library browsing at ollama.com/search; includes 300+ pre-quantized GGUF models
- Best for: developers who want quick setup, CI/CD integration, and REST API access without GUI overhead
LM Studio
- Cross-platform GUI (Mac, Windows, Linux) with a model discovery browser connected to Hugging Face
- Supports llama.cpp and MLX backends; automatic GPU detection and layer splitting
- Built-in chat interface and OpenAI-compatible local server
- Best for: non-developers, researchers, and Mac users who prefer a polished UI and easy model switching
llama.cpp
- The original C++ inference engine for GGUF models. Runs on CPU, CUDA, Metal, ROCm, Vulkan, and SYCL backends
- Maximum control over quantization, prompt formatting, context size, and batch settings
- Use
llama-serverfor an OpenAI-compatible HTTP server;llama-clifor direct terminal inference - Best for: power users, embedded deployments, custom quantization workflows, and situations where Ollama's overhead is undesirable