Local AI inference has never been more capable: in April 2026, a $1,500 RTX 4090 runs 32B-parameter models at usable speeds, Apple Silicon M3 Max handles 70B models via MLX with impressive efficiency, and tools like Ollama and LM Studio make deployment trivially easy. The key to choosing the right model isn't raw parameter count — it's matching quantization level and architecture to your available hardware, then picking the model family that best fits your use case.
Hardware Tier Overview
| Tier | VRAM / RAM | Example Hardware | Recommended Models | Max Params (usable) |
|---|---|---|---|---|
| CPU-Only / Low-End | ≤8GB RAM | Older laptops, Raspberry Pi 5 | Phi-3 Mini, Gemma 3 2B, Qwen3 1.5B | ~3B |
| Entry GPU | 4–8GB VRAM | RTX 3060, M1/M2 base (8GB) | Llama 3.1 8B Q4, Qwen3 8B Q4, Gemma 3 9B Q4 | ~9B |
| Mid GPU | 8–16GB VRAM | RTX 3080/4070, M2 Pro/Max (16GB) | GPT-OSS 20B Q4, Mistral 22B Q4, Qwen3 14B Q5 | ~22B |
| High-End GPU | 24GB VRAM | RTX 3090/4090, M3 Max (24GB) | Qwen3 32B Q4, Nemotron 3 Nano 30B, Qwen 2.5 Coder 32B | ~35B |
| Workstation / Multi-GPU | 48GB+ VRAM | 2×RTX 4090, A100, H100, M3 Ultra | Llama 3.3 70B Q4, Qwen2.5 72B Q4, DeepSeek V3 34B Q8 | ~72B |
| Apple Silicon (MLX) | Unified 16–192GB | M1/M2/M3/M4 all variants | Tier-dependent (see Apple section) | Up to 405B on Ultra |
CPU-Only & Low-End (≤8GB RAM)
CPU inference is slow — expect 3–10 tokens/second — but it works, and small models have improved dramatically. These recommendations prioritize models that are genuinely useful despite hardware constraints.
- Phi-3 Mini 3.8B (Q4_K_M) — Microsoft's Phi-3 Mini punches well above its weight class. At 3.8B parameters, it fits in ~2.5GB RAM and handles basic coding, Q&A, and summarization competently. Best small model for pure CPU use.
- Gemma 3 2B (Q4_K_M) — Google's 2B model is the lightest genuinely useful option, fitting in ~1.5GB. Recommended for embedded use cases, low-power devices, and always-on assistant applications where power consumption matters.
- Qwen3 1.5B (Q4_K_M) — Alibaba's tiny Qwen3 variant is surprisingly capable for its size, with strong multilingual support. Best choice for non-English CPU-only deployments.
- Llama 3.2 3B (Q4_K_M) — Meta's 3B model offers good all-around performance for general assistant tasks, with the benefit of the broad Llama ecosystem and community support.
Tip: For CPU inference, smaller quantizations (Q4 vs Q8) matter more for speed than quality. Always use Q4_K_M on CPU — the quality difference from Q8 is minimal but speed is 2× better.
Entry GPU Tier (4–8GB VRAM: RTX 3060, M1/M2 base)
This tier is where local AI becomes genuinely practical. At 40–80 tokens/second, responses feel nearly real-time, and 7–9B models handle the majority of everyday tasks well.
- Qwen3.5 9B (Q4_K_M) — Top Pick — Fits in ~6GB VRAM and delivers 55+ tokens/second fully on GPU. The best model for an 8GB VRAM daily assistant. Strong coding, reasoning, and chat quality for its size tier.
- Llama 3.1 8B (Q4_K_M) — Meta's workhorse model. Broad community support, wide tool compatibility, and reliable quality for general tasks at ~50 tok/s on RTX 3060. The safe default if you're unsure what to run.
- Qwen3 8B (Q4_K_M) — Excellent multilingual support and particularly strong on reasoning and math tasks for the 8B tier. Recommended over Llama 3.1 8B for non-English use cases.
- Gemma 3 9B (Q4_K_M) — Google's 9B model fits in ~6GB and excels at instruction following and structured output generation. Good for applications requiring reliable JSON or formatted responses.
- Mistral 7B v0.3 (Q4_K_M) — Still a strong performer, especially for European language tasks and document processing. The veteran pick with years of community tooling.
Mid GPU Tier (8–16GB VRAM: RTX 3080/4070, M2 Pro/Max)
The 16GB tier unlocks 20B+ models where quality noticeably surpasses the 8B tier on complex reasoning, coding, and long-form generation tasks.
- GPT-OSS 20B (Q4_K_M) — Daily Driver — OpenAI's open-weight 20B model at Q4_K_M is the recommended default for 16GB systems, handling 70–80% of tasks that would previously require API calls. Delivers ~40 tok/s on RTX 3080.
- Qwen3 14B (Q5_K_M) — At 14B parameters, Qwen3 fits comfortably in 16GB at Q5 quantization (better quality than Q4 with manageable size). Excellent for coding and reasoning; strong multilingual support.
- Mistral 22B (Q4_K_M) — Fits in ~14GB and delivers a significant quality jump over 7B models. Strong at creative writing and nuanced instruction following.
- DeepSeek Coder V2 16B (Q4_K_M) — The best local coding model in the 8–16GB tier. If your primary use case is code generation or review, this is the pick over general-purpose models.
- Phi-4 14B (Q4_K_M) — Microsoft's Phi-4 achieves remarkable quality at 14B parameters thanks to synthetic data training. Particularly strong on STEM tasks and structured reasoning.
High-End GPU (24GB+ VRAM: RTX 4090, A100, M3 Max/Ultra)
The 24GB tier is where local inference becomes production-grade. Qwen3 32B at Q4 fits comfortably with room to spare, delivering near-frontier quality at 20+ tokens/second.
- Qwen3 32B (Q4_K_M) — Top Overall Pick — The 2026 community consensus best model for RTX 4090. Fits in ~24GB comfortably, delivers excellent quality across all task types, and runs at ~20 tok/s. The Q4_K_M quantization preserves 97%+ of full-precision quality.
- Qwen 2.5 Coder 32B (Q4_K_M) — Best for Coding — The specialist coding choice. Outperforms Qwen3 32B on code generation and debugging tasks specifically. Use this if 80%+ of your workload is code-related.
- Nemotron 3 Nano 30B (Q4_K_M) — NVIDIA's own model, optimized for its hardware. Trades blows with Qwen3 32B at the top of the 24GB tier. Particularly strong on instruction following and structured output.
- Qwen3 VL 32B (Q4_K_M) — The multimodal pick. If you need vision capabilities (image understanding, document OCR, screenshot analysis) alongside text, this is the 24GB-tier choice at ~24.5GB fitted weight.
- Llama 3.3 70B (Q2_K) — 70B at Q2 quantization fits in ~24GB but quality is noticeably degraded versus Qwen3 32B at Q4. Only recommended if you specifically need Llama's fine-tuned ecosystem compatibility.
RTX 4090 speed reference: 7B Q4 ≈ 80–100 tok/s, 13B Q4 ≈ 50–70 tok/s, 30B Q4 ≈ 20–35 tok/s, 70B Q2 ≈ 15–25 tok/s.
Apple Silicon & MLX
Apple Silicon's unified memory architecture makes it uniquely suited for local LLM inference — CPU and GPU share the same memory pool, so a 64GB M3 Max can load a 40GB model without the VRAM bottleneck that limits NVIDIA cards. MLX (Apple's ML framework) further optimizes inference for Apple hardware.
| Chip | Unified RAM | Best Models | Approx Speed (32B Q4) |
|---|---|---|---|
| M1/M2 base | 8–16GB | Llama 3.1 8B, Qwen3 8B, Phi-4 (Q4) | N/A (8B tier only) |
| M2/M3 Pro | 16–36GB | Qwen3 14B Q5, GPT-OSS 20B Q4, Mistral 22B Q4 | ~25 tok/s (20B) |
| M2/M3 Max | 32–96GB | Qwen3 32B Q4, Llama 3.3 70B Q4 (on 64GB+) | ~18–22 tok/s |
| M3 Ultra / M4 Ultra | 96–192GB | Llama 3.3 70B Q8, Qwen2.5 72B Q8, 405B models (Q2) | ~12–15 tok/s (70B Q8) |
- Use MLX-LM for Apple Silicon — Apple's MLX framework outperforms llama.cpp on Apple hardware by 20–40% for most model architectures. Install via
pip install mlx-lm. - Ollama now uses MLX backend automatically on Apple Silicon — you get MLX performance without any extra configuration.
- Thermal throttling warning: Mac laptops will throttle under sustained inference load. M3 Max MacBook Pro sustains ~70% of peak performance over 30-minute sessions; M3 Ultra Mac Studio maintains full performance indefinitely thanks to active cooling.
Quantization Guide
Quantization reduces model weight precision to shrink file size and VRAM requirements. Choosing the right level is the single most impactful decision for local inference.
| Quantization | Bits per Weight | Size vs FP16 | Quality Loss | Speed | Recommendation |
|---|---|---|---|---|---|
| Q2_K | ~2.6 bits | ~83% smaller | Significant (5–15%) | Fastest | Only if VRAM is the absolute constraint |
| Q4_K_M | ~4.5 bits | ~72% smaller | Minimal (2–5%) | Very Fast | Gold standard — use this by default |
| Q5_K_M | ~5.5 bits | ~65% smaller | Negligible (1–2%) | Fast | Use when you have 10–15% extra VRAM headroom |
| Q6_K | ~6.6 bits | ~58% smaller | Near-lossless (<1%) | Moderate | High-end GPUs where quality matters most |
| Q8_0 | 8 bits | ~50% smaller | Virtually none | Slower | When quality is paramount and VRAM is ample |
| FP16 | 16 bits | Baseline | None (reference) | Slowest | Research/fine-tuning only — impractical for inference |
Rule of thumb: An 8B model in Q4_K_M uses ~5–6GB VRAM (vs ~16GB in FP16). A 32B model in Q4_K_M uses ~20–22GB. Always choose Q4_K_M unless you have a specific reason to go higher or lower.
Recommended Tools
Ollama
- Best for: Getting started, CLI usage, API server, macOS/Linux/Windows support
- Install: Single command install at ollama.com. Run any model with
ollama run qwen3:32b - Strengths: Automatic model download, OpenAI-compatible REST API at
localhost:11434, automatic MLX backend on Apple Silicon, background service mode - Library: 1,000+ models available in the Ollama registry covering every major family
- 2026 update: Ollama now supports multi-GPU inference natively — models too large for a single GPU automatically split across available GPUs
LM Studio
- Best for: GUI users, model discovery and management, non-technical users
- Strengths: Visual model browser with hardware compatibility filter, chat UI, local server mode, GGUF and MLX model support
- Best feature: Hardware compatibility checker tells you exactly which quantization levels fit your specific GPU/RAM combination before downloading
- Platform: macOS, Windows, Linux (available at lmstudio.ai)
llama.cpp
- Best for: Maximum performance, custom integrations, embedded deployment, fine-grained control
- Strengths: The reference implementation — all other tools build on top of it. Supports every quantization format, CUDA/Metal/Vulkan/CPU backends, speculative decoding for 2–3× speed boost
- Speculative decoding: Pair a large model (e.g., 32B) with a small draft model (e.g., 3B) to get 70–80% of 32B quality at 2–3× the speed of running 32B alone
- Best for power users who want to squeeze every token/second from their hardware