Running AI models locally has never been more practical. In 2026, consumer and prosumer hardware can run models that rival last year's frontier APIs — with zero data privacy concerns, no per-token costs, and sub-100ms latency on fast hardware. The key is matching the right model and quantization level to your specific hardware. This guide tells you exactly which models to run on which hardware, from 8GB RAM laptops to 24GB VRAM RTX 4090 workstations to Apple Silicon Macs with unified memory up to 192GB.

Hardware Tier Overview

Tier VRAM / RAM Example Hardware Recommended Models Max Params (Practical)
CPU-only / Low-end ≤8GB RAM Older laptops, Raspberry Pi 5 Phi-4 Mini, Qwen3 1.7B, Gemma 3 2B 3B (Q4)
Entry GPU 4–8GB VRAM RTX 3060, GTX 1080 Ti, M1/M2 base (8GB) Llama 3.2 8B, Qwen3.5 9B, Mistral 7B 9B (Q4_K_M)
Mid GPU 8–16GB VRAM RTX 3080, RTX 4070, M2 Pro (16GB), M3 (16GB) Qwen3 14B, GPT-OSS 20B, Gemma 3 12B 20B (Q4_K_M)
High-end GPU 24GB VRAM RTX 4090, RTX 3090, A5000 Qwen3 30B-A3B (MoE), Nemotron 3 Nano 30B, Qwen3 VL 32B 32B (Q4_K_M)
Enthusiast GPU 48GB+ VRAM RTX 5090 (32GB), dual RTX 3090, A100 Llama 3.3 70B, Qwen2.5 72B 70B (Q4_K_M)
Apple Silicon — High 64–192GB unified M2 Ultra, M3 Max (96GB), M4 Ultra Llama 3.3 70B, Qwen2.5 72B (via MLX) 70B+ (Q4–Q8)

CPU-Only & Low-End (≤8GB RAM)

Running LLMs on CPU-only or severely RAM-constrained machines is viable for small models with aggressive quantization. Expect 2–10 tokens per second — slow but functional for offline use.

  • Qwen3 1.7B (Q4_K_M) — Best quality-per-byte at this size. Surprisingly capable for summarization, Q&A, and simple coding tasks. Pull via Ollama: ollama pull qwen3:1.7b.
  • Phi-4 Mini (Q4_K_M) — Microsoft's 3.8B model punches above its weight on reasoning and instruction following. Fits in 8GB RAM with room to spare. Strong choice for offline assistant use.
  • Gemma 3 2B — Google's smallest open model; decent for text classification and extraction at minimal resource cost.
  • Practical tips: Use Q4_K_M quantization, not Q8_0 — halves memory at minimal quality loss at this size. Limit context to 2K tokens to reduce RAM pressure. Close all other applications before loading.

Entry GPU Tier (4–8GB VRAM: RTX 3060, M1/M2 Base)

This tier unlocks genuinely useful models. 7–9B parameter models at Q4_K_M quantization run at 30–55 tokens per second on 8GB VRAM — fast enough for real-time chat.

  • Qwen3.5-9B (Q4_K_M) — Top pick — 55+ tokens per second fully in GPU memory on 8GB VRAM. Best chatbot and local assistant quality in this tier. ollama pull qwen3.5:9b.
  • Phi-4 Mini (Q4_K_M) — 28 tok/s; best reasoning-per-VRAM ratio. Strong for coding assistance and logical tasks where raw chat fluency matters less.
  • Llama 3.2 8B (Q4_K_M) — Meta's well-documented, widely-supported model. Large community, abundant fine-tunes. ollama pull llama3.2:8b.
  • Mistral 7B v0.3 (Q4_K_M) — Efficient instruction-following; good for structured output generation and function calling.
  • Gemma 3 9B (Q4_K_M) — Google's multimodal model; also handles image inputs on supported runners. Good for document parsing workflows.

Mid GPU Tier (8–16GB VRAM: RTX 3080/4070, M2 Pro/Max)

The 14–20B range is where local AI becomes genuinely competitive with API-quality output for many tasks. At 30–50 tok/s, these models feel fast and produce results indistinguishable from last-generation frontier APIs on most everyday tasks.

  • Qwen3 14B (Q4_K_M) — Top pick — If you only run one model on a 16GB VRAM system, make it Qwen3 14B. Excellent instruction following, strong coding, multilingual. ~40 tok/s on RTX 3080. ollama pull qwen3:14b.
  • GPT-OSS 20B — OpenAI's open-weight 20B model. Serves as a daily driver for 70–80% of tasks requiring speed and reliability on 16GB systems. Strong on structured tasks and dialogue.
  • Gemma 3 12B (Q4_K_M) — Multimodal; processes images. Well-suited for document understanding and visual Q&A workflows on mid-range hardware.
  • Phi-4 14B (Q4_K_M) — Microsoft's coding-focused 14B. Best in class for local code generation at this parameter count; useful for teams without 24GB cards.

High-End GPU (24GB+ VRAM: RTX 4090, A100, M3 Max/Ultra)

24GB of VRAM (RTX 4090: 1,008 GB/s bandwidth) is the current sweet spot for enthusiast local inference. Models up to 32B parameters run entirely on-device at usable speeds, and the new RTX 5090 (32GB) pushes practical limits to around 40B.

  • Qwen3 30B-A3B MoE — Speed winner — A Mixture-of-Experts model that generates 196 tok/s on an RTX 4090 — faster than the 8B dense model — while delivering quality closer to the 14B class. The architectural breakthrough for local inference in 2026. ollama pull qwen3:30b-a3b.
  • Nemotron 3 Nano 30B — Math/reasoning winner — NVIDIA's 91% Math 500 score is the highest in the 24GB-compatible dataset. Top choice for scientific computing, data analysis, and STEM-heavy use cases.
  • Qwen3 VL 32B — Vision winner — Best multimodal model fitting in 24GB. Processes images, charts, and documents natively. 33 tok/s at 48K context. ollama pull qwen3-vl:32b.
  • RTX 5090 upgrade path — With 32GB GDDR7 and 25% higher bandwidth, the 5090 runs Qwen3 8B at 185 tok/s, 14B at 124 tok/s, and 32B at 61 tok/s — a 25–35% improvement over the 4090 across model sizes.

Apple Silicon & MLX

Apple Silicon's unified memory architecture fundamentally changes local LLM economics: CPU, GPU, and memory share the same pool, so a Mac Studio with 192GB unified memory can run models that would require two A100s on a PC. The catch is bandwidth — Apple Silicon memory is slower per byte than GDDR6X, so tokens-per-second lags dedicated GPUs on models that fit in VRAM.

  • Use MLX, not Ollama, for maximum performance — Apple's MLX framework achieves 6–10x the throughput of Ollama for the same model on the same hardware. On M2 Ultra, MLX sustains ~230 tok/s versus Ollama's 20–40 tok/s. Install via pip install mlx-lm.
  • M1/M2 Base (8GB) — Same as Entry GPU tier above. 7–9B models at Q4_K_M.
  • M2 Pro / M3 (16–24GB) — Qwen3 14B or GPT-OSS 20B at Q4_K_M. Run mlx_lm.generate for best speed.
  • M3 Max (64–96GB) / M4 Max — Run Qwen3 30B-A3B at full Q8_0 quality, or Llama 3.3 70B / Qwen2.5 72B at Q4_K_M (8–15 tok/s). Slow but the only single-device consumer option for 70B-class models without spending $10K+ on datacenter GPUs.
  • M2 Ultra / M3 Ultra (128–192GB) — Run 70B models at Q8_0 for near-full quality. Sustained inference at 15–25 tok/s. Best fully-offline option for teams needing 70B quality without cloud API costs.
  • Top model picks for Apple Silicon — Qwen3 14B (daily driver), Llama 3.3 70B (max quality on 96GB+), DeepSeek-R1 distills (reasoning on any tier).

Quantization Guide

Quantization reduces model precision to fit larger models in less memory. The naming convention (Q4_K_M, Q5_K_M, Q8_0) refers to bits per weight and the quantization algorithm. Here's what each level means in practice:

Format Bits/Weight Size vs FP16 Quality Loss Best Use Case
Q4_K_M 4-bit (mixed) ~25% of FP16 Low (<2% on most benchmarks) Default for most users; best memory-quality tradeoff
Q5_K_M 5-bit (mixed) ~31% of FP16 Very low (<1%) When you have ~25% more VRAM to spare; coding and math tasks
Q6_K 6-bit ~38% of FP16 Minimal Near-FP16 quality when memory allows; good for Apple Silicon
Q8_0 8-bit ~50% of FP16 Negligible Maximum local quality; use when you have the memory
IQ2_XXS ~2.5-bit ~16% of FP16 Significant Extreme RAM constraints only; quality noticeably degraded

Recommendation: Start with Q4_K_M — it is the memory efficiency gold standard and what Ollama pulls by default. Move to Q5_K_M for coding or math tasks if you have headroom. Use Q8_0 only on Apple Silicon or when running a model well within your VRAM limit.

Ollama

  • The easiest way to get started. One command installs and runs any model: ollama run qwen3:14b.
  • Runs a local REST API on port 11434 — compatible with OpenAI SDK (set base URL to http://localhost:11434/v1).
  • Model library at ollama.com; handles quantization selection automatically with the :Q4_K_M tag suffix.
  • Best for: beginners, quick prototyping, running models as a background service.

LM Studio

  • GUI application for Windows, Mac, and Linux. Download models from Hugging Face with a visual browser; no terminal required.
  • Built-in chat interface plus local OpenAI-compatible server mode.
  • Best for: non-technical users, teams wanting a polished UI, Windows users who prefer GUI tools.

llama.cpp

  • The underlying inference engine used by both Ollama and LM Studio. Running it directly gives maximum control: custom quantization, KV cache settings, speculative decoding, and server configuration.
  • Supports CUDA, Metal (Apple), ROCm (AMD), and Vulkan backends.
  • Best for: advanced users, production deployments, maximizing performance on specific hardware.

MLX / mlx-lm (Apple Silicon only)

  • Apple's open-source framework for GPU-accelerated ML on Apple Silicon. 6–10x faster than Ollama for the same model on the same Mac.
  • Install: pip install mlx-lm. Run: mlx_lm.generate --model mlx-community/Qwen3-14B-4bit --prompt "Hello".
  • Best for: any Mac user serious about local inference speed.