Running a large language model locally on a VPS is increasingly practical in 2026—but the hardware requirements vary wildly depending on what you want to run. The wrong plan turns into either a bill shock or a server that generates tokens slower than you can type.
This guide covers which model sizes need what hardware, where VRAM fits in, and which two providers offer the best value for most people exploring self-hosted AI.
The Real Constraint: VRAM, Then RAM
Before comparing providers, get the hardware math right. VRAM is the gating factor for GPU inference. If your model doesn’t fit in VRAM, it spills to system RAM and generation speed collapses to CPU speeds—or worse.
Rule of thumb for quantized models (Q4):
- 1-3B model: ~1-2 GB VRAM or RAM
- 7B model: ~4-6 GB VRAM
- 13B model: ~8-10 GB VRAM
- 34B model: ~20+ GB VRAM
For CPU-only inference, RAM replaces VRAM as the limit, but throughput is dramatically lower. Small models (1-3B) on CPU are usable. A 7B model on CPU will generate tokens at a speed that will test your patience. A 13B model on CPU is generally a last resort.
Model Size vs. Hardware: Quick Reference
| Model Size | Min Hardware | Usable Speed? | Recommended Setup |
|---|---|---|---|
| 1-3B (quantized) | 4 GB RAM, any CPU | Yes, even on CPU | Hetzner CX22 or similar CPU VPS |
| 7B (quantized) | 6 GB VRAM or 16 GB RAM (slow) | GPU yes; CPU marginal | Vultr GPU (hourly) |
| 13B (quantized) | 10 GB VRAM | Yes on GPU | Vultr A100/L40S or similar |
| 34B+ (quantized) | 24+ GB VRAM | Needs serious GPU | Dedicated GPU server |
Tokens-per-second benchmarks depend too much on specific hardware configurations to quote reliably here—check the Ollama community benchmarks thread for real user numbers on specific models and GPUs before committing to a plan.
Top Picks
Vultr — Best for GPU Instances (Hourly Billing)
Vultr’s Cloud GPU instances are the most accessible on-ramp for 7B+ model experimentation. The key advantage is hourly billing: spin up a GPU node, run your tests, pull it down. You’re not locked into a monthly commitment while you figure out what model size actually fits your workflow.
Vultr offers NVIDIA GPU instances across multiple tiers. Their global data center spread also means you can pick a region close to your users or yourself for lower latency.
Best for: Anyone who wants to experiment with 7B+ models without committing to monthly GPU costs upfront. Also solid for building a personal AI assistant with Open WebUI.
See our Ollama + Open WebUI setup guide for the exact steps to get running on a Vultr instance.
Hetzner — Best CPU Value for Small Models
Hetzner’s VPS lineup punches well above its price in RAM per euro, particularly their CX and CAX (Arm) series. For running small quantized models—think Phi-3 mini, Gemma 3 1B, or Qwen 1.5B—a Hetzner box with 8-16 GB RAM is genuinely capable and costs a fraction of comparable plans elsewhere.
Hetzner is CPU-only on their standard VPS tiers (dedicated servers with GPU add-ons exist but require more setup). That limits you to smaller models, but for a personal assistant, summarization tool, or coding helper running a 3B model, it’s a legitimate production setup.
Best for: Budget-conscious users running small models full-time, European users who need data residency in Germany or Finland, or anyone who wants a persistent always-on LLM without GPU costs.
What About DigitalOcean?
DigitalOcean has GPU Droplets but availability has historically been limited. Their CPU Droplets are solid but Hetzner offers better RAM density for the price in most configurations. Worth checking if you need DigitalOcean’s broader ecosystem (Managed DBs, App Platform, etc.) alongside your LLM work. See our Vultr vs DigitalOcean comparison for a deeper look.
Practical Cost Control Tips
Rent GPU hourly while testing. Don’t buy a monthly GPU plan until you’ve validated that your chosen model actually runs at acceptable speed on a given GPU tier. Vultr’s hourly billing exists precisely for this.
Quantize aggressively. A Q4_K_M quantized 7B model at ~4 GB VRAM gives you most of the quality of a full-precision model at a fraction of the hardware cost. Ollama handles quantized GGUF files natively.
Match model to use case. A 1-3B model running locally on a cheap CPU VPS is perfectly capable for summarization, simple Q&A, or code completion hints. You don’t need a 70B model for everything.
Next Steps
Once you’ve picked your VPS, the Ollama + Open WebUI self-hosting guide walks through the full setup: installing Ollama, pulling your first model, and wiring up a browser-based chat interface.
For general VPS selection beyond AI workloads, see our best VPS for self-hosting roundup.