Can I run a local LLM on a CPU-only VPS?

Yes, but only for small quantized models in the 1-3B parameter range. Expect slow generation speeds—usable for testing or low-traffic personal use, but not for production. A 7B or larger model on a CPU-only box will be painfully slow.

How much VRAM do I need to run a 7B model?

A quantized 7B model (Q4 format) needs roughly 4-6 GB of VRAM to fit fully on the GPU. For full-precision 7B you need 14+ GB. VRAM is the real bottleneck—if the model doesn't fit, it spills to RAM and generation speed collapses.

Is renting a GPU VPS worth it for personal LLM use?

For occasional testing, hourly GPU billing (like Vultr offers) makes it affordable. You spin up, run your experiments, and shut down. For daily personal use the costs add up fast—compare monthly GPU rates against a CPU box running a smaller quantized model.

Which model sizes can Ollama run on a standard VPS?

Ollama handles anything from tiny 1B models to 70B+ given enough hardware. On a standard 8 GB RAM CPU VPS, stick to 1-3B quantized models. On a GPU instance with 16 GB VRAM you can comfortably run 13B quantized models.

Is Hetzner a good choice for running Ollama?

Hetzner's CPU-only VPS plans offer exceptional RAM-per-dollar in Europe, making them a strong pick for small quantized models. Their dedicated servers also support GPU add-ons. For pure GPU cloud instances, Vultr has more flexible hourly options.

Best VPS for Local LLM in 2026: Run Ollama Without Breaking the Bank

Some links below are affiliate links: if you buy through them I may earn a commission at no extra cost to you. I only recommend what I have actually tested, and it never changes my verdict.

Running a large language model locally on a VPS is increasingly practical in 2026—but the hardware requirements vary wildly depending on what you want to run. The wrong plan turns into either a bill shock or a server that generates tokens slower than you can type.

This guide covers which model sizes need what hardware, where VRAM fits in, and which two providers offer the best value for most people exploring self-hosted AI.

The Real Constraint: VRAM, Then RAM

Before comparing providers, get the hardware math right. VRAM is the gating factor for GPU inference. If your model doesn’t fit in VRAM, it spills to system RAM and generation speed collapses to CPU speeds—or worse.

Rule of thumb for quantized models (Q4):

1-3B model: ~1-2 GB VRAM or RAM
7B model: ~4-6 GB VRAM
13B model: ~8-10 GB VRAM
34B model: ~20+ GB VRAM

For CPU-only inference, RAM replaces VRAM as the limit, but throughput is dramatically lower. Small models (1-3B) on CPU are usable. A 7B model on CPU will generate tokens at a speed that will test your patience. A 13B model on CPU is generally a last resort.

Model Size vs. Hardware: Quick Reference

Model Size	Min Hardware	Usable Speed?	Recommended Setup
1-3B (quantized)	4 GB RAM, any CPU	Yes, even on CPU	Hetzner CX22 or similar CPU VPS
7B (quantized)	6 GB VRAM or 16 GB RAM (slow)	GPU yes; CPU marginal	Vultr GPU (hourly)
13B (quantized)	10 GB VRAM	Yes on GPU	Vultr A100/L40S or similar
34B+ (quantized)	24+ GB VRAM	Needs serious GPU	Dedicated GPU server

Tokens-per-second benchmarks depend too much on specific hardware configurations to quote reliably here—check the Ollama community benchmarks thread for real user numbers on specific models and GPUs before committing to a plan.

Top Picks

Vultr — Best for GPU Instances (Hourly Billing)

Vultr’s Cloud GPU instances are the most accessible on-ramp for 7B+ model experimentation. The key advantage is hourly billing: spin up a GPU node, run your tests, pull it down. You’re not locked into a monthly commitment while you figure out what model size actually fits your workflow.

Vultr offers NVIDIA GPU instances across multiple tiers. Their global data center spread also means you can pick a region close to your users or yourself for lower latency.

Best for: Anyone who wants to experiment with 7B+ models without committing to monthly GPU costs upfront. Also solid for building a personal AI assistant with Open WebUI.

See our Ollama + Open WebUI setup guide for the exact steps to get running on a Vultr instance.

Hetzner — Best CPU Value for Small Models

Hetzner’s VPS lineup punches well above its price in RAM per euro, particularly their CX and CAX (Arm) series. For running small quantized models—think Phi-3 mini, Gemma 3 1B, or Qwen 1.5B—a Hetzner box with 8-16 GB RAM is genuinely capable and costs a fraction of comparable plans elsewhere.

Hetzner is CPU-only on their standard VPS tiers (dedicated servers with GPU add-ons exist but require more setup). That limits you to smaller models, but for a personal assistant, summarization tool, or coding helper running a 3B model, it’s a legitimate production setup.

Best for: Budget-conscious users running small models full-time, European users who need data residency in Germany or Finland, or anyone who wants a persistent always-on LLM without GPU costs.

What About DigitalOcean?

DigitalOcean has GPU Droplets but availability has historically been limited. Their CPU Droplets are solid but Hetzner offers better RAM density for the price in most configurations. Worth checking if you need DigitalOcean’s broader ecosystem (Managed DBs, App Platform, etc.) alongside your LLM work. See our Vultr vs DigitalOcean comparison for a deeper look.

Practical Cost Control Tips

Rent GPU hourly while testing. Don’t buy a monthly GPU plan until you’ve validated that your chosen model actually runs at acceptable speed on a given GPU tier. Vultr’s hourly billing exists precisely for this.

Quantize aggressively. A Q4_K_M quantized 7B model at ~4 GB VRAM gives you most of the quality of a full-precision model at a fraction of the hardware cost. Ollama handles quantized GGUF files natively.

Match model to use case. A 1-3B model running locally on a cheap CPU VPS is perfectly capable for summarization, simple Q&A, or code completion hints. You don’t need a 70B model for everything.

Next Steps

Once you’ve picked your VPS, the Ollama + Open WebUI self-hosting guide walks through the full setup: installing Ollama, pulling your first model, and wiring up a browser-based chat interface.

For general VPS selection beyond AI workloads, see our best VPS for self-hosting roundup.

The Real Constraint: VRAM, Then RAM

Model Size vs. Hardware: Quick Reference

Top Picks

Vultr — Best for GPU Instances (Hourly Billing)

Hetzner — Best CPU Value for Small Models

What About DigitalOcean?

Practical Cost Control Tips

Next Steps

Frequently asked questions

Related

Cheapest GPU VPS for AI in 2026: How to Pay Less

BandwagonHost vs Vultr (2026): China Access or Global Reach?

Best Control Panel for Self-Hosting in 2026: RunCloud, Coolify, CloudPanel & More