sharpbyte.dev
← LLMs
LLMs · topic 10 of 11

4 ways to run LLMs locally

Run models on your laptop for privacy, offline demos, and fast prompt iteration—without sending every test to the cloud.

Why bother running locally?

Cloud APIs are convenient, but local inference has real advantages.

Your prompts and documents stay on your machine—useful for NDAs, healthcare, and regulated data. You avoid network latency while iterating on prompts, and you can hack on a plane or in a locked-down network.

Local does not mean “free unlimited quality.” You are limited by RAM, GPU VRAM, and which quantized model sizes actually fit.

Ollama — fastest way to get a model running

Install Ollama, then run something like ollama run llama3 (exact model names change over time). Models download like container images; you get a CLI chat loop plus HTTP and Python APIs for scripting.

Install → pull a model → chat in terminal or call the local API.
Install → pull a model → chat in terminal or call the local API.

LM Studio — GUI for exploring weights

Desktop app with a ChatGPT-like interface. Browse Hugging Face models, load and unload them, tweak generation settings visually. Great when designers or PMs want to try a model without touching a terminal.

vLLM — local server that feels like production

Python library optimized for throughput. Spin up an OpenAI-compatible HTTP server on localhost so your existing SDK code points at http://127.0.0.1:8000 instead of a cloud endpoint—useful for integration tests.

llama.cpp — lean runtime for CPU and edge

C++ inference engine with strong quantized GGUF support. Runs on machines without a discrete GPU; popular for Raspberry Pi–class devices when you accept smaller models.

Pick tooling based on whether you want CLI speed, GUI exploration, or production-like serving.
Pick tooling based on whether you want CLI speed, GUI exploration, or production-like serving.

Practical tips for beginners

Start with a smaller quantized model that fits your RAM, then upgrade if answers lack quality. Match context length to your documents—longer context costs memory linearly. When you outgrow local hardware, the same prompts often port to cloud APIs with minimal changes.

Key takeaways

  • Local runs protect privacy and speed up prompt experiments.
  • Ollama/LM Studio favor ease of use; vLLM/llama.cpp favor throughput and control.
  • Quantization and model size determine whether your machine can keep up.