LiteLLM: one gateway for every AI model

May 23, 2026

ailitellmservice-spotlight

If you’ve started experimenting with AI in your homelab, you’ve probably hit the same wall: every model speaks a slightly different dialect. OpenAI has its API, Anthropic has theirs, Google has another, and your locally-hosted model has its own endpoint and quirks. Connecting a tool to one of them is fine. Connecting a tool to all of them, switching between them, or swapping one out when it goes down — that gets complicated fast.

LiteLLM is the fix. It sits in front of all your models and speaks one language to the rest of your stack.

What is LiteLLM, in plain language?

LiteLLM is an open-source proxy that presents a single OpenAI-compatible API endpoint, no matter what models are actually behind it. You point your apps at LiteLLM. LiteLLM handles routing to the right backend — whether that’s Anthropic’s Claude, Google’s Gemini, a locally-running model, or a cloud inference service.

From the outside, every model looks the same. Your app sends an OpenAI-style chat completion request, and LiteLLM translates it into whatever format the target model actually needs.

It also adds a layer of management that raw API calls don’t give you: virtual keys for different clients, usage tracking, spend limits, model fallbacks, and a web dashboard to see what’s happening.

The problem it actually solves

Here’s the situation without a gateway. You have an automation workflow (say, email triage). You hardcode it to call a local model. Then that local model gets slow, or the server runs out of memory, or you want to try a different model. Now you’re hunting through your workflow, updating credentials and endpoints, hoping you didn’t miss anything. Multiply that across five different tools — a code assistant, a chat interface, several automations — and you have a maintenance headache.

Or consider the key management side. Every model provider gives you an API key. That key goes in your tool’s configuration, or worse, gets pasted into multiple places. When you need to rotate it — or when a key leaks — you’re updating every tool that holds a copy.

A gateway solves both problems. The tools only ever know about one endpoint and one key (the gateway’s key). Backend credentials live in one place. Swapping the underlying model from “local Qwen” to “Gemini Flash” is a one-line config change in the gateway. The tools don’t know or care.

The commercial alternatives

Before we get into self-hosted options, it’s worth knowing what exists in the managed world.

Calling providers directly is always an option, and for simple setups it’s completely reasonable. OpenAI’s API is extremely well-documented, and most tools have native integrations for it. The downside is everything above: per-provider keys, no unified logging, no easy switching, and every tool that calls a different provider needs its own credential configuration.

OpenRouter is a managed multi-model gateway service. You get one API key, they handle routing to hundreds of models from dozens of providers, and you pay per-token with their markup on top. It’s genuinely useful if you want model variety without running infrastructure. The tradeoff is that your requests pass through their servers, their pricing adds a layer on top of provider rates, and you’re dependent on their availability and model catalog decisions. For privacy-sensitive workloads, that pass-through is the dealbreaker.

AWS Bedrock, Azure OpenAI, Google Vertex AI are enterprise-tier managed gateways. They offer compliance features, private networking, and deep integration with their respective clouds. They’re not aimed at hobbyists, and the cost structure reflects that. But it’s worth knowing they exist: if you’re already deep in one of those ecosystems for other reasons, they might save you the operational overhead of running LiteLLM yourself.

Portkey, Helicone, and similar SaaS observability layers are another category: they sit in front of your LLM calls and add logging, cost tracking, and caching, without replacing the provider’s endpoint. They’re useful, but they’re yet another external service that sees your traffic. LiteLLM’s built-in logging covers most of the same ground if you’re already running the proxy.

Why self-host the gateway?

The honest answer is: you don’t have to. If OpenRouter covers your use case and you’re comfortable with the tradeoff, use it. But there are a few reasons to run your own:

Your traffic stays on your network. When your AI assistant is summarizing your email or helping triage your documents, those contents are passing through the gateway. A self-hosted gateway means that traffic never leaves your infrastructure.

You control the model catalog. Want to add a locally-running model alongside cloud APIs? Done. Want to retire a model without waiting for a provider to deprecate it? Done. You’re not constrained by what a managed service decided to offer this week.

Unified observability. With LiteLLM writing to a database and optionally integrating with tools like Langfuse, you get a single view of every request, latency, token count, and cost across all your models. That’s useful for understanding what your automations are actually doing and what they’re costing.

Virtual keys for isolation. Different tools and services can get their own keys, each scoped to specific models and with spend limits. If one automation starts hammering a model, you can see it and cap it. If you want to give a service access to cheap/fast models but not expensive ones, you configure that at the gateway.

Self-hosted gateway options worth considering

LiteLLM isn’t the only choice if you want to run your own.

LiteLLM is the most widely deployed open-source option. It supports a large number of providers and models, has a solid dashboard, virtual key management, and can back itself with a database for persistence. The proxy server is the free part; there’s also a paid enterprise tier, but the open-source version is genuinely capable.

litellm.js / llm-proxy alternatives: Several smaller projects have emerged in this space, but they tend to lag on provider support and lack the battle-tested stability of LiteLLM for production use.

⚠️ Unverified: Specific alternative proxy projects and their current feature parity relative to LiteLLM — this is a fast-moving space and any detailed comparison could be stale within months.

Rolling your own is tempting, and for a single-model setup it’s genuinely simple. An nginx proxy plus a small translation script works. But once you’re dealing with multiple providers, token counting, key management, and fallback logic, you’ve basically rebuilt a limited version of what LiteLLM already does. Start with the open-source tool.

How it fits a homelab — the actual experience

In this lab, LiteLLM runs as a Docker container alongside other AI-stack services. All the services that need to call a language model — the chat interface, the automation platform, the AI coding assistant — talk to LiteLLM and only LiteLLM. They each have their own virtual key.

The model catalog has shifted considerably over time. Early on, most inference ran locally on a dedicated server with significant RAM allocated. Local CPU inference on large models is slow — think single-digit tokens per second for a 30 billion parameter model. That’s workable for non-interactive tasks like email triage, where you submit a batch and wait. For anything interactive, it’s frustrating.

Eventually the routing shifted: locally-running models are off, and everything flows through a cloud inference service behind the same LiteLLM aliases. The tools didn’t change. The automations didn’t change. Just the config.yaml in LiteLLM.

That’s the whole point. The gateway decouples what model from where it runs. When local inference becomes practical again — a GPU shows up, or a fast enough CPU model exists — flipping back is a one-file change.

The model aliases matter here. If your email triage workflow hardcodes qwen-30b as the model name, you can quietly point that alias to a faster cloud model or a better local one without touching the workflow. “qwen-30b” becomes a logical name, not a physical one.

One hard-learned lesson: API key hygiene matters a lot when a gateway is involved. A key leak in a gateway is worse than a key leak in a single app, because the gateway key grants access to everything behind it. Keep gateway credentials in a secrets manager, not in environment variable files you might accidentally commit. Rotate keys when backends change. The gateway makes key management easier, but only if you use the virtual-key system properly — hand each client service its own scoped key, not the master key.

It’s also worth noting that LiteLLM itself takes a non-trivial amount of time to start up. On the first launch after a restart it runs database migrations before it begins accepting traffic, which means a hard restart is a brief outage for anything depending on it. For a homelab this is usually fine — a minute of downtime while an automation waits is acceptable. In a more critical setup you’d want a health check on anything that depends on it, so it doesn’t start sending requests before the gateway is ready.

Another lesson: logging and observability at the gateway catches problems you’d otherwise miss. A local model server can hang silently — requests pile up, timeouts cascade, and your automation just looks “slow” for days before anyone notices. With gateway-level logging and a health check, you can see latency spike and respond before it becomes an outage.

Should you bother?

Run LiteLLM if:

You use more than one AI model or provider.
Multiple tools or automations call AI APIs.
You want to swap models without touching every client.
You care about logging and cost visibility across your AI usage.
You’re running local models alongside cloud APIs.

Skip it if:

You have exactly one tool calling exactly one model. The overhead isn’t worth it.
You’re just experimenting with a single provider’s API. Go direct; add the gateway when you outgrow it.
You need zero-infrastructure simplicity. OpenRouter is a reasonable managed alternative for that case.

The short version

LiteLLM is a thin layer that punches above its weight. It does one thing — present a consistent API in front of messy, diverse model backends — and it does it well. For a homelab that’s getting serious about AI automation, it’s the piece that keeps everything else from becoming a tangled mess of hardcoded endpoints and scattered API keys.

The real payoff is what you don’t have to do: you don’t update six tools when you switch models, you don’t hunt for credentials across a dozen config files, and you don’t lose visibility into what’s running up your bill. That’s exactly the kind of boring infrastructure win that makes a homelab feel like it’s actually designed rather than just accumulated.

← all posts