Prometheus, Loki, and Grafana: seeing everything at once

May 15, 2026

observabilitymonitoringgrafanaservice-spotlight

The first time something silently broke in my homelab, I found out the hard way — a service had been down for hours and I only noticed because I tried to use it. No alert, no log, no warning. Just a quiet failure in the dark.

That’s the problem observability solves. Not the dramatic crash where your terminal fills with red text. The subtle stuff: the disk that fills up at 3am, the container that restarts six times before stabilizing, the network path that starts dropping packets two weeks before anything visibly breaks. A good monitoring stack sees all of it, whether you’re watching or not.

What we’re talking about: three tools, one picture

The stack covered here is the combination of Prometheus, Loki, and Grafana. They’re separate projects that happen to work extremely well together — each does one thing well, and together they give you a nearly complete picture of what’s happening across your infrastructure.

Prometheus is a time-series metrics database. It works by “scraping” — periodically pulling numbers from your hosts and services (CPU usage, memory, disk I/O, container stats, custom application metrics) and storing them with timestamps. Those numbers are then queryable with a purpose-built language called PromQL. Want to know the average memory usage of every host over the last six hours? One query. Want to alert when any disk exceeds 85% full? One rule.

Loki does for logs what Prometheus does for metrics. Rather than indexing every word in every log line (which gets expensive fast), Loki stores log streams with lightweight labels and lets you query them with LogQL. It’s designed to be cheap to run and pairs naturally with Prometheus — if you’re already shipping metrics, adding log aggregation with Loki is mostly a configuration exercise.

Grafana is the front end for both. It connects to Prometheus and Loki as data sources, and from there you build dashboards: panels that chart metrics over time, panels that stream live logs, alert status boards, anything you can query. The community maintains thousands of pre-built dashboards you can import with a single ID — there’s no reason to build a node metrics dashboard from scratch when a well-maintained one already exists.

Rounding out the stack: Alertmanager handles alert routing for Prometheus. You define rules (a host is down, disk is critical, a container keeps restarting), Alertmanager fires them and routes them wherever you want — push notification, email, a webhook. It handles deduplication and silencing so you don’t get spammed when one thing causes fifty alerts.

The problem it solves

Running infrastructure without monitoring is flying blind. You know what’s supposed to be running, but you don’t know the state of anything unless you go check. And checking is a manual process you’ll inevitably skip.

With a monitoring stack, the question flips: instead of “is anything wrong?” becoming a manual investigation, the stack tells you when something is wrong. Proactively, while there’s still time to fix it before a real outage.

There are a few specific problems this stack solves well:

Disk space creep. Logs, media, backups — things fill up. A single alert rule catches any volume approaching critical before it causes a crash.

Flapping services. A container that restarts twice a day is annoying but easy to miss. Container restart alerts surface this immediately, before a flapping service turns into a full failure.

Slow degradation. Memory leaks, growing latency, increasing error rates — these don’t look like incidents at first. But on a time-series chart they’re obvious: a line that’s been climbing for two weeks.

Post-incident diagnosis. When something does break, correlated logs and metrics let you answer “what actually happened?” much faster. You have timestamps, you have context, and you can replay events rather than reconstructing them from memory.

Uptime visibility. Active probing with Blackbox Exporter extends Prometheus to test HTTP endpoints and ICMP reachability — you get notified when a service stops responding from the network’s perspective, not just when the process exits.

The commercial alternative

If you’re running workloads in a cloud environment, you’ve probably encountered Datadog, New Relic, Splunk, or similar SaaS observability platforms. These tools are genuinely excellent — polished dashboards, automatic discovery, ML-powered anomaly detection, and enterprise support.

They’re also expensive at scale. Datadog’s pricing is per-host per-month, plus per-log-event, plus per-custom-metric. For a large production environment this can run into tens of thousands of dollars a month. For a homelab — or even a small business — that economics doesn’t make sense.

⚠️ Unverified: Specific Datadog or New Relic pricing figures. Treat these as qualitative — pricing varies by plan, contract, and usage tier. Check vendor sites for current numbers.

There are also cloud-native options: AWS CloudWatch, Azure Monitor, Google Cloud Operations. These work well if your entire stack lives in one cloud provider, but they don’t help much for hybrid or on-premises infrastructure, and they carry the same lock-in concerns as any cloud service.

Self-hosted options worth knowing

If you’ve decided to run your own monitoring, there are a few stacks worth considering. They have different tradeoffs:

Prometheus + Grafana (this stack): The most widely deployed open-source monitoring stack. Well-documented, enormous community, thousands of pre-built dashboards and exporters. The learning curve is real — PromQL takes time, and the pull-based scrape model is different from anything log-shipping-based. But for a homelab or small infrastructure, this is the gold standard for a reason.

Zabbix: A mature, all-in-one monitoring platform. It does metrics, alerts, and some log collection in a single package. Historically heavier to operate than the Prometheus stack, and the UI is more “enterprise configuration tool” than “beautiful dashboard.” Good choice if you want one system that does everything without composition.

Netdata: Excellent for fast, per-host metrics with near-zero configuration. It auto-discovers almost everything on a host and produces real-time charts immediately. The tradeoff is that it’s primarily a per-node tool — correlating across many hosts or doing deep log analysis requires more work or the commercial tier.

The TIG stack (Telegraf + InfluxDB + Grafana): A push-based alternative to the Prometheus pull model. Telegraf is a general-purpose agent that collects and ships metrics; InfluxDB stores them with its own query language (Flux or InfluxQL). Some people prefer the push model because agents don’t require firewall rules to allow inbound scrapes. InfluxDB has also historically been easier to set up for high-cardinality workloads, though Prometheus has caught up considerably.

The reason to choose Prometheus + Grafana over the alternatives comes down to ecosystem breadth and community momentum. There are Prometheus exporters for almost everything — virtualization platforms, network devices, databases, container runtimes, even hardware. Whatever you’re running, someone has probably already written the exporter and published a dashboard for it. That leverage saves enormous amounts of time.

How it fits a homelab (high level)

In my setup, the stack runs as a Docker Compose deployment on a dedicated VM — one place where all monitoring data flows. Every other machine in the lab runs a small collection of agents that Prometheus scrapes on a schedule:

node_exporter reports system-level metrics: CPU, memory, disk, network, load average. It’s a tiny stateless binary that runs as a systemd service.
Promtail ships log streams to Loki — systemd journal logs, application logs, whatever’s relevant per host.
cAdvisor (on Docker hosts) exposes container-level metrics: per-container CPU, memory, and network stats.

Ansible deploys and configures all of these across every host automatically. Adding a new machine to monitoring means adding it to the inventory and running the playbook — the agents are installed, configured, and registered as scrape targets in one pass.

For the virtualization layer, a dedicated Proxmox exporter collects cluster and VM-level metrics — CPU and memory allocation per VM, storage pool usage, migration events — and feeds them into the same Prometheus instance.

Blackbox Exporter handles the “is it reachable?” question by actively probing services via HTTP and ICMP on a schedule. This catches a different class of failure than passive metrics do: a service can appear healthy to its own metrics while being unreachable from the network because of a firewall change, a DNS misconfiguration, or a proxy failure.

Alertmanager receives Prometheus firing rules and routes critical alerts as push notifications in real time. The routing setup includes deduplication — if five things go wrong in sequence from one cause, you get one alert, not five.

Grafana ties it together with SSO authentication, imported community dashboards for each component type, and a custom overview dashboard for the lab as a whole. The community dashboards cover the common cases well; the custom work is mainly about having one place to see the overall health picture at a glance.

Retention is a deliberate choice: metrics for 30 days, logs for two weeks. That covers the “what happened last Tuesday?” post-incident window while keeping the storage footprint predictable.

Who should run this stack

Run it if:

You have more than a handful of services and want to stop finding out things are broken by trying to use them.
You want to understand how your hardware is actually being used — which hosts have headroom, which are consistently stressed.
You enjoy learning PromQL and LogQL. They’re genuinely powerful, and knowing them has broad applicability beyond this stack.
You want the same observability primitives production engineers use, in your own home.

Maybe think twice if:

You’re just running two or three services and a single monitoring dashboard would do the job. Something like Uptime Kuma (which is part of this lab too, as a lightweight availability monitor) might be the right starting point.
The ops overhead feels like a burden rather than a learning opportunity. A monitoring stack that breaks and nobody investigates is worse than no monitoring stack — it creates false confidence.
You’re primarily running managed cloud services where vendor-native monitoring already exists and works well.

The honest answer is that this stack has meaningful complexity. Prometheus’s pull model requires your hosts to be reachable on their scrape ports. Loki’s label schema requires upfront thought to avoid making queries painful later. Alertmanager’s routing configuration is expressive but not simple. None of this is insurmountable, but it’s real work.

Closing thought

There’s something quietly satisfying about having a wall of dashboards that shows you, at a glance, that everything is fine. Disks healthy, services up, no containers flapping, memory holding steady. It turns “I hope nothing’s wrong” into “I can see nothing is wrong.”

More than that, it builds understanding. After a few months of watching your own metrics, you develop an intuition for what “normal” looks like — which makes “abnormal” obvious. That intuition is hard to get any other way, and it transfers everywhere: to cloud platforms, to production on-call, to debugging infrastructure you’ve never seen before.

The stack is free. The time investment is real. For a homelab that’s meant to be a learning environment, that trade-off is almost always worth it.

← all posts