#Observability
#Observability
The Prometheus ecosystem ships an exporter for nearly everything — node, kube-state, blackbox, redis, you name it. But sooner or later you’ll hit something the world hasn’t written for you yet: an internal API, a vendor’s HTTP endpoint, a CSV file on a server, a binary’s --stats flag.
At Convotis I needed a Longhorn exporter that exposed metrics the upstream chart wasn’t surfacing — replica health per volume, snapshot churn, and backup latency to S3. I wrote it in Go, shipped it in 200 lines, and it ran across 51 clusters without ever paging me. Here’s the boilerplate I keep coming back to.
A Prometheus exporter is just an HTTP server. Prometheus scrapes GET /metrics, your exporter answers in the text exposition format. The prometheus/client_golang library does the formatting; you write the collector.
Two ways to expose metrics:
counter.Inc() from your app code. Best when your service produces the data itself.prometheus.Collector and gather metrics on every scrape. Best for exporters that wrap an external system you have to query.The pull model is what you want for an exporter, because scraping the upstream system on a fixed timer and caching is a footgun: Prometheus’s own scrape interval already paces you.
go mod init github.com/abdelfattah-hilmi/longhorn-exporter
go get github.com/prometheus/client_golang
package main
import (
"log"
"net/http"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promhttp"
)
type longhornCollector struct {
client *LonghornClient
volumeHealthy *prometheus.Desc
replicaCount *prometheus.Desc
backupLatencyMs *prometheus.Desc
scrapeErrors prometheus.Counter
}
func newLonghornCollector(c *LonghornClient) *longhornCollector {
return &longhornCollector{
client: c,
volumeHealthy: prometheus.NewDesc(
"longhorn_volume_healthy",
"1 if the volume is in 'healthy' state, 0 otherwise.",
[]string{"volume", "namespace"}, nil,
),
replicaCount: prometheus.NewDesc(
"longhorn_volume_replicas",
"Number of replicas currently attached to a volume.",
[]string{"volume", "namespace"}, nil,
),
backupLatencyMs: prometheus.NewDesc(
"longhorn_backup_last_latency_ms",
"Latency in ms of the last backup to remote storage.",
[]string{"volume", "target"}, nil,
),
scrapeErrors: prometheus.NewCounter(prometheus.CounterOpts{
Name: "longhorn_exporter_scrape_errors_total",
Help: "Number of failed scrapes against the Longhorn API.",
}),
}
}
func (c *longhornCollector) Describe(ch chan<- *prometheus.Desc) {
ch <- c.volumeHealthy
ch <- c.replicaCount
ch <- c.backupLatencyMs
c.scrapeErrors.Describe(ch)
}
func (c *longhornCollector) Collect(ch chan<- prometheus.Metric) {
vols, err := c.client.ListVolumes()
if err != nil {
c.scrapeErrors.Inc()
log.Printf("scrape failed: %v", err)
c.scrapeErrors.Collect(ch)
return
}
for _, v := range vols {
healthy := 0.0
if v.State == "healthy" {
healthy = 1
}
ch <- prometheus.MustNewConstMetric(
c.volumeHealthy, prometheus.GaugeValue, healthy,
v.Name, v.Namespace,
)
ch <- prometheus.MustNewConstMetric(
c.replicaCount, prometheus.GaugeValue, float64(len(v.Replicas)),
v.Name, v.Namespace,
)
if v.LastBackup != nil {
ch <- prometheus.MustNewConstMetric(
c.backupLatencyMs, prometheus.GaugeValue,
float64(v.LastBackup.LatencyMs),
v.Name, v.LastBackup.Target,
)
}
}
c.scrapeErrors.Collect(ch)
}
func main() {
reg := prometheus.NewRegistry()
reg.MustRegister(newLonghornCollector(NewLonghornClient()))
http.Handle("/metrics", promhttp.HandlerFor(reg, promhttp.HandlerOpts{}))
http.HandleFunc("/healthz", func(w http.ResponseWriter, r *http.Request) {
w.Write([]byte("ok"))
})
log.Println("listening on :9265")
log.Fatal(http.ListenAndServe(":9265", nil))
}
That’s a complete exporter. Run it, hit http://localhost:9265/metrics, and Prometheus will scrape it happily.
Collect()Collect() runs on every scrape — every 15 to 60 seconds. If your upstream API takes 4 seconds and a scrape comes in every 5, you’re queuing scrapers and exhausting goroutines. Two options:
type cache struct {
mu sync.RWMutex
data []Volume
fetched time.Time
}
func (c *cache) get(maxAge time.Duration, fresh func() ([]Volume, error)) []Volume {
c.mu.RLock()
if time.Since(c.fetched) < maxAge {
defer c.mu.RUnlock()
return c.data
}
c.mu.RUnlock()
c.mu.Lock()
defer c.mu.Unlock()
if time.Since(c.fetched) < maxAge {
return c.data
}
v, err := fresh()
if err != nil {
return c.data // serve stale
}
c.data, c.fetched = v, time.Now()
return v
}
MustNewConstMetric with user-supplied labelsMustNewConstMetric panics on label cardinality mismatch. If your label values come from an external system — Kubernetes resources, third-party APIs — wrap with NewConstMetric and log the error. A panic in Collect() kills the whole scrape, not just the bad metric.
Every unique combination of label values creates a new time series. A label like pod_name in a 1000-pod cluster gives you 1000 series per metric. Multiply by 10 metrics — you’ve just created 10,000 series. Multiply by your retention — you’re holding gigabytes of TSDB blocks.
The rule of thumb: never put unbounded values in labels. UUIDs, timestamps, full URL paths — these will eat your cluster. Bucket them or drop them.
_up and scrape error countersEvery exporter should answer two questions Prometheus can alert on:
longhorn_up{} 1longhorn_exporter_scrape_errors_total{}The first is trivial:
upDesc := prometheus.NewDesc("longhorn_up", "1 if Longhorn API is reachable, 0 otherwise.", nil, nil)
// ...
up := 1.0
if _, err := c.client.Ping(); err != nil { up = 0 }
ch <- prometheus.MustNewConstMetric(upDesc, prometheus.GaugeValue, up)
Now a longhorn_up == 0 for 5m alert tells you the exporter is alive but the upstream is sick — far more actionable than “the scrape timed out.”
A tiny Dockerfile:
FROM golang:1.22-alpine AS build
WORKDIR /src
COPY go.mod go.sum ./
RUN go mod download
COPY . .
RUN CGO_ENABLED=0 go build -ldflags="-s -w" -o /out/exporter .
FROM gcr.io/distroless/static-debian12:nonroot
COPY --from=build /out/exporter /exporter
USER nonroot:nonroot
EXPOSE 9265
ENTRYPOINT ["/exporter"]
A 4 MB image, no shell, no package manager. The kind of thing security scanners stop yelling about.
For the Prometheus side, drop a ServiceMonitor (kube-prometheus-stack) or scrape config:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: longhorn-exporter
namespace: longhorn-system
spec:
selector:
matchLabels: { app: longhorn-exporter }
endpoints:
- port: metrics
interval: 30s
scrapeTimeout: 10s
The temptation when “you need metrics” is to reach for a fat agent or a SaaS sidecar. Don’t. A 200-line Go binary + the Prometheus client library is durable, debuggable, and one kubectl logs away from telling you exactly what went wrong. The best exporter is one a new team member can read top-to-bottom in 5 minutes.
— Abdel