#Observability

Building a Prometheus exporter in Go — from zero to /metrics

Abdelfattah Hilmi · Nov 12, 2025 ·

#Prometheus#Go#Observability#Kubernetes

Why you’ll eventually write one

The Prometheus ecosystem ships an exporter for nearly everything — node, kube-state, blackbox, redis, you name it. But sooner or later you’ll hit something the world hasn’t written for you yet: an internal API, a vendor’s HTTP endpoint, a CSV file on a server, a binary’s --stats flag.

At Convotis I needed a Longhorn exporter that exposed metrics the upstream chart wasn’t surfacing — replica health per volume, snapshot churn, and backup latency to S3. I wrote it in Go, shipped it in 200 lines, and it ran across 51 clusters without ever paging me. Here’s the boilerplate I keep coming back to.

The mental model

A Prometheus exporter is just an HTTP server. Prometheus scrapes GET /metrics, your exporter answers in the text exposition format. The prometheus/client_golang library does the formatting; you write the collector.

Two ways to expose metrics:

Push model — you call counter.Inc() from your app code. Best when your service produces the data itself.
Pull model (Collector) — you implement prometheus.Collector and gather metrics on every scrape. Best for exporters that wrap an external system you have to query.

The pull model is what you want for an exporter, because scraping the upstream system on a fixed timer and caching is a footgun: Prometheus’s own scrape interval already paces you.

The skeleton

go mod init github.com/abdelfattah-hilmi/longhorn-exporter
go get github.com/prometheus/client_golang

package main

import (
    "log"
    "net/http"

    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promhttp"
)

type longhornCollector struct {
    client *LonghornClient

    volumeHealthy   *prometheus.Desc
    replicaCount    *prometheus.Desc
    backupLatencyMs *prometheus.Desc
    scrapeErrors    prometheus.Counter
}

func newLonghornCollector(c *LonghornClient) *longhornCollector {
    return &longhornCollector{
        client: c,
        volumeHealthy: prometheus.NewDesc(
            "longhorn_volume_healthy",
            "1 if the volume is in 'healthy' state, 0 otherwise.",
            []string{"volume", "namespace"}, nil,
        ),
        replicaCount: prometheus.NewDesc(
            "longhorn_volume_replicas",
            "Number of replicas currently attached to a volume.",
            []string{"volume", "namespace"}, nil,
        ),
        backupLatencyMs: prometheus.NewDesc(
            "longhorn_backup_last_latency_ms",
            "Latency in ms of the last backup to remote storage.",
            []string{"volume", "target"}, nil,
        ),
        scrapeErrors: prometheus.NewCounter(prometheus.CounterOpts{
            Name: "longhorn_exporter_scrape_errors_total",
            Help: "Number of failed scrapes against the Longhorn API.",
        }),
    }
}

func (c *longhornCollector) Describe(ch chan<- *prometheus.Desc) {
    ch <- c.volumeHealthy
    ch <- c.replicaCount
    ch <- c.backupLatencyMs
    c.scrapeErrors.Describe(ch)
}

func (c *longhornCollector) Collect(ch chan<- prometheus.Metric) {
    vols, err := c.client.ListVolumes()
    if err != nil {
        c.scrapeErrors.Inc()
        log.Printf("scrape failed: %v", err)
        c.scrapeErrors.Collect(ch)
        return
    }

    for _, v := range vols {
        healthy := 0.0
        if v.State == "healthy" {
            healthy = 1
        }
        ch <- prometheus.MustNewConstMetric(
            c.volumeHealthy, prometheus.GaugeValue, healthy,
            v.Name, v.Namespace,
        )
        ch <- prometheus.MustNewConstMetric(
            c.replicaCount, prometheus.GaugeValue, float64(len(v.Replicas)),
            v.Name, v.Namespace,
        )
        if v.LastBackup != nil {
            ch <- prometheus.MustNewConstMetric(
                c.backupLatencyMs, prometheus.GaugeValue,
                float64(v.LastBackup.LatencyMs),
                v.Name, v.LastBackup.Target,
            )
        }
    }

    c.scrapeErrors.Collect(ch)
}

func main() {
    reg := prometheus.NewRegistry()
    reg.MustRegister(newLonghornCollector(NewLonghornClient()))

    http.Handle("/metrics", promhttp.HandlerFor(reg, promhttp.HandlerOpts{}))
    http.HandleFunc("/healthz", func(w http.ResponseWriter, r *http.Request) {
        w.Write([]byte("ok"))
    })
    log.Println("listening on :9265")
    log.Fatal(http.ListenAndServe(":9265", nil))
}

That’s a complete exporter. Run it, hit http://localhost:9265/metrics, and Prometheus will scrape it happily.

The four rules I learned the hard way

1. Never block in `Collect()`

Collect() runs on every scrape — every 15 to 60 seconds. If your upstream API takes 4 seconds and a scrape comes in every 5, you’re queuing scrapers and exhausting goroutines. Two options:

Set an HTTP client timeout shorter than your scrape interval.
For genuinely slow sources, cache the last successful result with a TTL and serve stale rather than block:

type cache struct {
    mu      sync.RWMutex
    data    []Volume
    fetched time.Time
}

func (c *cache) get(maxAge time.Duration, fresh func() ([]Volume, error)) []Volume {
    c.mu.RLock()
    if time.Since(c.fetched) < maxAge {
        defer c.mu.RUnlock()
        return c.data
    }
    c.mu.RUnlock()

    c.mu.Lock()
    defer c.mu.Unlock()
    if time.Since(c.fetched) < maxAge {
        return c.data
    }
    v, err := fresh()
    if err != nil {
        return c.data // serve stale
    }
    c.data, c.fetched = v, time.Now()
    return v
}

2. Never use `MustNewConstMetric` with user-supplied labels

MustNewConstMetric panics on label cardinality mismatch. If your label values come from an external system — Kubernetes resources, third-party APIs — wrap with NewConstMetric and log the error. A panic in Collect() kills the whole scrape, not just the bad metric.

3. Label cardinality is a footgun

Every unique combination of label values creates a new time series. A label like pod_name in a 1000-pod cluster gives you 1000 series per metric. Multiply by 10 metrics — you’ve just created 10,000 series. Multiply by your retention — you’re holding gigabytes of TSDB blocks.

The rule of thumb: never put unbounded values in labels. UUIDs, timestamps, full URL paths — these will eat your cluster. Bucket them or drop them.

4. Expose `_up` and scrape error counters

Every exporter should answer two questions Prometheus can alert on:

Did your last scrape succeed? — longhorn_up{} 1
How many scrapes have failed historically? — longhorn_exporter_scrape_errors_total{}

The first is trivial:

upDesc := prometheus.NewDesc("longhorn_up", "1 if Longhorn API is reachable, 0 otherwise.", nil, nil)
// ...
up := 1.0
if _, err := c.client.Ping(); err != nil { up = 0 }
ch <- prometheus.MustNewConstMetric(upDesc, prometheus.GaugeValue, up)

Now a longhorn_up == 0 for 5m alert tells you the exporter is alive but the upstream is sick — far more actionable than “the scrape timed out.”

Packaging it for Kubernetes

A tiny Dockerfile:

FROM golang:1.22-alpine AS build
WORKDIR /src
COPY go.mod go.sum ./
RUN go mod download
COPY . .
RUN CGO_ENABLED=0 go build -ldflags="-s -w" -o /out/exporter .

FROM gcr.io/distroless/static-debian12:nonroot
COPY --from=build /out/exporter /exporter
USER nonroot:nonroot
EXPOSE 9265
ENTRYPOINT ["/exporter"]

A 4 MB image, no shell, no package manager. The kind of thing security scanners stop yelling about.

For the Prometheus side, drop a ServiceMonitor (kube-prometheus-stack) or scrape config:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: longhorn-exporter
  namespace: longhorn-system
spec:
  selector:
    matchLabels: { app: longhorn-exporter }
  endpoints:
    - port: metrics
      interval: 30s
      scrapeTimeout: 10s

Closing thought

The temptation when “you need metrics” is to reach for a fat agent or a SaaS sidecar. Don’t. A 200-line Go binary + the Prometheus client library is durable, debuggable, and one kubectl logs away from telling you exactly what went wrong. The best exporter is one a new team member can read top-to-bottom in 5 minutes.

— Abdel