The Three Pillars of Observability: Completing the Picture with Distributed Tracing
You have a mesh running in production. Peers in three regions. Shared S3 storage that your services depend on. Everything was fine yesterday. Today, uploads are slow and you don't know why.
Where do you even start?
This is the hard problem of distributed systems: the symptom and the cause are often on different machines. The cause might be on a machine you haven't looked at yet, or it might be invisible until you trace the path a single request took through the whole system. You need a way to see from the outside, even when you can't SSH into every node at once.
The industry calls this observability. And the consensus, built up over a decade of running large distributed systems at places like Google, Netflix, and Cloudflare, is that observability rests on three pillars.
Logs, Metrics, and Traces
Think of it like diagnosing a car problem.
Logs are the warning lights and dashboard messages: "engine temperature high", "check tyre pressure". They record discrete events in sequence, and they're invaluable for knowing what happened. But they only record what someone thought to write down in advance, and when something goes wrong you're searching through thousands of them to find the relevant three.
Metrics are the gauges: speed, RPM, temperature, fuel level. Numbers that change over time, great for spotting trends and anomalies. "Temperature has been climbing for twenty minutes" is a metric telling you something is wrong before you see smoke. But a gauge can't tell you why the temperature is climbing.
Traces are the mechanic's diagnostic report: "we followed the coolant circuit and found a blockage in the pipe between the pump and the radiator; it takes 3 seconds for heat to reach the sensor from that point." A trace follows one specific request, connection, or operation through every step it takes, with timing for each step.
Logs ── What happened? (events, errors, state changes)
Metrics ── How often, how bad? (rates, durations, counts)
Traces ── Where did the time go? (causal chain for one request)
Each pillar answers a different question. Each has blind spots the others cover. The real power comes when they're connected to each other.
Pillar One: Logs
TunnelMesh has had structured logging since day one. Every subsystem emits structured log lines via zerolog: JSON-shaped entries with typed fields, easy to filter, easy to forward.
Configure a Loki endpoint and your logs are shipped there automatically, in batches, without blocking anything:
logging:
loki:
url: http://10.42.0.1:3100
batch_size: 100
flush_interval: 5s
Alongside the operational logs, TunnelMesh also writes a structured audit log — a separate record of every authentication event, authorisation decision, S3 operation, and NFS access, with user ID, source IP, operation, and result. If you ever need to answer "who accessed this bucket and when", the audit log has it in a queryable form.
Pillar Two: Metrics
Every TunnelMesh peer exposes a Prometheus endpoint at /metrics on its admin interface. There are 50+ measurements across four subsystems.
Networking metrics cover packet rates, drop reasons (no route, no tunnel, packet filter), tunnel health state, UDP latency per peer, and connection setup duration histograms. When two peers lose their direct path and fall back to a relay, you see it immediately: tunnelmesh_active_tunnels drops, tunnelmesh_reconnects_total ticks up.
Storage metrics cover S3 request rates and durations, transfer bytes, deduplication ratio, chunk counts, rebalancer activity, and replication bytes exchanged between coordinators. A sudden widening in the tunnelmesh_s3_request_duration_seconds histogram is the first signal that something is wrong with storage performance.
Coordinator metrics track peer RTT, online peer count, and heartbeat rate. Docker metrics expose CPU, memory, and status for every container on every node.
These are what power the visualiser, map, alerts panel, and peer cards in the dashboard. When something goes wrong, a metric usually shows it first.
Pillar Three: Traces (New)
Metrics tell you that your S3 operations got slower. Logs tell you what operations were attempted. Neither tells you where the time went inside a specific request: which coordinator handled it, whether it had to fetch a chunk from a remote peer, how long each step took.
That's what distributed tracing gives you, and today we're shipping it.
TunnelMesh now supports OpenTelemetry (OTel) distributed tracing with OTLP export. Enable it with a single flag:
tunnelmesh join --otlp-endpoint http://127.0.0.1:4318 coordinator.example.com
Point it at Grafana Tempo, Jaeger, or any OTLP-compatible collector and spans start flowing immediately.
What Gets Traced
Spans are instrumented across the subsystems where latency questions actually matter.
Routing: when a peer connects, TunnelMesh walks the transport fallback chain — UDP hole-punch, SSH tunnel, WebSocket relay. The trace captures each attempt:
Span: peer.discover (peer_id=b3f2a9...)
└─ Span: transport.negotiate
├─ Span: udp.hole_punch [timeout after 3s — failed]
├─ Span: ssh.tunnel [1.2s — success]
└─ Span: transport.promote [watching for better path]
If UDP hole-punching is timing out consistently for peers in a specific region, that pattern surfaces in the trace view immediately rather than buried in log lines you'd have to grep for.
Storage: when a client writes to S3, the trace follows the request through chunk splitting, local storage, and replication:
Span: s3.PutObject (bucket=backups, key=snapshot.tar.gz)
├─ Span: s3.chunk.store × 48 [avg 3ms each]
└─ Span: s3.replication.enqueue
└─ Span: s3.replication.fetch [coordinator=peer-eu-1, 340ms]
That 340ms replication fetch stands out immediately. Without a trace, you'd only see the total PutObject duration and wonder why it was slow. The trace shows the chunk wasn't available locally and had to be fetched from a remote coordinator. That's actionable: the data might not be replicated widely enough, or the rebalancer hasn't run since a new coordinator joined.
The Part That Makes It Click: Exemplars
The three pillars become more than the sum of their parts when they're linked to each other.
TunnelMesh's histogram metrics — tunnelmesh_connection_setup_duration_seconds and tunnelmesh_s3_request_duration_seconds — now record Prometheus exemplars: a trace ID embedded directly in the metric data point. In Grafana, this means you can click on a spike in a histogram bucket and jump directly to the trace that caused it.
Grafana: latency histogram spike at 14:32
↓ click exemplar
Tempo: trace b3f2a91c... → udp.hole_punch [timeout] → ssh.tunnel [8.2s]
↓ click "View logs"
Loki: peer b3f2 reconnected via ssh tunnel, udp failed on network change
Metrics catch the anomaly. One click jumps you to the trace. Another brings up the logs for that peer at that exact moment. You've gone from "something is slow" to "this specific peer fell back to SSH tunnel because its network changed" in three clicks, without guessing which machine to SSH into or which log file to tail.
Enabling the Full Stack
To get all three pillars connected, you need four things running:
- Loki for logs (self-hosted or Grafana Cloud)
- Prometheus for metrics (each peer exposes
/metrics; a Prometheus service discovery generator is available to automate scrape config) - Tempo for traces (or Jaeger, or any OTLP-capable backend)
- Grafana to tie them together (proxied through the coordinator dashboard at
/grafana/)
The Cloud Deployment Guide includes a Terraform configuration that stands all four up on DigitalOcean alongside the mesh. If you're already running a Grafana stack, adding --otlp-endpoint to your TunnelMesh nodes is all you need to start seeing traces in Tempo.
If you want to try it locally first, a single docker compose with Loki, Tempo, and Grafana is enough to explore. Check the Benchmarking article for the kind of connection and storage events that generate the most interesting traces.
TunnelMesh is released under the AGPL-3.0 License.