Observability
Diminuendo provides three observability pillars: structured logging via Effect’s built-in logger, distributed tracing via OpenTelemetry, and deep health checks that probe upstream dependencies. Each is designed for zero-configuration local development and opt-in production instrumentation.
Logging
Diminuendo uses Effect’s built-in logging system, which integrates directly with the Effect runtime’s fiber scheduler. Every Effect.log* call captures the current fiber’s context (span, annotations) and routes through the configured logger implementation.
Logger Configuration
The logger is configured by two environment variables:
| Variable | Effect |
|---|
LOG_LEVEL | Minimum severity: trace, debug, info, warning, error, fatal. Default: info |
DEV_MODE / NODE_ENV | Format selection: pretty-print in dev, JSON in production |
const loggerLayer = config.devMode
? Logger.replace(Logger.defaultLogger, Logger.prettyLoggerDefault)
: Logger.json
Production: JSON Logger
In production (NODE_ENV=production or DEV_MODE not set), logs are emitted as structured JSON, one object per line. This format is optimized for ingestion by log aggregators (Datadog, Grafana Loki, CloudWatch Logs):
{"level":"INFO","message":"Gateway listening on 0.0.0.0:8080","timestamp":"2024-03-01T12:00:00.000Z","fiber":"#1"}
{"level":"DEBUG","message":"PodiumClient: POST /api/v1/instances","timestamp":"2024-03-01T12:00:01.234Z","fiber":"#5"}
Development: Pretty Logger
In development, logs use Effect’s prettyLoggerDefault, which renders human-readable output with color coding:
12:00:00.000 INFO Gateway listening on 0.0.0.0:8080 (dev mode - auth bypassed)
12:00:01.234 DEBUG PodiumClient: POST /api/v1/instances body={...}
Log Level Recommendations
| Level | Use Case |
|---|
error | Unrecoverable failures, data corruption, service crashes |
warning | Recoverable issues: stale session recovery failures, missing optional config, degraded dependencies |
info | Service lifecycle events: startup, shutdown, configuration summary, connection events |
debug | Request/response details: Podium API calls, WebSocket frame details, SQL queries |
trace | Fiber scheduling, Effect runtime internals (rarely needed) |
In production, use info as the default log level. Switch to debug temporarily when diagnosing issues — the additional output includes every Podium API call, every SQLite worker command, and every WebSocket message type.
OpenTelemetry Tracing
Distributed tracing is opt-in. Set OTEL_EXPORTER_OTLP_ENDPOINT to enable it. If the variable is unset, the tracing subsystem is completely inert — no spans are created, no overhead is incurred.
Initialization
Tracing is initialized once at startup via initTracing(). The function is idempotent and safe to call multiple times:
await initTracing(process.env.OTEL_SERVICE_NAME ?? "diminuendo-gateway")
Initialization dynamically imports the OpenTelemetry packages:
@opentelemetry/api
@opentelemetry/sdk-trace-node
@opentelemetry/exporter-trace-otlp-http
@opentelemetry/sdk-trace-base
The OpenTelemetry packages are optional dependencies. If they are not installed, initTracing() catches the import error and silently disables tracing. The gateway runs identically with or without these packages in node_modules.
Configuration
| Variable | Default | Description |
|---|
OTEL_EXPORTER_OTLP_ENDPOINT | (none) | OTLP HTTP endpoint (e.g., http://localhost:4318) |
OTEL_SERVICE_NAME | diminuendo-gateway | Service name in trace metadata |
The exporter sends traces to {OTEL_EXPORTER_OTLP_ENDPOINT}/v1/traces using the OTLP HTTP protocol. A BatchSpanProcessor batches spans for efficient network transmission.
withSpan() Helper
The withSpan() function wraps any Effect in an OpenTelemetry span. If tracing is disabled, it passes the Effect through unchanged (zero overhead):
export function withSpan<A, E, R>(
name: string,
effect: Effect.Effect<A, E, R>,
attributes?: Record<string, string | number | boolean>,
): Effect.Effect<A, E, R>
Span lifecycle is managed correctly even under fiber interruption:
- On success: span status is set to
OK and the span is ended
- On failure or interruption: span status is set to
ERROR with a diagnostic message, and the span is ended
Trace ID Propagation
The currentTraceId() function returns the active span’s trace ID if OTel is enabled, or a random 32-character hex string otherwise. This ID is propagated through event envelopes, enabling correlation between client-visible events and server-side traces:
export function currentTraceId(): string {
if (tracingEnabled && otelApi) {
const span = otelApi.trace.getActiveSpan?.()
if (span) {
const ctx = span.spanContext()
if (ctx?.traceId) return ctx.traceId
}
}
return crypto.randomUUID().replace(/-/g, "").slice(0, 32)
}
Graceful Degradation
The tracing subsystem is designed for complete graceful degradation:
| Condition | Behavior |
|---|
OTEL_EXPORTER_OTLP_ENDPOINT not set | Tracing disabled; withSpan() is a pass-through |
| OTel packages not installed | initTracing() catches import error; tracing disabled |
| Collector unreachable | BatchSpanProcessor buffers and retries; no impact on gateway |
initTracing() called multiple times | Idempotent; second call is a no-op |
Metrics
Diminuendo includes an in-memory metrics system with counters, gauges, and histograms. Metrics support labels for multi-dimensional queries.
Endpoints
| Mode | Path | Format |
|---|
| Dev | GET /api/metrics | JSON |
| Production | GET /metrics | Prometheus text (when METRICS_PROMETHEUS=true) |
Built-in Metrics
ws_connections_active (gauge) — current WebSocket connections
ws_messages_received_total (counter) — total inbound messages by type
ws_messages_sent_total (counter) — total outbound events by type
turns_started_total (counter) — total turns initiated
turns_completed_total (counter) — total turns completed
turns_errored_total (counter) — total turns that errored
podium_request_duration_ms (histogram) — Podium API call latencies
ensemble_request_duration_ms (histogram) — Ensemble API call latencies
Health Endpoint
The gateway exposes a GET /health endpoint that performs deep health checks against upstream dependencies.
{
"status": "ok",
"uptime": 3600000,
"connections": 42,
"memory": {
"rss": 67108864,
"heapUsed": 41943040,
"heapTotal": 50331648
},
"dependencies": [
{
"name": "podium",
"status": "ok",
"latencyMs": 12
},
{
"name": "ensemble",
"status": "ok",
"latencyMs": 8
},
{
"name": "litestream",
"status": "ok"
}
],
"version": "0.1.0"
}
Health Check Logic
The endpoint probes each configured upstream service by sending a GET request to {service_url}/health with a 2-second timeout:
Probe Dependencies
Podium and Ensemble (if configured) are probed in parallel. Each probe measures latency and captures the HTTP status.
Classify Each Dependency
200 OK with latency under timeout: ok
- Non-200 HTTP status: degraded (with error detail)
- Timeout or connection error: unhealthy (with error message)
Compute Overall Status
- If Podium is unhealthy: overall status is unhealthy (Podium is critical)
- If any dependency is not ok but Podium is available: overall status is degraded
- If all dependencies are ok: overall status is ok
Return Response
200 for ok or degraded status
503 for unhealthy status
Response Fields
| Field | Type | Description |
|---|
status | "ok" | "degraded" | "unhealthy" | Overall gateway health |
uptime | number | Milliseconds since gateway started |
connections | number | Number of active session subscriptions |
dependencies | DependencyStatus[] | Per-dependency health details |
version | string | Gateway version |
Dependency Criticality
Podium is the only critical dependency. If Podium is unreachable, the gateway cannot create or manage agent sessions, so the overall status is unhealthy (503). Ensemble is non-critical — if it is unreachable, the gateway reports degraded (200) because agent sessions can still function without gateway-level inference.
Litestream Health Check
When LITESTREAM_ENABLED=true, the health endpoint includes a Litestream dependency check. This is a filesystem-only probe (no process check, no HTTP call):
- If
DATA_DIR has no databases yet → ok (nothing to replicate)
- If any
.litestream/generation or .litestream/generations directory exists → ok
- If databases exist but no Litestream metadata →
degraded (never unhealthy)
Litestream going down makes the status degraded, not unhealthy — the gateway can still serve requests, but replication is stalled.
The health endpoint does not require authentication. It is designed for load balancer health checks and monitoring systems. Do not expose sensitive information in the response.
Load Balancer Integration
Configure your load balancer to probe GET /health periodically:
Health check path: /health
Expected status: 200
Interval: 10s
Timeout: 5s
Unhealthy threshold: 3 consecutive 503 responses
An instance returning 503 (Podium unhealthy) should be removed from the load balancer pool. An instance returning 200 with degraded status should remain in the pool — it can still serve requests, but operators should investigate the degraded dependency.