Observability

Diminuendo provides three observability pillars: structured logging via Effect’s built-in logger, distributed tracing via OpenTelemetry, and deep health checks that probe upstream dependencies. Each is designed for zero-configuration local development and opt-in production instrumentation.

Logging

Diminuendo uses Effect’s built-in logging system, which integrates directly with the Effect runtime’s fiber scheduler. Every Effect.log* call captures the current fiber’s context (span, annotations) and routes through the configured logger implementation.

Logger Configuration

The logger is configured by two environment variables:

Variable	Effect
`LOG_LEVEL`	Minimum severity: `trace`, `debug`, `info`, `warning`, `error`, `fatal`. Default: `info`
`DEV_MODE` / `NODE_ENV`	Format selection: pretty-print in dev, JSON in production

const loggerLayer = config.devMode
  ? Logger.replace(Logger.defaultLogger, Logger.prettyLoggerDefault)
  : Logger.json

Production: JSON Logger

In production (NODE_ENV=production or DEV_MODE not set), logs are emitted as structured JSON, one object per line. This format is optimized for ingestion by log aggregators (Datadog, Grafana Loki, CloudWatch Logs):

{"level":"INFO","message":"Gateway listening on 0.0.0.0:8080","timestamp":"2024-03-01T12:00:00.000Z","fiber":"#1"}
{"level":"DEBUG","message":"PodiumClient: POST /api/v1/instances","timestamp":"2024-03-01T12:00:01.234Z","fiber":"#5"}

Development: Pretty Logger

In development, logs use Effect’s prettyLoggerDefault, which renders human-readable output with color coding:

12:00:00.000 INFO  Gateway listening on 0.0.0.0:8080 (dev mode - auth bypassed)
12:00:01.234 DEBUG PodiumClient: POST /api/v1/instances body={...}

Log Level Recommendations

Level	Use Case
`error`	Unrecoverable failures, data corruption, service crashes
`warning`	Recoverable issues: stale session recovery failures, missing optional config, degraded dependencies
`info`	Service lifecycle events: startup, shutdown, configuration summary, connection events
`debug`	Request/response details: Podium API calls, WebSocket frame details, SQL queries
`trace`	Fiber scheduling, Effect runtime internals (rarely needed)

In production, use info as the default log level. Switch to debug temporarily when diagnosing issues — the additional output includes every Podium API call, every SQLite worker command, and every WebSocket message type.

OpenTelemetry Tracing

Distributed tracing is opt-in. Set OTEL_EXPORTER_OTLP_ENDPOINT to enable it. If the variable is unset, the tracing subsystem is completely inert — no spans are created, no overhead is incurred.

Initialization

Tracing is initialized once at startup via initTracing(). The function is idempotent and safe to call multiple times:

await initTracing(process.env.OTEL_SERVICE_NAME ?? "diminuendo-gateway")

Initialization dynamically imports the OpenTelemetry packages:

@opentelemetry/api
@opentelemetry/sdk-trace-node
@opentelemetry/exporter-trace-otlp-http
@opentelemetry/sdk-trace-base

The OpenTelemetry packages are optional dependencies. If they are not installed, initTracing() catches the import error and silently disables tracing. The gateway runs identically with or without these packages in node_modules.

Configuration

Variable	Default	Description
`OTEL_EXPORTER_OTLP_ENDPOINT`	(none)	OTLP HTTP endpoint (e.g., `http://localhost:4318`)
`OTEL_SERVICE_NAME`	`diminuendo-gateway`	Service name in trace metadata

The exporter sends traces to {OTEL_EXPORTER_OTLP_ENDPOINT}/v1/traces using the OTLP HTTP protocol. A BatchSpanProcessor batches spans for efficient network transmission.

withSpan() Helper

The withSpan() function wraps any Effect in an OpenTelemetry span. If tracing is disabled, it passes the Effect through unchanged (zero overhead):

export function withSpan<A, E, R>(
  name: string,
  effect: Effect.Effect<A, E, R>,
  attributes?: Record<string, string | number | boolean>,
): Effect.Effect<A, E, R>

Span lifecycle is managed correctly even under fiber interruption:

On success: span status is set to OK and the span is ended
On failure or interruption: span status is set to ERROR with a diagnostic message, and the span is ended

Trace ID Propagation

The currentTraceId() function returns the active span’s trace ID if OTel is enabled, or a random 32-character hex string otherwise. This ID is propagated through event envelopes, enabling correlation between client-visible events and server-side traces:

export function currentTraceId(): string {
  if (tracingEnabled && otelApi) {
    const span = otelApi.trace.getActiveSpan?.()
    if (span) {
      const ctx = span.spanContext()
      if (ctx?.traceId) return ctx.traceId
    }
  }
  return crypto.randomUUID().replace(/-/g, "").slice(0, 32)
}

Graceful Degradation

The tracing subsystem is designed for complete graceful degradation:

Condition	Behavior
`OTEL_EXPORTER_OTLP_ENDPOINT` not set	Tracing disabled; `withSpan()` is a pass-through
OTel packages not installed	`initTracing()` catches import error; tracing disabled
Collector unreachable	`BatchSpanProcessor` buffers and retries; no impact on gateway
`initTracing()` called multiple times	Idempotent; second call is a no-op

Metrics

Diminuendo includes an in-memory metrics system with counters, gauges, and histograms. Metrics support labels for multi-dimensional queries.

Endpoints

Mode	Path	Format
Dev	`GET /api/metrics`	JSON
Production	`GET /metrics`	Prometheus text (when `METRICS_PROMETHEUS=true`)

Built-in Metrics

ws_connections_active (gauge) — current WebSocket connections
ws_messages_received_total (counter) — total inbound messages by type
ws_messages_sent_total (counter) — total outbound events by type
turns_started_total (counter) — total turns initiated
turns_completed_total (counter) — total turns completed
turns_errored_total (counter) — total turns that errored
podium_request_duration_ms (histogram) — Podium API call latencies
ensemble_request_duration_ms (histogram) — Ensemble API call latencies

Health Endpoint

The gateway exposes a GET /health endpoint that performs deep health checks against upstream dependencies.

Response Format

{
  "status": "ok",
  "uptime": 3600000,
  "connections": 42,
  "memory": {
    "rss": 67108864,
    "heapUsed": 41943040,
    "heapTotal": 50331648
  },
  "dependencies": [
    {
      "name": "podium",
      "status": "ok",
      "latencyMs": 12
    },
    {
      "name": "ensemble",
      "status": "ok",
      "latencyMs": 8
    },
    {
      "name": "litestream",
      "status": "ok"
    }
  ],
  "version": "0.1.0"
}

Health Check Logic

The endpoint probes each configured upstream service by sending a GET request to {service_url}/health with a 2-second timeout:

Probe Dependencies

Podium and Ensemble (if configured) are probed in parallel. Each probe measures latency and captures the HTTP status.

Classify Each Dependency

200 OK with latency under timeout: ok
Non-200 HTTP status: degraded (with error detail)
Timeout or connection error: unhealthy (with error message)

Compute Overall Status

If Podium is unhealthy: overall status is unhealthy (Podium is critical)
If any dependency is not ok but Podium is available: overall status is degraded
If all dependencies are ok: overall status is ok

Return Response

200 for ok or degraded status
503 for unhealthy status

Response Fields

Field	Type	Description
`status`	`"ok" \| "degraded" \| "unhealthy"`	Overall gateway health
`uptime`	`number`	Milliseconds since gateway started
`connections`	`number`	Number of active session subscriptions
`dependencies`	`DependencyStatus[]`	Per-dependency health details
`version`	`string`	Gateway version

Dependency Criticality

Podium is the only critical dependency. If Podium is unreachable, the gateway cannot create or manage agent sessions, so the overall status is unhealthy (503). Ensemble is non-critical — if it is unreachable, the gateway reports degraded (200) because agent sessions can still function without gateway-level inference.

Litestream Health Check

When LITESTREAM_ENABLED=true, the health endpoint includes a Litestream dependency check. This is a filesystem-only probe (no process check, no HTTP call):

If DATA_DIR has no databases yet → ok (nothing to replicate)
If any .litestream/generation or .litestream/generations directory exists → ok
If databases exist but no Litestream metadata → degraded (never unhealthy)

Litestream going down makes the status degraded, not unhealthy — the gateway can still serve requests, but replication is stalled.

The health endpoint does not require authentication. It is designed for load balancer health checks and monitoring systems. Do not expose sensitive information in the response.

Load Balancer Integration

Configure your load balancer to probe GET /health periodically:

Health check path: /health
Expected status: 200
Interval: 10s
Timeout: 5s
Unhealthy threshold: 3 consecutive 503 responses

An instance returning 503 (Podium unhealthy) should be removed from the load balancer pool. An instance returning 200 with degraded status should remain in the pool — it can still serve requests, but operators should investigate the degraded dependency.

Getting Started

Architecture

Agents

Protocol

Clients

SDKs

Integrations

Operations

Observability

Observability

Logging

Logger Configuration

Production: JSON Logger

Development: Pretty Logger

Log Level Recommendations

OpenTelemetry Tracing

Initialization

Configuration

withSpan() Helper

Trace ID Propagation

Graceful Degradation

Metrics

Endpoints

Built-in Metrics

Health Endpoint

Response Format

Health Check Logic

Response Fields

Dependency Criticality

Litestream Health Check

Load Balancer Integration

Getting Started

Architecture

Agents

Protocol

Clients

SDKs

Integrations

Operations

​Observability

​Logging

​Logger Configuration

​Production: JSON Logger

​Development: Pretty Logger

​Log Level Recommendations

​OpenTelemetry Tracing

​Initialization

​Configuration

​withSpan() Helper

​Trace ID Propagation

​Graceful Degradation

​Metrics

​Endpoints

​Built-in Metrics

​Health Endpoint

​Response Format

​Health Check Logic

​Response Fields

​Dependency Criticality

​Litestream Health Check

​Load Balancer Integration

Observability

Logging

Logger Configuration

Production: JSON Logger

Development: Pretty Logger

Log Level Recommendations

OpenTelemetry Tracing

Initialization

Configuration

withSpan() Helper

Trace ID Propagation

Graceful Degradation

Metrics

Endpoints

Built-in Metrics

Health Endpoint

Response Format

Health Check Logic

Response Fields

Dependency Criticality

Litestream Health Check

Load Balancer Integration