Observability

All ShopSTAR3 services must be observable by default. Observability is a build-time requirement enforced at the platform level, not something added operationally after deployment.

The stack is built on OpenTelemetry (OTel) to keep the instrumentation layer vendor-neutral. Datadog is the first wired-up export backend, but can be replaced or extended by changing collector configuration — no application code changes required.

Signals#

All three observability signals are required across every service:

SignalPurpose
TracesDistributed request tracing across service boundaries.
MetricsJVM internals, HTTP layer stats, and custom business metrics.
LogsStructured log emission correlated to active trace and span IDs.

Collector Architecture#

The OTel Collector is deployed in an agent + gateway topology:

flowchart TD
    S1([Service]) -->|OTLP| A
    S2([Service]) -->|OTLP| A
    S3([Service]) -->|OTLP| A

    subgraph host["Host / Node"]
        A[Host Agent Collector]
    end

    A -->|OTLP batched| G

    subgraph gateway["Central Gateway"]
        G[Gateway Collector]
        G --> SP[Tail-based Sampler]
        G --> EN[Enrichment]
        G --> BA[Batcher]
    end

    BA -->|Export| DD[(Datadog)]
    BA -.->|Future backend| FB[(Backend)]

Host agent: A collector instance runs on each host/node. Services emit OTLP (gRPC or HTTP) to the local agent. This keeps the service-to-collector path fast and local, and decouples services from the central gateway’s availability.

Central gateway: Receives from all agents. Responsible for tail-based sampling decisions, attribute enrichment, batching, and final export to Datadog. All backend endpoint configuration lives here — services never reference a backend directly.

Sampling#

Tail-based sampling is used for traces. The sampling decision is deferred until the full trace has been assembled at the gateway collector. This ensures that traces containing errors or latency outliers are always retained, regardless of overall request volume.

Head-based (probabilistic) sampling is explicitly not used — it would drop a statistically random subset of traces, including failures.

Application Requirements#

Quarkus Services#

All services must include the quarkus-opentelemetry extension. This provides automatic instrumentation for:

  • Inbound/outbound HTTP requests
  • Database queries (JDBC/reactive)
  • Messaging (Kafka producers and consumers)
  • JVM and system metrics via Micrometer bridge

No manual span creation is required for standard I/O. Custom spans should only be added for significant business operations that are not covered by automatic instrumentation.

Structured Logging#

All services must emit logs in structured JSON format. Every log entry must include:

  • trace_id — the active OTel trace ID
  • span_id — the active OTel span ID
  • service.name — the emitting service
  • service.version — the deployed artifact version

This allows log entries to be correlated with traces in Datadog without any post-processing.

Configuration#

Services must not hardcode any OTel endpoint or backend credentials. All export configuration is injected at deploy time via environment variables:

VariablePurpose
OTEL_EXPORTER_OTLP_ENDPOINTLocal agent OTLP endpoint (e.g. http://localhost:4318)
OTEL_SERVICE_NAMEService name as it appears in traces and metrics
OTEL_RESOURCE_ATTRIBUTESAdditional resource tags (environment, version, region)