Skip to content

feat(dist): add structured logging to dist backend (Phase A.1)#108

Merged
hyp3rd merged 3 commits intomainfrom
feat/dist-mem-cache
May 5, 2026
Merged

feat(dist): add structured logging to dist backend (Phase A.1)#108
hyp3rd merged 3 commits intomainfrom
feat/dist-mem-cache

Conversation

@hyp3rd
Copy link
Copy Markdown
Owner

@hyp3rd hyp3rd commented May 5, 2026

Introduce WithDistLogger(*slog.Logger) option for DistMemory. When supplied, the dist backend emits structured slog records for all operational events across its background loops and error surfaces:

  • HTTP listener bind success and failure
  • Serve-goroutine exits (previously silent in serveErr only)
  • Peer state transitions: suspect-on-timeout, suspect-on-probe-failure, and dead-peer pruning
  • Merkle sync fetch failures
  • Rebalance migration forward failures
  • Hint dropped after replay error

All records are pre-bound with component=dist_memory and node_id=<id> at construction so operators can grep/filter without each call site re-attaching the node identity.

Default behaviour is unchanged: the zero-value path installs a slog.DiscardHandler so library code never writes to stderr unless the caller opts in.

Add tests/dist_logging_test.go with two contract tests:

  • TestDistMemory_LoggerEmitsListenerStart — asserts the listener-start record is emitted with the expected level and attributes.
  • TestDistMemory_LoggerDefaultIsSilent — asserts no-option construction still succeeds without any side effects.

Update .mdl_style.rb to allow MD024 (allow_different_nesting) so the Keep-a-Changelog heading structure passes linting.

Update CHANGELOG.md: add [Unreleased] entry for this work; correct version numbers from 2.0.x to 0.5.0/0.4.3; fix comparison links to match actual tags.

Introduce `WithDistLogger(*slog.Logger)` option for `DistMemory`. When
supplied, the dist backend emits structured `slog` records for all
operational events across its background loops and error surfaces:

- HTTP listener bind success and failure
- Serve-goroutine exits (previously silent in `serveErr` only)
- Peer state transitions: suspect-on-timeout, suspect-on-probe-failure,
  and dead-peer pruning
- Merkle sync fetch failures
- Rebalance migration forward failures
- Hint dropped after replay error

All records are pre-bound with `component=dist_memory` and
`node_id=<id>` at construction so operators can grep/filter without
each call site re-attaching the node identity.

Default behaviour is unchanged: the zero-value path installs a
`slog.DiscardHandler` so library code never writes to stderr unless the
caller opts in.

Add `tests/dist_logging_test.go` with two contract tests:
- `TestDistMemory_LoggerEmitsListenerStart` — asserts the listener-start
  record is emitted with the expected level and attributes.
- `TestDistMemory_LoggerDefaultIsSilent` — asserts no-option construction
  still succeeds without any side effects.

Update `.mdl_style.rb` to allow `MD024` (`allow_different_nesting`) so
the Keep-a-Changelog heading structure passes linting.

Update `CHANGELOG.md`: add `[Unreleased]` entry for this work; correct
version numbers from `2.0.x` to `0.5.0`/`0.4.3`; fix comparison links
to match actual tags.
Copilot AI review requested due to automatic review settings May 5, 2026 06:44
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces opt-in structured logging for the distributed backend (DistMemory) via a new WithDistLogger(*slog.Logger) option, keeping default behavior silent while surfacing operational events (listener lifecycle, peer liveness transitions, merkle sync errors, rebalance forwarding failures, hint replay drops) when a logger is provided.

Changes:

  • Add WithDistLogger option and wire a pre-bound *slog.Logger through DistMemory and its internal HTTP server to emit structured operational logs.
  • Add contract tests validating that listener-start logging is emitted when configured and that default construction remains silent.
  • Update changelog and markdownlint style to reflect the new feature and accommodate Keep-a-Changelog heading structure.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
pkg/backend/dist_memory.go Adds logger field/option, binds default logger, and emits structured log events across dist operations.
pkg/backend/dist_http_server.go Adds inherited logger to surface serve-goroutine exit errors via structured logs.
pkg/backend/dist_memory_test_helpers.go Propagates the dist logger into the HTTP server in test helpers.
tests/dist_logging_test.go Adds tests + a custom slog.Handler to assert emitted structured records.
CHANGELOG.md Documents the feature and adjusts release versioning/links (but currently breaks reference links).
.mdl_style.rb Configures markdownlint to allow repeated nested headings (MD024) for Keep-a-Changelog format.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pkg/backend/dist_memory.go Outdated
Comment on lines +710 to +712
dm.logger = dm.logger.With(
slog.String("component", "dist_memory"),
slog.String("node_id", dm.nodeID),
Comment thread CHANGELOG.md
Comment on lines +201 to +202
Unreleased: <https://github.com/hyp3rd/hypercache/compare/v0.5.0...HEAD>
Released: [0.5.0](https://github.com/hyp3rd/hypercache/releases/tag/v0.5.0)
hyp3rd added 2 commits May 5, 2026 09:07
Introduces per-operation spans on the DistMemory backend as the second
phase of the production-readiness work.

Public API:
- WithDistTracerProvider(trace.TracerProvider) option enables tracing;
  nil-safe, no-op by default so existing callers are unaffected.
- ConsistencyLevel.String() renders consistency levels as ONE /
  QUORUM / ALL for use in log and span attributes.

Spans emitted:
- dist.get  — attributes: cache.key.length, dist.consistency, cache.hit
- dist.set  — attributes: cache.key.length, dist.consistency,
              dist.owners.count, dist.acks; errors recorded on span
- dist.remove — attribute: dist.keys.count; errors recorded on span
- dist.replicate.set / dist.replicate.remove — child spans per peer
  during fan-out carrying peer.id

Design notes:
- Cache key values are intentionally omitted from all spans (keys can
  be PII such as user IDs or session tokens); only cache.key.length
  is recorded.
- Business logic extracted into getImpl / setImpl / removeImpl so the
  public wrappers remain thin tracing + latency shells.
- replicateSetWithSpan / replicateRemoveWithSpan helpers ensure the
  trace-tree shape is identical regardless of which path drives fan-out.
- Library default installs noop.NewTracerProvider so the hot path
  incurs zero allocation overhead when tracing is not opted into.

Deps: adds go.opentelemetry.io/otel/sdk v1.43.0 and
      go.uber.org/goleak v1.3.0 for goroutine-leak detection in tests.

Tests: new tests/dist_tracing_test.go covers Set/Get span emission,
       hit-vs-miss cache.hit attribute, replication child-span shape,
       and the default no-op behaviour.
Introduce `WithDistMeterProvider(metric.MeterProvider)` option for the
DistMemory backend. When set, NewDistMemory registers 40+ observable
instruments covering every field on DistMetrics:

- ObservableCounters for cumulative totals: write attempts/acks/quorum
  failures, forwarded ops, replica fan-out, hinted-handoff lifecycle
  (queued/replayed/expired/dropped), Merkle syncs/keys pulled,
  heartbeat probes, membership transitions, versioning, rebalance keys
  and batches.
- ObservableGauges for current state: alive/suspect/dead member counts,
  active tombstones, hinted-handoff queue bytes, last-operation
  latencies (merkle build/diff/fetch, rebalance, auto-sync) in
  nanoseconds.

A single registered callback drives all instruments: on each collection
cycle it takes one Metrics() snapshot and observes every instrument,
keeping per-operation overhead to zero beyond the existing atomics.
Instrument names use the `dist.` prefix so a Prometheus exporter renders
them under a single subsystem.

Stop() unregisters the callback so the OTel SDK does not invoke it
against a half-stopped backend. Library default remains a no-op meter;
metrics cost nothing unless the caller opts in.

Refactor: extract `installTelemetryDefaults()` from NewDistMemory to
wire no-op logger/tracer/meter fallbacks and register the OTel callback,
keeping the constructor under the function-length lint cap.

Also fix README version references from v2.0.0 → v0.5.0 in the sharded
eviction, transport hardening, and security sections.

Tests: add dist_metrics_test.go (Phase A.3 contract tests):
- TestDistMemory_MetricsExportsObservableCounters: 3 Sets yield
  dist.write.attempts=3 via a ManualReader.
- TestDistMemory_MetricsExportsObservableGauges: dist.members.alive=1
  for a standalone node.
- TestDistMemory_MetricsDefaultIsNoop: no meter provider → cache still
  functional, no panic.
@hyp3rd hyp3rd merged commit 2347c33 into main May 5, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants