Skip to content

feat(dist): Phase B production-readiness — SWIM indirect probes, hint-queue retry, wire compression, and queueHint race fix#109

Open
hyp3rd wants to merge 7 commits intomainfrom
feat/dist-mem-cache
Open

feat(dist): Phase B production-readiness — SWIM indirect probes, hint-queue retry, wire compression, and queueHint race fix#109
hyp3rd wants to merge 7 commits intomainfrom
feat/dist-mem-cache

Conversation

@hyp3rd
Copy link
Copy Markdown
Owner

@hyp3rd hyp3rd commented May 5, 2026

Phase B.1 — SWIM-style indirect heartbeat probes:

  • Add WithDistIndirectProbes(k, timeout) option; when a direct probe fails, up to k random alive relays probe the target on the caller's behalf — target is only marked suspect if every relay also fails
  • Add /internal/probe HTTP endpoint and IndirectHealth() transport method
  • Refuted direct failures now refresh LastSeen rather than escalating
  • Expose dist.heartbeat.indirect_probe.{success,failure,refuted} metrics

Phase B.2 — migration failure retry via hint queue:

  • migrateIfNeeded queues a hint on ForwardSet failure instead of logging-and-dropping silently
  • replicateTo hint enqueue broadened from ErrBackendNotFound-only to any transport error (timeouts, 5xx, connection resets)
  • Fix race in queueHint: snapshot hintBytes under hintsMu before unlock to prevent concurrent adjustHintAccounting in the replay loop from racing the metric write

Phase B.3 — on-wire gzip compression for the dist HTTP transport:

  • Add DistHTTPLimits.CompressionThreshold; ForwardSet gzip-compresses Set request bodies exceeding the threshold; server decompresses transparently via fiber v3 Content-Encoding auto-decoding

Refactor: extract membershipSnapshot() helper from Metrics() to keep function length under the lint cap

Add contract tests for all three phases and the queueHint race fix

hyp3rd added 7 commits May 5, 2026 12:17
…-queue retry, wire compression, and queueHint race fix

Phase B.1 — SWIM-style indirect heartbeat probes:
- Add WithDistIndirectProbes(k, timeout) option; when a direct probe
  fails, up to k random alive relays probe the target on the caller's
  behalf — target is only marked suspect if every relay also fails
- Add /internal/probe HTTP endpoint and IndirectHealth() transport method
- Refuted direct failures now refresh LastSeen rather than escalating
- Expose dist.heartbeat.indirect_probe.{success,failure,refuted} metrics

Phase B.2 — migration failure retry via hint queue:
- migrateIfNeeded queues a hint on ForwardSet failure instead of
  logging-and-dropping silently
- replicateTo hint enqueue broadened from ErrBackendNotFound-only to
  any transport error (timeouts, 5xx, connection resets)
- Fix race in queueHint: snapshot hintBytes under hintsMu before unlock
  to prevent concurrent adjustHintAccounting in the replay loop from
  racing the metric write

Phase B.3 — on-wire gzip compression for the dist HTTP transport:
- Add DistHTTPLimits.CompressionThreshold; ForwardSet gzip-compresses
  Set request bodies exceeding the threshold; server decompresses
  transparently via fiber v3 Content-Encoding auto-decoding

Refactor: extract membershipSnapshot() helper from Metrics() to keep
function length under the lint cap

Add contract tests for all three phases and the queueHint race fix
…book (Phase C.1–C.3)

Phase C.1 — Drain endpoint:
- Add DistMemory.Drain(ctx) and POST /dist/drain HTTP endpoint; marks the
  node for graceful shutdown in a one-way, idempotent transition
- /health returns 503 while draining so load balancers stop routing
- Set/Remove reject with sentinel.ErrDraining; Get continues to serve
- Add IsDraining() accessor and dist.drains metric (CAS ensures it fires
  exactly once per transition)

Phase C.2 — Cursor-based key enumeration:
- Replace the naive full-set /internal/keys response with shard-level
  cursor pagination (next_cursor token per page)
- Add optional ?limit=<n> param; truncated=true in the response flags a
  partially-read shard and returns the same cursor for re-request
- DistHTTPTransport.ListKeys now walks pages internally with a 1024-page
  safety cap; all existing callers (anti-entropy fallback, tests) are
  unchanged
- Extract listKeysPage helper and keysPageResp wire type

Phase C.3 — Operations runbook:
- Add docs/operations.md covering split-brain, hint-queue overflow,
  rebalance under load, and replica-loss failure modes; each mode maps to
  the metrics that surface it
- Document observability wiring (logger/tracer/meter), drain procedure,
  and capacity-planning notes

Tests: dist_drain_test.go (3 cases) and dist_keys_cursor_test.go (2 cases)
…r binary and Docker cluster

Critical fixes in the DistMemory layer:

- factory.go: forward cfg.DistMemoryOptions to NewDistMemory; pre-fix all
  WithDistNode/WithDistSeeds/WithDistReplication calls were silent no-ops,
  leaving every node with a standalone default configuration.
- dist_memory.go: accept `id@addr` seed syntax via parseSeedSpec so the
  consistent-hash ring is built with real peer IDs; pre-fix, seeds were
  upserted with empty IDs — every node treated itself as sole owner and
  writes never propagated across the cluster.
- dist_memory.go: route removeImpl to owners[0] (primary), mirroring
  setImpl; pre-fix, replica-initiated removes skipped the primary and
  the value lingered until TTL.

New features:

- Add HyperCache.DistDrain(ctx) convenience method for graceful shutdown
  without type-asserting through the unexported backend field.
- Add production server binary at cmd/hypercache-server with multi-stage
  distroless Dockerfile and 12-factor HYPERCACHE_* env configuration.
- Add 5-node Docker Compose cluster (docker-compose.cluster.yml,
  replication=3, host ports 8081–8085 / 9081–9085).
- Add Makefile targets: start-dev-cluster / stop-dev-cluster.
- Add integration regression test for id@addr seed-spec propagation
  (tests/integration/dist_seed_spec_test.go).
- Add cluster smoke-test script (scripts/tests/10-test-cluster-api.sh).
- Add `.gitleaksconfig.toml` extending the default gitleaks ruleset with
  a global allowlist for config and test shell files; wire it into the
  `gitleaks.yml` workflow via `GITLEAKS_CONFIG`.
- Completely rewrite `scripts/tests/10-test-cluster-api.sh` to be a
  proper regression suite for the Phase D cluster bugs:
  - Replaces raw `curl` one-liners with reusable helper functions
    (`put_value`, `expect_value`, `expect_404`, `delete_key`) that assert
    both HTTP status codes and response body fields.
  - Collects all failures before exiting (non-short-circuit) so operators
    get a full report in one run.
  - Adds configurable `PORTS`, `WRITE_PORT`, and `DELETE_PORT` env vars
    for flexible local/CI overrides.
  - Phases cover: cluster propagation, wire-encoding fidelity for
    non-owner GETs, and cross-node DELETE propagation.
Replace the custom `.gitleaksconfig.toml` (which extended the default
gitleaks config and defined path-based allowlists) with a `.gitleaksignore`
file that allowlists specific fingerprints for known curl auth header
occurrences in docker-compose and test scripts.

Remove the `GITLEAKS_CONFIG` env var from the GitHub Actions workflow,
allowing gitleaks to use its built-in defaults and pick up the new
`.gitleaksignore` automatically.
Add a GitHub Actions workflow, Makefile target, and supporting
scripts to catch cross-node bugs that in-process unit tests miss.

- .github/workflows/cluster.yml: new CI job that boots the 5-node
  docker-compose stack, waits for all /healthz endpoints, runs the
  assertion script, and dumps container logs on failure
- Makefile: add `test-cluster` target mirroring the CI flow for
  local development, propagating the smoke's exit code on teardown
- scripts/tests/wait-for-cluster.sh: polling helper that blocks until
  every node's /healthz returns 200, configurable via PORTS /
  TIMEOUT_SECS / POLL_INTERVAL env vars
- CHANGELOG.md: document all additions under [Unreleased]
- cspell.config.yaml: add healthz to the word list

This specifically guards against the class of regressions that
escaped Phase D review: factory dropping DistMemoryOptions, seeds
without node IDs producing broken rings, and json.RawMessage
mis-encoding on non-owner GET requests.
Add .github/workflows/image.yml to build and publish the
hypercache-server Docker image for linux/amd64 and linux/arm64
via buildx + QEMU.

Trigger behaviour:
- pull_request: build-only (no push) to catch Dockerfile regressions
- push to main: publish :main and :sha-<short>
- semver tag push (v*.*.*): publish :v1.2.3, :1.2.3, :1.2, :1, :latest

:latest is intentionally restricted to semver tag pushes so
production deployments pinning :latest always resolve to a stable
release rather than an in-flight main commit. GHA layer caching
keeps re-builds fast when only Go source has changed.

Also replace stdlib encoding/json with github.com/goccy/go-json in
dist_memory.go and integration tests, update CHANGELOG.md, and add
buildx to the cspell allow-list.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant