feat(observability): OTel instrumentation + HTTP idle-pool options#28
Open
feat(observability): OTel instrumentation + HTTP idle-pool options#28
Conversation
The default MaxIdleConnsPerHost of 10 caps the keep-alive pool far below the per-worker goroutine count set by settings.Workers, forcing TCP re-dial on most completions and exhausting ephemeral source ports under load. Bumping the defaults alone still leaves the pool static for larger runs, so parameterize the client via an options pattern: - newHttpClient(opts ...HttpClientOption) with defaults 500 / 50 - WithMaxIdleConns(n) overrides the global pool - WithMaxIdleConnsPerHost(n) overrides the per-host pool The sender Worker shares one client across settings.Workers goroutines targeting a single endpoint, so it sizes both to the worker count. Unit tests cover defaults, each option in isolation, and composition. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
bdchatham
commented
Apr 21, 2026
Per review: pulling w.workers into the HTTP client at the call site is implicit tuning. Use the raised defaults (500 global / 50 per-host) for now; future follow-up can expose MaxIdleConns[PerHost] as explicit settings.* fields if callers need to tune beyond defaults. Options API is retained — WithMaxIdleConns / WithMaxIdleConnsPerHost remain available for any consumer that wants to opt in from a config path later.
Author
|
Addressed in bb74d4d — dropped the Worker.Run call-site override so we rely on the 500/50 defaults. Kept the WithMaxIdleConns/WithMaxIdleConnsPerHost options available for explicit opt-in later, and the tests still validate the options API so a follow-up that plumbs settings.HttpMaxIdleConns[PerHost] through to newHttpClient has test coverage ready. |
Implements the design in sei-protocol/platform#128 (docs/designs/ sei-load-observability.md). Wires OTel run-scope via Resource (not per-sample labels), full tracing with exemplar support, W3C propagation for future cross-process stitching with seid, and a dual-exporter topology (Prometheus scrape + optional OTLP push). New package: observability - Setup(ctx, Config) installs MeterProvider + TracerProvider, Resource from SEILOAD_* env (RunID, ChainID, CommitID, CommitIDShort, Workload, InstanceID), Prometheus reader, optional OTLP gRPC exporter gated on OTEL_EXPORTER_OTLP_ENDPOINT, and the composite W3C TraceContext+Baggage propagator. - RunScopeFromEnv reads SEILOAD_* env vars. - service.instance.id defaults to hostname when SEILOAD_INSTANCE_ID unset — lets cluster-of-seiload deployments disambiguate instances sharing a RunID/ChainID via a stable per-pod attribute. - Meter(name) / Tracer(name) are thin wrappers around otel.Meter / otel.Tracer so callers can use sync.OnceValue(...) at package scope for lazy acquisition. Fixes the existing NoOp-capture bug where package-init var meter = otel.Meter(...) grabs the NoOp provider before main runs Setup, silently dropping emissions. Migrations: - sender/metrics.go and stats/metrics.go now build their instrument bundle via sync.OnceValue(...) — no more capture-at-init. - sender/worker.go wraps newHttpClient's Transport with otelhttp.NewTransport so outbound requests inject W3C traceparent and auto-emit http.client.* metrics. newHttpTransport split out so tests can inspect the unwrapped transport directly. - sender.sendTransaction, sender.waitForReceipt, sender.watchTransactions (Dial) now emit spans (sender.send_tx, sender.check_receipt, sender.dial_endpoint). Metrics recorded inside these spans automatically get trace-id exemplars under OTel's trace_based exemplar filter (default since SDK v1.28). - main.go wraps runLoadTest in a seiload.run root span; per-request spans inherit from it so a full run is a single trace. New metrics (registered; producers wired where applicable): - seiload.txs.accepted (counter) — per successful submission - seiload.txs.rejected (counter, reason label) — per failed submission - seiload.http.errors (counter, status_code label) — non-200 responses - seiload.tps.achieved (observable gauge) — registered; sample pump is a follow-up - seiload.run.tps.final (gauge) — emitted once at run end via Collector.EmitRunSummary - seiload.run.duration.seconds (gauge) — emitted once at run end - seiload.run.txs.accepted.total (gauge) — emitted once at run end Run-summary metrics are Gauges intentionally: single emission at run end produces exactly one series per metric per run after the Resource join, which is the right shape for cross-run comparison dashboards in the benchmarks namespace. Tests: - observability/setup_test.go covers Resource population from env, hostname fallback for service.instance.id, propagator install, endpoint scheme stripping, and short-commit derivation. - sender/worker_test.go split into TestNewHttpTransport_* (inspects the bare transport factory) and TestNewHttpClient_Smoke (verifies the otelhttp wrap is present without depending on its internals). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…lity
Addresses four findings from the tri-specialist review (Kubernetes,
Platform, OpenTelemetry) on this PR:
1. Enable OpenMetrics on the /metrics handler. promhttp.Handler()'s
default opts leave EnableOpenMetrics off, which silently strips
exemplars from responses even when the OTel SDK computed them and
the Prometheus server is configured for exemplar-storage. Switch
to promhttp.HandlerFor(..., HandlerOpts{EnableOpenMetrics: true}).
Pairs with platform's enableFeatures: [exemplar-storage] for the
end-to-end exemplar path to actually deliver trace IDs.
2. Fix shutdown ordering. Provider shutdown is the only path that
flushes the OTLP PeriodicReader and BatchSpanProcessor batches;
shutting the exporters down first made that flush fail silently
and lost the final batch. New order: MeterProvider and
TracerProvider shut down first, cascading their flushes into the
still-live exporters. The Prometheus pull-reader is immune but we
keep the uniform ordering invariant.
3. Drop worker_id from send_latency / receipt_latency histogram
labels. Per-worker fan-out multiplied series by ~50 at realistic
autobake scale, bumping against the 50k cardinality guardrail
before adding any new dimensions. Worker-level investigation is
preserved via the span attributes (worker_id flows through the
trace, accessible from any exemplar pivot).
4. Fix defer cancel() inside the receipt loop. Each transaction's
context deferred its cancel until watchTransactions returned,
accumulating unbounded deferred state across a long run. Cancel
explicitly at end-of-iteration.
All tests still pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The sync.OnceValue bundles and observability.Meter/Tracer indirections were defensive scaffolding — added on the assumption that package-level var meter = otel.Meter(...) would capture the NoOp provider forever. Not true for OTel Go v1.19+: the internal/global package uses a delegation mechanism where meters and instruments created before SetMeterProvider are transparently rebound when the real provider is installed (internal/global/meter.go setDelegate). Current SDK is v1.37, well past that fix. Revert to the standard OTel Go idiom: package-level var meter + instruments as package vars, acquired with otel.Meter / otel.Tracer directly. Matches every upstream OTel Go tutorial and example. - Delete observability/meter.go (wrappers only). - sender/metrics.go: var meter = otel.Meter(...); instruments as package vars; drop the metricsBundle + senderMetrics OnceValue. - sender/worker.go: var tracer = otel.Tracer(...); drop senderTracer OnceValue. - stats/metrics.go: same simplification. - stats/block_collector.go: rename local gasUsed -> gas to avoid shadowing the package-level metric. - stats/run_summary.go: direct metric refs. observability.Setup remains the single-responsibility bootstrap — it still owns Resource construction, propagator install, exporter configuration, and shutdown orchestration. That's where the value actually is. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Picks up the current stable HTTP semconv attribute names (http.request.method, server.address, etc.) on the auto-generated http.client.* metrics and client spans emitted by otelhttp.NewTransport. Our bespoke seiload_* metrics + span attributes don't depend on these names so no query/dashboard impact; cosmetic alignment with current OTel HTTP semconv. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per review feedback on comment density. Principle applied: keep only the non-obvious WHY, delete anything that restates the code. Multi-line rationale moves to observability/README.md where it belongs (invariants on run-scope cardinality, shutdown order, exemplar requirements, cluster-of-seiload extensibility). Also drops the stale Setup docstring paragraph that still referenced the long-deleted observability.Meter/Tracer accessors. No behavior change. Net ~80 lines removed from code comments, 33 lines added to README. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
OTel SDK deps (v1.43+) declare go 1.25.0 in their own go.mod, which bumped this module's go directive via `go mod tidy`. The pinned CI Go 1.24.5 and golangci-lint v1.64.8 (built on Go 1.24) can't load configs targeting Go 1.25, so the lint step fails at startup. - .github/workflows/build-and-test.yml: go-version matrix 1.24.5 -> 1.25.1; golangci-lint-action @V3 -> @v6 with v2.1.6 binary (built on Go 1.25). - Dockerfile: golang:1.24.5-bullseye -> golang:1.25.1-bullseye to match. No source changes — pure toolchain bump.
Three in my changes + two pre-existing on main unmasked by the newer default ruleset. Mine: - Move observable-gauge registration (worker_queue_length, tps_achieved) from package vars into init(). The instrument handles are never read by app code — the callbacks run via OTel on each scrape — so holding them in named vars reads as dead code to the unused analyzer. init() with discarded returns makes the side-effect-registration intent explicit. Pre-existing (caught now because we moved from golangci-lint v1.64 to v2.11 whose default enables errcheck + staticcheck ST1005): - stats/logger.go: guard `reportFile.Close()` with `_ =` via deferred closure (errcheck). - sender/ramper.go: lowercase the error-string first letter to match Go idiom (ST1005).
amir-deris
approved these changes
Apr 22, 2026
- Rename OTEL_SERVICE_VERSION → SEILOAD_SERVICE_VERSION (matches the SEILOAD_* run-scope namespace; OTEL_SERVICE_VERSION was never an OTel spec env var anyway, spec uses OTEL_RESOURCE_ATTRIBUTES). - Distinguish receipt_latency description from send_latency. - Document OTel delegation pattern at package-level meter acquisition so the next reader doesn't re-derive why this works before Setup runs. - ethclient.DialContext(dialCtx, ...) so the dial honors the span ctx. - go.mod 1.25.0 → 1.25.1 to match CI toolchain. - Graceful metrics-server shutdown via http.Server + Shutdown, with a 5s timeout ordered after obsShutdown so the final Prometheus scrape path stays up during provider flush. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Author
|
@amir-deris thanks for the careful pass. All addressed in 95fe37c:
Lint + tests clean locally. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements the observability design specified in sei-protocol/platform#128 (
docs/designs/sei-load-observability.md), consumed by autobake and the forthcoming benchmarks namespace on harbor.Two bundled pieces of scope, captured in one PR because they share the same code path and dependency set:
newHttpClientvia a functional-options pattern with raised defaults (MaxIdleConns: 500,MaxIdleConnsPerHost: 50) so high-concurrency callers can tune without churning the constructor.observabilitypackage, run-scope Resource, full tracing with exemplar support, W3C context propagation, optional OTLP push exporter, and new application metrics.OTel scope — what's in
New
observabilitypackage (observability/{setup.go,setup_test.go,README.md}):Setup(ctx, Config)installs MeterProvider + TracerProvider with a Resource built fromSEILOAD_RUN_ID,SEILOAD_CHAIN_ID,SEILOAD_COMMIT_ID(+ derivedcommit_id_short),SEILOAD_WORKLOAD,SEILOAD_SERVICE_VERSION, andSEILOAD_INSTANCE_ID(falls back toos.Hostname()so cluster-of-seiload deployments disambiguate by pod name without explicit wiring).OTEL_EXPORTER_OTLP_ENDPOINT).TraceContext + Baggagepropagator installed globally.Package-level meters work before Setup.
sender/metrics.goandstats/metrics.goacquire their meter viavar meter = otel.Meter(...)at package-init time, beforeobservability.Setupinstalls the realMeterProvider. This is safe because OTel Go's global is a delegating provider — meters and instruments acquired against it forward to the real provider onceSetMeterProviderruns. Documented inline so the next reader doesn't have to re-derive it.Transport instrumentation (
sender/worker.go):newHttpClientwraps the base transport withotelhttp.NewTransport(W3C traceparent injection + autohttp.client.*metrics).Tracing (full scope — one-shot):
seiload.runtop-level span inmain.gowrapping the entire invocation.sender.send_tx,sender.check_receipt,sender.dial_endpointchild spans insender/worker.go.New metrics:
seiload.txs.acceptedseiload.txs.rejectedreasonseiload.http.errorsstatus_codeseiload.tps.achievedseiload.run.tps.finalseiload.run.duration.secondsseiload.run.txs.accepted.totalRun-summary gauges are 1-series-per-run-per-metric after the Resource join — the correct cardinality shape for cross-run benchmark dashboards.
Review rounds addressed
Tri-specialist pre-merge sanity check (platform-engineer + kubernetes-specialist + OpenTelemetry-expert):
promhttp.Handler()silently drops exemplars. Default opts leaveEnableOpenMetricsoff → handler never negotiates OpenMetrics regardless of scraper Accept header → SDK-computed exemplars get stripped. Fixed withpromhttp.HandlerFor(..., HandlerOpts{EnableOpenMetrics: true}). Highest-impact finding — without it, the entire exemplar path was inert.send_latency/receipt_latency. Droppedworker_idfrom labels — per-worker fan-out multiplied series by ~50 at realistic autobake scale, risking the 50k guardrail in platform PR #128'sSeiLoadCardinalityHighalert. Worker-level attribution lives on span attributes (accessible via exemplar pivot).defer cancel()inside aforloop inWorker.watchTransactionsaccumulated unbounded deferred state. Explicit cancel at end-of-iteration.@amir-deris review feedback:
OTEL_SERVICE_VERSION→SEILOAD_SERVICE_VERSION(the env var is sei-load's own version, distinct fromseiload.commit_idwhich names what's under test;OTEL_SERVICE_VERSIONisn't an OTel spec env var anyway — the spec path isOTEL_RESOURCE_ATTRIBUTES=service.version=…).receipt_latencydescription distinguished fromsend_latency.sender/metrics.goso the pattern is self-explaining.ethclient.DialContext(dialCtx, ...)instead ofethclient.Dialso the dial honors the span context.go.modbumped to 1.25.1 to match CI toolchain./metricsserver moved tohttp.Server{}+ deferredShutdown(ctx)so the scrape endpoint stays available through the OTel shutdown window.HttpClientOption/WithMaxIdleConns/WithMaxIdleConnsPerHostkept exported intentionally as the opt-in tuning API for a follow-up that plumbssettings.HttpMaxIdleConns[PerHost]through.Test plan
go build ./...cleango test ./...passes across all packagesgolangci-lint run ./...cleanobservability/setup_test.gocovers Resource population from env, hostname fallback forservice.instance.id, propagator install, OTLP gating, scheme stripping, short-commit derivationnewHttpTransport(inner, inspectable) andnewHttpClient(outer, otelhttp-wrapped)Post-merge validation on harbor (via autobake's first run after both PRs land):
/metricsscrape response includes# {trace_id=...}exemplar lines (OpenMetrics content-type negotiation)target_info{seiload_run_id="...",seiload_commit_id_short="..."}series present in harbor Prometheusseiload_run_duration_secondsemitted at run end (AutobakeRunFailed alert clears)Known follow-ups (tracked separately, NOT in this PR)
RecordTPSSampleproducer fromstats.Collectorsoseiload.tps.achievedemits datapoints (currently registered + inert).SEILOAD_INSTANCE_IDvia downward APImetadata.namein autobake Job template (works today via hostname fallback, but explicit downward API is cleaner for cluster-of-seiload).otlpmetricgrpc.WithEndpointURLvs thestripSchemehelper; bumpotelhttpto latest for fresher HTTP semconv.HttpClientOptionthrough fromsettingsso the exported tuning API has a caller.🤖 Generated with Claude Code