Skip to content

refactor(change-buffer)!: replace slot index with span_id, fix segment isolation#2105

Merged
gh-worker-dd-mergequeue-cf854d[bot] merged 9 commits into
mainfrom
rochdev/change-buffer-cleanup
Jun 18, 2026
Merged

refactor(change-buffer)!: replace slot index with span_id, fix segment isolation#2105
gh-worker-dd-mergequeue-cf854d[bot] merged 9 commits into
mainfrom
rochdev/change-buffer-cleanup

Conversation

@rochdev

@rochdev rochdev commented Jun 11, 2026

Copy link
Copy Markdown
Member

Summary

  • Segment isolation bug: When service A → B → A creates two segments sharing the same trace_id, trace-level state (SetTraceOrigin, SetTraceMetaAttr, SetTraceMetricsAttr) was keyed by trace_id, so segment 2's writes overwrote segment 1's values before it was flushed. Fixed by adding segment_id (u64) to Create ops and keying SmallSegmentMap by segment_id instead.
  • Slot index removal: Operations carried a u32 slot_index — a JS-managed array index into four parallel Vecs. JS had to allocate/free slot numbers; reusing a slot prematurely would silently drop a live span; ensure_slot had to grow all four Vecs in lockstep. Replaced with span_id (u64) as the natural key into HashMaps, removing the slot allocation contract entirely.
  • Opcode size: Was u64 because JS wrote via setBigUint64, wasting 6 bytes per op with only 17 opcodes. Reduced to u16.
  • SpanHeader removal: materialize_header was an unused direct-write path where JS wrote span fields into a #[repr(C)] struct in WASM memory, bypassing the change buffer protocol. It had no segment_id field and would have reproduced the isolation bug. Deleted along with span_header.rs.
  • Rename TraceSegment: SmallTraceMap/Trace renamed to SmallSegmentMap/Segment since the map holds per-segment state, not per-trace state.

Test plan

  • cargo test -p libdd-trace-utils --lib --features change-buffer -- change_buffer passes (16 tests including 3 new segment isolation tests)
  • cargo check -p libdd-trace-utils --features change-buffer clean

🤖 Generated with Claude Code

@rochdev rochdev marked this pull request as ready for review June 11, 2026 10:22
@rochdev rochdev requested review from a team as code owners June 11, 2026 10:22
@datadog-datadog-prod-us1-2

datadog-datadog-prod-us1-2 Bot commented Jun 11, 2026

Copy link
Copy Markdown

Tests

🎉 All green!

🧪 All tests passed
❄️ No new flaky tests detected

🎯 Code Coverage (details)
Patch Coverage: 84.59%
Overall Coverage: 73.51% (+0.44%)

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: 953dc2e | Docs | Datadog PR Page | Give us feedback!

@github-actions

github-actions Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Clippy Allow Annotation Report

Comparing clippy allow annotations between branches:

  • Base Branch: origin/main
  • PR Branch: origin/rochdev/change-buffer-cleanup

Summary by Rule

Rule Base Branch PR Branch Change
unwrap_used 1 0 ✅ -1 (-100.0%)
Total 1 0 ✅ -1 (-100.0%)

Annotation Counts by File

File Base Branch PR Branch Change
libdd-trace-utils/src/change_buffer/mod.rs 1 0 ✅ -1 (-100.0%)

Annotation Stats by Crate

Crate Base Branch PR Branch Change
clippy-annotation-reporter 5 5 No change (0%)
datadog-ffe-ffi 1 1 No change (0%)
datadog-ipc 21 21 No change (0%)
datadog-live-debugger 4 4 No change (0%)
datadog-live-debugger-ffi 10 10 No change (0%)
datadog-profiling-replayer 4 4 No change (0%)
datadog-sidecar 46 46 No change (0%)
libdd-common 13 13 No change (0%)
libdd-common-ffi 12 12 No change (0%)
libdd-data-pipeline 5 5 No change (0%)
libdd-ddsketch 2 2 No change (0%)
libdd-dogstatsd-client 1 1 No change (0%)
libdd-profiling 13 13 No change (0%)
libdd-remote-config 3 3 No change (0%)
libdd-telemetry 20 20 No change (0%)
libdd-tinybytes 4 4 No change (0%)
libdd-trace-normalization 2 2 No change (0%)
libdd-trace-obfuscation 3 3 No change (0%)
libdd-trace-stats 1 1 No change (0%)
libdd-trace-utils 12 11 ✅ -1 (-8.3%)
Total 182 181 ✅ -1 (-0.5%)

About This Report

This report tracks Clippy allow annotations for specific rules, showing how they've changed in this PR. Decreasing the number of these annotations generally improves code quality.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 29b0d7d5bd

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread libdd-trace-utils/src/change_buffer/mod.rs Outdated
Comment thread libdd-trace-utils/src/change_buffer/mod.rs Outdated
@rochdev rochdev changed the title refactor(change-buffer): replace slot index with span_id, fix segment isolation refactor(change-buffer)!: replace slot index with span_id, fix segment isolation Jun 11, 2026
@dd-octo-sts

dd-octo-sts Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Artifact Size Benchmark Report

aarch64-alpine-linux-musl
Artifact Baseline Commit Change
/aarch64-alpine-linux-musl/lib/libdatadog_profiling.a 83.67 MB 83.67 MB -0% (-3.97 KB) 👌
/aarch64-alpine-linux-musl/lib/libdatadog_profiling.so 7.70 MB 7.70 MB 0% (0 B) 👌
aarch64-unknown-linux-gnu
Artifact Baseline Commit Change
/aarch64-unknown-linux-gnu/lib/libdatadog_profiling.so 10.34 MB 10.34 MB +0% (+96 B) 👌
/aarch64-unknown-linux-gnu/lib/libdatadog_profiling.a 94.77 MB 94.77 MB +0% (+1.95 KB) 👌
libdatadog-x64-windows
Artifact Baseline Commit Change
/libdatadog-x64-windows/debug/dynamic/datadog_profiling_ffi.dll 24.83 MB 24.82 MB --.03% (-10.00 KB) 💪
/libdatadog-x64-windows/debug/dynamic/datadog_profiling_ffi.lib 87.33 KB 87.33 KB 0% (0 B) 👌
/libdatadog-x64-windows/debug/dynamic/datadog_profiling_ffi.pdb 180.86 MB 180.82 MB --.02% (-40.00 KB) 💪
/libdatadog-x64-windows/debug/static/datadog_profiling_ffi.lib 925.02 MB 924.79 MB --.02% (-239.83 KB) 💪
/libdatadog-x64-windows/release/dynamic/datadog_profiling_ffi.dll 8.09 MB 8.09 MB 0% (0 B) 👌
/libdatadog-x64-windows/release/dynamic/datadog_profiling_ffi.lib 87.33 KB 87.33 KB 0% (0 B) 👌
/libdatadog-x64-windows/release/dynamic/datadog_profiling_ffi.pdb 23.94 MB 23.94 MB 0% (0 B) 👌
/libdatadog-x64-windows/release/static/datadog_profiling_ffi.lib 47.78 MB 47.78 MB +0% (+1.73 KB) 👌
libdatadog-x86-windows
Artifact Baseline Commit Change
/libdatadog-x86-windows/debug/dynamic/datadog_profiling_ffi.dll 21.52 MB 21.52 MB +.02% (+5.00 KB) 🔍
/libdatadog-x86-windows/debug/dynamic/datadog_profiling_ffi.lib 88.71 KB 88.71 KB 0% (0 B) 👌
/libdatadog-x86-windows/debug/dynamic/datadog_profiling_ffi.pdb 184.87 MB 184.90 MB +.01% (+32.00 KB) 🔍
/libdatadog-x86-windows/debug/static/datadog_profiling_ffi.lib 917.97 MB 917.95 MB -0% (-24.69 KB) 👌
/libdatadog-x86-windows/release/dynamic/datadog_profiling_ffi.dll 6.24 MB 6.24 MB 0% (0 B) 👌
/libdatadog-x86-windows/release/dynamic/datadog_profiling_ffi.lib 88.71 KB 88.71 KB 0% (0 B) 👌
/libdatadog-x86-windows/release/dynamic/datadog_profiling_ffi.pdb 25.66 MB 25.66 MB 0% (0 B) 👌
/libdatadog-x86-windows/release/static/datadog_profiling_ffi.lib 45.41 MB 45.41 MB -0% (-342 B) 👌
x86_64-alpine-linux-musl
Artifact Baseline Commit Change
/x86_64-alpine-linux-musl/lib/libdatadog_profiling.a 74.59 MB 74.59 MB +0% (+2.03 KB) 👌
/x86_64-alpine-linux-musl/lib/libdatadog_profiling.so 8.58 MB 8.58 MB 0% (0 B) 👌
x86_64-unknown-linux-gnu
Artifact Baseline Commit Change
/x86_64-unknown-linux-gnu/lib/libdatadog_profiling.a 90.01 MB 90.02 MB +0% (+1.50 KB) 👌
/x86_64-unknown-linux-gnu/lib/libdatadog_profiling.so 10.44 MB 10.44 MB +.03% (+4.09 KB) 🔍

Comment thread libdd-trace-utils/src/change_buffer/mod.rs Outdated
Comment thread libdd-trace-utils/src/change_buffer/mod.rs Outdated
Comment thread libdd-trace-utils/src/change_buffer/mod.rs Outdated
let (mut span, segment_id) = self
.spans
.remove(span_id)
.ok_or(ChangeBufferError::SpanNotFound(*span_id))?;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this fails, we stop removing the other spans. Is that what we want?

Comment thread libdd-trace-utils/src/change_buffer/mod.rs Outdated
Comment on lines +871 to +877
// Segment 1 — span_id=1: origin → "origin-A" (string id 0)
w.op(OP_CREATE, 1); // span_id=1 in header
w.u128(TRACE_ID);
w.u64(1); // segment_id=1
w.u64(0); // parent_id
w.op(OP_SET_TRACE_ORIGIN, 1);
w.u32(0); // string_id → "origin-A"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this have a helper function?

Comment thread libdd-trace-utils/src/change_buffer/mod.rs
Comment thread libdd-trace-utils/src/change_buffer/mod.rs Outdated
deferred_meta_at(&mut self.deferred_meta, slot)?.insert(key_id, val_id);
if !cache.deferred_meta.is_null() {
// SAFETY: pointer into self.deferred_meta, valid per function contract.
unsafe { &mut *cache.deferred_meta }.insert(key_id, val_id);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we guaranteed to enforce stacked-borrows here?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, because nothing else is touching self.spans in any way (creating mutable or immutable references) while we materialize a mutable reference to some span. Basically we could use HashMap::get_mut for the lifetime of the materialized mutable span reference instead, but of course the whole point of the pointer is to avoid having to make a hashmap fetch.

Comment thread libdd-trace-utils/src/change_buffer/mod.rs Outdated
Base automatically changed from yannham/change-buffer-state to main June 12, 2026 14:00
@gh-worker-dd-mergequeue-cf854d gh-worker-dd-mergequeue-cf854d Bot requested a review from a team as a code owner June 12, 2026 14:00
@yannham yannham force-pushed the rochdev/change-buffer-cleanup branch from a5fe934 to d278eae Compare June 15, 2026 13:51
rochdev and others added 8 commits June 15, 2026 17:11
… isolation

Segment isolation bug
When service A calls B which calls back to A, both of A's segments
share the same trace_id. The segment-level state map (origin, meta,
metrics) was keyed by trace_id, so SetTraceOrigin/SetTraceMetaAttr/
SetTraceMetricsAttr written for segment 2 would overwrite segment 1's
values before it was flushed. Fix: add segment_id (u64) to the Create
op so each segment gets its own isolated entry in SmallSegmentMap.

Slot index removal
Operations previously carried a u32 slot_index — a JS-managed array
index into parallel Vecs (spans, deferred_meta, deferred_metrics,
segment_ids). JS had to allocate and free slot numbers, silently
reusing a slot would drop a live span, and ensure_slot had to grow all
four Vecs in lockstep on every Create. Replaced with span_id (u64) as
the natural key into HashMaps, removing the slot allocation contract
entirely.

Opcode size
The opcode was encoded as u64 because JS was writing via setBigUint64.
With only 17 opcodes this wastes 6 bytes per op. Reduced to u16.

SpanHeader removal
materialize_header was a direct-write path where JS wrote span fields
into a repr(C) struct in WASM memory, bypassing the change buffer
protocol. It had no segment_id field and would have reproduced the
isolation bug. It was not in use, so deleted along with span_header.rs.

Rename Trace→Segment
SmallTraceMap/Trace renamed to SmallSegmentMap/Segment since the map
holds per-segment state, not per-trace state.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…buffer corruption

Co-locate segment_id with its span in spans: HashMap<u64, (Span<T>, u64)>,
removing the separate segment_ids: HashMap<u64, u64>. This eliminates a
second HashMap lookup on every op that needs the segment_id.

Group the 5 cache state variables in flush_change_buffer into a SpanCache<T>
struct, fixing the too_many_arguments clippy lint on interpret_operation_cached.

Fix a buffer-corruption bug (P1) where BatchSetMeta/BatchSetMetric in the
cached path read the count but skipped the payload loop when
cached_deferred_meta/cached_deferred_metrics was null (e.g. after
materialize_span was called between flushes). The unread bytes were then
decoded as the next op's opcode, corrupting all subsequent ops in the buffer.
The fix moves the null-check inside the loop so bytes are always consumed.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@yannham yannham force-pushed the rochdev/change-buffer-cleanup branch from f18ba51 to e85b3fd Compare June 15, 2026 15:12

@yannham yannham left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed with Roch, but since I also rebased/amended the PR, it'd better to have another pair of eyes (and a stamp).

@paullegranddc paullegranddc left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No objection on principle, but did you benchmark this on dd-trace-js vs the previous implementation?

@rochdev

rochdev commented Jun 17, 2026

Copy link
Copy Markdown
Member Author

No objection on principle, but did you benchmark this on dd-trace-js vs the previous implementation?

The previous implementation was incorrect, so it doesn't matter if it was faster. Also, we shouldn't be relying on people benchmarking on their machines, they should be automated in this repo.

@yannham

yannham commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

The previous implementation was incorrect, so it doesn't matter if it was faster. Also, we shouldn't be relying on people benchmarking on their machines, they should be automated in this repo.

Hey, this is precisely what I'm trying to do this week 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants