perf: append-only segmented novelty — O(batch) write path#1302
Open
bplatz wants to merge 42 commits into
Open
Conversation
Replace the single-arena novelty (FlakeStore + per-graph merged index vectors) with per-graph append-only Arc<Segment> lists. apply_commit now builds one immutable segment per touched graph and appends it instead of merging the batch into four growing sorted vectors, dropping per-commit write cost from O(accumulated novelty) to O(batch log batch). - FlakeId becomes a read-scoped packed newtype (g_id|seg|local); get_flake decodes it. Segments are Arc-wrapped so Novelty clone / Arc::make_mut on the commit path copy only pointers, not flakes. - Reads go through one lazy comparator-ordered k-way merge primitive (GraphMergeIter). slice_for_range returns an owned Vec shim; iter_index and the new iter_flakes concatenate per-graph merges; for_each_overlay_flake drives the merge zero-copy. The merge reproduces the exact pre-segmentation index order, so the novelty equivalence golden digest is unchanged. - bulk_apply_commits folds existing segments and emits one consolidated segment per graph; clear_up_to drops/keeps/rebuilds segments by t-cutoff. - runtime_stats reads iter_flakes(POST); the k-way merge's full comparator order keeps its delta-log + graph_subject_classes side table correct across segment boundaries (covered by a new multi-segment-vs-bulk test). - staged.rs gets its own local FlakeId=u32 for its single-shot arena, decoupled from novelty's now-packed FlakeId.
Append-only growth (no compaction yet) is the mode that grows segment count, and the packed FlakeId's segment field was only checked by debug_assert! — a release build could silently alias bits past 2^24 segments in one graph. Add check_segment_capacity and enforce it in can_apply and (before any mutation, so it stays atomic) apply_commit, returning an overflow error that triggers a reindex. The local-index field needs no runtime guard: per-segment counts are already bounded by the MAX_SEGMENT_FLAKES check and bulk chunking. Also correct the dedup complexity claim: NoveltyFactState is an imbl::OrdMap, so is_asserted/record are O(log novelty), not O(1) (only clone is O(1)). Document the imbl::HashMap option and its caveat — switching would move key equality from Ord to FlakeValue's Hash/Eq, which treats cross-representation numerics as equal, a semantic change the Long-only equivalence harness would not catch.
Measures query latency as a function of novelty segment count (1..40k) over an identical dataset, to quantify the read-side cost of append-only segmentation and inform the compaction strategy (synchronous vs tiered vs reindex-triggered). Same txn pool is grouped into S commits == S segments; auto-indexing off + empty base index means reads are served entirely from the overlay merge. Queries use small/limited result sets so the timed cost is the per-read fan-out setup, not row streaming. Results stream to a file (flushed per cell) so progress is readable mid-run, and repeats are adaptive so a pathological cell (e.g. a join over 40k segments) can't blow the run up.
Times the novelty read primitive (for_each_overlay_flake) directly against a Novelty built with a controlled segment count at fixed total size, isolating the pure k-way-merge fan-out (one binary-search probe per segment per range read) from the O(novelty) query/stats cost that dominates the query-level profiler. point/narrow/full read shapes; same data every config so only fragmentation varies.
Each segment's per-order min/max key (order[0]/order[last]) gives a cheap overlap test: skip a segment in the k-way merge when its key range can't intersect the requested (first, rhs] without paying the two partition_point searches. Correctness-neutral — only segments that would binary-search to empty are skipped (novelty equivalence golden digest unchanged). Big win for point/disjoint-key reads, no effect on overlapping ranges or full scans (nothing to skip), as expected. Measured (m7i.xlarge, 120k subjects, vary segment count): a single-subject SPOT read at 1000 segments 1019us -> 29us (35x); at 40000 segments 47ms -> 2.6ms (18x). A POST read whose predicate spans every segment is unchanged, and full scans are unchanged. Cuts the fan-out constant ~18x but not the O(segments) asymptote (still one cheap probe per segment) — compaction is still needed for the overlapping/full cases.
…akes When a graph holds exactly one segment (the common steady state after a reindex/compaction), reads bypass the k-way-merge heap entirely and iterate that segment's range slice directly. Split the read primitive into two thin iterator enums sharing the heap for K>1: FlakeRead (yields &Flake, the hot path: iter_flakes / for_each_overlay_flake) materializes NO FlakeId at all, and IdRead (yields packed FlakeId, compat: slice_for_range / iter_index) packs only because the caller wants ids. K=1 recovers ~old single-vector read speed; K>1 still merges. Migrate the hot full-scan callers (runtime_stats already; now the ledger dict-novelty / runtime-small-dict rebuilds and the api dict-novelty population) off iter_index().map(get_flake) onto iter_flakes so they never allocate ids. Golden equivalence digest unchanged; ledger tests pass.
Add Novelty::compact_over(threshold) / compact_all(): rewrite each graph whose segment count exceeds the threshold into one consolidated immutable segment, preserving EVERY flake (each assert and retraction with original t/op/m/g). Only the representation changes — reads collapse to the K=1 fast path; the live multiset, size, t, time-travel semantics, fact_state and the assert/retract log are all unchanged, so it is safe for immutability and never reopens stats semantics (no dedup, no tombstone collapsing). epoch bumps so layout-scoped FlakeIds / epoch-keyed caches refresh. DEFAULT_COMPACTION_THRESHOLD = 128. Policy-query API (segment_count / max_segment_count / needs_compaction) lets callers decide when to compact; nothing is auto-wired into the write path (conservative until compaction timing is measured). Cold/bulk load already yields K=1 per graph (locked by a test). Tests: compaction is observationally transparent (identical reads across all four orders, every graph, and time-travel to_t bounds — before vs after), preserves dedup fact_state, respects the threshold, and bulk load is K=1. Plus a compaction-cost bench example (480k/1M/2M flakes).
Update the segmented-novelty design's compaction section to the decided and implemented approach: structural compact-all, segment-count triggered, preserving every flake. Because nothing is dropped, it is stats-safe by construction — the earlier stats-aware/effective-state prerequisite applied only to a collapsing compaction, which we are not doing. Records measured compaction cost (~1.1-1.4 us/flake; 525ms/1.25s/2.8s at 480k/1M/2M), the policy matrix (cold-load/query-Lambda compact; long-lived background; transactor skip / cost-gated), the amortization caveat (O(N/K), reindex- bounded), and tiered compaction as the cliff-free follow-up. Marks the phased plan steps done and corrects dedup to Option A (NoveltyFactState/imbl).
Wire the compact-all primitive to a real call site: LedgerHandle::snapshot (the query/read path) now runs compact_if_needed first — if any graph exceeds the configured segment threshold it consolidates novelty before serving, so the read and subsequent reads avoid fan-out. Per the agreed policy this lives only on the read/maintenance path: insert-only commits go through LedgerWriteGuard and never trigger it. Threshold is per-handle (AtomicUsize, default DEFAULT_COMPACTION_THRESHOLD=128) and settable via set_compaction_threshold (0 disables — e.g. a latency-sensitive transactor opts out). Common case (K<=threshold) only pays a brief shared-lock check; compaction escalates to the write lock and is idempotent + re-checked across the lock. Tests (white-box, ledger_manager): snapshot keeps segment count bounded at the threshold across 20 incremental commits without losing flakes; threshold 0 disables it. Plus novelty_compaction_lifecycle bench showing the sawtooth — reads creep up with segment count, the threshold-crossing read pays compaction (the spike), reads drop back to K=1, repeating every ~threshold commits. Note: get_or_load bulk-loads to K=1, so fragmentation accumulates in the in-memory cached handle of a long-lived server committing incrementally — which is exactly where this trigger applies.
Replace the read-path compact-all trigger with tiered compaction, which removes the growing cliff. size_class(count, T) = floor(log_T(count)) buckets segments by size (derived from flake count — no per-segment level stored/threaded); merging T segments of class K yields one of class K+1. Novelty::tier_compact(T) cascade-merges the lowest full class upward, preserving every flake (structural, no dedup/collapse — stats-safe, observationally transparent). Bounds read fan-out to ~T·log_T(N) with only per-class merge work, not a full rewrite. needs_tier_compaction(T) is the cheap policy check; DEFAULT_TIER_WIDTH = 16. compact_all/compact_over remain as maintenance primitives. LedgerHandle: the read-side trigger (compact_if_needed in snapshot) now runs tier_compact(tier_width); the per-handle knob is tier_width (default DEFAULT_TIER_WIDTH, set_tier_width, 0/1 disables). Still read/maintenance path only; insert-only commits never trigger it. Tests: size_class +1 invariant; tier_compact bounds K logarithmically and preserves every flake; tier compaction is observationally transparent across all orders/graphs/time-travel; snapshot triggers tiered compaction (K stays sub-linear) and tier_width 0 disables. Lifecycle bench updated to tiered — shows bounded K and small, roughly-constant per-read merges vs compact-all's growing 13→29→49 ms cliff.
Add segmented-novelty section 4.5.1: size-leveled tiered compaction is the read-path strategy (size_class = floor(log_T count); merge T of class K -> one class K+1; tier_compact cascades; bounds fan-out to ~T*log_T(N) with bounded per-merge work). compact-all kept as maintenance. Records measured results (K=31 vs 128, mean 3 ms/merge, one 30.85 ms cascade vs compact-all's growing 13->49 ms) and the inherent size-tiered caveat. Marks phased-plan steps done; config plumbing is the remaining follow-up.
is_multiple_of over manual %==0 in novelty_read_fanout; fix overindented doc list in novelty_segment_microbench.
tier_compact_graph merged ALL segments in a full size class, so the first read after a long insert-only burst could compact a huge class-0 backlog in one stall. Now each call merges exactly tier_width segments of the lowest full class and moves strictly upward (each class processed at most once per call), bounding per-merge AND per-call work to one cascade chain. Steady state still converges immediately (compaction keeps pace with commits); a large backlog drains across subsequent reads instead of stalling one. Preserves every flake and is still observationally transparent. Test: a 100-segment insert-only backlog compacts in bounded steps (first call does one cascade chain, not 25 merges), repeated calls converge to ~log K, no flakes lost.
A BinaryRangeProvider attached to the ledger snapshot owns Arc clones of dict_novelty and runtime_small_dicts. Once an index is published, the per-commit Arc::make_mut on those dicts saw strong_count >= 2 and deep-cloned the (growing) dictionaries on every commit -- an O(accumulated-novelty) cost that reappeared right after the first reindex. Detach the provider before the make_mut calls so the dicts are uniquely owned and mutate in place, then rebuild + reattach with the updated dicts. Covers the threaded commit path (commit.rs) and the cached-handle catch-up loop (ledger_manager.rs). Adds a debug-level strong_count probe (target "fluree::cow_probe") at the make_mut point. apply_single_commit sits in fluree-db-ledger (below fluree-db-query) so it cannot reference BinaryRangeProvider; the detach/reattach for that path lives in its only caller, the LedgerManager commit loop, and apply_single_commit gets the probe only.
resolve_ledger_config ran a '?s rdf:type f:LedgerConfig' query on every transaction (via enforce-unique staging) and every policy/SHACL query. The config object is absent from the dictionary on ledgers without config, so the bound-object filter is dropped and the rdf:type predicate is full-scanned over the novelty overlay -- making each call O(accumulated-novelty) once an index is attached. This was the dominant post-reindex per-commit cost (staging ramped to ~27ms while commit stayed flat). Short-circuit to None when CONFIG_GRAPH_ID has no data in either the novelty overlay (segment_count == 0) or the base index (per-graph stats report no flakes for the config graph). graph_registry always registers the reserved config graph via new_for_ledger, so registry presence is not a data signal -- use stats.graphs flake counts. Falls through to the scan for non-Novelty overlays and when per-graph stats are absent, preserving correctness. K-reindex bench: post-reindex staging drops from ~27ms (ramping O(novelty)) to flat ~0.5ms; with the earlier commit-side fix the indexed-write path is now flat across reindex cycles.
Rebasing the segmented-novelty work off the query-perf layer dropped that layer's 'cargo fmt across the branch' pass, leaving formatting drift. Normalize it.
clear_up_to drops whole segments (no dead-flake arena retained), so a full drain genuinely empties novelty — is_empty() and is_effectively_empty() now agree. The upstream test (from the drained-fast-path fix) asserted is_empty() stays false under the old single-vector arena model; update it to pin the real invariant: drained novelty reads as effectively-empty despite the bumped epoch, which is the gate the post-indexing batched-star-join fast path relies on.
In-process harness for the occasional-burst workload: measures burst absorption (O(N) append vs O(N^2) full-resort-per-commit emulation), read latency vs segment count K by read shape (point/narrow/full/join) against the K=1 baseline, a tier_width mitigation sweep, the adversarial first-read-after- silent-burst case, and a base+overlay dilution sweep approximating the production base-index <-> novelty-overlay merge.
Measures the same queries through the real query engine at four stages — drained baseline, burst peak (K overlay segments), after compact_all (K=1), and after re-draining via reindex — over a real published on-disk base index. Isolates whether the burst read penalty is O(K) (fixed by compaction) or O(overlay-size) (fixed only by draining). Companion to novelty_burst_profile (pure novelty) and novelty_read_fanout (empty base).
Mixed query+update throughput with a live background indexer: indexed BSBM base, then a timed loop interleaving update txns (new product + review) with a realistic Explore mix (50% point lookups, 25% scan, 25% join). Reports aggregate ops/sec + per-type latency + indexer lag/novelty. For comparing main vs the segmented-novelty branch under a production-shape mixed load.
Read-side complement to segmented-novelty: make query-time overlay translation segment-aware (so a write burst stops re-translating the whole overlay per cold query) plus a LIMIT row-budget for eager join lanes. Incorporates review feedback: process-global seg-id + reindex-scoped cache binding, phase-1-raw-Novelty with reasoning overlays as an explicit phase 2, ad-hoc-ephemeral predicate ids made uncacheable, whole-segment cache with window-after-merge, per-segment {ops, untranslated}, byte-bounded LRU.
Add an advisory Operator::set_row_budget (default ABSORB). Row- and order-preserving operators forward it: Offset (+offset, saturating), Project (pass-through), Limit (seeds min(limit, inherited) before opening its child). NestedLoopJoinOperator absorbs it and caps its batched accumulator's first flush at budget.clamp(1024, 100_000) instead of always BATCHED_JOIN_SIZE, growing geometrically back to full (first-flush-only) — so a small LIMIT no longer buffers ~100k left rows before producing a row. Advisory only: a fully drained join yields the identical multiset and order, and Sort/Distinct/GroupAggregate/Bind/Filter/hash-build absorb, so the budget never leaks past a row-dropping or reordering boundary. Validated by 528 query integration tests (SPARQL + JSON-LD), join/limit/offset/project unit tests, and clippy.
Foundation for segment-aware overlay translation. Give each novelty Segment a stable, process-unique seg_id (a process-global AtomicU64 assigned in Segment::build) so a per-segment translation cache keyed on it never collides across ledgers, reloads, diverged novelty, or derived overlays; Arc-clones share it and compaction turns it over. Expose segments to the query layer via two new OverlayProvider methods (overlay_segments + for_each_overlay_segment_flake, the latter unfiltered by to_t) plus OverlaySegmentMeta; default impls report one synthetic whole-overlay segment so non-segmented overlays are unchanged. Observationally transparent: the novelty equivalence harness and both compaction-transparency tests stay byte-identical.
Wire BinaryScanOperator::open() to assemble overlay ops from per-segment translations for raw Novelty overlays: translate each immutable segment once (detecting ad-hoc ephemeral ids -> uncacheable), apply to_t after translation, then merge + resolve. Falls back to the whole-graph translate for non-segment-native overlays (one synthetic segment) or any uncacheable segment, so reasoning overlays are unaffected. Slots into the global-cache MISS path so same-epoch repeats stay free. Validated by 528 query integration tests (byte-identical results). Known follow-ups: route the per-segment cache through the in-memory byte-budgeted LeafletCache (currently a temporary entry-count LRU); strengthen the SegmentOpsKey translation binding (store/dict identity beyond store_max_t); k-way merge the per-segment runs instead of concat+sort to make the post-commit cost O(new segment) rather than O(total overlay).
SegmentOpsKey (ledger_id, store_max_t, seg_id, index) was unsound: a new-namespace commit triggers refresh_index, which rebuilds dict_novelty from scratch in POST order (re-ranking subject/string ids) at an UNCHANGED store_max_t while novelty segments (and seg_ids) are preserved -> a cached segment's ops carry pre-rebuild ids while the live dict re-ranked them, yielding wrong/missing rows. store_max_t is a data-coverage watermark, not an identity of the dictionary id-assignment translation reads. Stamp a process-unique store_id on each BinaryIndexStore at construction (the seg_id pattern) and key the segment cache on store_id instead. Every from-scratch dict rebuild constructs a fresh store (new id -> cache bypass -> re-translate); ordinary commits reuse the same store Arc (stable id -> cross-commit reuse preserved, only new segments translate). Verified by a 4-area lifecycle investigation + adversarial review; 528 query integration tests pass.
Replace the placeholder entry-count lru::LruCache(8192) with a byte-weighted moka cache (the same TinyLFU mechanism as the index LeafletCache), bounded by a 128 MiB budget, so the per-segment translation cache is governed by byte pressure rather than a fixed entry count. Dedicated budget for now; folding into the LeafletCache shared 'one pool, one budget' is a noted follow-up. Also reconcile the design doc (review Medium finding): the segment cache accelerates the global-cache MISS path (preserving warm-repeat), it does not bypass it; the bounded overlap of the two caches for one hot epoch is intentional.
The segment-aware overlay path concatenated K already-sorted per-segment runs and re-sorted them with sort_unstable (review High #3). Add sort_overlay_ops_stable (stable, run-adaptive) and use it for the merge: Rust's stable sort detects the K runs and merges them in ~O(n log k) instead of re-sorting. The remaining O(n) copy is inherent; true O(new-segment) needs an incremental/persistent merge, deferred until the (now integer-only) merge is shown to dominate the cached per-segment translation. Also key GlobalTranslationKey on store_id instead of store_max_t, closing the same latent per-view-vs-live divergence the SegmentOpsKey fix addressed (a same-index_t store rebuild re-ranks dict ids at an unchanged store_max_t).
Replace the dedicated 128 MiB moka cache for per-segment translated overlay ops with an entry type in the index LeafletCache ('one pool, one budget', TinyLFU), so the translations compete for memory with decoded leaflets / stats views / etc. under a single byte budget instead of a separate pool. Key = xxh3_128(store_id, seg_id, index) via LeafletCache::segment_ops_key (store_id keeps it sound). When no leaflet cache is attached, segments translate fresh (correct, no cross-query reuse). 528 query integration tests pass + clippy clean.
CachedOverlaySegment::byte_size now counts the HashMap table capacity and each ephemeral predicate Sid's Arc<str> name heap (dominant for novelty-only-predicate-heavy overlays) plus the Arc<[T]> control blocks, instead of a flat len*size_of that under-counted them - tightening the shared-budget weighing. Reconcile the design doc: the cursor key window is applied AFTER the merge (at cursor attach), not per-segment during assembly - the full merged product is cached per epoch for warm-repeat reuse and must be window-independent. The selective-cold-query O(total) copy is a known cost; eliminating it needs per-predicate-scoped assembly (deferred gap #1), not window-during-assembly.
…stage + live indexer)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Reworks the in-memory novelty write buffer (the uncommitted-changes overlay that sits between index publishes) from a single merged, sorted index vector — rebuilt on every commit — into an append-only list of immutable per-commit segments. Each commit appends one
Arc<Segment>per touched graph instead of re-merging the whole novelty index, dropping per-commit write cost from O(accumulated novelty) to O(batch). Reads merge across segments with a comparator-ordered k-way merge that reproduces the exact pre-change ordering.Performance (before → after)
† Two identical m7i.xlarge boxes,
/dev/shm, release — before-vs-after segmentation (original write-slope measurement).‡ This change set, replicated, on a BSBM-shaped simulation (in-repo generator + representative lookup/scan/join + an insert update), driven through the real server path (cached
LedgerHandle+stage()commits + warmsnapshot()queries + a live background indexer). Not the official BSBM harness.What changed
Vec<Arc<Segment>>;apply_commitbuilds and appends one immutable segment, with no re-merge of existing data (O(batch) commits). The whole-novelty deep clone for snapshot isolation becomes an O(#segments) pointer copy (immutable segments are never mutated under readers).make_mut(it co-held the dictionaryArcs, forcing an O(novelty) deep clone on every commit once an index was published); skip the config-graph resolve when that graph is empty (it full-scanned therdf:typepredicate over the growing overlay on every request).Tradeoffs
mainalso degrades on broad reads with a non-empty overlay (it shares the overlay) — this adds only a small fan-out increment on top. The remaining flat per-query broad-scan cost is a pre-existing base-index issue (untyped-string predicate scans), not introduced here.Testing
novelty_equivalence_contractunchanged — byte-identical results across all four index orders, graphs, time-travel, and edge cases (reassert-after-retract, duplicate metadata, named graphs). Compaction observational-transparency tests (compact-all + tiered) across all orders/graphs.fluree-db-novelty/fluree-db-ledger/fluree-db-transactunit suites green; API integration suites green (RDF 1.2 edge annotations, indexing workflow, enforce-unique-after-index, namespace-after-index, config-graph).