Skip to content

Stratified GC#6

Open
vaidas-shopify wants to merge 32 commits into
masterfrom
stratified-gc
Open

Stratified GC#6
vaidas-shopify wants to merge 32 commits into
masterfrom
stratified-gc

Conversation

@vaidas-shopify

@vaidas-shopify vaidas-shopify commented Apr 20, 2026

Copy link
Copy Markdown
Owner

Previously was Anti-Cruft Generational GC: #5

What changed vs. the prior PR

The earlier "anti-cruft generational GC" iteration has been reworked
around an explicit stratum model, with the design now mirrored in
Documentation/technical/stratified-gc.adoc. The notable deltas:

  • "Generations" are gone; objects are sorted into a base stratum
    (long-lived, kept-pack-bounded by sidecars) and an active stratum
    (the unstratified remainder that surface-gc walks).
  • Three maintenance tasks instead of two: stratify, surface-gc, and
    a new stratify-prune for retiring anchor refs that have been
    removed from configuration.
  • The closed-set property is now per-repo, not per-anchor. The
    cross-anchor filter dedupes shared history (main + release branches
    • tags pointing at old commits) into the union of all base-stratum
      packs.
  • Cascade-by-timestamp validation has been replaced with independent
    validation
    : each pack stands or falls on its own
    anchor_commit/ref ancestry check. The cascade design relied on
    stratified_timestamp order matching commit-graph order, which clock
    skew or manual edits can break.
  • stratify-prune relabels orphan packs onto a surviving anchor
    (preserving the kept-pack boundary for surface-gc's walk) rather
    than unlinking sidecars outright, which would have silently broken
    the union.
  • Smaller items: annotated-tag anchors are peeled before commit
    lookup; a companion .keep is written alongside .base-stratum for
    older Git versions; sidecar writes are atomic; batch-size accepts
    unit suffixes; cruft-expiration falls back to gc.pruneExpire;
    both tasks emit trace2 data.

Stratified Garbage Collection with Base-Stratum Packs

Overview

Traditional Git garbage collection walks all objects in the repository to
determine reachability, then repacks everything. This cost scales with the
total repository size and becomes prohibitive for large repositories.

Stratified GC takes a different approach: stratify objects that are known
to be reachable into long-lived base-stratum packs, then scope garbage
collection to only the unstratified remainder (the active stratum). This
bounds GC cost to the size of the active stratum, not the total repository.

....
Traditional GC:
Walk ALL objects -> classify -> repack everything
Cost: O(total repository)

Stratified GC:
Stratify objects reachable from stable refs -> base-stratum pack
GC walks only unstratified objects -> classify -> repack remainder
Cost: O(unstratified objects) << O(total repository)
....

Object classification

Objects in a repository are classified into three tiers:

Tier 1: Base-stratum packs (base stratum)::
Objects reachable from configured anchor refs, identified by an
.base-stratum sidecar file. GC treats these as reachable without
walking into them.

Tier 2: Regular packs (active stratum)::
Objects not yet stratified: recent commits, feature branches, fetched
objects. Managed by geometric repack and subject to surface-gc
reachability walks.

Tier 3: Cruft packs (unreachable, awaiting expiration)::
Objects proven unreachable during surface GC. Same .mtimes mechanism
as today. Expired by cruft expiration.

Base-stratum packs

A base-stratum pack is a standard packfile accompanied by a .base-stratum
sidecar file that records the reachability proof. A companion .keep
file is written alongside it so older Git versions (without
.base-stratum awareness) still treat the pack as kept and exclude it
from their repacks.

Anchor refs are user-configured refs that represent stable, long-lived
history:


maintenance.stratified.anchor = refs/heads/main
maintenance.stratified.anchor = refs/heads/release/v1
maintenance.stratified.anchor = refs/tags/v1.0

Anchors may be branches or annotated tags; tag anchors are peeled to
their target commit before lookup.

Only objects reachable from commits older than maintenance.stratified.min-age
(default: 2.weeks.ago) are stratified. This avoids stratifying objects from recent
commits that might still be rewritten.

.base-stratum file format


The `.base-stratum` file uses the following binary format (version 1):

....
Bytes 0-3:    signature (0x53545241 = "STRA", network byte order)
Bytes 4-7:    version (1, network byte order)
Bytes 8-11:   hash_id (1=SHA1, 2=SHA256, network byte order)
Next N bytes: anchor_commit OID (raw, 20 or 32 bytes)
Next 4 bytes: stratified_timestamp (network byte order)
Next:         anchor_ref name (NUL-terminated string)
Next N bytes: trailing checksum of all preceding data
....

Sidecars are written atomically (write to temp, fsync, rename) so a
concurrent reader never sees a torn file.

The anchor commit and ref are recorded so the pack's reachability guarantee
can be validated: if the anchor ref no longer points to a descendant of the
anchor commit, the pack must be demoted back to a regular pack.

Detection
~~~~~~~~~

Pack discovery checks for the `.base-stratum` sidecar the same way it checks for
`.mtimes` to set `is_cruft`. The `packed_git` struct carries an `in_base_stratum`
bit flag.

Base-stratum packs are included in the multi-pack-index like any other pack.
No MIDX format changes are needed.


The stratify maintenance task
-------------------------------

The `stratify` task incrementally stratifies objects from regular packs into
base-stratum packs. For each configured anchor ref, the task:

1. Resolves the ref to a commit (peeling annotated tags).
2. Finds the most recent `anchor_commit` from any existing base-stratum
   pack that is a strict ancestor of this ref's tip
   (`find_stratified_ancestor`) — across all anchors, not just this one.
3. Runs `git rev-list --objects --reverse --before=<min-age> <tip> [^<bound>]`
   to enumerate unstratified reachable objects.
4. Feeds the object list to `git pack-objects` to create a new base-stratum pack,
   filtering against every existing base-stratum pack so shared history
   between anchors is stratified exactly once.
5. Writes the `.base-stratum` sidecar and companion `.keep`, recording
   the last commit in the output as the anchor commit (not the ref tip).

The incremental walk bounded by `^<bound>` means subsequent runs only
process objects newer than what was previously stratified, and the
cross-anchor lookup means anchors sharing history don't re-walk the
shared prefix. The walk cost is proportional to the new objects, not
the entire history.

Anchor refs are deduplicated at both the config level (string-equal
entries) and at the OID level (different refs pointing at the same
commit within a single run).

Batch size control
~~~~~~~~~~~~~~~~~~

The `maintenance.stratified.batch-size` option (default: 0, unlimited) limits
the number of objects stratified per anchor ref per run. It accepts
plain integers and unit suffixes (`100k`, `2m`). With `--reverse`, oldest
objects come first, so truncation stratifies the oldest batch and defers newer
objects to subsequent runs.

The `.base-stratum` sidecar records the last commit actually included in the
batch (not the ref tip), so the next run's `^<bound>` exclusion
correctly continues from where the previous batch ended.

This is important for the sliding time window: even without batch truncation,
the anchor commit is always the last commit from the `--before=<min-age>`
bounded output. As time passes and the min-age window advances, newly
eligible commits appear between the recorded frontier and the new cutoff,
and are picked up on the next run.

Validation
~~~~~~~~~~

Before creating new base-stratum packs, the task validates existing ones.
For each base-stratum pack:

1. Load `.base-stratum` to get `anchor_ref` and `anchor_commit`.
2. Resolve the anchor ref. If deleted, demote the pack.
3. Check if `anchor_commit` is an ancestor of the current ref tip.
   If not (history was rewritten), demote the pack.

Demotion means removing the `.base-stratum` and `.keep` sidecars. The
`.pack` and `.idx` files stay on disk and are absorbed by the next
geometric repack on its normal cadence.

Independent validation
^^^^^^^^^^^^^^^^^^^^^^

Each base-stratum pack is validated independently against the current
ref tip. There is no cascade-by-timestamp ordering: it relied on a
fragile assumption that `stratified_timestamp` order matches
commit-graph order, which clock skew, manual sidecar edits, or future
timestamp-inheriting code paths can break.

After demotion, the stratify task re-stratifies from scratch on the
next run. The cross-anchor coverage lookup ensures any prefix of the
ref's history that's still pinned by some other anchor's pack is not
re-walked.

This is cheap: one ancestor check per base-stratum pack via
`repo_in_merge_bases()`, accelerated to near-constant time by the
commit graph.

Cross-anchor coverage and the closed-set property
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

When two anchors share history (a release branch that branched from
main, a tag pointing at an old main commit, etc.) the shared objects
are stratified once into whichever pack lands first; later packs in
the same run, and packs created by subsequent runs, filter against
the union of *all* base-stratum packs and only carry their own
unique-to-anchor deltas.

The closed-set property therefore holds at the *repository* level —
the union of all base-stratum packs is closed under reachability from
any configured anchor — rather than per anchor. Maintaining the
*union* as the closed set across an anchor retirement is the
responsibility of `stratify-prune` (see below).

The same cross-anchor lookup powers `find_stratified_ancestor()`,
which the per-anchor stratify loop uses to bound its rev-list: if some
other configured anchor's pack already covers a prefix of this
anchor's history, walking that prefix again is wasted work.


The surface-gc maintenance task
-------------------------------

The `surface-gc` task performs lightweight garbage collection scoped to the
active stratum:

1. Collects all base-stratum packs and passes them as `--keep-pack` with
   `--kept-pack-boundary` to `git repack`. The reachability walk treats
   kept packs as traversal boundaries: when the walk hits a commit, tree,
   or blob in a kept pack, it stops traversing — it does not process
   parents or recurse into trees. This is safe because of the closed-set
   property (see below).
2. Runs `git repack -d -l --cruft --cruft-expiration=<expiration>` on the
   remaining (unstratified) objects.
3. Reachable unstratified objects are repacked into a new regular pack.
   Unreachable objects go into a cruft pack. Expired unreachable objects
   are dropped.

Readiness check
~~~~~~~~~~~~~~~

Before running, surface-gc verifies that stratify stratifying has caught up
sufficiently. For each configured anchor ref, the commit date of the
last-stratified commit must be newer than the combined threshold of
`maintenance.stratified.min-age` plus `maintenance.stratified.grace-period`
(default: `1.week.ago`).

The readiness lookup (`stratified_frontier_date`) considers both
ancestors *and* descendants of the ref tip: a pack whose
`anchor_commit` descends from this tip fully contains the tip's
reachables, so the frontier on this anchor clamps at the tip's own
committer date (relevant for tags whose entire history is already
covered by main's pack).

If any anchor's stratifying frontier is older than this cutoff, surface-gc is
skipped. The active stratum would still be too large for surface-gc to
save meaningful work over a full repack. This is particularly relevant
when `batch-size` is set and stratify needs multiple runs to fully pin
an anchor's history.

Why surface-gc is cheap
~~~~~~~~~~~~~~~~~~~~~~~

Surface-gc passes `--kept-pack-boundary` to `git repack`, which tells
`git pack-objects` to treat objects in kept (base-stratum) packs as
traversal boundaries. When the revision walk encounters a commit in a
kept pack, it skips parent processing entirely — no ancestors are queued.
Similarly, when tree or blob traversal encounters an object in a kept
pack, it marks it seen and does not recurse further.

This relies on the *closed-set property*: the union of all base-stratum
packs is closed under object reachability. Every object transitively
reachable from any configured anchor is guaranteed to be in some
base-stratum pack, because base-stratum packs are created by
`rev-list --objects` which enumerates the full transitive closure, and
the cross-anchor filter ensures shared history is stratified exactly
once into the union. The `^<bound>` exclusion means individual packs
are not self-contained, but the union is. `stratify-prune` preserves
this invariant when retiring an anchor by relabeling the orphan pack's
sidecar onto a surviving anchor instead of unlinking it.

If main's history is stratified up to 2 weeks ago, the walk only needs to
traverse approximately 2 weeks of commits before hitting stratified objects
and stopping. Both enumeration and writing are bounded by the active
stratum size.

....
  Full GC walk depth:        entire history (years)
  Surface GC walk depth:     min-age window (weeks)

  Full GC enumeration:       all objects
  Surface GC enumeration:    only unstratified objects

  Full GC rewrite:           all objects
  Surface GC rewrite:        only unstratified objects
....


The stratify-prune maintenance task
-----------------------------------

`stratify-prune` retires base-stratum packs whose anchor has been
removed from configuration. Auto-demoting orphans inside
`stratify`'s validation pass would amplify a config typo into hours
of work and tens of GB of churn (every pack for the mistyped anchor
demoted in one run, then re-stratified from scratch under the
correctly-named anchor), so detection and demotion are split:

* `stratify` (and `surface-gc`, when it runs) emits a trace2
  `orphan-anchor` event and a one-line warning naming the affected
  pack count and the prune task to invoke.
* `stratify-prune`, wired into `geometric_strategy` at
  `SCHEDULE_WEEKLY`, performs the actual demotion. It carries no
  `auto_condition`, so `git maintenance run --auto` never triggers it.
* If no anchors are configured at all and base-stratum packs exist on
  disk, `stratify-prune` refuses (rather than demoting every pack in
  one shot). If no base-stratum packs exist either, it is silent — so
  it can sit in the default flow without spamming non-stratified
  repos.

Relabeling, not unlinking

Because the closed-set property is per-repo (not per-anchor), an
orphan pack may exclusively hold objects that some surviving anchor's
pack references as ancestors. Unlinking the orphan's sidecars
outright would break surface-gc's --kept-pack-boundary walk — it
stops at surviving kept packs and never reaches into a now-regular
orphan pack — and the next cruft repack would misclassify the shared
ancestors as cruft and prune them at expiration.

stratify-prune therefore relabels each orphan pack's
.base-stratum sidecar to claim a surviving anchor as its
anchor_ref. The pack stays kept, joins the survivor's group, and is
later folded into the survivor's pack(s) by consolidate-stratum's
geometric merge.

The relabel target is chosen per orphan pack (not once per run):
the recorded anchor_commit must be reachable from the chosen
anchor_ref's tip, because the next stratify validation pass demotes
any pack whose anchor_commit is not an ancestor of its ref. Three
strategies are tried in order:

  1. anchor_commit already on a surviving anchor's history — the
    commit is preserved verbatim; the frontier date is unchanged.
  2. shares some history with a surviving anchor — the sidecar is
    rewritten with the merge-base of the orphan's anchor_commit and
    the chosen surviving tip. The merge-base lies on both histories,
    and the orphan pack — which covers everything reachable from its
    original anchor_commit — necessarily covers the merge-base. The
    recorded frontier date drops to the merge-base's date, which is
    harmless: future stratify runs find their own packs' later
    anchor_commits via find_stratified_ancestor and use those as
    the rev-list bound.
  3. no shared history at all (independently-rooted anchors,
    orphan branches, docs/gh-pages branches, vendored subtrees with
    their own root) — the sidecar borrows a surviving anchor's own
    anchor_commit from one of its existing base-stratum packs. The
    cross-anchor filter operates at the OID level, so any tree or
    blob with a colliding hash (empty tree, identical LICENSE,
    .gitignore, lockfile blob, .gitkeep) may be deduplicated
    across unrelated histories. Borrowing keeps validation passing
    without poisoning find_stratified_ancestor's frontier.

Only when no surviving anchor has any base-stratum pack of its own is
outright demotion taken: in that case there is no kept set whose
closure could be broken, and the orphan's objects are reachable from
its still-live ref (or already unreachable, in which case normal GC
handles them correctly).

Integration with geometric repack

Geometric repack skips base-stratum packs the same way it already skips cruft
packs. Base-stratum packs are excluded from the geometric merge sequence and
survive geometric repack unchanged.

Configuration

maintenance.stratified.anchor::
Multi-valued. Refs to use as anchors for base-stratum packs. Each
value must be an exact ref name; branches and annotated tags are
both accepted. Duplicate entries are deduplicated. The stratify
task is a no-op if no anchors are configured. No default.

maintenance.stratified.min-age::
Approxidate. Only stratify objects from commits older than this.
Default: 2.weeks.ago.

maintenance.stratified.batch-size::
Integer with optional unit suffix (k, m). Maximum objects to
stratify per anchor per run. 0 means unlimited. Default: 0.

maintenance.stratified.cruft-expiration::
Approxidate. Cruft expiration threshold for surface-gc. Passed as
--cruft-expiration to linkgit:git-repack[1]. Falls back to
gc.pruneExpire if unset. Default: 2.weeks.ago.

maintenance.stratified.grace-period::
Approxidate. How far behind stratify stratifying can lag before
surface-gc skips its run. The actual cutoff is min-age plus this
value. Default: 1.week.ago.

Steady-state behavior

....
Week 1, Day 1:
geometric-repack merges small regular packs (same as today)
stratify task stratifies objects reachable from main (>2 weeks old)
-> first base-stratum pack created

Week 1, Day 2-6:
geometric-repack continues managing regular packs
stratify task incrementally stratifies new objects crossing min-age
-> base-stratum packs grow (or new small ones created)

Week 1, Day 7:
surface-gc runs:
-> walks only unstratified objects (recent history)
-> unreachable unstratified objects -> cruft pack
-> expired cruft -> dropped
-> cost: proportional to ~2 weeks of history, not years

stratify-prune runs (if any anchor was dropped from config):
-> relabels orphan packs onto a surviving anchor
-> consolidate-stratum folds them into the survivor on a later run

Week 2+:
base-stratum packs ~ all of main's reachable history
regular packs ~ last 2 weeks of objects
cruft packs ~ 0-2 weeks of unreachable objects
surface-gc cost ~ constant (bounded by min-age window)
....

Safety properties

The system is safe as long as validation is conservative (demote on any
doubt) and the per-repo closed-set property is preserved across anchor
retirements:

  • A wrongly-demoted base-stratum pack becomes a regular pack. Its objects
    are subject to normal GC, which correctly classifies them.
  • Each base-stratum pack is trusted only as far as its own
    anchor_commit/anchor_ref validation allows.
  • Force-push on main demotes affected base-stratum packs for main. The
    next GC correctly identifies newly-unreachable objects.
  • Anchor removal from config does not silently demote: stratify-prune
    is required, and relabels orphans onto a surviving anchor instead of
    breaking the closed-set union.
  • A periodic full GC can serve as a safety net to catch any objects that
    might have been incorrectly retained.

Risks and trade-offs

Delta compression across strata::
Objects in base-stratum packs are delta-compressed within the pack.
Objects in regular packs cannot delta against base-stratum pack objects
(different pack-objects invocation). This may increase total size
slightly vs. a single full repack. The cross-pack delta loss is bounded
by the active stratum size.

Base-stratum pack proliferation::
Each anchor ref per run could create a new small base-stratum pack.
Mitigation: consolidate-stratum merges small base-stratum packs
using the same geometric progression as the active stratum.

Correctness of the skip optimization::
When surface-gc keeps base-stratum packs via --keep-pack, it assumes
every object in those packs is reachable. If validation misses a case
where this is not true, unreachable objects could survive indefinitely.
Mitigation: periodic full GC as a safety net.

@vaidas-shopify vaidas-shopify force-pushed the stratified-gc branch 3 times, most recently from e251214 to baf21c7 Compare April 20, 2026 13:29
Introduce the .base-stratum sidecar file format that identifies a pack
as belonging to the base stratum — one containing objects known to be
reachable from configured anchor refs. This is the foundation for
Stratified GC, where base-stratum (stable, archival) objects are
skipped during GC walks over the active stratum.

The .base-stratum file stores:
  - The anchor commit OID from which reachability was proven
  - The anchor ref name used for validation
  - A stratified timestamp recording when the pack was created

Detection follows the same pattern as .mtimes for cruft packs: during
pack discovery in add_packed_git(), the presence of a .base-stratum
file sets the in_base_stratum bit on the packed_git struct.

Also add .base-stratum to the list of extensions cleaned up by
unlink_pack_path().
Exclude packs with the in_base_stratum flag from geometric repack, same
as cruft packs are already excluded. Base-stratum packs are not subject
to routine repack churn and should not be merged or reorganized by the
regular geometric progression.
Add a new "stratify" maintenance task that incrementally moves objects
reachable from configured anchor refs into base-stratum packs. This is
the core of Stratified GC: objects in base-stratum packs form the
archival layer and can be skipped during future GC walks.

The task:
  - Reads anchor refs from maintenance.stratified.anchor (multi-valued)
  - Respects maintenance.stratified.min-age (default: 2.weeks.ago)
  - For each anchor ref, finds the last-stratified commit from existing
    .base-stratum packs to avoid re-walking already-stratified history
  - Uses rev-list --objects --before=<min-age> to find eligible objects
  - Packs them via pack-objects and writes a .base-stratum sidecar

The auto-condition triggers whenever anchor refs are configured,
making this a no-op when the feature is not in use.
Add validation that runs at the start of each stratify maintenance
task. For each existing base-stratum pack, verify:

  1. The .base-stratum file can be loaded
  2. The anchor ref still exists
  3. The recorded anchor commit is an ancestor of the current ref tip

If any check fails, the pack is demoted to a regular pack by removing
the .base-stratum file. This handles force-pushes and ref deletions
gracefully — demoted packs re-enter the normal geometric repack
pipeline and their objects will be correctly classified by the next GC.

Uses "git merge-base --is-ancestor" for the ancestry check, which is
near-constant-time when a commit-graph exists.
Add a new "surface-gc" maintenance task that performs lightweight
garbage collection scoped to the active stratum only. Base-stratum
packs (the base stratum) are passed to repack via --keep-pack,
so the reachability walk and object rewrite only cover unstratified
objects.

This achieves the key benefit of Stratified GC: GC cost is
proportional to the active stratum size (recent objects), not the
total repository size. When main's history has been stratified by the
stratify task, surface-gc only walks a few weeks of history.

The task uses "git repack -d -l --cruft --cruft-expiration=<exp>"
with --keep-pack for each base-stratum pack. Unreachable objects in
the active stratum are moved to cruft packs; expired ones are
dropped.

The auto-condition requires both base-stratum packs and regular packs
to exist, making this a no-op before the first stratify run.

Configurable via maintenance.stratified.cruft-expiration (default: 2.weeks.ago).
Add the new Stratified GC tasks to the geometric maintenance
strategy schedule:

  - stratify: daily (after geometric-repack, stratifies old objects)
  - surface-gc:  weekly (prunes unreachable from active stratum)

The resulting geometric strategy schedule is:

  hourly:  commit-graph
  daily:   geometric-repack, pack-refs, stratify
  weekly:  rerere-gc, reflog-expire, worktree-prune, surface-gc

Tasks execute in enum order, so geometric-repack runs before
stratify (which needs consolidated packs to stratify efficiently),
and surface-gc runs last (after reflog-expire has made objects
unreachable).

Both tasks are no-ops when maintenance.stratified.anchor is not
configured, so existing users see no behavioral change.
Add documentation for the new Stratified GC maintenance tasks
and their configuration options:

  - maintenance.stratified.anchor
  - maintenance.stratified.min-age
  - maintenance.stratified.batch-size
  - maintenance.stratified.cruft-expiration
  - maintenance.stratified.grace-period

The task descriptions are added to git-maintenance.adoc and the
config entries to config/maintenance.adoc.
Base-stratum packs for the same anchor ref form an incremental chain
via ^<last_stratified> exclusion — each pack depends on earlier packs
for completeness. When a pack fails validation (e.g., force-push on
the anchor ref), demote all packs for that ref with equal or later
stratified_timestamp to preserve the closed-set property.

Surface-gc currently enumerates all objects in the repository even
though it only rewrites the active stratum. This is because
--keep-pack prevents rewriting kept-pack objects but the reachability
walk still traverses into them.

Add --kept-pack-boundary (internal, hidden) to pack-objects and
repack. When set, the revision walk stops at commits in kept packs
(skipping parent processing), and tree/blob traversal skips objects
found in kept packs. This is safe because the union of all stratify
packs is closed under reachability — cascade demotion ensures this
invariant holds after force-pushes.

Surface-gc passes --kept-pack-boundary when base-stratum packs exist,
bounding both enumeration and rewrite cost to the active stratum.
A repository may use both full gc and surface-gc, so having two
independent configs for the same cruft expiration threshold is
error-prone. Make surface-gc fall back to gc.pruneExpire when
maintenance.stratified.cruft-expiration is unset, so that a single
config controls both paths by default.
Add trace2 instrumentation to the three stratify generational
maintenance tasks to aid debugging and performance analysis:

- stratify: log anchor count, per-anchor regions with objects
  stratified/skipped/already-packed counts, and batch truncation status
- consolidate-stratify: log group count, per-group regions with
  pack totals and merge counts
- surface-gc: log stratifying readiness, expiration value, and kept pack
  count
Change batch-size from int to unsigned long and use
repo_config_get_ulong() so that users can specify values like
"10k" or "1m" in maintenance.stratified.batch-size.
Commit <abbrev> ("<subject>") made add_object_entry_from_pack() call
add_pending_oid() for every commit in a --stdin-packs pack before
want_object_in_pack(). That is necessary for STDIN_PACK_EXCLUDE_OPEN
('!') packs: their commits fail the want check but must still seed
the revision walk so we discover objects reachable from open-excluded
packs that are not present in closed-excluded ('^') packs.

But the change applied unconditionally, regressing the default
--stdin-packs (INCLUDE) path. When a commit in an included pack also
lives in an excluded ('^') pack, it now becomes a revision walk tip
even though want_object_in_pack() would reject it. In a geometric
repack, where excluded packs hold most of history, the walk no longer
stops at the excluded-pack boundary and instead descends through the
entire repository. On large repos this stretched the walk from
seconds to hours. The output was unchanged: in the default mode the
extra traversal only affects delta-reuse hints (show_object_pack_hint);
the packed object set is identical.

Thread a stdin_pack_cb_data through the for_each_object_in_pack()
callback so add_object_entry_from_pack() knows whether to seed
pending objects eagerly. Set eager_pending only for EXCLUDE_OPEN
packs, restoring the pre-regression behavior for INCLUDE packs:
commits become walk tips only after passing want_object_in_pack().
The buffer for pack auxiliary file paths is sized for ".promisor",
which used to be the longest suffix written via xsnprintf. After
".base-stratum" was added in 5c5a585 (pack-base-stratum: introduce
.base-stratum sidecar for stratified packs), the four extra bytes
overflow the buffer and trip the xsnprintf length check, aborting any
command that enumerates packs (e.g. "git count-objects -v", "git log")
with:

    BUG: wrapper.c:678: attempt to snprintf into too-small buffer

Size the allocation for ".base-stratum" instead, and update the
adjacent comment.
When the same gitconfig file is pulled in by two overlapping
includeIf rules (e.g. one matching gitdir, another matching
hasconfig:remote.*.url), every multi-valued key in it is appended
twice. maintenance.stratified.anchor then comes back from
repo_config_get_string_multi() with duplicates, and both
maintenance_task_stratify() and stratify_stratifying_caught_up()
walk the raw list — re-running rev-list/pack-objects against an
already-stratified frontier and emitting redundant trace2 regions.

Add load_unique_stratify_anchors() to copy the list into a
caller-owned string_list, skipping entries already present, and
use it from both call sites. Restructure the readiness check
to a single exit so the list is always released.
Both maintenance_task_stratify() and maintenance_task_consolidate_stratum()
invoke pack-objects with a fixed basename of "pack/base-stratum",
relying on pack-objects to append "-<pack_hash>.pack". The pack hash is
derived from pack contents, so two anchors that produce identical
contents — e.g. two refs configured under maintenance.stratified.anchor
that point at the same commit, or two histories whose pre-divergence
reachable object sets coincide once trimmed by --before=<min-age> —
collide on the same pack path and the same .base-stratum sidecar path.

write_pack_base_stratum() opens the sidecar with O_TRUNC, so the second
run silently overwrites the first anchor's sidecar. find_last_stratified_commit()
and filter_already_packed_oids() both key off the sidecar's recorded
anchor_ref, so the displaced anchor sees no prior frontier and re-walks
its full history on every subsequent run, producing duplicate packs and
defeating incrementality.

Introduce format_base_stratum_pack_basename(), which appends an 8-byte
digest of anchor_ref to the basename. Each anchor gets its own filename
namespace, so identical pack contents no longer share a path. Use the
helper from both call sites and reuse the same prefix to locate the
produced pack for sidecar writing.

The fix is on-disk compatible. Pack discovery in add_packed_git() keys
on .base-stratum sidecar existence, not filename pattern, and every
lookup path matches by adata.anchor_ref rather than by basename — so
pre-fix packs continue to be recognized and looked up. New packs and
old packs coexist; consolidate-stratum naturally migrates old names to
the new scheme during its next merge. Repos that already suffered a
collision self-heal: the displaced anchor finds no matching sidecar,
falls through to a full stratification, and writes under the new
anchor-scoped name.

Add t7901-maintenance-stratify.sh covering two anchors at the same
commit (each must produce its own sidecar with the correct anchor_ref)
and a re-run case (no new commits must not produce additional packs).
Both tests fail without this change.
validate_stratify_packs() demotes a base-stratum pack only when the
recorded anchor_ref no longer resolves or the recorded anchor_commit
is no longer an ancestor of the current ref. It does not check whether
the anchor_ref is still listed under maintenance.stratified.anchor.
A pack whose anchor was removed from config but whose ref still exists
therefore persists indefinitely: stratify never updates it (its anchor
is not in the configured list and so is never walked), and validate
never demotes it. Surface-gc continues honoring it as a kept-pack
boundary, the geometric repack continues skipping it, and disk creeps
up across each anchor rotation.

Auto-demoting on validate would close the gap but would amplify a
config typo. At the scale this feature targets (millions of commits,
hundreds of GB), a single mistyped ref name in
maintenance.stratified.anchor would silently demote every pack for
that anchor in one run; the next stratify would then walk the full
reachable history of the (correctly-named) ref from scratch and emit
a fresh pack. That is hours of work and tens to hundreds of GB of
churn for a typo. So detection and demotion are split.

Detection is folded into validate_stratify_packs(). Before the
existing per-group cascade pass, each pack group whose anchor_ref is
not in the configured list emits a trace2 "orphan-anchor" data event
and (when not quiet) a one-line warning that names the count of
affected packs and the prune task to invoke. Cascade validation is
skipped for orphan groups: their anchor_ref may have been deleted
along with the config entry, in which case the existing logic would
demote anyway, and pruning is the right tool for that.

Demotion lives in a new stratify-prune task. It loads the configured
anchor list, iterates collect_base_stratum_pack_groups() output, and
calls remove_pack_base_stratum() on every pack whose group anchor_ref
is not configured. Only the .base-stratum and .keep sidecars go away
— the .pack and .idx files stay on disk and become regular packs that
the next geometric-repack absorbs on its normal cadence. The
demotion is therefore reversible until that next repack.

The task is wired into geometric_strategy at SCHEDULE_WEEKLY so
fleets running scheduled maintenance pick up orphans on their own
without any operator action. It carries no auto_condition, so
"git maintenance run --auto" never triggers it; explicit invocation
via --task=stratify-prune or a configured weekly schedule is
required.

Two safety properties matter when no anchors are configured at all:

  * a repo that has never used stratification (no base-stratum packs
    on disk) must produce no output, since stratify-prune is now in
    the default manual flow via geometric_strategy and would
    otherwise spam every "git maintenance run" invocation;

  * a repo that has base-stratum packs but lost its anchor config
    must not have every pack demoted in one shot.

repo_has_base_stratum_packs() distinguishes these. The first case
returns silently; the second emits a warning and refuses.

Document the task alongside stratify and surface-gc in
git-maintenance(1), describing the warn-but-don't-demote semantics
of stratify and the empty-config refusal of stratify-prune.

Cover with five new cases in t7901-maintenance-stratify.sh: orphan
detection without demotion, prune demoting only orphan sidecars
(pack count preserved, configured anchor untouched), prune as no-op
when nothing is orphaned, prune refusing when no anchors configured,
and trace2 verification that --schedule=daily excludes the task and
--schedule=weekly includes it.
load_unique_stratify_anchors() collapses duplicate entries in
maintenance.stratified.anchor by ref name only. Two refs that resolve
to the same commit therefore both get processed: each runs its own
git rev-list --objects, each invokes pack-objects, each writes a
sidecar. With the recent anchor-scoped basename fix the resulting
packs no longer collide on disk, but the work is still duplicated:
identical rev-list output, identical (modulo basename) pack-objects
output, identical reachable object sets sitting in two places. At
the scale this feature targets (millions of commits, hundreds of GB)
that doubles every overlap.

The case is realistic: a release branch cut from main and added to
the anchor list points at HEAD, both anchors resolve to the same
commit until main moves; tags-as-anchors aligned with their branch
do the same; mirror configurations carry the same ref under multiple
names indefinitely.

Track resolved tip OIDs in a struct oidset across the per-anchor
loop. After resolving each anchor, check the set; if the OID is
already present from a prior anchor in this run, log
skipped/reason=duplicate-oid via trace2 and continue. Otherwise
insert and proceed with stratification as before. The first-listed
anchor wins; followers are no-ops for that run only.

Naively skipping a follower would break stratify_stratifying_caught_up().
The readiness check looks up the most-recent stratified commit by
matching adata.anchor_ref against the anchor name, finds nothing for
a deduped follower, returns "no stratified commits yet", and surface-gc
gets skipped indefinitely. Extend the lookup with an OID-equivalence
fallback: when no sidecar's anchor_ref matches, also accept a sidecar
whose anchor_commit equals the follower's current tip. The primary's
pack covers the identical reachable history, so it serves as evidence
that the follower is also caught up.

Once the anchors diverge in OID (e.g., main advances past the release
branch), the follower is no longer in the seen set and stratifies on
its next run. There is one cycle of "stratify from scratch" while the
follower has no own sidecar and the OID-fallback no longer matches
anything — that is the unavoidable cost of having previously skipped
its first run, and is bounded.

Update t7901-maintenance-stratify.sh to assert the new semantics:
two anchors at the same commit produce a single pack with the
first-listed anchor recorded in the sidecar, and the
duplicate-skip message is emitted. Restructure the orphan/prune
tests to use anchors at *different* commits via a new
setup_two_distinct_anchors helper, since dedup would otherwise
collapse them and remove the preconditions those tests rely on.

Document the behavior in git-maintenance(1) under the stratify task.
…writes

write_pack_base_stratum() opens both the .base-stratum sidecar and
its companion .keep with xopen(O_WRONLY | O_CREAT | O_TRUNC, 0444).
The 0444 mode is intentional — these files are pack metadata and
should not be casually edited — but it makes the function fragile
under any rewrite scenario. Once such a file exists, a fresh
xopen(O_WRONLY) on the same path fails with EACCES because the
file's mode forbids writes regardless of the mode argument to
open() (which only applies to newly-created files).

That EACCES is reachable in practice. With maintenance.strategy
set to "geometric", "git commit" calls run_auto_maintenance(), which
spawns a detached "git maintenance run --auto --detach" that runs
the stratify task in the background. A subsequent explicit
"git maintenance run --task=stratify" also runs the stratify task.
Both feed the same anchor through the same rev-list and the same
pack-objects, both produce the same deterministic pack hash, and
both target the same <basename>.base-stratum and <basename>.keep
paths. The first writer creates the files at mode 0444; the second
writer dies with:

  fatal: could not open '...base-stratum' for writing: Permission denied

leaving the explicit invocation as a hard failure on every workflow
that pairs commit-driven auto-maintenance with manual stratify runs.
The same failure mode applies to any sequential rerun where the new
content happens to map to a hash that already has a sidecar on disk.

Switch to atomic temp + rename for both files. Two small helpers
isolate the pattern:

  - begin_atomic_write() opens "<path>.tmp-<pid>" with O_CREAT|O_EXCL
    after a best-effort unlink of any stale leftover from a prior
    crash, and returns the fd plus the temp name.
  - finish_atomic_write() rename(2)s the temp over the final path;
    rename overwrites atomically and does not consult the
    destination's mode, so neither concurrent writers nor sequential
    reruns trip over the existing 0444 file.

The .base-stratum path uses the helpers via hashfd(); the trailing
checksum and CSUM_CLOSE/CSUM_FSYNC behaviour is unchanged. The .keep
path goes through the same helpers — its content is empty, but the
existence is what matters and the same EACCES bug applied to it.

Both writers landing on the same path produce byte-identical content
apart from the stratified_timestamp (a 4-byte field updated each
run); whichever rename lands last wins, and the result is still a
valid sidecar pointing at the correct anchor_commit and anchor_ref.
The closed-set property and lookup paths are unaffected.

Add a regression test in t7901-maintenance-stratify.sh that runs
stratify, removes the .base-stratum sidecar (leaving its mode-0444
.keep behind), and reruns stratify. The new run re-stratifies and
must overwrite the existing 0444 .keep alongside the missing
.base-stratum. Without this change the second run dies on .keep
with EACCES; with it the rewrite succeeds and the sidecar is
restored.
validate_stratify_packs() sorts each anchor's pack group by
stratified_timestamp ascending and, on the first failure, cascade-
demotes every remaining entry in the sorted order. The optimisation
is sound only under an unwritten precondition: that the timestamp
order matches the commit-graph order of each pack's anchor_commit,
i.e. that a later-timestamped pack always has an anchor_commit
descended from any earlier-timestamped pack's anchor_commit.

That precondition is fragile. It holds for the natural flow of one
maintenance.stratify task per machine producing monotone-clock
timestamps, but it is broken by any of:

  - a system clock rollback between stratify runs (NTP correction,
    manual `date -s`, container clock skew),
  - a manually-touched .base-stratum file (the format is documented
    and there is no signature gating off-line edits),
  - any future code path that synthesises a timestamp on a merged
    pack — consolidate-stratum already inherits the *latest* input
    timestamp today, but a benign change to inherit the median or
    earliest value would silently invert the assumption,
  - a future per-pack format change that allows two packs in the
    same group to land within the same wall-clock second (the
    timestamp granularity).

When the precondition breaks, cascade-by-timestamp validates the
out-of-order pack first, finds it invalid against the current ref,
and demotes every entry "after" it in the sort — which can include
packs whose anchor_commit *is* still an ancestor of the ref. The
per-anchor loop then re-stratifies from scratch (because
find_last_stratified_commit returns NULL) and silently resurrects
the demoted-but-valid pack at the same path. From the outside the
repository looks fine; internally the closed-set property has been
re-derived rather than preserved, an unnecessary rev-walk has run
across the whole anchor history, and any clients holding open file
handles to the original .pack file are racing with its replacement.

The cost of getting the cascade wrong scales with the size of the
anchor's history. The cost of avoiding it is one ancestor lookup
per pack in the group — bounded by consolidate-stratum's geometric
split, typically two to four packs. That is unconditionally cheap.

Drop the cascade. Validate every pack in a configured anchor's
group independently against the current ref tip via
validate_single_base_stratum_pack(). The orphan-anchor branch is
unchanged (orphans are still surfaced via warning + trace2 but not
demoted; stratify-prune is the explicit cleanup path). Remove the
QSORT call and its comparator cmp_base_stratum_entry_timestamp(),
which now have no remaining users in this file. Rewrite the
function-level comment to spell out the new contract and the
reasoning for not relying on stratified_timestamp ordering.

Add a regression test in t7901-maintenance-stratify.sh: produce two
packs P1 and P2 for the same anchor in natural timestamp order,
overwrite P2's stratified_timestamp with a tiny value so it sorts
ahead of P1, rewind the ref so that P2's anchor_commit is no longer
an ancestor while P1's is, and run stratify. Independent validation
demotes P2 only and emits one per-pack warning; the buggy cascade
demotes both and emits the unique "cascade-demoting" message. Path-
existence and sidecar-count assertions cannot tell the regimes
apart on their own — the per-anchor re-stratification rebuilds P1
at the same path with byte-identical content, fooling those checks
— so the test pins the diagnostic by greping err for the cascade
message instead.

Add a small Perl helper set_sidecar_timestamp() to the test
script that flips the timestamp field and rewrites the trailing
checksum, so the manipulation cannot be confused for a corrupted
sidecar.
filter_already_packed_oids() consulted only base-stratum packs whose
anchor_ref string matched the anchor being stratified. Two anchors
that shared history therefore packed the shared objects twice — once
each. The doubling scales with overlap: at the scale this feature
targets (millions of commits, hundreds of GB), N anchors retaining
overlapping release history multiply the shared prefix N-fold on
disk. The shape recurs in mundane fleet topologies — main plus a
handful of long-lived release branches, main plus tags pointing at
older commits — not in pathological setups.

The per-anchor restriction existed to make per-anchor cascade
demotion safe: if every anchor's pack set was self-contained, you
could demote one anchor's packs without leaving another anchor's
coverage with dangling references. This is a stronger guarantee than
the system actually needs. Demotion only removes the .base-stratum
and .keep sidecars; the .pack and .idx files remain on disk and are
folded into the active stratum by the next geometric repack. Objects
never disappear in the window between demotion and absorption, so
the safety property holds at the per-repo level even without
per-anchor closure.

Drop the anchor_ref restriction in filter_already_packed_oids() and
let it consult every base-stratum pack. Each shared object is now
written into whichever anchor's pack lands first; later anchors in
the same run, and any subsequent run, carry only their unique-to-
anchor deltas relative to the union.

For this to actually save work, the helpers stratify uses to bound
its rev-list and to gauge surface-gc readiness must learn that
coverage from another anchor counts:

  - find_stratified_ancestor(r, tip) replaces the old
    find_last_stratified_commit(anchor_ref). It returns the most
    recent anchor_commit (regardless of which anchor recorded it)
    that is a strict ancestor of tip, suitable as a `^bound` for
    rev-list. Strict ancestors only — descendants would over-
    exclude (B_tip is reachable from a descendant, so ^descendant
    drops B_tip itself), and unrelated commits would have no
    effect. The check uses repo_in_merge_bases(), which is
    in-process and accelerated by the commit graph.

  - stratified_frontier_date(r, tip) is a separate helper for the
    surface-gc readiness check. Unlike find_stratified_ancestor()
    it accepts both ancestors *and* descendants of tip: a pack
    whose anchor_commit is descendant of tip fully contains tip's
    reachables, so the frontier on this anchor's history clamps at
    tip's own committer date. Without this, a release tag whose
    history is entirely covered by main's pack would appear
    "uncovered" because main's anchor_commit is younger than the
    tag's tip; surface-gc would skip indefinitely.

The newly-written packs must also be visible to the cross-anchor
filter when the next anchor in the same run reads them. Replace the
add_packed_git() calls in stratify and consolidate-stratum with
packfile_store_load_pack() so the pack is registered in the
repository's packfile store and shows up via repo_for_each_pack()
on subsequent iterations. Without this, the filter never observes
the pack the previous iteration just wrote, and the cross-anchor
saving disappears whenever multiple anchors are processed in one
maintenance run.

Document the new design in Documentation/technical/stratified-gc.adoc:
replace the "Cascade demotion" section with "Independent validation"
(it already runs that way after the prior commit), add a
"Cross-anchor coverage and the closed-set property" section, and
update the surface-gc explanation to spell out that the closed set
is the per-repo union rather than per anchor — and that demoted
packs preserve reachability via their on-disk .pack files until the
next geometric repack absorbs them.

Add a regression test in t7901-maintenance-stratify.sh
("cross-anchor filter dedupes shared history") with master at c2
and release at c1 (release ancestor of master). After stratify, only
one sidecar exists — master's — with the trace2 line confirming the
filter skipped release's already-packed objects. The test then
forces min-age and grace-period to "now" and runs surface-gc,
asserting release is not flagged as "no stratified commits yet"
— that exercises the descendant fallback in stratified_frontier_date().
The test fails without this commit (release would write its own
pack containing the shared objects) and passes with it.

setup_two_distinct_anchors in the test helpers now creates sibling
commits (master and release diverging from a common base) instead
of an ancestor relationship. Without the change the cross-anchor
filter would collapse the second anchor's pack to empty and break
every orphan/prune test that depends on each anchor having its own
pack on disk.
The batch-size truncation in maintenance_task_stratify() had two bugs
that compounded into "stratification silently skips objects forever"
or "stratification can never converge", depending on how small the
limit was.

First, an anchor-vs-buffer mismatch. When the parsing loop crossed
into a new commit C with batch_exceeded already set, it truncated
the buffer back to the start of C and broke out — but batch_anchor
had been updated to C's OID one iteration earlier, before C's
objects were scanned. The resulting pack contained the *previous*
commit's closure, but the .base-stratum sidecar recorded C as the
frontier. The next run's `^C` exclusion then permanently hid C's
unique objects from base-stratum packs.

Second, no progress on a single oversized commit. If the very first
eligible commit alone exceeded batch-size, the loop truncated to
offset 0 and the empty-buffer guard at the top of the pack-objects
path returned without writing a sidecar. The frontier never
advanced; every subsequent run repeated the same failed attempt.

A third issue surfaced while writing tests: the loop assumed
rev-list emits commits interleaved with their trees and blobs
(C1 T1 B1 C2 T2 B2 ...), but `rev-list --objects --reverse` without
`--in-commit-order` emits all commit OIDs first and then all
referenced trees and blobs. With that ordering there are no inline
commit boundaries to truncate at: either the limit is large enough
that truncation never fires (batch-size is silently ignored), or
the truncation fires while still in the commit prefix and produces
a commits-only buffer with no tree/blob closure. Both regimes
violate the documented "Maximum objects to stratify per anchor per
run" contract.

The fix has three parts:

  - Pass --in-commit-order to rev-list. With it, each commit is
    immediately followed by the trees and blobs reached through
    that commit, so the buffer prefix preserved by truncation has
    full object closure for the commits it includes.

  - Track the last *fully-included* commit separately from the
    commit currently being scanned. pending_anchor records the
    new commit on entry; it is only promoted to batch_anchor
    once we cross into the next commit (or finish the buffer)
    without exceeding the batch. Truncation now records the
    previous fully-included commit, so anchor and pack contents
    agree.

  - Handle the single-commit-overflow case explicitly. When the
    first eligible commit alone exceeds batch-size, there is no
    prior fully-included commit to fall back on; truncating
    would leave the batch empty. Promote the pending commit to
    anchor and overshoot the batch by the remainder of its
    objects. This costs at most one commit's worth of overshoot
    per run but guarantees that stratification always advances.

Add t7901-maintenance-stratify.sh coverage for both regimes:

  - "batch-size truncation records the last fully-included
    commit" exercises the multi-commit path. Three commits at
    3 objects each, batch-size=4: the sidecar must record c1
    (not c2 — the old bug) and a follow-up unlimited run must
    successfully stratify the rest, proving ^c2 did not
    permanently hide c2 from the walk.

  - "batch-size smaller than a single commit still makes
    progress" covers the case the reviewer flagged as
    untested. batch-size=2 cannot hold a single commit's
    closure (3 objects); the run must overshoot rather than
    return empty-handed.

Add an extract_sidecar_anchor helper in the test file alongside the
existing extract_sidecar_refs so tests can assert on the recorded
anchor_commit OID directly.
Both stratify call sites of write_pack_base_stratum() ignored the
return value and unconditionally set new_pack->in_base_stratum = 1.
The function can fail after the .pack itself is written but before
.base-stratum or .keep are durably installed — a rename(2) error, a
full filesystem, or any of the failure paths the recent atomic-write
conversion now reports cleanly via -1.

When that happens, the maintenance process keeps an in-memory pack
flagged as base-stratum coverage with no matching on-disk metadata.
Two concrete bad outcomes follow.

In maintenance_task_stratify(), the cross-anchor filter consults
in_base_stratum on every subsequent iteration of the same run. A
later anchor sharing history with the failed one will skip objects
on the strength of a pack whose .base-stratum sidecar does not exist
on disk, so the next maintenance process — which only sees what is
on disk — does not consider that pack base-stratum at all and loses
coverage of those objects entirely.

In maintenance_task_consolidate_stratum() the failure mode is worse:
the very next block runs repack_remove_redundant_pack() on every
source pack that fed the merge. Those source packs do have valid
.base-stratum sidecars; the merged pack does not. Deleting them
under the false belief that the merged pack now covers the anchor
permanently strands the anchor's objects outside any base-stratum
pack, exactly the state stratification exists to prevent.

A transient metadata write failure should be a clean task failure,
not silent inconsistency. Check the return value at both call sites:

  - stratify: warn, set result = 1, and skip the in_base_stratum
    assignment so subsequent iterations' filter does not treat the
    pack as covering anything. The orphaned .pack/.idx files remain
    on disk and are harmless — a later run will retry the anchor.

  - consolidate-stratum: warn, set result = 1, and continue past
    the source-pack removal block. The merged pack stays on disk
    as an orphan (the existing comment about orphans being harmless
    already covers this case); the source packs and their sidecars
    are preserved, so the anchor's coverage is unchanged from before
    the consolidation attempt.
find_stratified_ancestor() and stratified_frontier_date() each call
lookup_commit() directly on the tip_oid argument supplied by their
callers. Both callers obtain that OID from refs_resolve_ref_unsafe(),
which returns whatever the ref points at — for an annotated tag,
the tag object's OID, not the commit it dereferences to. lookup_commit()
rejects non-commit OIDs and the helpers return NULL/0, so any anchor
configured as refs/tags/<v> with an annotated tag silently degrades
both code paths the helpers exist to support.

Two visible regressions follow.

In maintenance_task_stratify(), find_stratified_ancestor() runs
before rev-list and is meant to supply ^last_stratified as a bound,
so the walk only sees commits past the previous frontier. With the
helper returning NULL, rev-list runs unbounded on every invocation
and re-walks the anchor's full history. The cross-anchor filter on
the rev-list output then strips the already-packed objects and the
resulting pack is empty, so the bug does not produce a duplicate
sidecar — it just burns CPU and IO proportional to history depth on
every stratify run, indefinitely.

In stratify_stratifying_caught_up(), stratified_frontier_date()
returning 0 is interpreted as "no base-stratum pack covers this
anchor yet" and the surface-gc readiness check refuses to engage:

  surface-gc: anchor 'refs/tags/v1' has no stratified commits yet

That message is sticky — it keeps firing on every run for as long
as the anchor stays a tag. A repository using release tags as
anchors (the conventional shape: annotated or signed tags marking
release boundaries) therefore never gets surface-gc savings, even
once stratification has fully caught up.

Sidecar contents are not affected. rev-list peels its own commit
arguments, the batch loop in maintenance_task_stratify() reads
commit OIDs out of `rev-list --objects --in-commit-order` output
into pending_anchor/batch_anchor, and tip_oid is overwritten with
the last fully-included commit before write_pack_base_stratum().
The recorded anchor_commit is always a commit OID regardless of
whether the anchor ref is a branch or an annotated tag, so the
inner lookup_commit(r, &adata.anchor_commit) inside both helpers is
correct as written and is not changed by this commit.

validate_single_base_stratum_pack() is also unaffected: it passes
the resolved ref OID to `git merge-base --is-ancestor`, which peels
its arguments through the standard rev-parse machinery.

Switch the two affected lookups to lookup_commit_reference_gently(),
which dereferences tags before the lookup and returns NULL silently
on a non-commit. The fix is at the helper boundary rather than at
each call site so any future caller that hands an unpeeled ref OID
in is handled the same way.

Add a regression test in t7901-maintenance-stratify.sh
("annotated-tag anchor: incremental detection and surface-gc
gating"). It creates a single commit, an annotated tag pointing at
it, and asserts up front that the tag's ref OID differs from the
commit's OID — a lightweight tag would resolve straight to the
commit and silently bypass the bug under test, so this guard is
load-bearing. It then verifies (a) the sidecar records the peeled
commit OID, (b) a second stratify run is a no-op, and (c) surface-gc
readiness with min-age and grace-period both set to "now" does not
emit "no stratified commits yet". Without this commit the test
fails on (c); with it the test passes. (b) does not fail without
the fix in this minimal repo because the cross-anchor filter masks
the wasted walk, but pinning it documents the intended behaviour
and would catch future regressions on larger histories where the
duplicate work becomes observable.
Cross-anchor dedup (e16dc47) made base-stratum packs share contents:
the union of all base-stratum packs is closed under reachability from
configured anchors, but each pack on its own is not. stratify-prune
(a7e93d6) demoted orphaned packs by unlinking .base-stratum and
.keep, which breaks the union as soon as the orphan held objects a
surviving pack referenced as ancestors.

The next surface-gc invocation passes
--keep-pack=<surviving> --kept-pack-boundary to repack -d --cruft.
revision.c's kept-pack-boundary check makes the walk skip parent
expansion at any object already in a kept pack, so when it reaches a
surviving anchor's tip it stops without ever traversing into the
orphan's now-non-kept pack. Objects that exist only in the orphan pack
are then classified as cruft, moved into a cruft pack with mtime=now,
and pruned at expiration. The result is silent corruption: surviving
refs lose parent commits, trees, or blobs after a routine maintenance
cycle.

The argument in e16dc47's commit message that the demoted .pack
"remains on disk until the next geometric repack absorbs it" does not
rescue this. The cruft repack happens before any geometric repack, and
geometric_repack uses ref reachability — it cannot recover an object
whose only ref path has been deleted.

Fix this in stratify-prune: instead of unlinking sidecars, rewrite the
orphan pack's .base-stratum sidecar to claim a surviving configured
anchor as its anchor_ref (preserving the orphan's anchor_commit and
stratified_timestamp). The pack stays in_base_stratum=1, keeps its
.keep file, and joins the surviving anchor's group via
collect_base_stratum_pack_groups. surface-gc then enumerates it in
--keep-pack= and the kept-pack boundary covers the same union it did
before retirement. consolidate-stratum's per-anchor geometric merge
absorbs the relabeled pack into the survivor's pack(s) over time.

Target-anchor selection picks the first ref-group whose anchor is in
the configured list. collect_base_stratum_pack_groups only emits groups
for anchors that have packs, so this guarantees a survivor with a pack.
If no surviving anchor has any base-stratum pack of its own, no kept
pack exists whose closure could be broken, and the orphan is demoted as
before; trace2 distinguishes the cases via packs/relabeled and
packs/demoted counters.

Document the contract in stratified-gc.adoc — both the "Cross-anchor
coverage and the closed-set property" section (which now explains why
sidecar unlink is unsafe and what relabel does instead) and the "Why
surface-gc is cheap" section (which previously claimed demotion was
absorbed safely by the next geometric repack).

Add a regression test in t7901 ("shared-history anchors survive donor
demotion and surface-gc"): build sibling anchors via
setup_two_distinct_anchors, stratify, drop master from config and
delete its ref, expire reflogs (real-world parity — without this the
release@{1} branch-creation reflog pins c1 until expiration runs),
stratify-prune, then surface-gc with cruft-expiration set past now to
force immediate pruning. Without this commit c1 (the shared base)
moves to a cruft pack and is deleted; "git cat-file -e" and "git fsck
--strict" both fail. With this commit c1 stays in master's relabeled
pack and the repository remains consistent.

The existing "stratify-prune demotes orphan anchor packs" test is
renamed and updated: it now asserts the relabel path (count_sidecars
stays at 2, both sidecars name the survivor, stderr contains
"relabeled") rather than the old unlink-and-drop behavior.
…nfig order

The relabel path added in 3abf001 picked a single target_anchor up
front — the first configured anchor whose ref-group already had a
base-stratum pack — and reused it for every orphan pack in the run.
That selection ignored whether the orphan's recorded anchor_commit
was actually reachable from the chosen target's tip. Two problems
follow.

First, the next stratify validation pass demotes any pack whose
anchor_commit is not an ancestor of its recorded anchor_ref's tip
(validate_single_base_stratum_pack shells out to `merge-base
--is-ancestor`). A relabel that writes an incompatible (anchor_ref,
anchor_commit) pairing therefore postpones demotion by exactly one
stratify cycle — and at that point the closed-set property the
relabel was meant to preserve breaks anyway.

Second, the choice depends on config order rather than repository
topology. With two surviving anchors on histories the orphan's
anchor_commit lies on one of but not the other, swapping the order
of `maintenance.stratified.anchor` entries changes whether the pack
survives. The existing single-survivor regression test does not
exercise this; the bug surfaces as soon as a second surviving anchor
sits ahead of the compatible one in config order.

The "shared-history anchors survive donor demotion and surface-gc"
test in 3abf001 happens to fall into this case too: master's
anchor_commit is c2, the only survivor is release at r1, and c2 and
r1 are siblings off c1. The relabel writes (release, c2) which is
not on release's history. The test passes only because nothing
between stratify-prune and the subsequent surface-gc invokes
validation — it would have broken the very next stratify run.

Fix this by choosing the target per pack inside the orphan loop. A
new helper, find_relabel_target(), iterates configured anchors in
two passes:

  1. Preferred path: if any surviving anchor's tip has the orphan's
     anchor_commit as an ancestor, use that anchor and preserve the
     orphan's anchor_commit verbatim. The frontier date is unchanged
     and the relabeled sidecar is a no-op for future validation.

  2. Fallback: substitute the merge-base of the orphan's
     anchor_commit and the surviving anchor's tip as the new
     anchor_commit. The merge-base lies on both histories, and the
     orphan pack — which covers everything reachable from its
     original anchor_commit, a descendant of the merge-base — covers
     the merge-base too, so the new claim "anchor_ref X covers
     anchor_commit Y" is true. The frontier date drops to the
     merge-base's date, which is harmless: future stratify runs on
     the new anchor find their own packs' later anchor_commits via
     find_stratified_ancestor and use those as the rev-list bound.

If no surviving anchor shares any history with the orphan's
anchor_commit (no merge-base), the helper returns failure and the
caller demotes the pack. This is safe: with no common ancestry there
are no shared objects for surviving packs to be referencing via
cross-anchor dedup, so the closure cannot be broken by dropping the
orphan's sidecar.

The up-front "is there any surviving anchor with a pack" check is
preserved but its meaning is narrowed: it now only gates whether we
attempt relabel at all (closure-at-risk), not which target gets
chosen. The trace2 counters (packs/relabeled, packs/demoted) and the
stderr messages keep their existing semantics.

Update the two existing relabel tests to assert the recorded
anchor_commit after relabel and to run a follow-up stratify that
exercises validation. Both now verify the merge-base substitution
explicitly: master's pack ends up labeled (release, c1) in
cross-anchor-survive, and release's pack ends up labeled (master,
c1) in orphan-prune.

Add a new test, "stratify-prune target selection is independent of
anchor config order", that builds three anchors — master/c2 and
feature/f1 sharing base c1, plus other/u2 on an independent root —
stratifies, retires feature, and runs stratify-prune twice with
{master, other} in opposite orders. Both orders must pick master:
master is the only survivor merge-base-compatible with feature's
anchor_commit, because other shares no history at all. Under the
config-order heuristic, the order [other, master] would have written
(other, f1) and the next stratify run would have demoted.

Document the per-pack target rule and the merge-base substitution
strategy in stratified-gc.adoc's "Cross-anchor coverage and the
closed-set property" section.
find_relabel_target() returns -1 when no surviving anchor's tip shares
any commit history with the orphan's anchor_commit, and the caller
falls back to demotion. d66db68 argued this is safe:

    If no surviving anchor shares any history with the orphan's
    anchor_commit (no merge-base), the helper returns failure and the
    caller demotes the pack. This is safe: with no common ancestry
    there are no shared objects for surviving packs to be referencing
    via cross-anchor dedup, so the closure cannot be broken by
    dropping the orphan's sidecar.

The premise is wrong. filter_already_packed_oids() filters by raw OID
via find_pack_entry_one(), not by commit ancestry. Independently-
rooted anchors that happen to share any tree or blob — the empty tree
4b825dc..., an identical LICENSE / .gitignore / lockfile blob, a
.gitkeep file, common build-output stubs — get deduplicated against
each other regardless of whether their commit histories ever meet.
Whichever pack lands first in a stratify run absorbs the shared OID;
later packs in the same run filter it out. A surviving pack can
therefore depend on an orphan's pack for tree/blob objects even when
the two anchors share no commit ancestor at all.

The corruption shape is the same as the unconditional-unlink case
3abf001 fixed: surface-gc runs with --keep-pack on the surviving
packs and --kept-pack-boundary, so the reachability walk stops at the
survivor's kept-pack tips without opening the trees that depend on
the shared objects. The orphan pack, after demotion, is no longer in
the kept set; once nothing else walks its commits — its anchor_ref
deleted, its reflog entries expired, the same trigger that opens the
window in 3abf001's test — the shared objects look unreachable to
the cruft repack and get classified as cruft, then pruned at
expiration.

Two configurations make this concrete in practice: dedicated
docs/gh-pages branches kept alongside main via `git checkout --orphan`,
and vendored subtrees imported with their own roots. Both retire as
"normal" anchors when the project shifts away from them, both share
nothing on the commit graph with main, and both are nearly guaranteed
to share at least the empty tree (and usually a few license/config
blobs) with the survivors.

Fix find_relabel_target() with a third strategy: when neither merge-
base path picks a target, borrow a surviving anchor's own recorded
anchor_commit from one of its existing base-stratum packs. By
construction this commit is

  - an ancestor of the chosen anchor_ref's tip — the survivor's own
    pack was just validated against that very pair, so validation on
    the relabeled sidecar passes on the next stratify run; and

  - no newer than any other survivor anchor_commit, so it cannot push
    find_stratified_ancestor()'s frontier past the survivor's true
    coverage and starve future stratify runs of progress.

Strategy 3 succeeds whenever some surviving anchor has a base-stratum
pack of its own, so the function now returns -1 only in the state the
up-front any_survivor_has_pack gate used to test for: no surviving
anchor has a pack at all. That gate is folded into the function's
return value and dropped from the caller.

Add 'no-merge-base orphan is relabeled, not demoted' to t7901,
mirroring 'shared-history anchors survive donor demotion and surface-
gc' but with truly independent roots. Both anchors commit an
identical shared.txt blob; configured order puts the to-be-orphaned
anchor first so its pack absorbs the shared blob and the survivor's
pack omits it via cross-anchor dedup. The test asserts the orphan is
relabeled (not demoted), that the relabeled sidecar's anchor_commit
equals the survivor's own, that a follow-up stratify validation pass
does not demote, and that surface-gc with cruft-expiration=tomorrow
does not prune the shared blob.

Update stratified-gc.adoc's "Cross-anchor coverage and the closed-set
property" section to describe the three strategies and drop the
incorrect "no merge-base implies safe to demote" claim.
repo_in_merge_bases() returns three values — 1 (ancestor), 0 (not
ancestor), -1 (walk failed). The -1 case fires from the non-generation-
number path in commit-reach.c when repo_parse_commit() fails on a
parent commit during paint_down_to_common — i.e. a parent is missing
from the object store and there is no usable commit-graph to short-
circuit the walk. In a repo with a populated commit-graph the
gen-number path runs and -1 cannot happen; without it, prune runs and
fetches both create transient windows where a referenced parent is
absent. gc itself is one of the things that can produce that window.

Four call sites in builtin/gc.c collapsed the tri-state into a plain
truthy test and so treated -1 as "ancestor":

  - find_stratified_ancestor() at the !repo_in_merge_bases() guard:
    on error we fell through and the anchor was kept as a candidate
    for `^bound` even though we have no idea whether it is actually
    on tip_oid's history. Picking a non-ancestor as ^bound either
    over-excludes (descendants drag too much along) or has no effect.

  - stratified_frontier_date() at both the ancestor and descendant
    probes: on error we picked a candidate date from a walk that
    failed, then folded it into the max. The resulting frontier date
    can be older or newer than truth, and stratify decisions key off
    it.

  - find_relabel_target() in the first (preferred-path) loop: on
    error we picked that ref as the relabel target and wrote a
    sidecar pairing (anchor_ref, orig_anchor_commit) that we never
    confirmed is on the ref's ancestry. The next stratify validation
    pass would then demote the pack, defeating the relabel.

Of these, find_relabel_target() is the most consequential: a transient
walk failure silently commits us to a wrong sidecar.

Match the existing error discipline of the surrounding loops — every
other failure mode (unresolvable ref, unparseable commit, missing
pack sidecar) already skips the candidate and continues — and do the
same on -1:

  - find_stratified_ancestor(): test `<= 0` so both 0 and -1 skip.
  - stratified_frontier_date(): capture the int, `continue` on -1
    for the ancestor probe; fold "error or unrelated" into a single
    `<= 0` skip for the descendant probe.
  - find_relabel_target(): capture the int, `continue` on -1, only
    accept the ref when the return is strictly positive.

No behavior change in repos with a healthy commit-graph (the
gen-number path returns 0/1 only). In degraded states the conservative
treatment matches what we already do for every other transient
failure in these loops.

Signed-off-by: Vaidas Pilkauskas <vaidas.pilkauskas@shopify.com>
…base orphans

f0d2041 added a third relabel strategy: when no surviving anchor
shares commit history with the orphan, borrow a surviving anchor's
own anchor_commit from one of its base-stratum packs. The safety
argument hinges on the borrowed commit being "no newer than any other
survivor anchor_commit," which keeps find_stratified_ancestor() from
claiming coverage the survivor doesn't actually have:

    The borrowed anchor_commit is, by construction, an ancestor of
    the chosen anchor_ref's tip (the survivor's own pack was just
    validated against it), and no newer than any other survivor
    anchor_commit, so it cannot push find_stratified_ancestor past
    the survivor's true frontier.

The "no newer" half is not actually enforced. The code reaches for
`group->entries[0].anchor_commit`, and entries are appended by
collect_base_stratum_pack_groups() in repo_for_each_pack() order
(pack mtime, newest first) with no subsequent sort. The cascade
comment a few lines above is explicit that stratified_timestamp is
not a trustworthy total order either, so even sorting on that field
wouldn't carry the argument. In practice entries[0] is whichever pack
the filesystem walk returned first.

If that pack's anchor_commit is newer than another pack's in the same
group, the borrowed sidecar overstates the survivor's frontier:
find_stratified_ancestor() will pick the newer commit as `^bound` on
the next stratify run, and objects the survivor actually covers
between the older and newer anchor_commits get re-walked and re-
packed. Worse, the orphan pack itself only really covers objects up
to its original anchor_commit, so the kept-set closure that the
relabel was meant to preserve is silently weakened.

All entries in a single group share an anchor_ref and have been
validated as ancestors of its tip, so they lie on one linear ancestry
chain and committer date is a faithful total order on them. Add a
helper that parses each entry's anchor_commit and returns the one
with the smallest ->date; call it from strategy 3 in place of the
entries[0] reference. If no entry parses (corrupt store), the helper
returns failure and strategy 3 falls through to the next configured
anchor, matching the existing error discipline elsewhere in
find_relabel_target().

Signed-off-by: Vaidas Pilkauskas <vaidas.pilkauskas@shopify.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant