Skip to content

fix(optuna): harden literal parsing across both desubstitution paths#7

Open
GrigoryEvko wants to merge 800 commits into
FusionBrainLab:mainfrom
GrigoryEvko:fix/optuna-literal-parsing
Open

fix(optuna): harden literal parsing across both desubstitution paths#7
GrigoryEvko wants to merge 800 commits into
FusionBrainLab:mainfrom
GrigoryEvko:fix/optuna-literal-parsing

Conversation

@GrigoryEvko
Copy link
Copy Markdown

The AST-level _EvalCleaner (add_tuned_comment=False path) checks isinstance(arg, (Constant, List, Tuple, Set, Dict)) to decide whether an eval(...) call wraps a literal that can be inlined. This misses UnaryOp(USub, Constant), so eval(-5) survives in the AST path while the source-level _clean_eval_in_source strips it correctly via ast.literal_eval. The two desubstitution modes silently produced different output for the same input.

Reproducer against current main:

import ast
from gigaevo.programs.stages.optimization.optuna.desubstitution import (
    _EvalCleaner, _clean_eval_in_source,
)

src = "x = eval(-5)"
tree = ast.parse(src); _EvalCleaner().visit(tree); ast.fix_missing_locations(tree)
print(ast.unparse(tree))           # before: 'x = eval(-5)' ;  after: 'x = -5'
print(_clean_eval_in_source(src))  # always: 'x = -5'

Fix: in _EvalCleaner.visit_Call, replace the isinstance check with ast.literal_eval(arg) as a try/except predicate so both paths share the same literal-detection logic. Verified across 12 sample inputs — the AST and source paths now produce identical output. Also broaden the (ValueError, SyntaxError) catches around literal_eval in _coerce_param_value and _clean_eval_in_source to also catch MemoryError, RecursionError, and TypeError.

PetrAnokhin and others added 30 commits April 1, 2026 16:00
short id will generate based on full id when required
KhrulkovV and others added 24 commits April 6, 2026 13:21
…cstrings

Phase 1 cleanup (completed):
- Move A_mem/, GAM_root/ → _vendor/ (vendored MIT libs)
- Move contrib licenses → _vendor/
- Move 3 example scripts → examples/
- Fix 15 broken vendored library imports (A_mem/GAM_root bare imports)
- Update 8 consumer import paths to _vendor/
- Add _vendor/__init__.py docstring (vendored libs notice)
- Add examples/__init__.py docstring (not production code)
- Update shared_memory/__init__.py docstring
- Update pyproject.toml (ruff/mypy exclude paths for vendors)

Tests: 770 passed
Lint: clean

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
refactor(memory): Phase 1 — directory reorg (vendor/examples/docstrings)
- Delete 5 duplicated usage-merge functions from memory_write_example.py
  (_to_float, _median_or_none, _extract_usage_task_deltas,
  _build_usage_payload_from_task_deltas, _merge_usage_payloads)
  → import from card_update_dedup.py (canonical home)
- Delete duplicate dedupe_keep_order from card_update_dedup.py
  → import from shared_memory/utils.py
- Remove deprecated _apply_update_actions() from memory.py (dead wrapper)
- Make memory_to_card private (_memory_to_card) — only used internally
- Simplify single-iteration loop in _extract_json_object

Net: ~120 lines deleted, zero behavior change.
Tests: 770 passed | Lint: clean

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
refactor(memory): deduplicate code and delete dead paths
Replace hand-written RetrievalWeights.from_mapping() and
CardUpdateDedupConfig.from_mapping() dict parsers (~87 lines) with
Pydantic v2 @model_validator(mode="before") — same behavior, idiomatic.

Also: add docstrings to all functions in card_update_dedup.py, fix
stale test references to deleted _apply_update_actions wrapper, add
test for flat config format.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
refactor(memory): Pydantic-idiomatic config parsing
- memory_write_example.py → write_pipeline.py (it's production, not an example)
- memory_write_config.py → write_pipeline_config.py
- selected_ideas_6.py → origin_analysis.py (remove versioned filename)
- Delete test_memory_write_example_extended.py (duplicate of test_write_pipeline.py)
- Update all 8 import sites + 1 dynamic importlib.import_module call

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
refactor(memory): rename write pipeline and analysis files
Extract from card_conversion.py (554 → 420 lines):
- base.py: GigaEvoMemoryBase abstract class (20 lines)
- card_search.py: format_search_results, search_cards_by_keyword,
  synthesize_search_results (115 lines)

Update 4 import sites directly (no re-exports).
card_conversion.py retains: normalization, conversion, GAM config,
constants, protocols.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
refactor(memory): split card_conversion into focused modules
Define MemoryError, MemoryRetrieverError, MemorySearchError, and
MemoryStorageError in gigaevo/exceptions.py following the existing
GigaEvoError hierarchy.

Wire them into the memory subsystem:
- gam_search.build() wraps all failures in MemoryRetrieverError
- memory.py narrows two gam.build() catches from bare Exception
- card_store._load() narrows to (json.JSONDecodeError, OSError)
- card_dedup import block narrows to (ImportError, OSError)

Resilience-critical catches (search fallback, merge loop, __exit__)
remain broad by design.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
refactor(memory): custom exception hierarchy and narrowed catches
…t base to ABC

- concept_api.py: all 5 RuntimeError raises → MemoryStorageError
  (matches gigaevo/database pattern of wrapping I/O errors)
- base.py: GigaEvoMemoryBase now uses ABC + @AbstractMethod
  (matches MutationOperator, Stage, LangGraphAgent pattern)
- card_dedup.py: narrow two broad catches:
  - JSONL read fallback: except Exception → (json.JSONDecodeError, OSError)
  - GAM store build: except Exception → (MemoryRetrieverError, OSError)
- Update 6 test assertions from RuntimeError to MemoryStorageError

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ormity

refactor(memory): exception conformity + ABC base class
When write_pipeline.py passes MemoryCard/ProgramCard Pydantic models to
memory_platform.save_card(), the dict() call on a Pydantic model doesn't
properly flatten nested Pydantic objects like ConnectedIdea. This caused
TypeError in _persist_index() when json.dumps() tried to serialize.

Root cause: write_pipeline returns list[AnyCard] (Pydantic models) and both
backends (memory_platform and memory/shared_memory) consume these cards via
save_card(). memory_platform's normalize_memory_card() must explicitly call
.model_dump() on Pydantic inputs to flatten nested objects.

Fix verified: all 788 memory + integration tests pass.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Tests the exact bug path: Pydantic MemoryCard/ProgramCard with nested
ConnectedIdea and MemoryCardExplanation objects must be properly
flattened to plain dicts before JSON serialization.

6 tests covering: ProgramCard with ConnectedIdea, MemoryCard with
MemoryCardExplanation, plain dict passthrough, JSON round-trips, None.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add gigaevo-memory Git dependency to pyproject.toml
- Remove sys.path manipulation from memory_platform/memory.py and
  remote_gam_retriever.py (no longer needed with proper install)
- Simplify test file to use direct imports instead of module mocking

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Expands from 6 to 11 tests covering the complete save_card → _persist_index
flow with Pydantic inputs. Tests verify:

- normalize_memory_card: ConnectedIdea/MemoryCardExplanation → dict
- save_card: Pydantic ProgramCard/MemoryCard → JSON-serializable index
- _card_to_backend_content: API payload is clean dict
- persist/reload roundtrip: index file survives write→read cycle

Uses _make_platform_memory() factory with mocked API client to test
memory_platform in isolation without network dependencies.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add docstrings to 15 public methods across 5 files (memory.py,
  concept_api.py, card_dedup.py, openai_inference.py, write_pipeline.py)
- Add return type annotations to 4 functions in amem_gam_retriever.py
- Fix 2 mypy errors: annotate retrievers dict, rename variable in api_sync.py
- Extract magic numbers: _MAX_SUMMARY_CHARS, _MAX_DESCRIPTION_CHARS,
  _ENTITY_NAME_MAX_LENGTH, _MAX_CONNECTED_DESCRIPTIONS

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
refactor(memory): type annotations, docstrings, constants, platform bug fix
The AST-level `_EvalCleaner` checked `isinstance(arg, (Constant, List, Tuple, Set, Dict))` to decide whether an `eval(...)` argument is a literal. This missed `UnaryOp(USub, Constant)`, so `eval(-5)` survived in the AST path while the source-level `_clean_eval_in_source` stripped it correctly. Switch to `ast.literal_eval` as the predicate so both paths share the same logic.

Also broaden the `(ValueError, SyntaxError)` except clauses around `literal_eval` in `_coerce_param_value` and `_clean_eval_in_source` — it can also raise `MemoryError`, `RecursionError`, and `TypeError` on pathological inputs.

Verified against the source-level path on 12 sample inputs; both produce identical output post-fix.
KhrulkovV added a commit that referenced this pull request May 26, 2026
Address 2 critical + 2 major issues from methodology expert audit:

C1: Add required_prefix constraint enforcement in PromptExecutionStage
    and GigaEvoArchivePromptFetcher — prevents mutation LLM from
    evolving away frozen SYSTEM_CONSTRAINTS.

C2: Change default prior from Beta(1,1) to Beta(1,3) — untested prompts
    start at fitness=0.25 instead of 0.50, preventing archive churn.

M1: Track metrics_count separately from total_trials — use as denominator
    for per-metric means (fixes bias from REJECTED_ACCEPTOR trials).

M4: prompt_text_to_id() now hashes both system and user text — prevents
    stats conflation when user prompts differ.

All prior data invalidated (fitness computation + ID hashing changed).
DBs 4-7 flushed for clean restart.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
KhrulkovV added a commit that referenced this pull request May 26, 2026
Addresses chaos-hacker adversarial review findings:
- HIGH #1: Test _await_idle's actual ghost cleanup branch via time.monotonic patch
- HIGH #2: Test generation_timeout on real step() with ghost IDs + stuck RUNNING
- HIGH #3: Verify snapshot data correctness (not just no-hang) after bump()
- MEDIUM #4: Truly concurrent writes via asyncio.Barrier + serialization proof
- MEDIUM #5: Stuck RUNNING program triggers generation_timeout
- MEDIUM #6: Write serialization assertion (max_concurrent == 1)
- MEDIUM #7: Lock eviction race (concurrent reuse after terminal pop)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
KhrulkovV added a commit that referenced this pull request May 26, 2026
* prompts: kill few-shot fabrication leak in insights + lineage

The GOOD examples themselves contained invented magnitudes
("rejects 60% of viable candidates", "-2.3% runtime"), training the
LLM that fabricated effect estimates are valid output. Live judge
eval on 5 parent->child pairs across heilbron + hover, audited
against actual Redis program metrics + task_description, shows:

- ungrounded-number rate: 20.2% -> 6.9% (3x reduction)
- lineage rubric subscore: 17.35 -> 17.40
- 4-pair rubric avg (excl. known Gemini Pro structured-output flake):
  16.97 -> 17.12

Edits:
- insights: remove fabricated "60%" from numeric GOOD example;
  add "Quote, don't estimate" rule naming specific fabrication
  patterns (% rejection rates, speedup factors, iteration budgets).
- lineage: remove "-2.3% runtime" from Quantification example;
  spell out that cited numbers must come from diff, code, metrics,
  or task description.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(evolution-stats): iteration-window aggregation + snapshot bump

Evolutionary Statistics section in the mutation prompt was empty or
showed stale numbers under the steady-state engine. Two root causes:

1. Stale population snapshot — `bump()` was only called once at seed
   drain. After that, every collector saw a frozen snapshot, so the
   focal program's iteration was rarely in scope. Added
   `bump(incremental=True)` in `poll_and_ingest` after every commit
   pass so the snapshot tracks ingestion progress without flushing
   cached program objects.

2. Per-generation aggregation is meaningless under JIT — generations
   are an output of the schedule, not a fixed input. Replaced the
   `generation_history` / per-gen fields with a symmetric iteration
   window ([iter-R, iter+R], R=15) around the focal program:
   window count/valid, best-in-window + iter, focal rank in window,
   median-before / median-after horizons, trend via median-of-thirds
   (5% multiplicative threshold, direction-aware via
   `metrics_context.is_higher_better`), max invalid streak, and a
   global running-best plateau marker (`iters_since_last_new_best`).

`EvolutionaryStatisticsMutationContext.format()` emits the locked
10-line "E_augmented" block; design doc lives at
`docs/superpowers/specs/2026-05-14-evolutionary-stats-redesign.md`.
Validated via 3-round LLM extraction eval: E_augmented scored 44/45
vs the old per-gen layout's 15/45.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(monitoring): file emit target writes frontier_<metric>.png each tick

start_live_frontier_compare gains an output_dir param and a new "file"
emit target that re-renders a frontier-trajectory PNG (best-so-far +
per-iter mean) in the Hydra run output directory on every tick, sibling
to live_profiler's profile_live.html. Default emit_targets now includes
"file". run.py threads the Hydra output_dir through to the daemon.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(memory): DAG-native intra+extra memory pipeline (per-parent lineage card + live global cards)

Adds the `intra_extra_memory` pipeline variant on top of the default builder:

* IntraMemoryStage (strong LLM, structured output) renders a per-parent
  lineage card from DescendantProgramIds + MemoryContextStage as named inputs;
  framework InputHashCache skips the LLM when neither changes. Output is
  attached to the parent's metadata['intra_memory_card'] and concatenated
  with the global memory cards block via ConcatMemoryStage.

* LiveMemoryRefreshHook wraps IdeaTracker.run_increment as a post_step_hook,
  surfacing freshly evolved ideas to MemoryContextStage's reload-on-read
  selector during the same run (no need to wait for end-of-run flush).

* New ExtraMemoryStage class (currently dormant in the wired pipeline) kept
  as opt-in infra with its own caching test, pinning the structured-output
  contract for future re-wiring.

* Bug fix bundled: invalid-child fitness sentinel (e.g. -1000 in heilbron)
  no longer pollutes delta_distribution.min/median/max or per-cluster
  mean_delta. Invalid children route to dedicated n_failed counters; the
  rendered card shows "n_failed=N (excluded from stats above)" and
  "mean delta n/a" for all-failed clusters. System prompt rule 3 now
  instructs the LLM to exclude is_valid=false from delta math.

Legacy lineage stages stripped from the builder (AncestorProgramIds,
LineageStage, LineagesToDescendants, LineagesFromAncestors, InsightsStage)
— DescendantProgramIds is kept and rewidened (max_selected=24) to feed
IntraMemoryStage instead of LineagesToDescendants.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: intra/extra memory mode guide + USAGE / MEMORY_ARCHITECTURE cross-links

Adds docs/INTRA_EXTRA_MEMORY.md covering the pipeline introduced in 89f01be5:
architecture diagram, intra-card schema (with the n_failed sentinel-handling
contract), live external-memory refresh hook, caching invalidation triggers,
required co-overrides (ideas_tracker=default, memory=local), smoke / full /
nohup launch commands, tuning knobs, verification checklist, and a
troubleshooting matrix.

USAGE.md: adds `intra_extra_memory` to the `pipeline` config-group table and
a launch example under "Examples".

MEMORY_ARCHITECTURE.md: top-of-file pointer to the new mode guide so the
in-run / live-memory entry point is discoverable from the store-side docs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(intra-memory): ship unified diff (not full code) per child + soften mutator "untried directions" preference

IntraMemoryStage payload now carries either a unified diff (change_form="diff")
or full child source (change_form="full_code") per child. Diff is the default;
full code is the fallback when (a) is_valid=False so error_summary line refs
stay readable against the same buffer the analyst sees, (b) the diff is
empty (identical sources), or (c) the diff is no smaller than the file
(structural rewrites where every line differs). Expected 50-80% prompt-size
reduction on the typical small-mutation regime, where the parent's boilerplate
was previously repeated N times across children.

The intra system prompt's user-message-structure table is updated to document
both children[i].diff and children[i].code, plus the change_form discriminator,
so the analyst knows how to read either form.

Mutator system prompt: softened the "Untried directions" rule. Previously
"prefer it over inventing a new direction from scratch" — a hard preference
that let speculative hints dominate archetype selection. Now framed as
candidates to weigh alongside the model's own ideas, with explicit licence
to skip any whose mechanism does not actually fit the parent's code.

Tests: 4 new payload-shape tests on IntraMemoryStage (diff for small mutation,
full-code for structural rewrite, full-code for invalid child, system prompt
documents change_form/diff/full_code), plus a new pin on the mutator prompt
wording.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(prescriptive): MutationSuggestionStage + EvolutionaryStatistics wiring

Architectural split between descriptive (IntraMemoryStage) and prescriptive
(MutationSuggestionStage) memory: the intra stage now ONLY summarises lineage
history into a per-parent card; the new MutationSuggestionStage consumes
intra card + cross-population memory cards + ancestral momentum trail +
EvolutionaryStatistics population snapshot and emits structured
ProgramInsights into MutationContextStage's insights slot (same shape as the
legacy InsightsStage, so the mutator's PROGRAM INSIGHTS section renders
unchanged).

Key wiring (lineage_memory_pipeline.py):
* DescendantProgramIds → IntraMemoryStage.children_ids
* IntraMemoryStage → MutationSuggestionStage.intra_card
* MemoryContextStage → MutationSuggestionStage.memory_cards
* EvolutionaryStatisticsCollector → MutationSuggestionStage.evolutionary_statistics
* MutationSuggestionStage → MutationContextStage.insights
* IntraMemoryStage + MemoryContextStage → ConcatMemoryStage → MutationContextStage.memory

Both strong-LLM stages (Intra + Suggestion) gate on validator success and
(when enabled) archive acceptance, mirroring the legacy InsightsStage
skip-cascade so paid LLM tokens are never spent on a program that won't
enter the archive.

Other:
* Intra card delta-distribution + mean_delta now formatted using primary
  metric's decimals from metrics.yaml (was rendering raw 16-sig-fig floats).
* PopulationSnapshot.refresh: refetch programs in INCOMPLETE_STATES so
  QUEUED/RUNNING entries get up-to-date metrics on each snapshot.

* fix(memory): pick OPENROUTER_API_KEY when LLM_BASE_URL targets OpenRouter

Previously gigaevo.memory.config.OPENAI_API_KEY preferred $OPENAI_API_KEY
over $OPENROUTER_API_KEY unconditionally. In intra_extra_memory smokes we
export both — $OPENAI_API_KEY=sk-gigaevo (LiteLLM proxy) for the main Qwen
pipeline and $OPENROUTER_API_KEY=sk-or-v1-... for the GAM/A-Mem cheap path
(Gemini Flash via OpenRouter). The wrong-key-for-endpoint combination made
every GAM research_agent and IdeaTracker LLM call 401-silently, killing
the extra-memory channel without any pipeline error.

Two-line fix:
- config.OPENAI_API_KEY now resolves OpenRouter key first when LLM_BASE_URL
  contains "openrouter.ai" (e.g. settings.yaml default).
- ideas_tracker.llm._init_clients picks the right key for the effective
  base_url (OPENROUTER_API_KEY for openrouter, OPENAI_API_KEY otherwise).

Verified: with both keys exported and settings.yaml's OpenRouter base_url,
client.api_key now starts with "sk-or-". With base_url set to the LiteLLM
proxy, client.api_key falls back to "sk-gigaevo".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(suggester): rank-aware ambition rule in mutation_suggestions/system.txt

Adds a sub-bullet under the Evolutionary Statistics input description that
calibrates the suggester's parametric-vs-structural mix to the rank of the
parent in the window (already reported as `rank X/Y in window`):

* Top quartile -> at least one suggestion must be structurally orthogonal
  (different algorithm family, init scheme, or objective), not a parameter
  tweak. Parametric refinements alone are insufficient when the parent
  already tops its window.
* Bottom half -> at least one structural change required; fragile/harmful
  tags take precedence over rigid-parameter tweaks.
* Middle band -> mix exploitation with at least one orthogonal axis.

Rationale: smoke #3 (cycle 1 at max_mutants=20) showed gen-3 105901c4
(rank=1/Y) receiving 5 of 6 suggestions tagged `rigid` (pure parameter
tweaks), producing a plateau at 0.01142 (32.6% of 0.035). The breakthrough
to 0.01885 (53.9%) came from a SIBLING program a35a0f72 whose suggester
happened to find a structural harm (asymmetric_extra_points / symmetry
restoration). The new rule makes that structural pivot a stable
expectation at top-of-window, not an accident.

Generic — uses only the existing rank-in-window signal already in
EvolutionaryStatistics. No new fields, no new code, no heilbron-specific
tokens.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(stats): rank line dropout when focal missing from snapshot

The iter-window rank computation in `_compute_iter_window_fields`
called `sorted(fits).index(focal_fit)` to find the focal's rank.
When the population snapshot lagged behind the pipeline view (or the
snapshot contained a stale `is_valid=0` view of the focal), `focal_fit`
was not in `fits`, the ValueError was swallowed silently, and
`iter_window_rank` became None.

Downstream renderer in `evolution/mutation/context.py:168` gates the
rank line on `iter_window_rank is not None`, so the entire
"rank X/Y in window" segment disappeared for top-of-window programs.
The mutation_suggestions/system.txt rank-aware ambition rule relies on
that text; with the rank line missing the rule was DORMANT throughout
the cycle-2 run (struct counter stayed at 0 for 100 mutations).

Verified on production program 4578cea1 from
output/cycle2_rankambition_20260518_022450 (fit=0.01509 at iter=49):
window valid=10, best in window=0.01455 — focal excluded, rank=None.

Fix:
- When focal is valid and not already in `valid_with_fit` (snapshot
  lag), include it explicitly using the up-to-date metrics passed by
  the pipeline. Downstream best/median/trend/valid_count then reflect
  reality.
- Replace `sorted.index` with a count-based rank (better+1). Robust to
  tied fitness values, which previously got under-counted by `index`'s
  first-match semantics.

Tests:
- test_iter_window_rank_when_focal_missing_from_snapshot (RED→GREEN)
- test_iter_window_rank_when_focal_in_snapshot_but_stale_metrics (RED→GREEN)
- existing test_iter_window_rank_none_when_focal_invalid still passes
  (invalid focal correctly yields rank=None)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(suggester): lineage-exhaustion override in rank-aware ambition

Cycle-3 (rank rule LIVE) plateaued at 0.02614 (74.7%). Forensics:
- Top-5 fitness: 4/5 are structural-pivot archetypes (Guided Innovation
  / Approach Synthesis). Mean fit by archetype: Guided Innovation 0.01735
  vs Exploitation 0.01016. Structural pivots win.
- But ~50% of all mutations chose Exploitation. Among the 6 programs
  whose parent's intra card flagged "all valid children regressed or
  failed" (lineage exhaustion), the mutator still chose
  Exploitation/Proven Pattern Extension in 3/6 cases — wasting budget
  re-tweaking failed clusters.

The existing rank-aware rule says "at least one orthogonal-axis
suggestion" — too soft when local gradient is empirically dead.

New sub-bullet: when intra card shows ≥2 failed/regressed tried_strategy
clusters (or delta distribution catastrophic+failed ≥ 2 with improving=0),
EVERY suggestion must propose a structural axis NOT in tried_strategies.
Parametric tweaks of failed clusters are explicitly rejected in this
regime. Forces the suggester to leave exhausted local basins.

Generic, no task-specific tokens. +11 lines in
mutation_suggestions/system.txt under the rank-aware ambition block.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(suggester): escape literal {} braces in lineage-exhaustion sub-bullet

673b9fb6 introduced `{regressed, failed}` as a literal phrase in the
mutation_suggestions/system.txt template. The prompt loader passes this
template through `str.format()` (factories.py:200), so `{regressed, failed}`
was parsed as a placeholder named 'regressed, failed' — raising KeyError
at every DAG build during cycle-4 startup. All 5 seed-eval DAGs failed,
the engine spun in an idle "no parents" loop, and the process exited at
t=07:53:04 with progs=5/scored=0/mut_done=0 — zero useful work done.

Fix: escape the literal braces as `{{regressed, failed}}`. Verified via
str.format() round-trip — only the three intended placeholders
({task_description}, {metrics_description}, {max_insights}) remain.

Lesson: any literal `{` or `}` in `.txt` prompts that flow through
.format() must be doubled. See feedback memory for hardening guide.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(suggester): server-computed EXHAUSTION ALERT banner overrides soft LEX

Cycle-4b shipped a soft "Lineage exhaustion override" sub-bullet in the
mutation-suggester system prompt. Qwen-3-235B-A22B-Thinking-2507 ignored
it: 3 parents in cycle-4b had ≥2 regressed/failed intra clusters yet
their children received parametric refinements of already-tried clusters
(plateau at 0.02630, +0.6% vs cycle-3 baseline 0.02614).

Replace the soft text with a deterministic server-side banner prepended
to the user message — most salient location, no LLM judgement on the
trigger condition.

Trigger (computed in MutationSuggestionAgent._format_exhaustion_block):
  - cond_a: ≥2 distinct clusters in {regressed, failed}, OR
  - cond_b: catastrophic + n_failed ≥ 2 AND improving = 0

When triggered, emit `## EXHAUSTION ALERT — strict structural-pivot mode`
header + OVERRIDES sentence + explicit AVOID-LIST of negative-verdict
clusters + full tried-strategies context + `---` separator. The system
prompt now references the banner as a HARD CONSTRAINT that overrides the
rank-aware ambition mix.

Banner is task-agnostic (no heilbron/triangle leak — covered by test).

Tests: 13 new in tests/llm/test_mutation_suggestion_exhaustion.py cover
empty intra, single-cluster non-triggers, cond_a/cond_b paths, mixed
verdicts, override/AVOID-LIST language, task-agnosticism, and the
trailing separator. All pass. Lint clean. No new regressions in
tests/llm/ (371 pass) or tests/stages/ (938 pass; pre-existing 3
failures unrelated).

Pure context-building change — schema unchanged, pipeline unchanged,
launch command unchanged. Stays within the 0.035-sprint allowed-knobs
envelope.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(suggester): revert rank+LEX to 9cca4344 baseline for cycle-6 A/B

Drops 24 lines from mutation_suggestions/system.txt — the rank-aware ambition
sub-bullet (commit 4caeb1b9) and the lineage-exhaustion banner clause
(commit 0ebd405c built on 73ed1207's brace-fix of 673b9fb6).

Net effect: the suggester prompt is now identical to its 9cca4344 state.
Empirical motivation:

  | Run                          | Best fitness | system.txt        |
  |------------------------------|--------------|-------------------|
  | sprint cycle-2 (2026-05-17)  | 0.0315       | pre-prescriptive  |
  | cycle3-from-scratch (uncomm) | ~0.030       | 9cca4344 baseline |
  | cycle-3 today (rank rule)    | 0.02614      | +13 rank          |
  | cycle-4b today (LEX soft)    | 0.02630      | +24 rank + LEX    |
  | cycle-5 today (LEX hard)     | 0.02536      | +24 rank + banner |

Today's three runs cluster at 0.025-0.026 (~17% below the 9cca4344
baseline). The collector.py rank-line bugfix + EXHAUSTION ALERT formatter
remain in place — only the LLM-facing prompt content is reverted. The
formatter just becomes dormant since the prompt no longer references its
output.

Cycle-6 will A/B this against the cycle-5 state to confirm whether the
+24 lines of guidance were net-destructive.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(stats): R2 — MAD-based trend noise floor + archive_valid_fitnesses field

Replace legacy 5%·|t1| trend threshold in `_trend_from_thirds` with a
nonparametric MAD (median absolute deviation) over the recent valid-fitness
window. The fixed 5% ratio reads as "flat" on low-fitness regimes where real
regressions are present at sub-5% absolute magnitude — the cycle-6 audit
showed parent contexts with medians falling 0.00165 → 0.00082 (clear
regression) still labelled `flat` and feeding the consumer's flat-trend
condition into Exploitation.

MAD adapts to the run's empirical noise scale: no chosen constant. Bootstrap
fallback to legacy 5%·|t1| ratio when fewer than `N_MIN_FOR_MAD=4` valid
samples in the window — pre-existing framework behaviour preserved during
the initial iterations.

Also exposes `archive_valid_fitnesses: tuple[float, ...]` as a transient
field on `EvolutionaryStatistics` (not persisted; rebuilt per emission).
This is the source-of-truth distribution that R1 (archive-quartile regime)
will read in a follow-up commit.

Constants introduced are all data-availability gates, not regime thresholds:
- `N_MIN_FOR_MAD = 4` — minimum sample size for MAD to be meaningful
- `_TREND_EPSILON = 1e-12` — numerical safety against degenerate MAD=0

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(context): R1+R3 — archive-quartile regime in mutation_context render

Adds two new tokens to every rendered parent context once the archive holds
≥ `N_MIN_ARCHIVE=4` valid programs:

  Archive: N=N  median=…  q75=…  best=…
  Regime: BAD/MIDDLE/GOOD (Q? of archive)

And appends `archive-quartile Q?` inline to the existing rank line:

  rank 2/8 in window, archive-quartile Q1)

Both signals are derived from the same archive distribution emitted by R2's
`archive_valid_fitnesses` field on `EvolutionaryStatistics`. Quartile
boundaries are universal statistical convention (Q1=25%, Q2=50%, Q3=75%) —
not chosen thresholds. The mapping Q1→BAD, Q2/Q3→MIDDLE, Q4→GOOD is the
framework's editorial choice with no numeric constants.

R1 v2 design properties:
- No dependency on `MetricSpec.upper_bound` — regime is derived from the
  run's empirical archive distribution, so the bundle is task-agnostic.
  Tasks declaring `upper_bound` additionally get an informational
  `Target: …  focal_gap=…` line; R6's archetype gate does NOT read it.
- Direction-aware via `MetricSpec.higher_is_better` — works identically for
  loss-style metrics where small = good.
- Bootstrap-safe: no token emitted when archive < 4 valid; rule falls back
  to original Step-6 logic.
- O(N log N) per render on archive size bounded by `max_mutants=100`.

R3 reuses R1's `quartile_str` so there is a single source of truth and the
rank line cannot drift from the Regime line.

Tests: 10 new `TestArchiveQuartileRegime` cases cover Q1/Q2/Q3/Q4 placement,
archive < 4 (no regime emission), `higher_is_better=False` direction,
archive-quartile inclusion in rank line, ties at quartile boundaries, target
decoration on/off.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(prompts): R6 — archive-quartile archetype gate + suggester tag-bias

Two-layer defense against wasted-budget Exploitation on weak parents.

mutation/system.txt (consumer):
- Adds a Selection Rule that makes `Regime: BAD` (focal in Q1 of archive)
  a HARD GATE: Exploitation archetypes (1-3) FORBIDDEN. Choose Exploration
  (4-6) or, if intra has an "improved" verdict with an untried extension,
  Hybrid (7-8). MIDDLE (Q2/Q3) gates Exploitation on intra-improved +
  untried-extension; otherwise prefer Hybrid. GOOD (Q4) opens all
  archetypes per other rules. Bootstrap (Regime line absent) falls back
  to original logic.
- Adds an Evolutionary Statistics descriptive paragraph explaining the
  new `Archive:` and `Regime:` lines so the LLM knows the rendered
  tokens.
- Trend label vocab synced to code emit: `rising / flat / falling` (the
  legacy `improving / regressing` words drifted from collector.py:126
  and broke the consumer's match logic).

mutation_suggestions/system.txt (producer):
- Adds an Archive-quartile awareness rule: in BAD regime (Q1) do NOT tag
  patterns as `beneficial` based only on local intra-card "improved"
  verdicts. Prefer `fragile` or `rigid`. Reserve `beneficial` for MIDDLE
  (Q2/Q3) or GOOD (Q4) regimes.
- Disambiguates earlier informal "low-fitness regime" wording (which
  collided with the formal `Regime:` tag) — the metric-scale heuristic
  is now explicitly called out as SEPARATE from the formal Regime tag.
- Trend vocab synced to `rising / flat / falling`.

Defense-in-depth: the producer suppresses `beneficial` tag at source for
Q1 parents; the consumer additionally forbids the Exploitation archetype
the tag would have biased toward. The two rules layer — they don't
duplicate. If the suggester slips and emits `beneficial`, the mutator's
hard gate still routes the mutation to Exploration.

Tests: TestR6ArchiveQuartileGate (consumer) + TestR6SuggesterTagBias
(producer) + TestRegimeAndQuartileVocabularySynergy (cross-prompt vocab
consistency for tag, verdict, quartile, regime scales).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(tools): trajectory_shape.py — log-based closeout analyzer for cycle comparisons

Computes the 6 trajectory metrics from plan §Verification on any
output/cycle*_*/evolution_*.log:

  - best_at_end (frontier final)
  - monotonicity_pct (cohort-mean signal, NOT running-max which would always be 100%)
  - per_stage_best (early/mid/late thirds)
  - longest_stagnation_min (gap between consecutive frontier-bumps)
  - rtail_ge_020 / rtail_ge_030 (right-tail mass — # cells reached past 0.020 / 0.030)
  - cells_filled

Two modes:
  python tools/trajectory_shape.py <log_file>                  # single report
  python tools/trajectory_shape.py --compare a.log b.log c.log # variance-floor verdict

Variance-floor rule (1.5×spread): N≥3 → mean(baselines)+1.5×spread is the bar;
treatment > bar → CONFIRMED, else NOISE.

Works on any log file regardless of Redis state — logs are permanent, Redis dbs
get flushed. Built during cycle-7's runtime, smoke-tested retroactively on
cycles 3/4b/5/6 to establish n=4 baseline (mean=0.02596, spread=0.00094,
zero breakouts past 0.030).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(context): R7+R8 v3.1 — archive distribution with worst/median/best + archive-percentile token

Renders the archive's distribution as `Archive: N=…  worst=…  median=…  best=…` plus an
`archive-percentile pXX of N=Y` annotation on the existing rank line. Both lines are direction-
aware via `MetricSpec.higher_is_better`. No `Regime:` or `Target:` token is rendered — the LLM
reads the task's target from the task description and judges the focal against the rendered
distribution itself.

Why v3.1:
- v3's `Regime: BAD/MIDDLE/GOOD` was a pre-baked classifier. Trust-the-model-synthesis
  principle: render data, let the LLM judge.
- v3's `Target:`/`focal_gap` was likewise pre-baked. The task description already states the
  target; rendering it twice (and adding a derived `focal_gap`) introduces a hardcoded
  interpretation channel the LLM doesn't need.
- Only deterministic gate kept: archive-percentile (a single direction-aware quality
  percentile, 100=best). Quartile boundaries 25/75 are statistical convention, not magic.

Bootstrap-mislead defense: the rendered `worst=… median=… best=…` triplet makes archive
compression visible. A compressed bootstrap (N=7, all <0.002) lands the focal at p100 but the
distribution itself shows the LLM the archive is far from the task's stated target. The
qualitative target-awareness clause in the prompt instructs the model to apply that judgment.

Tests:
- test_archive_line_includes_worst_higher_is_better_true
- test_archive_line_worst_inverts_for_higher_is_better_false
- test_compressed_bootstrap_renders_rich_archive_no_target_line
- test_target_line_never_rendered_when_upper_bound_declared
- test_target_line_never_rendered_when_upper_bound_none

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(prompts): R9 v3.1 — archive-percentile gate + qualitative target awareness

Mutator and suggester prompts now reference the v3.1 context surface: `Archive: …` and
`archive-percentile pXX of N=Y`. The only deterministic gate is the archive-percentile gate
(focal in bottom quartile → Exploitation FORBIDDEN; focal in top quartile → all archetypes
eligible). Quartile boundaries 25/75 are statistical convention, not magic numbers.

Target awareness is qualitative: the task description states the problem's target/bound; the
prompt instructs the LLM to compare the rendered `worst=… median=… best=…` distribution
against that target and apply judgment. No numeric threshold is imposed because fitness scale
is typically non-linear — small absolute distances at low fitness are structurally harder than
the same absolute distance at high fitness.

Removed from previous v3:
- `Regime: BAD/MIDDLE/GOOD` pre-baked classifier (replaced by archive-percentile + prose)
- `Target:`/`focal_gap` rendered tokens (LLM reads target from task description)
- `half the distance` magic-constant compound rule

Removed in this v3.1 cleanup pass:
- Legacy "framework does NOT render a separate `Target:` line" mentions — negating-by-mention
  is noise; prompts now positively instruct reading the target from the task description.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(v3.1): archive-percentile gate, archive distribution, no Target/Regime tokens

test_mutation_context.py:
- test_target_line_never_rendered_when_upper_bound_declared: asserts `Target:` and `focal_gap`
  are absent even when MetricSpec.upper_bound is set.
- test_target_line_never_rendered_when_upper_bound_none: parallel assertion for tasks without
  declared upper bound.
- test_compressed_bootstrap_renders_rich_archive_no_target_line: documents the bootstrap-
  mislead defense — N=7 compressed archive with focal at p100 still renders worst/median/best
  so the LLM can judge the gap against the task target itself.
- test_archive_line_includes_worst_higher_is_better_true: asserts `worst=…` and `best=…` are
  rendered with the strongest at `best` for fitness-style metrics.
- test_archive_line_worst_inverts_for_higher_is_better_false: parallel for loss-style metrics
  (worst = highest value, best = lowest).

test_prompts.py (TestV31* replacing TestR6*):
- TestV31ArchivePercentileGate: archive-percentile referenced, no Regime/archive-quartile/
  Target/focal_gap/half-distance vocab, FORBIDDEN keyword on Q1 Exploitation, 25/75 cited,
  qualitative target awareness via task description, non-linear scale acknowledged, trend
  vocab matches collector (rising/flat/falling).
- TestV31SuggesterTagBias: parallel for mutation_suggestions prompt.
- TestV31VocabularySynergy: cross-prompt consistency for tag scale, verdict scale, quartile
  boundaries (25, 75), archive distribution vocab (worst/median/best).

364 tests pass. Lint clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(audit): v3.1 mutation decision tree — channels, gate, decision table, worked examples

Spells out exactly how the LLM mutator selects an archetype (Exploitation 1–3 / Exploration
4–6 / Hybrid 7–8) under the v3.1 surface:

1. Six context channels (C1 Metrics, C2 Insights, C3 Intra Memory, C4 Memory Cards, C5
   Evolutionary Statistics, C6 Ancestral Momentum) — producer / carrier / consumer.
2. Two decision components: one deterministic gate (archive-percentile, only constants
   are quartile boundaries 25/75) and one qualitative target-awareness clause (LLM reads
   task description, judges against rendered Archive distribution; no numeric threshold
   because fitness scale is non-linear).
3. Exhaustive 18-row decision table covering (archive-percentile bucket × intra verdict
   × trend × invalid streak) → archetype.
4. Four worked heilbron examples: bootstrap-mislead p100 case (Hybrid 7 override),
   normal mid-run (Hybrid), late-run refinement (Exploitation 1), plateau exit
   (Exploration).
5. Cycle-9 mid-run invariants: 0% Exploitation on archive-percentile<25 focals; no
   Regime/Target/archive-quartile tokens; archive-percentile rendered ≥95% post-bootstrap.
6. Universal-across-tasks proof: only per-task input is higher_is_better flag; 25/75 are
   statistical-convention quartile boundaries, not chosen values.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* tools(v31-validator): read-only sampler — recompute archive-percentile + decision-tree predictions vs LLM choices

Non-mutating Redis sampler that walks DBs 13/14/15 program by program,
parses `Program.metadata.mutation_context` (the rendered prompt) for the
parent's state (focal fitness, valid sibling fitnesses, trend, intra
verdicts), and recomputes the v3.1 archive-percentile from valid
siblings. Emits JSONL with fields: tree_bucket, tree_eligible_archetypes,
archetype_chosen, fitness_delta, cf_tags (which of CF-A..CF-E cells the
sample lands in), match (tree-prediction vs LLM-choice).

Reuses:
- gigaevo.programs.stages.collector.N_MIN_ARCHIVE
- gigaevo.evolution.mutation.context._archive_percentile_of_focal
  (direction-aware)
- gigaevo.database.redis_program_storage RedisProgramStorage.get_all
- gigaevo.programs.program.Program.get_metadata (base64 deserialization)

Output drives docs/audits/MUTATION_DECISION_TREE_V3_1_COUNTERFACTUAL_AUDIT_2026-05-18.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(audit): v3.1 decision tree — counterfactual audit on 289 prior-run programs (DBs 13/14/15)

Recomputes v3.1 archive-percentile and tree-predicted archetype bucket
on every program with mutation_context metadata across DBs 13/14/15
(n=289), compares to the LLM's actual archetype_choice and the child's
fitness_delta, then groups by the 5 candidate counterfactual cells
identified in the audit plan:

  CF-A empty intra (first child)                 n=225 — modal Hybrid 7 (57), pos-rate 42.1%
  CF-B invalid focal with archive line           n=0   — unreachable under OLD prompt; documented as gap
  CF-C low-N flat trend (noise-dominated)        n=32
  CF-D middle-band percentile + falling trend    n=19
  CF-E top-quartile + spread wide + far-target   n=53  — pos-rate 92.5%, Exploitation 32/35 = 91% wins

Headline finding: v3.1's target-awareness override demoting top-quartile
parents to Hybrid is empirically TOO RESTRICTIVE. CF-E shows exploitation
beats hybrid 32/35 when the gate permits both. 18 counterfactual-A
samples — gate-violations that improved fitness anyway.

Recommendations applied in next 2 commits:
- REV-1 soften row 13 (target-awareness no longer forbids Exploitation)
- REV-3 add row 19  — empty intra middle-band → Hybrid 7 default
- REV-4 add row 19a — invalid focal → Exploration with corrective
- REV-2 (row 11 softening for CF-D) documented as PROPOSAL — n=19 too thin

Observational only — contexts rendered under OLD prompt surface; the
sampler recomputes v3.1 tokens from the same underlying numerics.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(audit): v3.1 tree — soften row 13 target-awareness + add rows 19 / 19a per counterfactual audit

Three additive/softening edits to MUTATION_DECISION_TREE_V3_1_2026-05-18.md,
all driven by empirical findings in the counterfactual audit
(MUTATION_DECISION_TREE_V3_1_COUNTERFACTUAL_AUDIT_2026-05-18.md):

REV-1 — row 13 (top-quartile + spread wide + far-target):
  OLD: "demote to Hybrid 7 even though gate permits Exploitation"
  NEW: "prefer Hybrid 7 OR Exploitation 1 — gate does NOT forbid
       Exploitation. Choose by C2 insight severity and C6 ancestral_step_delta."
  Evidence: CF-E n=53, pos-rate 92.5%; Exploitation wins 32/35 in this cell.

REV-3 — new row 19 (empty intra, middle-band 25≤p<75):
  Add explicit default: Hybrid 7 (Guided Innovation).
  Evidence: CF-A n=225, modal choice Hybrid 7 (57 picks), pos-rate 42.1%.
  Closes a gap where the tree relied on the LLM inferring a default.

REV-4 — new row 19a (invalid focal with archive line):
  Defensive rule — force Exploration with corrective mechanism regardless
  of archive-percentile (focal cannot be refined when it is invalid).
  Evidence: CF-B was unreachable in prior data; rule is forward-looking.

REV-2 (row 11 middle-band+falling softening) NOT applied — CF-D n=19 too
thin; documented in audit as proposal pending more data.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(prompts): soften target-awareness clause + add noise-dominated & empty-intra rules

Mirrors REV-1/REV-3 from the decision-tree counterfactual audit into the
two system prompts the LLM actually sees.

mutation/system.txt:
- Softened target-awareness paragraph: target awareness shapes priority
  WITHIN the gate-permitted set; it never forbids an archetype the gate
  allows. When archive.best far below target AND focal in top quartile,
  BOTH Hybrid 7 and Exploitation 1 are legitimate.
- Added noise-dominated-trends bullet: falling/flat with
  `[only N valid — too few for trend]` (or iter_window_valid < 9) is
  inconclusive — do NOT force Exploration on it alone.
- Added empty-intra default bullet: first child of a fresh parent in
  middle band defaults to Hybrid 7 (Guided Innovation).

mutation_suggestions/system.txt:
- Softened target-awareness paragraph (same wording philosophy).
- Added noise-dominated-trends bullet so the analyst does not raise
  severity to `high` on a noisy signal alone.

Total: ≤10 lines net per file, all additive or replacement softening.
Empirical justification: CF-E audit shows Exploitation wins 32/35 when
the gate permits both; CF-A shows Hybrid 7 is the modal LLM choice
(57/225) and best per-pick pos-rate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* tools(pseudo-evo-bench): single-shot mutation A/B harness with archetype-distribution analysis

Tests prompt-wrapper changes (mutation/system.txt + user.txt) on a fixed cohort of
50 stratified parents from DBs 13/14/15. Each parent's mutation_context blob is
frozen from its original search-time run; only the system.txt/user.txt wrappers
are re-rendered from the working tree at sample time.

Components:
- sample_parents.py: stratify 50 parents (17/18/15 by fitness bucket), render
  current HEAD prompts, write parents.json (idempotent under SEED=20260518)
- run_qwen.py: 1 LLM call per parent at concurrency=6 via LiteLLM proxy
- eval_mutants.py + eval_runner.py: parallel heilbron validate() at concurrency=4
- analyze.py: PRIMARY signal is archetype/strategy distribution (coverage,
  entropy, group balance, v3.1 gate compliance, archetype shift matrix).
  Fitness delta is reported as SECONDARY since single-shot is noise-dominated
  by parent quality and validity.

Scope (per README): tests prompt-wrapper changes only. Stage-internal
mutation_context build (collector, intra/extra memory, lineage) is NOT
exercised because contexts are pre-rendered.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* tools(pseudo-evo-bench): iter0 (HEAD) vs iter1 (pre-softening v3.1) — archetype-distribution A/B

Same 50 parents (verified parent_id-identical), frozen mutation_context blocks
(verified byte-identical per parent). Only difference: rendered system_prompt
(iter0 = 14,375 chars / HEAD with +14-line softening; iter1 = 13,415 chars at
commit 11eb4d4b before softening).

PRIMARY: archetype/strategy distribution
=========================================
                       iter0 (HEAD softened)    iter1 (v3.1 sharp)
Coverage               7/8 archetypes           7/8 archetypes
Entropy                2.583 bits               2.565 bits
Group balance E/X/H    28% / 41% / 30%          20% / 52% / 28%
Group skew (max-min)   13.0%  ← fairer          32.6%
Low anti-Exploit       87.5% (n=16)             93.8% (n=16)
High Exploit rate      50.0%                    42.9%

Per-bucket group shift (where softening actually changed behavior):
  low:   ~unchanged (gate respected in both)
  mid:   iter0 56% Hybrid → iter1 50% Explore  (softening pushed mid TOWARD Hybrid)
  high:  iter0 14% Hybrid → iter1 36% Hybrid   (softening pushed high TOWARD Explore)

Archetype shift matrix: 16/45 decisive pairs (36%) cross GROUP boundary
between iter0 and iter1 — the +14 lines DO steer the LLM, the question is
whether the steering is desirable.

Common finding across BOTH iterations:
  archetype #8 "Conservative Exploration" picked ZERO times (0/92 mutations)
  → strong signal the prompt does NOT surface this archetype effectively
  archetype #7 "Guided Innovation" dominates Hybrid (14/14 + 13/13)

SECONDARY: fitness (noise-dominated, but directionally informative)
==================================================================
Sign test on paired Δ: iter1 wins 20, iter0 wins 7, ties 23 → p≈0.012
Validity: iter0 26/50, iter1 32/50

Interpretation: HEAD softening trades 6pp of single-shot fitness loss for
group-balance fairness. Neither prompt surfaces archetype #8.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* prompts(mutation/system.txt): rewrite archetype #8 as Component Substitution + sharpen #5 separator from #4

Empirical motivation: pseudo-evo bench iter0+iter1 picked archetype #8 (Conservative Exploration) ZERO times across 92 valid responses. Diagnosis — "explore within structural / interface constraints" describes properties any valid mutation already has, not an edit pattern the model can operationalise.

Archetype #8 → Component Substitution
Replace ONE named subroutine or building block (scoring function, sampler, init scheme, distance metric, update rule, post-processor) with an alternative of the same kind, occupying the same slot — same inputs, same output shape — so surrounding control flow, interfaces, and hyper-parameters remain unchanged. Distinct from #4 (changes algorithm family) and from #7 (adds a component alongside an existing one rather than replacing).

Archetype #5 → sharpened separator from #4
"Change the SET of admissible solutions (relax/tighten a constraint, drop a parity or symmetry rule, allow rotations, switch from discrete to continuous parameterisation) without changing the search algorithm itself. #4 changes HOW the search runs; #5 changes WHAT set is searched."

Net effect — eight distinct edit verbs:
Exploit: tune (#1) / extend-scope (#2) / remove (#3)
Explore: reinvent-algorithm (#4) / change-feasibility-set (#5) / synthesise-from-memory (#6)
Hybrid:  add-alongside (#7) / substitute-component (#8)

Raises realised entropy ceiling above today's log2(5)≈2.32 bits toward the 3.0-bit max for 8 archetypes. Pseudo-evo iter2 will verify the model actually picks #8 and discriminates #5 from #4.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* schema(mutation): enforce archetype names via Literal + add drift-detection tests

Changes:
1. Added ARCHETYPE_NAMES constant and ArchetypeName Literal type to constants.py (single source of truth for canonical names)
2. Updated MutationStructuredOutput schema to use ArchetypeName Literal (strict validation, rejects unknown archetype strings at parse time)
3. Added 4 new test functions to test_mutator_system_prompt.py:
   - test_archetype_names_appear_in_system_prompt() — catches drift between schema Literal set and prompt's archetype menu
   - test_archetype_count_is_eight() — asserts len(ARCHETYPE_NAMES)==8
   - test_mutation_output_accepts_canonical_archetypes() — validates each canonical name
   - test_mutation_output_rejects_unknown_archetype() — rejects out-of-set strings with ValidationError
4. Fixed test_defaults in test_mutation_agent.py to use canonical archetype "Precision Optimization" instead of invalid "test"

Motivation: Prevent silent LLM output rejection when system.txt and schema drift (e.g., if archetype #8 is rewritten in prompt but not updated in Literal).

All 104 tests pass. No changes to mutation/system.txt (archetype #5/#8 redesign committed separately 2026-05-18 at 7a52f45d).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(auto-optimize-loop): spec + reference schemas + history/patterns scaffolds

Adds the auto-optimize-loop task spec that drives autonomous cycles tuning
ONLY the mutation operator's context factory graph and the archetype
framework. Primary success criterion is healthy trajectory + healthy
mutants; 0.03 is the "real improvement" floor with 0.035 aspirational.

All future auto-loop cycle commits land linearly on r7-r8-r9-v3-bundle
and are identified by their commit SHA captured at LAUNCH time. Each
cycle writes a Reconstruction MD and Analytics MD; the Analytics MD's
retroactive invariant audit hard-gates the next cycle's PROPOSE step.

Files:
- docs/audits/AUTO_OPTIMIZE_LOOP_TASK_2026-05-19.md (primary spec)
- docs/audits/references/AUTO_OPTIMIZE_RECONSTRUCTION_MD_SCHEMA.md
- docs/audits/references/AUTO_OPTIMIZE_ANALYTICS_MD_SCHEMA.md
- docs/audits/AUTO_OPTIMIZE_CYCLE_HISTORY.md (append-only ledger)
- docs/audits/AUTO_OPTIMIZE_PATTERNS.md (evidence ledger)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: gitignore output/runs/tool-caches + capture pre-loop audit MDs

- Add .gitignore rules for output/ runs/ problems/heilbron_pro/
  rotated litellm sampler logs, and throwaway tools/ subdirs
  (benchmark_gemini_ab, insights_ablation, lineage_card_scaffold).
- Capture pre-loop audit MDs under docs/audits/ that informed the
  cycle-9 redesign + auto-optimize-loop spec (insights, lineage memory
  plans, mutation guidance rubric, cycle-8 prelaunch, prompt redesign,
  etc.). These are read-only history; future PRs cite them.
- Collapse multi-line attach_inputs({...}) calls in
  tests/stages/test_intra_memory_cache.py to single-line form
  (pure formatting, no behavior change).
- Add docs/audits/AUTO_OPTIMIZE_CYCLE_0_ANALYTICS.md (cycle-0 = cycle-9
  baseline) pre-drafted at T+1h17m with <TBD-FINALIZE> markers;
  end-of-run values will be filled in once PID 2008891 exits.

This is a non-loop chore commit. It becomes cycle-1's PARENT_SHA
so the §8.1 invariant
  git rev-list --count \$PARENT_SHA..HEAD == 1
will be satisfiable when cycle-1's single IMPLEMENT commit lands on top.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: ruff fix + format on tools/pseudo_evo_bench (pre-loop)

Applies `ruff check --fix` + `ruff format` to the pseudo_evo_bench
A/B harness scripts. All changes are cosmetic:
- 11 I001 import-order errors fixed (stdlib imports merged into
  alphabetic order with site-package imports).
- Long json.dumps(...) and dict literals reflowed by the formatter.

These files have been failing `ruff check .` since they landed
(commits 893113c1 / a9b4bee5). The §8.1 pre-launch lint invariant
in the auto-optimize loop spec requires a clean `ruff check .` +
`ruff format --check .`, so this clean-up is a prerequisite for
cycle-1 launch.

Safety:
- pseudo_evo_bench is NOT imported by gigaevo/ or tests/ — grep
  across both trees returns zero hits. The currently-running
  cycle-9 (PID 2008891) does not touch these files.
- Only formatting changes; no semantic edits, no API surface change.
- `pytest tests/prompts/test_mutator_system_prompt.py` continues
  to pass (archetype-drift detection).

Verified:
  ruff check .           → All checks passed!
  ruff format --check .  → 1172 files already formatted

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(auto-optimize-loop): finalize cycle-0 Analytics + cycle-1 PROPOSE + scope-expansion note

Finalizes cycle-0 (= pre-loop cycle-9 archetype redesign + R-bundle) baseline analytics
against the full evolution log (20043 lines, exit at T+1h55m41s).

- AUTO_OPTIMIZE_CYCLE_0_ANALYTICS.md: substitute all TBD-FINALIZE markers with
  extracted numbers. Final outcome: HEALTHY-NEUTRAL (S2.1 trajectory PASS narrow;
  S2.2 fitness FAIL with best_fitness=0.02620 < 0.03 floor, below baseline 0.02788).
  Trajectory has 6 strictly-increasing best-fitness peaks with one mid-run plateau
  of ~46 mutants strict (peaks #4->#5) followed by fast late rescue (peak #6).
  Strict stagnation_interval_max=46 NARROW-FAILS <=40 gate; inclusive frontier-event
  defn PASSES at <=5 mutants. valid_rate=52%, frontier_new_cell_events=42, right_tail
  _mass=42.3%, per_parent_advance_rate=6% strict / 10% inclusive. Component Substitution
  (new archetype #8) NOT dead - rose 0->18 picks (6%) by run end; vindicates the
  feedback_archetype_distribution_not_a_goal user reframe.

- AUTO_OPTIMIZE_CYCLE_HISTORY.md: cycle-0 row now shows HEALTHY-NEUTRAL decision,
  best_fitness=0.02620, S2.1 PASS narrow, S2.2 FAIL.

- AUTO_OPTIMIZE_PATTERNS.md: cycle-0 entry as NEUTRAL ceiling evidence; documents
  surface scope (R-bundle), numbers (S2.1 components), caveats (n=1 variance not
  yet measured; below baseline by 6% within plausible n=1 variance).

- AUTO_OPTIMIZE_CYCLE_1_PROPOSE.md: cycle-1 = variance-floor replicate of baseline
  (NO EDIT per S7). Decision rationale cites feedback_variance_floor_first +
  feedback_consistent_improvement_all_stages + feedback_auto_optimize_trajectory_first.
  S4 citations rewritten as trajectory-shape-only signals (plateau duration,
  per_parent_advance_rate, stagnation_interval_max) per feedback_archetype
  _distribution_not_a_goal. Updated parent-SHA reference to current operational HEAD.

- AUTO_OPTIMIZE_LOOP_TASK_2026-05-19.md: S3 prelude blockquote captures the 2026-05-19
  user verbal directive expanding cycle-3+ scope to the entire mutation context harness
  (feedback_mutation_context_harness_in_scope). S3.2 SLIGHTLY cap on mutation/system.txt
  preserved as engineering constraint only.

Non-cycle docs commit - the cycle-1 commit will follow as a separate --allow-empty
commit per the variance-floor protocol.

* auto-loop cycle 1: variance-floor replicate of baseline (no edit)

* auto-loop meta cycle 1: variance-floor replicate (HEALTHY-NEUTRAL)

Cycle-1 ran 2026-05-19 02:33→04:53 MSK on db=12, identical config to
cycle-0 baseline (a4925a90). Cycle commit a527b256 is the --allow-empty
IMPLEMENT SHA; zero code/prompt/config diff.

Result: HEALTHY-NEUTRAL (variance-floor; informational §2.2 PASS).
- best_fitness = 0.031187 (cycle-0 was 0.02620; Δ=+0.00499)
- |Δ| < §7 variance threshold 0.01310 → within variance floor
- §2.1 trajectory: PASS all 5 gates strict (frontier_new_cell 52,
  right_tail_mass 57.4%, advance_rate 6% strict / 12% inclusive,
  stagnation_interval_max 14 strict within-active, valid_rate 64%)
- §2.2 fitness floor: PASS (0.031187 ≥ 0.03) — INFORMATIONAL
  flip vs cycle-0 (which failed by 0.004); NO-EDIT cycles cannot
  be WIN-CANDIDATE by spec.
- Trajectory shape: rapidly ascending for first 29 mutants (6 peaks
  compressed), then 70-mutant trailing plateau. INVERSE of cycle-0's
  mid-run plateau + late rescue. Two equally consistent interpretations
  (A: baseline mean ~0.029±0.003, B: cycle-1 high-side outlier).

Cycle-2 (db=13) will disambiguate. If cycle-2 lands within Δ=0.006 of
either prior cycle, baseline mean ≈ midpoint(0.026, 0.031, cycle-2);
loop may be near a structural Heilbron ceiling per
feedback_auto_optimize_trajectory_first ("task may be unsolvable in
knob scope — that's valid").

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(auto-optimize-loop): cycle-2 PROPOSE — variance-floor replicate #2 (db=13)

Continuation of §7 variance-floor methodology. Cycle-2 is NO-EDIT,
db=13 only delta vs cycle-0/1. Adds the third sample point to lock
baseline mean + std before cycle-3 first real intervention.

After cycle-1, n=2: best=[0.02620, 0.03119], midpoint 0.02870,
sample std 0.00353, §7 variance threshold 0.01435 (50% of midpoint).

Cycle-2 decision tree:
- best ∈ [0.026, 0.034] AND §2.1 PASS → cycle-3 PROPOSE proceeds
- best outside [0.020, 0.040] OR §2.1 FAIL → §7 STOP
- best in marginal bands → optional cycle-2.5 NO-EDIT before cycle-3

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* auto-loop cycle 2: variance-floor replicate of baseline (no edit)

Per §7 variance-floor methodology. Cycle-2 = second NO-EDIT replicate
on db=13. Cycle identity SHA = this commit's HEAD.

PARENT_SHA = previous commit (cycle-2 PROPOSE meta).
No code/prompt/config diff vs cycle-0 baseline (a4925a90).

§8.1 invariants verified pre-launch:
- branch r7-r8-r9-v3-bundle
- working tree clean
- LiteLLM proxy 10.232.30.185:4000 reachable (/health/readiness)
- archetype-schema drift tests 51 passed

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* auto-loop meta cycle 2: variance-floor replicate (HEALTHY-NEUTRAL; marginal → cycle-2.5)

Cycle-2 IMPLEMENT commit (57442737) was a --allow-empty NO-EDIT replicate on db=13.
This meta commit captures the cycle-2 ANALYZE-post artifacts:

- AUTO_OPTIMIZE_CYCLE_2_RECONSTRUCTION.md: best_fitness=0.021266 at mutant ~38;
  §2.1 trajectory PASS (5/5 gates strict); §2.2 fitness floor FAIL (< 0.03 by 0.00874)
- AUTO_OPTIMIZE_CYCLE_2_ANALYTICS.md: n=3 baseline mean=0.02622, stdev=0.00496;
  cycle-1 reclassified as high-side outlier; trailing-plateau shape dominates 2/3 cycles
- AUTO_OPTIMIZE_CYCLE_HISTORY.md: cycle-2 row appended
- AUTO_OPTIMIZE_PATTERNS.md: cycle-2 NEUTRAL evidence entry
- AUTO_OPTIMIZE_CYCLE_3_SURFACE_MENU_DRAFT.md: surface menu for cycle-3 PROPOSE
  (gated by cycle-2.5 4th NO-EDIT replicate per cycle-2 PROPOSE marginal-band rule)

Decision: cycle-2 best_fitness 0.021266 ∈ [0.020, 0.025] MARGINAL band per cycle-2
PROPOSE §5 decision tree → next cycle (2.5) is another NO-EDIT replicate on db=14
to add a 4th variance sample before cycle-3 first real intervention.

Per spec §8.1: git rev-list --count 57442737..HEAD == 1 (one commit per cycle step).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(auto-optimize-loop): cycle-2.5 PROPOSE — 4th NO-EDIT variance-floor replicate (db=14)

Triggered by cycle-2 PROPOSE §5 decision-tree marginal-band rule:
cycle-2 best_fitness 0.021266 ∈ [0.020, 0.025] → need 4th sample.

n=3 stats: mean=0.02622, stdev=0.00496; §7 STOP NOT triggered.
n=4 will tighten baseline mean variance ~2.6× and disambiguate
the bimodal-suspicious distribution (cycle-1 1σ above mean).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* auto-loop cycle 2.5: variance-floor replicate of baseline (4th sample, no edit)

Identical config to cycle-2 except redis.db=14 (cycle-2 was 13).
Per cycle-2 PROPOSE §5 decision-tree: cycle-2 best 0.021266 in MARGINAL
band [0.020, 0.025] required this 4th NO-EDIT sample.

After cycle-2.5 closes:
- if best ∈ [0.015, 0.035] AND §2.1 PASS → cycle-3 PROPOSE proceeds
- if outside band OR §2.1 FAIL → STOP per §7

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* auto-loop meta cycle 2.5: variance-floor replicate (HEALTHY-NEUTRAL; n=4 baseline LOCKED; proceed to cycle-3)

cycle-2.5 best_fitness = 0.025709 at mutant 80/100 on db=14.

n=4 baseline LOCKED:
- mean = 0.02609, stdev (sample) = 0.00406
- range [0.02127, 0.03119], CV = 15.6%
- §7 STOP NOT triggered (0.00406 << 0.5 × 0.03119 = 0.01559)

§2.1 trajectory: PASS lenient (4/5 strict + stagnation NARROW-FAIL at 55 mutants,
same shape as cycle-0's 46). §2.2 fitness floor: FAIL (0.025709 < 0.03 by 14%).

Trajectory-shape census (n=4):
- 2/4 cycles: mid-run plateau + late rescue (cycle-0, cycle-2.5)
- 2/4 cycles: leading sprint + trailing plateau (cycle-1, cycle-2)
Bimodal 2/2 — cycle-3 intervention must address BOTH shapes.

Decision tree (cycle-2.5 PROPOSE §5):
  best 0.02571 in [0.015, 0.035] AND §2.1 PASS lenient
  -> PROCEED TO CYCLE-3 PROPOSE (first real intervention)

Cycle-3 WIN-CAND threshold = mean+1sigma = 0.03015.
Cycle-3 STRONG-WIN threshold = mean+2sigma = 0.03421.

First cycle with DIRECT live /proc/<pid>/environ verification of all four
section 8.1 environment invariants (OPENROUTER_API_KEY len=73, OPENAI_API_KEY=sk-gigaevo,
HTTP_PROXY+HTTPS_PROXY unset). Strengthens n=4 baseline vs cycle-1/2 INFERRED env.

Files:
- AUTO_OPTIMIZE_CYCLE_2_5_RECONSTRUCTION.md (FINAL; sections 9-13 filled)
- AUTO_OPTIMIZE_CYCLE_2_5_ANALYTICS.md (created; sections 0-6)
- AUTO_OPTIMIZE_PATTERNS.md (append cycle-2.5 entry; n=4 stats)
- AUTO_OPTIMIZE_CYCLE_HISTORY.md (append cycle-2.5 row)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* auto-loop cycle 3: intra-memory saturation detection (K=3 narrative streak → header inject)

INT 1 from cycle-3 surface-menu draft. First REAL intervention after n=4
NO-EDIT variance-floor baseline (mean=0.02609, σ=0.00406; locked at f225e1db).

Change: IntraMemoryStage now tracks an SHA1-keyed hash of each rendered
intra-card's narrative signature (summary + tried_strategies' label/verdict/notes,
excluding monotonic counters n_attempts/mean_delta/delta_distribution per
chaos-hacker CRITICAL #1). When the same hash is observed for K=3 consecutive
renders on the same parent, the next render is prepended with a
"[STAGNATION DETECTED]" header plus a child-delta archetype histogram.

Hypothesis: the K=3 stagnation header makes the parent's saturation visible
to the MutationSuggestionAgent → encourages archetype shift OR genuinely new
strategies in identical-narrative branches → reduces stagnation_interval_max
and/or trailing plateau without changing model/prompt template/problem.

Scope: lineage_memory.py (+90 lines) + 6 unit tests bypassing InputHashCache
to exercise the new code path directly.

Frozen invariants (unchanged): problem.heilbron, validator, fitness fn,
Pydantic Literal archetype enforcement, num_parents=1, max_mutants=100,
model=Qwen3-235B-A22B-Thinking-2507, prompts (all 4 SHAs unchanged).

WIN-CAND threshold: best_fitness ≥ 0.03015 (n=4 baseline mean+1σ).
STRONG-WIN: ≥ 0.03421 (mean+2σ).

Spec: docs/audits/AUTO_OPTIMIZE_LOOP_TASK_2026-05-19.md
PROPOSE: docs/audits/AUTO_OPTIMIZE_CYCLE_3_PROPOSE.md
RECONSTRUCTION: docs/audits/AUTO_OPTIMIZE_CYCLE_3_RECONSTRUCTION.md (skeleton; populated post-run)
ANALYTICS: docs/audits/AUTO_OPTIMIZE_CYCLE_3_ANALYTICS.md (skeleton; populated post-run)

* auto-loop meta cycle 3: K=3 stagnation detector — HEALTHY-NEUTRAL (mechanism never fired)

Cycle-3 INT 1 (commit 45255a3d, db=15, 1.85h run) outcome class HEALTHY-NEUTRAL.

Headline:
- best_fitness = 0.024369 (mutant cc09c637, frontier event #38 of 55 valid mints)
- Δ vs n=4 baseline mean (0.02609) = -0.00172 (|Δ| < 1σ = 0.00406, WITHIN noise band)
- §2.1 trajectory: 5/5 strict PASS (valid_rate ~0.55, frontier_new_cell=46,
  right_tail_mass=0.344, advance 0.06 strict / 0.55 inclusive, stagnation_interval_max=14)
- §2.2 fitness floor: FAIL (0.024369 < 0.030 by 0.00563)
- §7 STOP threshold: NOT triggered (loop continues)

Key empirical finding: STAGNATION DETECTED header activations = 0/100.

The K=3 narrative-streak SHA1 detector never fired in 100 mutants. The bucketed
memory representation evolves enough between consecutive renders that
SHA1(narrative_signature) changes before K=3 is reached, even with num_parents=1
and many sibling renders per parent.

This vindicates the user's preference (feedback_llm_rules_over_hardcoded):
hardcoded Python predicates are too brittle to fire at this run scale.
Cycle-3 is empirically a NO-EDIT replicate of cycle-2.5 from the mutator's
perspective. Trajectory improvements (stagnation_interval_max=14 vs cycle-0's
46 / cycle-2.5's 55) are not causally attributable to the mechanism that
never fired — they are sampling variance.

Dual-axis verification per project_fat_context_direction.md (§1):
- signature: STAGNATION activations = 0 → ✗
- metric of interest: stagnation_interval_max = 14 (< 46 cycle-0) → ✓
- quadrant ✗✓ → "noise/lucky; replicate before claiming win" → cannot ship as a
  trajectory-shape win because the mechanism never fired

Archetype distribution shifted dramatically from cycle-0 (informational only):
- cycle-0: Guided Innovation 25%, Computational Reinvention 21%
- cycle-3: Guided Innovation 53% (mode-collapse), Computational Reinvention 2%
ARCHETYPE-EFFICIENCY MISMATCH persists: highest hit-rate archetypes
(Computational Reinvention 100% n=2, Harmful Pattern Removal 100% n=1,
Precision Optimization 66.7% n=6) are under-sampled, while highest-pick
archetype (Guided Innovation 53%) has below-average hit-rate (35.8%).

Decision: HEALTHY-NEUTRAL. Next: cycle-4 PROPOSE per fat-context methodology
will target a measurable failure mode (candidate: pick-rate-vs-hit-rate
mismatch) with LLM-side fat context, NOT hardcoded Python predicates.

Spec: docs/audits/AUTO_OPTIMIZE_LOOP_TASK_2026-05-19.md
RECONSTRUCTION: docs/audits/AUTO_OPTIMIZE_CYCLE_3_RECONSTRUCTION.md (FINAL)
ANALYTICS: docs/audits/AUTO_OPTIMIZE_CYCLE_3_ANALYTICS.md (FINAL)
HISTORY: docs/audits/AUTO_OPTIMIZE_CYCLE_HISTORY.md (row appended)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* auto-loop cycle 4: surface per-archetype yield in evolutionary statistics block

INT 2 (cycle-4) — FIRST LLM-side fat-context intervention. Per cycle-3 closeout
(commit 08cbd5b2, HEALTHY-NEUTRAL, K=3 narrative-streak detector NEVER FIRED in
100 mutants), the user verbal directive on 2026-05-19 expanded the loop scope
to the entire mutation-context harness (feedback_mutation_context_harness_in_scope)
and re-affirmed fat-informative-context over hardcoded Python predicates
(project_fat_context_direction).

Change: per-archetype yield (picks, valid_hits, hit_rate, mean_delta_to_parent)
aggregated over the whole-run population in EvolutionaryStatisticsCollector,
attached to the EvolutionaryStatistics StageIO, and rendered as a markdown
table inside the existing "## Evolutionary Statistics" block of the mutation
suggester's prompt (gigaevo/prompts/mutation_suggestions/system.txt also
extended with PRIORITY-reshape guidance, NOT invention).

Three additive surface touches, all in scope per §3.1 of LOOP_TASK +
feedback_mutation_context_harness_in_scope:
- gigaevo/programs/stages/collector.py: new _compute_archetype_yield()
  helper + archetype_yield field on EvolutionaryStatistics + cache wiring
  in _ensure_population_cache. Also flips EvolutionaryStatisticsCollector
  ._EXCLUDE from EXCLUDE_FOR_ANALYTICS (strips metadata) to
  EXCLUDE_STAGE_RESULTS (keeps metadata) — REQUIRED for the helper to read
  program.metadata[MutationSpec.META_OUTPUT]["archetype"]. The metadata
  cost is bounded by N=100 programs per snapshot, well under 1% of cycle
  wall-time. Other collectors continue to exclude metadata.
- gigaevo/evolution/mutation/context.py: extended
  EvolutionaryStatisticsMutationContext.format() to render the yield table
  when total picks >= 5 (suppresses bootstrap noise). Sorted by hit_rate
  desc, picks desc tie-break.
- gigaevo/prompts/mutation_suggestions/system.txt: extended the existing
  "Evolutionary Statistics" bullet with explicit guidance on how to read
  the new table (UNDER-UTILIZED vs OVER-RELIED-ON cells). Reaffirms
  PRIORITY-reshape, NOT invention.

Hypothesis: surfacing per-archetype yield (cycle-3 ANALYTICS Signal #1:
Computational Reinvention 100% hit-rate at 2% pick-share; Guided Innovation
35.8% hit-rate at 53% pick-share) to the suggester lets the LLM reshape
priority toward higher-yield archetypes. Cycle-4 takes the OPPOSITE design
to cycle-3: NO Python threshold, NO if-streak-K predicate, NO header
inject. The LLM decides.

CRITICIZE-pre v1 returned REVISE with one CRITICAL + two HIGH findings,
all mitigated:
- CRITICAL: PROPOSE v1 specified wrong metadata key (metadata["mutation"]
  vs canonical MutationSpec.META_OUTPUT = "mutation_output"). Fixed.
- HIGH: TDD fixtures echoed wrong key. Fixed + new test #7 regression-guards
  the dead key (test_rejects_dead_metadata_key).
- HIGH: integration smoke step #4 only checked header presence. Tightened to
  require >=1 canonical-named row + (other) share <= 20%.

8 RED tests in tests/stages/test_archetype_yield.py cover: empty population,
per-archetype aggregation with delta-to-parent, canonical ordering with zero
picks, attachment to EvolutionaryStatistics, format() rendering (sorted),
threshold suppression, defensive bucketing of unknown archetypes + missing
mutation_output, regression guard against the dead "mutation" key. All 8
pass post-GREEN. Adjacent suites (tests/stages/test_collector.py +
tests/stages/test_mutation_context.py) pass 85/85 — _EXCLUDE flip has no
regressions.

Scope discipline: NO change to gigaevo/llm/agents/mutation.py (archetype
Literal preserved), NO change to gigaevo/prompts/mutation/system.txt (the
SLIGHTLY rule does not apply), NO change to problems/heilbron/, num_parents,
max_mutants, model_name, llm_base_url. No Heilbron-specific anything in any
touched file. The 8 canonical archetype names are loaded from
gigaevo/evolution/mutation/constants.py — problem-agnostic.

Frozen invariants (unchanged): problem.heilbron, validator, fitness fn,
Pydantic Literal archetype enforcement, num_parents=1, max_mutants=100,
model=Qwen3-235B-A22B-Thinking-2507.

WIN-CAND threshold: best_fitness >= 0.03015 (n=4 baseline mean+1sigma).
STRONG-WIN: >= 0.03421 (mean+2sigma). Cycle-4 prediction at n=1: best
0.0275 +/- 0.005 (~30% chance of WIN-CAND given baseline variance).

Riskiest link: the suggester must actually read the yield table and
reshape priority. Dual-axis verification (PROPOSE §11) detects ignore
vs. follow via (DIAGNOSTIC) archetype-efficiency CV halving signature
+ (PRIMARY) metric of interest (best_fitness, trajectory gates).
Outcome quadrants per project_fat_context_direction 4-quadrant matrix.

Followup captured (NOT in this commit, future cycle): lineage_memory.py:711
reads metadata.get("mutation", {}) — the dead key. TransitionAnalysis archetype
field always None as a result.

* Revert "auto-loop cycle 4: surface per-archetype yield in evolutionary statistics block"

This reverts commit e6cfe6eed30fc81da752af950a4a32a45b2352f4.

* auto-loop meta cycle 4: archetype-yield prompt bloat — LOSE-REVERT

Cycle-4 INT 2 (commit e6cfe6ee, db=12, 2.16h run) outcome class LOSE-REVERT.
Reverted by 69a5a708 per feedback_auto_optimize_branch_policy (no reset).

Headline:
- best_fitness = 0.01686 (run high-water mark; SEED-level, no post-seed mints)
- Δ vs n=4 baseline mean (0.02609) = -0.00923 = -2.27σ → INSIDE LOSE band (baseline-2σ=0.01797)
- 0/37 ACCEPTED…
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants