fix(acceptor): reject NaN and +inf is_valid#3
Open
GrigoryEvko wants to merge 800 commits into
Open
Conversation
short id will generate based on full id when required
…regression feature weights
… and improve docstring clarity
refactor(memory): split card_conversion into focused modules
Define MemoryError, MemoryRetrieverError, MemorySearchError, and MemoryStorageError in gigaevo/exceptions.py following the existing GigaEvoError hierarchy. Wire them into the memory subsystem: - gam_search.build() wraps all failures in MemoryRetrieverError - memory.py narrows two gam.build() catches from bare Exception - card_store._load() narrows to (json.JSONDecodeError, OSError) - card_dedup import block narrows to (ImportError, OSError) Resilience-critical catches (search fallback, merge loop, __exit__) remain broad by design. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
refactor(memory): custom exception hierarchy and narrowed catches
…t base to ABC - concept_api.py: all 5 RuntimeError raises → MemoryStorageError (matches gigaevo/database pattern of wrapping I/O errors) - base.py: GigaEvoMemoryBase now uses ABC + @AbstractMethod (matches MutationOperator, Stage, LangGraphAgent pattern) - card_dedup.py: narrow two broad catches: - JSONL read fallback: except Exception → (json.JSONDecodeError, OSError) - GAM store build: except Exception → (MemoryRetrieverError, OSError) - Update 6 test assertions from RuntimeError to MemoryStorageError Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ormity refactor(memory): exception conformity + ABC base class
When write_pipeline.py passes MemoryCard/ProgramCard Pydantic models to memory_platform.save_card(), the dict() call on a Pydantic model doesn't properly flatten nested Pydantic objects like ConnectedIdea. This caused TypeError in _persist_index() when json.dumps() tried to serialize. Root cause: write_pipeline returns list[AnyCard] (Pydantic models) and both backends (memory_platform and memory/shared_memory) consume these cards via save_card(). memory_platform's normalize_memory_card() must explicitly call .model_dump() on Pydantic inputs to flatten nested objects. Fix verified: all 788 memory + integration tests pass. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Tests the exact bug path: Pydantic MemoryCard/ProgramCard with nested ConnectedIdea and MemoryCardExplanation objects must be properly flattened to plain dicts before JSON serialization. 6 tests covering: ProgramCard with ConnectedIdea, MemoryCard with MemoryCardExplanation, plain dict passthrough, JSON round-trips, None. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add gigaevo-memory Git dependency to pyproject.toml - Remove sys.path manipulation from memory_platform/memory.py and remote_gam_retriever.py (no longer needed with proper install) - Simplify test file to use direct imports instead of module mocking Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Expands from 6 to 11 tests covering the complete save_card → _persist_index flow with Pydantic inputs. Tests verify: - normalize_memory_card: ConnectedIdea/MemoryCardExplanation → dict - save_card: Pydantic ProgramCard/MemoryCard → JSON-serializable index - _card_to_backend_content: API payload is clean dict - persist/reload roundtrip: index file survives write→read cycle Uses _make_platform_memory() factory with mocked API client to test memory_platform in isolation without network dependencies. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add docstrings to 15 public methods across 5 files (memory.py, concept_api.py, card_dedup.py, openai_inference.py, write_pipeline.py) - Add return type annotations to 4 functions in amem_gam_retriever.py - Fix 2 mypy errors: annotate retrievers dict, rename variable in api_sync.py - Extract magic numbers: _MAX_SUMMARY_CHARS, _MAX_DESCRIPTION_CHARS, _ENTITY_NAME_MAX_LENGTH, _MAX_CONNECTED_DESCRIPTIONS Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
refactor(memory): type annotations, docstrings, constants, platform bug fix
`is_valid <= 0` is False for both NaN and +inf, so a crashed validity stage emitting NaN or an unbounded-sentinel +inf was silently accepted as an elite. Add an isfinite() guard.
GrigoryEvko
added a commit
to GrigoryEvko/gigaevo-core
that referenced
this pull request
May 15, 2026
If ``_bandit.select()`` raises before ``record_pull`` runs, no pull is
recorded and the try/except inside ``invoke``/``ainvoke`` never engages
to inject a zero reward — the ledger invariant ("pulls and rewards
stay in step") therefore holds vacuously. Codify this so a future
refactor that moves ``record_pull`` outside ``_select`` (or hoists the
try/except above ``_select``) doesn't break the invariant silently.
Audit item FusionBrainLab#3 from the PR FusionBrainLab#13 bug hunt — verification, no code change.
KhrulkovV
added a commit
that referenced
this pull request
May 26, 2026
…endment #3) Framework expects 'specs:' not 'metrics:'; P1 crashed at startup with MetricSpec validation error. Fixed before any data was generated. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
KhrulkovV
added a commit
that referenced
this pull request
May 26, 2026
Addresses chaos-hacker adversarial review findings: - HIGH #1: Test _await_idle's actual ghost cleanup branch via time.monotonic patch - HIGH #2: Test generation_timeout on real step() with ghost IDs + stuck RUNNING - HIGH #3: Verify snapshot data correctness (not just no-hang) after bump() - MEDIUM #4: Truly concurrent writes via asyncio.Barrier + serialization proof - MEDIUM #5: Stuck RUNNING program triggers generation_timeout - MEDIUM #6: Write serialization assertion (max_concurrent == 1) - MEDIUM #7: Lock eviction race (concurrent reuse after terminal pop) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
KhrulkovV
added a commit
that referenced
this pull request
May 26, 2026
1. still_active==0 cleanup now retries _ingest_batch for DONE programs that failed the first ingestion due to state change between mgets, instead of blindly force-releasing slots (chaos-hacker finding #3). 2. _processed_since_epoch carries forward programs ingested during drain instead of resetting to 0 (was systematically delaying next epoch). All 55 steady-state tests + 960 evolution/integration tests pass. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
KhrulkovV
added a commit
that referenced
this pull request
May 26, 2026
_card_type() called .get() on Pydantic models from load_memory_cards(). Fixed with isinstance(card, ProgramCard) check. Added full main() loop simulation tests and comprehensive E2E pipeline tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
KhrulkovV
added a commit
that referenced
this pull request
May 26, 2026
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
KhrulkovV
added a commit
that referenced
this pull request
May 26, 2026
_card_type() called .get() on cards from load_memory_cards(), which now returns Pydantic AnyCard models instead of dicts. Fixed by using isinstance(card, ProgramCard) check before dict-only .get() calls. Added tests: - _card_type with MemoryCard and ProgramCard inputs - Full main() loop simulation: load_memory_cards → _card_type for each card (both ideas-only and mixed ideas+programs scenarios) All 105 memory tests pass. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
KhrulkovV
added a commit
that referenced
this pull request
May 26, 2026
…+ feedback K=3 Two protocol amendments applied before restart: 1. Fix GAN resistance hard-cap (Issue #4 — pre-registration deviation) - pop_a_gan/evaluate.py: `delta = max(post_q - raw_quality, 0)` → signed delta - Bug made resistance always ≤ 0.5; GAN cells had non-functional fitness signal - Fix: signed delta allows resistance ∈ (0, 1) as intended 2. Add opponent code feedback K=3 to all cells (Issue #3) - All 8 runs: adversarial_coevo_ss → adversarial_coevo_feedback - opponent_feedback_k=3, population_role=constructor/improver per run - Feedback confirmed positive in heilbron/adversarial-v2; applied uniformly - Factorial design (re-eval × fitness-type) preserved across all cells Restart: killed PIDs 692104-692111+778161, flushed DBs 1-8, relaunched New PIDs: 818029-818036, watchdog 818810 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
KhrulkovV
added a commit
that referenced
this pull request
May 26, 2026
…treatment (gen-0)
Pre-launch D prompt told the Improver to "focus on GENERAL improvement
strategies rather than exploiting specific point arrangements" and stated
"Constructor programs change EVERY generation". Both contradict the
controlled variables actually deployed:
- d_sees_g_source=true injects the SAME Constructor's source code
D is being scored against this turn (white-box access)
- FetchOpponentIdsStage has cache_handler=NO_CACHE; opponents are
re-sampled per mutation, not per generation
Replaced lines 17-21 of pop_b/task_description.txt with a block that:
- States per-mutation re-sampling cadence correctly
- Acknowledges the source-code injection block explicitly
- Adds a true anchoring statement: source D sees has already
survived competitive selection in G's archive
- Instructs D to reason from source about the underlying assumption,
then craft an improvement general enough for the next opponent
- Method-neutral (no SLSQP/basin-hopping prescriptions)
G prompt (pop_a) intentionally untouched — telling G "your code is read"
risks obfuscation strategies that distort deep-basin-search.
Scientific impact: zero. All 8 runs at gen 0 at amendment time. Pre-reg
hypothesis, conditions, controlled variables, decision matrix, dataset
checksums all unchanged.
Documented as Amendment #3 in 03_plan.md and an event in 04_issues_log.md.
Restart via /experiment-restart heilbron/k5-budget-loose follows.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
KhrulkovV
added a commit
that referenced
this pull request
May 26, 2026
* docs(specs): steady-state engine audit + true-JIT-refresh redesign
Captures the current overlapping concepts in gigaevo/evolution/engine/
(epoch vs generation, two flags gating one loop, three drain paths, two
ingestion paths, multi-pass refresh) and proposes a redesign where the
only post-seed DONE->QUEUED flip happens for the parents picked for a
single mutation. Counter consolidates to total_mutants; epoch concept
goes away entirely; file split brings each module to ~250 LOC with a
single responsibility.
Draft for user review on refactor/steady-state-true-jit-refresh.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs(specs): refine steady-state redesign — async stream, multi-parent, iteration axis
- Recast §3.2 as a continuous async stream: dispatcher + per-mutant tasks
+ ingestor (spawn-and-forget), not a sequential loop.
- Generalise refresh path for num_parents > 1 (RandomParentSelector and
AllCombinationsParentSelector both take num_parents); per-parent lock
to prevent double-flip on overlapping selections.
- Pin Program.iteration semantics as total_mutants_at_production (denser
plot axis) and flag *_in_iteration cohort aggregates in collector.py
as a plan-level migration item.
- Rename module split to dispatcher.py / mutant_task.py / ingestor.py.
- Add risks for multi-parent backpressure starvation and cohort aggregate
collapse.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs(plans): true-JIT-refresh steady-state engine — 21-task implementation plan
TDD-sequenced refactor of gigaevo/evolution/engine/steady_state.py per
docs/superpowers/specs/2026-05-12-steady-state-engine-audit-and-redesign.md:
- Delete epoch concept, gate, drain barrier
- Single total_mutants counter (rename total_generations)
- Refresh only selected parents JIT, not whole archive
- Continuous async stream: dispatcher + mutant_task + ingestor
- Module split: engine.py / dispatcher.py / mutant_task.py / ingestor.py / refresh.py
- Drop refresh_passes / refresh_order / refresh_pass / epoch_trigger_count
- Keep MaxGenerationsStopper as deprecated alias of MaxMutantsStopper
- Migrate config/evolution/default.yaml to SteadyStateEvolutionEngine
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs(plans): paranoia tasks 19A-19F + hard-rename stopper (Option A)
Two clarifications to the steady-state JIT refactor plan:
1. Add Tasks 19A-19F before the smoke + PR tasks:
- 19A: concurrency stress + load/async simulation suite
- 19B: cancellation invariants + resume-after-kill
- 19C: real-Redis integration smoke
- 19D: ParentRefresher failure-mode resilience
- 19E: chaos-hacker adversarial review pass
- 19F: counter monotonicity invariant
2. Stopper rename is hard, not aliased. The old MaxGenerationsStopper
counted *epochs* (~8 mutants each); MaxMutantsStopper counts mutants.
An alias would silently shrink runs ~8x. Delete the old class, delete
the old config files, rename the global default from
max_generations: 100 to max_mutants: 800 (preserves prior effective
run length). Old configs fail loudly at Hydra compose time.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* refactor(engine): single-counter total_mutants; drop refresh_pass; hard-rename stopper
Foundational refactor for JIT-refresh steady-state engine (plan task 2+3+4+7+8):
Engine
- EngineMetrics.total_generations -> total_mutants (single-counter progress)
- EngineSnapshot.total_generations -> total_mutants
- EngineSnapshot.refresh_pass field DELETED (multi-pass refresh removed)
- SteadyStateEngineConfig.refresh_passes field REMOVED
- steady_state._refresh_archive_programs: inlined one-pass body; multi-pass
loop + per-pass snapshot bumps gone
Stopper (hard rename, no back-compat alias per Option A)
- MaxGenerationsStopper(max_generations=N) -> MaxMutantsStopper(max_mutants=N)
- config/stopper/max_generations*.yaml -> max_mutants*.yaml
- config/constants/evolution.yaml: max_generations: 100 -> max_mutants: 800
(preserves prior run length: 100 epochs x 8 mutants/epoch under steady-state)
- config/config.yaml stopper default: max_generations -> max_mutants
Manifest boundary preserved
- launch_generator.py: emits max_mutants={contract.max_generations} Hydra override
- Contract.max_generations stays (experiment-level concept)
- CMA-ES max_generations (optimizer hyperparam) unchanged
- watchdog/monitoring max_generations (experiment progress display) unchanged
Adversarial
- SharedBenchmarkFilteredLineageStage.compute_hash override DELETED
(refresh_pass-aware cache invariant obsolete under JIT-refresh)
Tests
- Deleted: test_snapshot_refresh_pass.py, test_lineage_cache_invalidation.py,
test_two_pass_mutation_context.py
- Vestigial "removed feature" assertion classes deleted per user directive
358 targeted tests pass. ruff clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* refactor(progress): migrate MainRunSyncHook + monitoring to programs_processed
Task 5: MainRunSyncHook polls snap.programs_processed (was total_mutants).
_last_main_gen -> _last_main_progress; _get_min_gen -> _get_min_progress.
Module docstring + log strings updated.
Task 6: redis_queries.get_generation -> get_programs_processed reading
snap.programs_processed. collect_snapshot.gen now sourced from
programs_processed; RunSnapshot.generation field name preserved for
display compatibility.
programs_processed is the canonical cross-run progress signal under JIT-
refresh: it counts mutants actually ingested into the archive (post-validation),
not total mutants emitted. Prompt-coevo sync needs the former to ensure the
main run has produced something usable before the prompt run advances.
Tests pass: tests/prompts/test_coevolution_sync.py (14), tests/monitoring/test_redis_queries.py (17).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(engine): ParentRefresher + ParentRefreshSelector ABC for JIT refresh
Adds the JIT DONE->QUEUED->DONE refresh helper that producer tasks call
before mutating selected parents. Replaces the multi-pass _refresh_archive
sweep removed in the prior commit.
Architecture (user directive 2026-05-12):
- ParentRefreshSelector: ABC choosing which programs to refresh given the
producer's parent pick. DirectParentsSelector is the canonical default
(refresh only the parents themselves). Future implementations may walk
lineage to depth-k and order refresh in depth-batched waves so deepest
ancestors finish before nearest parents flip.
- ParentRefresher: per-parent-id asyncio.Lock serialises overlapping
concurrent refreshers. Batch transition flips all DONE targets to QUEUED
atomically (no producer sees a half-flipped bundle), then polls mget()
until every target is DONE. DISCARDED-on-input or DISCARDED-during-wait
raises ValueError; vanished parents raise ValueError; absence-of-progress
raises TimeoutError. Caller aborts the mutant and releases its in-flight
slot rather than falling back to stale state.
Tests: 11/11 pass (single/empty/batch/overlap/discarded/timeout/selector
ABC contract/custom-selector-adds-targets/empty-selector-noop). FakeDag
test helper provides QUEUED -> RUNNING -> DONE auto-promotion to exercise
the refresh without a real DagRunner.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* refactor(engine): SteadyStateEvolutionEngine composes dispatcher + ingestor + ParentRefresher
Replaces the 935-LOC epoch-driven engine with a thin composition of
three new modules:
- gigaevo/evolution/engine/mutant_task.py — run_one_mutant: one mutant
per async task; explicit slot-ownership invariant (try/finally guards
the semaphore against partial-failure and cancellation)
- gigaevo/evolution/engine/dispatcher.py — dispatcher_loop: continuous
spawn-and-forget producer; backpressure via _in_flight_sema only
- gigaevo/evolution/engine/ingestor.py — ingestor_loop + poll_and_ingest:
long-lived ingestion loop with adaptive interval, batch DONE handling,
leaked-id sweep, slot-release on ingest
Deletes (from steady_state.py): _mutation_loop, _produce_one_mutant,
_get_cached_elites, _create_single_mutant, _ingestion_loop,
_poll_and_ingest, _ingest_batch, _should_trigger_epoch, _epoch_refresh,
_drain_in_flight, _drain_scoped, _refresh_archive_programs,
_mutation_gate, _cached_elites, _elite_cache_lock, _processed_since_epoch,
_epoch_mutants, _epoch_eligible_since (~800 LOC).
Config: drop refresh_passes + refresh_order from EngineConfig; hoist
max_in_flight to the parent; SteadyStateEngineConfig now a Hydra alias.
steady_state.yaml drops refresh_order + refresh_passes.
Tests: rewrite test_steady_state.py (736 → ~165 LOC) to cover
construction (incl. _parent_refresher wiring), backpressure semaphore,
generation cap stopping dispatcher_loop, restore from snapshot. Skip
modules pinned to deleted machinery: test_steady_state_determinism.py
(epoch determinism — to be rewritten against new tick site),
test_generation_boundary_emit.py (step() removal pending in Task 14).
See spec §3, plan §13.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* refactor(engine): delete generational EvolutionEngine.step() / run() loop
evolution=default now wires SteadyStateEvolutionEngine. EvolutionEngine
becomes an abstract base of shared helpers (snapshot, metrics, idle wait,
hooks, stop context). BusedEvolutionEngine migrated to subclass
SteadyStateEvolutionEngine with a periodic bus-drain background task.
Also persists total_mutants in the engine snapshot after each mutant
production so resume picks up the correct generation counter — previously
this happened inside step() which is now gone.
See spec docs/superpowers/specs/2026-05-12-steady-state-engine-audit-and-redesign.md §3.6.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* refactor(collector): set *_in_iteration aggregates to None under JIT engine
Each mutant has a unique iteration (= total_mutants_at_production), so cohort
aggregates collapse to single-program windows. Schema field retained for
plot/exporter compatibility; consumers needing windowed aggregates should
compute them at plot time. See spec §3.5 + §6.5.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* refactor(engine): JIT-refresh polish — empty-archive backoff, metric wiring, vestigial GenerationBoundary
Wraps up Tasks 16-18 of the JIT-refresh refactor (plan:
docs/superpowers/plans/2026-05-12-steady-state-true-jit-refresh.md).
gigaevo/evolution/engine/mutant_task.py
- Add asyncio.sleep(loop_interval) backoff when select_elites returns
empty (population seeding / all rejected). Prevents dispatcher
hot-spinning when the archive is empty.
- Wire submitted_for_refresh metric: record_reprocess_metrics(len(refreshed))
after ParentRefresher.refresh() succeeds. Previously the metric was
orphaned (defined but never incremented under JIT-refresh).
gigaevo/monitoring/events.py
- Mark GenerationBoundary vestigial with explanatory docstring. The class
schema is kept so legacy run logs still parse, but nothing in gigaevo/
emits this event under steady-state JIT-refresh.
config/constants/evolution.yaml, config/evolution/{default,steady_state}.yaml
gigaevo/evolution/engine/config.py
gigaevo/experiment/launch_generator.py
- Drop max_mutations_per_generation — under JIT-refresh there is no
per-generation mutation cap; max_in_flight controls parallelism.
Tests adjusted for JIT-refresh floor-trigger semantics:
- Strict total_mutants == N replaced with >= N at ~12 sites across
tests/integration/{test_mini_run,test_multigen_e2e,test_memory_e2e,
test_acceptor_engine,test_advanced_scenarios,test_brittleness,
test_complex_scenarios,test_engine_regression,test_ingest_regression,
test_evolution_engine_edge_cases}.py and tests/concurrency/
test_deadlock_prevention.py. JIT cap is a floor trigger — concurrent
in-flight mutants may bring total_mutants slightly above max.
- Skip class-level on TestEmptyArchiveEngine, TestAllMutationsReturnNone,
TestAllMutationsRaise, TestTransientMutationFailure (empty/zero-success
scenarios cannot reach the cap under JIT-refresh).
- Skip class-level on TestEngineStepIntegration — the generational
engine.step() entry point was deleted; deadlock-prevention under
JIT-refresh is covered by the paranoia suite (Task 19A).
- Skip two engine.run() wiring tests in TestEnginePostRunHookWiring
that hung on AsyncMock empty archive; the wiring is still covered by
test_none_hook_defaults_to_null + test_custom_hook_is_stored.
- New tests/config/test_stopper_configs.py pins the MaxMutantsStopper
Hydra targets and rejects MaxGenerationsStopper imports.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs(specs): record JIT engine dry-run smoke results
§9 added with the Hydra config resolution table showing the new schema
is canonical: SteadyStateEvolutionEngine + MaxMutantsStopper +
max_in_flight, with no max_mutations_per_generation / refresh_pass /
total_generations references. Closed experiment configs intentionally
left unchanged.
Live-cluster run deferred to post-merge follow-up.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* test(engine): concurrency stress + simulation suite (load × async patterns)
36-combo parametrised matrix exercising the JIT-refresh engine end-to-end
against fakeredis storage and a timed fake DAG. Verifies six core
invariants: no semaphore leak, _in_flight drains, total_mutants reaches
the cap with bounded overshoot (≤ max_in_flight), programs_processed
equals accepted+rejected, ParentRefresher flip count is bounded, and
snapshot counters are monotonically non-decreasing.
Sweeps (max_in_flight, n_mutants, duration_dist, overlap_rate) across
mif ∈ {1,4,16}, n ∈ {50,200}, dist ∈ {const,expo,heavy_tail}, ov ∈
{0,0.5}. The high-overlap arm seeds the archive with a single elite so
concurrent producers contend on one parent and exercise the per-id
ParentRefresher lock; the low-overlap arm seeds 2×mif elites so producers
pick distinct parents.
Closes Task 19A from the steady-state JIT-refresh refactor plan.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* test(engine): cancellation + resume-after-kill invariants
Two new test files paired with one engine fix:
* test_engine_cancellation.py — cancels run() mid-flight and after early
start; verifies slot accounting (sema._value + |_in_flight| ==
max_in_flight), counters never regress, snapshot remains consistent.
* test_engine_resume_after_kill.py — runs engine A to cap=5, tears it
down, rebuilds engine B against the same fakeredis server, calls
restore_state(), runs to cap=10. Verifies progress is strictly forward
across the resume and the cap window includes bounded overshoot.
Engine fix: SteadyStateEvolutionEngine.run()'s finally clause now
explicitly cancels the dispatcher and ingestor tasks. asyncio.wait()
does not propagate cancellation into its waited tasks, so without this
they leaked across an external run-task cancel, holding semaphore slots
forever (the cancellation test caught this directly).
Closes Task 19B from the steady-state JIT-refresh refactor plan.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* test(engine): ParentRefresher failure-mode resilience
Adds four new failure-mode tests to test_refresh_parents.py:
* No-timeout-default: with timeout_seconds=None, a brief DAG pause is
absorbed and the refresh still completes successfully.
* Mid-flight DISCARD: a parent flipped DISCARDED by another path during
the await raises ValueError rather than returning stale state.
* Mid-flight vanish: a parent removed from storage during the await
raises ValueError.
* Reversed input order: two concurrent refreshes on the same parent
set with reversed input orderings both complete — the per-id locks
are acquired in deterministic sorted order, so classic
lock-order-inversion deadlocks are impossible.
Closes Task 19D from the steady-state JIT-refresh refactor plan.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(engine): final ingestion sweep runs under cancellation
Chaos-hacker review identified two compounding High-severity bugs:
#1 cancellation between _in_flight.add and _write_snapshot permanently
leaks slots (slot_transferred=True blocks per-task release).
#2 final-sweep loop in run() body is unreachable when CancelledError
propagates from asyncio.wait().
Fix: move the final ingestion sweep into run()'s finally block with
asyncio.shield to survive outer cancellation, bounded by
max_in_flight + 1 passes to avoid hangs on QUEUED stragglers.
Also cancel dispatcher/ingestor tasks explicitly in finally — asyncio.wait()
does not cancel its waited tasks when the outer coroutine is cancelled, so
they could otherwise survive engine teardown and continue spawning mutants.
Regression test test_cancel_drains_done_programs_via_final_sweep asserts
that DONE programs in _in_flight at cancel time are ingested by the sweep,
with programs_processed advancing accordingly.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(engine): serialise _write_snapshot to keep Redis in sync with memory
Chaos-hacker review finding #3 (Medium): concurrent mutant tasks call
_write_snapshot from run_one_mutant after incrementing total_mutants.
Without synchronisation, two writers can compute monotone versions
v=N+1 and v=N+2 synchronously, then both await save_run_state — if the
v=N+2 save lands first and v=N+1 lands second, Redis ends at v=N+1
with stale fields while the in-memory mirror sits at v=N+2. A crash
resume then rehydrates the older v=N+1 and loses the latest updates.
Fix: wrap the model_copy + set_current_snapshot + storage.save_run_state
in an asyncio.Lock so the per-call version bump and Redis write land
atomically. Last-writer-wins still holds; only the ordering is
guaranteed.
Regression test concurrent_write_snapshot_keeps_redis_and_memory_in_sync
issues 50 concurrent _write_snapshot calls and asserts the Redis-persisted
version equals the in-memory mirror's version at the end.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(refresh): bound _locks dict via WeakValueDictionary
Chaos-hacker review finding #4 (Medium): ParentRefresher._locks was a
plain dict that retained an asyncio.Lock per distinct parent id forever.
On a multi-day run touching tens of thousands of mutants this leaks
~100 bytes/lock plus event-loop bookkeeping per entry — small in
absolute terms but proportional to evolution history.
Fix: switch to weakref.WeakValueDictionary so locks are retained only
while at least one in-flight refresh holds a strong reference. The lock
contract is unchanged — concurrent refreshes for the same parent id
still share the same lock, because the active caller's strong ref keeps
the entry alive across reentrant lookups.
Regression test test_refresh_locks_dict_does_not_grow_unboundedly
sequentially refreshes 20 distinct parents and asserts the dict shrinks
back to (near-)empty after gc.collect().
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* test(integration): real-Redis smoke for JIT-refresh engine (Task 19C)
Adds tests/integration/test_engine_real_redis.py: end-to-end smoke
against an actual Redis at localhost:6379/0 (or REAL_REDIS_URL).
Auto-skips when no Redis is reachable, so committing it is safe on
machines without a local server.
What it verifies:
- The full dispatcher/ingestor/refresher/mutant-task pipeline survives
real network round-trips (not just fakeredis fast-paths).
- Bounded overshoot holds with cap=6, max_in_flight=2.
- No semaphore slot leak at run end.
- Snapshot is persisted to Redis at the same version the in-memory
mirror reports — i.e. the snapshot-lock fix actually serialises real
Redis writes, not just fakeredis ones.
Uses a unique key prefix per run and SCAN+DELETE cleanup in fixture
finally, so the test never clobbers another caller's data.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(engine): wall-clock bounded final sweep, patient on stragglers
The previous max_in_flight+1-pass bound terminated the final ingestion
sweep before the DAG could flip QUEUED→RUNNING→DONE for the last few
in-flight mutants on normal completion, leaking their semaphore slots
(stress suite caught a 1-slot leak on high-mif runs).
Switch the sweep to a wall-clock deadline (5s) with loop_interval sleep
between empty passes, while preserving the asyncio.shield + early-break
on CancelledError that made the cancellation-safety fix work. The sleep
itself is wrapped to bail on cancellation immediately.
All 36 stress combos + 82 paranoia tests now green; 555-test evolution
sweep clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(engine): rename final-sweep loop var to satisfy mypy
The cleanup loop reused `t` from the earlier ``for t in pending`` block
(typed `Task[Any]`), but the cleanup iterates a tuple of
`Task[Any] | None`. Renaming the variable removes the assignment-type
conflict without changing behavior.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* refactor(engine): apply PR #227 review fixes — naming + deprecated test cleanup
Address two review recommendations on the JIT-refresh refactor:
1. Naming consistency — make "generation" → "mutant" rename complete:
- core.py: rename _reached_generation_cap → _reached_mutant_cap
- core.py: 8 log prefixes "[EvolutionEngine] gen={}" → "mutants={}"
- dispatcher.py: 2 call sites updated to _reached_mutant_cap
- test_steady_state.py: section-comment reference updated
2. Remove deprecated tests left as @pytest.mark.skip after the refactor.
These covered the old epoch/step()/run-loop machinery that no longer
exists in the JIT-refresh engine. Removed in bulk via AST script
matching skip reasons like "JIT-refresh", "step() removed",
"Generational ...", "GenerationBoundary emission",
"_refresh_archive_programs", "_create_mutants".
Whole files deleted (only contained deprecated tests):
- tests/evolution/test_steady_state_determinism.py
- tests/evolution/test_generation_boundary_emit.py
Surgical class/function removals (kept the rest of each file):
- tests/evolution/test_evolution_engine.py
- tests/evolution/test_evolution_engine_complex.py
- tests/evolution/test_resume.py
- tests/evolution/bus/test_engine.py
- tests/integration/test_acceptor_engine.py
- tests/integration/test_advanced_scenarios.py
- tests/integration/test_complex_scenarios.py
- tests/integration/test_evolution_engine_edge_cases.py
Net: 13 files changed, 16 insertions(+), 2207 deletions(-).
Verified: ruff check + format clean; targeted pytest sweep
(tests/evolution/ + 4 integration files) green.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* chore(deps): unpin gigaevo-memory from private git URL — it's now public
The gigaevo-memory repo went public, so we can drop the
`@ git+https://...@<commit-sha>#subdirectory=client/python` form and
rely on the plain `gigaevo-memory` spec. This also unblocks CI's pip
install step, which was failing on the private-repo username prompt:
fatal: could not read Username for 'https://github.com':
No such device or address
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(heilbron_adversarial): replace absolute-path symlinks with relative
The 9 symlinks under problems/heilbron_adversarial/{pop_a_gan,pop_a_soft,
pop_b_soft}/{fallback,helper.py,initial_programs} were committed on
2026-05-02 with absolute targets baked in:
/mnt/virtual_ai0001071-04017_SR004-nfs1/CFS-SR008/workspace/mathemage/
gigaevo-core-internal/problems/heilbron_adversarial/pop_a/...
That path only exists on this NFS dev mount, so every CI runner saw
dangling links and ruff bailed out with:
E902 Failed to create cache key
Cause: No such file or directory (os error 2)
--> problems/heilbron_adversarial/pop_a_gan/helper.py
Replaced all 9 with relative siblings (e.g. ../pop_a/helper.py).
The `_soft` and `_gan` problem variants reuse pop_a's / pop_b's
helper.py + fallback/ + initial_programs/, same intent as before, now
portable.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(engine): rewire post_step_hook + adjacent observability polish
PR #227 deleted EvolutionEngine.step(), which historically fired
_post_step_hook once per generation. The kwarg + assignment in
EvolutionEngine.__init__ became dead code: CompositionInjectionHook —
the only production consumer, wired by 3 adversarial experiment launches
— silently no-opped on every Arm A run.
Changes
-------
1. Re-wire _post_step_hook in poll_and_ingest: fires once per ingest
sweep that adds >=1 program to the archive (the JIT analogue of the
old per-generation boundary). Fault-isolated — a buggy hook can't
abort ingestion, which has already committed to Redis.
2. H3 fix: ParentRefresher.timeout_seconds default None -> 600s.
None default could strand a mutant forever on DAG-runner crash,
leaking its in-flight semaphore slot.
3. Final-sweep observability: extract _final_ingestion_sweep() and
emit WARNING with stuck-IDs when the 5s wall-clock deadline elapses
before _in_flight drains. Operators previously had no signal that
a run shut down with leaked slots.
4. Drop stale "JIT-refresh" / "epoch" docstring framing from
config.py, core.py, mutant_task.py, steady_state.py.
Tests
-----
- 13 new SOTA tests in tests/evolution/test_post_step_hook_rewire.py
cover hook firing semantics (added==0 / added>0 / mixed / failure /
unset), finite-timeout default + override, and WARNING emission via
loguru sink capture.
- Existing test_refresh_no_timeout_default_waits_through_brief_pause
renamed + assertion updated for the new finite default.
Verification
------------
Full audit of evolution engine consumers ran clean:
- tests/evolution/ (1000+ tests, all pass)
- tests/integration/test_acceptor_engine,advanced_scenarios,
complex_scenarios,evolution_engine_edge_cases (42 tests)
- tests/adversarial_pipeline/ (composition_injection, progress_sync,
steady_state_adversarial_e2e)
- tests/memory/ (ideas_tracker_pipeline, engine_integration,
dag_memory_flow, memory_e2e_pipeline)
- tests/concurrency/test_deadlock_prevention
- tests/integration/test_brittleness, mini_run, multigen_e2e,
engine_regression, ingest_regression
- tests/prompts/test_coevolution_sync
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(engine): two deadlock-class chaos-hacker findings + regressions
Closes the top two CRITICAL findings from the adversarial review on
commit 130fdbb2 (chaos-hacker agent a79a294de8502a7d8).
1) ParentRefresher: dedup parents by id before sorting + acquiring.
asyncio.Lock is NOT reentrant. If a parent bundle ever contains the
same program id twice (any ParentSelector returning duplicates, or a
future ParentRefreshSelector that walks lineage hitting the same id
via two paths), _acquire_all would call acquire() twice on the same
Lock from the same task and the mutant task hangs forever, holding
its in-flight slot. Eventually the engine starves.
Fix: fold duplicates by id inside refresh() before sort + lock
acquisition. First-seen wins. Test:
test_refresh_does_not_deadlock_on_duplicate_parent_ids and
test_refresh_selector_emitting_duplicates_does_not_deadlock both
would hang without this fix; with it, they complete and the parent
flips exactly once.
2) _final_ingestion_sweep: track inner task explicitly so cancellation
does not leak a detached poll_and_ingest.
asyncio.shield(coro) only protects the inner coroutine from being
cancelled — it does NOT prevent CancelledError from propagating to
the awaiter. The previous code did `await asyncio.shield(poll_and_
ingest(self))` and on cancellation broke out of the loop. The inner
then continued as a detached Task, racing _post_run_hook.on_run_
complete and engine teardown for access to storage, _in_flight, and
the post_step_hook.
Fix: wrap poll_and_ingest in an explicit asyncio.create_task; on
outer cancellation, cancel the inner and wait_for(timeout=1.0) so
no zombie coroutine outlives the method. New test
test_cancellation_does_not_leak_inner_task asserts the inner's
finally fires before we move on.
Chaos-hacker finding #1 (WeakValueDictionary GC race) was investigated
and dismissed: any task awaiting `lk.acquire()` keeps `lk` strongly
referenced on its suspended-coroutine frame, so the WeakValueDictionary
entry cannot be reclaimed while a waiter exists. The race the report
described requires a waiter without a strong ref, which is unreachable.
Verified all engine consumers green: evolution (1001 tests),
integration (83), adversarial+concurrency+memory+prompts (1424).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* refactor(engine): drop dead code + fix cancel propagation in final sweep
Cycles 5-6 of the auto-optimize sprint:
Cycle 5 — systems-architect proposal #5 (dead-code deletion):
- Delete EvolutionEngine.pause(), resume(), is_running() — zero callers
anywhere in gigaevo/, tests/, tools/, experiments/. Verified with
`git grep` (tests/evolution/test_strategy_base.py hits are on
strategy.pause/resume, not engine.pause/resume).
- Delete _set_state() — one-line shim with zero internal callers.
- Delete _paused field — written but never read.
- Delete _run_start_mutants field + its dead write in
steady_state.py:63 — never consumed anywhere.
Cycle 6 — chaos-hacker Findings 1 (HIGH) + 2 (MED) on 75203666:
Finding 1 (HIGH): _final_ingestion_sweep used
`contextlib.suppress(BaseException)` around `wait_for(inner, 1.0)`.
That suppress catches asyncio.CancelledError, KeyboardInterrupt, and
SystemExit — meaning a second cancellation (or SIGINT) during the
inner-task cleanup was silently absorbed and the sweep returned
"normally", letting `_post_run_hook.on_run_complete` run in a teardown
context the supervisor never authorised.
* Narrow to `suppress(Exception)` so only true exceptions (Redis
transient, network blip) are tolerated during cleanup.
* Track the cancel locally and re-raise CancelledError after the
inner is settled and the (skipped) WARNING block — so the cancel
reaches `run()`'s awaiter.
* In `run()`'s finally, catch the re-raised CancelledError around
the sweep call so the finalizer (`post_run_hook.on_run_complete`)
still executes — cancellation is a shutdown signal, not a "skip
cleanup" one — then re-raise.
* Skip the "deadline elapsed" WARNING when sweep exits via cancel
(the message is for diagnostics of leaked semaphore slots, not
for shutdown-was-aborted).
Finding 2 (MED): docstring claimed `wait_for(timeout=1.0)` was a
"tight" cap. In CPython 3.12 `wait_for` cancels the inner and then
waits for it to honor the cancel — wall-clock cost is bounded by
inner cleanup latency, not the parameter. Updated docstring to say
"best-effort timeout" and clarified that only `Exception` is
suppressed (BaseException family — CancelledError, KeyboardInterrupt,
SystemExit — propagates intact).
New regression tests in tests/evolution/test_post_step_hook_rewire.py
(TestFinalSweepCancellationSafety):
* test_cancellation_propagates_to_awaiter — pins Finding 1: cancel
must reach the engine awaiter; sweep_task.cancelled() must be true.
* test_normal_completion_returns_without_cancellederror — pins the
happy/timeout path so a future refactor of cancel plumbing doesn't
accidentally raise on deadline-elapsed.
Verified clean:
* tests/evolution/ + tests/integration/test_acceptor_engine.py +
test_advanced_scenarios.py + test_complex_scenarios.py +
test_evolution_engine_edge_cases.py → 1115 passed
* tests/adversarial_pipeline/ + tests/concurrency/ + tests/memory/ +
tests/prompts/ → 1581 passed, 5 skipped
* ruff check + format clean on the full repo
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* refactor(engine): drop dead mutation_ids branch + dead fields, lock schema with extra=forbid
Cycle 7 of auto-optimize-loop on PR #227. Synthesizes systems-architect's
stale-refs audit (12 ranked proposals, bundle 1-4 + 9) with the chaos-hacker
LOW findings from cycle 6.
Production cleanups
- `_ingest_completed_programs(mutation_ids=...)` parameter dropped — the only
production caller passed `mutation_ids=None`. The fast-discard branch had
no live caller. Function now does one job: deserialize non-archive DONE
programs, push through acceptor + strategy.
- `EngineConfig.generation_timeout` deleted. Documented "deprecated, no
longer used" since 31b66de7 (2026-04-19); zero production reads.
- `EngineMetrics.errors_encountered` deleted. Zero production readers/writers;
only test_engine_metrics.py mutated it. EngineSnapshot doesn't embed
EngineMetrics, so no Redis-snapshot break.
Defense-in-depth
- `EngineConfig` now uses `extra="forbid"`. Future field deletions will
crash callers passing the dead kwarg instead of silently dropping into
Pydantic's default `extra="ignore"`. Verified safe for live Hydra configs
(config/evolution/*.yaml only set declared fields).
- Swept 14 test sites still passing `generation_timeout=X` — chaos-hacker
flagged these as silent semantic drift if `extra="forbid"` is added without
the sweep.
Chaos-hacker LOW fixes (review of d5facada)
- `raise asyncio.CancelledError from None` on both sites in steady_state.py.
A Redis blip suppressed by the surrounding `contextlib.suppress(Exception)`
no longer dangles in `__context__` and misleads the operator.
- Tightened `test_cancellation_propagates_to_awaiter` assertion: drops the
`cancelled() or (done() and exception() is CancelledError)` OR-branch.
Probed: on Py3.12, `raise asyncio.CancelledError` inside a coroutine ALWAYS
produces `task.cancelled() == True`, and calling `.exception()` on a
cancelled task re-raises CancelledError (so the OR-branch was unreachable).
Tightening is strictly safer; future regressions that break the
`.cancelled()` contract now surface immediately.
Test cleanup
- Deleted `tests/evolution/test_ingest_mutation_ids.py` (299 LOC) — every
test pinned the dead `mutation_ids` branch.
- Removed stale "generation_timeout deprecated" zombie banner + module
docstring entry in test_evolution_engine_complex.py.
- Stripped `errors_encountered` assertions from test_engine_metrics.py.
Verification
- ruff: clean on touched dirs.
- Tests green:
* tests/evolution/ + selected integration (~700 tests, all dots)
* tests/concurrency/test_deadlock_prevention.py (all dots, 3 skipped)
* tests/integration/ + tests/benchmarks/ + tests/stages/ (all dots)
* tests/concurrency/ + tests/memory/ + tests/adversarial_pipeline/ +
tests/dag/ (all dots)
- chaos-hacker adversarial review of this diff: 1 HIGH (the
generation_timeout test-rot, fixed by the sweep above), 0 medium/low
remaining. Verdict: ship.
Adjacent finding (deferred)
- pre-existing observability gap: a second cancel landing during
`on_run_complete` skips the "[SteadyState] Stopped" log line.
Net behavior (cancellation reaches the awaiter) is correct; only the
log marker is missing. Out of scope for cycle 7.
LOC: -394 +32 (net -362). Full bytes-on-disk delta dominated by the
test_ingest_mutation_ids.py deletion (299 LOC).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* refactor(engine): drop dead error counters + step() vestige, inline helpers
Cycle 8 quality pass on PR #227 — systems-architect proposals #1, #3, #6, #8 +
partial #2.
- Delete `elites_selection_errors` and `mutations_creation_errors` fields
(always passed 0 in production — verified every call site)
- Delete `record_elite_selection_metrics`, `record_mutation_metrics`,
`record_reprocess_metrics` (single-line accumulators with one caller each
after dropping the errors arg)
- Inline `_pick_parents` helper (4-line single-caller wrapper)
- Delete `SteadyStateEvolutionEngine.step()` NotImplementedError vestige and
its test (no production caller; `run()` already raises in the abstract base)
- Fix dated docstring `elites_selected` "across all generations" → "Total
elites cumulatively selected for mutation" (JIT-refresh has no generations)
- Update `tools/benchmarks/bench_multirun.py` call site for consistency
Net: 32 insertions, 107 deletions (-75 LOC). All `tests/evolution/`,
`tests/integration/`, `tests/concurrency/`, `tests/benchmarks/`,
`tests/stages/`, `tests/memory/`, `tests/adversarial_pipeline/`,
`tests/dag/` pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(engine): eliminate ghost-persist by inlining single-mutant primitive
`generate_mutations(...)` wrapped `asyncio.gather(*tasks,
return_exceptions=True)`. If the outer awaiter (typically `run_one_mutant`,
spawned by the dispatcher and cancellable at engine teardown) was
cancelled after a child's `storage.add(program)` succeeded but before
`gather` returned, gather re-raised CancelledError to the caller — the
child's `except BaseException` handler still returned `persisted_id`,
but `results` was never bound. The program stayed in Redis with no
`_in_flight` tracking → ghost.
Refactor: extract `generate_one_mutation()` — a single-mutant primitive
with no gather. `mutant_task.run_one_mutant` calls it directly. The
function's `except BaseException` arm returns `persisted_id` to the
caller without any gather to swallow it. The caller registers the id in
`_in_flight` before the cancellation can re-propagate.
`generate_mutations(...)` is retained as a sequential batch wrapper for
the existing test suite (it loops over `generate_one_mutation` and
breaks on CancelledError, returning accumulated ids). Production
callers only ever passed `limit=1`, so there is no perf impact.
Adds `tests/evolution/test_engine_ghost_persist.py` with 7 deterministic
test cases covering: cancel-pre-persist (propagates cleanly), cancel-
post-persist (id surfaced), cancel-mid-lineage (id surfaced), an
integration test through `run_one_mutant` proving the id lands in
`_in_flight`, a gather-cancel regression-guard demonstrating the
historical failure mode, and backwards-compat checks for the batch
wrapper.
Files: 2 src changes (mutation.py refactor, mutant_task.py call-site),
1 new test file. 999/1000 evolution tests pass; 1 deselected test is a
pre-existing failure unrelated to this change (patches a non-existent
`steady_state.generate_mutations` symbol).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* test(engine): drop dead category banners in test_evolution_engine_complex
The file had empty Category A/F/H/J banner comments left over after
those categories' tests were removed. They created a false signal of
"these areas are covered" without any actual test bodies. Drop them
and the corresponding lines in the module docstring.
No production code touched; all 11 tests in this file still pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(engine): annotate inlined parents var to satisfy mypy
The cycle-8 inlining of _pick_parents lost the helper's return type
annotation. Now that the assignment uses `next(..., [])` as the default,
mypy cannot infer the element type. Add an explicit `list[Program]` hint.
No behavior change.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* test(engine): add SOTA invariant test suite for steady-state concurrency
Plugs 8 coverage gaps identified by test-obsessed-reviewer's audit. The
new file `tests/evolution/test_engine_invariants.py` (440 LOC, 15 tests)
guards the engine's 8 concurrency invariants (I1-I8):
Gap 1 (I1) — cancel-between-acquire-and-slot-transfer releases slot
* test_cancel_before_elite_select_releases_slot
* test_cancel_during_parent_refresh_releases_slot
Gap 2 (I6) — dispatcher cancel drains all active mutant tasks
* test_active_tasks_are_cancelled_on_dispatcher_cancel
Gap 3 (I7) — ingestor uses fast interval (0.25*loop_interval) saturated
* test_fast_interval_when_saturated
* test_slow_interval_when_idle (negative control)
Gap 4 (I6) — post_run_hook fires even on cancellation
* test_hook_fires_when_run_cancelled
Gap 5 (I4) — _in_flight_lock does not starve under contention
* test_many_waiters_all_progress (50 concurrent waiters, all land)
Gap 6 (I8) — _await_idle treats DISCARDED as idle (not active)
* test_discarded_only_returns_idle
* test_await_idle_returns_promptly_with_only_discarded
Gap 7 (I5) — snapshot version monotonic in Redis under concurrent writes
* test_concurrent_writes_versions_monotone (20 concurrent writes)
* test_in_memory_mirror_tracks_redis
Gap 8 (I1+I2) — double-poll same id releases slot exactly once
* test_id_not_double_released
* test_leaked_id_swept_once
Bonus (I3 deterministic) — slot_transferred flag is exclusive
* test_success_path_transfers_slot
* test_no_elite_releases_slot
All tests are deterministic — asyncio.Event for sync, no time.sleep
polling, no flaky timing assumptions. The full suite runs in <1s.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(engine): await metrics_collector cancel before storage.close
Without `await` after `_metrics_collector_task.cancel()`, the collector
may still be mid `await storage.<call>` when `storage.close()` fires
below — raising ConnectionClosedError into an orphan coroutine that has
no caller. Bound the wait so a wedged collector cannot indefinitely
block shutdown.
Add two regression tests:
- test_collector_finished_before_storage_close: asserts the collector's
finally runs strictly before storage.close().
- test_wedged_collector_does_not_block_stop_forever: asserts stop()
returns within the 2s wait_for budget even when the collector
shields against cancel.
Cycle 11: chaos-hacker F4 finding from cycle 10 deadlock probe.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* refactor(engine): drop redundant CancelledError arm + tidy Any import
Two micro-simplifications surfaced during cycle-12 quality review:
1. `dispatcher.py`: the explicit `except CancelledError: raise` arm was a
no-op — `finally` runs regardless, and `CancelledError` propagates
naturally without an explicit re-raise. Removing the dead arm keeps
the loop's control flow obvious: try → finally.
2. `core.py`: TYPE_CHECKING-guarded `from typing import Any` was overhead
for a singleton typing import (zero cost). Promoted to top-level.
Regression test added (`TestDispatcherFinallyCancelsSpawnedMutants`):
monkey-patches `run_one_mutant` to a long-runner, cancels the dispatcher
mid-flight, asserts the spawned mutant received `CancelledError` via the
dispatcher's `finally` block. Pins the cancellation contract so a future
refactor cannot accidentally swallow the cancel.
Total invariant tests now 18 (was 17 in cycle 11).
ruff clean + full evolution+integration suite green.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(engine): close two orphan paths in _final_ingestion_sweep
Cycle-13 chaos-hacker probe found a HIGH bug replaying the cycle-11 F4
shape through a different channel: the sweep's bounded-wait
(`suppress(Exception)` + `wait_for(timeout=1.0)`) had TWO escape paths
that left the `poll_and_ingest` inner task detached past the sweep's
return:
(1) Slow-cancel target — if inner takes >1s to honor cancel, wait_for
raises TimeoutError (Exception subclass), suppressed silently;
inner runs detached and races storage.close() in stop().
(2) Double-cancel — if a second cancel arrives during wait_for(inner),
wait_for re-raises CancelledError (BaseException, NOT Exception);
the suppress doesn't catch it, control exits the except arm with
`cancelled=True; break` skipped; inner is detached.
Both replay the cycle-11 metrics_collector orphan: ConnectionClosedError
fires into a coroutine that has no caller to surface it.
Fix (steady_state.py:174-205): explicit `suppress(CancelledError)`
catches the double-cancel and routes through the cancelled-flag path;
TimeoutError is logged as a WARNING so an operator can correlate the
orphan risk with whatever stranded the inner task in Redis. Generic
Exception still logs but does not let inner escape.
Regression coverage (+2 tests, total now 20):
- test_slow_cancel_inner_logs_timeout_but_no_orphan_on_normal_path
monkey-patches poll_and_ingest to a slow-cancel target (re-shields
the first cancel for 2s); asserts the WARN about "did not honor
cancel" / "orphan" is logged.
- test_double_cancel_routes_through_cancelled_flag — cancels the sweep
twice in succession; asserts the inner task still received its
CancelledError (no orphan).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(engine): persist-then-mirror snapshot write — no version skip on retry
Cycle 14 of the PR #227 quality sprint.
`EvolutionEngine._write_snapshot` previously incremented the in-memory
mirror (`self._snapshot` + `set_current_snapshot`) BEFORE the Redis
`save_run_state` call. On a transient Redis failure this left the mirror
reflecting an unpersisted version: the next successful save then wrote
version N+2, silently skipping N+1 in Redis. Resumers reading from Redis
would see a gap that doesn't exist in any operator-visible log.
Persist-then-mirror reorders the two operations so the in-memory mirror
only advances after Redis confirms. If `save_run_state` raises, the mirror
keeps the prior version, the next call retries the SAME version number,
and Redis stays gap-free. Mirror is now always `≤` Redis — acceptable
because Redis is the source of truth on resume.
Tests (tests/evolution/test_engine_invariants.py::TestWriteSnapshotPersistThenMirror):
- test_save_failure_leaves_mirror_at_old_version: asserts mirror stays
at version 0 when save_run_state raises RuntimeError
- test_successful_save_updates_mirror_and_redis_in_one_step: happy path
- test_retry_after_failure_uses_same_version: asserts saved_versions ==
[1, 1] (mirror-then-save form would have produced [1, 2])
Regression: 1060/1060 tests pass across tests/evolution/ and the four
integration suites (acceptor_engine, advanced_scenarios, complex_scenarios,
evolution_engine_edge_cases).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(engine): bound post_step_hook to 300s — prevent ingestor wedge
Long-lived post_step_hook (CompositionInjectionHook walks the full G
archive) was previously awaited without a wall-clock bound: a hung
hook (network call without timeout, infinite loop) would freeze the
ingestor — no further sweeps fire, no new mutants land in the archive.
Fix: wrap the hook call in `_run_bounded_post_step_hook`, which drives
the hook via an explicit asyncio.Task and bounds it with
`asyncio.wait(timeout=_POST_STEP_HOOK_TIMEOUT_S)`. On timeout we
cancel + grace-wait + log; on outer cancel we cancel + await briefly
+ re-raise.
Key load-bearing detail: ``asyncio.wait`` (NOT ``asyncio.wait_for``).
``wait_for`` cancels the inner task then awaits the cancel to be
honored before raising TimeoutError, so a hook that catches
CancelledError and keeps looping extends our wait indefinitely —
defeating the bound. Plain ``wait`` returns at the deadline regardless
of the inner task's state; we surface the orphan via the pending set
and log "potential orphan coroutine; ingestor proceeding".
Test suite adds TestPostStepHookTimeoutBound (5 tests):
- fast_hook_completes_normally — happy path, default budget
- hung_hook_cancelled_after_budget — sleeps 60s, monkeypatched to
0.1s budget, asserts WARN + hook_was_cancelled event set
- uncooperative_hook_logs_orphan_warn — bounded-badness stubborn
hook (swallows first cancel, honors second so test loop reaps it);
asserts elapsed < 1.0s and both WARN lines fire
- outer_cancel_propagates_to_hook — cancels poll_and_ingest mid-
hook, asserts hook cancelled and sweep re-raises
- default_timeout_is_generous — sanity: 60s ≤ T ≤ 3600s,
0.5s ≤ grace ≤ 30s
Regression: 1060+ evolution+integration tests green.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(engine): post_step_hook timeout knobs; iteration-window stats; deadlock stress
* `EngineConfig` gains `post_step_hook_timeout_s` (default 300s) and
`post_step_hook_cancel_grace_s` (default 2s) so the wall-clock bound on
a single post-step hook invocation is tunable per run; ingestor no
longer carries module-private magic constants.
* `EvolutionaryStatisticsCollector` gains `iteration_window_size`
(default 8). The iteration cohort aggregates now use a trailing window
`[iter - N, iter]`, restoring the "stats over the last batch" signal
that the old generational engine produced from per-generation cohorts.
`N = 0` disables the feature and keeps the iteration fields None.
* New deadlock-stress suite in `tests/evolution/test_refresh_parents.py`
exercises 32-way same-parent storms, randomized-order overlapping
batches, and cancel-mid-acquire on the per-id parent lock.
* `tests/monitoring/test_experiment_monitor.py` helper now seeds both
`total_mutants` and `programs_processed` — the latter is the field
`RunSnapshot.generation` reads from, so the assertion-based tests
pass against the current snapshot schema.
* Scrub of historical refactor framing (cycle numbers, finding tags,
in-flight rewire wording) from comments, docstrings and one filename;
no behavioural change in those sites.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(llm): langfuse v4 handler init; pin langfuse>=4,<5
`LangchainCallbackHandler` no longer exposes `.client` in langfuse 4.x,
so `handler.client.flush_at = 1` raises AttributeError at
`MultiModelRouter.__init__` -> Hydra instantiation fails before the run
even starts. Fix: configure the singleton `Langfuse` client with
`flush_at=1, flush_interval=1` before constructing the handler — the
handler picks it up via `get_client()` internally.
Also tighten the pin (`langfuse>=2.0.0` was unconstrained upward and
silently admitted v4) to `langfuse>=4.0.0,<5` so this API contract
doesn't drift again without a deliberate bump.
Pre-existing bug on main (introduced 2026-04-03, commit 51a14631);
unrelated to the steady-state refactor branch but blocking E2E.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(run.py): drop stale cfg.max_generations reference
The steady-state engine refactor deleted the generation/epoch concept;
``cfg.max_generations`` no longer exists, so the startup log on line 74
raised ``ConfigAttributeError`` and aborted every launch immediately
after the engine printed its own start banner.
Replaced with ``cfg.max_mutants`` — the top-level constant that backs
``MaxMutantsStopper``, which is the canonical termination signal now.
The engine's own log already reports ``stopper=MaxMutantsStopper``;
this just adds the bound.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(engine): extend per-parent-id lock through child-DAG via ParentRefreshTicket
Producer→ingestor handoff for the per-parent-id lock acquired in
ParentRefresher. The lock now spans refresh + mutate + child-DAG, not
just refresh — closing the invariant "parents are not refreshed while a
child of theirs is in flight."
Why: a concurrent producer that selected the same parents could refresh
them while another producer's child was mid-DAG (state=RUNNING,
metrics={}). AncestrySelector picked up that unscored child as
ancestry, triggering the "missing fitness key" warning the user
reported on `run.py problem.name=heilbron llm=gemini3_flash`.
Changes:
* refresh.py — Add ParentRefreshTicket (idempotent release; holds
per-parent-id locks in sorted order). New refresh_with_ticket()
returns the ticket; back-compat refresh() now wraps it and
auto-releases on return. Failure paths release any partially-acquired
locks before re-raising.
* mutant_task.py — Acquire ticket via refresh_with_ticket();
transfer atomically with _in_flight.add() under _in_flight_lock;
finally-release ticket if not transferred (failure path). Two
ownership-handoff invariants now documented in module docstring:
slot + ticket.
* steady_state.py — _inflight_tickets: dict[mutant_id, ticket]
paired with _in_flight set.
* ingestor.py — Pop tickets under _in_flight_lock atomically with
slot release; release() outside the lock to keep the critical
section short.
Tests:
* test_refresh_parents.py — Add TestRefreshWithTicket (6 tests):
ticket holds lock until release, idempotent release, empty parents,
back-compat refresh() auto-release, failure-path lock release.
* test_engine_invariants.py — Add TestNoRefreshWhileChildInFlight
(4 tests): second producer blocks until child ingested,
failure-before-register releases ticket, accept/reject paths both
release ticket, leaked child releases ticket.
* test_engine_ghost_persist.py — Update _FakeEngine to implement the
ticket API.
* test_engine_invariants.py — Update two mocks to use
refresh_with_ticket instead of refresh.
Verified: all engine + refresh + invariant tests pass (95 cases);
test_engine_stress.py passes its full 36-case parametrise sweep.
* refactor(engine): collapse elite→parent indirection in mutant_task
Source the elite pool size from parent_selector.num_parents instead of
the now-vestigial max_elites_per_generation. With pool == num_parents,
parent_selector.create_parent_iterator(elites) is a no-op shuffle, so
mutant_task.py no longer needs to do next(iter(...), []) over it.
- _select_elites_for_mutation → _select_parents_for_mutation, returns
the actual parent set directly.
- mutant_task.run_one_mutant calls it once; single empty-archive guard.
- Stress test stub now honours the EvolutionStrategy.select_elites
contract (return at most `total`); the old behaviour relied on the
parent_iterator to subsample.
max_elites_per_generation stays in EngineConfig for legacy YAML
compatibility but is no longer read by the engine.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(cli): add `gigaevo profiler` subcommand for log flow profiling
Parses an evolution runner log and emits two artifacts per run:
- profile_<label>.txt -- pipeline summary (counts, refresh queue stats,
per-program timeline)
- profile_<label>.html -- interactive Plotly dashboard (lifecycle bars,
stage sub-bars, refresh + re-eval bands, accept/reject bars)
Resolution priority mirrors `logs`: --file <path> for arbitrary logs,
positional labels under -e for manifest resolution, no-args + -e to
profile every run in the manifest. Default output dir:
experiments/<exp>/profiler/.
Core renderer lives in gigaevo.monitoring.flow_profiler so the CLI is a
thin wrapper. Accept/reject markers use go.Bar (same width as the DAG
span bar) instead of scatter markers, so they sit on the program's
exact row at every zoom level. Min visual width clamped to 50ms (was
250ms) to keep sub-second events readable without smearing the early
timeline. Footer explains queue-wait pathology referencing
ParentRefresher._await_done() pinning in-flight slots during re-eval.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(scheduling): add CachedFirstPrioritizer for re-eval-first DAG launch
A program with non-empty stage_results has already been DAG-evaluated once,
so on re-eval most of its stages will hit cached_skip and finish in
milliseconds. Surfacing those to the front of the launch queue directly
unblocks producer tasks that are pinned on ParentRefresher._await_done()
(each pinned task holds an in-flight slot, so when N mutants x M-second
refresh queues collide, throughput collapses even though per-DAG exec is
near-zero).
The cache signal is sound: fresh mutants from Program.from_mutation_spec
inherit default_factory=dict (empty), re-eval candidates retain the dict
through batch_transition_by_ids (which only patches state + atomic_counter,
program.py:281 -> redis_program_storage.py:632-633), and dag_runner.mget
fetches without exclude=EXCLUDE_STAGE_RESULTS. No code path destroys the
field.
Implements a two-tier partition: cached programs first, fresh second.
Within each tier the input order is preserved -- Redis SMEMBERS hash
order, which the runner uses upstream, has no meaningful semantics.
No predictor needed -- the cache signal lives on the program itself.
7 new tests in tests/evolution/test_scheduling.py::TestCachedFirstPrioritizer.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(monitoring): emit LLM_CALL canonical event from MutationAgent
The MutationAgent overrides acall_llm to use a structured_llm pathway,
which bypassed the base BaseStrategyAgent._emit_event(LLMCall(...)) call.
As a result, /flow-profiler had no MutationAgent timings — only Lineage
and Insights showed up in canonical event aggregations.
Add a finally-block emission that records latency, token usage, model,
attempt count, and error_type on both success and failure, matching the
contract used by the base agent.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(profiler): utilization view — LLM/exec overlap + mutation archetypes
Adds a torch-profiler-style "is the LLM fully hidden behind exec stages"
signal to /flow-profiler. Three new primitives:
* LLMCallEvent dataclass + LLM_CALL_RE — parse every canonical
`[LLM_CALL] {json}` line into (stage, end, duration_ms, ok, …).
* classify_stage(name) — bucket stages into llm / exec / orchestration.
LLM stages (LineageStage, InsightsStage, *Agent canonical names) and
program-exec stages (CallProgramFunction, CallValidatorFunction) are
the two sides of the overlap; orchestration is excluded.
* compute_utilization(...) — interval-union math returning total_llm_s,
total_exec_s, overlap_s, overlap_efficiency = overlap / min(L, E),
plus peak_concurrent_dags and per-archetype accept/reject counts.
Also:
* parse_log returns (programs, refreshes, llm_events) — 3-tuple.
* MUT_RE captures the optional `(model=…, archetype=…, prompt_id=…)`
suffix already emitted by the mutation operator, attaching it to
Program.mutation_archetype / .mutation_model.
* format_summary_text gains a "Utilization" section + archetype table.
* render_full_html gains a colored efficiency stat-bar (red <30%, amber
<60%, green ≥60%) and an archetype frequency table above the plot.
Smoke on experiments/heilbron/v1-honest-repro/run_A2_G.log:
LLM wall 76640s · exec wall 44860s · overlap 40377s (90% of min(L,E))
peak concurrent DAGs: 11 · 2421 LLM events (116 failed)
Computational Reinvention 91a/76r/49o · Guided Innovation 73a/61r/38o
Harmful Pattern Removal 12a/5r/4o · Solution Space Exploration 10a/4r/3o
19 new tests, all green; CLI smoke (10 tests) still green.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* chore: ruff format follow-up on test_mutation_agent
Pre-push hook caught residual formatting in the LLM_CALL emission tests
added in c336eb58; reformat only.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(profiler): drop experiment-branded subtitle from page header
The HTML header used to render `<h1>flow profile · A2_G</h1>` followed by
a `<span class="sub">heilbron/v1-honest-repro / A2_G</span>` next to it,
which made the generic profiler tool look "branded" with whatever
experiment was being analyzed.
Drop the prominent subtitle and relegate the source path to a small
muted `source: ...` line in the footer. The browser tab title and h1
are now clean — just `flow profile · <label>`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(monitoring): live flow profiler daemon for run.py
Adds gigaevo/monitoring/live_profiler.py — a small helper that spawns a
daemon thread to periodically re-render the running experiment log into
profile_live.html inside the Hydra output dir. Writes are atomic
(.tmp + os.replace) so a browser reload mid-render never sees a partial
file, and exceptions on one tick are logged but don't kill the loop.
run.py picks up the new helper with a single line after setup_logger —
keeps the entry point minimal as requested.
Tests cover the render-once contract, daemon-thread bootstrap, lazy
log-creation wait, and atomic-write residue.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(profiler): inline Plotly so HTML renders in sandboxed previews
VS Code's HTML preview extension (and other sandboxed/offline viewers)
blocks external <script src="cdn.plot.ly/..."> loads, so the previous
include_plotlyjs="cdn" produced a blank page in those environments.
Switch to include_plotlyjs="inline" which embeds plotly.js directly into
the document. File grows from ~50KB to ~4.7MB, but it now renders
anywhere — VS Code preview, archived run artifacts, offline shares.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs(specs): mutation-throughput two-semaphore redesign
Decouple "LLM/refresh in flight" from "produced-but-not-ingested"
so the DAG sees a freed slot back-to-back with the next ingest,
without waiting for a fresh refresh+LLM round-trip.
Single tunable (max_in_flight=N) sizes both semaphores. Steady-state
pipeline depth ~2N mutants: ~N producers (mix of LLM-running and
ready-result-held), ~N buffered (DAG queue + running). Ticket
ownership and orphan-window equivalence preserved.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs(plans): two-sema mutation-throughput implementation plan
11 TDD tasks: config docstring rewrite (1), engine init + log line +
sweep doc updates (2), dispatcher producer_sema (3), mutant_task
buffer-sema acquire-after-LLM with paired finally (4), ingestor
buffer-sema release (5), ghost-persist test migration (6), slot-leak
chaos invariants (7), JIT DAG-refill behavioral property (8),
resume-after-kill (9), real-Redis end-to-end smoke (10), full-sweep
+ push (11).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* refactor(engine): rewrite max_in_flight docstring for two-sema semantics
Field name unchanged; semantics now apply symmetrically to producer
and buffer pools. Steady-state pipeline depth ~2N.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* refactor(engine): replace _in_flight_sema with _producer_sema + _buffer_sema
Two-semaphore backpressure for the steady-state engine. _producer_sema caps
concurrent LLM/refresh tasks; _buffer_sema caps produced-but-not-yet-ingested
mutants. Both sized to existing max_in_flight knob — no new config surface.
Touched: steady_state.py (init + log + sweep doc), dispatcher.py (acquire
_producer_sema), mutant_task.py (acquire _buffer_sema after LLM, paired
release in finally), ingestor.py (release _buffer_sema on DONE/DISCARDED).
Ghost-persist test still pinned to old single-sema model — migration lives
in T6 to keep this commit reviewable.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* test(engine): migrate test suite from _in_flight_sema to two-sema pair
T2-T5 (95129056) replaced the single _in_flight_sema with _producer_sema
(dispatcher-side, always released in finally) and _buffer_sema (producer
acquires post-LLM, ingestor releases on DONE/DISCARDED). Migrate every
remaining test reference:
- caller-protocol acquire/release → _producer_sema (mirrors dispatcher)
- slot-accounting + len(_in_flight) conservation → _buffer_sema
(_in_flight membership is gated by _buffer_sema in the new model)
- 'all slots returned' assertions → both pools at full capacity
Test intent preserved; semantics translated 1:1 to the new model.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* test(engine): T7 - Slot-leak chaos test for two-sema architecture
Add comprehensive chaos test suite validating slot conservation under
adversarial timings and concurrent access patterns. Tests verify that
the two-semaphore model (producer_sema + buffer_sema) maintains
invariants across:
- Race conditions: rapid acquire/release cycles, concurrent transfers
- Backpressure: ingestor slow-release blocking producer
- Cancellation: mid-acquire/mid-flight cancellation with proper cleanup
- Edge cases: minimal (max_in_flight=1), large (max_in_flight=100), full drain
Key invariant validated: semaphore values stay in [0, max_in_flight] range
and in-flight mutants do not exceed max_in_flight, proving no slot leak
across dispatcher, producer, and ingestor phases.
15 ne…
KhrulkovV
added a commit
that referenced
this pull request
May 26, 2026
* prompts: kill few-shot fabrication leak in insights + lineage
The GOOD examples themselves contained invented magnitudes
("rejects 60% of viable candidates", "-2.3% runtime"), training the
LLM that fabricated effect estimates are valid output. Live judge
eval on 5 parent->child pairs across heilbron + hover, audited
against actual Redis program metrics + task_description, shows:
- ungrounded-number rate: 20.2% -> 6.9% (3x reduction)
- lineage rubric subscore: 17.35 -> 17.40
- 4-pair rubric avg (excl. known Gemini Pro structured-output flake):
16.97 -> 17.12
Edits:
- insights: remove fabricated "60%" from numeric GOOD example;
add "Quote, don't estimate" rule naming specific fabrication
patterns (% rejection rates, speedup factors, iteration budgets).
- lineage: remove "-2.3% runtime" from Quantification example;
spell out that cited numbers must come from diff, code, metrics,
or task description.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(evolution-stats): iteration-window aggregation + snapshot bump
Evolutionary Statistics section in the mutation prompt was empty or
showed stale numbers under the steady-state engine. Two root causes:
1. Stale population snapshot — `bump()` was only called once at seed
drain. After that, every collector saw a frozen snapshot, so the
focal program's iteration was rarely in scope. Added
`bump(incremental=True)` in `poll_and_ingest` after every commit
pass so the snapshot tracks ingestion progress without flushing
cached program objects.
2. Per-generation aggregation is meaningless under JIT — generations
are an output of the schedule, not a fixed input. Replaced the
`generation_history` / per-gen fields with a symmetric iteration
window ([iter-R, iter+R], R=15) around the focal program:
window count/valid, best-in-window + iter, focal rank in window,
median-before / median-after horizons, trend via median-of-thirds
(5% multiplicative threshold, direction-aware via
`metrics_context.is_higher_better`), max invalid streak, and a
global running-best plateau marker (`iters_since_last_new_best`).
`EvolutionaryStatisticsMutationContext.format()` emits the locked
10-line "E_augmented" block; design doc lives at
`docs/superpowers/specs/2026-05-14-evolutionary-stats-redesign.md`.
Validated via 3-round LLM extraction eval: E_augmented scored 44/45
vs the old per-gen layout's 15/45.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(monitoring): file emit target writes frontier_<metric>.png each tick
start_live_frontier_compare gains an output_dir param and a new "file"
emit target that re-renders a frontier-trajectory PNG (best-so-far +
per-iter mean) in the Hydra run output directory on every tick, sibling
to live_profiler's profile_live.html. Default emit_targets now includes
"file". run.py threads the Hydra output_dir through to the daemon.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(memory): DAG-native intra+extra memory pipeline (per-parent lineage card + live global cards)
Adds the `intra_extra_memory` pipeline variant on top of the default builder:
* IntraMemoryStage (strong LLM, structured output) renders a per-parent
lineage card from DescendantProgramIds + MemoryContextStage as named inputs;
framework InputHashCache skips the LLM when neither changes. Output is
attached to the parent's metadata['intra_memory_card'] and concatenated
with the global memory cards block via ConcatMemoryStage.
* LiveMemoryRefreshHook wraps IdeaTracker.run_increment as a post_step_hook,
surfacing freshly evolved ideas to MemoryContextStage's reload-on-read
selector during the same run (no need to wait for end-of-run flush).
* New ExtraMemoryStage class (currently dormant in the wired pipeline) kept
as opt-in infra with its own caching test, pinning the structured-output
contract for future re-wiring.
* Bug fix bundled: invalid-child fitness sentinel (e.g. -1000 in heilbron)
no longer pollutes delta_distribution.min/median/max or per-cluster
mean_delta. Invalid children route to dedicated n_failed counters; the
rendered card shows "n_failed=N (excluded from stats above)" and
"mean delta n/a" for all-failed clusters. System prompt rule 3 now
instructs the LLM to exclude is_valid=false from delta math.
Legacy lineage stages stripped from the builder (AncestorProgramIds,
LineageStage, LineagesToDescendants, LineagesFromAncestors, InsightsStage)
— DescendantProgramIds is kept and rewidened (max_selected=24) to feed
IntraMemoryStage instead of LineagesToDescendants.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs: intra/extra memory mode guide + USAGE / MEMORY_ARCHITECTURE cross-links
Adds docs/INTRA_EXTRA_MEMORY.md covering the pipeline introduced in 89f01be5:
architecture diagram, intra-card schema (with the n_failed sentinel-handling
contract), live external-memory refresh hook, caching invalidation triggers,
required co-overrides (ideas_tracker=default, memory=local), smoke / full /
nohup launch commands, tuning knobs, verification checklist, and a
troubleshooting matrix.
USAGE.md: adds `intra_extra_memory` to the `pipeline` config-group table and
a launch example under "Examples".
MEMORY_ARCHITECTURE.md: top-of-file pointer to the new mode guide so the
in-run / live-memory entry point is discoverable from the store-side docs.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(intra-memory): ship unified diff (not full code) per child + soften mutator "untried directions" preference
IntraMemoryStage payload now carries either a unified diff (change_form="diff")
or full child source (change_form="full_code") per child. Diff is the default;
full code is the fallback when (a) is_valid=False so error_summary line refs
stay readable against the same buffer the analyst sees, (b) the diff is
empty (identical sources), or (c) the diff is no smaller than the file
(structural rewrites where every line differs). Expected 50-80% prompt-size
reduction on the typical small-mutation regime, where the parent's boilerplate
was previously repeated N times across children.
The intra system prompt's user-message-structure table is updated to document
both children[i].diff and children[i].code, plus the change_form discriminator,
so the analyst knows how to read either form.
Mutator system prompt: softened the "Untried directions" rule. Previously
"prefer it over inventing a new direction from scratch" — a hard preference
that let speculative hints dominate archetype selection. Now framed as
candidates to weigh alongside the model's own ideas, with explicit licence
to skip any whose mechanism does not actually fit the parent's code.
Tests: 4 new payload-shape tests on IntraMemoryStage (diff for small mutation,
full-code for structural rewrite, full-code for invalid child, system prompt
documents change_form/diff/full_code), plus a new pin on the mutator prompt
wording.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(prescriptive): MutationSuggestionStage + EvolutionaryStatistics wiring
Architectural split between descriptive (IntraMemoryStage) and prescriptive
(MutationSuggestionStage) memory: the intra stage now ONLY summarises lineage
history into a per-parent card; the new MutationSuggestionStage consumes
intra card + cross-population memory cards + ancestral momentum trail +
EvolutionaryStatistics population snapshot and emits structured
ProgramInsights into MutationContextStage's insights slot (same shape as the
legacy InsightsStage, so the mutator's PROGRAM INSIGHTS section renders
unchanged).
Key wiring (lineage_memory_pipeline.py):
* DescendantProgramIds → IntraMemoryStage.children_ids
* IntraMemoryStage → MutationSuggestionStage.intra_card
* MemoryContextStage → MutationSuggestionStage.memory_cards
* EvolutionaryStatisticsCollector → MutationSuggestionStage.evolutionary_statistics
* MutationSuggestionStage → MutationContextStage.insights
* IntraMemoryStage + MemoryContextStage → ConcatMemoryStage → MutationContextStage.memory
Both strong-LLM stages (Intra + Suggestion) gate on validator success and
(when enabled) archive acceptance, mirroring the legacy InsightsStage
skip-cascade so paid LLM tokens are never spent on a program that won't
enter the archive.
Other:
* Intra card delta-distribution + mean_delta now formatted using primary
metric's decimals from metrics.yaml (was rendering raw 16-sig-fig floats).
* PopulationSnapshot.refresh: refetch programs in INCOMPLETE_STATES so
QUEUED/RUNNING entries get up-to-date metrics on each snapshot.
* fix(memory): pick OPENROUTER_API_KEY when LLM_BASE_URL targets OpenRouter
Previously gigaevo.memory.config.OPENAI_API_KEY preferred $OPENAI_API_KEY
over $OPENROUTER_API_KEY unconditionally. In intra_extra_memory smokes we
export both — $OPENAI_API_KEY=sk-gigaevo (LiteLLM proxy) for the main Qwen
pipeline and $OPENROUTER_API_KEY=sk-or-v1-... for the GAM/A-Mem cheap path
(Gemini Flash via OpenRouter). The wrong-key-for-endpoint combination made
every GAM research_agent and IdeaTracker LLM call 401-silently, killing
the extra-memory channel without any pipeline error.
Two-line fix:
- config.OPENAI_API_KEY now resolves OpenRouter key first when LLM_BASE_URL
contains "openrouter.ai" (e.g. settings.yaml default).
- ideas_tracker.llm._init_clients picks the right key for the effective
base_url (OPENROUTER_API_KEY for openrouter, OPENAI_API_KEY otherwise).
Verified: with both keys exported and settings.yaml's OpenRouter base_url,
client.api_key now starts with "sk-or-". With base_url set to the LiteLLM
proxy, client.api_key falls back to "sk-gigaevo".
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(suggester): rank-aware ambition rule in mutation_suggestions/system.txt
Adds a sub-bullet under the Evolutionary Statistics input description that
calibrates the suggester's parametric-vs-structural mix to the rank of the
parent in the window (already reported as `rank X/Y in window`):
* Top quartile -> at least one suggestion must be structurally orthogonal
(different algorithm family, init scheme, or objective), not a parameter
tweak. Parametric refinements alone are insufficient when the parent
already tops its window.
* Bottom half -> at least one structural change required; fragile/harmful
tags take precedence over rigid-parameter tweaks.
* Middle band -> mix exploitation with at least one orthogonal axis.
Rationale: smoke #3 (cycle 1 at max_mutants=20) showed gen-3 105901c4
(rank=1/Y) receiving 5 of 6 suggestions tagged `rigid` (pure parameter
tweaks), producing a plateau at 0.01142 (32.6% of 0.035). The breakthrough
to 0.01885 (53.9%) came from a SIBLING program a35a0f72 whose suggester
happened to find a structural harm (asymmetric_extra_points / symmetry
restoration). The new rule makes that structural pivot a stable
expectation at top-of-window, not an accident.
Generic — uses only the existing rank-in-window signal already in
EvolutionaryStatistics. No new fields, no new code, no heilbron-specific
tokens.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(stats): rank line dropout when focal missing from snapshot
The iter-window rank computation in `_compute_iter_window_fields`
called `sorted(fits).index(focal_fit)` to find the focal's rank.
When the population snapshot lagged behind the pipeline view (or the
snapshot contained a stale `is_valid=0` view of the focal), `focal_fit`
was not in `fits`, the ValueError was swallowed silently, and
`iter_window_rank` became None.
Downstream renderer in `evolution/mutation/context.py:168` gates the
rank line on `iter_window_rank is not None`, so the entire
"rank X/Y in window" segment disappeared for top-of-window programs.
The mutation_suggestions/system.txt rank-aware ambition rule relies on
that text; with the rank line missing the rule was DORMANT throughout
the cycle-2 run (struct counter stayed at 0 for 100 mutations).
Verified on production program 4578cea1 from
output/cycle2_rankambition_20260518_022450 (fit=0.01509 at iter=49):
window valid=10, best in window=0.01455 — focal excluded, rank=None.
Fix:
- When focal is valid and not already in `valid_with_fit` (snapshot
lag), include it explicitly using the up-to-date metrics passed by
the pipeline. Downstream best/median/trend/valid_count then reflect
reality.
- Replace `sorted.index` with a count-based rank (better+1). Robust to
tied fitness values, which previously got under-counted by `index`'s
first-match semantics.
Tests:
- test_iter_window_rank_when_focal_missing_from_snapshot (RED→GREEN)
- test_iter_window_rank_when_focal_in_snapshot_but_stale_metrics (RED→GREEN)
- existing test_iter_window_rank_none_when_focal_invalid still passes
(invalid focal correctly yields rank=None)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(suggester): lineage-exhaustion override in rank-aware ambition
Cycle-3 (rank rule LIVE) plateaued at 0.02614 (74.7%). Forensics:
- Top-5 fitness: 4/5 are structural-pivot archetypes (Guided Innovation
/ Approach Synthesis). Mean fit by archetype: Guided Innovation 0.01735
vs Exploitation 0.01016. Structural pivots win.
- But ~50% of all mutations chose Exploitation. Among the 6 programs
whose parent's intra card flagged "all valid children regressed or
failed" (lineage exhaustion), the mutator still chose
Exploitation/Proven Pattern Extension in 3/6 cases — wasting budget
re-tweaking failed clusters.
The existing rank-aware rule says "at least one orthogonal-axis
suggestion" — too soft when local gradient is empirically dead.
New sub-bullet: when intra card shows ≥2 failed/regressed tried_strategy
clusters (or delta distribution catastrophic+failed ≥ 2 with improving=0),
EVERY suggestion must propose a structural axis NOT in tried_strategies.
Parametric tweaks of failed clusters are explicitly rejected in this
regime. Forces the suggester to leave exhausted local basins.
Generic, no task-specific tokens. +11 lines in
mutation_suggestions/system.txt under the rank-aware ambition block.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(suggester): escape literal {} braces in lineage-exhaustion sub-bullet
673b9fb6 introduced `{regressed, failed}` as a literal phrase in the
mutation_suggestions/system.txt template. The prompt loader passes this
template through `str.format()` (factories.py:200), so `{regressed, failed}`
was parsed as a placeholder named 'regressed, failed' — raising KeyError
at every DAG build during cycle-4 startup. All 5 seed-eval DAGs failed,
the engine spun in an idle "no parents" loop, and the process exited at
t=07:53:04 with progs=5/scored=0/mut_done=0 — zero useful work done.
Fix: escape the literal braces as `{{regressed, failed}}`. Verified via
str.format() round-trip — only the three intended placeholders
({task_description}, {metrics_description}, {max_insights}) remain.
Lesson: any literal `{` or `}` in `.txt` prompts that flow through
.format() must be doubled. See feedback memory for hardening guide.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(suggester): server-computed EXHAUSTION ALERT banner overrides soft LEX
Cycle-4b shipped a soft "Lineage exhaustion override" sub-bullet in the
mutation-suggester system prompt. Qwen-3-235B-A22B-Thinking-2507 ignored
it: 3 parents in cycle-4b had ≥2 regressed/failed intra clusters yet
their children received parametric refinements of already-tried clusters
(plateau at 0.02630, +0.6% vs cycle-3 baseline 0.02614).
Replace the soft text with a deterministic server-side banner prepended
to the user message — most salient location, no LLM judgement on the
trigger condition.
Trigger (computed in MutationSuggestionAgent._format_exhaustion_block):
- cond_a: ≥2 distinct clusters in {regressed, failed}, OR
- cond_b: catastrophic + n_failed ≥ 2 AND improving = 0
When triggered, emit `## EXHAUSTION ALERT — strict structural-pivot mode`
header + OVERRIDES sentence + explicit AVOID-LIST of negative-verdict
clusters + full tried-strategies context + `---` separator. The system
prompt now references the banner as a HARD CONSTRAINT that overrides the
rank-aware ambition mix.
Banner is task-agnostic (no heilbron/triangle leak — covered by test).
Tests: 13 new in tests/llm/test_mutation_suggestion_exhaustion.py cover
empty intra, single-cluster non-triggers, cond_a/cond_b paths, mixed
verdicts, override/AVOID-LIST language, task-agnosticism, and the
trailing separator. All pass. Lint clean. No new regressions in
tests/llm/ (371 pass) or tests/stages/ (938 pass; pre-existing 3
failures unrelated).
Pure context-building change — schema unchanged, pipeline unchanged,
launch command unchanged. Stays within the 0.035-sprint allowed-knobs
envelope.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(suggester): revert rank+LEX to 9cca4344 baseline for cycle-6 A/B
Drops 24 lines from mutation_suggestions/system.txt — the rank-aware ambition
sub-bullet (commit 4caeb1b9) and the lineage-exhaustion banner clause
(commit 0ebd405c built on 73ed1207's brace-fix of 673b9fb6).
Net effect: the suggester prompt is now identical to its 9cca4344 state.
Empirical motivation:
| Run | Best fitness | system.txt |
|------------------------------|--------------|-------------------|
| sprint cycle-2 (2026-05-17) | 0.0315 | pre-prescriptive |
| cycle3-from-scratch (uncomm) | ~0.030 | 9cca4344 baseline |
| cycle-3 today (rank rule) | 0.02614 | +13 rank |
| cycle-4b today (LEX soft) | 0.02630 | +24 rank + LEX |
| cycle-5 today (LEX hard) | 0.02536 | +24 rank + banner |
Today's three runs cluster at 0.025-0.026 (~17% below the 9cca4344
baseline). The collector.py rank-line bugfix + EXHAUSTION ALERT formatter
remain in place — only the LLM-facing prompt content is reverted. The
formatter just becomes dormant since the prompt no longer references its
output.
Cycle-6 will A/B this against the cycle-5 state to confirm whether the
+24 lines of guidance were net-destructive.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(stats): R2 — MAD-based trend noise floor + archive_valid_fitnesses field
Replace legacy 5%·|t1| trend threshold in `_trend_from_thirds` with a
nonparametric MAD (median absolute deviation) over the recent valid-fitness
window. The fixed 5% ratio reads as "flat" on low-fitness regimes where real
regressions are present at sub-5% absolute magnitude — the cycle-6 audit
showed parent contexts with medians falling 0.00165 → 0.00082 (clear
regression) still labelled `flat` and feeding the consumer's flat-trend
condition into Exploitation.
MAD adapts to the run's empirical noise scale: no chosen constant. Bootstrap
fallback to legacy 5%·|t1| ratio when fewer than `N_MIN_FOR_MAD=4` valid
samples in the window — pre-existing framework behaviour preserved during
the initial iterations.
Also exposes `archive_valid_fitnesses: tuple[float, ...]` as a transient
field on `EvolutionaryStatistics` (not persisted; rebuilt per emission).
This is the source-of-truth distribution that R1 (archive-quartile regime)
will read in a follow-up commit.
Constants introduced are all data-availability gates, not regime thresholds:
- `N_MIN_FOR_MAD = 4` — minimum sample size for MAD to be meaningful
- `_TREND_EPSILON = 1e-12` — numerical safety against degenerate MAD=0
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(context): R1+R3 — archive-quartile regime in mutation_context render
Adds two new tokens to every rendered parent context once the archive holds
≥ `N_MIN_ARCHIVE=4` valid programs:
Archive: N=N median=… q75=… best=…
Regime: BAD/MIDDLE/GOOD (Q? of archive)
And appends `archive-quartile Q?` inline to the existing rank line:
rank 2/8 in window, archive-quartile Q1)
Both signals are derived from the same archive distribution emitted by R2's
`archive_valid_fitnesses` field on `EvolutionaryStatistics`. Quartile
boundaries are universal statistical convention (Q1=25%, Q2=50%, Q3=75%) —
not chosen thresholds. The mapping Q1→BAD, Q2/Q3→MIDDLE, Q4→GOOD is the
framework's editorial choice with no numeric constants.
R1 v2 design properties:
- No dependency on `MetricSpec.upper_bound` — regime is derived from the
run's empirical archive distribution, so the bundle is task-agnostic.
Tasks declaring `upper_bound` additionally get an informational
`Target: … focal_gap=…` line; R6's archetype gate does NOT read it.
- Direction-aware via `MetricSpec.higher_is_better` — works identically for
loss-style metrics where small = good.
- Bootstrap-safe: no token emitted when archive < 4 valid; rule falls back
to original Step-6 logic.
- O(N log N) per render on archive size bounded by `max_mutants=100`.
R3 reuses R1's `quartile_str` so there is a single source of truth and the
rank line cannot drift from the Regime line.
Tests: 10 new `TestArchiveQuartileRegime` cases cover Q1/Q2/Q3/Q4 placement,
archive < 4 (no regime emission), `higher_is_better=False` direction,
archive-quartile inclusion in rank line, ties at quartile boundaries, target
decoration on/off.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(prompts): R6 — archive-quartile archetype gate + suggester tag-bias
Two-layer defense against wasted-budget Exploitation on weak parents.
mutation/system.txt (consumer):
- Adds a Selection Rule that makes `Regime: BAD` (focal in Q1 of archive)
a HARD GATE: Exploitation archetypes (1-3) FORBIDDEN. Choose Exploration
(4-6) or, if intra has an "improved" verdict with an untried extension,
Hybrid (7-8). MIDDLE (Q2/Q3) gates Exploitation on intra-improved +
untried-extension; otherwise prefer Hybrid. GOOD (Q4) opens all
archetypes per other rules. Bootstrap (Regime line absent) falls back
to original logic.
- Adds an Evolutionary Statistics descriptive paragraph explaining the
new `Archive:` and `Regime:` lines so the LLM knows the rendered
tokens.
- Trend label vocab synced to code emit: `rising / flat / falling` (the
legacy `improving / regressing` words drifted from collector.py:126
and broke the consumer's match logic).
mutation_suggestions/system.txt (producer):
- Adds an Archive-quartile awareness rule: in BAD regime (Q1) do NOT tag
patterns as `beneficial` based only on local intra-card "improved"
verdicts. Prefer `fragile` or `rigid`. Reserve `beneficial` for MIDDLE
(Q2/Q3) or GOOD (Q4) regimes.
- Disambiguates earlier informal "low-fitness regime" wording (which
collided with the formal `Regime:` tag) — the metric-scale heuristic
is now explicitly called out as SEPARATE from the formal Regime tag.
- Trend vocab synced to `rising / flat / falling`.
Defense-in-depth: the producer suppresses `beneficial` tag at source for
Q1 parents; the consumer additionally forbids the Exploitation archetype
the tag would have biased toward. The two rules layer — they don't
duplicate. If the suggester slips and emits `beneficial`, the mutator's
hard gate still routes the mutation to Exploration.
Tests: TestR6ArchiveQuartileGate (consumer) + TestR6SuggesterTagBias
(producer) + TestRegimeAndQuartileVocabularySynergy (cross-prompt vocab
consistency for tag, verdict, quartile, regime scales).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(tools): trajectory_shape.py — log-based closeout analyzer for cycle comparisons
Computes the 6 trajectory metrics from plan §Verification on any
output/cycle*_*/evolution_*.log:
- best_at_end (frontier final)
- monotonicity_pct (cohort-mean signal, NOT running-max which would always be 100%)
- per_stage_best (early/mid/late thirds)
- longest_stagnation_min (gap between consecutive frontier-bumps)
- rtail_ge_020 / rtail_ge_030 (right-tail mass — # cells reached past 0.020 / 0.030)
- cells_filled
Two modes:
python tools/trajectory_shape.py <log_file> # single report
python tools/trajectory_shape.py --compare a.log b.log c.log # variance-floor verdict
Variance-floor rule (1.5×spread): N≥3 → mean(baselines)+1.5×spread is the bar;
treatment > bar → CONFIRMED, else NOISE.
Works on any log file regardless of Redis state — logs are permanent, Redis dbs
get flushed. Built during cycle-7's runtime, smoke-tested retroactively on
cycles 3/4b/5/6 to establish n=4 baseline (mean=0.02596, spread=0.00094,
zero breakouts past 0.030).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(context): R7+R8 v3.1 — archive distribution with worst/median/best + archive-percentile token
Renders the archive's distribution as `Archive: N=… worst=… median=… best=…` plus an
`archive-percentile pXX of N=Y` annotation on the existing rank line. Both lines are direction-
aware via `MetricSpec.higher_is_better`. No `Regime:` or `Target:` token is rendered — the LLM
reads the task's target from the task description and judges the focal against the rendered
distribution itself.
Why v3.1:
- v3's `Regime: BAD/MIDDLE/GOOD` was a pre-baked classifier. Trust-the-model-synthesis
principle: render data, let the LLM judge.
- v3's `Target:`/`focal_gap` was likewise pre-baked. The task description already states the
target; rendering it twice (and adding a derived `focal_gap`) introduces a hardcoded
interpretation channel the LLM doesn't need.
- Only deterministic gate kept: archive-percentile (a single direction-aware quality
percentile, 100=best). Quartile boundaries 25/75 are statistical convention, not magic.
Bootstrap-mislead defense: the rendered `worst=… median=… best=…` triplet makes archive
compression visible. A compressed bootstrap (N=7, all <0.002) lands the focal at p100 but the
distribution itself shows the LLM the archive is far from the task's stated target. The
qualitative target-awareness clause in the prompt instructs the model to apply that judgment.
Tests:
- test_archive_line_includes_worst_higher_is_better_true
- test_archive_line_worst_inverts_for_higher_is_better_false
- test_compressed_bootstrap_renders_rich_archive_no_target_line
- test_target_line_never_rendered_when_upper_bound_declared
- test_target_line_never_rendered_when_upper_bound_none
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(prompts): R9 v3.1 — archive-percentile gate + qualitative target awareness
Mutator and suggester prompts now reference the v3.1 context surface: `Archive: …` and
`archive-percentile pXX of N=Y`. The only deterministic gate is the archive-percentile gate
(focal in bottom quartile → Exploitation FORBIDDEN; focal in top quartile → all archetypes
eligible). Quartile boundaries 25/75 are statistical convention, not magic numbers.
Target awareness is qualitative: the task description states the problem's target/bound; the
prompt instructs the LLM to compare the rendered `worst=… median=… best=…` distribution
against that target and apply judgment. No numeric threshold is imposed because fitness scale
is typically non-linear — small absolute distances at low fitness are structurally harder than
the same absolute distance at high fitness.
Removed from previous v3:
- `Regime: BAD/MIDDLE/GOOD` pre-baked classifier (replaced by archive-percentile + prose)
- `Target:`/`focal_gap` rendered tokens (LLM reads target from task description)
- `half the distance` magic-constant compound rule
Removed in this v3.1 cleanup pass:
- Legacy "framework does NOT render a separate `Target:` line" mentions — negating-by-mention
is noise; prompts now positively instruct reading the target from the task description.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* test(v3.1): archive-percentile gate, archive distribution, no Target/Regime tokens
test_mutation_context.py:
- test_target_line_never_rendered_when_upper_bound_declared: asserts `Target:` and `focal_gap`
are absent even when MetricSpec.upper_bound is set.
- test_target_line_never_rendered_when_upper_bound_none: parallel assertion for tasks without
declared upper bound.
- test_compressed_bootstrap_renders_rich_archive_no_target_line: documents the bootstrap-
mislead defense — N=7 compressed archive with focal at p100 still renders worst/median/best
so the LLM can judge the gap against the task target itself.
- test_archive_line_includes_worst_higher_is_better_true: asserts `worst=…` and `best=…` are
rendered with the strongest at `best` for fitness-style metrics.
- test_archive_line_worst_inverts_for_higher_is_better_false: parallel for loss-style metrics
(worst = highest value, best = lowest).
test_prompts.py (TestV31* replacing TestR6*):
- TestV31ArchivePercentileGate: archive-percentile referenced, no Regime/archive-quartile/
Target/focal_gap/half-distance vocab, FORBIDDEN keyword on Q1 Exploitation, 25/75 cited,
qualitative target awareness via task description, non-linear scale acknowledged, trend
vocab matches collector (rising/flat/falling).
- TestV31SuggesterTagBias: parallel for mutation_suggestions prompt.
- TestV31VocabularySynergy: cross-prompt consistency for tag scale, verdict scale, quartile
boundaries (25, 75), archive distribution vocab (worst/median/best).
364 tests pass. Lint clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs(audit): v3.1 mutation decision tree — channels, gate, decision table, worked examples
Spells out exactly how the LLM mutator selects an archetype (Exploitation 1–3 / Exploration
4–6 / Hybrid 7–8) under the v3.1 surface:
1. Six context channels (C1 Metrics, C2 Insights, C3 Intra Memory, C4 Memory Cards, C5
Evolutionary Statistics, C6 Ancestral Momentum) — producer / carrier / consumer.
2. Two decision components: one deterministic gate (archive-percentile, only constants
are quartile boundaries 25/75) and one qualitative target-awareness clause (LLM reads
task description, judges against rendered Archive distribution; no numeric threshold
because fitness scale is non-linear).
3. Exhaustive 18-row decision table covering (archive-percentile bucket × intra verdict
× trend × invalid streak) → archetype.
4. Four worked heilbron examples: bootstrap-mislead p100 case (Hybrid 7 override),
normal mid-run (Hybrid), late-run refinement (Exploitation 1), plateau exit
(Exploration).
5. Cycle-9 mid-run invariants: 0% Exploitation on archive-percentile<25 focals; no
Regime/Target/archive-quartile tokens; archive-percentile rendered ≥95% post-bootstrap.
6. Universal-across-tasks proof: only per-task input is higher_is_better flag; 25/75 are
statistical-convention quartile boundaries, not chosen values.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* tools(v31-validator): read-only sampler — recompute archive-percentile + decision-tree predictions vs LLM choices
Non-mutating Redis sampler that walks DBs 13/14/15 program by program,
parses `Program.metadata.mutation_context` (the rendered prompt) for the
parent's state (focal fitness, valid sibling fitnesses, trend, intra
verdicts), and recomputes the v3.1 archive-percentile from valid
siblings. Emits JSONL with fields: tree_bucket, tree_eligible_archetypes,
archetype_chosen, fitness_delta, cf_tags (which of CF-A..CF-E cells the
sample lands in), match (tree-prediction vs LLM-choice).
Reuses:
- gigaevo.programs.stages.collector.N_MIN_ARCHIVE
- gigaevo.evolution.mutation.context._archive_percentile_of_focal
(direction-aware)
- gigaevo.database.redis_program_storage RedisProgramStorage.get_all
- gigaevo.programs.program.Program.get_metadata (base64 deserialization)
Output drives docs/audits/MUTATION_DECISION_TREE_V3_1_COUNTERFACTUAL_AUDIT_2026-05-18.md.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs(audit): v3.1 decision tree — counterfactual audit on 289 prior-run programs (DBs 13/14/15)
Recomputes v3.1 archive-percentile and tree-predicted archetype bucket
on every program with mutation_context metadata across DBs 13/14/15
(n=289), compares to the LLM's actual archetype_choice and the child's
fitness_delta, then groups by the 5 candidate counterfactual cells
identified in the audit plan:
CF-A empty intra (first child) n=225 — modal Hybrid 7 (57), pos-rate 42.1%
CF-B invalid focal with archive line n=0 — unreachable under OLD prompt; documented as gap
CF-C low-N flat trend (noise-dominated) n=32
CF-D middle-band percentile + falling trend n=19
CF-E top-quartile + spread wide + far-target n=53 — pos-rate 92.5%, Exploitation 32/35 = 91% wins
Headline finding: v3.1's target-awareness override demoting top-quartile
parents to Hybrid is empirically TOO RESTRICTIVE. CF-E shows exploitation
beats hybrid 32/35 when the gate permits both. 18 counterfactual-A
samples — gate-violations that improved fitness anyway.
Recommendations applied in next 2 commits:
- REV-1 soften row 13 (target-awareness no longer forbids Exploitation)
- REV-3 add row 19 — empty intra middle-band → Hybrid 7 default
- REV-4 add row 19a — invalid focal → Exploration with corrective
- REV-2 (row 11 softening for CF-D) documented as PROPOSAL — n=19 too thin
Observational only — contexts rendered under OLD prompt surface; the
sampler recomputes v3.1 tokens from the same underlying numerics.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs(audit): v3.1 tree — soften row 13 target-awareness + add rows 19 / 19a per counterfactual audit
Three additive/softening edits to MUTATION_DECISION_TREE_V3_1_2026-05-18.md,
all driven by empirical findings in the counterfactual audit
(MUTATION_DECISION_TREE_V3_1_COUNTERFACTUAL_AUDIT_2026-05-18.md):
REV-1 — row 13 (top-quartile + spread wide + far-target):
OLD: "demote to Hybrid 7 even though gate permits Exploitation"
NEW: "prefer Hybrid 7 OR Exploitation 1 — gate does NOT forbid
Exploitation. Choose by C2 insight severity and C6 ancestral_step_delta."
Evidence: CF-E n=53, pos-rate 92.5%; Exploitation wins 32/35 in this cell.
REV-3 — new row 19 (empty intra, middle-band 25≤p<75):
Add explicit default: Hybrid 7 (Guided Innovation).
Evidence: CF-A n=225, modal choice Hybrid 7 (57 picks), pos-rate 42.1%.
Closes a gap where the tree relied on the LLM inferring a default.
REV-4 — new row 19a (invalid focal with archive line):
Defensive rule — force Exploration with corrective mechanism regardless
of archive-percentile (focal cannot be refined when it is invalid).
Evidence: CF-B was unreachable in prior data; rule is forward-looking.
REV-2 (row 11 middle-band+falling softening) NOT applied — CF-D n=19 too
thin; documented in audit as proposal pending more data.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(prompts): soften target-awareness clause + add noise-dominated & empty-intra rules
Mirrors REV-1/REV-3 from the decision-tree counterfactual audit into the
two system prompts the LLM actually sees.
mutation/system.txt:
- Softened target-awareness paragraph: target awareness shapes priority
WITHIN the gate-permitted set; it never forbids an archetype the gate
allows. When archive.best far below target AND focal in top quartile,
BOTH Hybrid 7 and Exploitation 1 are legitimate.
- Added noise-dominated-trends bullet: falling/flat with
`[only N valid — too few for trend]` (or iter_window_valid < 9) is
inconclusive — do NOT force Exploration on it alone.
- Added empty-intra default bullet: first child of a fresh parent in
middle band defaults to Hybrid 7 (Guided Innovation).
mutation_suggestions/system.txt:
- Softened target-awareness paragraph (same wording philosophy).
- Added noise-dominated-trends bullet so the analyst does not raise
severity to `high` on a noisy signal alone.
Total: ≤10 lines net per file, all additive or replacement softening.
Empirical justification: CF-E audit shows Exploitation wins 32/35 when
the gate permits both; CF-A shows Hybrid 7 is the modal LLM choice
(57/225) and best per-pick pos-rate.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* tools(pseudo-evo-bench): single-shot mutation A/B harness with archetype-distribution analysis
Tests prompt-wrapper changes (mutation/system.txt + user.txt) on a fixed cohort of
50 stratified parents from DBs 13/14/15. Each parent's mutation_context blob is
frozen from its original search-time run; only the system.txt/user.txt wrappers
are re-rendered from the working tree at sample time.
Components:
- sample_parents.py: stratify 50 parents (17/18/15 by fitness bucket), render
current HEAD prompts, write parents.json (idempotent under SEED=20260518)
- run_qwen.py: 1 LLM call per parent at concurrency=6 via LiteLLM proxy
- eval_mutants.py + eval_runner.py: parallel heilbron validate() at concurrency=4
- analyze.py: PRIMARY signal is archetype/strategy distribution (coverage,
entropy, group balance, v3.1 gate compliance, archetype shift matrix).
Fitness delta is reported as SECONDARY since single-shot is noise-dominated
by parent quality and validity.
Scope (per README): tests prompt-wrapper changes only. Stage-internal
mutation_context build (collector, intra/extra memory, lineage) is NOT
exercised because contexts are pre-rendered.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* tools(pseudo-evo-bench): iter0 (HEAD) vs iter1 (pre-softening v3.1) — archetype-distribution A/B
Same 50 parents (verified parent_id-identical), frozen mutation_context blocks
(verified byte-identical per parent). Only difference: rendered system_prompt
(iter0 = 14,375 chars / HEAD with +14-line softening; iter1 = 13,415 chars at
commit 11eb4d4b before softening).
PRIMARY: archetype/strategy distribution
=========================================
iter0 (HEAD softened) iter1 (v3.1 sharp)
Coverage 7/8 archetypes 7/8 archetypes
Entropy 2.583 bits 2.565 bits
Group balance E/X/H 28% / 41% / 30% 20% / 52% / 28%
Group skew (max-min) 13.0% ← fairer 32.6%
Low anti-Exploit 87.5% (n=16) 93.8% (n=16)
High Exploit rate 50.0% 42.9%
Per-bucket group shift (where softening actually changed behavior):
low: ~unchanged (gate respected in both)
mid: iter0 56% Hybrid → iter1 50% Explore (softening pushed mid TOWARD Hybrid)
high: iter0 14% Hybrid → iter1 36% Hybrid (softening pushed high TOWARD Explore)
Archetype shift matrix: 16/45 decisive pairs (36%) cross GROUP boundary
between iter0 and iter1 — the +14 lines DO steer the LLM, the question is
whether the steering is desirable.
Common finding across BOTH iterations:
archetype #8 "Conservative Exploration" picked ZERO times (0/92 mutations)
→ strong signal the prompt does NOT surface this archetype effectively
archetype #7 "Guided Innovation" dominates Hybrid (14/14 + 13/13)
SECONDARY: fitness (noise-dominated, but directionally informative)
==================================================================
Sign test on paired Δ: iter1 wins 20, iter0 wins 7, ties 23 → p≈0.012
Validity: iter0 26/50, iter1 32/50
Interpretation: HEAD softening trades 6pp of single-shot fitness loss for
group-balance fairness. Neither prompt surfaces archetype #8.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* prompts(mutation/system.txt): rewrite archetype #8 as Component Substitution + sharpen #5 separator from #4
Empirical motivation: pseudo-evo bench iter0+iter1 picked archetype #8 (Conservative Exploration) ZERO times across 92 valid responses. Diagnosis — "explore within structural / interface constraints" describes properties any valid mutation already has, not an edit pattern the model can operationalise.
Archetype #8 → Component Substitution
Replace ONE named subroutine or building block (scoring function, sampler, init scheme, distance metric, update rule, post-processor) with an alternative of the same kind, occupying the same slot — same inputs, same output shape — so surrounding control flow, interfaces, and hyper-parameters remain unchanged. Distinct from #4 (changes algorithm family) and from #7 (adds a component alongside an existing one rather than replacing).
Archetype #5 → sharpened separator from #4
"Change the SET of admissible solutions (relax/tighten a constraint, drop a parity or symmetry rule, allow rotations, switch from discrete to continuous parameterisation) without changing the search algorithm itself. #4 changes HOW the search runs; #5 changes WHAT set is searched."
Net effect — eight distinct edit verbs:
Exploit: tune (#1) / extend-scope (#2) / remove (#3)
Explore: reinvent-algorithm (#4) / change-feasibility-set (#5) / synthesise-from-memory (#6)
Hybrid: add-alongside (#7) / substitute-component (#8)
Raises realised entropy ceiling above today's log2(5)≈2.32 bits toward the 3.0-bit max for 8 archetypes. Pseudo-evo iter2 will verify the model actually picks #8 and discriminates #5 from #4.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* schema(mutation): enforce archetype names via Literal + add drift-detection tests
Changes:
1. Added ARCHETYPE_NAMES constant and ArchetypeName Literal type to constants.py (single source of truth for canonical names)
2. Updated MutationStructuredOutput schema to use ArchetypeName Literal (strict validation, rejects unknown archetype strings at parse time)
3. Added 4 new test functions to test_mutator_system_prompt.py:
- test_archetype_names_appear_in_system_prompt() — catches drift between schema Literal set and prompt's archetype menu
- test_archetype_count_is_eight() — asserts len(ARCHETYPE_NAMES)==8
- test_mutation_output_accepts_canonical_archetypes() — validates each canonical name
- test_mutation_output_rejects_unknown_archetype() — rejects out-of-set strings with ValidationError
4. Fixed test_defaults in test_mutation_agent.py to use canonical archetype "Precision Optimization" instead of invalid "test"
Motivation: Prevent silent LLM output rejection when system.txt and schema drift (e.g., if archetype #8 is rewritten in prompt but not updated in Literal).
All 104 tests pass. No changes to mutation/system.txt (archetype #5/#8 redesign committed separately 2026-05-18 at 7a52f45d).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs(auto-optimize-loop): spec + reference schemas + history/patterns scaffolds
Adds the auto-optimize-loop task spec that drives autonomous cycles tuning
ONLY the mutation operator's context factory graph and the archetype
framework. Primary success criterion is healthy trajectory + healthy
mutants; 0.03 is the "real improvement" floor with 0.035 aspirational.
All future auto-loop cycle commits land linearly on r7-r8-r9-v3-bundle
and are identified by their commit SHA captured at LAUNCH time. Each
cycle writes a Reconstruction MD and Analytics MD; the Analytics MD's
retroactive invariant audit hard-gates the next cycle's PROPOSE step.
Files:
- docs/audits/AUTO_OPTIMIZE_LOOP_TASK_2026-05-19.md (primary spec)
- docs/audits/references/AUTO_OPTIMIZE_RECONSTRUCTION_MD_SCHEMA.md
- docs/audits/references/AUTO_OPTIMIZE_ANALYTICS_MD_SCHEMA.md
- docs/audits/AUTO_OPTIMIZE_CYCLE_HISTORY.md (append-only ledger)
- docs/audits/AUTO_OPTIMIZE_PATTERNS.md (evidence ledger)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* chore: gitignore output/runs/tool-caches + capture pre-loop audit MDs
- Add .gitignore rules for output/ runs/ problems/heilbron_pro/
rotated litellm sampler logs, and throwaway tools/ subdirs
(benchmark_gemini_ab, insights_ablation, lineage_card_scaffold).
- Capture pre-loop audit MDs under docs/audits/ that informed the
cycle-9 redesign + auto-optimize-loop spec (insights, lineage memory
plans, mutation guidance rubric, cycle-8 prelaunch, prompt redesign,
etc.). These are read-only history; future PRs cite them.
- Collapse multi-line attach_inputs({...}) calls in
tests/stages/test_intra_memory_cache.py to single-line form
(pure formatting, no behavior change).
- Add docs/audits/AUTO_OPTIMIZE_CYCLE_0_ANALYTICS.md (cycle-0 = cycle-9
baseline) pre-drafted at T+1h17m with <TBD-FINALIZE> markers;
end-of-run values will be filled in once PID 2008891 exits.
This is a non-loop chore commit. It becomes cycle-1's PARENT_SHA
so the §8.1 invariant
git rev-list --count \$PARENT_SHA..HEAD == 1
will be satisfiable when cycle-1's single IMPLEMENT commit lands on top.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* chore: ruff fix + format on tools/pseudo_evo_bench (pre-loop)
Applies `ruff check --fix` + `ruff format` to the pseudo_evo_bench
A/B harness scripts. All changes are cosmetic:
- 11 I001 import-order errors fixed (stdlib imports merged into
alphabetic order with site-package imports).
- Long json.dumps(...) and dict literals reflowed by the formatter.
These files have been failing `ruff check .` since they landed
(commits 893113c1 / a9b4bee5). The §8.1 pre-launch lint invariant
in the auto-optimize loop spec requires a clean `ruff check .` +
`ruff format --check .`, so this clean-up is a prerequisite for
cycle-1 launch.
Safety:
- pseudo_evo_bench is NOT imported by gigaevo/ or tests/ — grep
across both trees returns zero hits. The currently-running
cycle-9 (PID 2008891) does not touch these files.
- Only formatting changes; no semantic edits, no API surface change.
- `pytest tests/prompts/test_mutator_system_prompt.py` continues
to pass (archetype-drift detection).
Verified:
ruff check . → All checks passed!
ruff format --check . → 1172 files already formatted
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs(auto-optimize-loop): finalize cycle-0 Analytics + cycle-1 PROPOSE + scope-expansion note
Finalizes cycle-0 (= pre-loop cycle-9 archetype redesign + R-bundle) baseline analytics
against the full evolution log (20043 lines, exit at T+1h55m41s).
- AUTO_OPTIMIZE_CYCLE_0_ANALYTICS.md: substitute all TBD-FINALIZE markers with
extracted numbers. Final outcome: HEALTHY-NEUTRAL (S2.1 trajectory PASS narrow;
S2.2 fitness FAIL with best_fitness=0.02620 < 0.03 floor, below baseline 0.02788).
Trajectory has 6 strictly-increasing best-fitness peaks with one mid-run plateau
of ~46 mutants strict (peaks #4->#5) followed by fast late rescue (peak #6).
Strict stagnation_interval_max=46 NARROW-FAILS <=40 gate; inclusive frontier-event
defn PASSES at <=5 mutants. valid_rate=52%, frontier_new_cell_events=42, right_tail
_mass=42.3%, per_parent_advance_rate=6% strict / 10% inclusive. Component Substitution
(new archetype #8) NOT dead - rose 0->18 picks (6%) by run end; vindicates the
feedback_archetype_distribution_not_a_goal user reframe.
- AUTO_OPTIMIZE_CYCLE_HISTORY.md: cycle-0 row now shows HEALTHY-NEUTRAL decision,
best_fitness=0.02620, S2.1 PASS narrow, S2.2 FAIL.
- AUTO_OPTIMIZE_PATTERNS.md: cycle-0 entry as NEUTRAL ceiling evidence; documents
surface scope (R-bundle), numbers (S2.1 components), caveats (n=1 variance not
yet measured; below baseline by 6% within plausible n=1 variance).
- AUTO_OPTIMIZE_CYCLE_1_PROPOSE.md: cycle-1 = variance-floor replicate of baseline
(NO EDIT per S7). Decision rationale cites feedback_variance_floor_first +
feedback_consistent_improvement_all_stages + feedback_auto_optimize_trajectory_first.
S4 citations rewritten as trajectory-shape-only signals (plateau duration,
per_parent_advance_rate, stagnation_interval_max) per feedback_archetype
_distribution_not_a_goal. Updated parent-SHA reference to current operational HEAD.
- AUTO_OPTIMIZE_LOOP_TASK_2026-05-19.md: S3 prelude blockquote captures the 2026-05-19
user verbal directive expanding cycle-3+ scope to the entire mutation context harness
(feedback_mutation_context_harness_in_scope). S3.2 SLIGHTLY cap on mutation/system.txt
preserved as engineering constraint only.
Non-cycle docs commit - the cycle-1 commit will follow as a separate --allow-empty
commit per the variance-floor protocol.
* auto-loop cycle 1: variance-floor replicate of baseline (no edit)
* auto-loop meta cycle 1: variance-floor replicate (HEALTHY-NEUTRAL)
Cycle-1 ran 2026-05-19 02:33→04:53 MSK on db=12, identical config to
cycle-0 baseline (a4925a90). Cycle commit a527b256 is the --allow-empty
IMPLEMENT SHA; zero code/prompt/config diff.
Result: HEALTHY-NEUTRAL (variance-floor; informational §2.2 PASS).
- best_fitness = 0.031187 (cycle-0 was 0.02620; Δ=+0.00499)
- |Δ| < §7 variance threshold 0.01310 → within variance floor
- §2.1 trajectory: PASS all 5 gates strict (frontier_new_cell 52,
right_tail_mass 57.4%, advance_rate 6% strict / 12% inclusive,
stagnation_interval_max 14 strict within-active, valid_rate 64%)
- §2.2 fitness floor: PASS (0.031187 ≥ 0.03) — INFORMATIONAL
flip vs cycle-0 (which failed by 0.004); NO-EDIT cycles cannot
be WIN-CANDIDATE by spec.
- Trajectory shape: rapidly ascending for first 29 mutants (6 peaks
compressed), then 70-mutant trailing plateau. INVERSE of cycle-0's
mid-run plateau + late rescue. Two equally consistent interpretations
(A: baseline mean ~0.029±0.003, B: cycle-1 high-side outlier).
Cycle-2 (db=13) will disambiguate. If cycle-2 lands within Δ=0.006 of
either prior cycle, baseline mean ≈ midpoint(0.026, 0.031, cycle-2);
loop may be near a structural Heilbron ceiling per
feedback_auto_optimize_trajectory_first ("task may be unsolvable in
knob scope — that's valid").
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs(auto-optimize-loop): cycle-2 PROPOSE — variance-floor replicate #2 (db=13)
Continuation of §7 variance-floor methodology. Cycle-2 is NO-EDIT,
db=13 only delta vs cycle-0/1. Adds the third sample point to lock
baseline mean + std before cycle-3 first real intervention.
After cycle-1, n=2: best=[0.02620, 0.03119], midpoint 0.02870,
sample std 0.00353, §7 variance threshold 0.01435 (50% of midpoint).
Cycle-2 decision tree:
- best ∈ [0.026, 0.034] AND §2.1 PASS → cycle-3 PROPOSE proceeds
- best outside [0.020, 0.040] OR §2.1 FAIL → §7 STOP
- best in marginal bands → optional cycle-2.5 NO-EDIT before cycle-3
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* auto-loop cycle 2: variance-floor replicate of baseline (no edit)
Per §7 variance-floor methodology. Cycle-2 = second NO-EDIT replicate
on db=13. Cycle identity SHA = this commit's HEAD.
PARENT_SHA = previous commit (cycle-2 PROPOSE meta).
No code/prompt/config diff vs cycle-0 baseline (a4925a90).
§8.1 invariants verified pre-launch:
- branch r7-r8-r9-v3-bundle
- working tree clean
- LiteLLM proxy 10.232.30.185:4000 reachable (/health/readiness)
- archetype-schema drift tests 51 passed
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* auto-loop meta cycle 2: variance-floor replicate (HEALTHY-NEUTRAL; marginal → cycle-2.5)
Cycle-2 IMPLEMENT commit (57442737) was a --allow-empty NO-EDIT replicate on db=13.
This meta commit captures the cycle-2 ANALYZE-post artifacts:
- AUTO_OPTIMIZE_CYCLE_2_RECONSTRUCTION.md: best_fitness=0.021266 at mutant ~38;
§2.1 trajectory PASS (5/5 gates strict); §2.2 fitness floor FAIL (< 0.03 by 0.00874)
- AUTO_OPTIMIZE_CYCLE_2_ANALYTICS.md: n=3 baseline mean=0.02622, stdev=0.00496;
cycle-1 reclassified as high-side outlier; trailing-plateau shape dominates 2/3 cycles
- AUTO_OPTIMIZE_CYCLE_HISTORY.md: cycle-2 row appended
- AUTO_OPTIMIZE_PATTERNS.md: cycle-2 NEUTRAL evidence entry
- AUTO_OPTIMIZE_CYCLE_3_SURFACE_MENU_DRAFT.md: surface menu for cycle-3 PROPOSE
(gated by cycle-2.5 4th NO-EDIT replicate per cycle-2 PROPOSE marginal-band rule)
Decision: cycle-2 best_fitness 0.021266 ∈ [0.020, 0.025] MARGINAL band per cycle-2
PROPOSE §5 decision tree → next cycle (2.5) is another NO-EDIT replicate on db=14
to add a 4th variance sample before cycle-3 first real intervention.
Per spec §8.1: git rev-list --count 57442737..HEAD == 1 (one commit per cycle step).
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs(auto-optimize-loop): cycle-2.5 PROPOSE — 4th NO-EDIT variance-floor replicate (db=14)
Triggered by cycle-2 PROPOSE §5 decision-tree marginal-band rule:
cycle-2 best_fitness 0.021266 ∈ [0.020, 0.025] → need 4th sample.
n=3 stats: mean=0.02622, stdev=0.00496; §7 STOP NOT triggered.
n=4 will tighten baseline mean variance ~2.6× and disambiguate
the bimodal-suspicious distribution (cycle-1 1σ above mean).
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* auto-loop cycle 2.5: variance-floor replicate of baseline (4th sample, no edit)
Identical config to cycle-2 except redis.db=14 (cycle-2 was 13).
Per cycle-2 PROPOSE §5 decision-tree: cycle-2 best 0.021266 in MARGINAL
band [0.020, 0.025] required this 4th NO-EDIT sample.
After cycle-2.5 closes:
- if best ∈ [0.015, 0.035] AND §2.1 PASS → cycle-3 PROPOSE proceeds
- if outside band OR §2.1 FAIL → STOP per §7
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* auto-loop meta cycle 2.5: variance-floor replicate (HEALTHY-NEUTRAL; n=4 baseline LOCKED; proceed to cycle-3)
cycle-2.5 best_fitness = 0.025709 at mutant 80/100 on db=14.
n=4 baseline LOCKED:
- mean = 0.02609, stdev (sample) = 0.00406
- range [0.02127, 0.03119], CV = 15.6%
- §7 STOP NOT triggered (0.00406 << 0.5 × 0.03119 = 0.01559)
§2.1 trajectory: PASS lenient (4/5 strict + stagnation NARROW-FAIL at 55 mutants,
same shape as cycle-0's 46). §2.2 fitness floor: FAIL (0.025709 < 0.03 by 14%).
Trajectory-shape census (n=4):
- 2/4 cycles: mid-run plateau + late rescue (cycle-0, cycle-2.5)
- 2/4 cycles: leading sprint + trailing plateau (cycle-1, cycle-2)
Bimodal 2/2 — cycle-3 intervention must address BOTH shapes.
Decision tree (cycle-2.5 PROPOSE §5):
best 0.02571 in [0.015, 0.035] AND §2.1 PASS lenient
-> PROCEED TO CYCLE-3 PROPOSE (first real intervention)
Cycle-3 WIN-CAND threshold = mean+1sigma = 0.03015.
Cycle-3 STRONG-WIN threshold = mean+2sigma = 0.03421.
First cycle with DIRECT live /proc/<pid>/environ verification of all four
section 8.1 environment invariants (OPENROUTER_API_KEY len=73, OPENAI_API_KEY=sk-gigaevo,
HTTP_PROXY+HTTPS_PROXY unset). Strengthens n=4 baseline vs cycle-1/2 INFERRED env.
Files:
- AUTO_OPTIMIZE_CYCLE_2_5_RECONSTRUCTION.md (FINAL; sections 9-13 filled)
- AUTO_OPTIMIZE_CYCLE_2_5_ANALYTICS.md (created; sections 0-6)
- AUTO_OPTIMIZE_PATTERNS.md (append cycle-2.5 entry; n=4 stats)
- AUTO_OPTIMIZE_CYCLE_HISTORY.md (append cycle-2.5 row)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* auto-loop cycle 3: intra-memory saturation detection (K=3 narrative streak → header inject)
INT 1 from cycle-3 surface-menu draft. First REAL intervention after n=4
NO-EDIT variance-floor baseline (mean=0.02609, σ=0.00406; locked at f225e1db).
Change: IntraMemoryStage now tracks an SHA1-keyed hash of each rendered
intra-card's narrative signature (summary + tried_strategies' label/verdict/notes,
excluding monotonic counters n_attempts/mean_delta/delta_distribution per
chaos-hacker CRITICAL #1). When the same hash is observed for K=3 consecutive
renders on the same parent, the next render is prepended with a
"[STAGNATION DETECTED]" header plus a child-delta archetype histogram.
Hypothesis: the K=3 stagnation header makes the parent's saturation visible
to the MutationSuggestionAgent → encourages archetype shift OR genuinely new
strategies in identical-narrative branches → reduces stagnation_interval_max
and/or trailing plateau without changing model/prompt template/problem.
Scope: lineage_memory.py (+90 lines) + 6 unit tests bypassing InputHashCache
to exercise the new code path directly.
Frozen invariants (unchanged): problem.heilbron, validator, fitness fn,
Pydantic Literal archetype enforcement, num_parents=1, max_mutants=100,
model=Qwen3-235B-A22B-Thinking-2507, prompts (all 4 SHAs unchanged).
WIN-CAND threshold: best_fitness ≥ 0.03015 (n=4 baseline mean+1σ).
STRONG-WIN: ≥ 0.03421 (mean+2σ).
Spec: docs/audits/AUTO_OPTIMIZE_LOOP_TASK_2026-05-19.md
PROPOSE: docs/audits/AUTO_OPTIMIZE_CYCLE_3_PROPOSE.md
RECONSTRUCTION: docs/audits/AUTO_OPTIMIZE_CYCLE_3_RECONSTRUCTION.md (skeleton; populated post-run)
ANALYTICS: docs/audits/AUTO_OPTIMIZE_CYCLE_3_ANALYTICS.md (skeleton; populated post-run)
* auto-loop meta cycle 3: K=3 stagnation detector — HEALTHY-NEUTRAL (mechanism never fired)
Cycle-3 INT 1 (commit 45255a3d, db=15, 1.85h run) outcome class HEALTHY-NEUTRAL.
Headline:
- best_fitness = 0.024369 (mutant cc09c637, frontier event #38 of 55 valid mints)
- Δ vs n=4 baseline mean (0.02609) = -0.00172 (|Δ| < 1σ = 0.00406, WITHIN noise band)
- §2.1 trajectory: 5/5 strict PASS (valid_rate ~0.55, frontier_new_cell=46,
right_tail_mass=0.344, advance 0.06 strict / 0.55 inclusive, stagnation_interval_max=14)
- §2.2 fitness floor: FAIL (0.024369 < 0.030 by 0.00563)
- §7 STOP threshold: NOT triggered (loop continues)
Key empirical finding: STAGNATION DETECTED header activations = 0/100.
The K=3 narrative-streak SHA1 detector never fired in 100 mutants. The bucketed
memory representation evolves enough between consecutive renders that
SHA1(narrative_signature) changes before K=3 is reached, even with num_parents=1
and many sibling renders per parent.
This vindicates the user's preference (feedback_llm_rules_over_hardcoded):
hardcoded Python predicates are too brittle to fire at this run scale.
Cycle-3 is empirically a NO-EDIT replicate of cycle-2.5 from the mutator's
perspective. Trajectory improvements (stagnation_interval_max=14 vs cycle-0's
46 / cycle-2.5's 55) are not causally attributable to the mechanism that
never fired — they are sampling variance.
Dual-axis verification per project_fat_context_direction.md (§1):
- signature: STAGNATION activations = 0 → ✗
- metric of interest: stagnation_interval_max = 14 (< 46 cycle-0) → ✓
- quadrant ✗✓ → "noise/lucky; replicate before claiming win" → cannot ship as a
trajectory-shape win because the mechanism never fired
Archetype distribution shifted dramatically from cycle-0 (informational only):
- cycle-0: Guided Innovation 25%, Computational Reinvention 21%
- cycle-3: Guided Innovation 53% (mode-collapse), Computational Reinvention 2%
ARCHETYPE-EFFICIENCY MISMATCH persists: highest hit-rate archetypes
(Computational Reinvention 100% n=2, Harmful Pattern Removal 100% n=1,
Precision Optimization 66.7% n=6) are under-sampled, while highest-pick
archetype (Guided Innovation 53%) has below-average hit-rate (35.8%).
Decision: HEALTHY-NEUTRAL. Next: cycle-4 PROPOSE per fat-context methodology
will target a measurable failure mode (candidate: pick-rate-vs-hit-rate
mismatch) with LLM-side fat context, NOT hardcoded Python predicates.
Spec: docs/audits/AUTO_OPTIMIZE_LOOP_TASK_2026-05-19.md
RECONSTRUCTION: docs/audits/AUTO_OPTIMIZE_CYCLE_3_RECONSTRUCTION.md (FINAL)
ANALYTICS: docs/audits/AUTO_OPTIMIZE_CYCLE_3_ANALYTICS.md (FINAL)
HISTORY: docs/audits/AUTO_OPTIMIZE_CYCLE_HISTORY.md (row appended)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* auto-loop cycle 4: surface per-archetype yield in evolutionary statistics block
INT 2 (cycle-4) — FIRST LLM-side fat-context intervention. Per cycle-3 closeout
(commit 08cbd5b2, HEALTHY-NEUTRAL, K=3 narrative-streak detector NEVER FIRED in
100 mutants), the user verbal directive on 2026-05-19 expanded the loop scope
to the entire mutation-context harness (feedback_mutation_context_harness_in_scope)
and re-affirmed fat-informative-context over hardcoded Python predicates
(project_fat_context_direction).
Change: per-archetype yield (picks, valid_hits, hit_rate, mean_delta_to_parent)
aggregated over the whole-run population in EvolutionaryStatisticsCollector,
attached to the EvolutionaryStatistics StageIO, and rendered as a markdown
table inside the existing "## Evolutionary Statistics" block of the mutation
suggester's prompt (gigaevo/prompts/mutation_suggestions/system.txt also
extended with PRIORITY-reshape guidance, NOT invention).
Three additive surface touches, all in scope per §3.1 of LOOP_TASK +
feedback_mutation_context_harness_in_scope:
- gigaevo/programs/stages/collector.py: new _compute_archetype_yield()
helper + archetype_yield field on EvolutionaryStatistics + cache wiring
in _ensure_population_cache. Also flips EvolutionaryStatisticsCollector
._EXCLUDE from EXCLUDE_FOR_ANALYTICS (strips metadata) to
EXCLUDE_STAGE_RESULTS (keeps metadata) — REQUIRED for the helper to read
program.metadata[MutationSpec.META_OUTPUT]["archetype"]. The metadata
cost is bounded by N=100 programs per snapshot, well under 1% of cycle
wall-time. Other collectors continue to exclude metadata.
- gigaevo/evolution/mutation/context.py: extended
EvolutionaryStatisticsMutationContext.format() to render the yield table
when total picks >= 5 (suppresses bootstrap noise). Sorted by hit_rate
desc, picks desc tie-break.
- gigaevo/prompts/mutation_suggestions/system.txt: extended the existing
"Evolutionary Statistics" bullet with explicit guidance on how to read
the new table (UNDER-UTILIZED vs OVER-RELIED-ON cells). Reaffirms
PRIORITY-reshape, NOT invention.
Hypothesis: surfacing per-archetype yield (cycle-3 ANALYTICS Signal #1:
Computational Reinvention 100% hit-rate at 2% pick-share; Guided Innovation
35.8% hit-rate at 53% pick-share) to the suggester lets the LLM reshape
priority toward higher-yield archetypes. Cycle-4 takes the OPPOSITE design
to cycle-3: NO Python threshold, NO if-streak-K predicate, NO header
inject. The LLM decides.
CRITICIZE-pre v1 returned REVISE with one CRITICAL + two HIGH findings,
all mitigated:
- CRITICAL: PROPOSE v1 specified wrong metadata key (metadata["mutation"]
vs canonical MutationSpec.META_OUTPUT = "mutation_output"). Fixed.
- HIGH: TDD fixtures echoed wrong key. Fixed + new test #7 regression-guards
the dead key (test_rejects_dead_metadata_key).
- HIGH: integration smoke step #4 only checked header presence. Tightened to
require >=1 canonical-named row + (other) share <= 20%.
8 RED tests in tests/stages/test_archetype_yield.py cover: empty population,
per-archetype aggregation with delta-to-parent, canonical ordering with zero
picks, attachment to EvolutionaryStatistics, format() rendering (sorted),
threshold suppression, defensive bucketing of unknown archetypes + missing
mutation_output, regression guard against the dead "mutation" key. All 8
pass post-GREEN. Adjacent suites (tests/stages/test_collector.py +
tests/stages/test_mutation_context.py) pass 85/85 — _EXCLUDE flip has no
regressions.
Scope discipline: NO change to gigaevo/llm/agents/mutation.py (archetype
Literal preserved), NO change to gigaevo/prompts/mutation/system.txt (the
SLIGHTLY rule does not apply), NO change to problems/heilbron/, num_parents,
max_mutants, model_name, llm_base_url. No Heilbron-specific anything in any
touched file. The 8 canonical archetype names are loaded from
gigaevo/evolution/mutation/constants.py — problem-agnostic.
Frozen invariants (unchanged): problem.heilbron, validator, fitness fn,
Pydantic Literal archetype enforcement, num_parents=1, max_mutants=100,
model=Qwen3-235B-A22B-Thinking-2507.
WIN-CAND threshold: best_fitness >= 0.03015 (n=4 baseline mean+1sigma).
STRONG-WIN: >= 0.03421 (mean+2sigma). Cycle-4 prediction at n=1: best
0.0275 +/- 0.005 (~30% chance of WIN-CAND given baseline variance).
Riskiest link: the suggester must actually read the yield table and
reshape priority. Dual-axis verification (PROPOSE §11) detects ignore
vs. follow via (DIAGNOSTIC) archetype-efficiency CV halving signature
+ (PRIMARY) metric of interest (best_fitness, trajectory gates).
Outcome quadrants per project_fat_context_direction 4-quadrant matrix.
Followup captured (NOT in this commit, future cycle): lineage_memory.py:711
reads metadata.get("mutation", {}) — the dead key. TransitionAnalysis archetype
field always None as a result.
* Revert "auto-loop cycle 4: surface per-archetype yield in evolutionary statistics block"
This reverts commit e6cfe6eed30fc81da752af950a4a32a45b2352f4.
* auto-loop meta cycle 4: archetype-yield prompt bloat — LOSE-REVERT
Cycle-4 INT 2 (commit e6cfe6ee, db=12, 2.16h run) outcome class LOSE-REVERT.
Reverted by 69a5a708 per feedback_auto_optimize_branch_policy (no reset).
Headline:
- best_fitness = 0.01686 (run high-water mark; SEED-level, no post-seed mints)
- Δ vs n=4 baseline mean (0.02609) = -0.00923 = -2.27σ → INSIDE LOSE band (baseline-2σ=0.01797)
- 0/37 ACCEPTED…
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
ValidityMetricAcceptorrejected zero and negativeis_validviais_valid <= 0, but NaN comparisons all return False and so doesinf <= 0. A crashed validity stage emitting NaN, or an unbounded-objective sentinel of +inf, was therefore silently accepted as an elite.Fix: add an
isfinite()guard before the<= 0check. Tests: finite small positive (0.5) accepted, NaN rejected, +inf rejected.