fix(optuna): harden literal parsing across both desubstitution paths by GrigoryEvko · Pull Request #7 · FusionBrainLab/gigaevo-core

GrigoryEvko · 2026-05-15T01:41:32Z

The AST-level _EvalCleaner (add_tuned_comment=False path) checks isinstance(arg, (Constant, List, Tuple, Set, Dict)) to decide whether an eval(...) call wraps a literal that can be inlined. This misses UnaryOp(USub, Constant), so eval(-5) survives in the AST path while the source-level _clean_eval_in_source strips it correctly via ast.literal_eval. The two desubstitution modes silently produced different output for the same input.

Reproducer against current main:

import ast
from gigaevo.programs.stages.optimization.optuna.desubstitution import (
    _EvalCleaner, _clean_eval_in_source,
)

src = "x = eval(-5)"
tree = ast.parse(src); _EvalCleaner().visit(tree); ast.fix_missing_locations(tree)
print(ast.unparse(tree))           # before: 'x = eval(-5)' ;  after: 'x = -5'
print(_clean_eval_in_source(src))  # always: 'x = -5'

Fix: in _EvalCleaner.visit_Call, replace the isinstance check with ast.literal_eval(arg) as a try/except predicate so both paths share the same literal-detection logic. Verified across 12 sample inputs — the AST and source paths now produce identical output. Also broaden the (ValueError, SyntaxError) catches around literal_eval in _coerce_param_value and _clean_eval_in_source to also catch MemoryError, RecursionError, and TypeError.

short id will generate based on full id when required

…d delta fitness

…regression feature weights

… and improve docstring clarity

…cstrings Phase 1 cleanup (completed): - Move A_mem/, GAM_root/ → _vendor/ (vendored MIT libs) - Move contrib licenses → _vendor/ - Move 3 example scripts → examples/ - Fix 15 broken vendored library imports (A_mem/GAM_root bare imports) - Update 8 consumer import paths to _vendor/ - Add _vendor/__init__.py docstring (vendored libs notice) - Add examples/__init__.py docstring (not production code) - Update shared_memory/__init__.py docstring - Update pyproject.toml (ruff/mypy exclude paths for vendors) Tests: 770 passed Lint: clean Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

refactor(memory): Phase 1 — directory reorg (vendor/examples/docstrings)

- Delete 5 duplicated usage-merge functions from memory_write_example.py (_to_float, _median_or_none, _extract_usage_task_deltas, _build_usage_payload_from_task_deltas, _merge_usage_payloads) → import from card_update_dedup.py (canonical home) - Delete duplicate dedupe_keep_order from card_update_dedup.py → import from shared_memory/utils.py - Remove deprecated _apply_update_actions() from memory.py (dead wrapper) - Make memory_to_card private (_memory_to_card) — only used internally - Simplify single-iteration loop in _extract_json_object Net: ~120 lines deleted, zero behavior change. Tests: 770 passed | Lint: clean Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

refactor(memory): deduplicate code and delete dead paths

Replace hand-written RetrievalWeights.from_mapping() and CardUpdateDedupConfig.from_mapping() dict parsers (~87 lines) with Pydantic v2 @model_validator(mode="before") — same behavior, idiomatic. Also: add docstrings to all functions in card_update_dedup.py, fix stale test references to deleted _apply_update_actions wrapper, add test for flat config format. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

refactor(memory): Pydantic-idiomatic config parsing

- memory_write_example.py → write_pipeline.py (it's production, not an example) - memory_write_config.py → write_pipeline_config.py - selected_ideas_6.py → origin_analysis.py (remove versioned filename) - Delete test_memory_write_example_extended.py (duplicate of test_write_pipeline.py) - Update all 8 import sites + 1 dynamic importlib.import_module call Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

refactor(memory): rename write pipeline and analysis files

Extract from card_conversion.py (554 → 420 lines): - base.py: GigaEvoMemoryBase abstract class (20 lines) - card_search.py: format_search_results, search_cards_by_keyword, synthesize_search_results (115 lines) Update 4 import sites directly (no re-exports). card_conversion.py retains: normalization, conversion, GAM config, constants, protocols. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

refactor(memory): split card_conversion into focused modules

Define MemoryError, MemoryRetrieverError, MemorySearchError, and MemoryStorageError in gigaevo/exceptions.py following the existing GigaEvoError hierarchy. Wire them into the memory subsystem: - gam_search.build() wraps all failures in MemoryRetrieverError - memory.py narrows two gam.build() catches from bare Exception - card_store._load() narrows to (json.JSONDecodeError, OSError) - card_dedup import block narrows to (ImportError, OSError) Resilience-critical catches (search fallback, merge loop, __exit__) remain broad by design. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

refactor(memory): custom exception hierarchy and narrowed catches

@AbstractMethod

…t base to ABC - concept_api.py: all 5 RuntimeError raises → MemoryStorageError (matches gigaevo/database pattern of wrapping I/O errors) - base.py: GigaEvoMemoryBase now uses ABC + @AbstractMethod (matches MutationOperator, Stage, LangGraphAgent pattern) - card_dedup.py: narrow two broad catches: - JSONL read fallback: except Exception → (json.JSONDecodeError, OSError) - GAM store build: except Exception → (MemoryRetrieverError, OSError) - Update 6 test assertions from RuntimeError to MemoryStorageError Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ormity refactor(memory): exception conformity + ABC base class

When write_pipeline.py passes MemoryCard/ProgramCard Pydantic models to memory_platform.save_card(), the dict() call on a Pydantic model doesn't properly flatten nested Pydantic objects like ConnectedIdea. This caused TypeError in _persist_index() when json.dumps() tried to serialize. Root cause: write_pipeline returns list[AnyCard] (Pydantic models) and both backends (memory_platform and memory/shared_memory) consume these cards via save_card(). memory_platform's normalize_memory_card() must explicitly call .model_dump() on Pydantic inputs to flatten nested objects. Fix verified: all 788 memory + integration tests pass. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Tests the exact bug path: Pydantic MemoryCard/ProgramCard with nested ConnectedIdea and MemoryCardExplanation objects must be properly flattened to plain dicts before JSON serialization. 6 tests covering: ProgramCard with ConnectedIdea, MemoryCard with MemoryCardExplanation, plain dict passthrough, JSON round-trips, None. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Add gigaevo-memory Git dependency to pyproject.toml - Remove sys.path manipulation from memory_platform/memory.py and remote_gam_retriever.py (no longer needed with proper install) - Simplify test file to use direct imports instead of module mocking Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Expands from 6 to 11 tests covering the complete save_card → _persist_index flow with Pydantic inputs. Tests verify: - normalize_memory_card: ConnectedIdea/MemoryCardExplanation → dict - save_card: Pydantic ProgramCard/MemoryCard → JSON-serializable index - _card_to_backend_content: API payload is clean dict - persist/reload roundtrip: index file survives write→read cycle Uses _make_platform_memory() factory with mocked API client to test memory_platform in isolation without network dependencies. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Add docstrings to 15 public methods across 5 files (memory.py, concept_api.py, card_dedup.py, openai_inference.py, write_pipeline.py) - Add return type annotations to 4 functions in amem_gam_retriever.py - Fix 2 mypy errors: annotate retrievers dict, rename variable in api_sync.py - Extract magic numbers: _MAX_SUMMARY_CHARS, _MAX_DESCRIPTION_CHARS, _ENTITY_NAME_MAX_LENGTH, _MAX_CONNECTED_DESCRIPTIONS Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

refactor(memory): type annotations, docstrings, constants, platform bug fix

The AST-level `_EvalCleaner` checked `isinstance(arg, (Constant, List, Tuple, Set, Dict))` to decide whether an `eval(...)` argument is a literal. This missed `UnaryOp(USub, Constant)`, so `eval(-5)` survived in the AST path while the source-level `_clean_eval_in_source` stripped it correctly. Switch to `ast.literal_eval` as the predicate so both paths share the same logic. Also broaden the `(ValueError, SyntaxError)` except clauses around `literal_eval` in `_coerce_param_value` and `_clean_eval_in_source` — it can also raise `MemoryError`, `RecursionError`, and `TypeError` on pathological inputs. Verified against the source-level path on 12 sample inputs; both produce identical output post-fix.

Address 2 critical + 2 major issues from methodology expert audit: C1: Add required_prefix constraint enforcement in PromptExecutionStage and GigaEvoArchivePromptFetcher — prevents mutation LLM from evolving away frozen SYSTEM_CONSTRAINTS. C2: Change default prior from Beta(1,1) to Beta(1,3) — untested prompts start at fitness=0.25 instead of 0.50, preventing archive churn. M1: Track metrics_count separately from total_trials — use as denominator for per-metric means (fixes bias from REJECTED_ACCEPTOR trials). M4: prompt_text_to_id() now hashes both system and user text — prevents stats conflation when user prompts differ. All prior data invalidated (fitness computation + ID hashing changed). DBs 4-7 flushed for clean restart. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Addresses chaos-hacker adversarial review findings: - HIGH #1: Test _await_idle's actual ghost cleanup branch via time.monotonic patch - HIGH #2: Test generation_timeout on real step() with ghost IDs + stuck RUNNING - HIGH #3: Verify snapshot data correctness (not just no-hang) after bump() - MEDIUM #4: Truly concurrent writes via asyncio.Barrier + serialization proof - MEDIUM #5: Stuck RUNNING program triggers generation_timeout - MEDIUM #6: Write serialization assertion (max_concurrent == 1) - MEDIUM #7: Lock eviction race (concurrent reuse after terminal pop) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* prompts: kill few-shot fabrication leak in insights + lineage The GOOD examples themselves contained invented magnitudes ("rejects 60% of viable candidates", "-2.3% runtime"), training the LLM that fabricated effect estimates are valid output. Live judge eval on 5 parent->child pairs across heilbron + hover, audited against actual Redis program metrics + task_description, shows: - ungrounded-number rate: 20.2% -> 6.9% (3x reduction) - lineage rubric subscore: 17.35 -> 17.40 - 4-pair rubric avg (excl. known Gemini Pro structured-output flake): 16.97 -> 17.12 Edits: - insights: remove fabricated "60%" from numeric GOOD example; add "Quote, don't estimate" rule naming specific fabrication patterns (% rejection rates, speedup factors, iteration budgets). - lineage: remove "-2.3% runtime" from Quantification example; spell out that cited numbers must come from diff, code, metrics, or task description. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(evolution-stats): iteration-window aggregation + snapshot bump Evolutionary Statistics section in the mutation prompt was empty or showed stale numbers under the steady-state engine. Two root causes: 1. Stale population snapshot — `bump()` was only called once at seed drain. After that, every collector saw a frozen snapshot, so the focal program's iteration was rarely in scope. Added `bump(incremental=True)` in `poll_and_ingest` after every commit pass so the snapshot tracks ingestion progress without flushing cached program objects. 2. Per-generation aggregation is meaningless under JIT — generations are an output of the schedule, not a fixed input. Replaced the `generation_history` / per-gen fields with a symmetric iteration window ([iter-R, iter+R], R=15) around the focal program: window count/valid, best-in-window + iter, focal rank in window, median-before / median-after horizons, trend via median-of-thirds (5% multiplicative threshold, direction-aware via `metrics_context.is_higher_better`), max invalid streak, and a global running-best plateau marker (`iters_since_last_new_best`). `EvolutionaryStatisticsMutationContext.format()` emits the locked 10-line "E_augmented" block; design doc lives at `docs/superpowers/specs/2026-05-14-evolutionary-stats-redesign.md`. Validated via 3-round LLM extraction eval: E_augmented scored 44/45 vs the old per-gen layout's 15/45. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(monitoring): file emit target writes frontier_<metric>.png each tick start_live_frontier_compare gains an output_dir param and a new "file" emit target that re-renders a frontier-trajectory PNG (best-so-far + per-iter mean) in the Hydra run output directory on every tick, sibling to live_profiler's profile_live.html. Default emit_targets now includes "file". run.py threads the Hydra output_dir through to the daemon. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(memory): DAG-native intra+extra memory pipeline (per-parent lineage card + live global cards) Adds the `intra_extra_memory` pipeline variant on top of the default builder: * IntraMemoryStage (strong LLM, structured output) renders a per-parent lineage card from DescendantProgramIds + MemoryContextStage as named inputs; framework InputHashCache skips the LLM when neither changes. Output is attached to the parent's metadata['intra_memory_card'] and concatenated with the global memory cards block via ConcatMemoryStage. * LiveMemoryRefreshHook wraps IdeaTracker.run_increment as a post_step_hook, surfacing freshly evolved ideas to MemoryContextStage's reload-on-read selector during the same run (no need to wait for end-of-run flush). * New ExtraMemoryStage class (currently dormant in the wired pipeline) kept as opt-in infra with its own caching test, pinning the structured-output contract for future re-wiring. * Bug fix bundled: invalid-child fitness sentinel (e.g. -1000 in heilbron) no longer pollutes delta_distribution.min/median/max or per-cluster mean_delta. Invalid children route to dedicated n_failed counters; the rendered card shows "n_failed=N (excluded from stats above)" and "mean delta n/a" for all-failed clusters. System prompt rule 3 now instructs the LLM to exclude is_valid=false from delta math. Legacy lineage stages stripped from the builder (AncestorProgramIds, LineageStage, LineagesToDescendants, LineagesFromAncestors, InsightsStage) — DescendantProgramIds is kept and rewidened (max_selected=24) to feed IntraMemoryStage instead of LineagesToDescendants. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: intra/extra memory mode guide + USAGE / MEMORY_ARCHITECTURE cross-links Adds docs/INTRA_EXTRA_MEMORY.md covering the pipeline introduced in 89f01be5: architecture diagram, intra-card schema (with the n_failed sentinel-handling contract), live external-memory refresh hook, caching invalidation triggers, required co-overrides (ideas_tracker=default, memory=local), smoke / full / nohup launch commands, tuning knobs, verification checklist, and a troubleshooting matrix. USAGE.md: adds `intra_extra_memory` to the `pipeline` config-group table and a launch example under "Examples". MEMORY_ARCHITECTURE.md: top-of-file pointer to the new mode guide so the in-run / live-memory entry point is discoverable from the store-side docs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(intra-memory): ship unified diff (not full code) per child + soften mutator "untried directions" preference IntraMemoryStage payload now carries either a unified diff (change_form="diff") or full child source (change_form="full_code") per child. Diff is the default; full code is the fallback when (a) is_valid=False so error_summary line refs stay readable against the same buffer the analyst sees, (b) the diff is empty (identical sources), or (c) the diff is no smaller than the file (structural rewrites where every line differs). Expected 50-80% prompt-size reduction on the typical small-mutation regime, where the parent's boilerplate was previously repeated N times across children. The intra system prompt's user-message-structure table is updated to document both children[i].diff and children[i].code, plus the change_form discriminator, so the analyst knows how to read either form. Mutator system prompt: softened the "Untried directions" rule. Previously "prefer it over inventing a new direction from scratch" — a hard preference that let speculative hints dominate archetype selection. Now framed as candidates to weigh alongside the model's own ideas, with explicit licence to skip any whose mechanism does not actually fit the parent's code. Tests: 4 new payload-shape tests on IntraMemoryStage (diff for small mutation, full-code for structural rewrite, full-code for invalid child, system prompt documents change_form/diff/full_code), plus a new pin on the mutator prompt wording. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(prescriptive): MutationSuggestionStage + EvolutionaryStatistics wiring Architectural split between descriptive (IntraMemoryStage) and prescriptive (MutationSuggestionStage) memory: the intra stage now ONLY summarises lineage history into a per-parent card; the new MutationSuggestionStage consumes intra card + cross-population memory cards + ancestral momentum trail + EvolutionaryStatistics population snapshot and emits structured ProgramInsights into MutationContextStage's insights slot (same shape as the legacy InsightsStage, so the mutator's PROGRAM INSIGHTS section renders unchanged). Key wiring (lineage_memory_pipeline.py): * DescendantProgramIds → IntraMemoryStage.children_ids * IntraMemoryStage → MutationSuggestionStage.intra_card * MemoryContextStage → MutationSuggestionStage.memory_cards * EvolutionaryStatisticsCollector → MutationSuggestionStage.evolutionary_statistics * MutationSuggestionStage → MutationContextStage.insights * IntraMemoryStage + MemoryContextStage → ConcatMemoryStage → MutationContextStage.memory Both strong-LLM stages (Intra + Suggestion) gate on validator success and (when enabled) archive acceptance, mirroring the legacy InsightsStage skip-cascade so paid LLM tokens are never spent on a program that won't enter the archive. Other: * Intra card delta-distribution + mean_delta now formatted using primary metric's decimals from metrics.yaml (was rendering raw 16-sig-fig floats). * PopulationSnapshot.refresh: refetch programs in INCOMPLETE_STATES so QUEUED/RUNNING entries get up-to-date metrics on each snapshot. * fix(memory): pick OPENROUTER_API_KEY when LLM_BASE_URL targets OpenRouter Previously gigaevo.memory.config.OPENAI_API_KEY preferred $OPENAI_API_KEY over $OPENROUTER_API_KEY unconditionally. In intra_extra_memory smokes we export both — $OPENAI_API_KEY=sk-gigaevo (LiteLLM proxy) for the main Qwen pipeline and $OPENROUTER_API_KEY=sk-or-v1-... for the GAM/A-Mem cheap path (Gemini Flash via OpenRouter). The wrong-key-for-endpoint combination made every GAM research_agent and IdeaTracker LLM call 401-silently, killing the extra-memory channel without any pipeline error. Two-line fix: - config.OPENAI_API_KEY now resolves OpenRouter key first when LLM_BASE_URL contains "openrouter.ai" (e.g. settings.yaml default). - ideas_tracker.llm._init_clients picks the right key for the effective base_url (OPENROUTER_API_KEY for openrouter, OPENAI_API_KEY otherwise). Verified: with both keys exported and settings.yaml's OpenRouter base_url, client.api_key now starts with "sk-or-". With base_url set to the LiteLLM proxy, client.api_key falls back to "sk-gigaevo". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(suggester): rank-aware ambition rule in mutation_suggestions/system.txt Adds a sub-bullet under the Evolutionary Statistics input description that calibrates the suggester's parametric-vs-structural mix to the rank of the parent in the window (already reported as `rank X/Y in window`): * Top quartile -> at least one suggestion must be structurally orthogonal (different algorithm family, init scheme, or objective), not a parameter tweak. Parametric refinements alone are insufficient when the parent already tops its window. * Bottom half -> at least one structural change required; fragile/harmful tags take precedence over rigid-parameter tweaks. * Middle band -> mix exploitation with at least one orthogonal axis. Rationale: smoke #3 (cycle 1 at max_mutants=20) showed gen-3 105901c4 (rank=1/Y) receiving 5 of 6 suggestions tagged `rigid` (pure parameter tweaks), producing a plateau at 0.01142 (32.6% of 0.035). The breakthrough to 0.01885 (53.9%) came from a SIBLING program a35a0f72 whose suggester happened to find a structural harm (asymmetric_extra_points / symmetry restoration). The new rule makes that structural pivot a stable expectation at top-of-window, not an accident. Generic — uses only the existing rank-in-window signal already in EvolutionaryStatistics. No new fields, no new code, no heilbron-specific tokens. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(stats): rank line dropout when focal missing from snapshot The iter-window rank computation in `_compute_iter_window_fields` called `sorted(fits).index(focal_fit)` to find the focal's rank. When the population snapshot lagged behind the pipeline view (or the snapshot contained a stale `is_valid=0` view of the focal), `focal_fit` was not in `fits`, the ValueError was swallowed silently, and `iter_window_rank` became None. Downstream renderer in `evolution/mutation/context.py:168` gates the rank line on `iter_window_rank is not None`, so the entire "rank X/Y in window" segment disappeared for top-of-window programs. The mutation_suggestions/system.txt rank-aware ambition rule relies on that text; with the rank line missing the rule was DORMANT throughout the cycle-2 run (struct counter stayed at 0 for 100 mutations). Verified on production program 4578cea1 from output/cycle2_rankambition_20260518_022450 (fit=0.01509 at iter=49): window valid=10, best in window=0.01455 — focal excluded, rank=None. Fix: - When focal is valid and not already in `valid_with_fit` (snapshot lag), include it explicitly using the up-to-date metrics passed by the pipeline. Downstream best/median/trend/valid_count then reflect reality. - Replace `sorted.index` with a count-based rank (better+1). Robust to tied fitness values, which previously got under-counted by `index`'s first-match semantics. Tests: - test_iter_window_rank_when_focal_missing_from_snapshot (RED→GREEN) - test_iter_window_rank_when_focal_in_snapshot_but_stale_metrics (RED→GREEN) - existing test_iter_window_rank_none_when_focal_invalid still passes (invalid focal correctly yields rank=None) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(suggester): lineage-exhaustion override in rank-aware ambition Cycle-3 (rank rule LIVE) plateaued at 0.02614 (74.7%). Forensics: - Top-5 fitness: 4/5 are structural-pivot archetypes (Guided Innovation / Approach Synthesis). Mean fit by archetype: Guided Innovation 0.01735 vs Exploitation 0.01016. Structural pivots win. - But ~50% of all mutations chose Exploitation. Among the 6 programs whose parent's intra card flagged "all valid children regressed or failed" (lineage exhaustion), the mutator still chose Exploitation/Proven Pattern Extension in 3/6 cases — wasting budget re-tweaking failed clusters. The existing rank-aware rule says "at least one orthogonal-axis suggestion" — too soft when local gradient is empirically dead. New sub-bullet: when intra card shows ≥2 failed/regressed tried_strategy clusters (or delta distribution catastrophic+failed ≥ 2 with improving=0), EVERY suggestion must propose a structural axis NOT in tried_strategies. Parametric tweaks of failed clusters are explicitly rejected in this regime. Forces the suggester to leave exhausted local basins. Generic, no task-specific tokens. +11 lines in mutation_suggestions/system.txt under the rank-aware ambition block. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(suggester): escape literal {} braces in lineage-exhaustion sub-bullet 673b9fb6 introduced `{regressed, failed}` as a literal phrase in the mutation_suggestions/system.txt template. The prompt loader passes this template through `str.format()` (factories.py:200), so `{regressed, failed}` was parsed as a placeholder named 'regressed, failed' — raising KeyError at every DAG build during cycle-4 startup. All 5 seed-eval DAGs failed, the engine spun in an idle "no parents" loop, and the process exited at t=07:53:04 with progs=5/scored=0/mut_done=0 — zero useful work done. Fix: escape the literal braces as `{{regressed, failed}}`. Verified via str.format() round-trip — only the three intended placeholders ({task_description}, {metrics_description}, {max_insights}) remain. Lesson: any literal `{` or `}` in `.txt` prompts that flow through .format() must be doubled. See feedback memory for hardening guide. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(suggester): server-computed EXHAUSTION ALERT banner overrides soft LEX Cycle-4b shipped a soft "Lineage exhaustion override" sub-bullet in the mutation-suggester system prompt. Qwen-3-235B-A22B-Thinking-2507 ignored it: 3 parents in cycle-4b had ≥2 regressed/failed intra clusters yet their children received parametric refinements of already-tried clusters (plateau at 0.02630, +0.6% vs cycle-3 baseline 0.02614). Replace the soft text with a deterministic server-side banner prepended to the user message — most salient location, no LLM judgement on the trigger condition. Trigger (computed in MutationSuggestionAgent._format_exhaustion_block): - cond_a: ≥2 distinct clusters in {regressed, failed}, OR - cond_b: catastrophic + n_failed ≥ 2 AND improving = 0 When triggered, emit `## EXHAUSTION ALERT — strict structural-pivot mode` header + OVERRIDES sentence + explicit AVOID-LIST of negative-verdict clusters + full tried-strategies context + `---` separator. The system prompt now references the banner as a HARD CONSTRAINT that overrides the rank-aware ambition mix. Banner is task-agnostic (no heilbron/triangle leak — covered by test). Tests: 13 new in tests/llm/test_mutation_suggestion_exhaustion.py cover empty intra, single-cluster non-triggers, cond_a/cond_b paths, mixed verdicts, override/AVOID-LIST language, task-agnosticism, and the trailing separator. All pass. Lint clean. No new regressions in tests/llm/ (371 pass) or tests/stages/ (938 pass; pre-existing 3 failures unrelated). Pure context-building change — schema unchanged, pipeline unchanged, launch command unchanged. Stays within the 0.035-sprint allowed-knobs envelope. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(suggester): revert rank+LEX to 9cca4344 baseline for cycle-6 A/B Drops 24 lines from mutation_suggestions/system.txt — the rank-aware ambition sub-bullet (commit 4caeb1b9) and the lineage-exhaustion banner clause (commit 0ebd405c built on 73ed1207's brace-fix of 673b9fb6). Net effect: the suggester prompt is now identical to its 9cca4344 state. Empirical motivation: | Run | Best fitness | system.txt | |------------------------------|--------------|-------------------| | sprint cycle-2 (2026-05-17) | 0.0315 | pre-prescriptive | | cycle3-from-scratch (uncomm) | ~0.030 | 9cca4344 baseline | | cycle-3 today (rank rule) | 0.02614 | +13 rank | | cycle-4b today (LEX soft) | 0.02630 | +24 rank + LEX | | cycle-5 today (LEX hard) | 0.02536 | +24 rank + banner | Today's three runs cluster at 0.025-0.026 (~17% below the 9cca4344 baseline). The collector.py rank-line bugfix + EXHAUSTION ALERT formatter remain in place — only the LLM-facing prompt content is reverted. The formatter just becomes dormant since the prompt no longer references its output. Cycle-6 will A/B this against the cycle-5 state to confirm whether the +24 lines of guidance were net-destructive. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(stats): R2 — MAD-based trend noise floor + archive_valid_fitnesses field Replace legacy 5%·|t1| trend threshold in `_trend_from_thirds` with a nonparametric MAD (median absolute deviation) over the recent valid-fitness window. The fixed 5% ratio reads as "flat" on low-fitness regimes where real regressions are present at sub-5% absolute magnitude — the cycle-6 audit showed parent contexts with medians falling 0.00165 → 0.00082 (clear regression) still labelled `flat` and feeding the consumer's flat-trend condition into Exploitation. MAD adapts to the run's empirical noise scale: no chosen constant. Bootstrap fallback to legacy 5%·|t1| ratio when fewer than `N_MIN_FOR_MAD=4` valid samples in the window — pre-existing framework behaviour preserved during the initial iterations. Also exposes `archive_valid_fitnesses: tuple[float, ...]` as a transient field on `EvolutionaryStatistics` (not persisted; rebuilt per emission). This is the source-of-truth distribution that R1 (archive-quartile regime) will read in a follow-up commit. Constants introduced are all data-availability gates, not regime thresholds: - `N_MIN_FOR_MAD = 4` — minimum sample size for MAD to be meaningful - `_TREND_EPSILON = 1e-12` — numerical safety against degenerate MAD=0 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(context): R1+R3 — archive-quartile regime in mutation_context render Adds two new tokens to every rendered parent context once the archive holds ≥ `N_MIN_ARCHIVE=4` valid programs: Archive: N=N median=… q75=… best=… Regime: BAD/MIDDLE/GOOD (Q? of archive) And appends `archive-quartile Q?` inline to the existing rank line: rank 2/8 in window, archive-quartile Q1) Both signals are derived from the same archive distribution emitted by R2's `archive_valid_fitnesses` field on `EvolutionaryStatistics`. Quartile boundaries are universal statistical convention (Q1=25%, Q2=50%, Q3=75%) — not chosen thresholds. The mapping Q1→BAD, Q2/Q3→MIDDLE, Q4→GOOD is the framework's editorial choice with no numeric constants. R1 v2 design properties: - No dependency on `MetricSpec.upper_bound` — regime is derived from the run's empirical archive distribution, so the bundle is task-agnostic. Tasks declaring `upper_bound` additionally get an informational `Target: … focal_gap=…` line; R6's archetype gate does NOT read it. - Direction-aware via `MetricSpec.higher_is_better` — works identically for loss-style metrics where small = good. - Bootstrap-safe: no token emitted when archive < 4 valid; rule falls back to original Step-6 logic. - O(N log N) per render on archive size bounded by `max_mutants=100`. R3 reuses R1's `quartile_str` so there is a single source of truth and the rank line cannot drift from the Regime line. Tests: 10 new `TestArchiveQuartileRegime` cases cover Q1/Q2/Q3/Q4 placement, archive < 4 (no regime emission), `higher_is_better=False` direction, archive-quartile inclusion in rank line, ties at quartile boundaries, target decoration on/off. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(prompts): R6 — archive-quartile archetype gate + suggester tag-bias Two-layer defense against wasted-budget Exploitation on weak parents. mutation/system.txt (consumer): - Adds a Selection Rule that makes `Regime: BAD` (focal in Q1 of archive) a HARD GATE: Exploitation archetypes (1-3) FORBIDDEN. Choose Exploration (4-6) or, if intra has an "improved" verdict with an untried extension, Hybrid (7-8). MIDDLE (Q2/Q3) gates Exploitation on intra-improved + untried-extension; otherwise prefer Hybrid. GOOD (Q4) opens all archetypes per other rules. Bootstrap (Regime line absent) falls back to original logic. - Adds an Evolutionary Statistics descriptive paragraph explaining the new `Archive:` and `Regime:` lines so the LLM knows the rendered tokens. - Trend label vocab synced to code emit: `rising / flat / falling` (the legacy `improving / regressing` words drifted from collector.py:126 and broke the consumer's match logic). mutation_suggestions/system.txt (producer): - Adds an Archive-quartile awareness rule: in BAD regime (Q1) do NOT tag patterns as `beneficial` based only on local intra-card "improved" verdicts. Prefer `fragile` or `rigid`. Reserve `beneficial` for MIDDLE (Q2/Q3) or GOOD (Q4) regimes. - Disambiguates earlier informal "low-fitness regime" wording (which collided with the formal `Regime:` tag) — the metric-scale heuristic is now explicitly called out as SEPARATE from the formal Regime tag. - Trend vocab synced to `rising / flat / falling`. Defense-in-depth: the producer suppresses `beneficial` tag at source for Q1 parents; the consumer additionally forbids the Exploitation archetype the tag would have biased toward. The two rules layer — they don't duplicate. If the suggester slips and emits `beneficial`, the mutator's hard gate still routes the mutation to Exploration. Tests: TestR6ArchiveQuartileGate (consumer) + TestR6SuggesterTagBias (producer) + TestRegimeAndQuartileVocabularySynergy (cross-prompt vocab consistency for tag, verdict, quartile, regime scales). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(tools): trajectory_shape.py — log-based closeout analyzer for cycle comparisons Computes the 6 trajectory metrics from plan §Verification on any output/cycle*_*/evolution_*.log: - best_at_end (frontier final) - monotonicity_pct (cohort-mean signal, NOT running-max which would always be 100%) - per_stage_best (early/mid/late thirds) - longest_stagnation_min (gap between consecutive frontier-bumps) - rtail_ge_020 / rtail_ge_030 (right-tail mass — # cells reached past 0.020 / 0.030) - cells_filled Two modes: python tools/trajectory_shape.py <log_file> # single report python tools/trajectory_shape.py --compare a.log b.log c.log # variance-floor verdict Variance-floor rule (1.5×spread): N≥3 → mean(baselines)+1.5×spread is the bar; treatment > bar → CONFIRMED, else NOISE. Works on any log file regardless of Redis state — logs are permanent, Redis dbs get flushed. Built during cycle-7's runtime, smoke-tested retroactively on cycles 3/4b/5/6 to establish n=4 baseline (mean=0.02596, spread=0.00094, zero breakouts past 0.030). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(context): R7+R8 v3.1 — archive distribution with worst/median/best + archive-percentile token Renders the archive's distribution as `Archive: N=… worst=… median=… best=…` plus an `archive-percentile pXX of N=Y` annotation on the existing rank line. Both lines are direction- aware via `MetricSpec.higher_is_better`. No `Regime:` or `Target:` token is rendered — the LLM reads the task's target from the task description and judges the focal against the rendered distribution itself. Why v3.1: - v3's `Regime: BAD/MIDDLE/GOOD` was a pre-baked classifier. Trust-the-model-synthesis principle: render data, let the LLM judge. - v3's `Target:`/`focal_gap` was likewise pre-baked. The task description already states the target; rendering it twice (and adding a derived `focal_gap`) introduces a hardcoded interpretation channel the LLM doesn't need. - Only deterministic gate kept: archive-percentile (a single direction-aware quality percentile, 100=best). Quartile boundaries 25/75 are statistical convention, not magic. Bootstrap-mislead defense: the rendered `worst=… median=… best=…` triplet makes archive compression visible. A compressed bootstrap (N=7, all <0.002) lands the focal at p100 but the distribution itself shows the LLM the archive is far from the task's stated target. The qualitative target-awareness clause in the prompt instructs the model to apply that judgment. Tests: - test_archive_line_includes_worst_higher_is_better_true - test_archive_line_worst_inverts_for_higher_is_better_false - test_compressed_bootstrap_renders_rich_archive_no_target_line - test_target_line_never_rendered_when_upper_bound_declared - test_target_line_never_rendered_when_upper_bound_none Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(prompts): R9 v3.1 — archive-percentile gate + qualitative target awareness Mutator and suggester prompts now reference the v3.1 context surface: `Archive: …` and `archive-percentile pXX of N=Y`. The only deterministic gate is the archive-percentile gate (focal in bottom quartile → Exploitation FORBIDDEN; focal in top quartile → all archetypes eligible). Quartile boundaries 25/75 are statistical convention, not magic numbers. Target awareness is qualitative: the task description states the problem's target/bound; the prompt instructs the LLM to compare the rendered `worst=… median=… best=…` distribution against that target and apply judgment. No numeric threshold is imposed because fitness scale is typically non-linear — small absolute distances at low fitness are structurally harder than the same absolute distance at high fitness. Removed from previous v3: - `Regime: BAD/MIDDLE/GOOD` pre-baked classifier (replaced by archive-percentile + prose) - `Target:`/`focal_gap` rendered tokens (LLM reads target from task description) - `half the distance` magic-constant compound rule Removed in this v3.1 cleanup pass: - Legacy "framework does NOT render a separate `Target:` line" mentions — negating-by-mention is noise; prompts now positively instruct reading the target from the task description. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(v3.1): archive-percentile gate, archive distribution, no Target/Regime tokens test_mutation_context.py: - test_target_line_never_rendered_when_upper_bound_declared: asserts `Target:` and `focal_gap` are absent even when MetricSpec.upper_bound is set. - test_target_line_never_rendered_when_upper_bound_none: parallel assertion for tasks without declared upper bound. - test_compressed_bootstrap_renders_rich_archive_no_target_line: documents the bootstrap- mislead defense — N=7 compressed archive with focal at p100 still renders worst/median/best so the LLM can judge the gap against the task target itself. - test_archive_line_includes_worst_higher_is_better_true: asserts `worst=…` and `best=…` are rendered with the strongest at `best` for fitness-style metrics. - test_archive_line_worst_inverts_for_higher_is_better_false: parallel for loss-style metrics (worst = highest value, best = lowest). test_prompts.py (TestV31* replacing TestR6*): - TestV31ArchivePercentileGate: archive-percentile referenced, no Regime/archive-quartile/ Target/focal_gap/half-distance vocab, FORBIDDEN keyword on Q1 Exploitation, 25/75 cited, qualitative target awareness via task description, non-linear scale acknowledged, trend vocab matches collector (rising/flat/falling). - TestV31SuggesterTagBias: parallel for mutation_suggestions prompt. - TestV31VocabularySynergy: cross-prompt consistency for tag scale, verdict scale, quartile boundaries (25, 75), archive distribution vocab (worst/median/best). 364 tests pass. Lint clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(audit): v3.1 mutation decision tree — channels, gate, decision table, worked examples Spells out exactly how the LLM mutator selects an archetype (Exploitation 1–3 / Exploration 4–6 / Hybrid 7–8) under the v3.1 surface: 1. Six context channels (C1 Metrics, C2 Insights, C3 Intra Memory, C4 Memory Cards, C5 Evolutionary Statistics, C6 Ancestral Momentum) — producer / carrier / consumer. 2. Two decision components: one deterministic gate (archive-percentile, only constants are quartile boundaries 25/75) and one qualitative target-awareness clause (LLM reads task description, judges against rendered Archive distribution; no numeric threshold because fitness scale is non-linear). 3. Exhaustive 18-row decision table covering (archive-percentile bucket × intra verdict × trend × invalid streak) → archetype. 4. Four worked heilbron examples: bootstrap-mislead p100 case (Hybrid 7 override), normal mid-run (Hybrid), late-run refinement (Exploitation 1), plateau exit (Exploration). 5. Cycle-9 mid-run invariants: 0% Exploitation on archive-percentile<25 focals; no Regime/Target/archive-quartile tokens; archive-percentile rendered ≥95% post-bootstrap. 6. Universal-across-tasks proof: only per-task input is higher_is_better flag; 25/75 are statistical-convention quartile boundaries, not chosen values. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * tools(v31-validator): read-only sampler — recompute archive-percentile + decision-tree predictions vs LLM choices Non-mutating Redis sampler that walks DBs 13/14/15 program by program, parses `Program.metadata.mutation_context` (the rendered prompt) for the parent's state (focal fitness, valid sibling fitnesses, trend, intra verdicts), and recomputes the v3.1 archive-percentile from valid siblings. Emits JSONL with fields: tree_bucket, tree_eligible_archetypes, archetype_chosen, fitness_delta, cf_tags (which of CF-A..CF-E cells the sample lands in), match (tree-prediction vs LLM-choice). Reuses: - gigaevo.programs.stages.collector.N_MIN_ARCHIVE - gigaevo.evolution.mutation.context._archive_percentile_of_focal (direction-aware) - gigaevo.database.redis_program_storage RedisProgramStorage.get_all - gigaevo.programs.program.Program.get_metadata (base64 deserialization) Output drives docs/audits/MUTATION_DECISION_TREE_V3_1_COUNTERFACTUAL_AUDIT_2026-05-18.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(audit): v3.1 decision tree — counterfactual audit on 289 prior-run programs (DBs 13/14/15) Recomputes v3.1 archive-percentile and tree-predicted archetype bucket on every program with mutation_context metadata across DBs 13/14/15 (n=289), compares to the LLM's actual archetype_choice and the child's fitness_delta, then groups by the 5 candidate counterfactual cells identified in the audit plan: CF-A empty intra (first child) n=225 — modal Hybrid 7 (57), pos-rate 42.1% CF-B invalid focal with archive line n=0 — unreachable under OLD prompt; documented as gap CF-C low-N flat trend (noise-dominated) n=32 CF-D middle-band percentile + falling trend n=19 CF-E top-quartile + spread wide + far-target n=53 — pos-rate 92.5%, Exploitation 32/35 = 91% wins Headline finding: v3.1's target-awareness override demoting top-quartile parents to Hybrid is empirically TOO RESTRICTIVE. CF-E shows exploitation beats hybrid 32/35 when the gate permits both. 18 counterfactual-A samples — gate-violations that improved fitness anyway. Recommendations applied in next 2 commits: - REV-1 soften row 13 (target-awareness no longer forbids Exploitation) - REV-3 add row 19 — empty intra middle-band → Hybrid 7 default - REV-4 add row 19a — invalid focal → Exploration with corrective - REV-2 (row 11 softening for CF-D) documented as PROPOSAL — n=19 too thin Observational only — contexts rendered under OLD prompt surface; the sampler recomputes v3.1 tokens from the same underlying numerics. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(audit): v3.1 tree — soften row 13 target-awareness + add rows 19 / 19a per counterfactual audit Three additive/softening edits to MUTATION_DECISION_TREE_V3_1_2026-05-18.md, all driven by empirical findings in the counterfactual audit (MUTATION_DECISION_TREE_V3_1_COUNTERFACTUAL_AUDIT_2026-05-18.md): REV-1 — row 13 (top-quartile + spread wide + far-target): OLD: "demote to Hybrid 7 even though gate permits Exploitation" NEW: "prefer Hybrid 7 OR Exploitation 1 — gate does NOT forbid Exploitation. Choose by C2 insight severity and C6 ancestral_step_delta." Evidence: CF-E n=53, pos-rate 92.5%; Exploitation wins 32/35 in this cell. REV-3 — new row 19 (empty intra, middle-band 25≤p<75): Add explicit default: Hybrid 7 (Guided Innovation). Evidence: CF-A n=225, modal choice Hybrid 7 (57 picks), pos-rate 42.1%. Closes a gap where the tree relied on the LLM inferring a default. REV-4 — new row 19a (invalid focal with archive line): Defensive rule — force Exploration with corrective mechanism regardless of archive-percentile (focal cannot be refined when it is invalid). Evidence: CF-B was unreachable in prior data; rule is forward-looking. REV-2 (row 11 middle-band+falling softening) NOT applied — CF-D n=19 too thin; documented in audit as proposal pending more data. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(prompts): soften target-awareness clause + add noise-dominated & empty-intra rules Mirrors REV-1/REV-3 from the decision-tree counterfactual audit into the two system prompts the LLM actually sees. mutation/system.txt: - Softened target-awareness paragraph: target awareness shapes priority WITHIN the gate-permitted set; it never forbids an archetype the gate allows. When archive.best far below target AND focal in top quartile, BOTH Hybrid 7 and Exploitation 1 are legitimate. - Added noise-dominated-trends bullet: falling/flat with `[only N valid — too few for trend]` (or iter_window_valid < 9) is inconclusive — do NOT force Exploration on it alone. - Added empty-intra default bullet: first child of a fresh parent in middle band defaults to Hybrid 7 (Guided Innovation). mutation_suggestions/system.txt: - Softened target-awareness paragraph (same wording philosophy). - Added noise-dominated-trends bullet so the analyst does not raise severity to `high` on a noisy signal alone. Total: ≤10 lines net per file, all additive or replacement softening. Empirical justification: CF-E audit shows Exploitation wins 32/35 when the gate permits both; CF-A shows Hybrid 7 is the modal LLM choice (57/225) and best per-pick pos-rate. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * tools(pseudo-evo-bench): single-shot mutation A/B harness with archetype-distribution analysis Tests prompt-wrapper changes (mutation/system.txt + user.txt) on a fixed cohort of 50 stratified parents from DBs 13/14/15. Each parent's mutation_context blob is frozen from its original search-time run; only the system.txt/user.txt wrappers are re-rendered from the working tree at sample time. Components: - sample_parents.py: stratify 50 parents (17/18/15 by fitness bucket), render current HEAD prompts, write parents.json (idempotent under SEED=20260518) - run_qwen.py: 1 LLM call per parent at concurrency=6 via LiteLLM proxy - eval_mutants.py + eval_runner.py: parallel heilbron validate() at concurrency=4 - analyze.py: PRIMARY signal is archetype/strategy distribution (coverage, entropy, group balance, v3.1 gate compliance, archetype shift matrix). Fitness delta is reported as SECONDARY since single-shot is noise-dominated by parent quality and validity. Scope (per README): tests prompt-wrapper changes only. Stage-internal mutation_context build (collector, intra/extra memory, lineage) is NOT exercised because contexts are pre-rendered. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * tools(pseudo-evo-bench): iter0 (HEAD) vs iter1 (pre-softening v3.1) — archetype-distribution A/B Same 50 parents (verified parent_id-identical), frozen mutation_context blocks (verified byte-identical per parent). Only difference: rendered system_prompt (iter0 = 14,375 chars / HEAD with +14-line softening; iter1 = 13,415 chars at commit 11eb4d4b before softening). PRIMARY: archetype/strategy distribution ========================================= iter0 (HEAD softened) iter1 (v3.1 sharp) Coverage 7/8 archetypes 7/8 archetypes Entropy 2.583 bits 2.565 bits Group balance E/X/H 28% / 41% / 30% 20% / 52% / 28% Group skew (max-min) 13.0% ← fairer 32.6% Low anti-Exploit 87.5% (n=16) 93.8% (n=16) High Exploit rate 50.0% 42.9% Per-bucket group shift (where softening actually changed behavior): low: ~unchanged (gate respected in both) mid: iter0 56% Hybrid → iter1 50% Explore (softening pushed mid TOWARD Hybrid) high: iter0 14% Hybrid → iter1 36% Hybrid (softening pushed high TOWARD Explore) Archetype shift matrix: 16/45 decisive pairs (36%) cross GROUP boundary between iter0 and iter1 — the +14 lines DO steer the LLM, the question is whether the steering is desirable. Common finding across BOTH iterations: archetype #8 "Conservative Exploration" picked ZERO times (0/92 mutations) → strong signal the prompt does NOT surface this archetype effectively archetype #7 "Guided Innovation" dominates Hybrid (14/14 + 13/13) SECONDARY: fitness (noise-dominated, but directionally informative) ================================================================== Sign test on paired Δ: iter1 wins 20, iter0 wins 7, ties 23 → p≈0.012 Validity: iter0 26/50, iter1 32/50 Interpretation: HEAD softening trades 6pp of single-shot fitness loss for group-balance fairness. Neither prompt surfaces archetype #8. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * prompts(mutation/system.txt): rewrite archetype #8 as Component Substitution + sharpen #5 separator from #4 Empirical motivation: pseudo-evo bench iter0+iter1 picked archetype #8 (Conservative Exploration) ZERO times across 92 valid responses. Diagnosis — "explore within structural / interface constraints" describes properties any valid mutation already has, not an edit pattern the model can operationalise. Archetype #8 → Component Substitution Replace ONE named subroutine or building block (scoring function, sampler, init scheme, distance metric, update rule, post-processor) with an alternative of the same kind, occupying the same slot — same inputs, same output shape — so surrounding control flow, interfaces, and hyper-parameters remain unchanged. Distinct from #4 (changes algorithm family) and from #7 (adds a component alongside an existing one rather than replacing). Archetype #5 → sharpened separator from #4 "Change the SET of admissible solutions (relax/tighten a constraint, drop a parity or symmetry rule, allow rotations, switch from discrete to continuous parameterisation) without changing the search algorithm itself. #4 changes HOW the search runs; #5 changes WHAT set is searched." Net effect — eight distinct edit verbs: Exploit: tune (#1) / extend-scope (#2) / remove (#3) Explore: reinvent-algorithm (#4) / change-feasibility-set (#5) / synthesise-from-memory (#6) Hybrid: add-alongside (#7) / substitute-component (#8) Raises realised entropy ceiling above today's log2(5)≈2.32 bits toward the 3.0-bit max for 8 archetypes. Pseudo-evo iter2 will verify the model actually picks #8 and discriminates #5 from #4. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * schema(mutation): enforce archetype names via Literal + add drift-detection tests Changes: 1. Added ARCHETYPE_NAMES constant and ArchetypeName Literal type to constants.py (single source of truth for canonical names) 2. Updated MutationStructuredOutput schema to use ArchetypeName Literal (strict validation, rejects unknown archetype strings at parse time) 3. Added 4 new test functions to test_mutator_system_prompt.py: - test_archetype_names_appear_in_system_prompt() — catches drift between schema Literal set and prompt's archetype menu - test_archetype_count_is_eight() — asserts len(ARCHETYPE_NAMES)==8 - test_mutation_output_accepts_canonical_archetypes() — validates each canonical name - test_mutation_output_rejects_unknown_archetype() — rejects out-of-set strings with ValidationError 4. Fixed test_defaults in test_mutation_agent.py to use canonical archetype "Precision Optimization" instead of invalid "test" Motivation: Prevent silent LLM output rejection when system.txt and schema drift (e.g., if archetype #8 is rewritten in prompt but not updated in Literal). All 104 tests pass. No changes to mutation/system.txt (archetype #5/#8 redesign committed separately 2026-05-18 at 7a52f45d). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(auto-optimize-loop): spec + reference schemas + history/patterns scaffolds Adds the auto-optimize-loop task spec that drives autonomous cycles tuning ONLY the mutation operator's context factory graph and the archetype framework. Primary success criterion is healthy trajectory + healthy mutants; 0.03 is the "real improvement" floor with 0.035 aspirational. All future auto-loop cycle commits land linearly on r7-r8-r9-v3-bundle and are identified by their commit SHA captured at LAUNCH time. Each cycle writes a Reconstruction MD and Analytics MD; the Analytics MD's retroactive invariant audit hard-gates the next cycle's PROPOSE step. Files: - docs/audits/AUTO_OPTIMIZE_LOOP_TASK_2026-05-19.md (primary spec) - docs/audits/references/AUTO_OPTIMIZE_RECONSTRUCTION_MD_SCHEMA.md - docs/audits/references/AUTO_OPTIMIZE_ANALYTICS_MD_SCHEMA.md - docs/audits/AUTO_OPTIMIZE_CYCLE_HISTORY.md (append-only ledger) - docs/audits/AUTO_OPTIMIZE_PATTERNS.md (evidence ledger) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: gitignore output/runs/tool-caches + capture pre-loop audit MDs - Add .gitignore rules for output/ runs/ problems/heilbron_pro/ rotated litellm sampler logs, and throwaway tools/ subdirs (benchmark_gemini_ab, insights_ablation, lineage_card_scaffold). - Capture pre-loop audit MDs under docs/audits/ that informed the cycle-9 redesign + auto-optimize-loop spec (insights, lineage memory plans, mutation guidance rubric, cycle-8 prelaunch, prompt redesign, etc.). These are read-only history; future PRs cite them. - Collapse multi-line attach_inputs({...}) calls in tests/stages/test_intra_memory_cache.py to single-line form (pure formatting, no behavior change). - Add docs/audits/AUTO_OPTIMIZE_CYCLE_0_ANALYTICS.md (cycle-0 = cycle-9 baseline) pre-drafted at T+1h17m with <TBD-FINALIZE> markers; end-of-run values will be filled in once PID 2008891 exits. This is a non-loop chore commit. It becomes cycle-1's PARENT_SHA so the §8.1 invariant git rev-list --count \$PARENT_SHA..HEAD == 1 will be satisfiable when cycle-1's single IMPLEMENT commit lands on top. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: ruff fix + format on tools/pseudo_evo_bench (pre-loop) Applies `ruff check --fix` + `ruff format` to the pseudo_evo_bench A/B harness scripts. All changes are cosmetic: - 11 I001 import-order errors fixed (stdlib imports merged into alphabetic order with site-package imports). - Long json.dumps(...) and dict literals reflowed by the formatter. These files have been failing `ruff check .` since they landed (commits 893113c1 / a9b4bee5). The §8.1 pre-launch lint invariant in the auto-optimize loop spec requires a clean `ruff check .` + `ruff format --check .`, so this clean-up is a prerequisite for cycle-1 launch. Safety: - pseudo_evo_bench is NOT imported by gigaevo/ or tests/ — grep across both trees returns zero hits. The currently-running cycle-9 (PID 2008891) does not touch these files. - Only formatting changes; no semantic edits, no API surface change. - `pytest tests/prompts/test_mutator_system_prompt.py` continues to pass (archetype-drift detection). Verified: ruff check . → All checks passed! ruff format --check . → 1172 files already formatted Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(auto-optimize-loop): finalize cycle-0 Analytics + cycle-1 PROPOSE + scope-expansion note Finalizes cycle-0 (= pre-loop cycle-9 archetype redesign + R-bundle) baseline analytics against the full evolution log (20043 lines, exit at T+1h55m41s). - AUTO_OPTIMIZE_CYCLE_0_ANALYTICS.md: substitute all TBD-FINALIZE markers with extracted numbers. Final outcome: HEALTHY-NEUTRAL (S2.1 trajectory PASS narrow; S2.2 fitness FAIL with best_fitness=0.02620 < 0.03 floor, below baseline 0.02788). Trajectory has 6 strictly-increasing best-fitness peaks with one mid-run plateau of ~46 mutants strict (peaks #4->#5) followed by fast late rescue (peak #6). Strict stagnation_interval_max=46 NARROW-FAILS <=40 gate; inclusive frontier-event defn PASSES at <=5 mutants. valid_rate=52%, frontier_new_cell_events=42, right_tail _mass=42.3%, per_parent_advance_rate=6% strict / 10% inclusive. Component Substitution (new archetype #8) NOT dead - rose 0->18 picks (6%) by run end; vindicates the feedback_archetype_distribution_not_a_goal user reframe. - AUTO_OPTIMIZE_CYCLE_HISTORY.md: cycle-0 row now shows HEALTHY-NEUTRAL decision, best_fitness=0.02620, S2.1 PASS narrow, S2.2 FAIL. - AUTO_OPTIMIZE_PATTERNS.md: cycle-0 entry as NEUTRAL ceiling evidence; documents surface scope (R-bundle), numbers (S2.1 components), caveats (n=1 variance not yet measured; below baseline by 6% within plausible n=1 variance). - AUTO_OPTIMIZE_CYCLE_1_PROPOSE.md: cycle-1 = variance-floor replicate of baseline (NO EDIT per S7). Decision rationale cites feedback_variance_floor_first + feedback_consistent_improvement_all_stages + feedback_auto_optimize_trajectory_first. S4 citations rewritten as trajectory-shape-only signals (plateau duration, per_parent_advance_rate, stagnation_interval_max) per feedback_archetype _distribution_not_a_goal. Updated parent-SHA reference to current operational HEAD. - AUTO_OPTIMIZE_LOOP_TASK_2026-05-19.md: S3 prelude blockquote captures the 2026-05-19 user verbal directive expanding cycle-3+ scope to the entire mutation context harness (feedback_mutation_context_harness_in_scope). S3.2 SLIGHTLY cap on mutation/system.txt preserved as engineering constraint only. Non-cycle docs commit - the cycle-1 commit will follow as a separate --allow-empty commit per the variance-floor protocol. * auto-loop cycle 1: variance-floor replicate of baseline (no edit) * auto-loop meta cycle 1: variance-floor replicate (HEALTHY-NEUTRAL) Cycle-1 ran 2026-05-19 02:33→04:53 MSK on db=12, identical config to cycle-0 baseline (a4925a90). Cycle commit a527b256 is the --allow-empty IMPLEMENT SHA; zero code/prompt/config diff. Result: HEALTHY-NEUTRAL (variance-floor; informational §2.2 PASS). - best_fitness = 0.031187 (cycle-0 was 0.02620; Δ=+0.00499) - |Δ| < §7 variance threshold 0.01310 → within variance floor - §2.1 trajectory: PASS all 5 gates strict (frontier_new_cell 52, right_tail_mass 57.4%, advance_rate 6% strict / 12% inclusive, stagnation_interval_max 14 strict within-active, valid_rate 64%) - §2.2 fitness floor: PASS (0.031187 ≥ 0.03) — INFORMATIONAL flip vs cycle-0 (which failed by 0.004); NO-EDIT cycles cannot be WIN-CANDIDATE by spec. - Trajectory shape: rapidly ascending for first 29 mutants (6 peaks compressed), then 70-mutant trailing plateau. INVERSE of cycle-0's mid-run plateau + late rescue. Two equally consistent interpretations (A: baseline mean ~0.029±0.003, B: cycle-1 high-side outlier). Cycle-2 (db=13) will disambiguate. If cycle-2 lands within Δ=0.006 of either prior cycle, baseline mean ≈ midpoint(0.026, 0.031, cycle-2); loop may be near a structural Heilbron ceiling per feedback_auto_optimize_trajectory_first ("task may be unsolvable in knob scope — that's valid"). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(auto-optimize-loop): cycle-2 PROPOSE — variance-floor replicate #2 (db=13) Continuation of §7 variance-floor methodology. Cycle-2 is NO-EDIT, db=13 only delta vs cycle-0/1. Adds the third sample point to lock baseline mean + std before cycle-3 first real intervention. After cycle-1, n=2: best=[0.02620, 0.03119], midpoint 0.02870, sample std 0.00353, §7 variance threshold 0.01435 (50% of midpoint). Cycle-2 decision tree: - best ∈ [0.026, 0.034] AND §2.1 PASS → cycle-3 PROPOSE proceeds - best outside [0.020, 0.040] OR §2.1 FAIL → §7 STOP - best in marginal bands → optional cycle-2.5 NO-EDIT before cycle-3 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * auto-loop cycle 2: variance-floor replicate of baseline (no edit) Per §7 variance-floor methodology. Cycle-2 = second NO-EDIT replicate on db=13. Cycle identity SHA = this commit's HEAD. PARENT_SHA = previous commit (cycle-2 PROPOSE meta). No code/prompt/config diff vs cycle-0 baseline (a4925a90). §8.1 invariants verified pre-launch: - branch r7-r8-r9-v3-bundle - working tree clean - LiteLLM proxy 10.232.30.185:4000 reachable (/health/readiness) - archetype-schema drift tests 51 passed Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * auto-loop meta cycle 2: variance-floor replicate (HEALTHY-NEUTRAL; marginal → cycle-2.5) Cycle-2 IMPLEMENT commit (57442737) was a --allow-empty NO-EDIT replicate on db=13. This meta commit captures the cycle-2 ANALYZE-post artifacts: - AUTO_OPTIMIZE_CYCLE_2_RECONSTRUCTION.md: best_fitness=0.021266 at mutant ~38; §2.1 trajectory PASS (5/5 gates strict); §2.2 fitness floor FAIL (< 0.03 by 0.00874) - AUTO_OPTIMIZE_CYCLE_2_ANALYTICS.md: n=3 baseline mean=0.02622, stdev=0.00496; cycle-1 reclassified as high-side outlier; trailing-plateau shape dominates 2/3 cycles - AUTO_OPTIMIZE_CYCLE_HISTORY.md: cycle-2 row appended - AUTO_OPTIMIZE_PATTERNS.md: cycle-2 NEUTRAL evidence entry - AUTO_OPTIMIZE_CYCLE_3_SURFACE_MENU_DRAFT.md: surface menu for cycle-3 PROPOSE (gated by cycle-2.5 4th NO-EDIT replicate per cycle-2 PROPOSE marginal-band rule) Decision: cycle-2 best_fitness 0.021266 ∈ [0.020, 0.025] MARGINAL band per cycle-2 PROPOSE §5 decision tree → next cycle (2.5) is another NO-EDIT replicate on db=14 to add a 4th variance sample before cycle-3 first real intervention. Per spec §8.1: git rev-list --count 57442737..HEAD == 1 (one commit per cycle step). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(auto-optimize-loop): cycle-2.5 PROPOSE — 4th NO-EDIT variance-floor replicate (db=14) Triggered by cycle-2 PROPOSE §5 decision-tree marginal-band rule: cycle-2 best_fitness 0.021266 ∈ [0.020, 0.025] → need 4th sample. n=3 stats: mean=0.02622, stdev=0.00496; §7 STOP NOT triggered. n=4 will tighten baseline mean variance ~2.6× and disambiguate the bimodal-suspicious distribution (cycle-1 1σ above mean). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * auto-loop cycle 2.5: variance-floor replicate of baseline (4th sample, no edit) Identical config to cycle-2 except redis.db=14 (cycle-2 was 13). Per cycle-2 PROPOSE §5 decision-tree: cycle-2 best 0.021266 in MARGINAL band [0.020, 0.025] required this 4th NO-EDIT sample. After cycle-2.5 closes: - if best ∈ [0.015, 0.035] AND §2.1 PASS → cycle-3 PROPOSE proceeds - if outside band OR §2.1 FAIL → STOP per §7 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * auto-loop meta cycle 2.5: variance-floor replicate (HEALTHY-NEUTRAL; n=4 baseline LOCKED; proceed to cycle-3) cycle-2.5 best_fitness = 0.025709 at mutant 80/100 on db=14. n=4 baseline LOCKED: - mean = 0.02609, stdev (sample) = 0.00406 - range [0.02127, 0.03119], CV = 15.6% - §7 STOP NOT triggered (0.00406 << 0.5 × 0.03119 = 0.01559) §2.1 trajectory: PASS lenient (4/5 strict + stagnation NARROW-FAIL at 55 mutants, same shape as cycle-0's 46). §2.2 fitness floor: FAIL (0.025709 < 0.03 by 14%). Trajectory-shape census (n=4): - 2/4 cycles: mid-run plateau + late rescue (cycle-0, cycle-2.5) - 2/4 cycles: leading sprint + trailing plateau (cycle-1, cycle-2) Bimodal 2/2 — cycle-3 intervention must address BOTH shapes. Decision tree (cycle-2.5 PROPOSE §5): best 0.02571 in [0.015, 0.035] AND §2.1 PASS lenient -> PROCEED TO CYCLE-3 PROPOSE (first real intervention) Cycle-3 WIN-CAND threshold = mean+1sigma = 0.03015. Cycle-3 STRONG-WIN threshold = mean+2sigma = 0.03421. First cycle with DIRECT live /proc/<pid>/environ verification of all four section 8.1 environment invariants (OPENROUTER_API_KEY len=73, OPENAI_API_KEY=sk-gigaevo, HTTP_PROXY+HTTPS_PROXY unset). Strengthens n=4 baseline vs cycle-1/2 INFERRED env. Files: - AUTO_OPTIMIZE_CYCLE_2_5_RECONSTRUCTION.md (FINAL; sections 9-13 filled) - AUTO_OPTIMIZE_CYCLE_2_5_ANALYTICS.md (created; sections 0-6) - AUTO_OPTIMIZE_PATTERNS.md (append cycle-2.5 entry; n=4 stats) - AUTO_OPTIMIZE_CYCLE_HISTORY.md (append cycle-2.5 row) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * auto-loop cycle 3: intra-memory saturation detection (K=3 narrative streak → header inject) INT 1 from cycle-3 surface-menu draft. First REAL intervention after n=4 NO-EDIT variance-floor baseline (mean=0.02609, σ=0.00406; locked at f225e1db). Change: IntraMemoryStage now tracks an SHA1-keyed hash of each rendered intra-card's narrative signature (summary + tried_strategies' label/verdict/notes, excluding monotonic counters n_attempts/mean_delta/delta_distribution per chaos-hacker CRITICAL #1). When the same hash is observed for K=3 consecutive renders on the same parent, the next render is prepended with a "[STAGNATION DETECTED]" header plus a child-delta archetype histogram. Hypothesis: the K=3 stagnation header makes the parent's saturation visible to the MutationSuggestionAgent → encourages archetype shift OR genuinely new strategies in identical-narrative branches → reduces stagnation_interval_max and/or trailing plateau without changing model/prompt template/problem. Scope: lineage_memory.py (+90 lines) + 6 unit tests bypassing InputHashCache to exercise the new code path directly. Frozen invariants (unchanged): problem.heilbron, validator, fitness fn, Pydantic Literal archetype enforcement, num_parents=1, max_mutants=100, model=Qwen3-235B-A22B-Thinking-2507, prompts (all 4 SHAs unchanged). WIN-CAND threshold: best_fitness ≥ 0.03015 (n=4 baseline mean+1σ). STRONG-WIN: ≥ 0.03421 (mean+2σ). Spec: docs/audits/AUTO_OPTIMIZE_LOOP_TASK_2026-05-19.md PROPOSE: docs/audits/AUTO_OPTIMIZE_CYCLE_3_PROPOSE.md RECONSTRUCTION: docs/audits/AUTO_OPTIMIZE_CYCLE_3_RECONSTRUCTION.md (skeleton; populated post-run) ANALYTICS: docs/audits/AUTO_OPTIMIZE_CYCLE_3_ANALYTICS.md (skeleton; populated post-run) * auto-loop meta cycle 3: K=3 stagnation detector — HEALTHY-NEUTRAL (mechanism never fired) Cycle-3 INT 1 (commit 45255a3d, db=15, 1.85h run) outcome class HEALTHY-NEUTRAL. Headline: - best_fitness = 0.024369 (mutant cc09c637, frontier event #38 of 55 valid mints) - Δ vs n=4 baseline mean (0.02609) = -0.00172 (|Δ| < 1σ = 0.00406, WITHIN noise band) - §2.1 trajectory: 5/5 strict PASS (valid_rate ~0.55, frontier_new_cell=46, right_tail_mass=0.344, advance 0.06 strict / 0.55 inclusive, stagnation_interval_max=14) - §2.2 fitness floor: FAIL (0.024369 < 0.030 by 0.00563) - §7 STOP threshold: NOT triggered (loop continues) Key empirical finding: STAGNATION DETECTED header activations = 0/100. The K=3 narrative-streak SHA1 detector never fired in 100 mutants. The bucketed memory representation evolves enough between consecutive renders that SHA1(narrative_signature) changes before K=3 is reached, even with num_parents=1 and many sibling renders per parent. This vindicates the user's preference (feedback_llm_rules_over_hardcoded): hardcoded Python predicates are too brittle to fire at this run scale. Cycle-3 is empirically a NO-EDIT replicate of cycle-2.5 from the mutator's perspective. Trajectory improvements (stagnation_interval_max=14 vs cycle-0's 46 / cycle-2.5's 55) are not causally attributable to the mechanism that never fired — they are sampling variance. Dual-axis verification per project_fat_context_direction.md (§1): - signature: STAGNATION activations = 0 → ✗ - metric of interest: stagnation_interval_max = 14 (< 46 cycle-0) → ✓ - quadrant ✗✓ → "noise/lucky; replicate before claiming win" → cannot ship as a trajectory-shape win because the mechanism never fired Archetype distribution shifted dramatically from cycle-0 (informational only): - cycle-0: Guided Innovation 25%, Computational Reinvention 21% - cycle-3: Guided Innovation 53% (mode-collapse), Computational Reinvention 2% ARCHETYPE-EFFICIENCY MISMATCH persists: highest hit-rate archetypes (Computational Reinvention 100% n=2, Harmful Pattern Removal 100% n=1, Precision Optimization 66.7% n=6) are under-sampled, while highest-pick archetype (Guided Innovation 53%) has below-average hit-rate (35.8%). Decision: HEALTHY-NEUTRAL. Next: cycle-4 PROPOSE per fat-context methodology will target a measurable failure mode (candidate: pick-rate-vs-hit-rate mismatch) with LLM-side fat context, NOT hardcoded Python predicates. Spec: docs/audits/AUTO_OPTIMIZE_LOOP_TASK_2026-05-19.md RECONSTRUCTION: docs/audits/AUTO_OPTIMIZE_CYCLE_3_RECONSTRUCTION.md (FINAL) ANALYTICS: docs/audits/AUTO_OPTIMIZE_CYCLE_3_ANALYTICS.md (FINAL) HISTORY: docs/audits/AUTO_OPTIMIZE_CYCLE_HISTORY.md (row appended) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * auto-loop cycle 4: surface per-archetype yield in evolutionary statistics block INT 2 (cycle-4) — FIRST LLM-side fat-context intervention. Per cycle-3 closeout (commit 08cbd5b2, HEALTHY-NEUTRAL, K=3 narrative-streak detector NEVER FIRED in 100 mutants), the user verbal directive on 2026-05-19 expanded the loop scope to the entire mutation-context harness (feedback_mutation_context_harness_in_scope) and re-affirmed fat-informative-context over hardcoded Python predicates (project_fat_context_direction). Change: per-archetype yield (picks, valid_hits, hit_rate, mean_delta_to_parent) aggregated over the whole-run population in EvolutionaryStatisticsCollector, attached to the EvolutionaryStatistics StageIO, and rendered as a markdown table inside the existing "## Evolutionary Statistics" block of the mutation suggester's prompt (gigaevo/prompts/mutation_suggestions/system.txt also extended with PRIORITY-reshape guidance, NOT invention). Three additive surface touches, all in scope per §3.1 of LOOP_TASK + feedback_mutation_context_harness_in_scope: - gigaevo/programs/stages/collector.py: new _compute_archetype_yield() helper + archetype_yield field on EvolutionaryStatistics + cache wiring in _ensure_population_cache. Also flips EvolutionaryStatisticsCollector ._EXCLUDE from EXCLUDE_FOR_ANALYTICS (strips metadata) to EXCLUDE_STAGE_RESULTS (keeps metadata) — REQUIRED for the helper to read program.metadata[MutationSpec.META_OUTPUT]["archetype"]. The metadata cost is bounded by N=100 programs per snapshot, well under 1% of cycle wall-time. Other collectors continue to exclude metadata. - gigaevo/evolution/mutation/context.py: extended EvolutionaryStatisticsMutationContext.format() to render the yield table when total picks >= 5 (suppresses bootstrap noise). Sorted by hit_rate desc, picks desc tie-break. - gigaevo/prompts/mutation_suggestions/system.txt: extended the existing "Evolutionary Statistics" bullet with explicit guidance on how to read the new table (UNDER-UTILIZED vs OVER-RELIED-ON cells). Reaffirms PRIORITY-reshape, NOT invention. Hypothesis: surfacing per-archetype yield (cycle-3 ANALYTICS Signal #1: Computational Reinvention 100% hit-rate at 2% pick-share; Guided Innovation 35.8% hit-rate at 53% pick-share) to the suggester lets the LLM reshape priority toward higher-yield archetypes. Cycle-4 takes the OPPOSITE design to cycle-3: NO Python threshold, NO if-streak-K predicate, NO header inject. The LLM decides. CRITICIZE-pre v1 returned REVISE with one CRITICAL + two HIGH findings, all mitigated: - CRITICAL: PROPOSE v1 specified wrong metadata key (metadata["mutation"] vs canonical MutationSpec.META_OUTPUT = "mutation_output"). Fixed. - HIGH: TDD fixtures echoed wrong key. Fixed + new test #7 regression-guards the dead key (test_rejects_dead_metadata_key). - HIGH: integration smoke step #4 only checked header presence. Tightened to require >=1 canonical-named row + (other) share <= 20%. 8 RED tests in tests/stages/test_archetype_yield.py cover: empty population, per-archetype aggregation with delta-to-parent, canonical ordering with zero picks, attachment to EvolutionaryStatistics, format() rendering (sorted), threshold suppression, defensive bucketing of unknown archetypes + missing mutation_output, regression guard against the dead "mutation" key. All 8 pass post-GREEN. Adjacent suites (tests/stages/test_collector.py + tests/stages/test_mutation_context.py) pass 85/85 — _EXCLUDE flip has no regressions. Scope discipline: NO change to gigaevo/llm/agents/mutation.py (archetype Literal preserved), NO change to gigaevo/prompts/mutation/system.txt (the SLIGHTLY rule does not apply), NO change to problems/heilbron/, num_parents, max_mutants, model_name, llm_base_url. No Heilbron-specific anything in any touched file. The 8 canonical archetype names are loaded from gigaevo/evolution/mutation/constants.py — problem-agnostic. Frozen invariants (unchanged): problem.heilbron, validator, fitness fn, Pydantic Literal archetype enforcement, num_parents=1, max_mutants=100, model=Qwen3-235B-A22B-Thinking-2507. WIN-CAND threshold: best_fitness >= 0.03015 (n=4 baseline mean+1sigma). STRONG-WIN: >= 0.03421 (mean+2sigma). Cycle-4 prediction at n=1: best 0.0275 +/- 0.005 (~30% chance of WIN-CAND given baseline variance). Riskiest link: the suggester must actually read the yield table and reshape priority. Dual-axis verification (PROPOSE §11) detects ignore vs. follow via (DIAGNOSTIC) archetype-efficiency CV halving signature + (PRIMARY) metric of interest (best_fitness, trajectory gates). Outcome quadrants per project_fat_context_direction 4-quadrant matrix. Followup captured (NOT in this commit, future cycle): lineage_memory.py:711 reads metadata.get("mutation", {}) — the dead key. TransitionAnalysis archetype field always None as a result. * Revert "auto-loop cycle 4: surface per-archetype yield in evolutionary statistics block" This reverts commit e6cfe6eed30fc81da752af950a4a32a45b2352f4. * auto-loop meta cycle 4: archetype-yield prompt bloat — LOSE-REVERT Cycle-4 INT 2 (commit e6cfe6ee, db=12, 2.16h run) outcome class LOSE-REVERT. Reverted by 69a5a708 per feedback_auto_optimize_branch_policy (no reset). Headline: - best_fitness = 0.01686 (run high-water mark; SEED-level, no post-seed mints) - Δ vs n=4 baseline mean (0.02609) = -0.00923 = -2.27σ → INSIDE LOSE band (baseline-2σ=0.01797) - 0/37 ACCEPTED…

…rite

PetrAnokhin and others added 30 commits April 1, 2026 16:00

gitignore

c00aeb3

feat: add changes extraction to mutation agent

52ef218

feat: add idea tracker

5ec4f06

feat: add logging for idea tracker

842746e

fix: circular import in logger

fdca9a7

fix: remove short id separate storage and generation

dca364c

short id will generate based on full id when required

feat: add best idea extraction based on top_k selection by fitness an…

2d2c1e6

…d delta fitness

feat: experimental ml pipeline for impact estimation based on linear …

1872d38

…regression feature weights

refactor: remove debug code

c94e78a

fix: changed cooccurrence threshold agressive scaling to fixed minimum

97ad639

feat: add idea description rewriting logic

c6c3421

chore: removed unused prompts

5170fe9

gitignore

3293d4d

fix: correct serialization of dict and lists in pd columns

7c7e3e8

feat: csv loading to IdeaTracker

db5d7c4

memory in config

ce2ca3d

fixed my cat stepping on keyboard probably

43def44

Update idea_tracker

9fd9cf1

feat: add extended record card dataclass

6ea7bd9

feat: add update logic for extended record card

bec4351

feat: support for extended record card

49a2a00

refactor: record card extended minor refactor

4830094

feat: task description loading

1662eed

fix: remove debug print

7c8d45a

feat: update main logic to work with extended record card

78584c7

chore: update docstrings

97f96d1

fix: wrong key name fix

3d1d740

fix: IncomingIdeas update logic fix

2402efb

refactor: replace ML impact pipeline with origin analysis computation…

ba95f51

… and improve docstring clarity

fix: add break condition for processing when no new ideas are present

4974f97

KhrulkovV and others added 24 commits April 6, 2026 13:21

Merge pull request #167 from KhrulkovV/worktree-memory-optimize

ca62aae

refactor(memory): Phase 1 — directory reorg (vendor/examples/docstrings)

Merge pull request #168 from KhrulkovV/refactor/memory-deduplicate

381e63b

refactor(memory): deduplicate code and delete dead paths

Merge pull request #170 from KhrulkovV/refactor/memory-pydantic-config

a6968cf

refactor(memory): Pydantic-idiomatic config parsing

Merge pull request #171 from KhrulkovV/refactor/memory-rename-relocate

1ae530a

refactor(memory): rename write pipeline and analysis files

Merge pull request #172 from KhrulkovV/refactor/memory-split-narrow

37a486e

refactor(memory): split card_conversion into focused modules

Merge pull request #173 from KhrulkovV/refactor/memory-exceptions

e8c60d6

refactor(memory): custom exception hierarchy and narrowed catches

Merge pull request #174 from KhrulkovV/refactor/memory-exception-conf…

407b757

…ormity refactor(memory): exception conformity + ABC base class

fix: lint errors in memory_platform test (unused import, sort)

b99e221

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Merge pull request #175 from KhrulkovV/refactor/memory-mypy-docs

4f249d3

refactor(memory): type annotations, docstrings, constants, platform bug fix

Update pyproject.toml

98bd0ab

Update run.py

054df39

GrigoryEvko mentioned this pull request May 21, 2026

refactor(config): replace hydra/omegaconf with typed pydantic+tyro #21

Open

10 tasks

KhrulkovV force-pushed the main branch from 054df39 to 0f2b866 Compare May 26, 2026 09:37

chore: empty commit to refresh PR mergeability after main history rew…

2a616f2

…rite

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(optuna): harden literal parsing across both desubstitution paths#7

fix(optuna): harden literal parsing across both desubstitution paths#7
GrigoryEvko wants to merge 800 commits into
FusionBrainLab:mainfrom
GrigoryEvko:fix/optuna-literal-parsing

GrigoryEvko commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

GrigoryEvko commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants