fix(optuna): harden literal parsing across both desubstitution paths#7
Open
GrigoryEvko wants to merge 800 commits into
Open
fix(optuna): harden literal parsing across both desubstitution paths#7GrigoryEvko wants to merge 800 commits into
GrigoryEvko wants to merge 800 commits into
Conversation
short id will generate based on full id when required
…regression feature weights
… and improve docstring clarity
…cstrings Phase 1 cleanup (completed): - Move A_mem/, GAM_root/ → _vendor/ (vendored MIT libs) - Move contrib licenses → _vendor/ - Move 3 example scripts → examples/ - Fix 15 broken vendored library imports (A_mem/GAM_root bare imports) - Update 8 consumer import paths to _vendor/ - Add _vendor/__init__.py docstring (vendored libs notice) - Add examples/__init__.py docstring (not production code) - Update shared_memory/__init__.py docstring - Update pyproject.toml (ruff/mypy exclude paths for vendors) Tests: 770 passed Lint: clean Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
refactor(memory): Phase 1 — directory reorg (vendor/examples/docstrings)
- Delete 5 duplicated usage-merge functions from memory_write_example.py (_to_float, _median_or_none, _extract_usage_task_deltas, _build_usage_payload_from_task_deltas, _merge_usage_payloads) → import from card_update_dedup.py (canonical home) - Delete duplicate dedupe_keep_order from card_update_dedup.py → import from shared_memory/utils.py - Remove deprecated _apply_update_actions() from memory.py (dead wrapper) - Make memory_to_card private (_memory_to_card) — only used internally - Simplify single-iteration loop in _extract_json_object Net: ~120 lines deleted, zero behavior change. Tests: 770 passed | Lint: clean Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
refactor(memory): deduplicate code and delete dead paths
Replace hand-written RetrievalWeights.from_mapping() and CardUpdateDedupConfig.from_mapping() dict parsers (~87 lines) with Pydantic v2 @model_validator(mode="before") — same behavior, idiomatic. Also: add docstrings to all functions in card_update_dedup.py, fix stale test references to deleted _apply_update_actions wrapper, add test for flat config format. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
refactor(memory): Pydantic-idiomatic config parsing
- memory_write_example.py → write_pipeline.py (it's production, not an example) - memory_write_config.py → write_pipeline_config.py - selected_ideas_6.py → origin_analysis.py (remove versioned filename) - Delete test_memory_write_example_extended.py (duplicate of test_write_pipeline.py) - Update all 8 import sites + 1 dynamic importlib.import_module call Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
refactor(memory): rename write pipeline and analysis files
Extract from card_conversion.py (554 → 420 lines): - base.py: GigaEvoMemoryBase abstract class (20 lines) - card_search.py: format_search_results, search_cards_by_keyword, synthesize_search_results (115 lines) Update 4 import sites directly (no re-exports). card_conversion.py retains: normalization, conversion, GAM config, constants, protocols. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
refactor(memory): split card_conversion into focused modules
Define MemoryError, MemoryRetrieverError, MemorySearchError, and MemoryStorageError in gigaevo/exceptions.py following the existing GigaEvoError hierarchy. Wire them into the memory subsystem: - gam_search.build() wraps all failures in MemoryRetrieverError - memory.py narrows two gam.build() catches from bare Exception - card_store._load() narrows to (json.JSONDecodeError, OSError) - card_dedup import block narrows to (ImportError, OSError) Resilience-critical catches (search fallback, merge loop, __exit__) remain broad by design. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
refactor(memory): custom exception hierarchy and narrowed catches
…t base to ABC - concept_api.py: all 5 RuntimeError raises → MemoryStorageError (matches gigaevo/database pattern of wrapping I/O errors) - base.py: GigaEvoMemoryBase now uses ABC + @AbstractMethod (matches MutationOperator, Stage, LangGraphAgent pattern) - card_dedup.py: narrow two broad catches: - JSONL read fallback: except Exception → (json.JSONDecodeError, OSError) - GAM store build: except Exception → (MemoryRetrieverError, OSError) - Update 6 test assertions from RuntimeError to MemoryStorageError Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ormity refactor(memory): exception conformity + ABC base class
When write_pipeline.py passes MemoryCard/ProgramCard Pydantic models to memory_platform.save_card(), the dict() call on a Pydantic model doesn't properly flatten nested Pydantic objects like ConnectedIdea. This caused TypeError in _persist_index() when json.dumps() tried to serialize. Root cause: write_pipeline returns list[AnyCard] (Pydantic models) and both backends (memory_platform and memory/shared_memory) consume these cards via save_card(). memory_platform's normalize_memory_card() must explicitly call .model_dump() on Pydantic inputs to flatten nested objects. Fix verified: all 788 memory + integration tests pass. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Tests the exact bug path: Pydantic MemoryCard/ProgramCard with nested ConnectedIdea and MemoryCardExplanation objects must be properly flattened to plain dicts before JSON serialization. 6 tests covering: ProgramCard with ConnectedIdea, MemoryCard with MemoryCardExplanation, plain dict passthrough, JSON round-trips, None. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add gigaevo-memory Git dependency to pyproject.toml - Remove sys.path manipulation from memory_platform/memory.py and remote_gam_retriever.py (no longer needed with proper install) - Simplify test file to use direct imports instead of module mocking Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Expands from 6 to 11 tests covering the complete save_card → _persist_index flow with Pydantic inputs. Tests verify: - normalize_memory_card: ConnectedIdea/MemoryCardExplanation → dict - save_card: Pydantic ProgramCard/MemoryCard → JSON-serializable index - _card_to_backend_content: API payload is clean dict - persist/reload roundtrip: index file survives write→read cycle Uses _make_platform_memory() factory with mocked API client to test memory_platform in isolation without network dependencies. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add docstrings to 15 public methods across 5 files (memory.py, concept_api.py, card_dedup.py, openai_inference.py, write_pipeline.py) - Add return type annotations to 4 functions in amem_gam_retriever.py - Fix 2 mypy errors: annotate retrievers dict, rename variable in api_sync.py - Extract magic numbers: _MAX_SUMMARY_CHARS, _MAX_DESCRIPTION_CHARS, _ENTITY_NAME_MAX_LENGTH, _MAX_CONNECTED_DESCRIPTIONS Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
refactor(memory): type annotations, docstrings, constants, platform bug fix
The AST-level `_EvalCleaner` checked `isinstance(arg, (Constant, List, Tuple, Set, Dict))` to decide whether an `eval(...)` argument is a literal. This missed `UnaryOp(USub, Constant)`, so `eval(-5)` survived in the AST path while the source-level `_clean_eval_in_source` stripped it correctly. Switch to `ast.literal_eval` as the predicate so both paths share the same logic. Also broaden the `(ValueError, SyntaxError)` except clauses around `literal_eval` in `_coerce_param_value` and `_clean_eval_in_source` — it can also raise `MemoryError`, `RecursionError`, and `TypeError` on pathological inputs. Verified against the source-level path on 12 sample inputs; both produce identical output post-fix.
10 tasks
KhrulkovV
added a commit
that referenced
this pull request
May 26, 2026
Address 2 critical + 2 major issues from methodology expert audit:
C1: Add required_prefix constraint enforcement in PromptExecutionStage
and GigaEvoArchivePromptFetcher — prevents mutation LLM from
evolving away frozen SYSTEM_CONSTRAINTS.
C2: Change default prior from Beta(1,1) to Beta(1,3) — untested prompts
start at fitness=0.25 instead of 0.50, preventing archive churn.
M1: Track metrics_count separately from total_trials — use as denominator
for per-metric means (fixes bias from REJECTED_ACCEPTOR trials).
M4: prompt_text_to_id() now hashes both system and user text — prevents
stats conflation when user prompts differ.
All prior data invalidated (fitness computation + ID hashing changed).
DBs 4-7 flushed for clean restart.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
KhrulkovV
added a commit
that referenced
this pull request
May 26, 2026
Addresses chaos-hacker adversarial review findings: - HIGH #1: Test _await_idle's actual ghost cleanup branch via time.monotonic patch - HIGH #2: Test generation_timeout on real step() with ghost IDs + stuck RUNNING - HIGH #3: Verify snapshot data correctness (not just no-hang) after bump() - MEDIUM #4: Truly concurrent writes via asyncio.Barrier + serialization proof - MEDIUM #5: Stuck RUNNING program triggers generation_timeout - MEDIUM #6: Write serialization assertion (max_concurrent == 1) - MEDIUM #7: Lock eviction race (concurrent reuse after terminal pop) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
KhrulkovV
added a commit
that referenced
this pull request
May 26, 2026
* prompts: kill few-shot fabrication leak in insights + lineage
The GOOD examples themselves contained invented magnitudes
("rejects 60% of viable candidates", "-2.3% runtime"), training the
LLM that fabricated effect estimates are valid output. Live judge
eval on 5 parent->child pairs across heilbron + hover, audited
against actual Redis program metrics + task_description, shows:
- ungrounded-number rate: 20.2% -> 6.9% (3x reduction)
- lineage rubric subscore: 17.35 -> 17.40
- 4-pair rubric avg (excl. known Gemini Pro structured-output flake):
16.97 -> 17.12
Edits:
- insights: remove fabricated "60%" from numeric GOOD example;
add "Quote, don't estimate" rule naming specific fabrication
patterns (% rejection rates, speedup factors, iteration budgets).
- lineage: remove "-2.3% runtime" from Quantification example;
spell out that cited numbers must come from diff, code, metrics,
or task description.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(evolution-stats): iteration-window aggregation + snapshot bump
Evolutionary Statistics section in the mutation prompt was empty or
showed stale numbers under the steady-state engine. Two root causes:
1. Stale population snapshot — `bump()` was only called once at seed
drain. After that, every collector saw a frozen snapshot, so the
focal program's iteration was rarely in scope. Added
`bump(incremental=True)` in `poll_and_ingest` after every commit
pass so the snapshot tracks ingestion progress without flushing
cached program objects.
2. Per-generation aggregation is meaningless under JIT — generations
are an output of the schedule, not a fixed input. Replaced the
`generation_history` / per-gen fields with a symmetric iteration
window ([iter-R, iter+R], R=15) around the focal program:
window count/valid, best-in-window + iter, focal rank in window,
median-before / median-after horizons, trend via median-of-thirds
(5% multiplicative threshold, direction-aware via
`metrics_context.is_higher_better`), max invalid streak, and a
global running-best plateau marker (`iters_since_last_new_best`).
`EvolutionaryStatisticsMutationContext.format()` emits the locked
10-line "E_augmented" block; design doc lives at
`docs/superpowers/specs/2026-05-14-evolutionary-stats-redesign.md`.
Validated via 3-round LLM extraction eval: E_augmented scored 44/45
vs the old per-gen layout's 15/45.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(monitoring): file emit target writes frontier_<metric>.png each tick
start_live_frontier_compare gains an output_dir param and a new "file"
emit target that re-renders a frontier-trajectory PNG (best-so-far +
per-iter mean) in the Hydra run output directory on every tick, sibling
to live_profiler's profile_live.html. Default emit_targets now includes
"file". run.py threads the Hydra output_dir through to the daemon.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(memory): DAG-native intra+extra memory pipeline (per-parent lineage card + live global cards)
Adds the `intra_extra_memory` pipeline variant on top of the default builder:
* IntraMemoryStage (strong LLM, structured output) renders a per-parent
lineage card from DescendantProgramIds + MemoryContextStage as named inputs;
framework InputHashCache skips the LLM when neither changes. Output is
attached to the parent's metadata['intra_memory_card'] and concatenated
with the global memory cards block via ConcatMemoryStage.
* LiveMemoryRefreshHook wraps IdeaTracker.run_increment as a post_step_hook,
surfacing freshly evolved ideas to MemoryContextStage's reload-on-read
selector during the same run (no need to wait for end-of-run flush).
* New ExtraMemoryStage class (currently dormant in the wired pipeline) kept
as opt-in infra with its own caching test, pinning the structured-output
contract for future re-wiring.
* Bug fix bundled: invalid-child fitness sentinel (e.g. -1000 in heilbron)
no longer pollutes delta_distribution.min/median/max or per-cluster
mean_delta. Invalid children route to dedicated n_failed counters; the
rendered card shows "n_failed=N (excluded from stats above)" and
"mean delta n/a" for all-failed clusters. System prompt rule 3 now
instructs the LLM to exclude is_valid=false from delta math.
Legacy lineage stages stripped from the builder (AncestorProgramIds,
LineageStage, LineagesToDescendants, LineagesFromAncestors, InsightsStage)
— DescendantProgramIds is kept and rewidened (max_selected=24) to feed
IntraMemoryStage instead of LineagesToDescendants.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs: intra/extra memory mode guide + USAGE / MEMORY_ARCHITECTURE cross-links
Adds docs/INTRA_EXTRA_MEMORY.md covering the pipeline introduced in 89f01be5:
architecture diagram, intra-card schema (with the n_failed sentinel-handling
contract), live external-memory refresh hook, caching invalidation triggers,
required co-overrides (ideas_tracker=default, memory=local), smoke / full /
nohup launch commands, tuning knobs, verification checklist, and a
troubleshooting matrix.
USAGE.md: adds `intra_extra_memory` to the `pipeline` config-group table and
a launch example under "Examples".
MEMORY_ARCHITECTURE.md: top-of-file pointer to the new mode guide so the
in-run / live-memory entry point is discoverable from the store-side docs.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(intra-memory): ship unified diff (not full code) per child + soften mutator "untried directions" preference
IntraMemoryStage payload now carries either a unified diff (change_form="diff")
or full child source (change_form="full_code") per child. Diff is the default;
full code is the fallback when (a) is_valid=False so error_summary line refs
stay readable against the same buffer the analyst sees, (b) the diff is
empty (identical sources), or (c) the diff is no smaller than the file
(structural rewrites where every line differs). Expected 50-80% prompt-size
reduction on the typical small-mutation regime, where the parent's boilerplate
was previously repeated N times across children.
The intra system prompt's user-message-structure table is updated to document
both children[i].diff and children[i].code, plus the change_form discriminator,
so the analyst knows how to read either form.
Mutator system prompt: softened the "Untried directions" rule. Previously
"prefer it over inventing a new direction from scratch" — a hard preference
that let speculative hints dominate archetype selection. Now framed as
candidates to weigh alongside the model's own ideas, with explicit licence
to skip any whose mechanism does not actually fit the parent's code.
Tests: 4 new payload-shape tests on IntraMemoryStage (diff for small mutation,
full-code for structural rewrite, full-code for invalid child, system prompt
documents change_form/diff/full_code), plus a new pin on the mutator prompt
wording.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(prescriptive): MutationSuggestionStage + EvolutionaryStatistics wiring
Architectural split between descriptive (IntraMemoryStage) and prescriptive
(MutationSuggestionStage) memory: the intra stage now ONLY summarises lineage
history into a per-parent card; the new MutationSuggestionStage consumes
intra card + cross-population memory cards + ancestral momentum trail +
EvolutionaryStatistics population snapshot and emits structured
ProgramInsights into MutationContextStage's insights slot (same shape as the
legacy InsightsStage, so the mutator's PROGRAM INSIGHTS section renders
unchanged).
Key wiring (lineage_memory_pipeline.py):
* DescendantProgramIds → IntraMemoryStage.children_ids
* IntraMemoryStage → MutationSuggestionStage.intra_card
* MemoryContextStage → MutationSuggestionStage.memory_cards
* EvolutionaryStatisticsCollector → MutationSuggestionStage.evolutionary_statistics
* MutationSuggestionStage → MutationContextStage.insights
* IntraMemoryStage + MemoryContextStage → ConcatMemoryStage → MutationContextStage.memory
Both strong-LLM stages (Intra + Suggestion) gate on validator success and
(when enabled) archive acceptance, mirroring the legacy InsightsStage
skip-cascade so paid LLM tokens are never spent on a program that won't
enter the archive.
Other:
* Intra card delta-distribution + mean_delta now formatted using primary
metric's decimals from metrics.yaml (was rendering raw 16-sig-fig floats).
* PopulationSnapshot.refresh: refetch programs in INCOMPLETE_STATES so
QUEUED/RUNNING entries get up-to-date metrics on each snapshot.
* fix(memory): pick OPENROUTER_API_KEY when LLM_BASE_URL targets OpenRouter
Previously gigaevo.memory.config.OPENAI_API_KEY preferred $OPENAI_API_KEY
over $OPENROUTER_API_KEY unconditionally. In intra_extra_memory smokes we
export both — $OPENAI_API_KEY=sk-gigaevo (LiteLLM proxy) for the main Qwen
pipeline and $OPENROUTER_API_KEY=sk-or-v1-... for the GAM/A-Mem cheap path
(Gemini Flash via OpenRouter). The wrong-key-for-endpoint combination made
every GAM research_agent and IdeaTracker LLM call 401-silently, killing
the extra-memory channel without any pipeline error.
Two-line fix:
- config.OPENAI_API_KEY now resolves OpenRouter key first when LLM_BASE_URL
contains "openrouter.ai" (e.g. settings.yaml default).
- ideas_tracker.llm._init_clients picks the right key for the effective
base_url (OPENROUTER_API_KEY for openrouter, OPENAI_API_KEY otherwise).
Verified: with both keys exported and settings.yaml's OpenRouter base_url,
client.api_key now starts with "sk-or-". With base_url set to the LiteLLM
proxy, client.api_key falls back to "sk-gigaevo".
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(suggester): rank-aware ambition rule in mutation_suggestions/system.txt
Adds a sub-bullet under the Evolutionary Statistics input description that
calibrates the suggester's parametric-vs-structural mix to the rank of the
parent in the window (already reported as `rank X/Y in window`):
* Top quartile -> at least one suggestion must be structurally orthogonal
(different algorithm family, init scheme, or objective), not a parameter
tweak. Parametric refinements alone are insufficient when the parent
already tops its window.
* Bottom half -> at least one structural change required; fragile/harmful
tags take precedence over rigid-parameter tweaks.
* Middle band -> mix exploitation with at least one orthogonal axis.
Rationale: smoke #3 (cycle 1 at max_mutants=20) showed gen-3 105901c4
(rank=1/Y) receiving 5 of 6 suggestions tagged `rigid` (pure parameter
tweaks), producing a plateau at 0.01142 (32.6% of 0.035). The breakthrough
to 0.01885 (53.9%) came from a SIBLING program a35a0f72 whose suggester
happened to find a structural harm (asymmetric_extra_points / symmetry
restoration). The new rule makes that structural pivot a stable
expectation at top-of-window, not an accident.
Generic — uses only the existing rank-in-window signal already in
EvolutionaryStatistics. No new fields, no new code, no heilbron-specific
tokens.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(stats): rank line dropout when focal missing from snapshot
The iter-window rank computation in `_compute_iter_window_fields`
called `sorted(fits).index(focal_fit)` to find the focal's rank.
When the population snapshot lagged behind the pipeline view (or the
snapshot contained a stale `is_valid=0` view of the focal), `focal_fit`
was not in `fits`, the ValueError was swallowed silently, and
`iter_window_rank` became None.
Downstream renderer in `evolution/mutation/context.py:168` gates the
rank line on `iter_window_rank is not None`, so the entire
"rank X/Y in window" segment disappeared for top-of-window programs.
The mutation_suggestions/system.txt rank-aware ambition rule relies on
that text; with the rank line missing the rule was DORMANT throughout
the cycle-2 run (struct counter stayed at 0 for 100 mutations).
Verified on production program 4578cea1 from
output/cycle2_rankambition_20260518_022450 (fit=0.01509 at iter=49):
window valid=10, best in window=0.01455 — focal excluded, rank=None.
Fix:
- When focal is valid and not already in `valid_with_fit` (snapshot
lag), include it explicitly using the up-to-date metrics passed by
the pipeline. Downstream best/median/trend/valid_count then reflect
reality.
- Replace `sorted.index` with a count-based rank (better+1). Robust to
tied fitness values, which previously got under-counted by `index`'s
first-match semantics.
Tests:
- test_iter_window_rank_when_focal_missing_from_snapshot (RED→GREEN)
- test_iter_window_rank_when_focal_in_snapshot_but_stale_metrics (RED→GREEN)
- existing test_iter_window_rank_none_when_focal_invalid still passes
(invalid focal correctly yields rank=None)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(suggester): lineage-exhaustion override in rank-aware ambition
Cycle-3 (rank rule LIVE) plateaued at 0.02614 (74.7%). Forensics:
- Top-5 fitness: 4/5 are structural-pivot archetypes (Guided Innovation
/ Approach Synthesis). Mean fit by archetype: Guided Innovation 0.01735
vs Exploitation 0.01016. Structural pivots win.
- But ~50% of all mutations chose Exploitation. Among the 6 programs
whose parent's intra card flagged "all valid children regressed or
failed" (lineage exhaustion), the mutator still chose
Exploitation/Proven Pattern Extension in 3/6 cases — wasting budget
re-tweaking failed clusters.
The existing rank-aware rule says "at least one orthogonal-axis
suggestion" — too soft when local gradient is empirically dead.
New sub-bullet: when intra card shows ≥2 failed/regressed tried_strategy
clusters (or delta distribution catastrophic+failed ≥ 2 with improving=0),
EVERY suggestion must propose a structural axis NOT in tried_strategies.
Parametric tweaks of failed clusters are explicitly rejected in this
regime. Forces the suggester to leave exhausted local basins.
Generic, no task-specific tokens. +11 lines in
mutation_suggestions/system.txt under the rank-aware ambition block.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(suggester): escape literal {} braces in lineage-exhaustion sub-bullet
673b9fb6 introduced `{regressed, failed}` as a literal phrase in the
mutation_suggestions/system.txt template. The prompt loader passes this
template through `str.format()` (factories.py:200), so `{regressed, failed}`
was parsed as a placeholder named 'regressed, failed' — raising KeyError
at every DAG build during cycle-4 startup. All 5 seed-eval DAGs failed,
the engine spun in an idle "no parents" loop, and the process exited at
t=07:53:04 with progs=5/scored=0/mut_done=0 — zero useful work done.
Fix: escape the literal braces as `{{regressed, failed}}`. Verified via
str.format() round-trip — only the three intended placeholders
({task_description}, {metrics_description}, {max_insights}) remain.
Lesson: any literal `{` or `}` in `.txt` prompts that flow through
.format() must be doubled. See feedback memory for hardening guide.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(suggester): server-computed EXHAUSTION ALERT banner overrides soft LEX
Cycle-4b shipped a soft "Lineage exhaustion override" sub-bullet in the
mutation-suggester system prompt. Qwen-3-235B-A22B-Thinking-2507 ignored
it: 3 parents in cycle-4b had ≥2 regressed/failed intra clusters yet
their children received parametric refinements of already-tried clusters
(plateau at 0.02630, +0.6% vs cycle-3 baseline 0.02614).
Replace the soft text with a deterministic server-side banner prepended
to the user message — most salient location, no LLM judgement on the
trigger condition.
Trigger (computed in MutationSuggestionAgent._format_exhaustion_block):
- cond_a: ≥2 distinct clusters in {regressed, failed}, OR
- cond_b: catastrophic + n_failed ≥ 2 AND improving = 0
When triggered, emit `## EXHAUSTION ALERT — strict structural-pivot mode`
header + OVERRIDES sentence + explicit AVOID-LIST of negative-verdict
clusters + full tried-strategies context + `---` separator. The system
prompt now references the banner as a HARD CONSTRAINT that overrides the
rank-aware ambition mix.
Banner is task-agnostic (no heilbron/triangle leak — covered by test).
Tests: 13 new in tests/llm/test_mutation_suggestion_exhaustion.py cover
empty intra, single-cluster non-triggers, cond_a/cond_b paths, mixed
verdicts, override/AVOID-LIST language, task-agnosticism, and the
trailing separator. All pass. Lint clean. No new regressions in
tests/llm/ (371 pass) or tests/stages/ (938 pass; pre-existing 3
failures unrelated).
Pure context-building change — schema unchanged, pipeline unchanged,
launch command unchanged. Stays within the 0.035-sprint allowed-knobs
envelope.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(suggester): revert rank+LEX to 9cca4344 baseline for cycle-6 A/B
Drops 24 lines from mutation_suggestions/system.txt — the rank-aware ambition
sub-bullet (commit 4caeb1b9) and the lineage-exhaustion banner clause
(commit 0ebd405c built on 73ed1207's brace-fix of 673b9fb6).
Net effect: the suggester prompt is now identical to its 9cca4344 state.
Empirical motivation:
| Run | Best fitness | system.txt |
|------------------------------|--------------|-------------------|
| sprint cycle-2 (2026-05-17) | 0.0315 | pre-prescriptive |
| cycle3-from-scratch (uncomm) | ~0.030 | 9cca4344 baseline |
| cycle-3 today (rank rule) | 0.02614 | +13 rank |
| cycle-4b today (LEX soft) | 0.02630 | +24 rank + LEX |
| cycle-5 today (LEX hard) | 0.02536 | +24 rank + banner |
Today's three runs cluster at 0.025-0.026 (~17% below the 9cca4344
baseline). The collector.py rank-line bugfix + EXHAUSTION ALERT formatter
remain in place — only the LLM-facing prompt content is reverted. The
formatter just becomes dormant since the prompt no longer references its
output.
Cycle-6 will A/B this against the cycle-5 state to confirm whether the
+24 lines of guidance were net-destructive.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(stats): R2 — MAD-based trend noise floor + archive_valid_fitnesses field
Replace legacy 5%·|t1| trend threshold in `_trend_from_thirds` with a
nonparametric MAD (median absolute deviation) over the recent valid-fitness
window. The fixed 5% ratio reads as "flat" on low-fitness regimes where real
regressions are present at sub-5% absolute magnitude — the cycle-6 audit
showed parent contexts with medians falling 0.00165 → 0.00082 (clear
regression) still labelled `flat` and feeding the consumer's flat-trend
condition into Exploitation.
MAD adapts to the run's empirical noise scale: no chosen constant. Bootstrap
fallback to legacy 5%·|t1| ratio when fewer than `N_MIN_FOR_MAD=4` valid
samples in the window — pre-existing framework behaviour preserved during
the initial iterations.
Also exposes `archive_valid_fitnesses: tuple[float, ...]` as a transient
field on `EvolutionaryStatistics` (not persisted; rebuilt per emission).
This is the source-of-truth distribution that R1 (archive-quartile regime)
will read in a follow-up commit.
Constants introduced are all data-availability gates, not regime thresholds:
- `N_MIN_FOR_MAD = 4` — minimum sample size for MAD to be meaningful
- `_TREND_EPSILON = 1e-12` — numerical safety against degenerate MAD=0
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(context): R1+R3 — archive-quartile regime in mutation_context render
Adds two new tokens to every rendered parent context once the archive holds
≥ `N_MIN_ARCHIVE=4` valid programs:
Archive: N=N median=… q75=… best=…
Regime: BAD/MIDDLE/GOOD (Q? of archive)
And appends `archive-quartile Q?` inline to the existing rank line:
rank 2/8 in window, archive-quartile Q1)
Both signals are derived from the same archive distribution emitted by R2's
`archive_valid_fitnesses` field on `EvolutionaryStatistics`. Quartile
boundaries are universal statistical convention (Q1=25%, Q2=50%, Q3=75%) —
not chosen thresholds. The mapping Q1→BAD, Q2/Q3→MIDDLE, Q4→GOOD is the
framework's editorial choice with no numeric constants.
R1 v2 design properties:
- No dependency on `MetricSpec.upper_bound` — regime is derived from the
run's empirical archive distribution, so the bundle is task-agnostic.
Tasks declaring `upper_bound` additionally get an informational
`Target: … focal_gap=…` line; R6's archetype gate does NOT read it.
- Direction-aware via `MetricSpec.higher_is_better` — works identically for
loss-style metrics where small = good.
- Bootstrap-safe: no token emitted when archive < 4 valid; rule falls back
to original Step-6 logic.
- O(N log N) per render on archive size bounded by `max_mutants=100`.
R3 reuses R1's `quartile_str` so there is a single source of truth and the
rank line cannot drift from the Regime line.
Tests: 10 new `TestArchiveQuartileRegime` cases cover Q1/Q2/Q3/Q4 placement,
archive < 4 (no regime emission), `higher_is_better=False` direction,
archive-quartile inclusion in rank line, ties at quartile boundaries, target
decoration on/off.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(prompts): R6 — archive-quartile archetype gate + suggester tag-bias
Two-layer defense against wasted-budget Exploitation on weak parents.
mutation/system.txt (consumer):
- Adds a Selection Rule that makes `Regime: BAD` (focal in Q1 of archive)
a HARD GATE: Exploitation archetypes (1-3) FORBIDDEN. Choose Exploration
(4-6) or, if intra has an "improved" verdict with an untried extension,
Hybrid (7-8). MIDDLE (Q2/Q3) gates Exploitation on intra-improved +
untried-extension; otherwise prefer Hybrid. GOOD (Q4) opens all
archetypes per other rules. Bootstrap (Regime line absent) falls back
to original logic.
- Adds an Evolutionary Statistics descriptive paragraph explaining the
new `Archive:` and `Regime:` lines so the LLM knows the rendered
tokens.
- Trend label vocab synced to code emit: `rising / flat / falling` (the
legacy `improving / regressing` words drifted from collector.py:126
and broke the consumer's match logic).
mutation_suggestions/system.txt (producer):
- Adds an Archive-quartile awareness rule: in BAD regime (Q1) do NOT tag
patterns as `beneficial` based only on local intra-card "improved"
verdicts. Prefer `fragile` or `rigid`. Reserve `beneficial` for MIDDLE
(Q2/Q3) or GOOD (Q4) regimes.
- Disambiguates earlier informal "low-fitness regime" wording (which
collided with the formal `Regime:` tag) — the metric-scale heuristic
is now explicitly called out as SEPARATE from the formal Regime tag.
- Trend vocab synced to `rising / flat / falling`.
Defense-in-depth: the producer suppresses `beneficial` tag at source for
Q1 parents; the consumer additionally forbids the Exploitation archetype
the tag would have biased toward. The two rules layer — they don't
duplicate. If the suggester slips and emits `beneficial`, the mutator's
hard gate still routes the mutation to Exploration.
Tests: TestR6ArchiveQuartileGate (consumer) + TestR6SuggesterTagBias
(producer) + TestRegimeAndQuartileVocabularySynergy (cross-prompt vocab
consistency for tag, verdict, quartile, regime scales).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(tools): trajectory_shape.py — log-based closeout analyzer for cycle comparisons
Computes the 6 trajectory metrics from plan §Verification on any
output/cycle*_*/evolution_*.log:
- best_at_end (frontier final)
- monotonicity_pct (cohort-mean signal, NOT running-max which would always be 100%)
- per_stage_best (early/mid/late thirds)
- longest_stagnation_min (gap between consecutive frontier-bumps)
- rtail_ge_020 / rtail_ge_030 (right-tail mass — # cells reached past 0.020 / 0.030)
- cells_filled
Two modes:
python tools/trajectory_shape.py <log_file> # single report
python tools/trajectory_shape.py --compare a.log b.log c.log # variance-floor verdict
Variance-floor rule (1.5×spread): N≥3 → mean(baselines)+1.5×spread is the bar;
treatment > bar → CONFIRMED, else NOISE.
Works on any log file regardless of Redis state — logs are permanent, Redis dbs
get flushed. Built during cycle-7's runtime, smoke-tested retroactively on
cycles 3/4b/5/6 to establish n=4 baseline (mean=0.02596, spread=0.00094,
zero breakouts past 0.030).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(context): R7+R8 v3.1 — archive distribution with worst/median/best + archive-percentile token
Renders the archive's distribution as `Archive: N=… worst=… median=… best=…` plus an
`archive-percentile pXX of N=Y` annotation on the existing rank line. Both lines are direction-
aware via `MetricSpec.higher_is_better`. No `Regime:` or `Target:` token is rendered — the LLM
reads the task's target from the task description and judges the focal against the rendered
distribution itself.
Why v3.1:
- v3's `Regime: BAD/MIDDLE/GOOD` was a pre-baked classifier. Trust-the-model-synthesis
principle: render data, let the LLM judge.
- v3's `Target:`/`focal_gap` was likewise pre-baked. The task description already states the
target; rendering it twice (and adding a derived `focal_gap`) introduces a hardcoded
interpretation channel the LLM doesn't need.
- Only deterministic gate kept: archive-percentile (a single direction-aware quality
percentile, 100=best). Quartile boundaries 25/75 are statistical convention, not magic.
Bootstrap-mislead defense: the rendered `worst=… median=… best=…` triplet makes archive
compression visible. A compressed bootstrap (N=7, all <0.002) lands the focal at p100 but the
distribution itself shows the LLM the archive is far from the task's stated target. The
qualitative target-awareness clause in the prompt instructs the model to apply that judgment.
Tests:
- test_archive_line_includes_worst_higher_is_better_true
- test_archive_line_worst_inverts_for_higher_is_better_false
- test_compressed_bootstrap_renders_rich_archive_no_target_line
- test_target_line_never_rendered_when_upper_bound_declared
- test_target_line_never_rendered_when_upper_bound_none
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(prompts): R9 v3.1 — archive-percentile gate + qualitative target awareness
Mutator and suggester prompts now reference the v3.1 context surface: `Archive: …` and
`archive-percentile pXX of N=Y`. The only deterministic gate is the archive-percentile gate
(focal in bottom quartile → Exploitation FORBIDDEN; focal in top quartile → all archetypes
eligible). Quartile boundaries 25/75 are statistical convention, not magic numbers.
Target awareness is qualitative: the task description states the problem's target/bound; the
prompt instructs the LLM to compare the rendered `worst=… median=… best=…` distribution
against that target and apply judgment. No numeric threshold is imposed because fitness scale
is typically non-linear — small absolute distances at low fitness are structurally harder than
the same absolute distance at high fitness.
Removed from previous v3:
- `Regime: BAD/MIDDLE/GOOD` pre-baked classifier (replaced by archive-percentile + prose)
- `Target:`/`focal_gap` rendered tokens (LLM reads target from task description)
- `half the distance` magic-constant compound rule
Removed in this v3.1 cleanup pass:
- Legacy "framework does NOT render a separate `Target:` line" mentions — negating-by-mention
is noise; prompts now positively instruct reading the target from the task description.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* test(v3.1): archive-percentile gate, archive distribution, no Target/Regime tokens
test_mutation_context.py:
- test_target_line_never_rendered_when_upper_bound_declared: asserts `Target:` and `focal_gap`
are absent even when MetricSpec.upper_bound is set.
- test_target_line_never_rendered_when_upper_bound_none: parallel assertion for tasks without
declared upper bound.
- test_compressed_bootstrap_renders_rich_archive_no_target_line: documents the bootstrap-
mislead defense — N=7 compressed archive with focal at p100 still renders worst/median/best
so the LLM can judge the gap against the task target itself.
- test_archive_line_includes_worst_higher_is_better_true: asserts `worst=…` and `best=…` are
rendered with the strongest at `best` for fitness-style metrics.
- test_archive_line_worst_inverts_for_higher_is_better_false: parallel for loss-style metrics
(worst = highest value, best = lowest).
test_prompts.py (TestV31* replacing TestR6*):
- TestV31ArchivePercentileGate: archive-percentile referenced, no Regime/archive-quartile/
Target/focal_gap/half-distance vocab, FORBIDDEN keyword on Q1 Exploitation, 25/75 cited,
qualitative target awareness via task description, non-linear scale acknowledged, trend
vocab matches collector (rising/flat/falling).
- TestV31SuggesterTagBias: parallel for mutation_suggestions prompt.
- TestV31VocabularySynergy: cross-prompt consistency for tag scale, verdict scale, quartile
boundaries (25, 75), archive distribution vocab (worst/median/best).
364 tests pass. Lint clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs(audit): v3.1 mutation decision tree — channels, gate, decision table, worked examples
Spells out exactly how the LLM mutator selects an archetype (Exploitation 1–3 / Exploration
4–6 / Hybrid 7–8) under the v3.1 surface:
1. Six context channels (C1 Metrics, C2 Insights, C3 Intra Memory, C4 Memory Cards, C5
Evolutionary Statistics, C6 Ancestral Momentum) — producer / carrier / consumer.
2. Two decision components: one deterministic gate (archive-percentile, only constants
are quartile boundaries 25/75) and one qualitative target-awareness clause (LLM reads
task description, judges against rendered Archive distribution; no numeric threshold
because fitness scale is non-linear).
3. Exhaustive 18-row decision table covering (archive-percentile bucket × intra verdict
× trend × invalid streak) → archetype.
4. Four worked heilbron examples: bootstrap-mislead p100 case (Hybrid 7 override),
normal mid-run (Hybrid), late-run refinement (Exploitation 1), plateau exit
(Exploration).
5. Cycle-9 mid-run invariants: 0% Exploitation on archive-percentile<25 focals; no
Regime/Target/archive-quartile tokens; archive-percentile rendered ≥95% post-bootstrap.
6. Universal-across-tasks proof: only per-task input is higher_is_better flag; 25/75 are
statistical-convention quartile boundaries, not chosen values.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* tools(v31-validator): read-only sampler — recompute archive-percentile + decision-tree predictions vs LLM choices
Non-mutating Redis sampler that walks DBs 13/14/15 program by program,
parses `Program.metadata.mutation_context` (the rendered prompt) for the
parent's state (focal fitness, valid sibling fitnesses, trend, intra
verdicts), and recomputes the v3.1 archive-percentile from valid
siblings. Emits JSONL with fields: tree_bucket, tree_eligible_archetypes,
archetype_chosen, fitness_delta, cf_tags (which of CF-A..CF-E cells the
sample lands in), match (tree-prediction vs LLM-choice).
Reuses:
- gigaevo.programs.stages.collector.N_MIN_ARCHIVE
- gigaevo.evolution.mutation.context._archive_percentile_of_focal
(direction-aware)
- gigaevo.database.redis_program_storage RedisProgramStorage.get_all
- gigaevo.programs.program.Program.get_metadata (base64 deserialization)
Output drives docs/audits/MUTATION_DECISION_TREE_V3_1_COUNTERFACTUAL_AUDIT_2026-05-18.md.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs(audit): v3.1 decision tree — counterfactual audit on 289 prior-run programs (DBs 13/14/15)
Recomputes v3.1 archive-percentile and tree-predicted archetype bucket
on every program with mutation_context metadata across DBs 13/14/15
(n=289), compares to the LLM's actual archetype_choice and the child's
fitness_delta, then groups by the 5 candidate counterfactual cells
identified in the audit plan:
CF-A empty intra (first child) n=225 — modal Hybrid 7 (57), pos-rate 42.1%
CF-B invalid focal with archive line n=0 — unreachable under OLD prompt; documented as gap
CF-C low-N flat trend (noise-dominated) n=32
CF-D middle-band percentile + falling trend n=19
CF-E top-quartile + spread wide + far-target n=53 — pos-rate 92.5%, Exploitation 32/35 = 91% wins
Headline finding: v3.1's target-awareness override demoting top-quartile
parents to Hybrid is empirically TOO RESTRICTIVE. CF-E shows exploitation
beats hybrid 32/35 when the gate permits both. 18 counterfactual-A
samples — gate-violations that improved fitness anyway.
Recommendations applied in next 2 commits:
- REV-1 soften row 13 (target-awareness no longer forbids Exploitation)
- REV-3 add row 19 — empty intra middle-band → Hybrid 7 default
- REV-4 add row 19a — invalid focal → Exploration with corrective
- REV-2 (row 11 softening for CF-D) documented as PROPOSAL — n=19 too thin
Observational only — contexts rendered under OLD prompt surface; the
sampler recomputes v3.1 tokens from the same underlying numerics.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs(audit): v3.1 tree — soften row 13 target-awareness + add rows 19 / 19a per counterfactual audit
Three additive/softening edits to MUTATION_DECISION_TREE_V3_1_2026-05-18.md,
all driven by empirical findings in the counterfactual audit
(MUTATION_DECISION_TREE_V3_1_COUNTERFACTUAL_AUDIT_2026-05-18.md):
REV-1 — row 13 (top-quartile + spread wide + far-target):
OLD: "demote to Hybrid 7 even though gate permits Exploitation"
NEW: "prefer Hybrid 7 OR Exploitation 1 — gate does NOT forbid
Exploitation. Choose by C2 insight severity and C6 ancestral_step_delta."
Evidence: CF-E n=53, pos-rate 92.5%; Exploitation wins 32/35 in this cell.
REV-3 — new row 19 (empty intra, middle-band 25≤p<75):
Add explicit default: Hybrid 7 (Guided Innovation).
Evidence: CF-A n=225, modal choice Hybrid 7 (57 picks), pos-rate 42.1%.
Closes a gap where the tree relied on the LLM inferring a default.
REV-4 — new row 19a (invalid focal with archive line):
Defensive rule — force Exploration with corrective mechanism regardless
of archive-percentile (focal cannot be refined when it is invalid).
Evidence: CF-B was unreachable in prior data; rule is forward-looking.
REV-2 (row 11 middle-band+falling softening) NOT applied — CF-D n=19 too
thin; documented in audit as proposal pending more data.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(prompts): soften target-awareness clause + add noise-dominated & empty-intra rules
Mirrors REV-1/REV-3 from the decision-tree counterfactual audit into the
two system prompts the LLM actually sees.
mutation/system.txt:
- Softened target-awareness paragraph: target awareness shapes priority
WITHIN the gate-permitted set; it never forbids an archetype the gate
allows. When archive.best far below target AND focal in top quartile,
BOTH Hybrid 7 and Exploitation 1 are legitimate.
- Added noise-dominated-trends bullet: falling/flat with
`[only N valid — too few for trend]` (or iter_window_valid < 9) is
inconclusive — do NOT force Exploration on it alone.
- Added empty-intra default bullet: first child of a fresh parent in
middle band defaults to Hybrid 7 (Guided Innovation).
mutation_suggestions/system.txt:
- Softened target-awareness paragraph (same wording philosophy).
- Added noise-dominated-trends bullet so the analyst does not raise
severity to `high` on a noisy signal alone.
Total: ≤10 lines net per file, all additive or replacement softening.
Empirical justification: CF-E audit shows Exploitation wins 32/35 when
the gate permits both; CF-A shows Hybrid 7 is the modal LLM choice
(57/225) and best per-pick pos-rate.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* tools(pseudo-evo-bench): single-shot mutation A/B harness with archetype-distribution analysis
Tests prompt-wrapper changes (mutation/system.txt + user.txt) on a fixed cohort of
50 stratified parents from DBs 13/14/15. Each parent's mutation_context blob is
frozen from its original search-time run; only the system.txt/user.txt wrappers
are re-rendered from the working tree at sample time.
Components:
- sample_parents.py: stratify 50 parents (17/18/15 by fitness bucket), render
current HEAD prompts, write parents.json (idempotent under SEED=20260518)
- run_qwen.py: 1 LLM call per parent at concurrency=6 via LiteLLM proxy
- eval_mutants.py + eval_runner.py: parallel heilbron validate() at concurrency=4
- analyze.py: PRIMARY signal is archetype/strategy distribution (coverage,
entropy, group balance, v3.1 gate compliance, archetype shift matrix).
Fitness delta is reported as SECONDARY since single-shot is noise-dominated
by parent quality and validity.
Scope (per README): tests prompt-wrapper changes only. Stage-internal
mutation_context build (collector, intra/extra memory, lineage) is NOT
exercised because contexts are pre-rendered.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* tools(pseudo-evo-bench): iter0 (HEAD) vs iter1 (pre-softening v3.1) — archetype-distribution A/B
Same 50 parents (verified parent_id-identical), frozen mutation_context blocks
(verified byte-identical per parent). Only difference: rendered system_prompt
(iter0 = 14,375 chars / HEAD with +14-line softening; iter1 = 13,415 chars at
commit 11eb4d4b before softening).
PRIMARY: archetype/strategy distribution
=========================================
iter0 (HEAD softened) iter1 (v3.1 sharp)
Coverage 7/8 archetypes 7/8 archetypes
Entropy 2.583 bits 2.565 bits
Group balance E/X/H 28% / 41% / 30% 20% / 52% / 28%
Group skew (max-min) 13.0% ← fairer 32.6%
Low anti-Exploit 87.5% (n=16) 93.8% (n=16)
High Exploit rate 50.0% 42.9%
Per-bucket group shift (where softening actually changed behavior):
low: ~unchanged (gate respected in both)
mid: iter0 56% Hybrid → iter1 50% Explore (softening pushed mid TOWARD Hybrid)
high: iter0 14% Hybrid → iter1 36% Hybrid (softening pushed high TOWARD Explore)
Archetype shift matrix: 16/45 decisive pairs (36%) cross GROUP boundary
between iter0 and iter1 — the +14 lines DO steer the LLM, the question is
whether the steering is desirable.
Common finding across BOTH iterations:
archetype #8 "Conservative Exploration" picked ZERO times (0/92 mutations)
→ strong signal the prompt does NOT surface this archetype effectively
archetype #7 "Guided Innovation" dominates Hybrid (14/14 + 13/13)
SECONDARY: fitness (noise-dominated, but directionally informative)
==================================================================
Sign test on paired Δ: iter1 wins 20, iter0 wins 7, ties 23 → p≈0.012
Validity: iter0 26/50, iter1 32/50
Interpretation: HEAD softening trades 6pp of single-shot fitness loss for
group-balance fairness. Neither prompt surfaces archetype #8.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* prompts(mutation/system.txt): rewrite archetype #8 as Component Substitution + sharpen #5 separator from #4
Empirical motivation: pseudo-evo bench iter0+iter1 picked archetype #8 (Conservative Exploration) ZERO times across 92 valid responses. Diagnosis — "explore within structural / interface constraints" describes properties any valid mutation already has, not an edit pattern the model can operationalise.
Archetype #8 → Component Substitution
Replace ONE named subroutine or building block (scoring function, sampler, init scheme, distance metric, update rule, post-processor) with an alternative of the same kind, occupying the same slot — same inputs, same output shape — so surrounding control flow, interfaces, and hyper-parameters remain unchanged. Distinct from #4 (changes algorithm family) and from #7 (adds a component alongside an existing one rather than replacing).
Archetype #5 → sharpened separator from #4
"Change the SET of admissible solutions (relax/tighten a constraint, drop a parity or symmetry rule, allow rotations, switch from discrete to continuous parameterisation) without changing the search algorithm itself. #4 changes HOW the search runs; #5 changes WHAT set is searched."
Net effect — eight distinct edit verbs:
Exploit: tune (#1) / extend-scope (#2) / remove (#3)
Explore: reinvent-algorithm (#4) / change-feasibility-set (#5) / synthesise-from-memory (#6)
Hybrid: add-alongside (#7) / substitute-component (#8)
Raises realised entropy ceiling above today's log2(5)≈2.32 bits toward the 3.0-bit max for 8 archetypes. Pseudo-evo iter2 will verify the model actually picks #8 and discriminates #5 from #4.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* schema(mutation): enforce archetype names via Literal + add drift-detection tests
Changes:
1. Added ARCHETYPE_NAMES constant and ArchetypeName Literal type to constants.py (single source of truth for canonical names)
2. Updated MutationStructuredOutput schema to use ArchetypeName Literal (strict validation, rejects unknown archetype strings at parse time)
3. Added 4 new test functions to test_mutator_system_prompt.py:
- test_archetype_names_appear_in_system_prompt() — catches drift between schema Literal set and prompt's archetype menu
- test_archetype_count_is_eight() — asserts len(ARCHETYPE_NAMES)==8
- test_mutation_output_accepts_canonical_archetypes() — validates each canonical name
- test_mutation_output_rejects_unknown_archetype() — rejects out-of-set strings with ValidationError
4. Fixed test_defaults in test_mutation_agent.py to use canonical archetype "Precision Optimization" instead of invalid "test"
Motivation: Prevent silent LLM output rejection when system.txt and schema drift (e.g., if archetype #8 is rewritten in prompt but not updated in Literal).
All 104 tests pass. No changes to mutation/system.txt (archetype #5/#8 redesign committed separately 2026-05-18 at 7a52f45d).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs(auto-optimize-loop): spec + reference schemas + history/patterns scaffolds
Adds the auto-optimize-loop task spec that drives autonomous cycles tuning
ONLY the mutation operator's context factory graph and the archetype
framework. Primary success criterion is healthy trajectory + healthy
mutants; 0.03 is the "real improvement" floor with 0.035 aspirational.
All future auto-loop cycle commits land linearly on r7-r8-r9-v3-bundle
and are identified by their commit SHA captured at LAUNCH time. Each
cycle writes a Reconstruction MD and Analytics MD; the Analytics MD's
retroactive invariant audit hard-gates the next cycle's PROPOSE step.
Files:
- docs/audits/AUTO_OPTIMIZE_LOOP_TASK_2026-05-19.md (primary spec)
- docs/audits/references/AUTO_OPTIMIZE_RECONSTRUCTION_MD_SCHEMA.md
- docs/audits/references/AUTO_OPTIMIZE_ANALYTICS_MD_SCHEMA.md
- docs/audits/AUTO_OPTIMIZE_CYCLE_HISTORY.md (append-only ledger)
- docs/audits/AUTO_OPTIMIZE_PATTERNS.md (evidence ledger)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* chore: gitignore output/runs/tool-caches + capture pre-loop audit MDs
- Add .gitignore rules for output/ runs/ problems/heilbron_pro/
rotated litellm sampler logs, and throwaway tools/ subdirs
(benchmark_gemini_ab, insights_ablation, lineage_card_scaffold).
- Capture pre-loop audit MDs under docs/audits/ that informed the
cycle-9 redesign + auto-optimize-loop spec (insights, lineage memory
plans, mutation guidance rubric, cycle-8 prelaunch, prompt redesign,
etc.). These are read-only history; future PRs cite them.
- Collapse multi-line attach_inputs({...}) calls in
tests/stages/test_intra_memory_cache.py to single-line form
(pure formatting, no behavior change).
- Add docs/audits/AUTO_OPTIMIZE_CYCLE_0_ANALYTICS.md (cycle-0 = cycle-9
baseline) pre-drafted at T+1h17m with <TBD-FINALIZE> markers;
end-of-run values will be filled in once PID 2008891 exits.
This is a non-loop chore commit. It becomes cycle-1's PARENT_SHA
so the §8.1 invariant
git rev-list --count \$PARENT_SHA..HEAD == 1
will be satisfiable when cycle-1's single IMPLEMENT commit lands on top.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* chore: ruff fix + format on tools/pseudo_evo_bench (pre-loop)
Applies `ruff check --fix` + `ruff format` to the pseudo_evo_bench
A/B harness scripts. All changes are cosmetic:
- 11 I001 import-order errors fixed (stdlib imports merged into
alphabetic order with site-package imports).
- Long json.dumps(...) and dict literals reflowed by the formatter.
These files have been failing `ruff check .` since they landed
(commits 893113c1 / a9b4bee5). The §8.1 pre-launch lint invariant
in the auto-optimize loop spec requires a clean `ruff check .` +
`ruff format --check .`, so this clean-up is a prerequisite for
cycle-1 launch.
Safety:
- pseudo_evo_bench is NOT imported by gigaevo/ or tests/ — grep
across both trees returns zero hits. The currently-running
cycle-9 (PID 2008891) does not touch these files.
- Only formatting changes; no semantic edits, no API surface change.
- `pytest tests/prompts/test_mutator_system_prompt.py` continues
to pass (archetype-drift detection).
Verified:
ruff check . → All checks passed!
ruff format --check . → 1172 files already formatted
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs(auto-optimize-loop): finalize cycle-0 Analytics + cycle-1 PROPOSE + scope-expansion note
Finalizes cycle-0 (= pre-loop cycle-9 archetype redesign + R-bundle) baseline analytics
against the full evolution log (20043 lines, exit at T+1h55m41s).
- AUTO_OPTIMIZE_CYCLE_0_ANALYTICS.md: substitute all TBD-FINALIZE markers with
extracted numbers. Final outcome: HEALTHY-NEUTRAL (S2.1 trajectory PASS narrow;
S2.2 fitness FAIL with best_fitness=0.02620 < 0.03 floor, below baseline 0.02788).
Trajectory has 6 strictly-increasing best-fitness peaks with one mid-run plateau
of ~46 mutants strict (peaks #4->#5) followed by fast late rescue (peak #6).
Strict stagnation_interval_max=46 NARROW-FAILS <=40 gate; inclusive frontier-event
defn PASSES at <=5 mutants. valid_rate=52%, frontier_new_cell_events=42, right_tail
_mass=42.3%, per_parent_advance_rate=6% strict / 10% inclusive. Component Substitution
(new archetype #8) NOT dead - rose 0->18 picks (6%) by run end; vindicates the
feedback_archetype_distribution_not_a_goal user reframe.
- AUTO_OPTIMIZE_CYCLE_HISTORY.md: cycle-0 row now shows HEALTHY-NEUTRAL decision,
best_fitness=0.02620, S2.1 PASS narrow, S2.2 FAIL.
- AUTO_OPTIMIZE_PATTERNS.md: cycle-0 entry as NEUTRAL ceiling evidence; documents
surface scope (R-bundle), numbers (S2.1 components), caveats (n=1 variance not
yet measured; below baseline by 6% within plausible n=1 variance).
- AUTO_OPTIMIZE_CYCLE_1_PROPOSE.md: cycle-1 = variance-floor replicate of baseline
(NO EDIT per S7). Decision rationale cites feedback_variance_floor_first +
feedback_consistent_improvement_all_stages + feedback_auto_optimize_trajectory_first.
S4 citations rewritten as trajectory-shape-only signals (plateau duration,
per_parent_advance_rate, stagnation_interval_max) per feedback_archetype
_distribution_not_a_goal. Updated parent-SHA reference to current operational HEAD.
- AUTO_OPTIMIZE_LOOP_TASK_2026-05-19.md: S3 prelude blockquote captures the 2026-05-19
user verbal directive expanding cycle-3+ scope to the entire mutation context harness
(feedback_mutation_context_harness_in_scope). S3.2 SLIGHTLY cap on mutation/system.txt
preserved as engineering constraint only.
Non-cycle docs commit - the cycle-1 commit will follow as a separate --allow-empty
commit per the variance-floor protocol.
* auto-loop cycle 1: variance-floor replicate of baseline (no edit)
* auto-loop meta cycle 1: variance-floor replicate (HEALTHY-NEUTRAL)
Cycle-1 ran 2026-05-19 02:33→04:53 MSK on db=12, identical config to
cycle-0 baseline (a4925a90). Cycle commit a527b256 is the --allow-empty
IMPLEMENT SHA; zero code/prompt/config diff.
Result: HEALTHY-NEUTRAL (variance-floor; informational §2.2 PASS).
- best_fitness = 0.031187 (cycle-0 was 0.02620; Δ=+0.00499)
- |Δ| < §7 variance threshold 0.01310 → within variance floor
- §2.1 trajectory: PASS all 5 gates strict (frontier_new_cell 52,
right_tail_mass 57.4%, advance_rate 6% strict / 12% inclusive,
stagnation_interval_max 14 strict within-active, valid_rate 64%)
- §2.2 fitness floor: PASS (0.031187 ≥ 0.03) — INFORMATIONAL
flip vs cycle-0 (which failed by 0.004); NO-EDIT cycles cannot
be WIN-CANDIDATE by spec.
- Trajectory shape: rapidly ascending for first 29 mutants (6 peaks
compressed), then 70-mutant trailing plateau. INVERSE of cycle-0's
mid-run plateau + late rescue. Two equally consistent interpretations
(A: baseline mean ~0.029±0.003, B: cycle-1 high-side outlier).
Cycle-2 (db=13) will disambiguate. If cycle-2 lands within Δ=0.006 of
either prior cycle, baseline mean ≈ midpoint(0.026, 0.031, cycle-2);
loop may be near a structural Heilbron ceiling per
feedback_auto_optimize_trajectory_first ("task may be unsolvable in
knob scope — that's valid").
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs(auto-optimize-loop): cycle-2 PROPOSE — variance-floor replicate #2 (db=13)
Continuation of §7 variance-floor methodology. Cycle-2 is NO-EDIT,
db=13 only delta vs cycle-0/1. Adds the third sample point to lock
baseline mean + std before cycle-3 first real intervention.
After cycle-1, n=2: best=[0.02620, 0.03119], midpoint 0.02870,
sample std 0.00353, §7 variance threshold 0.01435 (50% of midpoint).
Cycle-2 decision tree:
- best ∈ [0.026, 0.034] AND §2.1 PASS → cycle-3 PROPOSE proceeds
- best outside [0.020, 0.040] OR §2.1 FAIL → §7 STOP
- best in marginal bands → optional cycle-2.5 NO-EDIT before cycle-3
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* auto-loop cycle 2: variance-floor replicate of baseline (no edit)
Per §7 variance-floor methodology. Cycle-2 = second NO-EDIT replicate
on db=13. Cycle identity SHA = this commit's HEAD.
PARENT_SHA = previous commit (cycle-2 PROPOSE meta).
No code/prompt/config diff vs cycle-0 baseline (a4925a90).
§8.1 invariants verified pre-launch:
- branch r7-r8-r9-v3-bundle
- working tree clean
- LiteLLM proxy 10.232.30.185:4000 reachable (/health/readiness)
- archetype-schema drift tests 51 passed
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* auto-loop meta cycle 2: variance-floor replicate (HEALTHY-NEUTRAL; marginal → cycle-2.5)
Cycle-2 IMPLEMENT commit (57442737) was a --allow-empty NO-EDIT replicate on db=13.
This meta commit captures the cycle-2 ANALYZE-post artifacts:
- AUTO_OPTIMIZE_CYCLE_2_RECONSTRUCTION.md: best_fitness=0.021266 at mutant ~38;
§2.1 trajectory PASS (5/5 gates strict); §2.2 fitness floor FAIL (< 0.03 by 0.00874)
- AUTO_OPTIMIZE_CYCLE_2_ANALYTICS.md: n=3 baseline mean=0.02622, stdev=0.00496;
cycle-1 reclassified as high-side outlier; trailing-plateau shape dominates 2/3 cycles
- AUTO_OPTIMIZE_CYCLE_HISTORY.md: cycle-2 row appended
- AUTO_OPTIMIZE_PATTERNS.md: cycle-2 NEUTRAL evidence entry
- AUTO_OPTIMIZE_CYCLE_3_SURFACE_MENU_DRAFT.md: surface menu for cycle-3 PROPOSE
(gated by cycle-2.5 4th NO-EDIT replicate per cycle-2 PROPOSE marginal-band rule)
Decision: cycle-2 best_fitness 0.021266 ∈ [0.020, 0.025] MARGINAL band per cycle-2
PROPOSE §5 decision tree → next cycle (2.5) is another NO-EDIT replicate on db=14
to add a 4th variance sample before cycle-3 first real intervention.
Per spec §8.1: git rev-list --count 57442737..HEAD == 1 (one commit per cycle step).
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs(auto-optimize-loop): cycle-2.5 PROPOSE — 4th NO-EDIT variance-floor replicate (db=14)
Triggered by cycle-2 PROPOSE §5 decision-tree marginal-band rule:
cycle-2 best_fitness 0.021266 ∈ [0.020, 0.025] → need 4th sample.
n=3 stats: mean=0.02622, stdev=0.00496; §7 STOP NOT triggered.
n=4 will tighten baseline mean variance ~2.6× and disambiguate
the bimodal-suspicious distribution (cycle-1 1σ above mean).
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* auto-loop cycle 2.5: variance-floor replicate of baseline (4th sample, no edit)
Identical config to cycle-2 except redis.db=14 (cycle-2 was 13).
Per cycle-2 PROPOSE §5 decision-tree: cycle-2 best 0.021266 in MARGINAL
band [0.020, 0.025] required this 4th NO-EDIT sample.
After cycle-2.5 closes:
- if best ∈ [0.015, 0.035] AND §2.1 PASS → cycle-3 PROPOSE proceeds
- if outside band OR §2.1 FAIL → STOP per §7
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* auto-loop meta cycle 2.5: variance-floor replicate (HEALTHY-NEUTRAL; n=4 baseline LOCKED; proceed to cycle-3)
cycle-2.5 best_fitness = 0.025709 at mutant 80/100 on db=14.
n=4 baseline LOCKED:
- mean = 0.02609, stdev (sample) = 0.00406
- range [0.02127, 0.03119], CV = 15.6%
- §7 STOP NOT triggered (0.00406 << 0.5 × 0.03119 = 0.01559)
§2.1 trajectory: PASS lenient (4/5 strict + stagnation NARROW-FAIL at 55 mutants,
same shape as cycle-0's 46). §2.2 fitness floor: FAIL (0.025709 < 0.03 by 14%).
Trajectory-shape census (n=4):
- 2/4 cycles: mid-run plateau + late rescue (cycle-0, cycle-2.5)
- 2/4 cycles: leading sprint + trailing plateau (cycle-1, cycle-2)
Bimodal 2/2 — cycle-3 intervention must address BOTH shapes.
Decision tree (cycle-2.5 PROPOSE §5):
best 0.02571 in [0.015, 0.035] AND §2.1 PASS lenient
-> PROCEED TO CYCLE-3 PROPOSE (first real intervention)
Cycle-3 WIN-CAND threshold = mean+1sigma = 0.03015.
Cycle-3 STRONG-WIN threshold = mean+2sigma = 0.03421.
First cycle with DIRECT live /proc/<pid>/environ verification of all four
section 8.1 environment invariants (OPENROUTER_API_KEY len=73, OPENAI_API_KEY=sk-gigaevo,
HTTP_PROXY+HTTPS_PROXY unset). Strengthens n=4 baseline vs cycle-1/2 INFERRED env.
Files:
- AUTO_OPTIMIZE_CYCLE_2_5_RECONSTRUCTION.md (FINAL; sections 9-13 filled)
- AUTO_OPTIMIZE_CYCLE_2_5_ANALYTICS.md (created; sections 0-6)
- AUTO_OPTIMIZE_PATTERNS.md (append cycle-2.5 entry; n=4 stats)
- AUTO_OPTIMIZE_CYCLE_HISTORY.md (append cycle-2.5 row)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* auto-loop cycle 3: intra-memory saturation detection (K=3 narrative streak → header inject)
INT 1 from cycle-3 surface-menu draft. First REAL intervention after n=4
NO-EDIT variance-floor baseline (mean=0.02609, σ=0.00406; locked at f225e1db).
Change: IntraMemoryStage now tracks an SHA1-keyed hash of each rendered
intra-card's narrative signature (summary + tried_strategies' label/verdict/notes,
excluding monotonic counters n_attempts/mean_delta/delta_distribution per
chaos-hacker CRITICAL #1). When the same hash is observed for K=3 consecutive
renders on the same parent, the next render is prepended with a
"[STAGNATION DETECTED]" header plus a child-delta archetype histogram.
Hypothesis: the K=3 stagnation header makes the parent's saturation visible
to the MutationSuggestionAgent → encourages archetype shift OR genuinely new
strategies in identical-narrative branches → reduces stagnation_interval_max
and/or trailing plateau without changing model/prompt template/problem.
Scope: lineage_memory.py (+90 lines) + 6 unit tests bypassing InputHashCache
to exercise the new code path directly.
Frozen invariants (unchanged): problem.heilbron, validator, fitness fn,
Pydantic Literal archetype enforcement, num_parents=1, max_mutants=100,
model=Qwen3-235B-A22B-Thinking-2507, prompts (all 4 SHAs unchanged).
WIN-CAND threshold: best_fitness ≥ 0.03015 (n=4 baseline mean+1σ).
STRONG-WIN: ≥ 0.03421 (mean+2σ).
Spec: docs/audits/AUTO_OPTIMIZE_LOOP_TASK_2026-05-19.md
PROPOSE: docs/audits/AUTO_OPTIMIZE_CYCLE_3_PROPOSE.md
RECONSTRUCTION: docs/audits/AUTO_OPTIMIZE_CYCLE_3_RECONSTRUCTION.md (skeleton; populated post-run)
ANALYTICS: docs/audits/AUTO_OPTIMIZE_CYCLE_3_ANALYTICS.md (skeleton; populated post-run)
* auto-loop meta cycle 3: K=3 stagnation detector — HEALTHY-NEUTRAL (mechanism never fired)
Cycle-3 INT 1 (commit 45255a3d, db=15, 1.85h run) outcome class HEALTHY-NEUTRAL.
Headline:
- best_fitness = 0.024369 (mutant cc09c637, frontier event #38 of 55 valid mints)
- Δ vs n=4 baseline mean (0.02609) = -0.00172 (|Δ| < 1σ = 0.00406, WITHIN noise band)
- §2.1 trajectory: 5/5 strict PASS (valid_rate ~0.55, frontier_new_cell=46,
right_tail_mass=0.344, advance 0.06 strict / 0.55 inclusive, stagnation_interval_max=14)
- §2.2 fitness floor: FAIL (0.024369 < 0.030 by 0.00563)
- §7 STOP threshold: NOT triggered (loop continues)
Key empirical finding: STAGNATION DETECTED header activations = 0/100.
The K=3 narrative-streak SHA1 detector never fired in 100 mutants. The bucketed
memory representation evolves enough between consecutive renders that
SHA1(narrative_signature) changes before K=3 is reached, even with num_parents=1
and many sibling renders per parent.
This vindicates the user's preference (feedback_llm_rules_over_hardcoded):
hardcoded Python predicates are too brittle to fire at this run scale.
Cycle-3 is empirically a NO-EDIT replicate of cycle-2.5 from the mutator's
perspective. Trajectory improvements (stagnation_interval_max=14 vs cycle-0's
46 / cycle-2.5's 55) are not causally attributable to the mechanism that
never fired — they are sampling variance.
Dual-axis verification per project_fat_context_direction.md (§1):
- signature: STAGNATION activations = 0 → ✗
- metric of interest: stagnation_interval_max = 14 (< 46 cycle-0) → ✓
- quadrant ✗✓ → "noise/lucky; replicate before claiming win" → cannot ship as a
trajectory-shape win because the mechanism never fired
Archetype distribution shifted dramatically from cycle-0 (informational only):
- cycle-0: Guided Innovation 25%, Computational Reinvention 21%
- cycle-3: Guided Innovation 53% (mode-collapse), Computational Reinvention 2%
ARCHETYPE-EFFICIENCY MISMATCH persists: highest hit-rate archetypes
(Computational Reinvention 100% n=2, Harmful Pattern Removal 100% n=1,
Precision Optimization 66.7% n=6) are under-sampled, while highest-pick
archetype (Guided Innovation 53%) has below-average hit-rate (35.8%).
Decision: HEALTHY-NEUTRAL. Next: cycle-4 PROPOSE per fat-context methodology
will target a measurable failure mode (candidate: pick-rate-vs-hit-rate
mismatch) with LLM-side fat context, NOT hardcoded Python predicates.
Spec: docs/audits/AUTO_OPTIMIZE_LOOP_TASK_2026-05-19.md
RECONSTRUCTION: docs/audits/AUTO_OPTIMIZE_CYCLE_3_RECONSTRUCTION.md (FINAL)
ANALYTICS: docs/audits/AUTO_OPTIMIZE_CYCLE_3_ANALYTICS.md (FINAL)
HISTORY: docs/audits/AUTO_OPTIMIZE_CYCLE_HISTORY.md (row appended)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* auto-loop cycle 4: surface per-archetype yield in evolutionary statistics block
INT 2 (cycle-4) — FIRST LLM-side fat-context intervention. Per cycle-3 closeout
(commit 08cbd5b2, HEALTHY-NEUTRAL, K=3 narrative-streak detector NEVER FIRED in
100 mutants), the user verbal directive on 2026-05-19 expanded the loop scope
to the entire mutation-context harness (feedback_mutation_context_harness_in_scope)
and re-affirmed fat-informative-context over hardcoded Python predicates
(project_fat_context_direction).
Change: per-archetype yield (picks, valid_hits, hit_rate, mean_delta_to_parent)
aggregated over the whole-run population in EvolutionaryStatisticsCollector,
attached to the EvolutionaryStatistics StageIO, and rendered as a markdown
table inside the existing "## Evolutionary Statistics" block of the mutation
suggester's prompt (gigaevo/prompts/mutation_suggestions/system.txt also
extended with PRIORITY-reshape guidance, NOT invention).
Three additive surface touches, all in scope per §3.1 of LOOP_TASK +
feedback_mutation_context_harness_in_scope:
- gigaevo/programs/stages/collector.py: new _compute_archetype_yield()
helper + archetype_yield field on EvolutionaryStatistics + cache wiring
in _ensure_population_cache. Also flips EvolutionaryStatisticsCollector
._EXCLUDE from EXCLUDE_FOR_ANALYTICS (strips metadata) to
EXCLUDE_STAGE_RESULTS (keeps metadata) — REQUIRED for the helper to read
program.metadata[MutationSpec.META_OUTPUT]["archetype"]. The metadata
cost is bounded by N=100 programs per snapshot, well under 1% of cycle
wall-time. Other collectors continue to exclude metadata.
- gigaevo/evolution/mutation/context.py: extended
EvolutionaryStatisticsMutationContext.format() to render the yield table
when total picks >= 5 (suppresses bootstrap noise). Sorted by hit_rate
desc, picks desc tie-break.
- gigaevo/prompts/mutation_suggestions/system.txt: extended the existing
"Evolutionary Statistics" bullet with explicit guidance on how to read
the new table (UNDER-UTILIZED vs OVER-RELIED-ON cells). Reaffirms
PRIORITY-reshape, NOT invention.
Hypothesis: surfacing per-archetype yield (cycle-3 ANALYTICS Signal #1:
Computational Reinvention 100% hit-rate at 2% pick-share; Guided Innovation
35.8% hit-rate at 53% pick-share) to the suggester lets the LLM reshape
priority toward higher-yield archetypes. Cycle-4 takes the OPPOSITE design
to cycle-3: NO Python threshold, NO if-streak-K predicate, NO header
inject. The LLM decides.
CRITICIZE-pre v1 returned REVISE with one CRITICAL + two HIGH findings,
all mitigated:
- CRITICAL: PROPOSE v1 specified wrong metadata key (metadata["mutation"]
vs canonical MutationSpec.META_OUTPUT = "mutation_output"). Fixed.
- HIGH: TDD fixtures echoed wrong key. Fixed + new test #7 regression-guards
the dead key (test_rejects_dead_metadata_key).
- HIGH: integration smoke step #4 only checked header presence. Tightened to
require >=1 canonical-named row + (other) share <= 20%.
8 RED tests in tests/stages/test_archetype_yield.py cover: empty population,
per-archetype aggregation with delta-to-parent, canonical ordering with zero
picks, attachment to EvolutionaryStatistics, format() rendering (sorted),
threshold suppression, defensive bucketing of unknown archetypes + missing
mutation_output, regression guard against the dead "mutation" key. All 8
pass post-GREEN. Adjacent suites (tests/stages/test_collector.py +
tests/stages/test_mutation_context.py) pass 85/85 — _EXCLUDE flip has no
regressions.
Scope discipline: NO change to gigaevo/llm/agents/mutation.py (archetype
Literal preserved), NO change to gigaevo/prompts/mutation/system.txt (the
SLIGHTLY rule does not apply), NO change to problems/heilbron/, num_parents,
max_mutants, model_name, llm_base_url. No Heilbron-specific anything in any
touched file. The 8 canonical archetype names are loaded from
gigaevo/evolution/mutation/constants.py — problem-agnostic.
Frozen invariants (unchanged): problem.heilbron, validator, fitness fn,
Pydantic Literal archetype enforcement, num_parents=1, max_mutants=100,
model=Qwen3-235B-A22B-Thinking-2507.
WIN-CAND threshold: best_fitness >= 0.03015 (n=4 baseline mean+1sigma).
STRONG-WIN: >= 0.03421 (mean+2sigma). Cycle-4 prediction at n=1: best
0.0275 +/- 0.005 (~30% chance of WIN-CAND given baseline variance).
Riskiest link: the suggester must actually read the yield table and
reshape priority. Dual-axis verification (PROPOSE §11) detects ignore
vs. follow via (DIAGNOSTIC) archetype-efficiency CV halving signature
+ (PRIMARY) metric of interest (best_fitness, trajectory gates).
Outcome quadrants per project_fat_context_direction 4-quadrant matrix.
Followup captured (NOT in this commit, future cycle): lineage_memory.py:711
reads metadata.get("mutation", {}) — the dead key. TransitionAnalysis archetype
field always None as a result.
* Revert "auto-loop cycle 4: surface per-archetype yield in evolutionary statistics block"
This reverts commit e6cfe6eed30fc81da752af950a4a32a45b2352f4.
* auto-loop meta cycle 4: archetype-yield prompt bloat — LOSE-REVERT
Cycle-4 INT 2 (commit e6cfe6ee, db=12, 2.16h run) outcome class LOSE-REVERT.
Reverted by 69a5a708 per feedback_auto_optimize_branch_policy (no reset).
Headline:
- best_fitness = 0.01686 (run high-water mark; SEED-level, no post-seed mints)
- Δ vs n=4 baseline mean (0.02609) = -0.00923 = -2.27σ → INSIDE LOSE band (baseline-2σ=0.01797)
- 0/37 ACCEPTED…
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The AST-level
_EvalCleaner(add_tuned_comment=Falsepath) checksisinstance(arg, (Constant, List, Tuple, Set, Dict))to decide whether aneval(...)call wraps a literal that can be inlined. This missesUnaryOp(USub, Constant), soeval(-5)survives in the AST path while the source-level_clean_eval_in_sourcestrips it correctly viaast.literal_eval. The two desubstitution modes silently produced different output for the same input.Reproducer against current main:
Fix: in
_EvalCleaner.visit_Call, replace the isinstance check withast.literal_eval(arg)as a try/except predicate so both paths share the same literal-detection logic. Verified across 12 sample inputs — the AST and source paths now produce identical output. Also broaden the(ValueError, SyntaxError)catches aroundliteral_evalin_coerce_param_valueand_clean_eval_in_sourceto also catchMemoryError,RecursionError, andTypeError.