fix(codex): correct provider switching contamination, base_url resolution, and official routing#297
Conversation
|
@louisneal heads up, main changed a lot in this area. I pulled out the add-only live-config merge entirely. Switching now does a clean write of the selected provider instead of merging into the existing config.toml, so a few of your five fixes land differently and this'll need a rebase:
If you can rebase onto current main and trim it to what still repros (looks like #5 plus the base_url helper, maybe #3) I'll get it reviewed. Sorry for the churn, the switch rework overlapped heavily with your files. |
…talog on backfill After main's clean-write rework only two of the originally reported issues still reproduce; this trims the fix set accordingly. Kept: - Failover snapshot routing contamination. build_failover_live_snapshot merges the user's live config into each failover target's snapshot with a None base (2-way add-only). The live config carries the active provider's routing, so it leaks into other providers' snapshots: a failover to the official provider keeps the previous third-party model_provider and routes to the old endpoint, and (with preserve_codex_official_auth_on_switch) the third-party experimental_bearer_token leaks into another provider's snapshot. Strip provider-owned fields from the live backup before the merge so each target only inherits user customizations. - modelCatalog wipe on backfill. backfill_codex_current read config.toml text directly and stored it without the modelCatalog field, so switching away from a provider dropped its catalog; switching back then stripped model_catalog_json from config.toml, disabling the catalog. backfill now reads the live catalog via read_codex_model_catalog_simplified_from_live (the same source read_codex_live_settings_with_model_catalog uses) so it round-trips. Dropped (fixed by main's clean-write rework): base_url first-match (config.toml now carries only the active provider and extract_codex_base_url_from_toml follows model_provider), and snapshot cross-contamination / switching-to-official routing (the add-only merge no longer accumulates [model_providers] sections). Tests: regression tests for failover to third-party, to official, bearer-token leak, and modelCatalog round-trip, plus strip_codex_provider_routing unit tests.
82b9b2f to
b7649f5
Compare
|
@SaladDay Thanks for the detailed triage — and sorry about the overlap with the switch rework. Rebased onto current Dropped (fixed by main's clean-write rework)
Kept: #5 failover routing contaminationApplied as-is at #3 modelCatalog wipe — double-checked as you asked, and it still reproduces
Repro test Verification
Happy to split #3 into its own commit/PR if you'd prefer to land #5 first. |
Summary
Switching between Codex providers (with and without the local proxy / takeover) produced several correctness bugs. The common root cause is that live-config merges are add-only, so provider routing accumulates and leaks back into stored snapshots. This PR fixes five related issues, all with unit + regression tests.
cargo test --libpasses (2726 tests),cargo fmt/clippyclean.Problems & fixes
1.
base_urlresolves to the wrong provider (display / stream-check / proxy / usage)The active provider's URL was extracted by returning the first
base_urlline inconfig.toml. After a switch, every provider's live config accumulates multiple[model_providers]sections, so every provider displayed the same (first) URL.extract_codex_active_base_url()now resolves the top-levelmodel_provider→
[model_providers.<id>].base_url. The OpenAI-official case (nomodel_provider) returnsNone. Replaces naive first-match in 5 call sites (tui, provider-quota, stream-check, usage script, proxy codex adapter).2. Stored snapshots get cross-contaminated
The add-only merge accumulates
[model_providers]sections in live; the live→snapshot backfills (refresh_provider_snapshot/backfill_codex_current/capture_codex_temp_launch_snapshot) wrote that unioned config back into each provider's snapshot, so every provider inherited the others' sections.strip_foreign_codex_model_providers()keeps only the active provider's section (all of them, for official) before persisting a snapshot.3.
modelCatalogis wiped when switching away from a providerbackfill_codex_current/capture_codex_temp_launch_snapshotrebuilt settings fromconfig.toml(which carries nomodelCatalog), wiping the outgoing provider's stored model list — wrong catalog next time it was switched to.modelCatalog(and bearer token) from the existing snapshot, mirroringrefresh_provider_snapshot.4. Switching to OpenAI-official (or any provider lacking routing fields) keeps routing to the old provider
The add-only merge kept the previous provider's
model_provider/[model_providers]/experimental_bearer_tokenin live, so Codex kept hitting the old provider after a switch to official.strip_codex_provider_routing_for_switch()removes provider-owned routing from the local live config before the merge; the incoming provider re-adds exactly what it needs (official adds none). User customizations (projects, features, mcp_servers, plugins, marketplaces) are preserved.5. Proxy hot-switch (takeover) failover snapshots get contaminated
build_failover_live_snapshotmerged the captured live backup (carrying the then-active provider's routing) with every provider's snapshot, baking e.g. GLM routing into the official provider's failover snapshot.strip_codex_routing_in_settings()strips provider routing from the backup base so each failover snapshot reflects only its own provider.Relationship to #295
#295 added a 3-way merge (previous-provider snapshot as base) to remove stale provider-owned keys on hot-switch. This PR is complementary: #295 passes
Nonefor that base insidebuild_failover_live_snapshot(still 2-way / add-only), which is exactly where fix #5 applies. Fixes 1–4 are outside #295's scope (URL extraction, snapshot contamination on the normal switch path, modelCatalog wipe, official-provider routing).Verification
Reproduced all five against the release binary, then confirmed fixed with the patched binary (proxy on/off, round-trip switches between a third-party pair and the official provider).
cargo test --lib: 2726 passed.