Skip to content

fix(codex): correct provider switching contamination, base_url resolution, and official routing#297

Open
louisneal wants to merge 1 commit into
SaladDay:mainfrom
louisneal:fix/codex-provider-switching
Open

fix(codex): correct provider switching contamination, base_url resolution, and official routing#297
louisneal wants to merge 1 commit into
SaladDay:mainfrom
louisneal:fix/codex-provider-switching

Conversation

@louisneal

@louisneal louisneal commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Summary

Switching between Codex providers (with and without the local proxy / takeover) produced several correctness bugs. The common root cause is that live-config merges are add-only, so provider routing accumulates and leaks back into stored snapshots. This PR fixes five related issues, all with unit + regression tests. cargo test --lib passes (2726 tests), cargo fmt/clippy clean.

Problems & fixes

1. base_url resolves to the wrong provider (display / stream-check / proxy / usage)
The active provider's URL was extracted by returning the first base_url line in config.toml. After a switch, every provider's live config accumulates multiple [model_providers] sections, so every provider displayed the same (first) URL.

  • extract_codex_active_base_url() now resolves the top-level model_provider
    [model_providers.<id>].base_url. The OpenAI-official case (no model_provider) returns None. Replaces naive first-match in 5 call sites (tui, provider-quota, stream-check, usage script, proxy codex adapter).

2. Stored snapshots get cross-contaminated
The add-only merge accumulates [model_providers] sections in live; the live→snapshot backfills (refresh_provider_snapshot / backfill_codex_current / capture_codex_temp_launch_snapshot) wrote that unioned config back into each provider's snapshot, so every provider inherited the others' sections.

  • strip_foreign_codex_model_providers() keeps only the active provider's section (all of them, for official) before persisting a snapshot.

3. modelCatalog is wiped when switching away from a provider
backfill_codex_current / capture_codex_temp_launch_snapshot rebuilt settings from config.toml (which carries no modelCatalog), wiping the outgoing provider's stored model list — wrong catalog next time it was switched to.

  • Both now restore the per-provider modelCatalog (and bearer token) from the existing snapshot, mirroring refresh_provider_snapshot.

4. Switching to OpenAI-official (or any provider lacking routing fields) keeps routing to the old provider
The add-only merge kept the previous provider's model_provider / [model_providers] / experimental_bearer_token in live, so Codex kept hitting the old provider after a switch to official.

  • strip_codex_provider_routing_for_switch() removes provider-owned routing from the local live config before the merge; the incoming provider re-adds exactly what it needs (official adds none). User customizations (projects, features, mcp_servers, plugins, marketplaces) are preserved.

5. Proxy hot-switch (takeover) failover snapshots get contaminated
build_failover_live_snapshot merged the captured live backup (carrying the then-active provider's routing) with every provider's snapshot, baking e.g. GLM routing into the official provider's failover snapshot.

  • strip_codex_routing_in_settings() strips provider routing from the backup base so each failover snapshot reflects only its own provider.

Relationship to #295

#295 added a 3-way merge (previous-provider snapshot as base) to remove stale provider-owned keys on hot-switch. This PR is complementary: #295 passes None for that base inside build_failover_live_snapshot (still 2-way / add-only), which is exactly where fix #5 applies. Fixes 1–4 are outside #295's scope (URL extraction, snapshot contamination on the normal switch path, modelCatalog wipe, official-provider routing).

Verification

Reproduced all five against the release binary, then confirmed fixed with the patched binary (proxy on/off, round-trip switches between a third-party pair and the official provider). cargo test --lib: 2726 passed.

@SaladDay

Copy link
Copy Markdown
Owner

@louisneal heads up, main changed a lot in this area. I pulled out the add-only live-config merge entirely. Switching now does a clean write of the selected provider instead of merging into the existing config.toml, so a few of your five fixes land differently and this'll need a rebase:

  • 考虑拓展修改供应商配置的方式 #2 (snapshot cross-contamination) and codex配置问题 #4 (switching to official still routes to the old provider) were both driven by the add-only merge accumulating [model_providers] sections. With the clean write that accumulation doesn't happen, so those two should mostly fall out on their own. You can probably drop them.
  • MCP 服务器同步不生效 #5 (failover snapshot routing contamination) is still there. build_failover_live_snapshot doesn't strip routing on main, so this one applies as-is.
  • Fix clippy warnings and remove dead code patterns #1 (base_url first-match) the wrong-URL symptom is mostly gone now that config.toml only carries the active provider, but extract_codex_active_base_url is still a more robust way to resolve it than grabbing the first match, so I'd keep that part.
  • codex配置问题 #3 (modelCatalog wipe) main's backfill already goes through read_codex_live_settings_with_model_catalog, so please double check whether this still reproduces after the rebase.

If you can rebase onto current main and trim it to what still repros (looks like #5 plus the base_url helper, maybe #3) I'll get it reviewed. Sorry for the churn, the switch rework overlapped heavily with your files.

…talog on backfill

After main's clean-write rework only two of the originally reported issues

still reproduce; this trims the fix set accordingly.

Kept:

- Failover snapshot routing contamination. build_failover_live_snapshot merges
  the user's live config into each failover target's snapshot with a None base
  (2-way add-only). The live config carries the active provider's routing, so it
  leaks into other providers' snapshots: a failover to the official provider
  keeps the previous third-party model_provider and routes to the old endpoint,
  and (with preserve_codex_official_auth_on_switch) the third-party
  experimental_bearer_token leaks into another provider's snapshot. Strip
  provider-owned fields from the live backup before the merge so each target
  only inherits user customizations.

- modelCatalog wipe on backfill. backfill_codex_current read config.toml text
  directly and stored it without the modelCatalog field, so switching away from
  a provider dropped its catalog; switching back then stripped model_catalog_json
  from config.toml, disabling the catalog. backfill now reads the live catalog
  via read_codex_model_catalog_simplified_from_live (the same source
  read_codex_live_settings_with_model_catalog uses) so it round-trips.

Dropped (fixed by main's clean-write rework): base_url first-match (config.toml
now carries only the active provider and extract_codex_base_url_from_toml follows
model_provider), and snapshot cross-contamination / switching-to-official routing
(the add-only merge no longer accumulates [model_providers] sections).

Tests: regression tests for failover to third-party, to official, bearer-token
leak, and modelCatalog round-trip, plus strip_codex_provider_routing unit tests.
@louisneal louisneal force-pushed the fix/codex-provider-switching branch from 82b9b2f to b7649f5 Compare June 25, 2026 14:41
@louisneal

louisneal commented Jun 25, 2026

Copy link
Copy Markdown
Contributor Author

@SaladDay Thanks for the detailed triage — and sorry about the overlap with the switch rework. Rebased onto current main and trimmed to a single commit with only what still reproduces.

Dropped (fixed by main's clean-write rework)

Kept: #5 failover routing contamination

Applied as-is at build_failover_live_snapshot: provider-owned fields are stripped from the live backup before the 2-way merge, so each target only inherits user customizations. Adapted to the new unconditional prefer-incoming merge (the ConflictResolution threading is gone). Regression tests cover failover to a third-party target, to the official target (the real wrong-routing case — the official snapshot kept the previous model_provider), and a bearer-token leak guard for preserve_codex_official_auth_on_switch.

#3 modelCatalog wipe — double-checked as you asked, and it still reproduces

backfill_codex_current (services/provider/codex.rs) reads config.toml via std::fs::read_to_string and stores it without the modelCatalog field. read_codex_live_settings_with_model_catalog is only reached from import_default_config and read_live_settings — not the switch backfill. So switch A→B→A drops model_catalog_json from config.toml on the way back: the backfilled A snapshot has no modelCatalog, codex_catalog_model_specs returns empty, and set_codex_model_catalog_json_field(_, None) strips the reference; the catalog file is left as a stale orphan and codex stops loading any catalog.

Repro test codex_model_catalog_survives_round_trip_switch fails on main, passes with the fix. Fix: backfill reads the live catalog via read_codex_model_catalog_simplified_from_live (the same source read_codex_live_settings_with_model_catalog uses) so it round-trips.

Verification

cargo fmt --check clean; cargo clippy clean apart from pre-existing warnings; cargo test3193 passed, 0 failed.

Happy to split #3 into its own commit/PR if you'd prefer to land #5 first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

MCP 服务器同步不生效

2 participants