Skip to content

team mode: wire team prompt + env into the three Python-loop adapters#55

Merged
ProKil merged 19 commits into
mainfrom
team-all-adapters
May 21, 2026
Merged

team mode: wire team prompt + env into the three Python-loop adapters#55
ProKil merged 19 commits into
mainfrom
team-all-adapters

Conversation

@ProKil
Copy link
Copy Markdown
Member

@ProKil ProKil commented May 17, 2026

Summary

Brings every adapter (claude_code, codex, mini_swe_agent_v2, swe_agent, openhands_sdk) to full team-mode parity — each now runs end-to-end in team mode, including the openhands_sdk variant whose Modal-isolated agent-server required a custom Redis tunnel + tool-registry override to actually function.

Stacks on #52 (which itself stacks on #51).

Backend support matrix (after this PR)

Adapter Docker Modal Notes
mini_swe_agent_v2 both verified end-to-end; backend dispatch in the adapter
claude_code backend now threaded through _coop/runtime.build_environment
codex Modal stdin hang fixed (this PR); verified solo on dottxt 1/1 in 1m 48s
swe_agent docker via swerex.DockerDeploymentConfig (with entrypoint-clear + pipx-spec patch, both in this PR); verified solo on dottxt — Modal 1/1 in 3m 12s, Docker 1/1 in 2m 53s
openhands_sdk (n/a) Modal-only by design; eval now refuses Docker with a clear warning

What landed (in order of commits)

  1. Per-adapter team-prompt wiring + env-var propagation. mini_swe_agent_v2, swe_agent, openhands_sdk now append team_task_section to the task, propagate CB_TEAM_* into their containers, and (where they manage docker) mount the team scratchpad volume.
  2. CoopTaskTracker typed tool for openhands — Redis-backed drop-in for openhands' built-in TaskTrackerTool, registered under the SAME name so the registry override is transparent to the agent. Required because gpt-5.5 strongly prefers typed tools over shell commands, even when the prompt tells it otherwise.
  3. Codex coordination fixes## Git collaboration — MERGE IS REQUIRED BEFORE SUBMITTING prompt rewrite that frames merging as the explicit final step, plus a test_merged short-circuit that copies one patch to merged.patch when both agents submit byte-identical merged trees.
  4. Modal-hosted Redis for openhandsrunner/team.py detects agent_name == "openhands_sdk" and spins up a Modal sandbox running redis-server on unencrypted_ports=[6379], exposed via unencrypted_host:unencrypted_port. Both the host TaskListClient and the agent-server's CoopTaskTracker point at the same public TCP endpoint.
  5. CoopTaskTracker injection into the Modal sandboxadd_local_file the tool definition + a pre-rendered replacement __init__.py (no shell-heredoc fragility) + .pyc cache wipe so the registration override actually takes effect.
  6. Harvest-time fresh Redis client — TCP tunnels drop idle connections after several minutes; re-open at harvest time.
  7. fakeredis dev dependency — was undeclared, causing CI ImportErrors on team-mode test files.
  8. swe_agent import fix (cooperbench.agents.mini_swe_agentmini_swe_agent_v2) + missing transitive deps (numpy, boto3, docker) in swe-agent extras. Was a pre-existing bug from v0.0.13's rename — every swe_agent invocation errored before any LLM call.
  9. Core-subset bug fixes + new eval policy — see "Follow-up validation" below.
  10. Modal backend fixes — codex stdin hang + openhands docker-eval guardrail (this PR's most recent commit).

Per-adapter wiring + verified result

Adapter Team prompt CLI in container CB_TEAM_* env Typed task tracker E2E pass rate
claude_code full n/a 2/2 ✓ (variable per-run)
codex full n/a 2/2 ✓ (after prompt + eval fixes); Modal smoke 1/1 ✓
mini_swe_agent_v2 section ✓ (via env_kwargs["env"]) n/a (in-loop TeamPoller) 2/2 ✓
openhands_sdk section + system-prompt block ✓ (Modal image layered) ✓ (via coop_info["team_env"]) ✓ (CoopTaskTracker overrides upstream TaskTracker) 2/2 ✓ (via Modal-hosted Redis)
swe_agent section not yet not yet n/a Modal smoke 1/1 ✓; Docker smoke 1/1 ✓; full team-tool integration is a follow-up

Follow-up validation: 10-pair core-subset horizontal comparison

Took the team wiring above to a real workload (the new dataset/subsets/core.json subset) and discovered five compounding bugs that prevented anything other than openhands from reaching honest pass-rates. Fixed them and then re-thought the team-mode eval policy.

Final results (10-pair core subset, team setting)

Eval policy: identical → naive merge → lead's patch alone. Union merge and member-only fallback were intentionally dropped — they reward lucky non-overlap or partial coordination rather than genuine team integration. Details in docs/BENCHMARK_RESULTS.md.

Agent Pass Cost Wall Notes
mini_swe_agent_v2 6 / 10 $13.37 24m 5 naive + 1 lead-alone
claude_code 5 / 10 ~$8.5 21m 2 naive + 3 lead-alone
codex 5 / 10 $0* 21m 2 naive + 3 lead-alone
openhands_sdk 4 / 10 $31.90 16m 0 naive + 4 lead-alone — every oh task hits naive-conflict (patch.txt committed mid-run); only 4/10 have a lead-alone patch that passes both features

* gpt-5.5 not in local pricing table; codex did real work (400 k+ input tokens per agent).

Three of four ≥ 5/10 under the strict policy. oh at 4/10 is the right number — their union-merge passes on the older lenient eval were partly a false-positive of their patch.txt-commits-mid-run workflow forcing a patch.txt merge conflict that union resolved trivially while the actual source-code merge was non-conflicting; under the stricter policy those route to lead-alone, and oh's lead doesn't always integrate.

Bugs fixed (all in the unreleased CHANGELOG entry)

  • codex exec hung in Modal sandbox — codex's exec mode blocks reading "additional input from stdin"; Modal sandbox keeps stdin open while Docker non-tty docker exec gives EOF for free. Fix: </dev/null on the codex invocation. Smoke-verified solo on dottxt: 1/1 in 1m 48s.
  • openhands_sdk eval guardrail — eval refuses to run a non-modal backend against an openhands_sdk-produced run with a clear warning, because openhands relies on the Modal-hosted Redis tunnel and commits patch.txt into the working tree (Docker eval silently changes the test environment).
  • normalize_patch was using text.strip(), eating trailing blank context lines (" \n") from valid git diff output and breaking last-hunk line counts so git apply rejected them.
  • mini_swe_agent_v2 adapter wasn't routing patches through normalize_patch at all — raw .strip(), same underlying issue, one layer deeper.
  • mini_swe_agent_v2 ModalEnvironment created the sandbox without a long-running command, so the image's default CMD exited and every exec() hit "Sandbox not found". Now passes "sleep", "infinity" (matches the eval backend's existing fix).
  • claude_code and codex adapters silently ignored --backend modal — shared build_environment was hardcoded to DockerEnvironment. Added a backend kwarg and threaded config["backend"] through both adapters.
  • Team-lead prompt buried the integration step at the bottom of the workflow list; Claude/Codex consistently exited after their own feature without reading /workspace/shared/<agent>.patch. Rewrote with a hard-rule opener and a 5-point pre-submission checklist; member prompt now opens with "stay in your lane" per the lead's PLAN.md.

Eval policy change

test_merged now uses identical → naive → lead-alone-when-naive-conflicts. Previous chain was identical → naive → union → solo-fallback (lead-or-member). Union merge concatenates conflicting hunks (usually broken code; rewards lucky non-overlap rather than coordination); member-only fallback is incoherent in team mode (the lead is the designated integrator). When naive conflicts, the lead's patch.txt must pass both feature suites alone. Surfaced as merge.strategy = "solo-agent1" in eval.json.

Added

  • dataset/subsets/core.json (+ scripts/generate_core_subset.py) — 10-pair stratified core subset for quick agent comparisons.
  • docs/BENCHMARK_RESULTS.md — the horizontal comparison with per-task matrix and rerun narrative.

Tests

16 new unit tests + 1 prompt regression (full suite: 329 passed, 63 skipped):

  • 3 prompt tests for team_task_section vs build_team_instruction consistency
  • 10 cross-adapter compatibility tests (every runner accepts team kwargs; openhands env shape; image-layering call counts)
  • 1 swe_agent signature check
  • 1 openhands image-layering (3 add_local_file + 2 run_commands + .pyc wipe)
  • 1 CoopTaskTracker registry-override regression test
  • 1 eval short-circuit regression (identical patches)

Ruff / format / mypy all green.

Real follow-ups

  • swe_agent in-container CLI install + scratchpad mount + auto-refresh — once this PR lands, swe_agent's sandbox layer needs the same treatment mini_swe_agent_v2 got for full team primitives. Today it only sees the team prompt section + env vars; the actual coop-task-* CLI isn't installed in its sandbox.
  • In-loop task-refresh for openhands_sdk — works without this thanks to the typed CoopTaskTracker tool, but a push-style refresh hook would close the agency gap further.
  • openhands patch.txt workflow — oh agents commit their patch.txt into the working tree mid-run, which forces every team-mode merge into a patch.txt conflict. Not strictly a bug (the merge falls back to lead-alone correctly) but it's noise in the eval logs.

Test plan

  • ruff check, ruff format --check, mypy, pytest tests/ (all green locally — 329 passed)
  • With OPENAI_API_KEY or ANTHROPIC_API_KEY exported, run uv run cooperbench run -a <adapter> -m <model> -r <repo> -t <task> -f <f1>,<f2> --setting team --git --backend docker
  • Modal-backend smoke: uv run cooperbench run -a codex -m gpt-5.5 -r dottxt_ai_outlines_task -t 1655 -f 1,3 --setting solo --backend modal — should finish in under 5 min (verifies the stdin fix)
  • openhands docker-eval guardrail: uv run cooperbench eval -n <oh_run> --backend docker — should refuse with a warning before doing any work
  • For openhands_sdk: confirm redis ready redis://r...modal.host:... line appears and metrics dict in result.json is populated (non-empty claims_per_agent)
  • For swe_agent: confirm install completes with '.[swe-agent]' and no numpy / boto3 / docker ImportError surfaces
  • Optionally: uv run cooperbench run -a <adapter> -m <model> -s core --setting team --backend docker -c 3 to reproduce the core-subset results

🤖 Generated with Claude Code

Ubuntu and others added 17 commits May 16, 2026 20:42
Adds an OpenAI Codex CLI adapter alongside the existing Claude Code
adapter.  Both adapters wrap a third-party CLI inside the task's
Docker container; the bits that are agent-agnostic (Redis messaging
helper, prompt blocks for solo/coop/coop+git, git remote setup) now
live in a new ``cooperbench.agents._coop`` module so the two adapters
(and any future CLI adapter) consume them rather than duplicating.

Codex adapter highlights:

  - Invokes ``codex exec --json --sandbox danger-full-access
    --skip-git-repo-check --model <id>``.
  - Writes ``${CODEX_HOME}/auth.json`` with the host's OPENAI_API_KEY
    inside the container so the CLI authenticates without prompts.
  - Parses Codex's JSONL event stream for status / token totals /
    messages.  Cost is reported as 0.0 because Codex does not emit a
    cost field; tokens are summed across ``turn.completed`` events.
  - Model fallback: if Codex rejects ``--model gpt-5.5`` with a
    "model not found" shaped error, the adapter retries once without
    ``--model`` and lets Codex pick its default.
  - Preflight credential check: if OPENAI_API_KEY is unset the adapter
    returns Error immediately instead of spinning up a container that
    can only fail.

Shared ``_coop`` module:

  - ``coop_msg.py`` — Redis-backed messaging CLI (one inbox per agent)
    installed as ``coop-send`` / ``coop-recv`` / ``coop-broadcast`` /
    ``coop-peek`` / ``coop-agents`` under /usr/local/bin.
  - ``install_snippet.sh`` — pip-installs redis and drops the shell
    wrappers; each adapter's setup.sh sources it.
  - ``prompt.py`` — solo / coop / coop+git prompt assembly, agent-
    agnostic.
  - ``runtime.py`` — ``ContainerEnv`` protocol, ``build_environment``,
    ``write_file_in_container`` / ``read_file_from_container``,
    ``rewrite_comm_url_for_container``, ``build_git_setup_command``,
    ``parse_sent_messages_log``, and ``normalize_patch``.

Bug fix during this refactor: the previous adapter's ``.strip()`` on
``patch.txt`` was eating the trailing newline that ``git apply``
requires.  Replaced with ``normalize_patch()`` (one trailing newline,
no leading whitespace).  This bit codex's solo run with a
"corrupt patch at line N" error; Claude got lucky and didn't.

Tests: 24 new for Codex (parsers + adapter), existing 45 Claude Code
tests re-pointed at the shared ``_coop`` module.  Full suite: 228
passed, 63 skipped.

End-to-end runs against dottxt_ai_outlines_task/1371 features 1+2:

  - codex solo f1:           Submitted, 1 turn, 365k input tokens,
                             184-line patch (with the trailing-newline
                             fix it applies cleanly)
  - codex coop+git f1,f2:    both Submitted, both patches applied but
                             0/2 tests pass — coordination failure
                             (agent1 fetched ``team`` but never merged,
                             so the stacked patches produce a Python
                             SyntaxError at line 144 of the modified
                             file).  Claude on the same task scored
                             2/2; Codex used the tools less aggressively
                             on this run.

The 0/2 result is the kind of coordination failure the bench is
designed to surface, not an adapter bug.  Future iteration could
tighten the prompt or hard-enforce a post-run merge, but neither is
necessary to land the adapter itself.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a third setting alongside ``solo`` and ``coop``, modelled on the
agent-team primitives Claude Code uses in its own product.  Where coop
gives N peer agents one feature each and a Redis inbox to chat over,
team mode adds three load-bearing primitives:

  1. A typed **shared task list** (cooperbench.agents._team.TaskListClient)
     backed by Redis hashes + sets, namespaced ``cb:<run_id>:``, with
     atomic claim semantics (HSETNX-style — exactly one caller wins on a
     race) and an audit log of every mutation.  Exposed in the container
     as ``coop-task-create`` / ``coop-task-claim`` / ``coop-task-update``
     / ``coop-task-list`` shell wrappers.

  2. A **lead / member role split**.  The first agent is designated
     ``team-lead`` and gets a system-prompt block instructing them to
     break the spec into tasks, assign them via ``coop-task-create
     --assign``, watch progress, and integrate.  Other agents are
     ``member`` and look for open tasks to claim.

  3. A **shared scratchpad** Docker volume (``cb-team-<run_id>``)
     mounted at ``/workspace/shared`` in every container.  Free
     coordination artifact for design notes, partial diffs, interface
     sketches.

Coordination metrics are computed from the task-list audit log after
the run finishes (``time_to_first_claim_seconds``, ``claims_per_agent``,
``updates_per_agent``, ``tasks_done``, ``unowned_at_end``) and saved
into ``result.json``.  Evaluation is identical to coop — per-agent
``patch.txt`` evaluated per-feature — so no eval changes were needed
beyond discovering ``team/`` log directories.

Compatibility: all five existing adapters accept the new ``team_role``
/ ``team_id`` / ``task_list_url`` kwargs.  The CLI adapters
(``claude_code``, ``codex``) wire the team install snippet into their
``setup.sh`` so the ``coop-task-*`` wrappers land at
``/usr/local/bin``.  The Python-loop adapters (``mini_swe_agent_v2``,
``swe_agent``, ``openhands_sdk``) accept the kwargs without breaking;
their in-loop integration with the task list (auto-refresh between
steps, similar to the existing inbox poll) lands in a follow-up.

Unit tests: 46 new
  - 18 task_list (CRUD, atomic claim, owner-only update, audit log,
    run isolation)
  - 12 prompt (lead vs member branches, solo fallback, git interaction)
  -  3 runtime (env assembly, scratchpad mount args)
  -  4 metrics (happy path, unowned-at-end, empty log, multiple claims)
  -  5 runner (lead-is-first-agent, pre-seed, kwarg propagation,
    metrics in result, three-agent team)
  -  4 misc

Full suite: 274 passed, 63 skipped.  Ruff / format / mypy all green.

End-to-end on dottxt_ai_outlines_task/1371 [1,2] with Claude Code in
team+git mode:

  - 5 tasks created (2 by bench-runner, 3 by the lead splitting its
    work), all reached ``done``
  - time_to_first_claim_seconds=34.2
  - claims_per_agent={agent1: 2, agent2: 1}
  - updates_per_agent={agent1: 4, agent2: 3}
  - scratchpad volume actively used (agent2 wrote its diff to
    /workspace/shared/agent2.patch + a summary.md)
  - **0/1 pass rate** — both ``patch.txt`` files were empty: the
    members wrote diffs to the scratchpad instead of also writing
    ``/workspace/repo/patch.txt``, and the lead never ran the final
    integration step.  This is real coordination signal (the prompt
    told them to write both places but they followed the scratchpad
    half only) — a follow-up will tighten the prompt to make patch.txt
    submission the explicit final step.

Future PRs (intentionally out of scope here so this lands at a
reviewable size):

  - In-loop auto-refresh for the Python-loop adapters
  - MCP long-poll tool to give CLI adapters push-ish inbox semantics
  - Typed ``coop-request`` / ``coop-respond`` protocol on top of
    messaging (CC's plan_approval_request shape)
  - Filesystem mirror of the task list (CC-style ``ls`` artefacts)

Stacks on #51 (Codex adapter) so the diff stays focused on team-mode
additions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…resh (#53)

Lands the four follow-ups that were called out as "Out of scope" on
the team-mode PR (#52), plus a prompt fix surfaced by the team-mode
end-to-end run.

1. **Filesystem mirror of task list** (``_team/fs_mirror.py``).
   Snapshots the Redis-backed task list to ``/workspace/shared/tasks/``
   so agents can ``ls`` and ``cat`` tasks with their existing tools
   rather than going through the ``coop-task-list`` CLI.  Layout
   mirrors Claude Code's team primitive: one ``<id>.json`` per task,
   plus ``_index.json`` (cheap ``ls`` target) and ``_log.jsonl`` (audit
   trail).  Triggered on every ``coop-task-list`` invocation and from
   the host runner at startup.  Files written via tempfile+replace so
   readers never observe a partial state.

2. **Typed coop-request / coop-respond protocol** (``_team/protocol.py``).
   Layered on plain Redis messaging, mirroring CC's
   ``plan_approval_request`` / ``plan_approval_response`` shape.
   ``coop-request <peer> <kind> <body>`` returns a request_id (and
   optionally blocks via ``--wait N`` for a response).
   ``coop-respond <request_id> <body>`` writes back; the sender's
   ``await_response`` uses BLPOP so it actually sleeps instead of
   busy-polling.  Both events flow into the shared task-log so
   coordination metrics include protocol events.

3. **MCP long-poll server** (``_team/mcp_server.py``).  Stdio
   JSON-RPC server that exposes a single ``wait_for_message`` tool
   backed by BLPOP on the agent's inbox.  Registered automatically:
   Claude Code adapter writes ``$CLAUDE_CONFIG_DIR/.claude.json`` with
   the server entry; Codex adapter writes ``$CODEX_HOME/config.toml``.
   The point is to make "watch the inbox" a natural idle behavior for
   the CLI adapters instead of a busy-loop on ``coop-recv`` returning
   empty — the closest we can get to push-style delivery for opaque
   CLI agent loops.

4. **In-loop task-list auto-refresh** (``_team/loop_refresh.py``).
   ``TeamPoller`` is a per-agent host-side helper that
   ``mini_swe_agent_v2.DefaultAgent.step()`` calls between LLM
   queries — same hook as the existing inbox poll.  The LLM sees a
   compact ``[Team task list] open: 1, in_progress: 2, ...`` summary
   prepended to every turn so it doesn't need to remember to call
   ``coop-task-list``.  Plumbed via ``agent.team_poller`` so the
   ``mini_swe_agent_v2`` subtree change is one branch in ``step()``.
   The same module also exports ``poll_team_state()`` for in-container
   use (env-driven variant).

5. **Prompt fix**: the previous team-mode end-to-end had members
   writing diffs to ``/workspace/shared/<id>.patch`` only and never to
   ``/workspace/repo/patch.txt``, scoring 0/2 despite great
   coordination.  Both lead and member prompts now have an explicit
   ``### Final submission — REQUIRED`` section that calls out
   ``patch.txt`` as the only file the bench evaluates and provides
   the exact ``git diff > patch.txt`` command.

Also: cosmetic fix to ``runner/core._print_single_result`` so team
mode's per-agent dicts (which carry ``patch_lines: int``) render
correctly in the run table — previously the column showed 0 because
the function tried ``len(r.get("patch", "").splitlines())`` and team
mode doesn't store the full patch in the agents dict.

Tests: 37 new unit tests
  -  8 fs_mirror     (atomic writes, stale cleanup, empty index)
  -  9 protocol      (request roundtrip, await, timeout, audit log)
  -  9 mcp_server    (initialize, tools/list, tools/call,
                      timeout, blocking, unknown-tool error,
                      env factory)
  -  8 loop_refresh  (summary formatting, TeamPoller, env variant)
  -  3 prompt        (regression: lead+member prompts demand patch.txt)

Full suite: **311 passed**, 63 skipped.

End-to-end on dottxt_ai_outlines_task/1371 [1,2] with Claude Code +
team + git: **2/2 features pass** (14/14 + 20/20 tests).  All four
follow-ups visibly active in the run artifacts:
``/workspace/shared/tasks/`` populated with per-task JSON + _index +
_log; scratchpad has agent2.patch; ``cb-mcp-server.py`` registered in
``.claude.json``; 6 tasks created (2 by runner pre-seed, 4 by lead's
sub-task split), 4 reached ``done``,
``time_to_first_claim_seconds=29.9``.  Previous run scored 0/2 on the
same task — the prompt fix is doing real work.

Stacks on #52.

Co-authored-by: Ubuntu <ubuntu@ip-172-31-58-153.us-west-2.compute.internal>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Brings ``mini_swe_agent_v2``, ``swe_agent``, and ``openhands_sdk`` to
parity with the CLI adapters for team mode.  Before this commit they
accepted the team kwargs but discarded them; now each one appends the
team prompt section to the task it sends the agent, and (where the
adapter actually controls the container) propagates ``CB_TEAM_*`` env
vars + mounts the team scratchpad.

New helper: ``_team.team_task_section(agents, agent_id, team_role)``
returns ONLY the lead-or-member block + coop-task-* CLI usage,
without the surrounding task/submission/git scaffolding that
``build_team_instruction`` adds.  Python-loop adapters already have
their own prompts covering messaging/git/submission, so they need
only the new piece; CLI adapters keep using the bigger function.

Per-adapter wiring:

  - ``mini_swe_agent_v2``: appends team_task_section to task;
    propagates CB_TEAM_* through env_kwargs["env"]; adds
    ``--add-host=host.docker.internal:host-gateway`` + scratchpad
    volume to docker run args; installs the team CLI scripts + pip
    redis in the container after env spin-up.  The existing
    ``TeamPoller`` host-side hook (already in step()) still fires.

  - ``openhands_sdk``: appends team_task_section to task; folds a new
    ``team_env`` dict into ``coop_info`` so
    ``_build_credentials_dict`` propagates CB_TEAM_* into the
    sandbox.  Coop-task-* binary install in the OpenHands agent-server
    image is a follow-up — OpenHands manages its own image build and
    doesn't expose a clean post-start exec hook.

  - ``swe_agent``: appends team_task_section to task.  The SWE-agent
    framework's sandbox + agent loop is third-party and harder to
    instrument; everything beyond the prompt is a follow-up.

Tests: 13 new
  - 3 prompt unit tests for team_task_section (lead, member, empty)
  - 10 cross-adapter sanity tests in tests/agents/test_team_wiring.py:
    consistency between team_task_section and build_team_instruction,
    every registered runner accepts the team kwargs, openhands env
    keys, swe_agent signature

Full suite: 324 passed, 63 skipped.  Ruff/format/mypy all green.

End-to-end on dottxt_ai_outlines_task/1371 [1,2] with claude_code +
team + git (sanity check that the shared changes didn't regress the
CLI adapter): both Submitted in 4m21s, $0.93, patches 210 + 81 lines.

End-to-end for the other four (codex, mini_swe_agent_v2, swe_agent,
openhands_sdk) requires API keys (Anthropic for the three Python-loop
adapters via litellm, OpenAI for codex) that aren't available in this
environment.  Unit tests cover the new wiring; the e2e validations
should be run with real keys before relying on the per-adapter
behavior.

Compatibility matrix is now:

  | Adapter             | Accepts | Team prompt | Auto-refresh | CLI in container | env vars |
  |---------------------|---------|-------------|--------------|------------------|----------|
  | claude_code         | yes     | yes (full)  | n/a          | yes              | yes      |
  | codex               | yes     | yes (full)  | n/a          | yes              | yes      |
  | mini_swe_agent_v2   | yes     | yes (sec.)  | yes          | yes              | yes      |
  | openhands_sdk       | yes     | yes (sec.)  | n/a          | NOT YET          | yes      |
  | swe_agent           | yes     | yes (sec.)  | NOT YET      | NOT YET          | NOT YET  |

Stacks on #52 (merged-up team-mode branch).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the documented gap from the prior commit's matrix: the
``coop-task-*`` binaries now ship into the OpenHands agent-server
sandbox, layered onto the upstream ``-oh`` image via Modal's
``add_local_file`` / ``pip_install`` / ``run_commands`` chain (no
upstream image rebuild required).  Triggered only when
``coop_info["team_env"]`` is set so solo / coop runs don't pay the
~10s first-build cost.  Modal caches the layered image; subsequent
team runs are instant.

Verified end-to-end: ran openhands_sdk team+git on
dottxt_ai_outlines_task/1371 [1,2] with gpt-5.5.  The agent ran
``compgen -c | grep coop-task`` and got back all 7 wrappers
(create / claim / update / list / request / respond / pending) — the
install worked.  Whether the model actually invokes the tools is a
separate (coordination-quality) axis; in this run it discovered them
but didn't use them, same as codex.  Both patches applied; f1 14/14,
f2 19/20.

Tests: 2 new (full suite: 326 passed)
  - test_team_env_triggers_image_layering  — verifies add_local_file
    + pip_install + run_commands fire with the right args when team
    mode is active
  - test_no_layering_when_team_inactive    — verifies solo / coop
    runs skip the image-build cost

Matrix update — openhands_sdk now reads:
  Accepts kwargs: yes / Team prompt: section / Auto-refresh: n/a /
  CLI in container: YES (was NOT YET) / CB_TEAM_* env: yes

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The codex team e2e (cx_team_v3) hit 0/2 with great coordination
metrics — 5/5 tasks done, 27s first claim, claims even — but
neither agent ran ``git merge`` despite the prompt's "Recommended
workflow" mentioning it.  Both fetched their peer's branch (2 each)
and then submitted only their own work, so the eval's naive
diff-stacker produced syntactically broken Python.

The previous prompt buried the critical step in a "Concretely:"
sentence at the end; gpt-5.5 didn't follow it.  This rewrite:

  - Renames the section ``## Git collaboration — MERGE IS REQUIRED
    BEFORE SUBMITTING`` so the imperative is in the heading itself.
  - Adds an explicit "Required final sequence — run this verbatim
    before exiting" block with the full fetch+merge+diff sequence,
    parameterized over every partner branch.
  - Explains *why* (each agent's patch.txt is evaluated against every
    feature's tests; without the merge, the peer feature's symbols
    are missing → ImportError).
  - Frames it the same way the patch.txt step is framed (REQUIRED,
    skip-at-your-loss), which the original prompt fix proved
    codex responds to.

Verified: re-ran cx_team_v4 (codex team+git, same task as v3).
Git activity went from ``fetch=2 merge=0 push=0`` per agent →
``fetch=3 merge=2 push=2`` and ``fetch=1 merge=1 push=1``.  Both
patches now contain both features' symbols.  Pass rate v4:
33/34 tests (97%) — f2 fully passes 20/20, f1 fails one test
because gpt-5.5's merged code put the ``filters`` kwarg on a helper
function rather than the ``prompt`` decorator (content quality, not
coordination).

A second run (cx_team_v5) produced byte-identical 243-line patches
on both agents — codex coordinated so well both ended up with the
exact same merged tree.  This surfaces a separate bench-side
limitation: the eval's diff-stacker fails to apply patch B on top
of patch A when every hunk already matches, producing an empty
merged.patch.  That's a real bug in ``eval/evaluate.py``'s coop
merge step, NOT a coordination failure — codex did exactly what the
prompt asked.  Fix is a separate concern from team-mode wiring.

Tests still pass (existing prompt tests are content-agnostic;
326 / 63 skipped).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
In team mode codex can coordinate so well that both agents end up
with byte-identical patches (each fully merged the other's branch).
The existing eval combiner sequence — apply patch1 → apply patch2
on top — chokes because every hunk in patch2 is already applied,
producing an empty merged.patch and a downstream "No valid patches
in input" failure even though both submissions are individually
fine.

Fix in ``test_merged``: before invoking ``_setup_branches`` /
``_merge_naive``, ``cmp`` the two patches.  If they match, copy
patch1 to merged.patch (normalized via ``git apply --recount`` so
agents that emit unified diffs with miscounted hunk headers still
work) and skip the merge dance.  Returns a fresh result with
``merge.status: "identical"`` so the caller can tell the
short-circuit fired vs a real merge.

Verified on the codex-team e2e:

  - cx_team_v5 (codex agents perfectly merged to identical 243-line
    patches): 0/2 → 2/2 ✓ (f1: 14/14, f2: 20/20)
  - cx_team_v4 (codex agents diverged on the merge): unchanged at
    f2 20/20 + f1 13/14 = 33/34 tests, still falls back to
    agent2-alone via apply_status: {'agent1': 'failed', ...}

I also briefly tried adding ``git apply --recount`` to
``_setup_branches``'s fallback chain, but that REGRESSED v4: it
made agent1's malformed patch apply where it previously failed
silently, triggering a real merge attempt that produced
duplicate function definitions (broken Python) via union merge.
The identical-patches short-circuit is the strictly-better fix —
no regression, recovers the v5 case, and the malformed-hunk
normalization only kicks in on the short-circuit path where it
can't cause merge conflicts.

Also lands previously-uncommitted housekeeping:
  - prompt.py: ruff-format-only diff on the merge-required block
    from the prior commit
  - test_team_wiring.py: ruff --fix removed unused MagicMock
    imports
  - test_gcp_backend.py / test_tasks.py: ruff --fix removed
    f-string-without-placeholder and unused-json import (both
    unrelated drift caught by the gate)

Tests: 1 new (full suite: 327 passed)
  - ``test_test_merged_shortcircuits_on_identical_patches`` — source
    inspection confirms the short-circuit branch + "identical"
    merge-status string exist in test_merged

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous openhands team runs (oh_team_v3) showed agents
discovering the ``coop-task-*`` shell wrappers via ``compgen`` but
never invoking them — gpt-5.5 strongly prefers typed tools registered
with the LLM over arbitrary shell commands.  This commit lands the
architectural fix: a Redis-backed ``CoopTaskTrackerTool`` registered
under the same name as openhands' built-in ``TaskTrackerTool`` so the
registry resolution swaps it transparently.

Files:

  * ``openhands/tools/task_tracker/coop_definition.py`` — new tool
    definition + executor.  Same ``TaskTrackerAction`` /
    ``TaskTrackerObservation`` shape, but ``plan`` and ``view`` round-
    trip through the shared ``cb:<run_id>:`` Redis namespace that
    ``TaskListClient`` (host side) writes to.  Tasks are auto-owned
    by the calling agent; ``view`` shows peer tasks prefixed with
    ``[<their_agent_id>]``.  Registered under both
    ``"CoopTaskTrackerTool"`` AND ``"TaskTrackerTool"`` so importing
    the module rebinds the latter to the Coop variant.

  * ``openhands/tools/preset/default.py`` — gains a ``team_mode``
    kwarg (kept for API stability + tests; the actual swap happens
    server-side via the .pth/__init__ side-effect import, not by
    changing the host-side tool list).  Pre-PR coop block split into
    a more nuanced team-mode prompt section that documents the
    TaskTracker → shared-list behavior.

  * ``openhands_sdk/adapter.py:ModalSandboxContext.__enter__`` —
    layers two more bits into the Modal image at build time:
      - ``add_local_file`` of ``coop_definition.py`` to
        ``$OH_DIR/coop_definition.py`` (in the sandbox's openhands
        install)
      - ``grep ... || echo`` appending
        ``from . import coop_definition`` to the package's
        ``__init__.py`` so the registration runs at import time.

Tests: 1 new + updated image-layering assertions
  - ``test_importing_coop_definition_overrides_local_registration``:
    inspecting the registry's ``_MODULE_QUALNAMES`` confirms
    ``TaskTrackerTool.name`` resolves to ``coop_definition``'s
    registration after import.
  - ``TestOpenHandsImageLayering`` now asserts 2 ``add_local_file``
    calls + 2 ``run_commands`` layers (tool-file install +
    ``coop-task-*`` wrappers) and that the
    ``from . import coop_definition`` line is in the install
    commands.

Full suite: 329 passed.  Ruff / format / mypy all green.

KNOWN LIMITATION (documented in coop_definition.py docstring):
the openhands_sdk agent-server runs in a Modal sandbox that's
network-isolated from the host Redis.  The CoopTaskTracker is
correctly registered and the LLM can call it, but every operation
returns "Shared task list unavailable" because the sandbox can't
``socket.getaddrinfo("host.docker.internal")``.  The fix is in the
deployment layer (Modal tunnels, a Modal-hosted Redis, or running
openhands directly via docker like the other adapters), not in this
PR — verified by oh_team_v10: agent ran ``coop-task-list`` first
("The coop CLI failed; I'll use the shared task tracker."), then
fell back to TaskTrackerAction which still hit the local executor
because the override + Redis combo can't actually work in Modal.

For non-Modal openhands deployments (e.g. local docker-backed
openhands runs, future remote-conversation transports that share the
host network), this tool works as designed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Resolves the Modal-Redis isolation that blocked the prior CoopTaskTracker
swap from actually functioning.  Three pieces, working together:

1. **Modal-hosted Redis.** ``runner/team.py:execute_team`` detects
   ``agent_name == "openhands_sdk"`` and spins up a Modal sandbox
   running redis-server on a TCP tunnel (``unencrypted_ports=[6379]``,
   accessed via ``unencrypted_host:unencrypted_port``).  Re-uses the
   existing ``connectors/redis_server.ModalRedisServer`` — it was
   already written, just unused.  Both the host TaskListClient and
   the agent sandboxes point at the same public TCP endpoint, so
   pre-seed and agent reads/writes share state.  Falls back to local
   Redis for the other adapters.

2. **CoopTaskTrackerTool injection into the Modal sandbox.** The
   adapter now ``add_local_file``s three pieces into the OpenHands
   image at build time:
     - ``coop_task.py`` → ``/usr/local/bin/cb-coop-task.py``
     - ``coop_definition.py`` → ``$OH_DIR/coop_definition.py``
     - ``_team_init_override.py`` → ``$OH_DIR/__init__.py``
       (replaces upstream; same exports + a side-effect import of
       coop_definition so the Redis-backed executor overrides the
       local TaskTracker registration at first import).
   Plus a ``find -name '*.pyc' -delete`` to invalidate Python's
   bytecode cache so the new __init__ actually re-runs.

3. **Harvest-time fresh client.** Modal's TCP tunnels drop idle
   connections after a few minutes, so the original Redis client
   pre-seed used at startup gets closed before the 9-min agent run
   finishes.  Re-open the client at harvest time using the same URL.

End-to-end on ``dottxt_ai_outlines_task/1371 [1,2]`` with
``-a openhands_sdk --setting team --git``:

  - Modal Redis startup: ``redis ready redis://r450.modal.host:41899``
  - Both agents Submitted, 9m total
  - Eval: 2/2 PASS (f1: 14/14 ✓, f2: 20/20 ✓)
  - Metrics: ``tasks_total: 4, tasks_done: 4, unowned_at_end: 0,
    time_to_first_claim_seconds: 52.6, claims_per_agent: {agent2:2,
    agent1:1}, updates_per_agent: {agent2:4, agent1:5}``
  - Cost: $3.33

Tests: image-layering assertions expanded — ``add_local_file`` now
called 3 times (CLI helper, tool def, __init__ override), and the
run_commands chain copies both files + wipes .pyc caches.

Full suite: 329 passed.  Ruff / format / mypy all green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The team-mode unit tests (task_list / protocol / fs_mirror /
loop_refresh / mcp_server) use ``fakeredis.FakeRedis`` as a hermetic
stand-in for redis-server, but ``fakeredis`` wasn't declared anywhere
in pyproject.toml — it just happened to be present in my local venv
because something else pulled it in transitively.

GitHub CI installs ``[dev]`` only, so on a clean install pytest
collection fails with ``ModuleNotFoundError: No module named
'fakeredis'`` on every team-mode test file.  Adding the dependency
explicitly fixes PR #52 (team-mode) CI; once team-mode merges,
PR #55 (team-all-adapters) will also pick it up via the same path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three changes that together unblock swe_agent team-mode runs (and
solo/coop runs too — the bug wasn't team-specific):

1. ``cooperbench.agents.mini_swe_agent`` → ``mini_swe_agent_v2``
   in ``swe_agent/adapter.py`` and ``swe_agent/agent/agents.py``.
   The old package was renamed in v0.0.13; both swe_agent files
   had stale imports that no-op'd at module load (TypeError or
   ModuleNotFoundError depending on how the framework was invoked),
   making every swe_agent invocation return Error before any LLM
   call.

2. Add ``numpy``, ``boto3``, ``docker`` to the ``swe-agent`` extras
   in pyproject.toml.  swe_agent's vendored framework imports these
   at module-load time even when the docker/S3/model paths are
   dormant, so a clean ``pip install '.[swe-agent]'`` without these
   would still ImportError on first invocation.

3. uv.lock refreshed with the new transitive deps.

End-to-end on dottxt_ai_outlines_task/1371 [1,2] with
``-a swe_agent -m gpt-5.5 --setting team --git`` (sw_team_v5):
both agents Submitted, patches 373 + 88 lines, both applied via
git apply.  Eval failed 0/2 due to a content-quality issue
(``NameError: name 'Set' is not defined`` — agent used Set
without importing it; both agents hit exit_cost budget limit
mid-implementation), but that's model variance, not adapter
wiring.  swe_agent is unblocked: it runs end-to-end, produces
patches, the eval pipeline processes them.

Coordination metrics still empty (claims_per_agent: {}) because
swe_agent doesn't yet have the in-container coop-task-* CLI
install or in-loop task auto-refresh — those are tracked as
follow-ups in the PR body.  For now the swe_agent team-mode run
just gets the team prompt section + env vars; full team-tool
integration is a separate PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Five compounding bugs prevented `claude_code`, `codex`, and
`mini_swe_agent_v2` from reaching honest pass-rates on the core
subset in team setting. All four now ≥ 5/10.

- normalize_patch ate trailing blank context lines (text.strip()
  consumes " \n"), breaking last-hunk line counts so git apply
  rejected otherwise-valid diffs. Replaced with lstrip/rstrip on
  "\n" only.
- mini_swe_agent_v2 adapter wasn't normalizing patches at all —
  raw .strip() on the patch.txt read, so every msa patch ended
  in a non-newline byte. Now routes through normalize_patch.
- mini_swe_agent_v2 ModalEnvironment created the sandbox with no
  long-running command, so the image's default CMD exited and
  every exec hit "Sandbox not found". Pass "sleep", "infinity"
  as the positional command (matches eval backend's existing fix).
- claude_code and codex adapters silently ignored --backend modal
  because shared build_environment was hardcoded to DockerEnvironment.
  Added a backend kwarg and threaded config["backend"] through both
  adapters.
- Team lead prompt buried the integration step at the bottom of a
  long workflow list; Claude/Codex consistently exited after their
  own feature without reading /workspace/shared/<agent>.patch.
  Rewrote with a hard-rule opener and a 5-point pre-submission
  checklist. Member prompt now opens with "stay in your lane" per
  the lead's PLAN.md.
- eval test_merged now falls back to testing each agent's patch
  alone when the merged tree doesn't pass both features. Surfaced
  as merge.strategy="solo-agent1" / "solo-agent2". Credits the
  agent (typically the lead) who correctly integrated both
  features into one working patch but had it corrupted by
  union-merging with the other agent's partial implementation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- dataset/subsets/core.json: 10-pair subset for quick agent
  comparisons. Stratified by repo (largest-remainder proportional
  allocation by full-dataset pair count) with a one-slot floor per
  primary language (Python / Go / Rust / TS). Reproducible via
  scripts/generate_core_subset.py (seed=42).
- docs/BENCHMARK_RESULTS.md: horizontal comparison of four agent
  frameworks on the core subset in team setting. Includes per-task
  pass/fail matrix annotated with the merge strategy used, plus the
  chronological narrative of the dozen reruns that surfaced each of
  the bugs fixed in the previous commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously test_merged returned early with an error when both naive
and union merge strategies hit conflicts, so the solo-agent fallback
never got a chance to credit a team whose lead alone integrated both
features. Now we write an empty merged.patch, let run_tests fail
naturally on the merged tree, and fall through to the solo fallback.

Doesn't change any of the current 40 eval results — union's merge=union
attribute is tolerant enough that every task in the dataset produces
some tree (potentially broken code with stitched-together lines); the
broken-tree-tests-fail path already triggered the solo fallback. This
just closes the defensive gap for future pathological cases.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drops the union-merge strategy and the member-only fallback from
test_merged. The new chain is:

  1. identical patches → skip-merge short-circuit
  2. naive 3-way merge clean → merged-tree tests are authoritative
                               (no further fallback)
  3. naive merge conflicts → test the lead's patch.txt alone against
                             both feature suites

Rationale: union merge concatenates conflicting hunks, which usually
produces syntactically broken code; the cases where it accidentally
produced a working tree were rewarding lucky non-overlap, not genuine
coordination. The member-only fallback was symmetric to lead-only but
incoherent under team-mode semantics (the lead is the designated
integrator; if they didn't integrate, the team failed regardless of
what the member's branch looks like).

Effect on the core-subset horizontal comparison:
  msa  6 → 6  (unchanged)
  oh   5 → 4  (loses pallets_jinja/1621 — was passing via union, which
              concealed that oh's lead doesn't integrate)
  cc   5 → 5  (unchanged)
  cx   5 → 5  (unchanged)

oh sliding below 5/10 is the correct outcome: the previous union-pass
on pallets_jinja/1621 was a false-positive of sorts (oh's agents commit
their patch.txt into the working tree, which forces a merge conflict
on patch.txt that union resolved while the actual source merge was
non-conflicting). Under the stricter policy this gets routed through
lead-alone, which oh's lead does not pass.

BENCHMARK_RESULTS.md updated to reflect the new totals + per-task
matrix legend (N = naive/identical, L = lead-alone). CHANGELOG entry
revised; full test suite still green (329 passed, 63 skipped).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
codex on Modal: `codex exec` was hanging for the full sandbox
lifetime (~2h) producing zero stream output. Root cause: codex's
exec mode prints "Reading additional input from stdin..." and
blocks until stdin EOF. Docker's non-tty `docker exec` gives EOF
for free; Modal sandbox keeps stdin open. Fix: add `</dev/null`
to the codex invocation in _build_codex_command. Smoke-tested on
dottxt_ai_outlines/1655 [1,3] solo on Modal: 1/1 pass in 1m 48s.

openhands_sdk eval guardrail: openhands_sdk produces patches that
include a committed patch.txt in the working tree and relies on
Modal-hosted Redis for coordination; running eval through Docker
silently changed the test environment. The eval now reads the
run's config.json and refuses with a clear warning when the run
was produced by openhands_sdk but --backend != modal.

Note: swe_agent already runs on Modal (uses swerex.ModalDeploymentConfig
by default; the earlier docs claiming it was docker-only were
wrong). Smoke-tested same dottxt task: 1/1 pass in 3m 12s.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
swe_agent adapter was hardcoded to swerex.ModalDeploymentConfig.
Added a backend dispatch that picks DockerDeploymentConfig when
config["backend"] == "docker"; Modal stays as the default.

Two upstream-swerex issues had to be worked around to make the
docker path actually start a container:

1. CooperBench task images set ENTRYPOINT=/usr/local/bin/runner.sh,
   so swerex's `docker run ... image sh -c "<startup>"` becomes
   `runner.sh sh -c "<startup>"` and runner.sh interprets "sh" as
   the feature-patch path. Pass docker_args=["--entrypoint", ""]
   to clear the entrypoint (mirrors the existing Modal monkey-patch
   that does .entrypoint([]) on the image).

2. swerex's startup falls back to `pipx run swe-rex ...` when the
   swerex-remote binary isn't pre-installed, but pipx looks for an
   executable literally named "swe-rex" — which doesn't exist in
   the published `swe-rex` package (it provides "swerex-remote").
   Monkey-patch DockerDeployment._get_swerex_start_cmd to use
   `pipx run --spec swe-rex swerex-remote ...` instead.

Smoke-tested with `dottxt_ai_outlines/1655 [1,3]` solo on docker:
1/1 pass in 2m 53s, 17 steps, $0.32, no errors.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Base automatically changed from team-mode to main May 18, 2026 23:44
Resolves the squash-merge conflicts from #52 landing on main.

All conflicts followed the same pattern: this branch's HEAD contains
#52's content plus the subsequent work on top, while main's
squashed-merge commit contains only #52. Resolved each conflict by
taking ours (HEAD), which preserves the cumulative state of:

- CHANGELOG: full Fixed/Changed/Added entries for team-mode bug
  fixes, eval policy change, core subset + benchmark doc, plus the
  original "team setting" bullet from #52
- _team/prompt.py: the stronger lead-prompt with the 5-point
  integration checklist (#52 had the older "buried integration"
  version)
- swe_agent/adapter.py: team-mode kwarg propagation + Docker
  backend dispatch + pipx --spec monkey-patch
- runner/team.py: openhands_sdk Modal-Redis tunnel branch
- everywhere else: my newer adapter changes are strict supersets
  of #52's

CI green locally: 329 tests passed, ruff clean, mypy clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ProKil ProKil merged commit 32ad650 into main May 21, 2026
3 checks passed
@ProKil ProKil deleted the team-all-adapters branch May 21, 2026 16:52
ProKil added a commit that referenced this pull request May 21, 2026
#58)

* agents/codex: add Codex adapter; lift shared coop bits into _coop

Adds an OpenAI Codex CLI adapter alongside the existing Claude Code
adapter.  Both adapters wrap a third-party CLI inside the task's
Docker container; the bits that are agent-agnostic (Redis messaging
helper, prompt blocks for solo/coop/coop+git, git remote setup) now
live in a new ``cooperbench.agents._coop`` module so the two adapters
(and any future CLI adapter) consume them rather than duplicating.

Codex adapter highlights:

  - Invokes ``codex exec --json --sandbox danger-full-access
    --skip-git-repo-check --model <id>``.
  - Writes ``${CODEX_HOME}/auth.json`` with the host's OPENAI_API_KEY
    inside the container so the CLI authenticates without prompts.
  - Parses Codex's JSONL event stream for status / token totals /
    messages.  Cost is reported as 0.0 because Codex does not emit a
    cost field; tokens are summed across ``turn.completed`` events.
  - Model fallback: if Codex rejects ``--model gpt-5.5`` with a
    "model not found" shaped error, the adapter retries once without
    ``--model`` and lets Codex pick its default.
  - Preflight credential check: if OPENAI_API_KEY is unset the adapter
    returns Error immediately instead of spinning up a container that
    can only fail.

Shared ``_coop`` module:

  - ``coop_msg.py`` — Redis-backed messaging CLI (one inbox per agent)
    installed as ``coop-send`` / ``coop-recv`` / ``coop-broadcast`` /
    ``coop-peek`` / ``coop-agents`` under /usr/local/bin.
  - ``install_snippet.sh`` — pip-installs redis and drops the shell
    wrappers; each adapter's setup.sh sources it.
  - ``prompt.py`` — solo / coop / coop+git prompt assembly, agent-
    agnostic.
  - ``runtime.py`` — ``ContainerEnv`` protocol, ``build_environment``,
    ``write_file_in_container`` / ``read_file_from_container``,
    ``rewrite_comm_url_for_container``, ``build_git_setup_command``,
    ``parse_sent_messages_log``, and ``normalize_patch``.

Bug fix during this refactor: the previous adapter's ``.strip()`` on
``patch.txt`` was eating the trailing newline that ``git apply``
requires.  Replaced with ``normalize_patch()`` (one trailing newline,
no leading whitespace).  This bit codex's solo run with a
"corrupt patch at line N" error; Claude got lucky and didn't.

Tests: 24 new for Codex (parsers + adapter), existing 45 Claude Code
tests re-pointed at the shared ``_coop`` module.  Full suite: 228
passed, 63 skipped.

End-to-end runs against dottxt_ai_outlines_task/1371 features 1+2:

  - codex solo f1:           Submitted, 1 turn, 365k input tokens,
                             184-line patch (with the trailing-newline
                             fix it applies cleanly)
  - codex coop+git f1,f2:    both Submitted, both patches applied but
                             0/2 tests pass — coordination failure
                             (agent1 fetched ``team`` but never merged,
                             so the stacked patches produce a Python
                             SyntaxError at line 144 of the modified
                             file).  Claude on the same task scored
                             2/2; Codex used the tools less aggressively
                             on this run.

The 0/2 result is the kind of coordination failure the bench is
designed to surface, not an adapter bug.  Future iteration could
tighten the prompt or hard-enforce a post-run merge, but neither is
necessary to land the adapter itself.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* runner: add team mode (lead + members + shared task list + scratchpad)

Adds a third setting alongside ``solo`` and ``coop``, modelled on the
agent-team primitives Claude Code uses in its own product.  Where coop
gives N peer agents one feature each and a Redis inbox to chat over,
team mode adds three load-bearing primitives:

  1. A typed **shared task list** (cooperbench.agents._team.TaskListClient)
     backed by Redis hashes + sets, namespaced ``cb:<run_id>:``, with
     atomic claim semantics (HSETNX-style — exactly one caller wins on a
     race) and an audit log of every mutation.  Exposed in the container
     as ``coop-task-create`` / ``coop-task-claim`` / ``coop-task-update``
     / ``coop-task-list`` shell wrappers.

  2. A **lead / member role split**.  The first agent is designated
     ``team-lead`` and gets a system-prompt block instructing them to
     break the spec into tasks, assign them via ``coop-task-create
     --assign``, watch progress, and integrate.  Other agents are
     ``member`` and look for open tasks to claim.

  3. A **shared scratchpad** Docker volume (``cb-team-<run_id>``)
     mounted at ``/workspace/shared`` in every container.  Free
     coordination artifact for design notes, partial diffs, interface
     sketches.

Coordination metrics are computed from the task-list audit log after
the run finishes (``time_to_first_claim_seconds``, ``claims_per_agent``,
``updates_per_agent``, ``tasks_done``, ``unowned_at_end``) and saved
into ``result.json``.  Evaluation is identical to coop — per-agent
``patch.txt`` evaluated per-feature — so no eval changes were needed
beyond discovering ``team/`` log directories.

Compatibility: all five existing adapters accept the new ``team_role``
/ ``team_id`` / ``task_list_url`` kwargs.  The CLI adapters
(``claude_code``, ``codex``) wire the team install snippet into their
``setup.sh`` so the ``coop-task-*`` wrappers land at
``/usr/local/bin``.  The Python-loop adapters (``mini_swe_agent_v2``,
``swe_agent``, ``openhands_sdk``) accept the kwargs without breaking;
their in-loop integration with the task list (auto-refresh between
steps, similar to the existing inbox poll) lands in a follow-up.

Unit tests: 46 new
  - 18 task_list (CRUD, atomic claim, owner-only update, audit log,
    run isolation)
  - 12 prompt (lead vs member branches, solo fallback, git interaction)
  -  3 runtime (env assembly, scratchpad mount args)
  -  4 metrics (happy path, unowned-at-end, empty log, multiple claims)
  -  5 runner (lead-is-first-agent, pre-seed, kwarg propagation,
    metrics in result, three-agent team)
  -  4 misc

Full suite: 274 passed, 63 skipped.  Ruff / format / mypy all green.

End-to-end on dottxt_ai_outlines_task/1371 [1,2] with Claude Code in
team+git mode:

  - 5 tasks created (2 by bench-runner, 3 by the lead splitting its
    work), all reached ``done``
  - time_to_first_claim_seconds=34.2
  - claims_per_agent={agent1: 2, agent2: 1}
  - updates_per_agent={agent1: 4, agent2: 3}
  - scratchpad volume actively used (agent2 wrote its diff to
    /workspace/shared/agent2.patch + a summary.md)
  - **0/1 pass rate** — both ``patch.txt`` files were empty: the
    members wrote diffs to the scratchpad instead of also writing
    ``/workspace/repo/patch.txt``, and the lead never ran the final
    integration step.  This is real coordination signal (the prompt
    told them to write both places but they followed the scratchpad
    half only) — a follow-up will tighten the prompt to make patch.txt
    submission the explicit final step.

Future PRs (intentionally out of scope here so this lands at a
reviewable size):

  - In-loop auto-refresh for the Python-loop adapters
  - MCP long-poll tool to give CLI adapters push-ish inbox semantics
  - Typed ``coop-request`` / ``coop-respond`` protocol on top of
    messaging (CC's plan_approval_request shape)
  - Filesystem mirror of the task list (CC-style ``ls`` artefacts)

Stacks on #51 (Codex adapter) so the diff stays focused on team-mode
additions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* team mode: filesystem mirror, typed protocol, MCP server, in-loop refresh (#53)

Lands the four follow-ups that were called out as "Out of scope" on
the team-mode PR (#52), plus a prompt fix surfaced by the team-mode
end-to-end run.

1. **Filesystem mirror of task list** (``_team/fs_mirror.py``).
   Snapshots the Redis-backed task list to ``/workspace/shared/tasks/``
   so agents can ``ls`` and ``cat`` tasks with their existing tools
   rather than going through the ``coop-task-list`` CLI.  Layout
   mirrors Claude Code's team primitive: one ``<id>.json`` per task,
   plus ``_index.json`` (cheap ``ls`` target) and ``_log.jsonl`` (audit
   trail).  Triggered on every ``coop-task-list`` invocation and from
   the host runner at startup.  Files written via tempfile+replace so
   readers never observe a partial state.

2. **Typed coop-request / coop-respond protocol** (``_team/protocol.py``).
   Layered on plain Redis messaging, mirroring CC's
   ``plan_approval_request`` / ``plan_approval_response`` shape.
   ``coop-request <peer> <kind> <body>`` returns a request_id (and
   optionally blocks via ``--wait N`` for a response).
   ``coop-respond <request_id> <body>`` writes back; the sender's
   ``await_response`` uses BLPOP so it actually sleeps instead of
   busy-polling.  Both events flow into the shared task-log so
   coordination metrics include protocol events.

3. **MCP long-poll server** (``_team/mcp_server.py``).  Stdio
   JSON-RPC server that exposes a single ``wait_for_message`` tool
   backed by BLPOP on the agent's inbox.  Registered automatically:
   Claude Code adapter writes ``$CLAUDE_CONFIG_DIR/.claude.json`` with
   the server entry; Codex adapter writes ``$CODEX_HOME/config.toml``.
   The point is to make "watch the inbox" a natural idle behavior for
   the CLI adapters instead of a busy-loop on ``coop-recv`` returning
   empty — the closest we can get to push-style delivery for opaque
   CLI agent loops.

4. **In-loop task-list auto-refresh** (``_team/loop_refresh.py``).
   ``TeamPoller`` is a per-agent host-side helper that
   ``mini_swe_agent_v2.DefaultAgent.step()`` calls between LLM
   queries — same hook as the existing inbox poll.  The LLM sees a
   compact ``[Team task list] open: 1, in_progress: 2, ...`` summary
   prepended to every turn so it doesn't need to remember to call
   ``coop-task-list``.  Plumbed via ``agent.team_poller`` so the
   ``mini_swe_agent_v2`` subtree change is one branch in ``step()``.
   The same module also exports ``poll_team_state()`` for in-container
   use (env-driven variant).

5. **Prompt fix**: the previous team-mode end-to-end had members
   writing diffs to ``/workspace/shared/<id>.patch`` only and never to
   ``/workspace/repo/patch.txt``, scoring 0/2 despite great
   coordination.  Both lead and member prompts now have an explicit
   ``### Final submission — REQUIRED`` section that calls out
   ``patch.txt`` as the only file the bench evaluates and provides
   the exact ``git diff > patch.txt`` command.

Also: cosmetic fix to ``runner/core._print_single_result`` so team
mode's per-agent dicts (which carry ``patch_lines: int``) render
correctly in the run table — previously the column showed 0 because
the function tried ``len(r.get("patch", "").splitlines())`` and team
mode doesn't store the full patch in the agents dict.

Tests: 37 new unit tests
  -  8 fs_mirror     (atomic writes, stale cleanup, empty index)
  -  9 protocol      (request roundtrip, await, timeout, audit log)
  -  9 mcp_server    (initialize, tools/list, tools/call,
                      timeout, blocking, unknown-tool error,
                      env factory)
  -  8 loop_refresh  (summary formatting, TeamPoller, env variant)
  -  3 prompt        (regression: lead+member prompts demand patch.txt)

Full suite: **311 passed**, 63 skipped.

End-to-end on dottxt_ai_outlines_task/1371 [1,2] with Claude Code +
team + git: **2/2 features pass** (14/14 + 20/20 tests).  All four
follow-ups visibly active in the run artifacts:
``/workspace/shared/tasks/`` populated with per-task JSON + _index +
_log; scratchpad has agent2.patch; ``cb-mcp-server.py`` registered in
``.claude.json``; 6 tasks created (2 by runner pre-seed, 4 by lead's
sub-task split), 4 reached ``done``,
``time_to_first_claim_seconds=29.9``.  Previous run scored 0/2 on the
same task — the prompt fix is doing real work.

Stacks on #52.

Co-authored-by: Ubuntu <ubuntu@ip-172-31-58-153.us-west-2.compute.internal>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* team mode: wire team prompt + env into the three Python-loop adapters

Brings ``mini_swe_agent_v2``, ``swe_agent``, and ``openhands_sdk`` to
parity with the CLI adapters for team mode.  Before this commit they
accepted the team kwargs but discarded them; now each one appends the
team prompt section to the task it sends the agent, and (where the
adapter actually controls the container) propagates ``CB_TEAM_*`` env
vars + mounts the team scratchpad.

New helper: ``_team.team_task_section(agents, agent_id, team_role)``
returns ONLY the lead-or-member block + coop-task-* CLI usage,
without the surrounding task/submission/git scaffolding that
``build_team_instruction`` adds.  Python-loop adapters already have
their own prompts covering messaging/git/submission, so they need
only the new piece; CLI adapters keep using the bigger function.

Per-adapter wiring:

  - ``mini_swe_agent_v2``: appends team_task_section to task;
    propagates CB_TEAM_* through env_kwargs["env"]; adds
    ``--add-host=host.docker.internal:host-gateway`` + scratchpad
    volume to docker run args; installs the team CLI scripts + pip
    redis in the container after env spin-up.  The existing
    ``TeamPoller`` host-side hook (already in step()) still fires.

  - ``openhands_sdk``: appends team_task_section to task; folds a new
    ``team_env`` dict into ``coop_info`` so
    ``_build_credentials_dict`` propagates CB_TEAM_* into the
    sandbox.  Coop-task-* binary install in the OpenHands agent-server
    image is a follow-up — OpenHands manages its own image build and
    doesn't expose a clean post-start exec hook.

  - ``swe_agent``: appends team_task_section to task.  The SWE-agent
    framework's sandbox + agent loop is third-party and harder to
    instrument; everything beyond the prompt is a follow-up.

Tests: 13 new
  - 3 prompt unit tests for team_task_section (lead, member, empty)
  - 10 cross-adapter sanity tests in tests/agents/test_team_wiring.py:
    consistency between team_task_section and build_team_instruction,
    every registered runner accepts the team kwargs, openhands env
    keys, swe_agent signature

Full suite: 324 passed, 63 skipped.  Ruff/format/mypy all green.

End-to-end on dottxt_ai_outlines_task/1371 [1,2] with claude_code +
team + git (sanity check that the shared changes didn't regress the
CLI adapter): both Submitted in 4m21s, $0.93, patches 210 + 81 lines.

End-to-end for the other four (codex, mini_swe_agent_v2, swe_agent,
openhands_sdk) requires API keys (Anthropic for the three Python-loop
adapters via litellm, OpenAI for codex) that aren't available in this
environment.  Unit tests cover the new wiring; the e2e validations
should be run with real keys before relying on the per-adapter
behavior.

Compatibility matrix is now:

  | Adapter             | Accepts | Team prompt | Auto-refresh | CLI in container | env vars |
  |---------------------|---------|-------------|--------------|------------------|----------|
  | claude_code         | yes     | yes (full)  | n/a          | yes              | yes      |
  | codex               | yes     | yes (full)  | n/a          | yes              | yes      |
  | mini_swe_agent_v2   | yes     | yes (sec.)  | yes          | yes              | yes      |
  | openhands_sdk       | yes     | yes (sec.)  | n/a          | NOT YET          | yes      |
  | swe_agent           | yes     | yes (sec.)  | NOT YET      | NOT YET          | NOT YET  |

Stacks on #52 (merged-up team-mode branch).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* openhands: layer coop-task-* install onto Modal image for team mode

Closes the documented gap from the prior commit's matrix: the
``coop-task-*`` binaries now ship into the OpenHands agent-server
sandbox, layered onto the upstream ``-oh`` image via Modal's
``add_local_file`` / ``pip_install`` / ``run_commands`` chain (no
upstream image rebuild required).  Triggered only when
``coop_info["team_env"]`` is set so solo / coop runs don't pay the
~10s first-build cost.  Modal caches the layered image; subsequent
team runs are instant.

Verified end-to-end: ran openhands_sdk team+git on
dottxt_ai_outlines_task/1371 [1,2] with gpt-5.5.  The agent ran
``compgen -c | grep coop-task`` and got back all 7 wrappers
(create / claim / update / list / request / respond / pending) — the
install worked.  Whether the model actually invokes the tools is a
separate (coordination-quality) axis; in this run it discovered them
but didn't use them, same as codex.  Both patches applied; f1 14/14,
f2 19/20.

Tests: 2 new (full suite: 326 passed)
  - test_team_env_triggers_image_layering  — verifies add_local_file
    + pip_install + run_commands fire with the right args when team
    mode is active
  - test_no_layering_when_team_inactive    — verifies solo / coop
    runs skip the image-build cost

Matrix update — openhands_sdk now reads:
  Accepts kwargs: yes / Team prompt: section / Auto-refresh: n/a /
  CLI in container: YES (was NOT YET) / CB_TEAM_* env: yes

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* team prompt: make the merge-before-submit step REQUIRED

The codex team e2e (cx_team_v3) hit 0/2 with great coordination
metrics — 5/5 tasks done, 27s first claim, claims even — but
neither agent ran ``git merge`` despite the prompt's "Recommended
workflow" mentioning it.  Both fetched their peer's branch (2 each)
and then submitted only their own work, so the eval's naive
diff-stacker produced syntactically broken Python.

The previous prompt buried the critical step in a "Concretely:"
sentence at the end; gpt-5.5 didn't follow it.  This rewrite:

  - Renames the section ``## Git collaboration — MERGE IS REQUIRED
    BEFORE SUBMITTING`` so the imperative is in the heading itself.
  - Adds an explicit "Required final sequence — run this verbatim
    before exiting" block with the full fetch+merge+diff sequence,
    parameterized over every partner branch.
  - Explains *why* (each agent's patch.txt is evaluated against every
    feature's tests; without the merge, the peer feature's symbols
    are missing → ImportError).
  - Frames it the same way the patch.txt step is framed (REQUIRED,
    skip-at-your-loss), which the original prompt fix proved
    codex responds to.

Verified: re-ran cx_team_v4 (codex team+git, same task as v3).
Git activity went from ``fetch=2 merge=0 push=0`` per agent →
``fetch=3 merge=2 push=2`` and ``fetch=1 merge=1 push=1``.  Both
patches now contain both features' symbols.  Pass rate v4:
33/34 tests (97%) — f2 fully passes 20/20, f1 fails one test
because gpt-5.5's merged code put the ``filters`` kwarg on a helper
function rather than the ``prompt`` decorator (content quality, not
coordination).

A second run (cx_team_v5) produced byte-identical 243-line patches
on both agents — codex coordinated so well both ended up with the
exact same merged tree.  This surfaces a separate bench-side
limitation: the eval's diff-stacker fails to apply patch B on top
of patch A when every hunk already matches, producing an empty
merged.patch.  That's a real bug in ``eval/evaluate.py``'s coop
merge step, NOT a coordination failure — codex did exactly what the
prompt asked.  Fix is a separate concern from team-mode wiring.

Tests still pass (existing prompt tests are content-agnostic;
326 / 63 skipped).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* eval: short-circuit when both agents submit identical merged patches

In team mode codex can coordinate so well that both agents end up
with byte-identical patches (each fully merged the other's branch).
The existing eval combiner sequence — apply patch1 → apply patch2
on top — chokes because every hunk in patch2 is already applied,
producing an empty merged.patch and a downstream "No valid patches
in input" failure even though both submissions are individually
fine.

Fix in ``test_merged``: before invoking ``_setup_branches`` /
``_merge_naive``, ``cmp`` the two patches.  If they match, copy
patch1 to merged.patch (normalized via ``git apply --recount`` so
agents that emit unified diffs with miscounted hunk headers still
work) and skip the merge dance.  Returns a fresh result with
``merge.status: "identical"`` so the caller can tell the
short-circuit fired vs a real merge.

Verified on the codex-team e2e:

  - cx_team_v5 (codex agents perfectly merged to identical 243-line
    patches): 0/2 → 2/2 ✓ (f1: 14/14, f2: 20/20)
  - cx_team_v4 (codex agents diverged on the merge): unchanged at
    f2 20/20 + f1 13/14 = 33/34 tests, still falls back to
    agent2-alone via apply_status: {'agent1': 'failed', ...}

I also briefly tried adding ``git apply --recount`` to
``_setup_branches``'s fallback chain, but that REGRESSED v4: it
made agent1's malformed patch apply where it previously failed
silently, triggering a real merge attempt that produced
duplicate function definitions (broken Python) via union merge.
The identical-patches short-circuit is the strictly-better fix —
no regression, recovers the v5 case, and the malformed-hunk
normalization only kicks in on the short-circuit path where it
can't cause merge conflicts.

Also lands previously-uncommitted housekeeping:
  - prompt.py: ruff-format-only diff on the merge-required block
    from the prior commit
  - test_team_wiring.py: ruff --fix removed unused MagicMock
    imports
  - test_gcp_backend.py / test_tasks.py: ruff --fix removed
    f-string-without-placeholder and unused-json import (both
    unrelated drift caught by the gate)

Tests: 1 new (full suite: 327 passed)
  - ``test_test_merged_shortcircuits_on_identical_patches`` — source
    inspection confirms the short-circuit branch + "identical"
    merge-status string exist in test_merged

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* openhands: register Redis-backed CoopTaskTracker as a typed tool

The previous openhands team runs (oh_team_v3) showed agents
discovering the ``coop-task-*`` shell wrappers via ``compgen`` but
never invoking them — gpt-5.5 strongly prefers typed tools registered
with the LLM over arbitrary shell commands.  This commit lands the
architectural fix: a Redis-backed ``CoopTaskTrackerTool`` registered
under the same name as openhands' built-in ``TaskTrackerTool`` so the
registry resolution swaps it transparently.

Files:

  * ``openhands/tools/task_tracker/coop_definition.py`` — new tool
    definition + executor.  Same ``TaskTrackerAction`` /
    ``TaskTrackerObservation`` shape, but ``plan`` and ``view`` round-
    trip through the shared ``cb:<run_id>:`` Redis namespace that
    ``TaskListClient`` (host side) writes to.  Tasks are auto-owned
    by the calling agent; ``view`` shows peer tasks prefixed with
    ``[<their_agent_id>]``.  Registered under both
    ``"CoopTaskTrackerTool"`` AND ``"TaskTrackerTool"`` so importing
    the module rebinds the latter to the Coop variant.

  * ``openhands/tools/preset/default.py`` — gains a ``team_mode``
    kwarg (kept for API stability + tests; the actual swap happens
    server-side via the .pth/__init__ side-effect import, not by
    changing the host-side tool list).  Pre-PR coop block split into
    a more nuanced team-mode prompt section that documents the
    TaskTracker → shared-list behavior.

  * ``openhands_sdk/adapter.py:ModalSandboxContext.__enter__`` —
    layers two more bits into the Modal image at build time:
      - ``add_local_file`` of ``coop_definition.py`` to
        ``$OH_DIR/coop_definition.py`` (in the sandbox's openhands
        install)
      - ``grep ... || echo`` appending
        ``from . import coop_definition`` to the package's
        ``__init__.py`` so the registration runs at import time.

Tests: 1 new + updated image-layering assertions
  - ``test_importing_coop_definition_overrides_local_registration``:
    inspecting the registry's ``_MODULE_QUALNAMES`` confirms
    ``TaskTrackerTool.name`` resolves to ``coop_definition``'s
    registration after import.
  - ``TestOpenHandsImageLayering`` now asserts 2 ``add_local_file``
    calls + 2 ``run_commands`` layers (tool-file install +
    ``coop-task-*`` wrappers) and that the
    ``from . import coop_definition`` line is in the install
    commands.

Full suite: 329 passed.  Ruff / format / mypy all green.

KNOWN LIMITATION (documented in coop_definition.py docstring):
the openhands_sdk agent-server runs in a Modal sandbox that's
network-isolated from the host Redis.  The CoopTaskTracker is
correctly registered and the LLM can call it, but every operation
returns "Shared task list unavailable" because the sandbox can't
``socket.getaddrinfo("host.docker.internal")``.  The fix is in the
deployment layer (Modal tunnels, a Modal-hosted Redis, or running
openhands directly via docker like the other adapters), not in this
PR — verified by oh_team_v10: agent ran ``coop-task-list`` first
("The coop CLI failed; I'll use the shared task tracker."), then
fell back to TaskTrackerAction which still hit the local executor
because the override + Redis combo can't actually work in Modal.

For non-Modal openhands deployments (e.g. local docker-backed
openhands runs, future remote-conversation transports that share the
host network), this tool works as designed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* openhands team mode: end-to-end working with Modal-hosted Redis

Resolves the Modal-Redis isolation that blocked the prior CoopTaskTracker
swap from actually functioning.  Three pieces, working together:

1. **Modal-hosted Redis.** ``runner/team.py:execute_team`` detects
   ``agent_name == "openhands_sdk"`` and spins up a Modal sandbox
   running redis-server on a TCP tunnel (``unencrypted_ports=[6379]``,
   accessed via ``unencrypted_host:unencrypted_port``).  Re-uses the
   existing ``connectors/redis_server.ModalRedisServer`` — it was
   already written, just unused.  Both the host TaskListClient and
   the agent sandboxes point at the same public TCP endpoint, so
   pre-seed and agent reads/writes share state.  Falls back to local
   Redis for the other adapters.

2. **CoopTaskTrackerTool injection into the Modal sandbox.** The
   adapter now ``add_local_file``s three pieces into the OpenHands
   image at build time:
     - ``coop_task.py`` → ``/usr/local/bin/cb-coop-task.py``
     - ``coop_definition.py`` → ``$OH_DIR/coop_definition.py``
     - ``_team_init_override.py`` → ``$OH_DIR/__init__.py``
       (replaces upstream; same exports + a side-effect import of
       coop_definition so the Redis-backed executor overrides the
       local TaskTracker registration at first import).
   Plus a ``find -name '*.pyc' -delete`` to invalidate Python's
   bytecode cache so the new __init__ actually re-runs.

3. **Harvest-time fresh client.** Modal's TCP tunnels drop idle
   connections after a few minutes, so the original Redis client
   pre-seed used at startup gets closed before the 9-min agent run
   finishes.  Re-open the client at harvest time using the same URL.

End-to-end on ``dottxt_ai_outlines_task/1371 [1,2]`` with
``-a openhands_sdk --setting team --git``:

  - Modal Redis startup: ``redis ready redis://r450.modal.host:41899``
  - Both agents Submitted, 9m total
  - Eval: 2/2 PASS (f1: 14/14 ✓, f2: 20/20 ✓)
  - Metrics: ``tasks_total: 4, tasks_done: 4, unowned_at_end: 0,
    time_to_first_claim_seconds: 52.6, claims_per_agent: {agent2:2,
    agent1:1}, updates_per_agent: {agent2:4, agent1:5}``
  - Cost: $3.33

Tests: image-layering assertions expanded — ``add_local_file`` now
called 3 times (CLI helper, tool def, __init__ override), and the
run_commands chain copies both files + wipes .pyc caches.

Full suite: 329 passed.  Ruff / format / mypy all green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* deps: add fakeredis to dev extras

The team-mode unit tests (task_list / protocol / fs_mirror /
loop_refresh / mcp_server) use ``fakeredis.FakeRedis`` as a hermetic
stand-in for redis-server, but ``fakeredis`` wasn't declared anywhere
in pyproject.toml — it just happened to be present in my local venv
because something else pulled it in transitively.

GitHub CI installs ``[dev]`` only, so on a clean install pytest
collection fails with ``ModuleNotFoundError: No module named
'fakeredis'`` on every team-mode test file.  Adding the dependency
explicitly fixes PR #52 (team-mode) CI; once team-mode merges,
PR #55 (team-all-adapters) will also pick it up via the same path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* swe_agent: fix import error + add missing transitive deps

Three changes that together unblock swe_agent team-mode runs (and
solo/coop runs too — the bug wasn't team-specific):

1. ``cooperbench.agents.mini_swe_agent`` → ``mini_swe_agent_v2``
   in ``swe_agent/adapter.py`` and ``swe_agent/agent/agents.py``.
   The old package was renamed in v0.0.13; both swe_agent files
   had stale imports that no-op'd at module load (TypeError or
   ModuleNotFoundError depending on how the framework was invoked),
   making every swe_agent invocation return Error before any LLM
   call.

2. Add ``numpy``, ``boto3``, ``docker`` to the ``swe-agent`` extras
   in pyproject.toml.  swe_agent's vendored framework imports these
   at module-load time even when the docker/S3/model paths are
   dormant, so a clean ``pip install '.[swe-agent]'`` without these
   would still ImportError on first invocation.

3. uv.lock refreshed with the new transitive deps.

End-to-end on dottxt_ai_outlines_task/1371 [1,2] with
``-a swe_agent -m gpt-5.5 --setting team --git`` (sw_team_v5):
both agents Submitted, patches 373 + 88 lines, both applied via
git apply.  Eval failed 0/2 due to a content-quality issue
(``NameError: name 'Set' is not defined`` — agent used Set
without importing it; both agents hit exit_cost budget limit
mid-implementation), but that's model variance, not adapter
wiring.  swe_agent is unblocked: it runs end-to-end, produces
patches, the eval pipeline processes them.

Coordination metrics still empty (claims_per_agent: {}) because
swe_agent doesn't yet have the in-container coop-task-* CLI
install or in-loop task auto-refresh — those are tracked as
follow-ups in the PR body.  For now the swe_agent team-mode run
just gets the team prompt section + env vars; full team-tool
integration is a separate PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: team-mode bugs surfaced by 10-pair core run

Five compounding bugs prevented `claude_code`, `codex`, and
`mini_swe_agent_v2` from reaching honest pass-rates on the core
subset in team setting. All four now ≥ 5/10.

- normalize_patch ate trailing blank context lines (text.strip()
  consumes " \n"), breaking last-hunk line counts so git apply
  rejected otherwise-valid diffs. Replaced with lstrip/rstrip on
  "\n" only.
- mini_swe_agent_v2 adapter wasn't normalizing patches at all —
  raw .strip() on the patch.txt read, so every msa patch ended
  in a non-newline byte. Now routes through normalize_patch.
- mini_swe_agent_v2 ModalEnvironment created the sandbox with no
  long-running command, so the image's default CMD exited and
  every exec hit "Sandbox not found". Pass "sleep", "infinity"
  as the positional command (matches eval backend's existing fix).
- claude_code and codex adapters silently ignored --backend modal
  because shared build_environment was hardcoded to DockerEnvironment.
  Added a backend kwarg and threaded config["backend"] through both
  adapters.
- Team lead prompt buried the integration step at the bottom of a
  long workflow list; Claude/Codex consistently exited after their
  own feature without reading /workspace/shared/<agent>.patch.
  Rewrote with a hard-rule opener and a 5-point pre-submission
  checklist. Member prompt now opens with "stay in your lane" per
  the lead's PLAN.md.
- eval test_merged now falls back to testing each agent's patch
  alone when the merged tree doesn't pass both features. Surfaced
  as merge.strategy="solo-agent1" / "solo-agent2". Credits the
  agent (typically the lead) who correctly integrated both
  features into one working patch but had it corrupted by
  union-merging with the other agent's partial implementation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs+data: core subset and team-mode horizontal comparison

- dataset/subsets/core.json: 10-pair subset for quick agent
  comparisons. Stratified by repo (largest-remainder proportional
  allocation by full-dataset pair count) with a one-slot floor per
  primary language (Python / Go / Rust / TS). Reproducible via
  scripts/generate_core_subset.py (seed=42).
- docs/BENCHMARK_RESULTS.md: horizontal comparison of four agent
  frameworks on the core subset in team setting. Includes per-task
  pass/fail matrix annotated with the merge strategy used, plus the
  chronological narrative of the dozen reruns that surfaced each of
  the bugs fixed in the previous commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(eval): don't bail when union-merge also conflicts

Previously test_merged returned early with an error when both naive
and union merge strategies hit conflicts, so the solo-agent fallback
never got a chance to credit a team whose lead alone integrated both
features. Now we write an empty merged.patch, let run_tests fail
naturally on the merged tree, and fall through to the solo fallback.

Doesn't change any of the current 40 eval results — union's merge=union
attribute is tolerant enough that every task in the dataset produces
some tree (potentially broken code with stitched-together lines); the
broken-tree-tests-fail path already triggered the solo fallback. This
just closes the defensive gap for future pathological cases.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* eval(team): identical / naive / lead-when-naive-conflicts policy

Drops the union-merge strategy and the member-only fallback from
test_merged. The new chain is:

  1. identical patches → skip-merge short-circuit
  2. naive 3-way merge clean → merged-tree tests are authoritative
                               (no further fallback)
  3. naive merge conflicts → test the lead's patch.txt alone against
                             both feature suites

Rationale: union merge concatenates conflicting hunks, which usually
produces syntactically broken code; the cases where it accidentally
produced a working tree were rewarding lucky non-overlap, not genuine
coordination. The member-only fallback was symmetric to lead-only but
incoherent under team-mode semantics (the lead is the designated
integrator; if they didn't integrate, the team failed regardless of
what the member's branch looks like).

Effect on the core-subset horizontal comparison:
  msa  6 → 6  (unchanged)
  oh   5 → 4  (loses pallets_jinja/1621 — was passing via union, which
              concealed that oh's lead doesn't integrate)
  cc   5 → 5  (unchanged)
  cx   5 → 5  (unchanged)

oh sliding below 5/10 is the correct outcome: the previous union-pass
on pallets_jinja/1621 was a false-positive of sorts (oh's agents commit
their patch.txt into the working tree, which forces a merge conflict
on patch.txt that union resolved while the actual source merge was
non-conflicting). Under the stricter policy this gets routed through
lead-alone, which oh's lead does not pass.

BENCHMARK_RESULTS.md updated to reflect the new totals + per-task
matrix legend (N = naive/identical, L = lead-alone). CHANGELOG entry
revised; full test suite still green (329 passed, 63 skipped).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(modal): codex stdin hang; eval guardrail for openhands_sdk

codex on Modal: `codex exec` was hanging for the full sandbox
lifetime (~2h) producing zero stream output. Root cause: codex's
exec mode prints "Reading additional input from stdin..." and
blocks until stdin EOF. Docker's non-tty `docker exec` gives EOF
for free; Modal sandbox keeps stdin open. Fix: add `</dev/null`
to the codex invocation in _build_codex_command. Smoke-tested on
dottxt_ai_outlines/1655 [1,3] solo on Modal: 1/1 pass in 1m 48s.

openhands_sdk eval guardrail: openhands_sdk produces patches that
include a committed patch.txt in the working tree and relies on
Modal-hosted Redis for coordination; running eval through Docker
silently changed the test environment. The eval now reads the
run's config.json and refuses with a clear warning when the run
was produced by openhands_sdk but --backend != modal.

Note: swe_agent already runs on Modal (uses swerex.ModalDeploymentConfig
by default; the earlier docs claiming it was docker-only were
wrong). Smoke-tested same dottxt task: 1/1 pass in 3m 12s.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(swe_agent): add --backend docker support

swe_agent adapter was hardcoded to swerex.ModalDeploymentConfig.
Added a backend dispatch that picks DockerDeploymentConfig when
config["backend"] == "docker"; Modal stays as the default.

Two upstream-swerex issues had to be worked around to make the
docker path actually start a container:

1. CooperBench task images set ENTRYPOINT=/usr/local/bin/runner.sh,
   so swerex's `docker run ... image sh -c "<startup>"` becomes
   `runner.sh sh -c "<startup>"` and runner.sh interprets "sh" as
   the feature-patch path. Pass docker_args=["--entrypoint", ""]
   to clear the entrypoint (mirrors the existing Modal monkey-patch
   that does .entrypoint([]) on the image).

2. swerex's startup falls back to `pipx run swe-rex ...` when the
   swerex-remote binary isn't pre-installed, but pipx looks for an
   executable literally named "swe-rex" — which doesn't exist in
   the published `swe-rex` package (it provides "swerex-remote").
   Monkey-patch DockerDeployment._get_swerex_start_cmd to use
   `pipx run --spec swe-rex swerex-remote ...` instead.

Smoke-tested with `dottxt_ai_outlines/1655 [1,3]` solo on docker:
1/1 pass in 2m 53s, 17 steps, $0.32, no errors.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* team_harness: extract team mode as standalone harness + ablation flags

Move team-mode primitives from cooperbench/agents/_team (private) to
cooperbench/team_harness (public, library-shaped) so other benchmarks
can consume the multi-agent coordination algorithm without depending on
CooperBench's task layout.

Adds TeamSession + TeamHarnessConfig:

- TeamSession bundles per-run state (run_id, namespaced Redis URL,
  ordered agent list, scratchpad volume name) with the feature config
  and exposes adapter-facing factories that each return None / [] / {}
  when their feature is disabled, so adapter code paths collapse to one
  branch:

    coop_env.update(session.env_for(agent_id))
    extra_run_args.extend(session.scratchpad_mount_args())
    mcp_config = session.mcp_config(container_script_path=...)

- TeamHarnessConfig is a frozen dataclass of five per-feature booleans
  (task_list, scratchpad, mcp, auto_refresh, protocol).  The lead/member
  role split is the always-on baseline -- without it team is just coop.

Wires five --team-no-* CLI flags through cli.py -> runner.run ->
runner.core -> runner.team -> each adapter.  result.json now records
team_features so post-hoc analysis can attribute deltas to the feature
that was off.

Adapter refactor: claude_code, codex, mini_swe_agent_v2, swe_agent, and
openhands_agent_sdk now accept team_features kwarg and construct a
local TeamSession instead of calling loose helpers.  Each adapter's
team-mode blocks (prompt, env, mount, MCP, install) gate on the
session's config.

Tests: tests/agents/_team -> tests/team_harness (rename), new
test_session.py (29 cases) covers the facade, four new ablation tests
in tests/runner/test_team.py verify the runner-side gating.  Full suite
363 passed, 63 skipped; ruff/format/mypy clean.

End-to-end smoke on dottxt_ai_outlines/1371 [1,2] with codex (docker):
- Default: writes task_log.json + tasks.json + metrics, cb-team-<run>
  volume created.
- --team-no-task-list --team-no-scratchpad --team-no-mcp: no task_log /
  tasks files, empty metrics dict, no volume.  team_features in
  result.json reflects the requested ablation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Ubuntu <ubuntu@ip-172-31-58-153.us-west-2.compute.internal>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ProKil added a commit that referenced this pull request May 21, 2026
* agents/codex: add Codex adapter; lift shared coop bits into _coop

Adds an OpenAI Codex CLI adapter alongside the existing Claude Code
adapter.  Both adapters wrap a third-party CLI inside the task's
Docker container; the bits that are agent-agnostic (Redis messaging
helper, prompt blocks for solo/coop/coop+git, git remote setup) now
live in a new ``cooperbench.agents._coop`` module so the two adapters
(and any future CLI adapter) consume them rather than duplicating.

Codex adapter highlights:

  - Invokes ``codex exec --json --sandbox danger-full-access
    --skip-git-repo-check --model <id>``.
  - Writes ``${CODEX_HOME}/auth.json`` with the host's OPENAI_API_KEY
    inside the container so the CLI authenticates without prompts.
  - Parses Codex's JSONL event stream for status / token totals /
    messages.  Cost is reported as 0.0 because Codex does not emit a
    cost field; tokens are summed across ``turn.completed`` events.
  - Model fallback: if Codex rejects ``--model gpt-5.5`` with a
    "model not found" shaped error, the adapter retries once without
    ``--model`` and lets Codex pick its default.
  - Preflight credential check: if OPENAI_API_KEY is unset the adapter
    returns Error immediately instead of spinning up a container that
    can only fail.

Shared ``_coop`` module:

  - ``coop_msg.py`` — Redis-backed messaging CLI (one inbox per agent)
    installed as ``coop-send`` / ``coop-recv`` / ``coop-broadcast`` /
    ``coop-peek`` / ``coop-agents`` under /usr/local/bin.
  - ``install_snippet.sh`` — pip-installs redis and drops the shell
    wrappers; each adapter's setup.sh sources it.
  - ``prompt.py`` — solo / coop / coop+git prompt assembly, agent-
    agnostic.
  - ``runtime.py`` — ``ContainerEnv`` protocol, ``build_environment``,
    ``write_file_in_container`` / ``read_file_from_container``,
    ``rewrite_comm_url_for_container``, ``build_git_setup_command``,
    ``parse_sent_messages_log``, and ``normalize_patch``.

Bug fix during this refactor: the previous adapter's ``.strip()`` on
``patch.txt`` was eating the trailing newline that ``git apply``
requires.  Replaced with ``normalize_patch()`` (one trailing newline,
no leading whitespace).  This bit codex's solo run with a
"corrupt patch at line N" error; Claude got lucky and didn't.

Tests: 24 new for Codex (parsers + adapter), existing 45 Claude Code
tests re-pointed at the shared ``_coop`` module.  Full suite: 228
passed, 63 skipped.

End-to-end runs against dottxt_ai_outlines_task/1371 features 1+2:

  - codex solo f1:           Submitted, 1 turn, 365k input tokens,
                             184-line patch (with the trailing-newline
                             fix it applies cleanly)
  - codex coop+git f1,f2:    both Submitted, both patches applied but
                             0/2 tests pass — coordination failure
                             (agent1 fetched ``team`` but never merged,
                             so the stacked patches produce a Python
                             SyntaxError at line 144 of the modified
                             file).  Claude on the same task scored
                             2/2; Codex used the tools less aggressively
                             on this run.

The 0/2 result is the kind of coordination failure the bench is
designed to surface, not an adapter bug.  Future iteration could
tighten the prompt or hard-enforce a post-run merge, but neither is
necessary to land the adapter itself.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* runner: add team mode (lead + members + shared task list + scratchpad)

Adds a third setting alongside ``solo`` and ``coop``, modelled on the
agent-team primitives Claude Code uses in its own product.  Where coop
gives N peer agents one feature each and a Redis inbox to chat over,
team mode adds three load-bearing primitives:

  1. A typed **shared task list** (cooperbench.agents._team.TaskListClient)
     backed by Redis hashes + sets, namespaced ``cb:<run_id>:``, with
     atomic claim semantics (HSETNX-style — exactly one caller wins on a
     race) and an audit log of every mutation.  Exposed in the container
     as ``coop-task-create`` / ``coop-task-claim`` / ``coop-task-update``
     / ``coop-task-list`` shell wrappers.

  2. A **lead / member role split**.  The first agent is designated
     ``team-lead`` and gets a system-prompt block instructing them to
     break the spec into tasks, assign them via ``coop-task-create
     --assign``, watch progress, and integrate.  Other agents are
     ``member`` and look for open tasks to claim.

  3. A **shared scratchpad** Docker volume (``cb-team-<run_id>``)
     mounted at ``/workspace/shared`` in every container.  Free
     coordination artifact for design notes, partial diffs, interface
     sketches.

Coordination metrics are computed from the task-list audit log after
the run finishes (``time_to_first_claim_seconds``, ``claims_per_agent``,
``updates_per_agent``, ``tasks_done``, ``unowned_at_end``) and saved
into ``result.json``.  Evaluation is identical to coop — per-agent
``patch.txt`` evaluated per-feature — so no eval changes were needed
beyond discovering ``team/`` log directories.

Compatibility: all five existing adapters accept the new ``team_role``
/ ``team_id`` / ``task_list_url`` kwargs.  The CLI adapters
(``claude_code``, ``codex``) wire the team install snippet into their
``setup.sh`` so the ``coop-task-*`` wrappers land at
``/usr/local/bin``.  The Python-loop adapters (``mini_swe_agent_v2``,
``swe_agent``, ``openhands_sdk``) accept the kwargs without breaking;
their in-loop integration with the task list (auto-refresh between
steps, similar to the existing inbox poll) lands in a follow-up.

Unit tests: 46 new
  - 18 task_list (CRUD, atomic claim, owner-only update, audit log,
    run isolation)
  - 12 prompt (lead vs member branches, solo fallback, git interaction)
  -  3 runtime (env assembly, scratchpad mount args)
  -  4 metrics (happy path, unowned-at-end, empty log, multiple claims)
  -  5 runner (lead-is-first-agent, pre-seed, kwarg propagation,
    metrics in result, three-agent team)
  -  4 misc

Full suite: 274 passed, 63 skipped.  Ruff / format / mypy all green.

End-to-end on dottxt_ai_outlines_task/1371 [1,2] with Claude Code in
team+git mode:

  - 5 tasks created (2 by bench-runner, 3 by the lead splitting its
    work), all reached ``done``
  - time_to_first_claim_seconds=34.2
  - claims_per_agent={agent1: 2, agent2: 1}
  - updates_per_agent={agent1: 4, agent2: 3}
  - scratchpad volume actively used (agent2 wrote its diff to
    /workspace/shared/agent2.patch + a summary.md)
  - **0/1 pass rate** — both ``patch.txt`` files were empty: the
    members wrote diffs to the scratchpad instead of also writing
    ``/workspace/repo/patch.txt``, and the lead never ran the final
    integration step.  This is real coordination signal (the prompt
    told them to write both places but they followed the scratchpad
    half only) — a follow-up will tighten the prompt to make patch.txt
    submission the explicit final step.

Future PRs (intentionally out of scope here so this lands at a
reviewable size):

  - In-loop auto-refresh for the Python-loop adapters
  - MCP long-poll tool to give CLI adapters push-ish inbox semantics
  - Typed ``coop-request`` / ``coop-respond`` protocol on top of
    messaging (CC's plan_approval_request shape)
  - Filesystem mirror of the task list (CC-style ``ls`` artefacts)

Stacks on #51 (Codex adapter) so the diff stays focused on team-mode
additions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* team mode: filesystem mirror, typed protocol, MCP server, in-loop refresh (#53)

Lands the four follow-ups that were called out as "Out of scope" on
the team-mode PR (#52), plus a prompt fix surfaced by the team-mode
end-to-end run.

1. **Filesystem mirror of task list** (``_team/fs_mirror.py``).
   Snapshots the Redis-backed task list to ``/workspace/shared/tasks/``
   so agents can ``ls`` and ``cat`` tasks with their existing tools
   rather than going through the ``coop-task-list`` CLI.  Layout
   mirrors Claude Code's team primitive: one ``<id>.json`` per task,
   plus ``_index.json`` (cheap ``ls`` target) and ``_log.jsonl`` (audit
   trail).  Triggered on every ``coop-task-list`` invocation and from
   the host runner at startup.  Files written via tempfile+replace so
   readers never observe a partial state.

2. **Typed coop-request / coop-respond protocol** (``_team/protocol.py``).
   Layered on plain Redis messaging, mirroring CC's
   ``plan_approval_request`` / ``plan_approval_response`` shape.
   ``coop-request <peer> <kind> <body>`` returns a request_id (and
   optionally blocks via ``--wait N`` for a response).
   ``coop-respond <request_id> <body>`` writes back; the sender's
   ``await_response`` uses BLPOP so it actually sleeps instead of
   busy-polling.  Both events flow into the shared task-log so
   coordination metrics include protocol events.

3. **MCP long-poll server** (``_team/mcp_server.py``).  Stdio
   JSON-RPC server that exposes a single ``wait_for_message`` tool
   backed by BLPOP on the agent's inbox.  Registered automatically:
   Claude Code adapter writes ``$CLAUDE_CONFIG_DIR/.claude.json`` with
   the server entry; Codex adapter writes ``$CODEX_HOME/config.toml``.
   The point is to make "watch the inbox" a natural idle behavior for
   the CLI adapters instead of a busy-loop on ``coop-recv`` returning
   empty — the closest we can get to push-style delivery for opaque
   CLI agent loops.

4. **In-loop task-list auto-refresh** (``_team/loop_refresh.py``).
   ``TeamPoller`` is a per-agent host-side helper that
   ``mini_swe_agent_v2.DefaultAgent.step()`` calls between LLM
   queries — same hook as the existing inbox poll.  The LLM sees a
   compact ``[Team task list] open: 1, in_progress: 2, ...`` summary
   prepended to every turn so it doesn't need to remember to call
   ``coop-task-list``.  Plumbed via ``agent.team_poller`` so the
   ``mini_swe_agent_v2`` subtree change is one branch in ``step()``.
   The same module also exports ``poll_team_state()`` for in-container
   use (env-driven variant).

5. **Prompt fix**: the previous team-mode end-to-end had members
   writing diffs to ``/workspace/shared/<id>.patch`` only and never to
   ``/workspace/repo/patch.txt``, scoring 0/2 despite great
   coordination.  Both lead and member prompts now have an explicit
   ``### Final submission — REQUIRED`` section that calls out
   ``patch.txt`` as the only file the bench evaluates and provides
   the exact ``git diff > patch.txt`` command.

Also: cosmetic fix to ``runner/core._print_single_result`` so team
mode's per-agent dicts (which carry ``patch_lines: int``) render
correctly in the run table — previously the column showed 0 because
the function tried ``len(r.get("patch", "").splitlines())`` and team
mode doesn't store the full patch in the agents dict.

Tests: 37 new unit tests
  -  8 fs_mirror     (atomic writes, stale cleanup, empty index)
  -  9 protocol      (request roundtrip, await, timeout, audit log)
  -  9 mcp_server    (initialize, tools/list, tools/call,
                      timeout, blocking, unknown-tool error,
                      env factory)
  -  8 loop_refresh  (summary formatting, TeamPoller, env variant)
  -  3 prompt        (regression: lead+member prompts demand patch.txt)

Full suite: **311 passed**, 63 skipped.

End-to-end on dottxt_ai_outlines_task/1371 [1,2] with Claude Code +
team + git: **2/2 features pass** (14/14 + 20/20 tests).  All four
follow-ups visibly active in the run artifacts:
``/workspace/shared/tasks/`` populated with per-task JSON + _index +
_log; scratchpad has agent2.patch; ``cb-mcp-server.py`` registered in
``.claude.json``; 6 tasks created (2 by runner pre-seed, 4 by lead's
sub-task split), 4 reached ``done``,
``time_to_first_claim_seconds=29.9``.  Previous run scored 0/2 on the
same task — the prompt fix is doing real work.

Stacks on #52.

Co-authored-by: Ubuntu <ubuntu@ip-172-31-58-153.us-west-2.compute.internal>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* team mode: wire team prompt + env into the three Python-loop adapters

Brings ``mini_swe_agent_v2``, ``swe_agent``, and ``openhands_sdk`` to
parity with the CLI adapters for team mode.  Before this commit they
accepted the team kwargs but discarded them; now each one appends the
team prompt section to the task it sends the agent, and (where the
adapter actually controls the container) propagates ``CB_TEAM_*`` env
vars + mounts the team scratchpad.

New helper: ``_team.team_task_section(agents, agent_id, team_role)``
returns ONLY the lead-or-member block + coop-task-* CLI usage,
without the surrounding task/submission/git scaffolding that
``build_team_instruction`` adds.  Python-loop adapters already have
their own prompts covering messaging/git/submission, so they need
only the new piece; CLI adapters keep using the bigger function.

Per-adapter wiring:

  - ``mini_swe_agent_v2``: appends team_task_section to task;
    propagates CB_TEAM_* through env_kwargs["env"]; adds
    ``--add-host=host.docker.internal:host-gateway`` + scratchpad
    volume to docker run args; installs the team CLI scripts + pip
    redis in the container after env spin-up.  The existing
    ``TeamPoller`` host-side hook (already in step()) still fires.

  - ``openhands_sdk``: appends team_task_section to task; folds a new
    ``team_env`` dict into ``coop_info`` so
    ``_build_credentials_dict`` propagates CB_TEAM_* into the
    sandbox.  Coop-task-* binary install in the OpenHands agent-server
    image is a follow-up — OpenHands manages its own image build and
    doesn't expose a clean post-start exec hook.

  - ``swe_agent``: appends team_task_section to task.  The SWE-agent
    framework's sandbox + agent loop is third-party and harder to
    instrument; everything beyond the prompt is a follow-up.

Tests: 13 new
  - 3 prompt unit tests for team_task_section (lead, member, empty)
  - 10 cross-adapter sanity tests in tests/agents/test_team_wiring.py:
    consistency between team_task_section and build_team_instruction,
    every registered runner accepts the team kwargs, openhands env
    keys, swe_agent signature

Full suite: 324 passed, 63 skipped.  Ruff/format/mypy all green.

End-to-end on dottxt_ai_outlines_task/1371 [1,2] with claude_code +
team + git (sanity check that the shared changes didn't regress the
CLI adapter): both Submitted in 4m21s, $0.93, patches 210 + 81 lines.

End-to-end for the other four (codex, mini_swe_agent_v2, swe_agent,
openhands_sdk) requires API keys (Anthropic for the three Python-loop
adapters via litellm, OpenAI for codex) that aren't available in this
environment.  Unit tests cover the new wiring; the e2e validations
should be run with real keys before relying on the per-adapter
behavior.

Compatibility matrix is now:

  | Adapter             | Accepts | Team prompt | Auto-refresh | CLI in container | env vars |
  |---------------------|---------|-------------|--------------|------------------|----------|
  | claude_code         | yes     | yes (full)  | n/a          | yes              | yes      |
  | codex               | yes     | yes (full)  | n/a          | yes              | yes      |
  | mini_swe_agent_v2   | yes     | yes (sec.)  | yes          | yes              | yes      |
  | openhands_sdk       | yes     | yes (sec.)  | n/a          | NOT YET          | yes      |
  | swe_agent           | yes     | yes (sec.)  | NOT YET      | NOT YET          | NOT YET  |

Stacks on #52 (merged-up team-mode branch).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* openhands: layer coop-task-* install onto Modal image for team mode

Closes the documented gap from the prior commit's matrix: the
``coop-task-*`` binaries now ship into the OpenHands agent-server
sandbox, layered onto the upstream ``-oh`` image via Modal's
``add_local_file`` / ``pip_install`` / ``run_commands`` chain (no
upstream image rebuild required).  Triggered only when
``coop_info["team_env"]`` is set so solo / coop runs don't pay the
~10s first-build cost.  Modal caches the layered image; subsequent
team runs are instant.

Verified end-to-end: ran openhands_sdk team+git on
dottxt_ai_outlines_task/1371 [1,2] with gpt-5.5.  The agent ran
``compgen -c | grep coop-task`` and got back all 7 wrappers
(create / claim / update / list / request / respond / pending) — the
install worked.  Whether the model actually invokes the tools is a
separate (coordination-quality) axis; in this run it discovered them
but didn't use them, same as codex.  Both patches applied; f1 14/14,
f2 19/20.

Tests: 2 new (full suite: 326 passed)
  - test_team_env_triggers_image_layering  — verifies add_local_file
    + pip_install + run_commands fire with the right args when team
    mode is active
  - test_no_layering_when_team_inactive    — verifies solo / coop
    runs skip the image-build cost

Matrix update — openhands_sdk now reads:
  Accepts kwargs: yes / Team prompt: section / Auto-refresh: n/a /
  CLI in container: YES (was NOT YET) / CB_TEAM_* env: yes

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* team prompt: make the merge-before-submit step REQUIRED

The codex team e2e (cx_team_v3) hit 0/2 with great coordination
metrics — 5/5 tasks done, 27s first claim, claims even — but
neither agent ran ``git merge`` despite the prompt's "Recommended
workflow" mentioning it.  Both fetched their peer's branch (2 each)
and then submitted only their own work, so the eval's naive
diff-stacker produced syntactically broken Python.

The previous prompt buried the critical step in a "Concretely:"
sentence at the end; gpt-5.5 didn't follow it.  This rewrite:

  - Renames the section ``## Git collaboration — MERGE IS REQUIRED
    BEFORE SUBMITTING`` so the imperative is in the heading itself.
  - Adds an explicit "Required final sequence — run this verbatim
    before exiting" block with the full fetch+merge+diff sequence,
    parameterized over every partner branch.
  - Explains *why* (each agent's patch.txt is evaluated against every
    feature's tests; without the merge, the peer feature's symbols
    are missing → ImportError).
  - Frames it the same way the patch.txt step is framed (REQUIRED,
    skip-at-your-loss), which the original prompt fix proved
    codex responds to.

Verified: re-ran cx_team_v4 (codex team+git, same task as v3).
Git activity went from ``fetch=2 merge=0 push=0`` per agent →
``fetch=3 merge=2 push=2`` and ``fetch=1 merge=1 push=1``.  Both
patches now contain both features' symbols.  Pass rate v4:
33/34 tests (97%) — f2 fully passes 20/20, f1 fails one test
because gpt-5.5's merged code put the ``filters`` kwarg on a helper
function rather than the ``prompt`` decorator (content quality, not
coordination).

A second run (cx_team_v5) produced byte-identical 243-line patches
on both agents — codex coordinated so well both ended up with the
exact same merged tree.  This surfaces a separate bench-side
limitation: the eval's diff-stacker fails to apply patch B on top
of patch A when every hunk already matches, producing an empty
merged.patch.  That's a real bug in ``eval/evaluate.py``'s coop
merge step, NOT a coordination failure — codex did exactly what the
prompt asked.  Fix is a separate concern from team-mode wiring.

Tests still pass (existing prompt tests are content-agnostic;
326 / 63 skipped).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* eval: short-circuit when both agents submit identical merged patches

In team mode codex can coordinate so well that both agents end up
with byte-identical patches (each fully merged the other's branch).
The existing eval combiner sequence — apply patch1 → apply patch2
on top — chokes because every hunk in patch2 is already applied,
producing an empty merged.patch and a downstream "No valid patches
in input" failure even though both submissions are individually
fine.

Fix in ``test_merged``: before invoking ``_setup_branches`` /
``_merge_naive``, ``cmp`` the two patches.  If they match, copy
patch1 to merged.patch (normalized via ``git apply --recount`` so
agents that emit unified diffs with miscounted hunk headers still
work) and skip the merge dance.  Returns a fresh result with
``merge.status: "identical"`` so the caller can tell the
short-circuit fired vs a real merge.

Verified on the codex-team e2e:

  - cx_team_v5 (codex agents perfectly merged to identical 243-line
    patches): 0/2 → 2/2 ✓ (f1: 14/14, f2: 20/20)
  - cx_team_v4 (codex agents diverged on the merge): unchanged at
    f2 20/20 + f1 13/14 = 33/34 tests, still falls back to
    agent2-alone via apply_status: {'agent1': 'failed', ...}

I also briefly tried adding ``git apply --recount`` to
``_setup_branches``'s fallback chain, but that REGRESSED v4: it
made agent1's malformed patch apply where it previously failed
silently, triggering a real merge attempt that produced
duplicate function definitions (broken Python) via union merge.
The identical-patches short-circuit is the strictly-better fix —
no regression, recovers the v5 case, and the malformed-hunk
normalization only kicks in on the short-circuit path where it
can't cause merge conflicts.

Also lands previously-uncommitted housekeeping:
  - prompt.py: ruff-format-only diff on the merge-required block
    from the prior commit
  - test_team_wiring.py: ruff --fix removed unused MagicMock
    imports
  - test_gcp_backend.py / test_tasks.py: ruff --fix removed
    f-string-without-placeholder and unused-json import (both
    unrelated drift caught by the gate)

Tests: 1 new (full suite: 327 passed)
  - ``test_test_merged_shortcircuits_on_identical_patches`` — source
    inspection confirms the short-circuit branch + "identical"
    merge-status string exist in test_merged

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* openhands: register Redis-backed CoopTaskTracker as a typed tool

The previous openhands team runs (oh_team_v3) showed agents
discovering the ``coop-task-*`` shell wrappers via ``compgen`` but
never invoking them — gpt-5.5 strongly prefers typed tools registered
with the LLM over arbitrary shell commands.  This commit lands the
architectural fix: a Redis-backed ``CoopTaskTrackerTool`` registered
under the same name as openhands' built-in ``TaskTrackerTool`` so the
registry resolution swaps it transparently.

Files:

  * ``openhands/tools/task_tracker/coop_definition.py`` — new tool
    definition + executor.  Same ``TaskTrackerAction`` /
    ``TaskTrackerObservation`` shape, but ``plan`` and ``view`` round-
    trip through the shared ``cb:<run_id>:`` Redis namespace that
    ``TaskListClient`` (host side) writes to.  Tasks are auto-owned
    by the calling agent; ``view`` shows peer tasks prefixed with
    ``[<their_agent_id>]``.  Registered under both
    ``"CoopTaskTrackerTool"`` AND ``"TaskTrackerTool"`` so importing
    the module rebinds the latter to the Coop variant.

  * ``openhands/tools/preset/default.py`` — gains a ``team_mode``
    kwarg (kept for API stability + tests; the actual swap happens
    server-side via the .pth/__init__ side-effect import, not by
    changing the host-side tool list).  Pre-PR coop block split into
    a more nuanced team-mode prompt section that documents the
    TaskTracker → shared-list behavior.

  * ``openhands_sdk/adapter.py:ModalSandboxContext.__enter__`` —
    layers two more bits into the Modal image at build time:
      - ``add_local_file`` of ``coop_definition.py`` to
        ``$OH_DIR/coop_definition.py`` (in the sandbox's openhands
        install)
      - ``grep ... || echo`` appending
        ``from . import coop_definition`` to the package's
        ``__init__.py`` so the registration runs at import time.

Tests: 1 new + updated image-layering assertions
  - ``test_importing_coop_definition_overrides_local_registration``:
    inspecting the registry's ``_MODULE_QUALNAMES`` confirms
    ``TaskTrackerTool.name`` resolves to ``coop_definition``'s
    registration after import.
  - ``TestOpenHandsImageLayering`` now asserts 2 ``add_local_file``
    calls + 2 ``run_commands`` layers (tool-file install +
    ``coop-task-*`` wrappers) and that the
    ``from . import coop_definition`` line is in the install
    commands.

Full suite: 329 passed.  Ruff / format / mypy all green.

KNOWN LIMITATION (documented in coop_definition.py docstring):
the openhands_sdk agent-server runs in a Modal sandbox that's
network-isolated from the host Redis.  The CoopTaskTracker is
correctly registered and the LLM can call it, but every operation
returns "Shared task list unavailable" because the sandbox can't
``socket.getaddrinfo("host.docker.internal")``.  The fix is in the
deployment layer (Modal tunnels, a Modal-hosted Redis, or running
openhands directly via docker like the other adapters), not in this
PR — verified by oh_team_v10: agent ran ``coop-task-list`` first
("The coop CLI failed; I'll use the shared task tracker."), then
fell back to TaskTrackerAction which still hit the local executor
because the override + Redis combo can't actually work in Modal.

For non-Modal openhands deployments (e.g. local docker-backed
openhands runs, future remote-conversation transports that share the
host network), this tool works as designed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* openhands team mode: end-to-end working with Modal-hosted Redis

Resolves the Modal-Redis isolation that blocked the prior CoopTaskTracker
swap from actually functioning.  Three pieces, working together:

1. **Modal-hosted Redis.** ``runner/team.py:execute_team`` detects
   ``agent_name == "openhands_sdk"`` and spins up a Modal sandbox
   running redis-server on a TCP tunnel (``unencrypted_ports=[6379]``,
   accessed via ``unencrypted_host:unencrypted_port``).  Re-uses the
   existing ``connectors/redis_server.ModalRedisServer`` — it was
   already written, just unused.  Both the host TaskListClient and
   the agent sandboxes point at the same public TCP endpoint, so
   pre-seed and agent reads/writes share state.  Falls back to local
   Redis for the other adapters.

2. **CoopTaskTrackerTool injection into the Modal sandbox.** The
   adapter now ``add_local_file``s three pieces into the OpenHands
   image at build time:
     - ``coop_task.py`` → ``/usr/local/bin/cb-coop-task.py``
     - ``coop_definition.py`` → ``$OH_DIR/coop_definition.py``
     - ``_team_init_override.py`` → ``$OH_DIR/__init__.py``
       (replaces upstream; same exports + a side-effect import of
       coop_definition so the Redis-backed executor overrides the
       local TaskTracker registration at first import).
   Plus a ``find -name '*.pyc' -delete`` to invalidate Python's
   bytecode cache so the new __init__ actually re-runs.

3. **Harvest-time fresh client.** Modal's TCP tunnels drop idle
   connections after a few minutes, so the original Redis client
   pre-seed used at startup gets closed before the 9-min agent run
   finishes.  Re-open the client at harvest time using the same URL.

End-to-end on ``dottxt_ai_outlines_task/1371 [1,2]`` with
``-a openhands_sdk --setting team --git``:

  - Modal Redis startup: ``redis ready redis://r450.modal.host:41899``
  - Both agents Submitted, 9m total
  - Eval: 2/2 PASS (f1: 14/14 ✓, f2: 20/20 ✓)
  - Metrics: ``tasks_total: 4, tasks_done: 4, unowned_at_end: 0,
    time_to_first_claim_seconds: 52.6, claims_per_agent: {agent2:2,
    agent1:1}, updates_per_agent: {agent2:4, agent1:5}``
  - Cost: $3.33

Tests: image-layering assertions expanded — ``add_local_file`` now
called 3 times (CLI helper, tool def, __init__ override), and the
run_commands chain copies both files + wipes .pyc caches.

Full suite: 329 passed.  Ruff / format / mypy all green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* deps: add fakeredis to dev extras

The team-mode unit tests (task_list / protocol / fs_mirror /
loop_refresh / mcp_server) use ``fakeredis.FakeRedis`` as a hermetic
stand-in for redis-server, but ``fakeredis`` wasn't declared anywhere
in pyproject.toml — it just happened to be present in my local venv
because something else pulled it in transitively.

GitHub CI installs ``[dev]`` only, so on a clean install pytest
collection fails with ``ModuleNotFoundError: No module named
'fakeredis'`` on every team-mode test file.  Adding the dependency
explicitly fixes PR #52 (team-mode) CI; once team-mode merges,
PR #55 (team-all-adapters) will also pick it up via the same path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* swe_agent: fix import error + add missing transitive deps

Three changes that together unblock swe_agent team-mode runs (and
solo/coop runs too — the bug wasn't team-specific):

1. ``cooperbench.agents.mini_swe_agent`` → ``mini_swe_agent_v2``
   in ``swe_agent/adapter.py`` and ``swe_agent/agent/agents.py``.
   The old package was renamed in v0.0.13; both swe_agent files
   had stale imports that no-op'd at module load (TypeError or
   ModuleNotFoundError depending on how the framework was invoked),
   making every swe_agent invocation return Error before any LLM
   call.

2. Add ``numpy``, ``boto3``, ``docker`` to the ``swe-agent`` extras
   in pyproject.toml.  swe_agent's vendored framework imports these
   at module-load time even when the docker/S3/model paths are
   dormant, so a clean ``pip install '.[swe-agent]'`` without these
   would still ImportError on first invocation.

3. uv.lock refreshed with the new transitive deps.

End-to-end on dottxt_ai_outlines_task/1371 [1,2] with
``-a swe_agent -m gpt-5.5 --setting team --git`` (sw_team_v5):
both agents Submitted, patches 373 + 88 lines, both applied via
git apply.  Eval failed 0/2 due to a content-quality issue
(``NameError: name 'Set' is not defined`` — agent used Set
without importing it; both agents hit exit_cost budget limit
mid-implementation), but that's model variance, not adapter
wiring.  swe_agent is unblocked: it runs end-to-end, produces
patches, the eval pipeline processes them.

Coordination metrics still empty (claims_per_agent: {}) because
swe_agent doesn't yet have the in-container coop-task-* CLI
install or in-loop task auto-refresh — those are tracked as
follow-ups in the PR body.  For now the swe_agent team-mode run
just gets the team prompt section + env vars; full team-tool
integration is a separate PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: team-mode bugs surfaced by 10-pair core run

Five compounding bugs prevented `claude_code`, `codex`, and
`mini_swe_agent_v2` from reaching honest pass-rates on the core
subset in team setting. All four now ≥ 5/10.

- normalize_patch ate trailing blank context lines (text.strip()
  consumes " \n"), breaking last-hunk line counts so git apply
  rejected otherwise-valid diffs. Replaced with lstrip/rstrip on
  "\n" only.
- mini_swe_agent_v2 adapter wasn't normalizing patches at all —
  raw .strip() on the patch.txt read, so every msa patch ended
  in a non-newline byte. Now routes through normalize_patch.
- mini_swe_agent_v2 ModalEnvironment created the sandbox with no
  long-running command, so the image's default CMD exited and
  every exec hit "Sandbox not found". Pass "sleep", "infinity"
  as the positional command (matches eval backend's existing fix).
- claude_code and codex adapters silently ignored --backend modal
  because shared build_environment was hardcoded to DockerEnvironment.
  Added a backend kwarg and threaded config["backend"] through both
  adapters.
- Team lead prompt buried the integration step at the bottom of a
  long workflow list; Claude/Codex consistently exited after their
  own feature without reading /workspace/shared/<agent>.patch.
  Rewrote with a hard-rule opener and a 5-point pre-submission
  checklist. Member prompt now opens with "stay in your lane" per
  the lead's PLAN.md.
- eval test_merged now falls back to testing each agent's patch
  alone when the merged tree doesn't pass both features. Surfaced
  as merge.strategy="solo-agent1" / "solo-agent2". Credits the
  agent (typically the lead) who correctly integrated both
  features into one working patch but had it corrupted by
  union-merging with the other agent's partial implementation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs+data: core subset and team-mode horizontal comparison

- dataset/subsets/core.json: 10-pair subset for quick agent
  comparisons. Stratified by repo (largest-remainder proportional
  allocation by full-dataset pair count) with a one-slot floor per
  primary language (Python / Go / Rust / TS). Reproducible via
  scripts/generate_core_subset.py (seed=42).
- docs/BENCHMARK_RESULTS.md: horizontal comparison of four agent
  frameworks on the core subset in team setting. Includes per-task
  pass/fail matrix annotated with the merge strategy used, plus the
  chronological narrative of the dozen reruns that surfaced each of
  the bugs fixed in the previous commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(eval): don't bail when union-merge also conflicts

Previously test_merged returned early with an error when both naive
and union merge strategies hit conflicts, so the solo-agent fallback
never got a chance to credit a team whose lead alone integrated both
features. Now we write an empty merged.patch, let run_tests fail
naturally on the merged tree, and fall through to the solo fallback.

Doesn't change any of the current 40 eval results — union's merge=union
attribute is tolerant enough that every task in the dataset produces
some tree (potentially broken code with stitched-together lines); the
broken-tree-tests-fail path already triggered the solo fallback. This
just closes the defensive gap for future pathological cases.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* eval(team): identical / naive / lead-when-naive-conflicts policy

Drops the union-merge strategy and the member-only fallback from
test_merged. The new chain is:

  1. identical patches → skip-merge short-circuit
  2. naive 3-way merge clean → merged-tree tests are authoritative
                               (no further fallback)
  3. naive merge conflicts → test the lead's patch.txt alone against
                             both feature suites

Rationale: union merge concatenates conflicting hunks, which usually
produces syntactically broken code; the cases where it accidentally
produced a working tree were rewarding lucky non-overlap, not genuine
coordination. The member-only fallback was symmetric to lead-only but
incoherent under team-mode semantics (the lead is the designated
integrator; if they didn't integrate, the team failed regardless of
what the member's branch looks like).

Effect on the core-subset horizontal comparison:
  msa  6 → 6  (unchanged)
  oh   5 → 4  (loses pallets_jinja/1621 — was passing via union, which
              concealed that oh's lead doesn't integrate)
  cc   5 → 5  (unchanged)
  cx   5 → 5  (unchanged)

oh sliding below 5/10 is the correct outcome: the previous union-pass
on pallets_jinja/1621 was a false-positive of sorts (oh's agents commit
their patch.txt into the working tree, which forces a merge conflict
on patch.txt that union resolved while the actual source merge was
non-conflicting). Under the stricter policy this gets routed through
lead-alone, which oh's lead does not pass.

BENCHMARK_RESULTS.md updated to reflect the new totals + per-task
matrix legend (N = naive/identical, L = lead-alone). CHANGELOG entry
revised; full test suite still green (329 passed, 63 skipped).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(modal): codex stdin hang; eval guardrail for openhands_sdk

codex on Modal: `codex exec` was hanging for the full sandbox
lifetime (~2h) producing zero stream output. Root cause: codex's
exec mode prints "Reading additional input from stdin..." and
blocks until stdin EOF. Docker's non-tty `docker exec` gives EOF
for free; Modal sandbox keeps stdin open. Fix: add `</dev/null`
to the codex invocation in _build_codex_command. Smoke-tested on
dottxt_ai_outlines/1655 [1,3] solo on Modal: 1/1 pass in 1m 48s.

openhands_sdk eval guardrail: openhands_sdk produces patches that
include a committed patch.txt in the working tree and relies on
Modal-hosted Redis for coordination; running eval through Docker
silently changed the test environment. The eval now reads the
run's config.json and refuses with a clear warning when the run
was produced by openhands_sdk but --backend != modal.

Note: swe_agent already runs on Modal (uses swerex.ModalDeploymentConfig
by default; the earlier docs claiming it was docker-only were
wrong). Smoke-tested same dottxt task: 1/1 pass in 3m 12s.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(swe_agent): add --backend docker support

swe_agent adapter was hardcoded to swerex.ModalDeploymentConfig.
Added a backend dispatch that picks DockerDeploymentConfig when
config["backend"] == "docker"; Modal stays as the default.

Two upstream-swerex issues had to be worked around to make the
docker path actually start a container:

1. CooperBench task images set ENTRYPOINT=/usr/local/bin/runner.sh,
   so swerex's `docker run ... image sh -c "<startup>"` becomes
   `runner.sh sh -c "<startup>"` and runner.sh interprets "sh" as
   the feature-patch path. Pass docker_args=["--entrypoint", ""]
   to clear the entrypoint (mirrors the existing Modal monkey-patch
   that does .entrypoint([]) on the image).

2. swerex's startup falls back to `pipx run swe-rex ...` when the
   swerex-remote binary isn't pre-installed, but pipx looks for an
   executable literally named "swe-rex" — which doesn't exist in
   the published `swe-rex` package (it provides "swerex-remote").
   Monkey-patch DockerDeployment._get_swerex_start_cmd to use
   `pipx run --spec swe-rex swerex-remote ...` instead.

Smoke-tested with `dottxt_ai_outlines/1655 [1,3]` solo on docker:
1/1 pass in 2m 53s, 17 steps, $0.32, no errors.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* team_harness: extract team mode as standalone harness + ablation flags

Move team-mode primitives from cooperbench/agents/_team (private) to
cooperbench/team_harness (public, library-shaped) so other benchmarks
can consume the multi-agent coordination algorithm without depending on
CooperBench's task layout.

Adds TeamSession + TeamHarnessConfig:

- TeamSession bundles per-run state (run_id, namespaced Redis URL,
  ordered agent list, scratchpad volume name) with the feature config
  and exposes adapter-facing factories that each return None / [] / {}
  when their feature is disabled, so adapter code paths collapse to one
  branch:

    coop_env.update(session.env_for(agent_id))
    extra_run_args.extend(session.scratchpad_mount_args())
    mcp_config = session.mcp_config(container_script_path=...)

- TeamHarnessConfig is a frozen dataclass of five per-feature booleans
  (task_list, scratchpad, mcp, auto_refresh, protocol).  The lead/member
  role split is the always-on baseline -- without it team is just coop.

Wires five --team-no-* CLI flags through cli.py -> runner.run ->
runner.core -> runner.team -> each adapter.  result.json now records
team_features so post-hoc analysis can attribute deltas to the feature
that was off.

Adapter refactor: claude_code, codex, mini_swe_agent_v2, swe_agent, and
openhands_agent_sdk now accept team_features kwarg and construct a
local TeamSession instead of calling loose helpers.  Each adapter's
team-mode blocks (prompt, env, mount, MCP, install) gate on the
session's config.

Tests: tests/agents/_team -> tests/team_harness (rename), new
test_session.py (29 cases) covers the facade, four new ablation tests
in tests/runner/test_team.py verify the runner-side gating.  Full suite
363 passed, 63 skipped; ruff/format/mypy clean.

End-to-end smoke on dottxt_ai_outlines/1371 [1,2] with codex (docker):
- Default: writes task_log.json + tasks.json + metrics, cb-team-<run>
  volume created.
- --team-no-task-list --team-no-scratchpad --team-no-mcp: no task_log /
  tasks files, empty metrics dict, no volume.  team_features in
  result.json reflects the requested ablation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: team-harness ablation report (flash, codex/gpt-5.5)

Self-contained HTML report of the team-harness ablation + multi-agent
comparison run on the flash subset (50 task pairs), codex/gpt-5.5,
docker, 1 seed.

Contents:
- docs/team_harness_ablation_report.html — setting comparison
  (solo/coop/coop+git/team), one-feature-off ablation matrix, timing,
  findings, methodology, caveats.  All numbers embedded inline.
- docs/team_harness_ablation_data/{core,flash}_ablation.csv — raw rows.
- scripts/run_team_ablation.py — sweep driver (config -> cooperbench run+eval).
- scripts/gen_ablation_report.py — regenerates the HTML from logs/.

Headline results (passed / 50, both-features-pass):
  coop msg-only 13 · team no-scratchpad 15 · team no-task_list 20 ·
  solo 24 · coop+git 28 · team no-mcp 30 · team no-auto_refresh 30 ·
  team baseline 31 · team no-protocol 35

Findings:
- scratchpad (-16) and task_list (-11) are load-bearing; removing
  either drops team below solo (two uncoordinated agents < one).
- mcp/auto_refresh/protocol show no positive effect for codex
  (auto_refresh is a no-op for CLI adapters by design; protocol-off
  even scored +4, i.e. mild overhead without payoff).
- Most multi-agent value is a shared code substrate, not orchestration:
  coop+git (56%) ~ team-scratchpad (62%) >> messaging-only coop (26%).

Caveat: team runs used the scratchpad for code-sharing, NOT --git, so
"team vs coop+git" compares two sharing substrates; the team --git cell
is untested (follow-up).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Ubuntu <ubuntu@ip-172-31-58-153.us-west-2.compute.internal>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ProKil added a commit that referenced this pull request May 21, 2026
* agents/codex: add Codex adapter; lift shared coop bits into _coop

Adds an OpenAI Codex CLI adapter alongside the existing Claude Code
adapter.  Both adapters wrap a third-party CLI inside the task's
Docker container; the bits that are agent-agnostic (Redis messaging
helper, prompt blocks for solo/coop/coop+git, git remote setup) now
live in a new ``cooperbench.agents._coop`` module so the two adapters
(and any future CLI adapter) consume them rather than duplicating.

Codex adapter highlights:

  - Invokes ``codex exec --json --sandbox danger-full-access
    --skip-git-repo-check --model <id>``.
  - Writes ``${CODEX_HOME}/auth.json`` with the host's OPENAI_API_KEY
    inside the container so the CLI authenticates without prompts.
  - Parses Codex's JSONL event stream for status / token totals /
    messages.  Cost is reported as 0.0 because Codex does not emit a
    cost field; tokens are summed across ``turn.completed`` events.
  - Model fallback: if Codex rejects ``--model gpt-5.5`` with a
    "model not found" shaped error, the adapter retries once without
    ``--model`` and lets Codex pick its default.
  - Preflight credential check: if OPENAI_API_KEY is unset the adapter
    returns Error immediately instead of spinning up a container that
    can only fail.

Shared ``_coop`` module:

  - ``coop_msg.py`` — Redis-backed messaging CLI (one inbox per agent)
    installed as ``coop-send`` / ``coop-recv`` / ``coop-broadcast`` /
    ``coop-peek`` / ``coop-agents`` under /usr/local/bin.
  - ``install_snippet.sh`` — pip-installs redis and drops the shell
    wrappers; each adapter's setup.sh sources it.
  - ``prompt.py`` — solo / coop / coop+git prompt assembly, agent-
    agnostic.
  - ``runtime.py`` — ``ContainerEnv`` protocol, ``build_environment``,
    ``write_file_in_container`` / ``read_file_from_container``,
    ``rewrite_comm_url_for_container``, ``build_git_setup_command``,
    ``parse_sent_messages_log``, and ``normalize_patch``.

Bug fix during this refactor: the previous adapter's ``.strip()`` on
``patch.txt`` was eating the trailing newline that ``git apply``
requires.  Replaced with ``normalize_patch()`` (one trailing newline,
no leading whitespace).  This bit codex's solo run with a
"corrupt patch at line N" error; Claude got lucky and didn't.

Tests: 24 new for Codex (parsers + adapter), existing 45 Claude Code
tests re-pointed at the shared ``_coop`` module.  Full suite: 228
passed, 63 skipped.

End-to-end runs against dottxt_ai_outlines_task/1371 features 1+2:

  - codex solo f1:           Submitted, 1 turn, 365k input tokens,
                             184-line patch (with the trailing-newline
                             fix it applies cleanly)
  - codex coop+git f1,f2:    both Submitted, both patches applied but
                             0/2 tests pass — coordination failure
                             (agent1 fetched ``team`` but never merged,
                             so the stacked patches produce a Python
                             SyntaxError at line 144 of the modified
                             file).  Claude on the same task scored
                             2/2; Codex used the tools less aggressively
                             on this run.

The 0/2 result is the kind of coordination failure the bench is
designed to surface, not an adapter bug.  Future iteration could
tighten the prompt or hard-enforce a post-run merge, but neither is
necessary to land the adapter itself.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* runner: add team mode (lead + members + shared task list + scratchpad)

Adds a third setting alongside ``solo`` and ``coop``, modelled on the
agent-team primitives Claude Code uses in its own product.  Where coop
gives N peer agents one feature each and a Redis inbox to chat over,
team mode adds three load-bearing primitives:

  1. A typed **shared task list** (cooperbench.agents._team.TaskListClient)
     backed by Redis hashes + sets, namespaced ``cb:<run_id>:``, with
     atomic claim semantics (HSETNX-style — exactly one caller wins on a
     race) and an audit log of every mutation.  Exposed in the container
     as ``coop-task-create`` / ``coop-task-claim`` / ``coop-task-update``
     / ``coop-task-list`` shell wrappers.

  2. A **lead / member role split**.  The first agent is designated
     ``team-lead`` and gets a system-prompt block instructing them to
     break the spec into tasks, assign them via ``coop-task-create
     --assign``, watch progress, and integrate.  Other agents are
     ``member`` and look for open tasks to claim.

  3. A **shared scratchpad** Docker volume (``cb-team-<run_id>``)
     mounted at ``/workspace/shared`` in every container.  Free
     coordination artifact for design notes, partial diffs, interface
     sketches.

Coordination metrics are computed from the task-list audit log after
the run finishes (``time_to_first_claim_seconds``, ``claims_per_agent``,
``updates_per_agent``, ``tasks_done``, ``unowned_at_end``) and saved
into ``result.json``.  Evaluation is identical to coop — per-agent
``patch.txt`` evaluated per-feature — so no eval changes were needed
beyond discovering ``team/`` log directories.

Compatibility: all five existing adapters accept the new ``team_role``
/ ``team_id`` / ``task_list_url`` kwargs.  The CLI adapters
(``claude_code``, ``codex``) wire the team install snippet into their
``setup.sh`` so the ``coop-task-*`` wrappers land at
``/usr/local/bin``.  The Python-loop adapters (``mini_swe_agent_v2``,
``swe_agent``, ``openhands_sdk``) accept the kwargs without breaking;
their in-loop integration with the task list (auto-refresh between
steps, similar to the existing inbox poll) lands in a follow-up.

Unit tests: 46 new
  - 18 task_list (CRUD, atomic claim, owner-only update, audit log,
    run isolation)
  - 12 prompt (lead vs member branches, solo fallback, git interaction)
  -  3 runtime (env assembly, scratchpad mount args)
  -  4 metrics (happy path, unowned-at-end, empty log, multiple claims)
  -  5 runner (lead-is-first-agent, pre-seed, kwarg propagation,
    metrics in result, three-agent team)
  -  4 misc

Full suite: 274 passed, 63 skipped.  Ruff / format / mypy all green.

End-to-end on dottxt_ai_outlines_task/1371 [1,2] with Claude Code in
team+git mode:

  - 5 tasks created (2 by bench-runner, 3 by the lead splitting its
    work), all reached ``done``
  - time_to_first_claim_seconds=34.2
  - claims_per_agent={agent1: 2, agent2: 1}
  - updates_per_agent={agent1: 4, agent2: 3}
  - scratchpad volume actively used (agent2 wrote its diff to
    /workspace/shared/agent2.patch + a summary.md)
  - **0/1 pass rate** — both ``patch.txt`` files were empty: the
    members wrote diffs to the scratchpad instead of also writing
    ``/workspace/repo/patch.txt``, and the lead never ran the final
    integration step.  This is real coordination signal (the prompt
    told them to write both places but they followed the scratchpad
    half only) — a follow-up will tighten the prompt to make patch.txt
    submission the explicit final step.

Future PRs (intentionally out of scope here so this lands at a
reviewable size):

  - In-loop auto-refresh for the Python-loop adapters
  - MCP long-poll tool to give CLI adapters push-ish inbox semantics
  - Typed ``coop-request`` / ``coop-respond`` protocol on top of
    messaging (CC's plan_approval_request shape)
  - Filesystem mirror of the task list (CC-style ``ls`` artefacts)

Stacks on #51 (Codex adapter) so the diff stays focused on team-mode
additions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* team mode: filesystem mirror, typed protocol, MCP server, in-loop refresh (#53)

Lands the four follow-ups that were called out as "Out of scope" on
the team-mode PR (#52), plus a prompt fix surfaced by the team-mode
end-to-end run.

1. **Filesystem mirror of task list** (``_team/fs_mirror.py``).
   Snapshots the Redis-backed task list to ``/workspace/shared/tasks/``
   so agents can ``ls`` and ``cat`` tasks with their existing tools
   rather than going through the ``coop-task-list`` CLI.  Layout
   mirrors Claude Code's team primitive: one ``<id>.json`` per task,
   plus ``_index.json`` (cheap ``ls`` target) and ``_log.jsonl`` (audit
   trail).  Triggered on every ``coop-task-list`` invocation and from
   the host runner at startup.  Files written via tempfile+replace so
   readers never observe a partial state.

2. **Typed coop-request / coop-respond protocol** (``_team/protocol.py``).
   Layered on plain Redis messaging, mirroring CC's
   ``plan_approval_request`` / ``plan_approval_response`` shape.
   ``coop-request <peer> <kind> <body>`` returns a request_id (and
   optionally blocks via ``--wait N`` for a response).
   ``coop-respond <request_id> <body>`` writes back; the sender's
   ``await_response`` uses BLPOP so it actually sleeps instead of
   busy-polling.  Both events flow into the shared task-log so
   coordination metrics include protocol events.

3. **MCP long-poll server** (``_team/mcp_server.py``).  Stdio
   JSON-RPC server that exposes a single ``wait_for_message`` tool
   backed by BLPOP on the agent's inbox.  Registered automatically:
   Claude Code adapter writes ``$CLAUDE_CONFIG_DIR/.claude.json`` with
   the server entry; Codex adapter writes ``$CODEX_HOME/config.toml``.
   The point is to make "watch the inbox" a natural idle behavior for
   the CLI adapters instead of a busy-loop on ``coop-recv`` returning
   empty — the closest we can get to push-style delivery for opaque
   CLI agent loops.

4. **In-loop task-list auto-refresh** (``_team/loop_refresh.py``).
   ``TeamPoller`` is a per-agent host-side helper that
   ``mini_swe_agent_v2.DefaultAgent.step()`` calls between LLM
   queries — same hook as the existing inbox poll.  The LLM sees a
   compact ``[Team task list] open: 1, in_progress: 2, ...`` summary
   prepended to every turn so it doesn't need to remember to call
   ``coop-task-list``.  Plumbed via ``agent.team_poller`` so the
   ``mini_swe_agent_v2`` subtree change is one branch in ``step()``.
   The same module also exports ``poll_team_state()`` for in-container
   use (env-driven variant).

5. **Prompt fix**: the previous team-mode end-to-end had members
   writing diffs to ``/workspace/shared/<id>.patch`` only and never to
   ``/workspace/repo/patch.txt``, scoring 0/2 despite great
   coordination.  Both lead and member prompts now have an explicit
   ``### Final submission — REQUIRED`` section that calls out
   ``patch.txt`` as the only file the bench evaluates and provides
   the exact ``git diff > patch.txt`` command.

Also: cosmetic fix to ``runner/core._print_single_result`` so team
mode's per-agent dicts (which carry ``patch_lines: int``) render
correctly in the run table — previously the column showed 0 because
the function tried ``len(r.get("patch", "").splitlines())`` and team
mode doesn't store the full patch in the agents dict.

Tests: 37 new unit tests
  -  8 fs_mirror     (atomic writes, stale cleanup, empty index)
  -  9 protocol      (request roundtrip, await, timeout, audit log)
  -  9 mcp_server    (initialize, tools/list, tools/call,
                      timeout, blocking, unknown-tool error,
                      env factory)
  -  8 loop_refresh  (summary formatting, TeamPoller, env variant)
  -  3 prompt        (regression: lead+member prompts demand patch.txt)

Full suite: **311 passed**, 63 skipped.

End-to-end on dottxt_ai_outlines_task/1371 [1,2] with Claude Code +
team + git: **2/2 features pass** (14/14 + 20/20 tests).  All four
follow-ups visibly active in the run artifacts:
``/workspace/shared/tasks/`` populated with per-task JSON + _index +
_log; scratchpad has agent2.patch; ``cb-mcp-server.py`` registered in
``.claude.json``; 6 tasks created (2 by runner pre-seed, 4 by lead's
sub-task split), 4 reached ``done``,
``time_to_first_claim_seconds=29.9``.  Previous run scored 0/2 on the
same task — the prompt fix is doing real work.

Stacks on #52.

Co-authored-by: Ubuntu <ubuntu@ip-172-31-58-153.us-west-2.compute.internal>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* team mode: wire team prompt + env into the three Python-loop adapters

Brings ``mini_swe_agent_v2``, ``swe_agent``, and ``openhands_sdk`` to
parity with the CLI adapters for team mode.  Before this commit they
accepted the team kwargs but discarded them; now each one appends the
team prompt section to the task it sends the agent, and (where the
adapter actually controls the container) propagates ``CB_TEAM_*`` env
vars + mounts the team scratchpad.

New helper: ``_team.team_task_section(agents, agent_id, team_role)``
returns ONLY the lead-or-member block + coop-task-* CLI usage,
without the surrounding task/submission/git scaffolding that
``build_team_instruction`` adds.  Python-loop adapters already have
their own prompts covering messaging/git/submission, so they need
only the new piece; CLI adapters keep using the bigger function.

Per-adapter wiring:

  - ``mini_swe_agent_v2``: appends team_task_section to task;
    propagates CB_TEAM_* through env_kwargs["env"]; adds
    ``--add-host=host.docker.internal:host-gateway`` + scratchpad
    volume to docker run args; installs the team CLI scripts + pip
    redis in the container after env spin-up.  The existing
    ``TeamPoller`` host-side hook (already in step()) still fires.

  - ``openhands_sdk``: appends team_task_section to task; folds a new
    ``team_env`` dict into ``coop_info`` so
    ``_build_credentials_dict`` propagates CB_TEAM_* into the
    sandbox.  Coop-task-* binary install in the OpenHands agent-server
    image is a follow-up — OpenHands manages its own image build and
    doesn't expose a clean post-start exec hook.

  - ``swe_agent``: appends team_task_section to task.  The SWE-agent
    framework's sandbox + agent loop is third-party and harder to
    instrument; everything beyond the prompt is a follow-up.

Tests: 13 new
  - 3 prompt unit tests for team_task_section (lead, member, empty)
  - 10 cross-adapter sanity tests in tests/agents/test_team_wiring.py:
    consistency between team_task_section and build_team_instruction,
    every registered runner accepts the team kwargs, openhands env
    keys, swe_agent signature

Full suite: 324 passed, 63 skipped.  Ruff/format/mypy all green.

End-to-end on dottxt_ai_outlines_task/1371 [1,2] with claude_code +
team + git (sanity check that the shared changes didn't regress the
CLI adapter): both Submitted in 4m21s, $0.93, patches 210 + 81 lines.

End-to-end for the other four (codex, mini_swe_agent_v2, swe_agent,
openhands_sdk) requires API keys (Anthropic for the three Python-loop
adapters via litellm, OpenAI for codex) that aren't available in this
environment.  Unit tests cover the new wiring; the e2e validations
should be run with real keys before relying on the per-adapter
behavior.

Compatibility matrix is now:

  | Adapter             | Accepts | Team prompt | Auto-refresh | CLI in container | env vars |
  |---------------------|---------|-------------|--------------|------------------|----------|
  | claude_code         | yes     | yes (full)  | n/a          | yes              | yes      |
  | codex               | yes     | yes (full)  | n/a          | yes              | yes      |
  | mini_swe_agent_v2   | yes     | yes (sec.)  | yes          | yes              | yes      |
  | openhands_sdk       | yes     | yes (sec.)  | n/a          | NOT YET          | yes      |
  | swe_agent           | yes     | yes (sec.)  | NOT YET      | NOT YET          | NOT YET  |

Stacks on #52 (merged-up team-mode branch).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* openhands: layer coop-task-* install onto Modal image for team mode

Closes the documented gap from the prior commit's matrix: the
``coop-task-*`` binaries now ship into the OpenHands agent-server
sandbox, layered onto the upstream ``-oh`` image via Modal's
``add_local_file`` / ``pip_install`` / ``run_commands`` chain (no
upstream image rebuild required).  Triggered only when
``coop_info["team_env"]`` is set so solo / coop runs don't pay the
~10s first-build cost.  Modal caches the layered image; subsequent
team runs are instant.

Verified end-to-end: ran openhands_sdk team+git on
dottxt_ai_outlines_task/1371 [1,2] with gpt-5.5.  The agent ran
``compgen -c | grep coop-task`` and got back all 7 wrappers
(create / claim / update / list / request / respond / pending) — the
install worked.  Whether the model actually invokes the tools is a
separate (coordination-quality) axis; in this run it discovered them
but didn't use them, same as codex.  Both patches applied; f1 14/14,
f2 19/20.

Tests: 2 new (full suite: 326 passed)
  - test_team_env_triggers_image_layering  — verifies add_local_file
    + pip_install + run_commands fire with the right args when team
    mode is active
  - test_no_layering_when_team_inactive    — verifies solo / coop
    runs skip the image-build cost

Matrix update — openhands_sdk now reads:
  Accepts kwargs: yes / Team prompt: section / Auto-refresh: n/a /
  CLI in container: YES (was NOT YET) / CB_TEAM_* env: yes

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* team prompt: make the merge-before-submit step REQUIRED

The codex team e2e (cx_team_v3) hit 0/2 with great coordination
metrics — 5/5 tasks done, 27s first claim, claims even — but
neither agent ran ``git merge`` despite the prompt's "Recommended
workflow" mentioning it.  Both fetched their peer's branch (2 each)
and then submitted only their own work, so the eval's naive
diff-stacker produced syntactically broken Python.

The previous prompt buried the critical step in a "Concretely:"
sentence at the end; gpt-5.5 didn't follow it.  This rewrite:

  - Renames the section ``## Git collaboration — MERGE IS REQUIRED
    BEFORE SUBMITTING`` so the imperative is in the heading itself.
  - Adds an explicit "Required final sequence — run this verbatim
    before exiting" block with the full fetch+merge+diff sequence,
    parameterized over every partner branch.
  - Explains *why* (each agent's patch.txt is evaluated against every
    feature's tests; without the merge, the peer feature's symbols
    are missing → ImportError).
  - Frames it the same way the patch.txt step is framed (REQUIRED,
    skip-at-your-loss), which the original prompt fix proved
    codex responds to.

Verified: re-ran cx_team_v4 (codex team+git, same task as v3).
Git activity went from ``fetch=2 merge=0 push=0`` per agent →
``fetch=3 merge=2 push=2`` and ``fetch=1 merge=1 push=1``.  Both
patches now contain both features' symbols.  Pass rate v4:
33/34 tests (97%) — f2 fully passes 20/20, f1 fails one test
because gpt-5.5's merged code put the ``filters`` kwarg on a helper
function rather than the ``prompt`` decorator (content quality, not
coordination).

A second run (cx_team_v5) produced byte-identical 243-line patches
on both agents — codex coordinated so well both ended up with the
exact same merged tree.  This surfaces a separate bench-side
limitation: the eval's diff-stacker fails to apply patch B on top
of patch A when every hunk already matches, producing an empty
merged.patch.  That's a real bug in ``eval/evaluate.py``'s coop
merge step, NOT a coordination failure — codex did exactly what the
prompt asked.  Fix is a separate concern from team-mode wiring.

Tests still pass (existing prompt tests are content-agnostic;
326 / 63 skipped).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* eval: short-circuit when both agents submit identical merged patches

In team mode codex can coordinate so well that both agents end up
with byte-identical patches (each fully merged the other's branch).
The existing eval combiner sequence — apply patch1 → apply patch2
on top — chokes because every hunk in patch2 is already applied,
producing an empty merged.patch and a downstream "No valid patches
in input" failure even though both submissions are individually
fine.

Fix in ``test_merged``: before invoking ``_setup_branches`` /
``_merge_naive``, ``cmp`` the two patches.  If they match, copy
patch1 to merged.patch (normalized via ``git apply --recount`` so
agents that emit unified diffs with miscounted hunk headers still
work) and skip the merge dance.  Returns a fresh result with
``merge.status: "identical"`` so the caller can tell the
short-circuit fired vs a real merge.

Verified on the codex-team e2e:

  - cx_team_v5 (codex agents perfectly merged to identical 243-line
    patches): 0/2 → 2/2 ✓ (f1: 14/14, f2: 20/20)
  - cx_team_v4 (codex agents diverged on the merge): unchanged at
    f2 20/20 + f1 13/14 = 33/34 tests, still falls back to
    agent2-alone via apply_status: {'agent1': 'failed', ...}

I also briefly tried adding ``git apply --recount`` to
``_setup_branches``'s fallback chain, but that REGRESSED v4: it
made agent1's malformed patch apply where it previously failed
silently, triggering a real merge attempt that produced
duplicate function definitions (broken Python) via union merge.
The identical-patches short-circuit is the strictly-better fix —
no regression, recovers the v5 case, and the malformed-hunk
normalization only kicks in on the short-circuit path where it
can't cause merge conflicts.

Also lands previously-uncommitted housekeeping:
  - prompt.py: ruff-format-only diff on the merge-required block
    from the prior commit
  - test_team_wiring.py: ruff --fix removed unused MagicMock
    imports
  - test_gcp_backend.py / test_tasks.py: ruff --fix removed
    f-string-without-placeholder and unused-json import (both
    unrelated drift caught by the gate)

Tests: 1 new (full suite: 327 passed)
  - ``test_test_merged_shortcircuits_on_identical_patches`` — source
    inspection confirms the short-circuit branch + "identical"
    merge-status string exist in test_merged

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* openhands: register Redis-backed CoopTaskTracker as a typed tool

The previous openhands team runs (oh_team_v3) showed agents
discovering the ``coop-task-*`` shell wrappers via ``compgen`` but
never invoking them — gpt-5.5 strongly prefers typed tools registered
with the LLM over arbitrary shell commands.  This commit lands the
architectural fix: a Redis-backed ``CoopTaskTrackerTool`` registered
under the same name as openhands' built-in ``TaskTrackerTool`` so the
registry resolution swaps it transparently.

Files:

  * ``openhands/tools/task_tracker/coop_definition.py`` — new tool
    definition + executor.  Same ``TaskTrackerAction`` /
    ``TaskTrackerObservation`` shape, but ``plan`` and ``view`` round-
    trip through the shared ``cb:<run_id>:`` Redis namespace that
    ``TaskListClient`` (host side) writes to.  Tasks are auto-owned
    by the calling agent; ``view`` shows peer tasks prefixed with
    ``[<their_agent_id>]``.  Registered under both
    ``"CoopTaskTrackerTool"`` AND ``"TaskTrackerTool"`` so importing
    the module rebinds the latter to the Coop variant.

  * ``openhands/tools/preset/default.py`` — gains a ``team_mode``
    kwarg (kept for API stability + tests; the actual swap happens
    server-side via the .pth/__init__ side-effect import, not by
    changing the host-side tool list).  Pre-PR coop block split into
    a more nuanced team-mode prompt section that documents the
    TaskTracker → shared-list behavior.

  * ``openhands_sdk/adapter.py:ModalSandboxContext.__enter__`` —
    layers two more bits into the Modal image at build time:
      - ``add_local_file`` of ``coop_definition.py`` to
        ``$OH_DIR/coop_definition.py`` (in the sandbox's openhands
        install)
      - ``grep ... || echo`` appending
        ``from . import coop_definition`` to the package's
        ``__init__.py`` so the registration runs at import time.

Tests: 1 new + updated image-layering assertions
  - ``test_importing_coop_definition_overrides_local_registration``:
    inspecting the registry's ``_MODULE_QUALNAMES`` confirms
    ``TaskTrackerTool.name`` resolves to ``coop_definition``'s
    registration after import.
  - ``TestOpenHandsImageLayering`` now asserts 2 ``add_local_file``
    calls + 2 ``run_commands`` layers (tool-file install +
    ``coop-task-*`` wrappers) and that the
    ``from . import coop_definition`` line is in the install
    commands.

Full suite: 329 passed.  Ruff / format / mypy all green.

KNOWN LIMITATION (documented in coop_definition.py docstring):
the openhands_sdk agent-server runs in a Modal sandbox that's
network-isolated from the host Redis.  The CoopTaskTracker is
correctly registered and the LLM can call it, but every operation
returns "Shared task list unavailable" because the sandbox can't
``socket.getaddrinfo("host.docker.internal")``.  The fix is in the
deployment layer (Modal tunnels, a Modal-hosted Redis, or running
openhands directly via docker like the other adapters), not in this
PR — verified by oh_team_v10: agent ran ``coop-task-list`` first
("The coop CLI failed; I'll use the shared task tracker."), then
fell back to TaskTrackerAction which still hit the local executor
because the override + Redis combo can't actually work in Modal.

For non-Modal openhands deployments (e.g. local docker-backed
openhands runs, future remote-conversation transports that share the
host network), this tool works as designed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* openhands team mode: end-to-end working with Modal-hosted Redis

Resolves the Modal-Redis isolation that blocked the prior CoopTaskTracker
swap from actually functioning.  Three pieces, working together:

1. **Modal-hosted Redis.** ``runner/team.py:execute_team`` detects
   ``agent_name == "openhands_sdk"`` and spins up a Modal sandbox
   running redis-server on a TCP tunnel (``unencrypted_ports=[6379]``,
   accessed via ``unencrypted_host:unencrypted_port``).  Re-uses the
   existing ``connectors/redis_server.ModalRedisServer`` — it was
   already written, just unused.  Both the host TaskListClient and
   the agent sandboxes point at the same public TCP endpoint, so
   pre-seed and agent reads/writes share state.  Falls back to local
   Redis for the other adapters.

2. **CoopTaskTrackerTool injection into the Modal sandbox.** The
   adapter now ``add_local_file``s three pieces into the OpenHands
   image at build time:
     - ``coop_task.py`` → ``/usr/local/bin/cb-coop-task.py``
     - ``coop_definition.py`` → ``$OH_DIR/coop_definition.py``
     - ``_team_init_override.py`` → ``$OH_DIR/__init__.py``
       (replaces upstream; same exports + a side-effect import of
       coop_definition so the Redis-backed executor overrides the
       local TaskTracker registration at first import).
   Plus a ``find -name '*.pyc' -delete`` to invalidate Python's
   bytecode cache so the new __init__ actually re-runs.

3. **Harvest-time fresh client.** Modal's TCP tunnels drop idle
   connections after a few minutes, so the original Redis client
   pre-seed used at startup gets closed before the 9-min agent run
   finishes.  Re-open the client at harvest time using the same URL.

End-to-end on ``dottxt_ai_outlines_task/1371 [1,2]`` with
``-a openhands_sdk --setting team --git``:

  - Modal Redis startup: ``redis ready redis://r450.modal.host:41899``
  - Both agents Submitted, 9m total
  - Eval: 2/2 PASS (f1: 14/14 ✓, f2: 20/20 ✓)
  - Metrics: ``tasks_total: 4, tasks_done: 4, unowned_at_end: 0,
    time_to_first_claim_seconds: 52.6, claims_per_agent: {agent2:2,
    agent1:1}, updates_per_agent: {agent2:4, agent1:5}``
  - Cost: $3.33

Tests: image-layering assertions expanded — ``add_local_file`` now
called 3 times (CLI helper, tool def, __init__ override), and the
run_commands chain copies both files + wipes .pyc caches.

Full suite: 329 passed.  Ruff / format / mypy all green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* deps: add fakeredis to dev extras

The team-mode unit tests (task_list / protocol / fs_mirror /
loop_refresh / mcp_server) use ``fakeredis.FakeRedis`` as a hermetic
stand-in for redis-server, but ``fakeredis`` wasn't declared anywhere
in pyproject.toml — it just happened to be present in my local venv
because something else pulled it in transitively.

GitHub CI installs ``[dev]`` only, so on a clean install pytest
collection fails with ``ModuleNotFoundError: No module named
'fakeredis'`` on every team-mode test file.  Adding the dependency
explicitly fixes PR #52 (team-mode) CI; once team-mode merges,
PR #55 (team-all-adapters) will also pick it up via the same path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* swe_agent: fix import error + add missing transitive deps

Three changes that together unblock swe_agent team-mode runs (and
solo/coop runs too — the bug wasn't team-specific):

1. ``cooperbench.agents.mini_swe_agent`` → ``mini_swe_agent_v2``
   in ``swe_agent/adapter.py`` and ``swe_agent/agent/agents.py``.
   The old package was renamed in v0.0.13; both swe_agent files
   had stale imports that no-op'd at module load (TypeError or
   ModuleNotFoundError depending on how the framework was invoked),
   making every swe_agent invocation return Error before any LLM
   call.

2. Add ``numpy``, ``boto3``, ``docker`` to the ``swe-agent`` extras
   in pyproject.toml.  swe_agent's vendored framework imports these
   at module-load time even when the docker/S3/model paths are
   dormant, so a clean ``pip install '.[swe-agent]'`` without these
   would still ImportError on first invocation.

3. uv.lock refreshed with the new transitive deps.

End-to-end on dottxt_ai_outlines_task/1371 [1,2] with
``-a swe_agent -m gpt-5.5 --setting team --git`` (sw_team_v5):
both agents Submitted, patches 373 + 88 lines, both applied via
git apply.  Eval failed 0/2 due to a content-quality issue
(``NameError: name 'Set' is not defined`` — agent used Set
without importing it; both agents hit exit_cost budget limit
mid-implementation), but that's model variance, not adapter
wiring.  swe_agent is unblocked: it runs end-to-end, produces
patches, the eval pipeline processes them.

Coordination metrics still empty (claims_per_agent: {}) because
swe_agent doesn't yet have the in-container coop-task-* CLI
install or in-loop task auto-refresh — those are tracked as
follow-ups in the PR body.  For now the swe_agent team-mode run
just gets the team prompt section + env vars; full team-tool
integration is a separate PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: team-mode bugs surfaced by 10-pair core run

Five compounding bugs prevented `claude_code`, `codex`, and
`mini_swe_agent_v2` from reaching honest pass-rates on the core
subset in team setting. All four now ≥ 5/10.

- normalize_patch ate trailing blank context lines (text.strip()
  consumes " \n"), breaking last-hunk line counts so git apply
  rejected otherwise-valid diffs. Replaced with lstrip/rstrip on
  "\n" only.
- mini_swe_agent_v2 adapter wasn't normalizing patches at all —
  raw .strip() on the patch.txt read, so every msa patch ended
  in a non-newline byte. Now routes through normalize_patch.
- mini_swe_agent_v2 ModalEnvironment created the sandbox with no
  long-running command, so the image's default CMD exited and
  every exec hit "Sandbox not found". Pass "sleep", "infinity"
  as the positional command (matches eval backend's existing fix).
- claude_code and codex adapters silently ignored --backend modal
  because shared build_environment was hardcoded to DockerEnvironment.
  Added a backend kwarg and threaded config["backend"] through both
  adapters.
- Team lead prompt buried the integration step at the bottom of a
  long workflow list; Claude/Codex consistently exited after their
  own feature without reading /workspace/shared/<agent>.patch.
  Rewrote with a hard-rule opener and a 5-point pre-submission
  checklist. Member prompt now opens with "stay in your lane" per
  the lead's PLAN.md.
- eval test_merged now falls back to testing each agent's patch
  alone when the merged tree doesn't pass both features. Surfaced
  as merge.strategy="solo-agent1" / "solo-agent2". Credits the
  agent (typically the lead) who correctly integrated both
  features into one working patch but had it corrupted by
  union-merging with the other agent's partial implementation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs+data: core subset and team-mode horizontal comparison

- dataset/subsets/core.json: 10-pair subset for quick agent
  comparisons. Stratified by repo (largest-remainder proportional
  allocation by full-dataset pair count) with a one-slot floor per
  primary language (Python / Go / Rust / TS). Reproducible via
  scripts/generate_core_subset.py (seed=42).
- docs/BENCHMARK_RESULTS.md: horizontal comparison of four agent
  frameworks on the core subset in team setting. Includes per-task
  pass/fail matrix annotated with the merge strategy used, plus the
  chronological narrative of the dozen reruns that surfaced each of
  the bugs fixed in the previous commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(eval): don't bail when union-merge also conflicts

Previously test_merged returned early with an error when both naive
and union merge strategies hit conflicts, so the solo-agent fallback
never got a chance to credit a team whose lead alone integrated both
features. Now we write an empty merged.patch, let run_tests fail
naturally on the merged tree, and fall through to the solo fallback.

Doesn't change any of the current 40 eval results — union's merge=union
attribute is tolerant enough that every task in the dataset produces
some tree (potentially broken code with stitched-together lines); the
broken-tree-tests-fail path already triggered the solo fallback. This
just closes the defensive gap for future pathological cases.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* eval(team): identical / naive / lead-when-naive-conflicts policy

Drops the union-merge strategy and the member-only fallback from
test_merged. The new chain is:

  1. identical patches → skip-merge short-circuit
  2. naive 3-way merge clean → merged-tree tests are authoritative
                               (no further fallback)
  3. naive merge conflicts → test the lead's patch.txt alone against
                             both feature suites

Rationale: union merge concatenates conflicting hunks, which usually
produces syntactically broken code; the cases where it accidentally
produced a working tree were rewarding lucky non-overlap, not genuine
coordination. The member-only fallback was symmetric to lead-only but
incoherent under team-mode semantics (the lead is the designated
integrator; if they didn't integrate, the team failed regardless of
what the member's branch looks like).

Effect on the core-subset horizontal comparison:
  msa  6 → 6  (unchanged)
  oh   5 → 4  (loses pallets_jinja/1621 — was passing via union, which
              concealed that oh's lead doesn't integrate)
  cc   5 → 5  (unchanged)
  cx   5 → 5  (unchanged)

oh sliding below 5/10 is the correct outcome: the previous union-pass
on pallets_jinja/1621 was a false-positive of sorts (oh's agents commit
their patch.txt into the working tree, which forces a merge conflict
on patch.txt that union resolved while the actual source merge was
non-conflicting). Under the stricter policy this gets routed through
lead-alone, which oh's lead does not pass.

BENCHMARK_RESULTS.md updated to reflect the new totals + per-task
matrix legend (N = naive/identical, L = lead-alone). CHANGELOG entry
revised; full test suite still green (329 passed, 63 skipped).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(modal): codex stdin hang; eval guardrail for openhands_sdk

codex on Modal: `codex exec` was hanging for the full sandbox
lifetime (~2h) producing zero stream output. Root cause: codex's
exec mode prints "Reading additional input from stdin..." and
blocks until stdin EOF. Docker's non-tty `docker exec` gives EOF
for free; Modal sandbox keeps stdin open. Fix: add `</dev/null`
to the codex invocation in _build_codex_command. Smoke-tested on
dottxt_ai_outlines/1655 [1,3] solo on Modal: 1/1 pass in 1m 48s.

openhands_sdk eval guardrail: openhands_sdk produces patches that
include a committed patch.txt in the working tree and relies on
Modal-hosted Redis for coordination; running eval through Docker
silently changed the test environment. The eval now reads the
run's config.json and refuses with a clear warning when the run
was produced by openhands_sdk but --backend != modal.

Note: swe_agent already runs on Modal (uses swerex.ModalDeploymentConfig
by default; the earlier docs claiming it was docker-only were
wrong). Smoke-tested same dottxt task: 1/1 pass in 3m 12s.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(swe_agent): add --backend docker support

swe_agent adapter was hardcoded to swerex.ModalDeploymentConfig.
Added a backend dispatch that picks DockerDeploymentConfig when
config["backend"] == "docker"; Modal stays as the default.

Two upstream-swerex issues had to be worked around to make the
docker path actually start a container:

1. CooperBench task images set ENTRYPOINT=/usr/local/bin/runner.sh,
   so swerex's `docker run ... image sh -c "<startup>"` becomes
   `runner.sh sh -c "<startup>"` and runner.sh interprets "sh" as
   the feature-patch path. Pass docker_args=["--entrypoint", ""]
   to clear the entrypoint (mirrors the existing Modal monkey-patch
   that does .entrypoint([]) on the image).

2. swerex's startup falls back to `pipx run swe-rex ...` when the
   swerex-remote binary isn't pre-installed, but pipx looks for an
   executable literally named "swe-rex" — which doesn't exist in
   the published `swe-rex` package (it provides "swerex-remote").
   Monkey-patch DockerDeployment._get_swerex_start_cmd to use
   `pipx run --spec swe-rex swerex-remote ...` instead.

Smoke-tested with `dottxt_ai_outlines/1655 [1,3]` solo on docker:
1/1 pass in 2m 53s, 17 steps, $0.32, no errors.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* team_harness: extract team mode as standalone harness + ablation flags

Move team-mode primitives from cooperbench/agents/_team (private) to
cooperbench/team_harness (public, library-shaped) so other benchmarks
can consume the multi-agent coordination algorithm without depending on
CooperBench's task layout.

Adds TeamSession + TeamHarnessConfig:

- TeamSession bundles per-run state (run_id, namespaced Redis URL,
  ordered agent list, scratchpad volume name) with the feature config
  and exposes adapter-facing factories that each return None / [] / {}
  when their feature is disabled, so adapter code paths collapse to one
  branch:

    coop_env.update(session.env_for(agent_id))
    extra_run_args.extend(session.scratchpad_mount_args())
    mcp_config = session.mcp_config(container_script_path=...)

- TeamHarnessConfig is a frozen dataclass of five per-feature booleans
  (task_list, scratchpad, mcp, auto_refresh, protocol).  The lead/member
  role split is the always-on baseline -- without it team is just coop.

Wires five --team-no-* CLI flags through cli.py -> runner.run ->
runner.core -> runner.team -> each adapter.  result.json now records
team_features so post-hoc analysis can attribute deltas to the feature
that was off.

Adapter refactor: claude_code, codex, mini_swe_agent_v2, swe_agent, and
openhands_agent_sdk now accept team_features kwarg and construct a
local TeamSession instead of calling loose helpers.  Each adapter's
team-mode blocks (prompt, env, mount, MCP, install) gate on the
session's config.

Tests: tests/agents/_team -> tests/team_harness (rename), new
test_session.py (29 cases) covers the facade, four new ablation tests
in tests/runner/test_team.py verify the runner-side gating.  Full suite
363 passed, 63 skipped; ruff/format/mypy clean.

End-to-end smoke on dottxt_ai_outlines/1371 [1,2] with codex (docker):
- Default: writes task_log.json + tasks.json + metrics, cb-team-<run>
  volume created.
- --team-no-task-list --team-no-scratchpad --team-no-mcp: no task_log /
  tasks files, empty metrics dict, no volume.  team_features in
  result.json reflects the requested ablation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* codex: add Azure OpenAI support

Set AZURE_OPENAI_API_KEY + AZURE_OPENAI_ENDPOINT (the OpenAI-compatible
v1 base, e.g. https://<resource>.cognitiveservices.azure.com/openai/v1)
and pass the Azure deployment name via -m.  When both are present they
take precedence over OPENAI_API_KEY.

How it works:
- resolve_azure_config() reads the two env vars (endpoint trailing slash
  stripped); _azure_config_toml() writes a `model_provider = "azure"`
  block into codex's config.toml with wire_api = "responses" (codex
  0.132 dropped the chat wire API) and env_key = AZURE_OPENAI_API_KEY.
- The key is exported into the codex command and read via the provider
  env_key; auth.json is skipped on the Azure path.
- config.toml is now composed from independent fragments (azure provider
  + team-mode MCP server) so both can coexist.

Non-json fallback: codex 0.132's --json event stream deterministically
fails against Azure's HTTP/2 /responses endpoint ("stream disconnected:
error sending request") while plain output works.  Captured requests are
byte-identical between modes, so it's a codex response-handling bug, not
a config error.  The Azure path therefore runs codex WITHOUT --json,
harvests the patch from patch.txt (as always) and the final message via
--output-last-message, and derives status from codex's exit code.
Trade-off: no token/cost/trajectory telemetry on Azure (codex's plain
output carries none; cost was already $0 via the broken json parser).

Tests: 5 new (resolve_azure_config, _azure_config_toml, non-json run
shape + provider config + no auth.json, error status on non-zero exit);
autouse fixture clears AZURE_* so non-Azure tests stay hermetic.
Full suite 369 passed; ruff/format/mypy green.

Validated end-to-end on dottxt_ai_outlines/1655 [1,3] with
`-a codex -m gpt-5.5-hao` against a live Azure deployment: Submitted,
clean stream (no disconnects), eval passes both features.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(codex): preserve Azure key in coop/team mode

The is_coop branch reassigned `coop_env = {...}`, wiping the
AZURE_OPENAI_API_KEY added just above it.  Codex then failed provider
auth ("Missing environment variable: AZURE_OPENAI_API_KEY") in every
coop / coop+git / team run, producing empty patches — a full-dataset
coop+git Azure sweep scored 0/652 while solo (same path) scored 355/652.

Fix: `coop_env.update({...})` so the Azure key survives.  Verified with
a coop+git Azure smoke (both agents Submitted, real patches, zero
missing-key errors).  Adds a regression test
(test_azure_key_survives_coop_mode).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(codex): harden container install for concurrent runs

Codex's setup.sh ran apt without DEBIAN_FRONTEND=noninteractive, so in
TTY-less containers debconf fell through Dialog->Readline->Teletype and
tripped dpkg ("Sub-process /usr/bin/dpkg returned an error code (1)").
Rare at solo concurrency (6 containers, ~0.6% fail) but dominant under
coop/team (12 containers at concurrency 6, ~87% fail) — a full-dataset
coop+git sweep collapsed to install failures.

Fix: export DEBIAN_FRONTEND=noninteractive and wrap apt/apk/yum installs
in a 3x retry (transient mirror throttling under many simultaneous
installs from one host).  Validated with 15 coop+git tasks at
concurrency 6: 15/15 installed cleanly (was ~1/8 before), 30/30 agents
produced patches.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Ubuntu <ubuntu@ip-172-31-58-153.us-west-2.compute.internal>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ProKil added a commit that referenced this pull request May 21, 2026
* agents/codex: add Codex adapter; lift shared coop bits into _coop

Adds an OpenAI Codex CLI adapter alongside the existing Claude Code
adapter.  Both adapters wrap a third-party CLI inside the task's
Docker container; the bits that are agent-agnostic (Redis messaging
helper, prompt blocks for solo/coop/coop+git, git remote setup) now
live in a new ``cooperbench.agents._coop`` module so the two adapters
(and any future CLI adapter) consume them rather than duplicating.

Codex adapter highlights:

  - Invokes ``codex exec --json --sandbox danger-full-access
    --skip-git-repo-check --model <id>``.
  - Writes ``${CODEX_HOME}/auth.json`` with the host's OPENAI_API_KEY
    inside the container so the CLI authenticates without prompts.
  - Parses Codex's JSONL event stream for status / token totals /
    messages.  Cost is reported as 0.0 because Codex does not emit a
    cost field; tokens are summed across ``turn.completed`` events.
  - Model fallback: if Codex rejects ``--model gpt-5.5`` with a
    "model not found" shaped error, the adapter retries once without
    ``--model`` and lets Codex pick its default.
  - Preflight credential check: if OPENAI_API_KEY is unset the adapter
    returns Error immediately instead of spinning up a container that
    can only fail.

Shared ``_coop`` module:

  - ``coop_msg.py`` — Redis-backed messaging CLI (one inbox per agent)
    installed as ``coop-send`` / ``coop-recv`` / ``coop-broadcast`` /
    ``coop-peek`` / ``coop-agents`` under /usr/local/bin.
  - ``install_snippet.sh`` — pip-installs redis and drops the shell
    wrappers; each adapter's setup.sh sources it.
  - ``prompt.py`` — solo / coop / coop+git prompt assembly, agent-
    agnostic.
  - ``runtime.py`` — ``ContainerEnv`` protocol, ``build_environment``,
    ``write_file_in_container`` / ``read_file_from_container``,
    ``rewrite_comm_url_for_container``, ``build_git_setup_command``,
    ``parse_sent_messages_log``, and ``normalize_patch``.

Bug fix during this refactor: the previous adapter's ``.strip()`` on
``patch.txt`` was eating the trailing newline that ``git apply``
requires.  Replaced with ``normalize_patch()`` (one trailing newline,
no leading whitespace).  This bit codex's solo run with a
"corrupt patch at line N" error; Claude got lucky and didn't.

Tests: 24 new for Codex (parsers + adapter), existing 45 Claude Code
tests re-pointed at the shared ``_coop`` module.  Full suite: 228
passed, 63 skipped.

End-to-end runs against dottxt_ai_outlines_task/1371 features 1+2:

  - codex solo f1:           Submitted, 1 turn, 365k input tokens,
                             184-line patch (with the trailing-newline
                             fix it applies cleanly)
  - codex coop+git f1,f2:    both Submitted, both patches applied but
                             0/2 tests pass — coordination failure
                             (agent1 fetched ``team`` but never merged,
                             so the stacked patches produce a Python
                             SyntaxError at line 144 of the modified
                             file).  Claude on the same task scored
                             2/2; Codex used the tools less aggressively
                             on this run.

The 0/2 result is the kind of coordination failure the bench is
designed to surface, not an adapter bug.  Future iteration could
tighten the prompt or hard-enforce a post-run merge, but neither is
necessary to land the adapter itself.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* runner: add team mode (lead + members + shared task list + scratchpad)

Adds a third setting alongside ``solo`` and ``coop``, modelled on the
agent-team primitives Claude Code uses in its own product.  Where coop
gives N peer agents one feature each and a Redis inbox to chat over,
team mode adds three load-bearing primitives:

  1. A typed **shared task list** (cooperbench.agents._team.TaskListClient)
     backed by Redis hashes + sets, namespaced ``cb:<run_id>:``, with
     atomic claim semantics (HSETNX-style — exactly one caller wins on a
     race) and an audit log of every mutation.  Exposed in the container
     as ``coop-task-create`` / ``coop-task-claim`` / ``coop-task-update``
     / ``coop-task-list`` shell wrappers.

  2. A **lead / member role split**.  The first agent is designated
     ``team-lead`` and gets a system-prompt block instructing them to
     break the spec into tasks, assign them via ``coop-task-create
     --assign``, watch progress, and integrate.  Other agents are
     ``member`` and look for open tasks to claim.

  3. A **shared scratchpad** Docker volume (``cb-team-<run_id>``)
     mounted at ``/workspace/shared`` in every container.  Free
     coordination artifact for design notes, partial diffs, interface
     sketches.

Coordination metrics are computed from the task-list audit log after
the run finishes (``time_to_first_claim_seconds``, ``claims_per_agent``,
``updates_per_agent``, ``tasks_done``, ``unowned_at_end``) and saved
into ``result.json``.  Evaluation is identical to coop — per-agent
``patch.txt`` evaluated per-feature — so no eval changes were needed
beyond discovering ``team/`` log directories.

Compatibility: all five existing adapters accept the new ``team_role``
/ ``team_id`` / ``task_list_url`` kwargs.  The CLI adapters
(``claude_code``, ``codex``) wire the team install snippet into their
``setup.sh`` so the ``coop-task-*`` wrappers land at
``/usr/local/bin``.  The Python-loop adapters (``mini_swe_agent_v2``,
``swe_agent``, ``openhands_sdk``) accept the kwargs without breaking;
their in-loop integration with the task list (auto-refresh between
steps, similar to the existing inbox poll) lands in a follow-up.

Unit tests: 46 new
  - 18 task_list (CRUD, atomic claim, owner-only update, audit log,
    run isolation)
  - 12 prompt (lead vs member branches, solo fallback, git interaction)
  -  3 runtime (env assembly, scratchpad mount args)
  -  4 metrics (happy path, unowned-at-end, empty log, multiple claims)
  -  5 runner (lead-is-first-agent, pre-seed, kwarg propagation,
    metrics in result, three-agent team)
  -  4 misc

Full suite: 274 passed, 63 skipped.  Ruff / format / mypy all green.

End-to-end on dottxt_ai_outlines_task/1371 [1,2] with Claude Code in
team+git mode:

  - 5 tasks created (2 by bench-runner, 3 by the lead splitting its
    work), all reached ``done``
  - time_to_first_claim_seconds=34.2
  - claims_per_agent={agent1: 2, agent2: 1}
  - updates_per_agent={agent1: 4, agent2: 3}
  - scratchpad volume actively used (agent2 wrote its diff to
    /workspace/shared/agent2.patch + a summary.md)
  - **0/1 pass rate** — both ``patch.txt`` files were empty: the
    members wrote diffs to the scratchpad instead of also writing
    ``/workspace/repo/patch.txt``, and the lead never ran the final
    integration step.  This is real coordination signal (the prompt
    told them to write both places but they followed the scratchpad
    half only) — a follow-up will tighten the prompt to make patch.txt
    submission the explicit final step.

Future PRs (intentionally out of scope here so this lands at a
reviewable size):

  - In-loop auto-refresh for the Python-loop adapters
  - MCP long-poll tool to give CLI adapters push-ish inbox semantics
  - Typed ``coop-request`` / ``coop-respond`` protocol on top of
    messaging (CC's plan_approval_request shape)
  - Filesystem mirror of the task list (CC-style ``ls`` artefacts)

Stacks on #51 (Codex adapter) so the diff stays focused on team-mode
additions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* team mode: filesystem mirror, typed protocol, MCP server, in-loop refresh (#53)

Lands the four follow-ups that were called out as "Out of scope" on
the team-mode PR (#52), plus a prompt fix surfaced by the team-mode
end-to-end run.

1. **Filesystem mirror of task list** (``_team/fs_mirror.py``).
   Snapshots the Redis-backed task list to ``/workspace/shared/tasks/``
   so agents can ``ls`` and ``cat`` tasks with their existing tools
   rather than going through the ``coop-task-list`` CLI.  Layout
   mirrors Claude Code's team primitive: one ``<id>.json`` per task,
   plus ``_index.json`` (cheap ``ls`` target) and ``_log.jsonl`` (audit
   trail).  Triggered on every ``coop-task-list`` invocation and from
   the host runner at startup.  Files written via tempfile+replace so
   readers never observe a partial state.

2. **Typed coop-request / coop-respond protocol** (``_team/protocol.py``).
   Layered on plain Redis messaging, mirroring CC's
   ``plan_approval_request`` / ``plan_approval_response`` shape.
   ``coop-request <peer> <kind> <body>`` returns a request_id (and
   optionally blocks via ``--wait N`` for a response).
   ``coop-respond <request_id> <body>`` writes back; the sender's
   ``await_response`` uses BLPOP so it actually sleeps instead of
   busy-polling.  Both events flow into the shared task-log so
   coordination metrics include protocol events.

3. **MCP long-poll server** (``_team/mcp_server.py``).  Stdio
   JSON-RPC server that exposes a single ``wait_for_message`` tool
   backed by BLPOP on the agent's inbox.  Registered automatically:
   Claude Code adapter writes ``$CLAUDE_CONFIG_DIR/.claude.json`` with
   the server entry; Codex adapter writes ``$CODEX_HOME/config.toml``.
   The point is to make "watch the inbox" a natural idle behavior for
   the CLI adapters instead of a busy-loop on ``coop-recv`` returning
   empty — the closest we can get to push-style delivery for opaque
   CLI agent loops.

4. **In-loop task-list auto-refresh** (``_team/loop_refresh.py``).
   ``TeamPoller`` is a per-agent host-side helper that
   ``mini_swe_agent_v2.DefaultAgent.step()`` calls between LLM
   queries — same hook as the existing inbox poll.  The LLM sees a
   compact ``[Team task list] open: 1, in_progress: 2, ...`` summary
   prepended to every turn so it doesn't need to remember to call
   ``coop-task-list``.  Plumbed via ``agent.team_poller`` so the
   ``mini_swe_agent_v2`` subtree change is one branch in ``step()``.
   The same module also exports ``poll_team_state()`` for in-container
   use (env-driven variant).

5. **Prompt fix**: the previous team-mode end-to-end had members
   writing diffs to ``/workspace/shared/<id>.patch`` only and never to
   ``/workspace/repo/patch.txt``, scoring 0/2 despite great
   coordination.  Both lead and member prompts now have an explicit
   ``### Final submission — REQUIRED`` section that calls out
   ``patch.txt`` as the only file the bench evaluates and provides
   the exact ``git diff > patch.txt`` command.

Also: cosmetic fix to ``runner/core._print_single_result`` so team
mode's per-agent dicts (which carry ``patch_lines: int``) render
correctly in the run table — previously the column showed 0 because
the function tried ``len(r.get("patch", "").splitlines())`` and team
mode doesn't store the full patch in the agents dict.

Tests: 37 new unit tests
  -  8 fs_mirror     (atomic writes, stale cleanup, empty index)
  -  9 protocol      (request roundtrip, await, timeout, audit log)
  -  9 mcp_server    (initialize, tools/list, tools/call,
                      timeout, blocking, unknown-tool error,
                      env factory)
  -  8 loop_refresh  (summary formatting, TeamPoller, env variant)
  -  3 prompt        (regression: lead+member prompts demand patch.txt)

Full suite: **311 passed**, 63 skipped.

End-to-end on dottxt_ai_outlines_task/1371 [1,2] with Claude Code +
team + git: **2/2 features pass** (14/14 + 20/20 tests).  All four
follow-ups visibly active in the run artifacts:
``/workspace/shared/tasks/`` populated with per-task JSON + _index +
_log; scratchpad has agent2.patch; ``cb-mcp-server.py`` registered in
``.claude.json``; 6 tasks created (2 by runner pre-seed, 4 by lead's
sub-task split), 4 reached ``done``,
``time_to_first_claim_seconds=29.9``.  Previous run scored 0/2 on the
same task — the prompt fix is doing real work.

Stacks on #52.

Co-authored-by: Ubuntu <ubuntu@ip-172-31-58-153.us-west-2.compute.internal>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* team mode: wire team prompt + env into the three Python-loop adapters

Brings ``mini_swe_agent_v2``, ``swe_agent``, and ``openhands_sdk`` to
parity with the CLI adapters for team mode.  Before this commit they
accepted the team kwargs but discarded them; now each one appends the
team prompt section to the task it sends the agent, and (where the
adapter actually controls the container) propagates ``CB_TEAM_*`` env
vars + mounts the team scratchpad.

New helper: ``_team.team_task_section(agents, agent_id, team_role)``
returns ONLY the lead-or-member block + coop-task-* CLI usage,
without the surrounding task/submission/git scaffolding that
``build_team_instruction`` adds.  Python-loop adapters already have
their own prompts covering messaging/git/submission, so they need
only the new piece; CLI adapters keep using the bigger function.

Per-adapter wiring:

  - ``mini_swe_agent_v2``: appends team_task_section to task;
    propagates CB_TEAM_* through env_kwargs["env"]; adds
    ``--add-host=host.docker.internal:host-gateway`` + scratchpad
    volume to docker run args; installs the team CLI scripts + pip
    redis in the container after env spin-up.  The existing
    ``TeamPoller`` host-side hook (already in step()) still fires.

  - ``openhands_sdk``: appends team_task_section to task; folds a new
    ``team_env`` dict into ``coop_info`` so
    ``_build_credentials_dict`` propagates CB_TEAM_* into the
    sandbox.  Coop-task-* binary install in the OpenHands agent-server
    image is a follow-up — OpenHands manages its own image build and
    doesn't expose a clean post-start exec hook.

  - ``swe_agent``: appends team_task_section to task.  The SWE-agent
    framework's sandbox + agent loop is third-party and harder to
    instrument; everything beyond the prompt is a follow-up.

Tests: 13 new
  - 3 prompt unit tests for team_task_section (lead, member, empty)
  - 10 cross-adapter sanity tests in tests/agents/test_team_wiring.py:
    consistency between team_task_section and build_team_instruction,
    every registered runner accepts the team kwargs, openhands env
    keys, swe_agent signature

Full suite: 324 passed, 63 skipped.  Ruff/format/mypy all green.

End-to-end on dottxt_ai_outlines_task/1371 [1,2] with claude_code +
team + git (sanity check that the shared changes didn't regress the
CLI adapter): both Submitted in 4m21s, $0.93, patches 210 + 81 lines.

End-to-end for the other four (codex, mini_swe_agent_v2, swe_agent,
openhands_sdk) requires API keys (Anthropic for the three Python-loop
adapters via litellm, OpenAI for codex) that aren't available in this
environment.  Unit tests cover the new wiring; the e2e validations
should be run with real keys before relying on the per-adapter
behavior.

Compatibility matrix is now:

  | Adapter             | Accepts | Team prompt | Auto-refresh | CLI in container | env vars |
  |---------------------|---------|-------------|--------------|------------------|----------|
  | claude_code         | yes     | yes (full)  | n/a          | yes              | yes      |
  | codex               | yes     | yes (full)  | n/a          | yes              | yes      |
  | mini_swe_agent_v2   | yes     | yes (sec.)  | yes          | yes              | yes      |
  | openhands_sdk       | yes     | yes (sec.)  | n/a          | NOT YET          | yes      |
  | swe_agent           | yes     | yes (sec.)  | NOT YET      | NOT YET          | NOT YET  |

Stacks on #52 (merged-up team-mode branch).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* openhands: layer coop-task-* install onto Modal image for team mode

Closes the documented gap from the prior commit's matrix: the
``coop-task-*`` binaries now ship into the OpenHands agent-server
sandbox, layered onto the upstream ``-oh`` image via Modal's
``add_local_file`` / ``pip_install`` / ``run_commands`` chain (no
upstream image rebuild required).  Triggered only when
``coop_info["team_env"]`` is set so solo / coop runs don't pay the
~10s first-build cost.  Modal caches the layered image; subsequent
team runs are instant.

Verified end-to-end: ran openhands_sdk team+git on
dottxt_ai_outlines_task/1371 [1,2] with gpt-5.5.  The agent ran
``compgen -c | grep coop-task`` and got back all 7 wrappers
(create / claim / update / list / request / respond / pending) — the
install worked.  Whether the model actually invokes the tools is a
separate (coordination-quality) axis; in this run it discovered them
but didn't use them, same as codex.  Both patches applied; f1 14/14,
f2 19/20.

Tests: 2 new (full suite: 326 passed)
  - test_team_env_triggers_image_layering  — verifies add_local_file
    + pip_install + run_commands fire with the right args when team
    mode is active
  - test_no_layering_when_team_inactive    — verifies solo / coop
    runs skip the image-build cost

Matrix update — openhands_sdk now reads:
  Accepts kwargs: yes / Team prompt: section / Auto-refresh: n/a /
  CLI in container: YES (was NOT YET) / CB_TEAM_* env: yes

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* team prompt: make the merge-before-submit step REQUIRED

The codex team e2e (cx_team_v3) hit 0/2 with great coordination
metrics — 5/5 tasks done, 27s first claim, claims even — but
neither agent ran ``git merge`` despite the prompt's "Recommended
workflow" mentioning it.  Both fetched their peer's branch (2 each)
and then submitted only their own work, so the eval's naive
diff-stacker produced syntactically broken Python.

The previous prompt buried the critical step in a "Concretely:"
sentence at the end; gpt-5.5 didn't follow it.  This rewrite:

  - Renames the section ``## Git collaboration — MERGE IS REQUIRED
    BEFORE SUBMITTING`` so the imperative is in the heading itself.
  - Adds an explicit "Required final sequence — run this verbatim
    before exiting" block with the full fetch+merge+diff sequence,
    parameterized over every partner branch.
  - Explains *why* (each agent's patch.txt is evaluated against every
    feature's tests; without the merge, the peer feature's symbols
    are missing → ImportError).
  - Frames it the same way the patch.txt step is framed (REQUIRED,
    skip-at-your-loss), which the original prompt fix proved
    codex responds to.

Verified: re-ran cx_team_v4 (codex team+git, same task as v3).
Git activity went from ``fetch=2 merge=0 push=0`` per agent →
``fetch=3 merge=2 push=2`` and ``fetch=1 merge=1 push=1``.  Both
patches now contain both features' symbols.  Pass rate v4:
33/34 tests (97%) — f2 fully passes 20/20, f1 fails one test
because gpt-5.5's merged code put the ``filters`` kwarg on a helper
function rather than the ``prompt`` decorator (content quality, not
coordination).

A second run (cx_team_v5) produced byte-identical 243-line patches
on both agents — codex coordinated so well both ended up with the
exact same merged tree.  This surfaces a separate bench-side
limitation: the eval's diff-stacker fails to apply patch B on top
of patch A when every hunk already matches, producing an empty
merged.patch.  That's a real bug in ``eval/evaluate.py``'s coop
merge step, NOT a coordination failure — codex did exactly what the
prompt asked.  Fix is a separate concern from team-mode wiring.

Tests still pass (existing prompt tests are content-agnostic;
326 / 63 skipped).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* eval: short-circuit when both agents submit identical merged patches

In team mode codex can coordinate so well that both agents end up
with byte-identical patches (each fully merged the other's branch).
The existing eval combiner sequence — apply patch1 → apply patch2
on top — chokes because every hunk in patch2 is already applied,
producing an empty merged.patch and a downstream "No valid patches
in input" failure even though both submissions are individually
fine.

Fix in ``test_merged``: before invoking ``_setup_branches`` /
``_merge_naive``, ``cmp`` the two patches.  If they match, copy
patch1 to merged.patch (normalized via ``git apply --recount`` so
agents that emit unified diffs with miscounted hunk headers still
work) and skip the merge dance.  Returns a fresh result with
``merge.status: "identical"`` so the caller can tell the
short-circuit fired vs a real merge.

Verified on the codex-team e2e:

  - cx_team_v5 (codex agents perfectly merged to identical 243-line
    patches): 0/2 → 2/2 ✓ (f1: 14/14, f2: 20/20)
  - cx_team_v4 (codex agents diverged on the merge): unchanged at
    f2 20/20 + f1 13/14 = 33/34 tests, still falls back to
    agent2-alone via apply_status: {'agent1': 'failed', ...}

I also briefly tried adding ``git apply --recount`` to
``_setup_branches``'s fallback chain, but that REGRESSED v4: it
made agent1's malformed patch apply where it previously failed
silently, triggering a real merge attempt that produced
duplicate function definitions (broken Python) via union merge.
The identical-patches short-circuit is the strictly-better fix —
no regression, recovers the v5 case, and the malformed-hunk
normalization only kicks in on the short-circuit path where it
can't cause merge conflicts.

Also lands previously-uncommitted housekeeping:
  - prompt.py: ruff-format-only diff on the merge-required block
    from the prior commit
  - test_team_wiring.py: ruff --fix removed unused MagicMock
    imports
  - test_gcp_backend.py / test_tasks.py: ruff --fix removed
    f-string-without-placeholder and unused-json import (both
    unrelated drift caught by the gate)

Tests: 1 new (full suite: 327 passed)
  - ``test_test_merged_shortcircuits_on_identical_patches`` — source
    inspection confirms the short-circuit branch + "identical"
    merge-status string exist in test_merged

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* openhands: register Redis-backed CoopTaskTracker as a typed tool

The previous openhands team runs (oh_team_v3) showed agents
discovering the ``coop-task-*`` shell wrappers via ``compgen`` but
never invoking them — gpt-5.5 strongly prefers typed tools registered
with the LLM over arbitrary shell commands.  This commit lands the
architectural fix: a Redis-backed ``CoopTaskTrackerTool`` registered
under the same name as openhands' built-in ``TaskTrackerTool`` so the
registry resolution swaps it transparently.

Files:

  * ``openhands/tools/task_tracker/coop_definition.py`` — new tool
    definition + executor.  Same ``TaskTrackerAction`` /
    ``TaskTrackerObservation`` shape, but ``plan`` and ``view`` round-
    trip through the shared ``cb:<run_id>:`` Redis namespace that
    ``TaskListClient`` (host side) writes to.  Tasks are auto-owned
    by the calling agent; ``view`` shows peer tasks prefixed with
    ``[<their_agent_id>]``.  Registered under both
    ``"CoopTaskTrackerTool"`` AND ``"TaskTrackerTool"`` so importing
    the module rebinds the latter to the Coop variant.

  * ``openhands/tools/preset/default.py`` — gains a ``team_mode``
    kwarg (kept for API stability + tests; the actual swap happens
    server-side via the .pth/__init__ side-effect import, not by
    changing the host-side tool list).  Pre-PR coop block split into
    a more nuanced team-mode prompt section that documents the
    TaskTracker → shared-list behavior.

  * ``openhands_sdk/adapter.py:ModalSandboxContext.__enter__`` —
    layers two more bits into the Modal image at build time:
      - ``add_local_file`` of ``coop_definition.py`` to
        ``$OH_DIR/coop_definition.py`` (in the sandbox's openhands
        install)
      - ``grep ... || echo`` appending
        ``from . import coop_definition`` to the package's
        ``__init__.py`` so the registration runs at import time.

Tests: 1 new + updated image-layering assertions
  - ``test_importing_coop_definition_overrides_local_registration``:
    inspecting the registry's ``_MODULE_QUALNAMES`` confirms
    ``TaskTrackerTool.name`` resolves to ``coop_definition``'s
    registration after import.
  - ``TestOpenHandsImageLayering`` now asserts 2 ``add_local_file``
    calls + 2 ``run_commands`` layers (tool-file install +
    ``coop-task-*`` wrappers) and that the
    ``from . import coop_definition`` line is in the install
    commands.

Full suite: 329 passed.  Ruff / format / mypy all green.

KNOWN LIMITATION (documented in coop_definition.py docstring):
the openhands_sdk agent-server runs in a Modal sandbox that's
network-isolated from the host Redis.  The CoopTaskTracker is
correctly registered and the LLM can call it, but every operation
returns "Shared task list unavailable" because the sandbox can't
``socket.getaddrinfo("host.docker.internal")``.  The fix is in the
deployment layer (Modal tunnels, a Modal-hosted Redis, or running
openhands directly via docker like the other adapters), not in this
PR — verified by oh_team_v10: agent ran ``coop-task-list`` first
("The coop CLI failed; I'll use the shared task tracker."), then
fell back to TaskTrackerAction which still hit the local executor
because the override + Redis combo can't actually work in Modal.

For non-Modal openhands deployments (e.g. local docker-backed
openhands runs, future remote-conversation transports that share the
host network), this tool works as designed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* openhands team mode: end-to-end working with Modal-hosted Redis

Resolves the Modal-Redis isolation that blocked the prior CoopTaskTracker
swap from actually functioning.  Three pieces, working together:

1. **Modal-hosted Redis.** ``runner/team.py:execute_team`` detects
   ``agent_name == "openhands_sdk"`` and spins up a Modal sandbox
   running redis-server on a TCP tunnel (``unencrypted_ports=[6379]``,
   accessed via ``unencrypted_host:unencrypted_port``).  Re-uses the
   existing ``connectors/redis_server.ModalRedisServer`` — it was
   already written, just unused.  Both the host TaskListClient and
   the agent sandboxes point at the same public TCP endpoint, so
   pre-seed and agent reads/writes share state.  Falls back to local
   Redis for the other adapters.

2. **CoopTaskTrackerTool injection into the Modal sandbox.** The
   adapter now ``add_local_file``s three pieces into the OpenHands
   image at build time:
     - ``coop_task.py`` → ``/usr/local/bin/cb-coop-task.py``
     - ``coop_definition.py`` → ``$OH_DIR/coop_definition.py``
     - ``_team_init_override.py`` → ``$OH_DIR/__init__.py``
       (replaces upstream; same exports + a side-effect import of
       coop_definition so the Redis-backed executor overrides the
       local TaskTracker registration at first import).
   Plus a ``find -name '*.pyc' -delete`` to invalidate Python's
   bytecode cache so the new __init__ actually re-runs.

3. **Harvest-time fresh client.** Modal's TCP tunnels drop idle
   connections after a few minutes, so the original Redis client
   pre-seed used at startup gets closed before the 9-min agent run
   finishes.  Re-open the client at harvest time using the same URL.

End-to-end on ``dottxt_ai_outlines_task/1371 [1,2]`` with
``-a openhands_sdk --setting team --git``:

  - Modal Redis startup: ``redis ready redis://r450.modal.host:41899``
  - Both agents Submitted, 9m total
  - Eval: 2/2 PASS (f1: 14/14 ✓, f2: 20/20 ✓)
  - Metrics: ``tasks_total: 4, tasks_done: 4, unowned_at_end: 0,
    time_to_first_claim_seconds: 52.6, claims_per_agent: {agent2:2,
    agent1:1}, updates_per_agent: {agent2:4, agent1:5}``
  - Cost: $3.33

Tests: image-layering assertions expanded — ``add_local_file`` now
called 3 times (CLI helper, tool def, __init__ override), and the
run_commands chain copies both files + wipes .pyc caches.

Full suite: 329 passed.  Ruff / format / mypy all green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* deps: add fakeredis to dev extras

The team-mode unit tests (task_list / protocol / fs_mirror /
loop_refresh / mcp_server) use ``fakeredis.FakeRedis`` as a hermetic
stand-in for redis-server, but ``fakeredis`` wasn't declared anywhere
in pyproject.toml — it just happened to be present in my local venv
because something else pulled it in transitively.

GitHub CI installs ``[dev]`` only, so on a clean install pytest
collection fails with ``ModuleNotFoundError: No module named
'fakeredis'`` on every team-mode test file.  Adding the dependency
explicitly fixes PR #52 (team-mode) CI; once team-mode merges,
PR #55 (team-all-adapters) will also pick it up via the same path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* swe_agent: fix import error + add missing transitive deps

Three changes that together unblock swe_agent team-mode runs (and
solo/coop runs too — the bug wasn't team-specific):

1. ``cooperbench.agents.mini_swe_agent`` → ``mini_swe_agent_v2``
   in ``swe_agent/adapter.py`` and ``swe_agent/agent/agents.py``.
   The old package was renamed in v0.0.13; both swe_agent files
   had stale imports that no-op'd at module load (TypeError or
   ModuleNotFoundError depending on how the framework was invoked),
   making every swe_agent invocation return Error before any LLM
   call.

2. Add ``numpy``, ``boto3``, ``docker`` to the ``swe-agent`` extras
   in pyproject.toml.  swe_agent's vendored framework imports these
   at module-load time even when the docker/S3/model paths are
   dormant, so a clean ``pip install '.[swe-agent]'`` without these
   would still ImportError on first invocation.

3. uv.lock refreshed with the new transitive deps.

End-to-end on dottxt_ai_outlines_task/1371 [1,2] with
``-a swe_agent -m gpt-5.5 --setting team --git`` (sw_team_v5):
both agents Submitted, patches 373 + 88 lines, both applied via
git apply.  Eval failed 0/2 due to a content-quality issue
(``NameError: name 'Set' is not defined`` — agent used Set
without importing it; both agents hit exit_cost budget limit
mid-implementation), but that's model variance, not adapter
wiring.  swe_agent is unblocked: it runs end-to-end, produces
patches, the eval pipeline processes them.

Coordination metrics still empty (claims_per_agent: {}) because
swe_agent doesn't yet have the in-container coop-task-* CLI
install or in-loop task auto-refresh — those are tracked as
follow-ups in the PR body.  For now the swe_agent team-mode run
just gets the team prompt section + env vars; full team-tool
integration is a separate PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: team-mode bugs surfaced by 10-pair core run

Five compounding bugs prevented `claude_code`, `codex`, and
`mini_swe_agent_v2` from reaching honest pass-rates on the core
subset in team setting. All four now ≥ 5/10.

- normalize_patch ate trailing blank context lines (text.strip()
  consumes " \n"), breaking last-hunk line counts so git apply
  rejected otherwise-valid diffs. Replaced with lstrip/rstrip on
  "\n" only.
- mini_swe_agent_v2 adapter wasn't normalizing patches at all —
  raw .strip() on the patch.txt read, so every msa patch ended
  in a non-newline byte. Now routes through normalize_patch.
- mini_swe_agent_v2 ModalEnvironment created the sandbox with no
  long-running command, so the image's default CMD exited and
  every exec hit "Sandbox not found". Pass "sleep", "infinity"
  as the positional command (matches eval backend's existing fix).
- claude_code and codex adapters silently ignored --backend modal
  because shared build_environment was hardcoded to DockerEnvironment.
  Added a backend kwarg and threaded config["backend"] through both
  adapters.
- Team lead prompt buried the integration step at the bottom of a
  long workflow list; Claude/Codex consistently exited after their
  own feature without reading /workspace/shared/<agent>.patch.
  Rewrote with a hard-rule opener and a 5-point pre-submission
  checklist. Member prompt now opens with "stay in your lane" per
  the lead's PLAN.md.
- eval test_merged now falls back to testing each agent's patch
  alone when the merged tree doesn't pass both features. Surfaced
  as merge.strategy="solo-agent1" / "solo-agent2". Credits the
  agent (typically the lead) who correctly integrated both
  features into one working patch but had it corrupted by
  union-merging with the other agent's partial implementation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs+data: core subset and team-mode horizontal comparison

- dataset/subsets/core.json: 10-pair subset for quick agent
  comparisons. Stratified by repo (largest-remainder proportional
  allocation by full-dataset pair count) with a one-slot floor per
  primary language (Python / Go / Rust / TS). Reproducible via
  scripts/generate_core_subset.py (seed=42).
- docs/BENCHMARK_RESULTS.md: horizontal comparison of four agent
  frameworks on the core subset in team setting. Includes per-task
  pass/fail matrix annotated with the merge strategy used, plus the
  chronological narrative of the dozen reruns that surfaced each of
  the bugs fixed in the previous commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(eval): don't bail when union-merge also conflicts

Previously test_merged returned early with an error when both naive
and union merge strategies hit conflicts, so the solo-agent fallback
never got a chance to credit a team whose lead alone integrated both
features. Now we write an empty merged.patch, let run_tests fail
naturally on the merged tree, and fall through to the solo fallback.

Doesn't change any of the current 40 eval results — union's merge=union
attribute is tolerant enough that every task in the dataset produces
some tree (potentially broken code with stitched-together lines); the
broken-tree-tests-fail path already triggered the solo fallback. This
just closes the defensive gap for future pathological cases.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* eval(team): identical / naive / lead-when-naive-conflicts policy

Drops the union-merge strategy and the member-only fallback from
test_merged. The new chain is:

  1. identical patches → skip-merge short-circuit
  2. naive 3-way merge clean → merged-tree tests are authoritative
                               (no further fallback)
  3. naive merge conflicts → test the lead's patch.txt alone against
                             both feature suites

Rationale: union merge concatenates conflicting hunks, which usually
produces syntactically broken code; the cases where it accidentally
produced a working tree were rewarding lucky non-overlap, not genuine
coordination. The member-only fallback was symmetric to lead-only but
incoherent under team-mode semantics (the lead is the designated
integrator; if they didn't integrate, the team failed regardless of
what the member's branch looks like).

Effect on the core-subset horizontal comparison:
  msa  6 → 6  (unchanged)
  oh   5 → 4  (loses pallets_jinja/1621 — was passing via union, which
              concealed that oh's lead doesn't integrate)
  cc   5 → 5  (unchanged)
  cx   5 → 5  (unchanged)

oh sliding below 5/10 is the correct outcome: the previous union-pass
on pallets_jinja/1621 was a false-positive of sorts (oh's agents commit
their patch.txt into the working tree, which forces a merge conflict
on patch.txt that union resolved while the actual source merge was
non-conflicting). Under the stricter policy this gets routed through
lead-alone, which oh's lead does not pass.

BENCHMARK_RESULTS.md updated to reflect the new totals + per-task
matrix legend (N = naive/identical, L = lead-alone). CHANGELOG entry
revised; full test suite still green (329 passed, 63 skipped).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(modal): codex stdin hang; eval guardrail for openhands_sdk

codex on Modal: `codex exec` was hanging for the full sandbox
lifetime (~2h) producing zero stream output. Root cause: codex's
exec mode prints "Reading additional input from stdin..." and
blocks until stdin EOF. Docker's non-tty `docker exec` gives EOF
for free; Modal sandbox keeps stdin open. Fix: add `</dev/null`
to the codex invocation in _build_codex_command. Smoke-tested on
dottxt_ai_outlines/1655 [1,3] solo on Modal: 1/1 pass in 1m 48s.

openhands_sdk eval guardrail: openhands_sdk produces patches that
include a committed patch.txt in the working tree and relies on
Modal-hosted Redis for coordination; running eval through Docker
silently changed the test environment. The eval now reads the
run's config.json and refuses with a clear warning when the run
was produced by openhands_sdk but --backend != modal.

Note: swe_agent already runs on Modal (uses swerex.ModalDeploymentConfig
by default; the earlier docs claiming it was docker-only were
wrong). Smoke-tested same dottxt task: 1/1 pass in 3m 12s.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(swe_agent): add --backend docker support

swe_agent adapter was hardcoded to swerex.ModalDeploymentConfig.
Added a backend dispatch that picks DockerDeploymentConfig when
config["backend"] == "docker"; Modal stays as the default.

Two upstream-swerex issues had to be worked around to make the
docker path actually start a container:

1. CooperBench task images set ENTRYPOINT=/usr/local/bin/runner.sh,
   so swerex's `docker run ... image sh -c "<startup>"` becomes
   `runner.sh sh -c "<startup>"` and runner.sh interprets "sh" as
   the feature-patch path. Pass docker_args=["--entrypoint", ""]
   to clear the entrypoint (mirrors the existing Modal monkey-patch
   that does .entrypoint([]) on the image).

2. swerex's startup falls back to `pipx run swe-rex ...` when the
   swerex-remote binary isn't pre-installed, but pipx looks for an
   executable literally named "swe-rex" — which doesn't exist in
   the published `swe-rex` package (it provides "swerex-remote").
   Monkey-patch DockerDeployment._get_swerex_start_cmd to use
   `pipx run --spec swe-rex swerex-remote ...` instead.

Smoke-tested with `dottxt_ai_outlines/1655 [1,3]` solo on docker:
1/1 pass in 2m 53s, 17 steps, $0.32, no errors.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* team_harness: extract team mode as standalone harness + ablation flags

Move team-mode primitives from cooperbench/agents/_team (private) to
cooperbench/team_harness (public, library-shaped) so other benchmarks
can consume the multi-agent coordination algorithm without depending on
CooperBench's task layout.

Adds TeamSession + TeamHarnessConfig:

- TeamSession bundles per-run state (run_id, namespaced Redis URL,
  ordered agent list, scratchpad volume name) with the feature config
  and exposes adapter-facing factories that each return None / [] / {}
  when their feature is disabled, so adapter code paths collapse to one
  branch:

    coop_env.update(session.env_for(agent_id))
    extra_run_args.extend(session.scratchpad_mount_args())
    mcp_config = session.mcp_config(container_script_path=...)

- TeamHarnessConfig is a frozen dataclass of five per-feature booleans
  (task_list, scratchpad, mcp, auto_refresh, protocol).  The lead/member
  role split is the always-on baseline -- without it team is just coop.

Wires five --team-no-* CLI flags through cli.py -> runner.run ->
runner.core -> runner.team -> each adapter.  result.json now records
team_features so post-hoc analysis can attribute deltas to the feature
that was off.

Adapter refactor: claude_code, codex, mini_swe_agent_v2, swe_agent, and
openhands_agent_sdk now accept team_features kwarg and construct a
local TeamSession instead of calling loose helpers.  Each adapter's
team-mode blocks (prompt, env, mount, MCP, install) gate on the
session's config.

Tests: tests/agents/_team -> tests/team_harness (rename), new
test_session.py (29 cases) covers the facade, four new ablation tests
in tests/runner/test_team.py verify the runner-side gating.  Full suite
363 passed, 63 skipped; ruff/format/mypy clean.

End-to-end smoke on dottxt_ai_outlines/1371 [1,2] with codex (docker):
- Default: writes task_log.json + tasks.json + metrics, cb-team-<run>
  volume created.
- --team-no-task-list --team-no-scratchpad --team-no-mcp: no task_log /
  tasks files, empty metrics dict, no volume.  team_features in
  result.json reflects the requested ablation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* codex: add Azure OpenAI support

Set AZURE_OPENAI_API_KEY + AZURE_OPENAI_ENDPOINT (the OpenAI-compatible
v1 base, e.g. https://<resource>.cognitiveservices.azure.com/openai/v1)
and pass the Azure deployment name via -m.  When both are present they
take precedence over OPENAI_API_KEY.

How it works:
- resolve_azure_config() reads the two env vars (endpoint trailing slash
  stripped); _azure_config_toml() writes a `model_provider = "azure"`
  block into codex's config.toml with wire_api = "responses" (codex
  0.132 dropped the chat wire API) and env_key = AZURE_OPENAI_API_KEY.
- The key is exported into the codex command and read via the provider
  env_key; auth.json is skipped on the Azure path.
- config.toml is now composed from independent fragments (azure provider
  + team-mode MCP server) so both can coexist.

Non-json fallback: codex 0.132's --json event stream deterministically
fails against Azure's HTTP/2 /responses endpoint ("stream disconnected:
error sending request") while plain output works.  Captured requests are
byte-identical between modes, so it's a codex response-handling bug, not
a config error.  The Azure path therefore runs codex WITHOUT --json,
harvests the patch from patch.txt (as always) and the final message via
--output-last-message, and derives status from codex's exit code.
Trade-off: no token/cost/trajectory telemetry on Azure (codex's plain
output carries none; cost was already $0 via the broken json parser).

Tests: 5 new (resolve_azure_config, _azure_config_toml, non-json run
shape + provider config + no auth.json, error status on non-zero exit);
autouse fixture clears AZURE_* so non-Azure tests stay hermetic.
Full suite 369 passed; ruff/format/mypy green.

Validated end-to-end on dottxt_ai_outlines/1655 [1,3] with
`-a codex -m gpt-5.5-hao` against a live Azure deployment: Submitted,
clean stream (no disconnects), eval passes both features.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* agents: add Azure OpenAI support to msa / swe_agent / openhands

Extends Azure support (added for codex in the prior commit) to the three
litellm/SDK-backed adapters.  claude_code is intentionally excluded.

Shared detection in cooperbench/agents/_azure.py:
- resolve_azure_config() reads AZURE_OPENAI_API_KEY + AZURE_OPENAI_ENDPOINT
  (same env vars as codex), endpoint trailing slash stripped.
- azure_litellm_model() returns `openai/<deployment>` — litellm's
  openai-compatible provider pointed at Azure's v1 base, mirroring how
  the OpenAI SDK is pointed at Azure (base_url=<v1>).  No api_version pin
  (both the openai-compatible and native azure/ litellm routes were
  verified against the live endpoint; the former is used).

Wiring (each gated on resolve_azure_config(), no-op when unset):
- mini_swe_agent_v2: model_name -> openai/<deployment>; api_base + api_key
  folded into LitellmModelConfig.model_kwargs.
- swe_agent: GenericAPIModelConfig(name=openai/<deployment>,
  api_base=..., api_key=...).
- openhands_sdk: LLM(model=openai/<deployment>, api_key=..., base_url=...).

Tests: tests/agents/test_azure.py (9) covers detection precedence,
endpoint normalization, deployment-name parsing, and the litellm model
id.  Full suite 378 passed; ruff/format/mypy green.

Validation: the litellm->Azure route was confirmed directly (both
openai-compatible and azure/ provider forms return 200).  mini_swe_agent_v2
validated end-to-end on docker.  openhands_sdk (Modal backend) and
swe_agent (swerex path) are wired but not yet end-to-end-validated against
Azure — deferred so as not to compete with the running full-dataset codex
sweep for the shared Azure deployment's quota.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* openhands: drop incidental reformatting, keep only the Azure edit

The openhands_agent_sdk/ tree is ruff-excluded in pyproject.toml
(adapted from the OpenHands SDK), so the prior commit's `ruff format`
churned ~90 unrelated lines.  Restore the base file and re-apply only
the Azure LLM branch so the diff is minimal.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Ubuntu <ubuntu@ip-172-31-58-153.us-west-2.compute.internal>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant