team mode: wire team prompt + env into the three Python-loop adapters by ProKil · Pull Request #55 · cooperbench/CooperBench

ProKil · 2026-05-17T15:26:56Z

Summary

Brings every adapter (claude_code, codex, mini_swe_agent_v2, swe_agent, openhands_sdk) to full team-mode parity — each now runs end-to-end in team mode, including the openhands_sdk variant whose Modal-isolated agent-server required a custom Redis tunnel + tool-registry override to actually function.

Stacks on #52 (which itself stacks on #51).

Backend support matrix (after this PR)

Adapter	Docker	Modal	Notes
`mini_swe_agent_v2`	✓	✓	both verified end-to-end; backend dispatch in the adapter
`claude_code`	✓	✓	backend now threaded through `_coop/runtime.build_environment`
`codex`	✓	✓	Modal stdin hang fixed (this PR); verified solo on dottxt 1/1 in 1m 48s
`swe_agent`	✓	✓	docker via `swerex.DockerDeploymentConfig` (with entrypoint-clear + pipx-spec patch, both in this PR); verified solo on dottxt — Modal 1/1 in 3m 12s, Docker 1/1 in 2m 53s
`openhands_sdk`	(n/a)	✓	Modal-only by design; eval now refuses Docker with a clear warning

What landed (in order of commits)

Per-adapter team-prompt wiring + env-var propagation. mini_swe_agent_v2, swe_agent, openhands_sdk now append team_task_section to the task, propagate CB_TEAM_* into their containers, and (where they manage docker) mount the team scratchpad volume.
CoopTaskTracker typed tool for openhands — Redis-backed drop-in for openhands' built-in TaskTrackerTool, registered under the SAME name so the registry override is transparent to the agent. Required because gpt-5.5 strongly prefers typed tools over shell commands, even when the prompt tells it otherwise.
Codex coordination fixes — ## Git collaboration — MERGE IS REQUIRED BEFORE SUBMITTING prompt rewrite that frames merging as the explicit final step, plus a test_merged short-circuit that copies one patch to merged.patch when both agents submit byte-identical merged trees.
Modal-hosted Redis for openhands — runner/team.py detects agent_name == "openhands_sdk" and spins up a Modal sandbox running redis-server on unencrypted_ports=[6379], exposed via unencrypted_host:unencrypted_port. Both the host TaskListClient and the agent-server's CoopTaskTracker point at the same public TCP endpoint.
CoopTaskTracker injection into the Modal sandbox — add_local_file the tool definition + a pre-rendered replacement __init__.py (no shell-heredoc fragility) + .pyc cache wipe so the registration override actually takes effect.
Harvest-time fresh Redis client — TCP tunnels drop idle connections after several minutes; re-open at harvest time.
fakeredis dev dependency — was undeclared, causing CI ImportErrors on team-mode test files.
swe_agent import fix (cooperbench.agents.mini_swe_agent → mini_swe_agent_v2) + missing transitive deps (numpy, boto3, docker) in swe-agent extras. Was a pre-existing bug from v0.0.13's rename — every swe_agent invocation errored before any LLM call.
Core-subset bug fixes + new eval policy — see "Follow-up validation" below.
Modal backend fixes — codex stdin hang + openhands docker-eval guardrail (this PR's most recent commit).

Per-adapter wiring + verified result

Adapter	Team prompt	CLI in container	`CB_TEAM_*` env	Typed task tracker	E2E pass rate
`claude_code`	full	✓	✓	n/a	2/2 ✓ (variable per-run)
`codex`	full	✓	✓	n/a	2/2 ✓ (after prompt + eval fixes); Modal smoke 1/1 ✓
`mini_swe_agent_v2`	section	✓	✓ (via `env_kwargs["env"]`)	n/a (in-loop `TeamPoller`)	2/2 ✓
`openhands_sdk`	section + system-prompt block	✓ (Modal image layered)	✓ (via `coop_info["team_env"]`)	✓ (CoopTaskTracker overrides upstream TaskTracker)	2/2 ✓ (via Modal-hosted Redis)
`swe_agent`	section	not yet	not yet	n/a	Modal smoke 1/1 ✓; Docker smoke 1/1 ✓; full team-tool integration is a follow-up

Follow-up validation: 10-pair core-subset horizontal comparison

Took the team wiring above to a real workload (the new dataset/subsets/core.json subset) and discovered five compounding bugs that prevented anything other than openhands from reaching honest pass-rates. Fixed them and then re-thought the team-mode eval policy.

Final results (10-pair core subset, team setting)

Eval policy: identical → naive merge → lead's patch alone. Union merge and member-only fallback were intentionally dropped — they reward lucky non-overlap or partial coordination rather than genuine team integration. Details in docs/BENCHMARK_RESULTS.md.

Agent	Pass	Cost	Wall	Notes
`mini_swe_agent_v2`	6 / 10	$13.37	24m	5 naive + 1 lead-alone
`claude_code`	5 / 10	~$8.5	21m	2 naive + 3 lead-alone
`codex`	5 / 10	$0*	21m	2 naive + 3 lead-alone
`openhands_sdk`	4 / 10	$31.90	16m	0 naive + 4 lead-alone — every oh task hits naive-conflict (patch.txt committed mid-run); only 4/10 have a lead-alone patch that passes both features

* gpt-5.5 not in local pricing table; codex did real work (400 k+ input tokens per agent).

Three of four ≥ 5/10 under the strict policy. oh at 4/10 is the right number — their union-merge passes on the older lenient eval were partly a false-positive of their patch.txt-commits-mid-run workflow forcing a patch.txt merge conflict that union resolved trivially while the actual source-code merge was non-conflicting; under the stricter policy those route to lead-alone, and oh's lead doesn't always integrate.

Bugs fixed (all in the unreleased CHANGELOG entry)

codex exec hung in Modal sandbox — codex's exec mode blocks reading "additional input from stdin"; Modal sandbox keeps stdin open while Docker non-tty docker exec gives EOF for free. Fix: </dev/null on the codex invocation. Smoke-verified solo on dottxt: 1/1 in 1m 48s.
openhands_sdk eval guardrail — eval refuses to run a non-modal backend against an openhands_sdk-produced run with a clear warning, because openhands relies on the Modal-hosted Redis tunnel and commits patch.txt into the working tree (Docker eval silently changes the test environment).
normalize_patch was using text.strip(), eating trailing blank context lines (" \n") from valid git diff output and breaking last-hunk line counts so git apply rejected them.
mini_swe_agent_v2 adapter wasn't routing patches through normalize_patch at all — raw .strip(), same underlying issue, one layer deeper.
mini_swe_agent_v2 ModalEnvironment created the sandbox without a long-running command, so the image's default CMD exited and every exec() hit "Sandbox not found". Now passes "sleep", "infinity" (matches the eval backend's existing fix).
claude_code and codex adapters silently ignored --backend modal — shared build_environment was hardcoded to DockerEnvironment. Added a backend kwarg and threaded config["backend"] through both adapters.
Team-lead prompt buried the integration step at the bottom of the workflow list; Claude/Codex consistently exited after their own feature without reading /workspace/shared/<agent>.patch. Rewrote with a hard-rule opener and a 5-point pre-submission checklist; member prompt now opens with "stay in your lane" per the lead's PLAN.md.

Eval policy change

test_merged now uses identical → naive → lead-alone-when-naive-conflicts. Previous chain was identical → naive → union → solo-fallback (lead-or-member). Union merge concatenates conflicting hunks (usually broken code; rewards lucky non-overlap rather than coordination); member-only fallback is incoherent in team mode (the lead is the designated integrator). When naive conflicts, the lead's patch.txt must pass both feature suites alone. Surfaced as merge.strategy = "solo-agent1" in eval.json.

Added

dataset/subsets/core.json (+ scripts/generate_core_subset.py) — 10-pair stratified core subset for quick agent comparisons.
docs/BENCHMARK_RESULTS.md — the horizontal comparison with per-task matrix and rerun narrative.

Tests

16 new unit tests + 1 prompt regression (full suite: 329 passed, 63 skipped):

3 prompt tests for team_task_section vs build_team_instruction consistency
10 cross-adapter compatibility tests (every runner accepts team kwargs; openhands env shape; image-layering call counts)
1 swe_agent signature check
1 openhands image-layering (3 add_local_file + 2 run_commands + .pyc wipe)
1 CoopTaskTracker registry-override regression test
1 eval short-circuit regression (identical patches)

Ruff / format / mypy all green.

Real follow-ups

swe_agent in-container CLI install + scratchpad mount + auto-refresh — once this PR lands, swe_agent's sandbox layer needs the same treatment mini_swe_agent_v2 got for full team primitives. Today it only sees the team prompt section + env vars; the actual coop-task-* CLI isn't installed in its sandbox.
In-loop task-refresh for openhands_sdk — works without this thanks to the typed CoopTaskTracker tool, but a push-style refresh hook would close the agency gap further.
openhands patch.txt workflow — oh agents commit their patch.txt into the working tree mid-run, which forces every team-mode merge into a patch.txt conflict. Not strictly a bug (the merge falls back to lead-alone correctly) but it's noise in the eval logs.

Test plan

ruff check, ruff format --check, mypy, pytest tests/ (all green locally — 329 passed)
With OPENAI_API_KEY or ANTHROPIC_API_KEY exported, run uv run cooperbench run -a <adapter> -m <model> -r <repo> -t <task> -f <f1>,<f2> --setting team --git --backend docker
Modal-backend smoke: uv run cooperbench run -a codex -m gpt-5.5 -r dottxt_ai_outlines_task -t 1655 -f 1,3 --setting solo --backend modal — should finish in under 5 min (verifies the stdin fix)
openhands docker-eval guardrail: uv run cooperbench eval -n <oh_run> --backend docker — should refuse with a warning before doing any work
For openhands_sdk: confirm redis ready redis://r...modal.host:... line appears and metrics dict in result.json is populated (non-empty claims_per_agent)
For swe_agent: confirm install completes with '.[swe-agent]' and no numpy / boto3 / docker ImportError surfaces
Optionally: uv run cooperbench run -a <adapter> -m <model> -s core --setting team --backend docker -c 3 to reproduce the core-subset results

🤖 Generated with Claude Code

Adds an OpenAI Codex CLI adapter alongside the existing Claude Code adapter. Both adapters wrap a third-party CLI inside the task's Docker container; the bits that are agent-agnostic (Redis messaging helper, prompt blocks for solo/coop/coop+git, git remote setup) now live in a new ``cooperbench.agents._coop`` module so the two adapters (and any future CLI adapter) consume them rather than duplicating. Codex adapter highlights: - Invokes ``codex exec --json --sandbox danger-full-access --skip-git-repo-check --model <id>``. - Writes ``${CODEX_HOME}/auth.json`` with the host's OPENAI_API_KEY inside the container so the CLI authenticates without prompts. - Parses Codex's JSONL event stream for status / token totals / messages. Cost is reported as 0.0 because Codex does not emit a cost field; tokens are summed across ``turn.completed`` events. - Model fallback: if Codex rejects ``--model gpt-5.5`` with a "model not found" shaped error, the adapter retries once without ``--model`` and lets Codex pick its default. - Preflight credential check: if OPENAI_API_KEY is unset the adapter returns Error immediately instead of spinning up a container that can only fail. Shared ``_coop`` module: - ``coop_msg.py`` — Redis-backed messaging CLI (one inbox per agent) installed as ``coop-send`` / ``coop-recv`` / ``coop-broadcast`` / ``coop-peek`` / ``coop-agents`` under /usr/local/bin. - ``install_snippet.sh`` — pip-installs redis and drops the shell wrappers; each adapter's setup.sh sources it. - ``prompt.py`` — solo / coop / coop+git prompt assembly, agent- agnostic. - ``runtime.py`` — ``ContainerEnv`` protocol, ``build_environment``, ``write_file_in_container`` / ``read_file_from_container``, ``rewrite_comm_url_for_container``, ``build_git_setup_command``, ``parse_sent_messages_log``, and ``normalize_patch``. Bug fix during this refactor: the previous adapter's ``.strip()`` on ``patch.txt`` was eating the trailing newline that ``git apply`` requires. Replaced with ``normalize_patch()`` (one trailing newline, no leading whitespace). This bit codex's solo run with a "corrupt patch at line N" error; Claude got lucky and didn't. Tests: 24 new for Codex (parsers + adapter), existing 45 Claude Code tests re-pointed at the shared ``_coop`` module. Full suite: 228 passed, 63 skipped. End-to-end runs against dottxt_ai_outlines_task/1371 features 1+2: - codex solo f1: Submitted, 1 turn, 365k input tokens, 184-line patch (with the trailing-newline fix it applies cleanly) - codex coop+git f1,f2: both Submitted, both patches applied but 0/2 tests pass — coordination failure (agent1 fetched ``team`` but never merged, so the stacked patches produce a Python SyntaxError at line 144 of the modified file). Claude on the same task scored 2/2; Codex used the tools less aggressively on this run. The 0/2 result is the kind of coordination failure the bench is designed to surface, not an adapter bug. Future iteration could tighten the prompt or hard-enforce a post-run merge, but neither is necessary to land the adapter itself. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds a third setting alongside ``solo`` and ``coop``, modelled on the agent-team primitives Claude Code uses in its own product. Where coop gives N peer agents one feature each and a Redis inbox to chat over, team mode adds three load-bearing primitives: 1. A typed **shared task list** (cooperbench.agents._team.TaskListClient) backed by Redis hashes + sets, namespaced ``cb:<run_id>:``, with atomic claim semantics (HSETNX-style — exactly one caller wins on a race) and an audit log of every mutation. Exposed in the container as ``coop-task-create`` / ``coop-task-claim`` / ``coop-task-update`` / ``coop-task-list`` shell wrappers. 2. A **lead / member role split**. The first agent is designated ``team-lead`` and gets a system-prompt block instructing them to break the spec into tasks, assign them via ``coop-task-create --assign``, watch progress, and integrate. Other agents are ``member`` and look for open tasks to claim. 3. A **shared scratchpad** Docker volume (``cb-team-<run_id>``) mounted at ``/workspace/shared`` in every container. Free coordination artifact for design notes, partial diffs, interface sketches. Coordination metrics are computed from the task-list audit log after the run finishes (``time_to_first_claim_seconds``, ``claims_per_agent``, ``updates_per_agent``, ``tasks_done``, ``unowned_at_end``) and saved into ``result.json``. Evaluation is identical to coop — per-agent ``patch.txt`` evaluated per-feature — so no eval changes were needed beyond discovering ``team/`` log directories. Compatibility: all five existing adapters accept the new ``team_role`` / ``team_id`` / ``task_list_url`` kwargs. The CLI adapters (``claude_code``, ``codex``) wire the team install snippet into their ``setup.sh`` so the ``coop-task-*`` wrappers land at ``/usr/local/bin``. The Python-loop adapters (``mini_swe_agent_v2``, ``swe_agent``, ``openhands_sdk``) accept the kwargs without breaking; their in-loop integration with the task list (auto-refresh between steps, similar to the existing inbox poll) lands in a follow-up. Unit tests: 46 new - 18 task_list (CRUD, atomic claim, owner-only update, audit log, run isolation) - 12 prompt (lead vs member branches, solo fallback, git interaction) - 3 runtime (env assembly, scratchpad mount args) - 4 metrics (happy path, unowned-at-end, empty log, multiple claims) - 5 runner (lead-is-first-agent, pre-seed, kwarg propagation, metrics in result, three-agent team) - 4 misc Full suite: 274 passed, 63 skipped. Ruff / format / mypy all green. End-to-end on dottxt_ai_outlines_task/1371 [1,2] with Claude Code in team+git mode: - 5 tasks created (2 by bench-runner, 3 by the lead splitting its work), all reached ``done`` - time_to_first_claim_seconds=34.2 - claims_per_agent={agent1: 2, agent2: 1} - updates_per_agent={agent1: 4, agent2: 3} - scratchpad volume actively used (agent2 wrote its diff to /workspace/shared/agent2.patch + a summary.md) - **0/1 pass rate** — both ``patch.txt`` files were empty: the members wrote diffs to the scratchpad instead of also writing ``/workspace/repo/patch.txt``, and the lead never ran the final integration step. This is real coordination signal (the prompt told them to write both places but they followed the scratchpad half only) — a follow-up will tighten the prompt to make patch.txt submission the explicit final step. Future PRs (intentionally out of scope here so this lands at a reviewable size): - In-loop auto-refresh for the Python-loop adapters - MCP long-poll tool to give CLI adapters push-ish inbox semantics - Typed ``coop-request`` / ``coop-respond`` protocol on top of messaging (CC's plan_approval_request shape) - Filesystem mirror of the task list (CC-style ``ls`` artefacts) Stacks on #51 (Codex adapter) so the diff stays focused on team-mode additions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…resh (#53) Lands the four follow-ups that were called out as "Out of scope" on the team-mode PR (#52), plus a prompt fix surfaced by the team-mode end-to-end run. 1. **Filesystem mirror of task list** (``_team/fs_mirror.py``). Snapshots the Redis-backed task list to ``/workspace/shared/tasks/`` so agents can ``ls`` and ``cat`` tasks with their existing tools rather than going through the ``coop-task-list`` CLI. Layout mirrors Claude Code's team primitive: one ``<id>.json`` per task, plus ``_index.json`` (cheap ``ls`` target) and ``_log.jsonl`` (audit trail). Triggered on every ``coop-task-list`` invocation and from the host runner at startup. Files written via tempfile+replace so readers never observe a partial state. 2. **Typed coop-request / coop-respond protocol** (``_team/protocol.py``). Layered on plain Redis messaging, mirroring CC's ``plan_approval_request`` / ``plan_approval_response`` shape. ``coop-request <peer> <kind> <body>`` returns a request_id (and optionally blocks via ``--wait N`` for a response). ``coop-respond <request_id> <body>`` writes back; the sender's ``await_response`` uses BLPOP so it actually sleeps instead of busy-polling. Both events flow into the shared task-log so coordination metrics include protocol events. 3. **MCP long-poll server** (``_team/mcp_server.py``). Stdio JSON-RPC server that exposes a single ``wait_for_message`` tool backed by BLPOP on the agent's inbox. Registered automatically: Claude Code adapter writes ``$CLAUDE_CONFIG_DIR/.claude.json`` with the server entry; Codex adapter writes ``$CODEX_HOME/config.toml``. The point is to make "watch the inbox" a natural idle behavior for the CLI adapters instead of a busy-loop on ``coop-recv`` returning empty — the closest we can get to push-style delivery for opaque CLI agent loops. 4. **In-loop task-list auto-refresh** (``_team/loop_refresh.py``). ``TeamPoller`` is a per-agent host-side helper that ``mini_swe_agent_v2.DefaultAgent.step()`` calls between LLM queries — same hook as the existing inbox poll. The LLM sees a compact ``[Team task list] open: 1, in_progress: 2, ...`` summary prepended to every turn so it doesn't need to remember to call ``coop-task-list``. Plumbed via ``agent.team_poller`` so the ``mini_swe_agent_v2`` subtree change is one branch in ``step()``. The same module also exports ``poll_team_state()`` for in-container use (env-driven variant). 5. **Prompt fix**: the previous team-mode end-to-end had members writing diffs to ``/workspace/shared/<id>.patch`` only and never to ``/workspace/repo/patch.txt``, scoring 0/2 despite great coordination. Both lead and member prompts now have an explicit ``### Final submission — REQUIRED`` section that calls out ``patch.txt`` as the only file the bench evaluates and provides the exact ``git diff > patch.txt`` command. Also: cosmetic fix to ``runner/core._print_single_result`` so team mode's per-agent dicts (which carry ``patch_lines: int``) render correctly in the run table — previously the column showed 0 because the function tried ``len(r.get("patch", "").splitlines())`` and team mode doesn't store the full patch in the agents dict. Tests: 37 new unit tests - 8 fs_mirror (atomic writes, stale cleanup, empty index) - 9 protocol (request roundtrip, await, timeout, audit log) - 9 mcp_server (initialize, tools/list, tools/call, timeout, blocking, unknown-tool error, env factory) - 8 loop_refresh (summary formatting, TeamPoller, env variant) - 3 prompt (regression: lead+member prompts demand patch.txt) Full suite: **311 passed**, 63 skipped. End-to-end on dottxt_ai_outlines_task/1371 [1,2] with Claude Code + team + git: **2/2 features pass** (14/14 + 20/20 tests). All four follow-ups visibly active in the run artifacts: ``/workspace/shared/tasks/`` populated with per-task JSON + _index + _log; scratchpad has agent2.patch; ``cb-mcp-server.py`` registered in ``.claude.json``; 6 tasks created (2 by runner pre-seed, 4 by lead's sub-task split), 4 reached ``done``, ``time_to_first_claim_seconds=29.9``. Previous run scored 0/2 on the same task — the prompt fix is doing real work. Stacks on #52. Co-authored-by: Ubuntu <ubuntu@ip-172-31-58-153.us-west-2.compute.internal> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Brings ``mini_swe_agent_v2``, ``swe_agent``, and ``openhands_sdk`` to parity with the CLI adapters for team mode. Before this commit they accepted the team kwargs but discarded them; now each one appends the team prompt section to the task it sends the agent, and (where the adapter actually controls the container) propagates ``CB_TEAM_*`` env vars + mounts the team scratchpad. New helper: ``_team.team_task_section(agents, agent_id, team_role)`` returns ONLY the lead-or-member block + coop-task-* CLI usage, without the surrounding task/submission/git scaffolding that ``build_team_instruction`` adds. Python-loop adapters already have their own prompts covering messaging/git/submission, so they need only the new piece; CLI adapters keep using the bigger function. Per-adapter wiring: - ``mini_swe_agent_v2``: appends team_task_section to task; propagates CB_TEAM_* through env_kwargs["env"]; adds ``--add-host=host.docker.internal:host-gateway`` + scratchpad volume to docker run args; installs the team CLI scripts + pip redis in the container after env spin-up. The existing ``TeamPoller`` host-side hook (already in step()) still fires. - ``openhands_sdk``: appends team_task_section to task; folds a new ``team_env`` dict into ``coop_info`` so ``_build_credentials_dict`` propagates CB_TEAM_* into the sandbox. Coop-task-* binary install in the OpenHands agent-server image is a follow-up — OpenHands manages its own image build and doesn't expose a clean post-start exec hook. - ``swe_agent``: appends team_task_section to task. The SWE-agent framework's sandbox + agent loop is third-party and harder to instrument; everything beyond the prompt is a follow-up. Tests: 13 new - 3 prompt unit tests for team_task_section (lead, member, empty) - 10 cross-adapter sanity tests in tests/agents/test_team_wiring.py: consistency between team_task_section and build_team_instruction, every registered runner accepts the team kwargs, openhands env keys, swe_agent signature Full suite: 324 passed, 63 skipped. Ruff/format/mypy all green. End-to-end on dottxt_ai_outlines_task/1371 [1,2] with claude_code + team + git (sanity check that the shared changes didn't regress the CLI adapter): both Submitted in 4m21s, $0.93, patches 210 + 81 lines. End-to-end for the other four (codex, mini_swe_agent_v2, swe_agent, openhands_sdk) requires API keys (Anthropic for the three Python-loop adapters via litellm, OpenAI for codex) that aren't available in this environment. Unit tests cover the new wiring; the e2e validations should be run with real keys before relying on the per-adapter behavior. Compatibility matrix is now: | Adapter | Accepts | Team prompt | Auto-refresh | CLI in container | env vars | |---------------------|---------|-------------|--------------|------------------|----------| | claude_code | yes | yes (full) | n/a | yes | yes | | codex | yes | yes (full) | n/a | yes | yes | | mini_swe_agent_v2 | yes | yes (sec.) | yes | yes | yes | | openhands_sdk | yes | yes (sec.) | n/a | NOT YET | yes | | swe_agent | yes | yes (sec.) | NOT YET | NOT YET | NOT YET | Stacks on #52 (merged-up team-mode branch). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Closes the documented gap from the prior commit's matrix: the ``coop-task-*`` binaries now ship into the OpenHands agent-server sandbox, layered onto the upstream ``-oh`` image via Modal's ``add_local_file`` / ``pip_install`` / ``run_commands`` chain (no upstream image rebuild required). Triggered only when ``coop_info["team_env"]`` is set so solo / coop runs don't pay the ~10s first-build cost. Modal caches the layered image; subsequent team runs are instant. Verified end-to-end: ran openhands_sdk team+git on dottxt_ai_outlines_task/1371 [1,2] with gpt-5.5. The agent ran ``compgen -c | grep coop-task`` and got back all 7 wrappers (create / claim / update / list / request / respond / pending) — the install worked. Whether the model actually invokes the tools is a separate (coordination-quality) axis; in this run it discovered them but didn't use them, same as codex. Both patches applied; f1 14/14, f2 19/20. Tests: 2 new (full suite: 326 passed) - test_team_env_triggers_image_layering — verifies add_local_file + pip_install + run_commands fire with the right args when team mode is active - test_no_layering_when_team_inactive — verifies solo / coop runs skip the image-build cost Matrix update — openhands_sdk now reads: Accepts kwargs: yes / Team prompt: section / Auto-refresh: n/a / CLI in container: YES (was NOT YET) / CB_TEAM_* env: yes Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The codex team e2e (cx_team_v3) hit 0/2 with great coordination metrics — 5/5 tasks done, 27s first claim, claims even — but neither agent ran ``git merge`` despite the prompt's "Recommended workflow" mentioning it. Both fetched their peer's branch (2 each) and then submitted only their own work, so the eval's naive diff-stacker produced syntactically broken Python. The previous prompt buried the critical step in a "Concretely:" sentence at the end; gpt-5.5 didn't follow it. This rewrite: - Renames the section ``## Git collaboration — MERGE IS REQUIRED BEFORE SUBMITTING`` so the imperative is in the heading itself. - Adds an explicit "Required final sequence — run this verbatim before exiting" block with the full fetch+merge+diff sequence, parameterized over every partner branch. - Explains *why* (each agent's patch.txt is evaluated against every feature's tests; without the merge, the peer feature's symbols are missing → ImportError). - Frames it the same way the patch.txt step is framed (REQUIRED, skip-at-your-loss), which the original prompt fix proved codex responds to. Verified: re-ran cx_team_v4 (codex team+git, same task as v3). Git activity went from ``fetch=2 merge=0 push=0`` per agent → ``fetch=3 merge=2 push=2`` and ``fetch=1 merge=1 push=1``. Both patches now contain both features' symbols. Pass rate v4: 33/34 tests (97%) — f2 fully passes 20/20, f1 fails one test because gpt-5.5's merged code put the ``filters`` kwarg on a helper function rather than the ``prompt`` decorator (content quality, not coordination). A second run (cx_team_v5) produced byte-identical 243-line patches on both agents — codex coordinated so well both ended up with the exact same merged tree. This surfaces a separate bench-side limitation: the eval's diff-stacker fails to apply patch B on top of patch A when every hunk already matches, producing an empty merged.patch. That's a real bug in ``eval/evaluate.py``'s coop merge step, NOT a coordination failure — codex did exactly what the prompt asked. Fix is a separate concern from team-mode wiring. Tests still pass (existing prompt tests are content-agnostic; 326 / 63 skipped). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

In team mode codex can coordinate so well that both agents end up with byte-identical patches (each fully merged the other's branch). The existing eval combiner sequence — apply patch1 → apply patch2 on top — chokes because every hunk in patch2 is already applied, producing an empty merged.patch and a downstream "No valid patches in input" failure even though both submissions are individually fine. Fix in ``test_merged``: before invoking ``_setup_branches`` / ``_merge_naive``, ``cmp`` the two patches. If they match, copy patch1 to merged.patch (normalized via ``git apply --recount`` so agents that emit unified diffs with miscounted hunk headers still work) and skip the merge dance. Returns a fresh result with ``merge.status: "identical"`` so the caller can tell the short-circuit fired vs a real merge. Verified on the codex-team e2e: - cx_team_v5 (codex agents perfectly merged to identical 243-line patches): 0/2 → 2/2 ✓ (f1: 14/14, f2: 20/20) - cx_team_v4 (codex agents diverged on the merge): unchanged at f2 20/20 + f1 13/14 = 33/34 tests, still falls back to agent2-alone via apply_status: {'agent1': 'failed', ...} I also briefly tried adding ``git apply --recount`` to ``_setup_branches``'s fallback chain, but that REGRESSED v4: it made agent1's malformed patch apply where it previously failed silently, triggering a real merge attempt that produced duplicate function definitions (broken Python) via union merge. The identical-patches short-circuit is the strictly-better fix — no regression, recovers the v5 case, and the malformed-hunk normalization only kicks in on the short-circuit path where it can't cause merge conflicts. Also lands previously-uncommitted housekeeping: - prompt.py: ruff-format-only diff on the merge-required block from the prior commit - test_team_wiring.py: ruff --fix removed unused MagicMock imports - test_gcp_backend.py / test_tasks.py: ruff --fix removed f-string-without-placeholder and unused-json import (both unrelated drift caught by the gate) Tests: 1 new (full suite: 327 passed) - ``test_test_merged_shortcircuits_on_identical_patches`` — source inspection confirms the short-circuit branch + "identical" merge-status string exist in test_merged Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The previous openhands team runs (oh_team_v3) showed agents discovering the ``coop-task-*`` shell wrappers via ``compgen`` but never invoking them — gpt-5.5 strongly prefers typed tools registered with the LLM over arbitrary shell commands. This commit lands the architectural fix: a Redis-backed ``CoopTaskTrackerTool`` registered under the same name as openhands' built-in ``TaskTrackerTool`` so the registry resolution swaps it transparently. Files: * ``openhands/tools/task_tracker/coop_definition.py`` — new tool definition + executor. Same ``TaskTrackerAction`` / ``TaskTrackerObservation`` shape, but ``plan`` and ``view`` round- trip through the shared ``cb:<run_id>:`` Redis namespace that ``TaskListClient`` (host side) writes to. Tasks are auto-owned by the calling agent; ``view`` shows peer tasks prefixed with ``[<their_agent_id>]``. Registered under both ``"CoopTaskTrackerTool"`` AND ``"TaskTrackerTool"`` so importing the module rebinds the latter to the Coop variant. * ``openhands/tools/preset/default.py`` — gains a ``team_mode`` kwarg (kept for API stability + tests; the actual swap happens server-side via the .pth/__init__ side-effect import, not by changing the host-side tool list). Pre-PR coop block split into a more nuanced team-mode prompt section that documents the TaskTracker → shared-list behavior. * ``openhands_sdk/adapter.py:ModalSandboxContext.__enter__`` — layers two more bits into the Modal image at build time: - ``add_local_file`` of ``coop_definition.py`` to ``$OH_DIR/coop_definition.py`` (in the sandbox's openhands install) - ``grep ... || echo`` appending ``from . import coop_definition`` to the package's ``__init__.py`` so the registration runs at import time. Tests: 1 new + updated image-layering assertions - ``test_importing_coop_definition_overrides_local_registration``: inspecting the registry's ``_MODULE_QUALNAMES`` confirms ``TaskTrackerTool.name`` resolves to ``coop_definition``'s registration after import. - ``TestOpenHandsImageLayering`` now asserts 2 ``add_local_file`` calls + 2 ``run_commands`` layers (tool-file install + ``coop-task-*`` wrappers) and that the ``from . import coop_definition`` line is in the install commands. Full suite: 329 passed. Ruff / format / mypy all green. KNOWN LIMITATION (documented in coop_definition.py docstring): the openhands_sdk agent-server runs in a Modal sandbox that's network-isolated from the host Redis. The CoopTaskTracker is correctly registered and the LLM can call it, but every operation returns "Shared task list unavailable" because the sandbox can't ``socket.getaddrinfo("host.docker.internal")``. The fix is in the deployment layer (Modal tunnels, a Modal-hosted Redis, or running openhands directly via docker like the other adapters), not in this PR — verified by oh_team_v10: agent ran ``coop-task-list`` first ("The coop CLI failed; I'll use the shared task tracker."), then fell back to TaskTrackerAction which still hit the local executor because the override + Redis combo can't actually work in Modal. For non-Modal openhands deployments (e.g. local docker-backed openhands runs, future remote-conversation transports that share the host network), this tool works as designed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Resolves the Modal-Redis isolation that blocked the prior CoopTaskTracker swap from actually functioning. Three pieces, working together: 1. **Modal-hosted Redis.** ``runner/team.py:execute_team`` detects ``agent_name == "openhands_sdk"`` and spins up a Modal sandbox running redis-server on a TCP tunnel (``unencrypted_ports=[6379]``, accessed via ``unencrypted_host:unencrypted_port``). Re-uses the existing ``connectors/redis_server.ModalRedisServer`` — it was already written, just unused. Both the host TaskListClient and the agent sandboxes point at the same public TCP endpoint, so pre-seed and agent reads/writes share state. Falls back to local Redis for the other adapters. 2. **CoopTaskTrackerTool injection into the Modal sandbox.** The adapter now ``add_local_file``s three pieces into the OpenHands image at build time: - ``coop_task.py`` → ``/usr/local/bin/cb-coop-task.py`` - ``coop_definition.py`` → ``$OH_DIR/coop_definition.py`` - ``_team_init_override.py`` → ``$OH_DIR/__init__.py`` (replaces upstream; same exports + a side-effect import of coop_definition so the Redis-backed executor overrides the local TaskTracker registration at first import). Plus a ``find -name '*.pyc' -delete`` to invalidate Python's bytecode cache so the new __init__ actually re-runs. 3. **Harvest-time fresh client.** Modal's TCP tunnels drop idle connections after a few minutes, so the original Redis client pre-seed used at startup gets closed before the 9-min agent run finishes. Re-open the client at harvest time using the same URL. End-to-end on ``dottxt_ai_outlines_task/1371 [1,2]`` with ``-a openhands_sdk --setting team --git``: - Modal Redis startup: ``redis ready redis://r450.modal.host:41899`` - Both agents Submitted, 9m total - Eval: 2/2 PASS (f1: 14/14 ✓, f2: 20/20 ✓) - Metrics: ``tasks_total: 4, tasks_done: 4, unowned_at_end: 0, time_to_first_claim_seconds: 52.6, claims_per_agent: {agent2:2, agent1:1}, updates_per_agent: {agent2:4, agent1:5}`` - Cost: $3.33 Tests: image-layering assertions expanded — ``add_local_file`` now called 3 times (CLI helper, tool def, __init__ override), and the run_commands chain copies both files + wipes .pyc caches. Full suite: 329 passed. Ruff / format / mypy all green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The team-mode unit tests (task_list / protocol / fs_mirror / loop_refresh / mcp_server) use ``fakeredis.FakeRedis`` as a hermetic stand-in for redis-server, but ``fakeredis`` wasn't declared anywhere in pyproject.toml — it just happened to be present in my local venv because something else pulled it in transitively. GitHub CI installs ``[dev]`` only, so on a clean install pytest collection fails with ``ModuleNotFoundError: No module named 'fakeredis'`` on every team-mode test file. Adding the dependency explicitly fixes PR #52 (team-mode) CI; once team-mode merges, PR #55 (team-all-adapters) will also pick it up via the same path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Three changes that together unblock swe_agent team-mode runs (and solo/coop runs too — the bug wasn't team-specific): 1. ``cooperbench.agents.mini_swe_agent`` → ``mini_swe_agent_v2`` in ``swe_agent/adapter.py`` and ``swe_agent/agent/agents.py``. The old package was renamed in v0.0.13; both swe_agent files had stale imports that no-op'd at module load (TypeError or ModuleNotFoundError depending on how the framework was invoked), making every swe_agent invocation return Error before any LLM call. 2. Add ``numpy``, ``boto3``, ``docker`` to the ``swe-agent`` extras in pyproject.toml. swe_agent's vendored framework imports these at module-load time even when the docker/S3/model paths are dormant, so a clean ``pip install '.[swe-agent]'`` without these would still ImportError on first invocation. 3. uv.lock refreshed with the new transitive deps. End-to-end on dottxt_ai_outlines_task/1371 [1,2] with ``-a swe_agent -m gpt-5.5 --setting team --git`` (sw_team_v5): both agents Submitted, patches 373 + 88 lines, both applied via git apply. Eval failed 0/2 due to a content-quality issue (``NameError: name 'Set' is not defined`` — agent used Set without importing it; both agents hit exit_cost budget limit mid-implementation), but that's model variance, not adapter wiring. swe_agent is unblocked: it runs end-to-end, produces patches, the eval pipeline processes them. Coordination metrics still empty (claims_per_agent: {}) because swe_agent doesn't yet have the in-container coop-task-* CLI install or in-loop task auto-refresh — those are tracked as follow-ups in the PR body. For now the swe_agent team-mode run just gets the team prompt section + env vars; full team-tool integration is a separate PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Five compounding bugs prevented `claude_code`, `codex`, and `mini_swe_agent_v2` from reaching honest pass-rates on the core subset in team setting. All four now ≥ 5/10. - normalize_patch ate trailing blank context lines (text.strip() consumes " \n"), breaking last-hunk line counts so git apply rejected otherwise-valid diffs. Replaced with lstrip/rstrip on "\n" only. - mini_swe_agent_v2 adapter wasn't normalizing patches at all — raw .strip() on the patch.txt read, so every msa patch ended in a non-newline byte. Now routes through normalize_patch. - mini_swe_agent_v2 ModalEnvironment created the sandbox with no long-running command, so the image's default CMD exited and every exec hit "Sandbox not found". Pass "sleep", "infinity" as the positional command (matches eval backend's existing fix). - claude_code and codex adapters silently ignored --backend modal because shared build_environment was hardcoded to DockerEnvironment. Added a backend kwarg and threaded config["backend"] through both adapters. - Team lead prompt buried the integration step at the bottom of a long workflow list; Claude/Codex consistently exited after their own feature without reading /workspace/shared/<agent>.patch. Rewrote with a hard-rule opener and a 5-point pre-submission checklist. Member prompt now opens with "stay in your lane" per the lead's PLAN.md. - eval test_merged now falls back to testing each agent's patch alone when the merged tree doesn't pass both features. Surfaced as merge.strategy="solo-agent1" / "solo-agent2". Credits the agent (typically the lead) who correctly integrated both features into one working patch but had it corrupted by union-merging with the other agent's partial implementation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- dataset/subsets/core.json: 10-pair subset for quick agent comparisons. Stratified by repo (largest-remainder proportional allocation by full-dataset pair count) with a one-slot floor per primary language (Python / Go / Rust / TS). Reproducible via scripts/generate_core_subset.py (seed=42). - docs/BENCHMARK_RESULTS.md: horizontal comparison of four agent frameworks on the core subset in team setting. Includes per-task pass/fail matrix annotated with the merge strategy used, plus the chronological narrative of the dozen reruns that surfaced each of the bugs fixed in the previous commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Previously test_merged returned early with an error when both naive and union merge strategies hit conflicts, so the solo-agent fallback never got a chance to credit a team whose lead alone integrated both features. Now we write an empty merged.patch, let run_tests fail naturally on the merged tree, and fall through to the solo fallback. Doesn't change any of the current 40 eval results — union's merge=union attribute is tolerant enough that every task in the dataset produces some tree (potentially broken code with stitched-together lines); the broken-tree-tests-fail path already triggered the solo fallback. This just closes the defensive gap for future pathological cases. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Drops the union-merge strategy and the member-only fallback from test_merged. The new chain is: 1. identical patches → skip-merge short-circuit 2. naive 3-way merge clean → merged-tree tests are authoritative (no further fallback) 3. naive merge conflicts → test the lead's patch.txt alone against both feature suites Rationale: union merge concatenates conflicting hunks, which usually produces syntactically broken code; the cases where it accidentally produced a working tree were rewarding lucky non-overlap, not genuine coordination. The member-only fallback was symmetric to lead-only but incoherent under team-mode semantics (the lead is the designated integrator; if they didn't integrate, the team failed regardless of what the member's branch looks like). Effect on the core-subset horizontal comparison: msa 6 → 6 (unchanged) oh 5 → 4 (loses pallets_jinja/1621 — was passing via union, which concealed that oh's lead doesn't integrate) cc 5 → 5 (unchanged) cx 5 → 5 (unchanged) oh sliding below 5/10 is the correct outcome: the previous union-pass on pallets_jinja/1621 was a false-positive of sorts (oh's agents commit their patch.txt into the working tree, which forces a merge conflict on patch.txt that union resolved while the actual source merge was non-conflicting). Under the stricter policy this gets routed through lead-alone, which oh's lead does not pass. BENCHMARK_RESULTS.md updated to reflect the new totals + per-task matrix legend (N = naive/identical, L = lead-alone). CHANGELOG entry revised; full test suite still green (329 passed, 63 skipped). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

codex on Modal: `codex exec` was hanging for the full sandbox lifetime (~2h) producing zero stream output. Root cause: codex's exec mode prints "Reading additional input from stdin..." and blocks until stdin EOF. Docker's non-tty `docker exec` gives EOF for free; Modal sandbox keeps stdin open. Fix: add `</dev/null` to the codex invocation in _build_codex_command. Smoke-tested on dottxt_ai_outlines/1655 [1,3] solo on Modal: 1/1 pass in 1m 48s. openhands_sdk eval guardrail: openhands_sdk produces patches that include a committed patch.txt in the working tree and relies on Modal-hosted Redis for coordination; running eval through Docker silently changed the test environment. The eval now reads the run's config.json and refuses with a clear warning when the run was produced by openhands_sdk but --backend != modal. Note: swe_agent already runs on Modal (uses swerex.ModalDeploymentConfig by default; the earlier docs claiming it was docker-only were wrong). Smoke-tested same dottxt task: 1/1 pass in 3m 12s. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

swe_agent adapter was hardcoded to swerex.ModalDeploymentConfig. Added a backend dispatch that picks DockerDeploymentConfig when config["backend"] == "docker"; Modal stays as the default. Two upstream-swerex issues had to be worked around to make the docker path actually start a container: 1. CooperBench task images set ENTRYPOINT=/usr/local/bin/runner.sh, so swerex's `docker run ... image sh -c "<startup>"` becomes `runner.sh sh -c "<startup>"` and runner.sh interprets "sh" as the feature-patch path. Pass docker_args=["--entrypoint", ""] to clear the entrypoint (mirrors the existing Modal monkey-patch that does .entrypoint([]) on the image). 2. swerex's startup falls back to `pipx run swe-rex ...` when the swerex-remote binary isn't pre-installed, but pipx looks for an executable literally named "swe-rex" — which doesn't exist in the published `swe-rex` package (it provides "swerex-remote"). Monkey-patch DockerDeployment._get_swerex_start_cmd to use `pipx run --spec swe-rex swerex-remote ...` instead. Smoke-tested with `dottxt_ai_outlines/1655 [1,3]` solo on docker: 1/1 pass in 2m 53s, 17 steps, $0.32, no errors. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Resolves the squash-merge conflicts from #52 landing on main. All conflicts followed the same pattern: this branch's HEAD contains #52's content plus the subsequent work on top, while main's squashed-merge commit contains only #52. Resolved each conflict by taking ours (HEAD), which preserves the cumulative state of: - CHANGELOG: full Fixed/Changed/Added entries for team-mode bug fixes, eval policy change, core subset + benchmark doc, plus the original "team setting" bullet from #52 - _team/prompt.py: the stronger lead-prompt with the 5-point integration checklist (#52 had the older "buried integration" version) - swe_agent/adapter.py: team-mode kwarg propagation + Docker backend dispatch + pipx --spec monkey-patch - runner/team.py: openhands_sdk Modal-Redis tunnel branch - everywhere else: my newer adapter changes are strict supersets of #52's CI green locally: 329 tests passed, ruff clean, mypy clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

# Conflicts: # CHANGELOG.md

#58) * agents/codex: add Codex adapter; lift shared coop bits into _coop Adds an OpenAI Codex CLI adapter alongside the existing Claude Code adapter. Both adapters wrap a third-party CLI inside the task's Docker container; the bits that are agent-agnostic (Redis messaging helper, prompt blocks for solo/coop/coop+git, git remote setup) now live in a new ``cooperbench.agents._coop`` module so the two adapters (and any future CLI adapter) consume them rather than duplicating. Codex adapter highlights: - Invokes ``codex exec --json --sandbox danger-full-access --skip-git-repo-check --model <id>``. - Writes ``${CODEX_HOME}/auth.json`` with the host's OPENAI_API_KEY inside the container so the CLI authenticates without prompts. - Parses Codex's JSONL event stream for status / token totals / messages. Cost is reported as 0.0 because Codex does not emit a cost field; tokens are summed across ``turn.completed`` events. - Model fallback: if Codex rejects ``--model gpt-5.5`` with a "model not found" shaped error, the adapter retries once without ``--model`` and lets Codex pick its default. - Preflight credential check: if OPENAI_API_KEY is unset the adapter returns Error immediately instead of spinning up a container that can only fail. Shared ``_coop`` module: - ``coop_msg.py`` — Redis-backed messaging CLI (one inbox per agent) installed as ``coop-send`` / ``coop-recv`` / ``coop-broadcast`` / ``coop-peek`` / ``coop-agents`` under /usr/local/bin. - ``install_snippet.sh`` — pip-installs redis and drops the shell wrappers; each adapter's setup.sh sources it. - ``prompt.py`` — solo / coop / coop+git prompt assembly, agent- agnostic. - ``runtime.py`` — ``ContainerEnv`` protocol, ``build_environment``, ``write_file_in_container`` / ``read_file_from_container``, ``rewrite_comm_url_for_container``, ``build_git_setup_command``, ``parse_sent_messages_log``, and ``normalize_patch``. Bug fix during this refactor: the previous adapter's ``.strip()`` on ``patch.txt`` was eating the trailing newline that ``git apply`` requires. Replaced with ``normalize_patch()`` (one trailing newline, no leading whitespace). This bit codex's solo run with a "corrupt patch at line N" error; Claude got lucky and didn't. Tests: 24 new for Codex (parsers + adapter), existing 45 Claude Code tests re-pointed at the shared ``_coop`` module. Full suite: 228 passed, 63 skipped. End-to-end runs against dottxt_ai_outlines_task/1371 features 1+2: - codex solo f1: Submitted, 1 turn, 365k input tokens, 184-line patch (with the trailing-newline fix it applies cleanly) - codex coop+git f1,f2: both Submitted, both patches applied but 0/2 tests pass — coordination failure (agent1 fetched ``team`` but never merged, so the stacked patches produce a Python SyntaxError at line 144 of the modified file). Claude on the same task scored 2/2; Codex used the tools less aggressively on this run. The 0/2 result is the kind of coordination failure the bench is designed to surface, not an adapter bug. Future iteration could tighten the prompt or hard-enforce a post-run merge, but neither is necessary to land the adapter itself. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * runner: add team mode (lead + members + shared task list + scratchpad) Adds a third setting alongside ``solo`` and ``coop``, modelled on the agent-team primitives Claude Code uses in its own product. Where coop gives N peer agents one feature each and a Redis inbox to chat over, team mode adds three load-bearing primitives: 1. A typed **shared task list** (cooperbench.agents._team.TaskListClient) backed by Redis hashes + sets, namespaced ``cb:<run_id>:``, with atomic claim semantics (HSETNX-style — exactly one caller wins on a race) and an audit log of every mutation. Exposed in the container as ``coop-task-create`` / ``coop-task-claim`` / ``coop-task-update`` / ``coop-task-list`` shell wrappers. 2. A **lead / member role split**. The first agent is designated ``team-lead`` and gets a system-prompt block instructing them to break the spec into tasks, assign them via ``coop-task-create --assign``, watch progress, and integrate. Other agents are ``member`` and look for open tasks to claim. 3. A **shared scratchpad** Docker volume (``cb-team-<run_id>``) mounted at ``/workspace/shared`` in every container. Free coordination artifact for design notes, partial diffs, interface sketches. Coordination metrics are computed from the task-list audit log after the run finishes (``time_to_first_claim_seconds``, ``claims_per_agent``, ``updates_per_agent``, ``tasks_done``, ``unowned_at_end``) and saved into ``result.json``. Evaluation is identical to coop — per-agent ``patch.txt`` evaluated per-feature — so no eval changes were needed beyond discovering ``team/`` log directories. Compatibility: all five existing adapters accept the new ``team_role`` / ``team_id`` / ``task_list_url`` kwargs. The CLI adapters (``claude_code``, ``codex``) wire the team install snippet into their ``setup.sh`` so the ``coop-task-*`` wrappers land at ``/usr/local/bin``. The Python-loop adapters (``mini_swe_agent_v2``, ``swe_agent``, ``openhands_sdk``) accept the kwargs without breaking; their in-loop integration with the task list (auto-refresh between steps, similar to the existing inbox poll) lands in a follow-up. Unit tests: 46 new - 18 task_list (CRUD, atomic claim, owner-only update, audit log, run isolation) - 12 prompt (lead vs member branches, solo fallback, git interaction) - 3 runtime (env assembly, scratchpad mount args) - 4 metrics (happy path, unowned-at-end, empty log, multiple claims) - 5 runner (lead-is-first-agent, pre-seed, kwarg propagation, metrics in result, three-agent team) - 4 misc Full suite: 274 passed, 63 skipped. Ruff / format / mypy all green. End-to-end on dottxt_ai_outlines_task/1371 [1,2] with Claude Code in team+git mode: - 5 tasks created (2 by bench-runner, 3 by the lead splitting its work), all reached ``done`` - time_to_first_claim_seconds=34.2 - claims_per_agent={agent1: 2, agent2: 1} - updates_per_agent={agent1: 4, agent2: 3} - scratchpad volume actively used (agent2 wrote its diff to /workspace/shared/agent2.patch + a summary.md) - **0/1 pass rate** — both ``patch.txt`` files were empty: the members wrote diffs to the scratchpad instead of also writing ``/workspace/repo/patch.txt``, and the lead never ran the final integration step. This is real coordination signal (the prompt told them to write both places but they followed the scratchpad half only) — a follow-up will tighten the prompt to make patch.txt submission the explicit final step. Future PRs (intentionally out of scope here so this lands at a reviewable size): - In-loop auto-refresh for the Python-loop adapters - MCP long-poll tool to give CLI adapters push-ish inbox semantics - Typed ``coop-request`` / ``coop-respond`` protocol on top of messaging (CC's plan_approval_request shape) - Filesystem mirror of the task list (CC-style ``ls`` artefacts) Stacks on #51 (Codex adapter) so the diff stays focused on team-mode additions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * team mode: filesystem mirror, typed protocol, MCP server, in-loop refresh (#53) Lands the four follow-ups that were called out as "Out of scope" on the team-mode PR (#52), plus a prompt fix surfaced by the team-mode end-to-end run. 1. **Filesystem mirror of task list** (``_team/fs_mirror.py``). Snapshots the Redis-backed task list to ``/workspace/shared/tasks/`` so agents can ``ls`` and ``cat`` tasks with their existing tools rather than going through the ``coop-task-list`` CLI. Layout mirrors Claude Code's team primitive: one ``<id>.json`` per task, plus ``_index.json`` (cheap ``ls`` target) and ``_log.jsonl`` (audit trail). Triggered on every ``coop-task-list`` invocation and from the host runner at startup. Files written via tempfile+replace so readers never observe a partial state. 2. **Typed coop-request / coop-respond protocol** (``_team/protocol.py``). Layered on plain Redis messaging, mirroring CC's ``plan_approval_request`` / ``plan_approval_response`` shape. ``coop-request <peer> <kind> <body>`` returns a request_id (and optionally blocks via ``--wait N`` for a response). ``coop-respond <request_id> <body>`` writes back; the sender's ``await_response`` uses BLPOP so it actually sleeps instead of busy-polling. Both events flow into the shared task-log so coordination metrics include protocol events. 3. **MCP long-poll server** (``_team/mcp_server.py``). Stdio JSON-RPC server that exposes a single ``wait_for_message`` tool backed by BLPOP on the agent's inbox. Registered automatically: Claude Code adapter writes ``$CLAUDE_CONFIG_DIR/.claude.json`` with the server entry; Codex adapter writes ``$CODEX_HOME/config.toml``. The point is to make "watch the inbox" a natural idle behavior for the CLI adapters instead of a busy-loop on ``coop-recv`` returning empty — the closest we can get to push-style delivery for opaque CLI agent loops. 4. **In-loop task-list auto-refresh** (``_team/loop_refresh.py``). ``TeamPoller`` is a per-agent host-side helper that ``mini_swe_agent_v2.DefaultAgent.step()`` calls between LLM queries — same hook as the existing inbox poll. The LLM sees a compact ``[Team task list] open: 1, in_progress: 2, ...`` summary prepended to every turn so it doesn't need to remember to call ``coop-task-list``. Plumbed via ``agent.team_poller`` so the ``mini_swe_agent_v2`` subtree change is one branch in ``step()``. The same module also exports ``poll_team_state()`` for in-container use (env-driven variant). 5. **Prompt fix**: the previous team-mode end-to-end had members writing diffs to ``/workspace/shared/<id>.patch`` only and never to ``/workspace/repo/patch.txt``, scoring 0/2 despite great coordination. Both lead and member prompts now have an explicit ``### Final submission — REQUIRED`` section that calls out ``patch.txt`` as the only file the bench evaluates and provides the exact ``git diff > patch.txt`` command. Also: cosmetic fix to ``runner/core._print_single_result`` so team mode's per-agent dicts (which carry ``patch_lines: int``) render correctly in the run table — previously the column showed 0 because the function tried ``len(r.get("patch", "").splitlines())`` and team mode doesn't store the full patch in the agents dict. Tests: 37 new unit tests - 8 fs_mirror (atomic writes, stale cleanup, empty index) - 9 protocol (request roundtrip, await, timeout, audit log) - 9 mcp_server (initialize, tools/list, tools/call, timeout, blocking, unknown-tool error, env factory) - 8 loop_refresh (summary formatting, TeamPoller, env variant) - 3 prompt (regression: lead+member prompts demand patch.txt) Full suite: **311 passed**, 63 skipped. End-to-end on dottxt_ai_outlines_task/1371 [1,2] with Claude Code + team + git: **2/2 features pass** (14/14 + 20/20 tests). All four follow-ups visibly active in the run artifacts: ``/workspace/shared/tasks/`` populated with per-task JSON + _index + _log; scratchpad has agent2.patch; ``cb-mcp-server.py`` registered in ``.claude.json``; 6 tasks created (2 by runner pre-seed, 4 by lead's sub-task split), 4 reached ``done``, ``time_to_first_claim_seconds=29.9``. Previous run scored 0/2 on the same task — the prompt fix is doing real work. Stacks on #52. Co-authored-by: Ubuntu <ubuntu@ip-172-31-58-153.us-west-2.compute.internal> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * team mode: wire team prompt + env into the three Python-loop adapters Brings ``mini_swe_agent_v2``, ``swe_agent``, and ``openhands_sdk`` to parity with the CLI adapters for team mode. Before this commit they accepted the team kwargs but discarded them; now each one appends the team prompt section to the task it sends the agent, and (where the adapter actually controls the container) propagates ``CB_TEAM_*`` env vars + mounts the team scratchpad. New helper: ``_team.team_task_section(agents, agent_id, team_role)`` returns ONLY the lead-or-member block + coop-task-* CLI usage, without the surrounding task/submission/git scaffolding that ``build_team_instruction`` adds. Python-loop adapters already have their own prompts covering messaging/git/submission, so they need only the new piece; CLI adapters keep using the bigger function. Per-adapter wiring: - ``mini_swe_agent_v2``: appends team_task_section to task; propagates CB_TEAM_* through env_kwargs["env"]; adds ``--add-host=host.docker.internal:host-gateway`` + scratchpad volume to docker run args; installs the team CLI scripts + pip redis in the container after env spin-up. The existing ``TeamPoller`` host-side hook (already in step()) still fires. - ``openhands_sdk``: appends team_task_section to task; folds a new ``team_env`` dict into ``coop_info`` so ``_build_credentials_dict`` propagates CB_TEAM_* into the sandbox. Coop-task-* binary install in the OpenHands agent-server image is a follow-up — OpenHands manages its own image build and doesn't expose a clean post-start exec hook. - ``swe_agent``: appends team_task_section to task. The SWE-agent framework's sandbox + agent loop is third-party and harder to instrument; everything beyond the prompt is a follow-up. Tests: 13 new - 3 prompt unit tests for team_task_section (lead, member, empty) - 10 cross-adapter sanity tests in tests/agents/test_team_wiring.py: consistency between team_task_section and build_team_instruction, every registered runner accepts the team kwargs, openhands env keys, swe_agent signature Full suite: 324 passed, 63 skipped. Ruff/format/mypy all green. End-to-end on dottxt_ai_outlines_task/1371 [1,2] with claude_code + team + git (sanity check that the shared changes didn't regress the CLI adapter): both Submitted in 4m21s, $0.93, patches 210 + 81 lines. End-to-end for the other four (codex, mini_swe_agent_v2, swe_agent, openhands_sdk) requires API keys (Anthropic for the three Python-loop adapters via litellm, OpenAI for codex) that aren't available in this environment. Unit tests cover the new wiring; the e2e validations should be run with real keys before relying on the per-adapter behavior. Compatibility matrix is now: | Adapter | Accepts | Team prompt | Auto-refresh | CLI in container | env vars | |---------------------|---------|-------------|--------------|------------------|----------| | claude_code | yes | yes (full) | n/a | yes | yes | | codex | yes | yes (full) | n/a | yes | yes | | mini_swe_agent_v2 | yes | yes (sec.) | yes | yes | yes | | openhands_sdk | yes | yes (sec.) | n/a | NOT YET | yes | | swe_agent | yes | yes (sec.) | NOT YET | NOT YET | NOT YET | Stacks on #52 (merged-up team-mode branch). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * openhands: layer coop-task-* install onto Modal image for team mode Closes the documented gap from the prior commit's matrix: the ``coop-task-*`` binaries now ship into the OpenHands agent-server sandbox, layered onto the upstream ``-oh`` image via Modal's ``add_local_file`` / ``pip_install`` / ``run_commands`` chain (no upstream image rebuild required). Triggered only when ``coop_info["team_env"]`` is set so solo / coop runs don't pay the ~10s first-build cost. Modal caches the layered image; subsequent team runs are instant. Verified end-to-end: ran openhands_sdk team+git on dottxt_ai_outlines_task/1371 [1,2] with gpt-5.5. The agent ran ``compgen -c | grep coop-task`` and got back all 7 wrappers (create / claim / update / list / request / respond / pending) — the install worked. Whether the model actually invokes the tools is a separate (coordination-quality) axis; in this run it discovered them but didn't use them, same as codex. Both patches applied; f1 14/14, f2 19/20. Tests: 2 new (full suite: 326 passed) - test_team_env_triggers_image_layering — verifies add_local_file + pip_install + run_commands fire with the right args when team mode is active - test_no_layering_when_team_inactive — verifies solo / coop runs skip the image-build cost Matrix update — openhands_sdk now reads: Accepts kwargs: yes / Team prompt: section / Auto-refresh: n/a / CLI in container: YES (was NOT YET) / CB_TEAM_* env: yes Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * team prompt: make the merge-before-submit step REQUIRED The codex team e2e (cx_team_v3) hit 0/2 with great coordination metrics — 5/5 tasks done, 27s first claim, claims even — but neither agent ran ``git merge`` despite the prompt's "Recommended workflow" mentioning it. Both fetched their peer's branch (2 each) and then submitted only their own work, so the eval's naive diff-stacker produced syntactically broken Python. The previous prompt buried the critical step in a "Concretely:" sentence at the end; gpt-5.5 didn't follow it. This rewrite: - Renames the section ``## Git collaboration — MERGE IS REQUIRED BEFORE SUBMITTING`` so the imperative is in the heading itself. - Adds an explicit "Required final sequence — run this verbatim before exiting" block with the full fetch+merge+diff sequence, parameterized over every partner branch. - Explains *why* (each agent's patch.txt is evaluated against every feature's tests; without the merge, the peer feature's symbols are missing → ImportError). - Frames it the same way the patch.txt step is framed (REQUIRED, skip-at-your-loss), which the original prompt fix proved codex responds to. Verified: re-ran cx_team_v4 (codex team+git, same task as v3). Git activity went from ``fetch=2 merge=0 push=0`` per agent → ``fetch=3 merge=2 push=2`` and ``fetch=1 merge=1 push=1``. Both patches now contain both features' symbols. Pass rate v4: 33/34 tests (97%) — f2 fully passes 20/20, f1 fails one test because gpt-5.5's merged code put the ``filters`` kwarg on a helper function rather than the ``prompt`` decorator (content quality, not coordination). A second run (cx_team_v5) produced byte-identical 243-line patches on both agents — codex coordinated so well both ended up with the exact same merged tree. This surfaces a separate bench-side limitation: the eval's diff-stacker fails to apply patch B on top of patch A when every hunk already matches, producing an empty merged.patch. That's a real bug in ``eval/evaluate.py``'s coop merge step, NOT a coordination failure — codex did exactly what the prompt asked. Fix is a separate concern from team-mode wiring. Tests still pass (existing prompt tests are content-agnostic; 326 / 63 skipped). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * eval: short-circuit when both agents submit identical merged patches In team mode codex can coordinate so well that both agents end up with byte-identical patches (each fully merged the other's branch). The existing eval combiner sequence — apply patch1 → apply patch2 on top — chokes because every hunk in patch2 is already applied, producing an empty merged.patch and a downstream "No valid patches in input" failure even though both submissions are individually fine. Fix in ``test_merged``: before invoking ``_setup_branches`` / ``_merge_naive``, ``cmp`` the two patches. If they match, copy patch1 to merged.patch (normalized via ``git apply --recount`` so agents that emit unified diffs with miscounted hunk headers still work) and skip the merge dance. Returns a fresh result with ``merge.status: "identical"`` so the caller can tell the short-circuit fired vs a real merge. Verified on the codex-team e2e: - cx_team_v5 (codex agents perfectly merged to identical 243-line patches): 0/2 → 2/2 ✓ (f1: 14/14, f2: 20/20) - cx_team_v4 (codex agents diverged on the merge): unchanged at f2 20/20 + f1 13/14 = 33/34 tests, still falls back to agent2-alone via apply_status: {'agent1': 'failed', ...} I also briefly tried adding ``git apply --recount`` to ``_setup_branches``'s fallback chain, but that REGRESSED v4: it made agent1's malformed patch apply where it previously failed silently, triggering a real merge attempt that produced duplicate function definitions (broken Python) via union merge. The identical-patches short-circuit is the strictly-better fix — no regression, recovers the v5 case, and the malformed-hunk normalization only kicks in on the short-circuit path where it can't cause merge conflicts. Also lands previously-uncommitted housekeeping: - prompt.py: ruff-format-only diff on the merge-required block from the prior commit - test_team_wiring.py: ruff --fix removed unused MagicMock imports - test_gcp_backend.py / test_tasks.py: ruff --fix removed f-string-without-placeholder and unused-json import (both unrelated drift caught by the gate) Tests: 1 new (full suite: 327 passed) - ``test_test_merged_shortcircuits_on_identical_patches`` — source inspection confirms the short-circuit branch + "identical" merge-status string exist in test_merged Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * openhands: register Redis-backed CoopTaskTracker as a typed tool The previous openhands team runs (oh_team_v3) showed agents discovering the ``coop-task-*`` shell wrappers via ``compgen`` but never invoking them — gpt-5.5 strongly prefers typed tools registered with the LLM over arbitrary shell commands. This commit lands the architectural fix: a Redis-backed ``CoopTaskTrackerTool`` registered under the same name as openhands' built-in ``TaskTrackerTool`` so the registry resolution swaps it transparently. Files: * ``openhands/tools/task_tracker/coop_definition.py`` — new tool definition + executor. Same ``TaskTrackerAction`` / ``TaskTrackerObservation`` shape, but ``plan`` and ``view`` round- trip through the shared ``cb:<run_id>:`` Redis namespace that ``TaskListClient`` (host side) writes to. Tasks are auto-owned by the calling agent; ``view`` shows peer tasks prefixed with ``[<their_agent_id>]``. Registered under both ``"CoopTaskTrackerTool"`` AND ``"TaskTrackerTool"`` so importing the module rebinds the latter to the Coop variant. * ``openhands/tools/preset/default.py`` — gains a ``team_mode`` kwarg (kept for API stability + tests; the actual swap happens server-side via the .pth/__init__ side-effect import, not by changing the host-side tool list). Pre-PR coop block split into a more nuanced team-mode prompt section that documents the TaskTracker → shared-list behavior. * ``openhands_sdk/adapter.py:ModalSandboxContext.__enter__`` — layers two more bits into the Modal image at build time: - ``add_local_file`` of ``coop_definition.py`` to ``$OH_DIR/coop_definition.py`` (in the sandbox's openhands install) - ``grep ... || echo`` appending ``from . import coop_definition`` to the package's ``__init__.py`` so the registration runs at import time. Tests: 1 new + updated image-layering assertions - ``test_importing_coop_definition_overrides_local_registration``: inspecting the registry's ``_MODULE_QUALNAMES`` confirms ``TaskTrackerTool.name`` resolves to ``coop_definition``'s registration after import. - ``TestOpenHandsImageLayering`` now asserts 2 ``add_local_file`` calls + 2 ``run_commands`` layers (tool-file install + ``coop-task-*`` wrappers) and that the ``from . import coop_definition`` line is in the install commands. Full suite: 329 passed. Ruff / format / mypy all green. KNOWN LIMITATION (documented in coop_definition.py docstring): the openhands_sdk agent-server runs in a Modal sandbox that's network-isolated from the host Redis. The CoopTaskTracker is correctly registered and the LLM can call it, but every operation returns "Shared task list unavailable" because the sandbox can't ``socket.getaddrinfo("host.docker.internal")``. The fix is in the deployment layer (Modal tunnels, a Modal-hosted Redis, or running openhands directly via docker like the other adapters), not in this PR — verified by oh_team_v10: agent ran ``coop-task-list`` first ("The coop CLI failed; I'll use the shared task tracker."), then fell back to TaskTrackerAction which still hit the local executor because the override + Redis combo can't actually work in Modal. For non-Modal openhands deployments (e.g. local docker-backed openhands runs, future remote-conversation transports that share the host network), this tool works as designed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * openhands team mode: end-to-end working with Modal-hosted Redis Resolves the Modal-Redis isolation that blocked the prior CoopTaskTracker swap from actually functioning. Three pieces, working together: 1. **Modal-hosted Redis.** ``runner/team.py:execute_team`` detects ``agent_name == "openhands_sdk"`` and spins up a Modal sandbox running redis-server on a TCP tunnel (``unencrypted_ports=[6379]``, accessed via ``unencrypted_host:unencrypted_port``). Re-uses the existing ``connectors/redis_server.ModalRedisServer`` — it was already written, just unused. Both the host TaskListClient and the agent sandboxes point at the same public TCP endpoint, so pre-seed and agent reads/writes share state. Falls back to local Redis for the other adapters. 2. **CoopTaskTrackerTool injection into the Modal sandbox.** The adapter now ``add_local_file``s three pieces into the OpenHands image at build time: - ``coop_task.py`` → ``/usr/local/bin/cb-coop-task.py`` - ``coop_definition.py`` → ``$OH_DIR/coop_definition.py`` - ``_team_init_override.py`` → ``$OH_DIR/__init__.py`` (replaces upstream; same exports + a side-effect import of coop_definition so the Redis-backed executor overrides the local TaskTracker registration at first import). Plus a ``find -name '*.pyc' -delete`` to invalidate Python's bytecode cache so the new __init__ actually re-runs. 3. **Harvest-time fresh client.** Modal's TCP tunnels drop idle connections after a few minutes, so the original Redis client pre-seed used at startup gets closed before the 9-min agent run finishes. Re-open the client at harvest time using the same URL. End-to-end on ``dottxt_ai_outlines_task/1371 [1,2]`` with ``-a openhands_sdk --setting team --git``: - Modal Redis startup: ``redis ready redis://r450.modal.host:41899`` - Both agents Submitted, 9m total - Eval: 2/2 PASS (f1: 14/14 ✓, f2: 20/20 ✓) - Metrics: ``tasks_total: 4, tasks_done: 4, unowned_at_end: 0, time_to_first_claim_seconds: 52.6, claims_per_agent: {agent2:2, agent1:1}, updates_per_agent: {agent2:4, agent1:5}`` - Cost: $3.33 Tests: image-layering assertions expanded — ``add_local_file`` now called 3 times (CLI helper, tool def, __init__ override), and the run_commands chain copies both files + wipes .pyc caches. Full suite: 329 passed. Ruff / format / mypy all green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * deps: add fakeredis to dev extras The team-mode unit tests (task_list / protocol / fs_mirror / loop_refresh / mcp_server) use ``fakeredis.FakeRedis`` as a hermetic stand-in for redis-server, but ``fakeredis`` wasn't declared anywhere in pyproject.toml — it just happened to be present in my local venv because something else pulled it in transitively. GitHub CI installs ``[dev]`` only, so on a clean install pytest collection fails with ``ModuleNotFoundError: No module named 'fakeredis'`` on every team-mode test file. Adding the dependency explicitly fixes PR #52 (team-mode) CI; once team-mode merges, PR #55 (team-all-adapters) will also pick it up via the same path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * swe_agent: fix import error + add missing transitive deps Three changes that together unblock swe_agent team-mode runs (and solo/coop runs too — the bug wasn't team-specific): 1. ``cooperbench.agents.mini_swe_agent`` → ``mini_swe_agent_v2`` in ``swe_agent/adapter.py`` and ``swe_agent/agent/agents.py``. The old package was renamed in v0.0.13; both swe_agent files had stale imports that no-op'd at module load (TypeError or ModuleNotFoundError depending on how the framework was invoked), making every swe_agent invocation return Error before any LLM call. 2. Add ``numpy``, ``boto3``, ``docker`` to the ``swe-agent`` extras in pyproject.toml. swe_agent's vendored framework imports these at module-load time even when the docker/S3/model paths are dormant, so a clean ``pip install '.[swe-agent]'`` without these would still ImportError on first invocation. 3. uv.lock refreshed with the new transitive deps. End-to-end on dottxt_ai_outlines_task/1371 [1,2] with ``-a swe_agent -m gpt-5.5 --setting team --git`` (sw_team_v5): both agents Submitted, patches 373 + 88 lines, both applied via git apply. Eval failed 0/2 due to a content-quality issue (``NameError: name 'Set' is not defined`` — agent used Set without importing it; both agents hit exit_cost budget limit mid-implementation), but that's model variance, not adapter wiring. swe_agent is unblocked: it runs end-to-end, produces patches, the eval pipeline processes them. Coordination metrics still empty (claims_per_agent: {}) because swe_agent doesn't yet have the in-container coop-task-* CLI install or in-loop task auto-refresh — those are tracked as follow-ups in the PR body. For now the swe_agent team-mode run just gets the team prompt section + env vars; full team-tool integration is a separate PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: team-mode bugs surfaced by 10-pair core run Five compounding bugs prevented `claude_code`, `codex`, and `mini_swe_agent_v2` from reaching honest pass-rates on the core subset in team setting. All four now ≥ 5/10. - normalize_patch ate trailing blank context lines (text.strip() consumes " \n"), breaking last-hunk line counts so git apply rejected otherwise-valid diffs. Replaced with lstrip/rstrip on "\n" only. - mini_swe_agent_v2 adapter wasn't normalizing patches at all — raw .strip() on the patch.txt read, so every msa patch ended in a non-newline byte. Now routes through normalize_patch. - mini_swe_agent_v2 ModalEnvironment created the sandbox with no long-running command, so the image's default CMD exited and every exec hit "Sandbox not found". Pass "sleep", "infinity" as the positional command (matches eval backend's existing fix). - claude_code and codex adapters silently ignored --backend modal because shared build_environment was hardcoded to DockerEnvironment. Added a backend kwarg and threaded config["backend"] through both adapters. - Team lead prompt buried the integration step at the bottom of a long workflow list; Claude/Codex consistently exited after their own feature without reading /workspace/shared/<agent>.patch. Rewrote with a hard-rule opener and a 5-point pre-submission checklist. Member prompt now opens with "stay in your lane" per the lead's PLAN.md. - eval test_merged now falls back to testing each agent's patch alone when the merged tree doesn't pass both features. Surfaced as merge.strategy="solo-agent1" / "solo-agent2". Credits the agent (typically the lead) who correctly integrated both features into one working patch but had it corrupted by union-merging with the other agent's partial implementation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs+data: core subset and team-mode horizontal comparison - dataset/subsets/core.json: 10-pair subset for quick agent comparisons. Stratified by repo (largest-remainder proportional allocation by full-dataset pair count) with a one-slot floor per primary language (Python / Go / Rust / TS). Reproducible via scripts/generate_core_subset.py (seed=42). - docs/BENCHMARK_RESULTS.md: horizontal comparison of four agent frameworks on the core subset in team setting. Includes per-task pass/fail matrix annotated with the merge strategy used, plus the chronological narrative of the dozen reruns that surfaced each of the bugs fixed in the previous commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(eval): don't bail when union-merge also conflicts Previously test_merged returned early with an error when both naive and union merge strategies hit conflicts, so the solo-agent fallback never got a chance to credit a team whose lead alone integrated both features. Now we write an empty merged.patch, let run_tests fail naturally on the merged tree, and fall through to the solo fallback. Doesn't change any of the current 40 eval results — union's merge=union attribute is tolerant enough that every task in the dataset produces some tree (potentially broken code with stitched-together lines); the broken-tree-tests-fail path already triggered the solo fallback. This just closes the defensive gap for future pathological cases. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * eval(team): identical / naive / lead-when-naive-conflicts policy Drops the union-merge strategy and the member-only fallback from test_merged. The new chain is: 1. identical patches → skip-merge short-circuit 2. naive 3-way merge clean → merged-tree tests are authoritative (no further fallback) 3. naive merge conflicts → test the lead's patch.txt alone against both feature suites Rationale: union merge concatenates conflicting hunks, which usually produces syntactically broken code; the cases where it accidentally produced a working tree were rewarding lucky non-overlap, not genuine coordination. The member-only fallback was symmetric to lead-only but incoherent under team-mode semantics (the lead is the designated integrator; if they didn't integrate, the team failed regardless of what the member's branch looks like). Effect on the core-subset horizontal comparison: msa 6 → 6 (unchanged) oh 5 → 4 (loses pallets_jinja/1621 — was passing via union, which concealed that oh's lead doesn't integrate) cc 5 → 5 (unchanged) cx 5 → 5 (unchanged) oh sliding below 5/10 is the correct outcome: the previous union-pass on pallets_jinja/1621 was a false-positive of sorts (oh's agents commit their patch.txt into the working tree, which forces a merge conflict on patch.txt that union resolved while the actual source merge was non-conflicting). Under the stricter policy this gets routed through lead-alone, which oh's lead does not pass. BENCHMARK_RESULTS.md updated to reflect the new totals + per-task matrix legend (N = naive/identical, L = lead-alone). CHANGELOG entry revised; full test suite still green (329 passed, 63 skipped). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(modal): codex stdin hang; eval guardrail for openhands_sdk codex on Modal: `codex exec` was hanging for the full sandbox lifetime (~2h) producing zero stream output. Root cause: codex's exec mode prints "Reading additional input from stdin..." and blocks until stdin EOF. Docker's non-tty `docker exec` gives EOF for free; Modal sandbox keeps stdin open. Fix: add `</dev/null` to the codex invocation in _build_codex_command. Smoke-tested on dottxt_ai_outlines/1655 [1,3] solo on Modal: 1/1 pass in 1m 48s. openhands_sdk eval guardrail: openhands_sdk produces patches that include a committed patch.txt in the working tree and relies on Modal-hosted Redis for coordination; running eval through Docker silently changed the test environment. The eval now reads the run's config.json and refuses with a clear warning when the run was produced by openhands_sdk but --backend != modal. Note: swe_agent already runs on Modal (uses swerex.ModalDeploymentConfig by default; the earlier docs claiming it was docker-only were wrong). Smoke-tested same dottxt task: 1/1 pass in 3m 12s. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(swe_agent): add --backend docker support swe_agent adapter was hardcoded to swerex.ModalDeploymentConfig. Added a backend dispatch that picks DockerDeploymentConfig when config["backend"] == "docker"; Modal stays as the default. Two upstream-swerex issues had to be worked around to make the docker path actually start a container: 1. CooperBench task images set ENTRYPOINT=/usr/local/bin/runner.sh, so swerex's `docker run ... image sh -c "<startup>"` becomes `runner.sh sh -c "<startup>"` and runner.sh interprets "sh" as the feature-patch path. Pass docker_args=["--entrypoint", ""] to clear the entrypoint (mirrors the existing Modal monkey-patch that does .entrypoint([]) on the image). 2. swerex's startup falls back to `pipx run swe-rex ...` when the swerex-remote binary isn't pre-installed, but pipx looks for an executable literally named "swe-rex" — which doesn't exist in the published `swe-rex` package (it provides "swerex-remote"). Monkey-patch DockerDeployment._get_swerex_start_cmd to use `pipx run --spec swe-rex swerex-remote ...` instead. Smoke-tested with `dottxt_ai_outlines/1655 [1,3]` solo on docker: 1/1 pass in 2m 53s, 17 steps, $0.32, no errors. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * team_harness: extract team mode as standalone harness + ablation flags Move team-mode primitives from cooperbench/agents/_team (private) to cooperbench/team_harness (public, library-shaped) so other benchmarks can consume the multi-agent coordination algorithm without depending on CooperBench's task layout. Adds TeamSession + TeamHarnessConfig: - TeamSession bundles per-run state (run_id, namespaced Redis URL, ordered agent list, scratchpad volume name) with the feature config and exposes adapter-facing factories that each return None / [] / {} when their feature is disabled, so adapter code paths collapse to one branch: coop_env.update(session.env_for(agent_id)) extra_run_args.extend(session.scratchpad_mount_args()) mcp_config = session.mcp_config(container_script_path=...) - TeamHarnessConfig is a frozen dataclass of five per-feature booleans (task_list, scratchpad, mcp, auto_refresh, protocol). The lead/member role split is the always-on baseline -- without it team is just coop. Wires five --team-no-* CLI flags through cli.py -> runner.run -> runner.core -> runner.team -> each adapter. result.json now records team_features so post-hoc analysis can attribute deltas to the feature that was off. Adapter refactor: claude_code, codex, mini_swe_agent_v2, swe_agent, and openhands_agent_sdk now accept team_features kwarg and construct a local TeamSession instead of calling loose helpers. Each adapter's team-mode blocks (prompt, env, mount, MCP, install) gate on the session's config. Tests: tests/agents/_team -> tests/team_harness (rename), new test_session.py (29 cases) covers the facade, four new ablation tests in tests/runner/test_team.py verify the runner-side gating. Full suite 363 passed, 63 skipped; ruff/format/mypy clean. End-to-end smoke on dottxt_ai_outlines/1371 [1,2] with codex (docker): - Default: writes task_log.json + tasks.json + metrics, cb-team-<run> volume created. - --team-no-task-list --team-no-scratchpad --team-no-mcp: no task_log / tasks files, empty metrics dict, no volume. team_features in result.json reflects the requested ablation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Ubuntu <ubuntu@ip-172-31-58-153.us-west-2.compute.internal> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* agents/codex: add Codex adapter; lift shared coop bits into _coop Adds an OpenAI Codex CLI adapter alongside the existing Claude Code adapter. Both adapters wrap a third-party CLI inside the task's Docker container; the bits that are agent-agnostic (Redis messaging helper, prompt blocks for solo/coop/coop+git, git remote setup) now live in a new ``cooperbench.agents._coop`` module so the two adapters (and any future CLI adapter) consume them rather than duplicating. Codex adapter highlights: - Invokes ``codex exec --json --sandbox danger-full-access --skip-git-repo-check --model <id>``. - Writes ``${CODEX_HOME}/auth.json`` with the host's OPENAI_API_KEY inside the container so the CLI authenticates without prompts. - Parses Codex's JSONL event stream for status / token totals / messages. Cost is reported as 0.0 because Codex does not emit a cost field; tokens are summed across ``turn.completed`` events. - Model fallback: if Codex rejects ``--model gpt-5.5`` with a "model not found" shaped error, the adapter retries once without ``--model`` and lets Codex pick its default. - Preflight credential check: if OPENAI_API_KEY is unset the adapter returns Error immediately instead of spinning up a container that can only fail. Shared ``_coop`` module: - ``coop_msg.py`` — Redis-backed messaging CLI (one inbox per agent) installed as ``coop-send`` / ``coop-recv`` / ``coop-broadcast`` / ``coop-peek`` / ``coop-agents`` under /usr/local/bin. - ``install_snippet.sh`` — pip-installs redis and drops the shell wrappers; each adapter's setup.sh sources it. - ``prompt.py`` — solo / coop / coop+git prompt assembly, agent- agnostic. - ``runtime.py`` — ``ContainerEnv`` protocol, ``build_environment``, ``write_file_in_container`` / ``read_file_from_container``, ``rewrite_comm_url_for_container``, ``build_git_setup_command``, ``parse_sent_messages_log``, and ``normalize_patch``. Bug fix during this refactor: the previous adapter's ``.strip()`` on ``patch.txt`` was eating the trailing newline that ``git apply`` requires. Replaced with ``normalize_patch()`` (one trailing newline, no leading whitespace). This bit codex's solo run with a "corrupt patch at line N" error; Claude got lucky and didn't. Tests: 24 new for Codex (parsers + adapter), existing 45 Claude Code tests re-pointed at the shared ``_coop`` module. Full suite: 228 passed, 63 skipped. End-to-end runs against dottxt_ai_outlines_task/1371 features 1+2: - codex solo f1: Submitted, 1 turn, 365k input tokens, 184-line patch (with the trailing-newline fix it applies cleanly) - codex coop+git f1,f2: both Submitted, both patches applied but 0/2 tests pass — coordination failure (agent1 fetched ``team`` but never merged, so the stacked patches produce a Python SyntaxError at line 144 of the modified file). Claude on the same task scored 2/2; Codex used the tools less aggressively on this run. The 0/2 result is the kind of coordination failure the bench is designed to surface, not an adapter bug. Future iteration could tighten the prompt or hard-enforce a post-run merge, but neither is necessary to land the adapter itself. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * runner: add team mode (lead + members + shared task list + scratchpad) Adds a third setting alongside ``solo`` and ``coop``, modelled on the agent-team primitives Claude Code uses in its own product. Where coop gives N peer agents one feature each and a Redis inbox to chat over, team mode adds three load-bearing primitives: 1. A typed **shared task list** (cooperbench.agents._team.TaskListClient) backed by Redis hashes + sets, namespaced ``cb:<run_id>:``, with atomic claim semantics (HSETNX-style — exactly one caller wins on a race) and an audit log of every mutation. Exposed in the container as ``coop-task-create`` / ``coop-task-claim`` / ``coop-task-update`` / ``coop-task-list`` shell wrappers. 2. A **lead / member role split**. The first agent is designated ``team-lead`` and gets a system-prompt block instructing them to break the spec into tasks, assign them via ``coop-task-create --assign``, watch progress, and integrate. Other agents are ``member`` and look for open tasks to claim. 3. A **shared scratchpad** Docker volume (``cb-team-<run_id>``) mounted at ``/workspace/shared`` in every container. Free coordination artifact for design notes, partial diffs, interface sketches. Coordination metrics are computed from the task-list audit log after the run finishes (``time_to_first_claim_seconds``, ``claims_per_agent``, ``updates_per_agent``, ``tasks_done``, ``unowned_at_end``) and saved into ``result.json``. Evaluation is identical to coop — per-agent ``patch.txt`` evaluated per-feature — so no eval changes were needed beyond discovering ``team/`` log directories. Compatibility: all five existing adapters accept the new ``team_role`` / ``team_id`` / ``task_list_url`` kwargs. The CLI adapters (``claude_code``, ``codex``) wire the team install snippet into their ``setup.sh`` so the ``coop-task-*`` wrappers land at ``/usr/local/bin``. The Python-loop adapters (``mini_swe_agent_v2``, ``swe_agent``, ``openhands_sdk``) accept the kwargs without breaking; their in-loop integration with the task list (auto-refresh between steps, similar to the existing inbox poll) lands in a follow-up. Unit tests: 46 new - 18 task_list (CRUD, atomic claim, owner-only update, audit log, run isolation) - 12 prompt (lead vs member branches, solo fallback, git interaction) - 3 runtime (env assembly, scratchpad mount args) - 4 metrics (happy path, unowned-at-end, empty log, multiple claims) - 5 runner (lead-is-first-agent, pre-seed, kwarg propagation, metrics in result, three-agent team) - 4 misc Full suite: 274 passed, 63 skipped. Ruff / format / mypy all green. End-to-end on dottxt_ai_outlines_task/1371 [1,2] with Claude Code in team+git mode: - 5 tasks created (2 by bench-runner, 3 by the lead splitting its work), all reached ``done`` - time_to_first_claim_seconds=34.2 - claims_per_agent={agent1: 2, agent2: 1} - updates_per_agent={agent1: 4, agent2: 3} - scratchpad volume actively used (agent2 wrote its diff to /workspace/shared/agent2.patch + a summary.md) - **0/1 pass rate** — both ``patch.txt`` files were empty: the members wrote diffs to the scratchpad instead of also writing ``/workspace/repo/patch.txt``, and the lead never ran the final integration step. This is real coordination signal (the prompt told them to write both places but they followed the scratchpad half only) — a follow-up will tighten the prompt to make patch.txt submission the explicit final step. Future PRs (intentionally out of scope here so this lands at a reviewable size): - In-loop auto-refresh for the Python-loop adapters - MCP long-poll tool to give CLI adapters push-ish inbox semantics - Typed ``coop-request`` / ``coop-respond`` protocol on top of messaging (CC's plan_approval_request shape) - Filesystem mirror of the task list (CC-style ``ls`` artefacts) Stacks on #51 (Codex adapter) so the diff stays focused on team-mode additions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * team mode: filesystem mirror, typed protocol, MCP server, in-loop refresh (#53) Lands the four follow-ups that were called out as "Out of scope" on the team-mode PR (#52), plus a prompt fix surfaced by the team-mode end-to-end run. 1. **Filesystem mirror of task list** (``_team/fs_mirror.py``). Snapshots the Redis-backed task list to ``/workspace/shared/tasks/`` so agents can ``ls`` and ``cat`` tasks with their existing tools rather than going through the ``coop-task-list`` CLI. Layout mirrors Claude Code's team primitive: one ``<id>.json`` per task, plus ``_index.json`` (cheap ``ls`` target) and ``_log.jsonl`` (audit trail). Triggered on every ``coop-task-list`` invocation and from the host runner at startup. Files written via tempfile+replace so readers never observe a partial state. 2. **Typed coop-request / coop-respond protocol** (``_team/protocol.py``). Layered on plain Redis messaging, mirroring CC's ``plan_approval_request`` / ``plan_approval_response`` shape. ``coop-request <peer> <kind> <body>`` returns a request_id (and optionally blocks via ``--wait N`` for a response). ``coop-respond <request_id> <body>`` writes back; the sender's ``await_response`` uses BLPOP so it actually sleeps instead of busy-polling. Both events flow into the shared task-log so coordination metrics include protocol events. 3. **MCP long-poll server** (``_team/mcp_server.py``). Stdio JSON-RPC server that exposes a single ``wait_for_message`` tool backed by BLPOP on the agent's inbox. Registered automatically: Claude Code adapter writes ``$CLAUDE_CONFIG_DIR/.claude.json`` with the server entry; Codex adapter writes ``$CODEX_HOME/config.toml``. The point is to make "watch the inbox" a natural idle behavior for the CLI adapters instead of a busy-loop on ``coop-recv`` returning empty — the closest we can get to push-style delivery for opaque CLI agent loops. 4. **In-loop task-list auto-refresh** (``_team/loop_refresh.py``). ``TeamPoller`` is a per-agent host-side helper that ``mini_swe_agent_v2.DefaultAgent.step()`` calls between LLM queries — same hook as the existing inbox poll. The LLM sees a compact ``[Team task list] open: 1, in_progress: 2, ...`` summary prepended to every turn so it doesn't need to remember to call ``coop-task-list``. Plumbed via ``agent.team_poller`` so the ``mini_swe_agent_v2`` subtree change is one branch in ``step()``. The same module also exports ``poll_team_state()`` for in-container use (env-driven variant). 5. **Prompt fix**: the previous team-mode end-to-end had members writing diffs to ``/workspace/shared/<id>.patch`` only and never to ``/workspace/repo/patch.txt``, scoring 0/2 despite great coordination. Both lead and member prompts now have an explicit ``### Final submission — REQUIRED`` section that calls out ``patch.txt`` as the only file the bench evaluates and provides the exact ``git diff > patch.txt`` command. Also: cosmetic fix to ``runner/core._print_single_result`` so team mode's per-agent dicts (which carry ``patch_lines: int``) render correctly in the run table — previously the column showed 0 because the function tried ``len(r.get("patch", "").splitlines())`` and team mode doesn't store the full patch in the agents dict. Tests: 37 new unit tests - 8 fs_mirror (atomic writes, stale cleanup, empty index) - 9 protocol (request roundtrip, await, timeout, audit log) - 9 mcp_server (initialize, tools/list, tools/call, timeout, blocking, unknown-tool error, env factory) - 8 loop_refresh (summary formatting, TeamPoller, env variant) - 3 prompt (regression: lead+member prompts demand patch.txt) Full suite: **311 passed**, 63 skipped. End-to-end on dottxt_ai_outlines_task/1371 [1,2] with Claude Code + team + git: **2/2 features pass** (14/14 + 20/20 tests). All four follow-ups visibly active in the run artifacts: ``/workspace/shared/tasks/`` populated with per-task JSON + _index + _log; scratchpad has agent2.patch; ``cb-mcp-server.py`` registered in ``.claude.json``; 6 tasks created (2 by runner pre-seed, 4 by lead's sub-task split), 4 reached ``done``, ``time_to_first_claim_seconds=29.9``. Previous run scored 0/2 on the same task — the prompt fix is doing real work. Stacks on #52. Co-authored-by: Ubuntu <ubuntu@ip-172-31-58-153.us-west-2.compute.internal> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * team mode: wire team prompt + env into the three Python-loop adapters Brings ``mini_swe_agent_v2``, ``swe_agent``, and ``openhands_sdk`` to parity with the CLI adapters for team mode. Before this commit they accepted the team kwargs but discarded them; now each one appends the team prompt section to the task it sends the agent, and (where the adapter actually controls the container) propagates ``CB_TEAM_*`` env vars + mounts the team scratchpad. New helper: ``_team.team_task_section(agents, agent_id, team_role)`` returns ONLY the lead-or-member block + coop-task-* CLI usage, without the surrounding task/submission/git scaffolding that ``build_team_instruction`` adds. Python-loop adapters already have their own prompts covering messaging/git/submission, so they need only the new piece; CLI adapters keep using the bigger function. Per-adapter wiring: - ``mini_swe_agent_v2``: appends team_task_section to task; propagates CB_TEAM_* through env_kwargs["env"]; adds ``--add-host=host.docker.internal:host-gateway`` + scratchpad volume to docker run args; installs the team CLI scripts + pip redis in the container after env spin-up. The existing ``TeamPoller`` host-side hook (already in step()) still fires. - ``openhands_sdk``: appends team_task_section to task; folds a new ``team_env`` dict into ``coop_info`` so ``_build_credentials_dict`` propagates CB_TEAM_* into the sandbox. Coop-task-* binary install in the OpenHands agent-server image is a follow-up — OpenHands manages its own image build and doesn't expose a clean post-start exec hook. - ``swe_agent``: appends team_task_section to task. The SWE-agent framework's sandbox + agent loop is third-party and harder to instrument; everything beyond the prompt is a follow-up. Tests: 13 new - 3 prompt unit tests for team_task_section (lead, member, empty) - 10 cross-adapter sanity tests in tests/agents/test_team_wiring.py: consistency between team_task_section and build_team_instruction, every registered runner accepts the team kwargs, openhands env keys, swe_agent signature Full suite: 324 passed, 63 skipped. Ruff/format/mypy all green. End-to-end on dottxt_ai_outlines_task/1371 [1,2] with claude_code + team + git (sanity check that the shared changes didn't regress the CLI adapter): both Submitted in 4m21s, $0.93, patches 210 + 81 lines. End-to-end for the other four (codex, mini_swe_agent_v2, swe_agent, openhands_sdk) requires API keys (Anthropic for the three Python-loop adapters via litellm, OpenAI for codex) that aren't available in this environment. Unit tests cover the new wiring; the e2e validations should be run with real keys before relying on the per-adapter behavior. Compatibility matrix is now: | Adapter | Accepts | Team prompt | Auto-refresh | CLI in container | env vars | |---------------------|---------|-------------|--------------|------------------|----------| | claude_code | yes | yes (full) | n/a | yes | yes | | codex | yes | yes (full) | n/a | yes | yes | | mini_swe_agent_v2 | yes | yes (sec.) | yes | yes | yes | | openhands_sdk | yes | yes (sec.) | n/a | NOT YET | yes | | swe_agent | yes | yes (sec.) | NOT YET | NOT YET | NOT YET | Stacks on #52 (merged-up team-mode branch). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * openhands: layer coop-task-* install onto Modal image for team mode Closes the documented gap from the prior commit's matrix: the ``coop-task-*`` binaries now ship into the OpenHands agent-server sandbox, layered onto the upstream ``-oh`` image via Modal's ``add_local_file`` / ``pip_install`` / ``run_commands`` chain (no upstream image rebuild required). Triggered only when ``coop_info["team_env"]`` is set so solo / coop runs don't pay the ~10s first-build cost. Modal caches the layered image; subsequent team runs are instant. Verified end-to-end: ran openhands_sdk team+git on dottxt_ai_outlines_task/1371 [1,2] with gpt-5.5. The agent ran ``compgen -c | grep coop-task`` and got back all 7 wrappers (create / claim / update / list / request / respond / pending) — the install worked. Whether the model actually invokes the tools is a separate (coordination-quality) axis; in this run it discovered them but didn't use them, same as codex. Both patches applied; f1 14/14, f2 19/20. Tests: 2 new (full suite: 326 passed) - test_team_env_triggers_image_layering — verifies add_local_file + pip_install + run_commands fire with the right args when team mode is active - test_no_layering_when_team_inactive — verifies solo / coop runs skip the image-build cost Matrix update — openhands_sdk now reads: Accepts kwargs: yes / Team prompt: section / Auto-refresh: n/a / CLI in container: YES (was NOT YET) / CB_TEAM_* env: yes Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * team prompt: make the merge-before-submit step REQUIRED The codex team e2e (cx_team_v3) hit 0/2 with great coordination metrics — 5/5 tasks done, 27s first claim, claims even — but neither agent ran ``git merge`` despite the prompt's "Recommended workflow" mentioning it. Both fetched their peer's branch (2 each) and then submitted only their own work, so the eval's naive diff-stacker produced syntactically broken Python. The previous prompt buried the critical step in a "Concretely:" sentence at the end; gpt-5.5 didn't follow it. This rewrite: - Renames the section ``## Git collaboration — MERGE IS REQUIRED BEFORE SUBMITTING`` so the imperative is in the heading itself. - Adds an explicit "Required final sequence — run this verbatim before exiting" block with the full fetch+merge+diff sequence, parameterized over every partner branch. - Explains *why* (each agent's patch.txt is evaluated against every feature's tests; without the merge, the peer feature's symbols are missing → ImportError). - Frames it the same way the patch.txt step is framed (REQUIRED, skip-at-your-loss), which the original prompt fix proved codex responds to. Verified: re-ran cx_team_v4 (codex team+git, same task as v3). Git activity went from ``fetch=2 merge=0 push=0`` per agent → ``fetch=3 merge=2 push=2`` and ``fetch=1 merge=1 push=1``. Both patches now contain both features' symbols. Pass rate v4: 33/34 tests (97%) — f2 fully passes 20/20, f1 fails one test because gpt-5.5's merged code put the ``filters`` kwarg on a helper function rather than the ``prompt`` decorator (content quality, not coordination). A second run (cx_team_v5) produced byte-identical 243-line patches on both agents — codex coordinated so well both ended up with the exact same merged tree. This surfaces a separate bench-side limitation: the eval's diff-stacker fails to apply patch B on top of patch A when every hunk already matches, producing an empty merged.patch. That's a real bug in ``eval/evaluate.py``'s coop merge step, NOT a coordination failure — codex did exactly what the prompt asked. Fix is a separate concern from team-mode wiring. Tests still pass (existing prompt tests are content-agnostic; 326 / 63 skipped). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * eval: short-circuit when both agents submit identical merged patches In team mode codex can coordinate so well that both agents end up with byte-identical patches (each fully merged the other's branch). The existing eval combiner sequence — apply patch1 → apply patch2 on top — chokes because every hunk in patch2 is already applied, producing an empty merged.patch and a downstream "No valid patches in input" failure even though both submissions are individually fine. Fix in ``test_merged``: before invoking ``_setup_branches`` / ``_merge_naive``, ``cmp`` the two patches. If they match, copy patch1 to merged.patch (normalized via ``git apply --recount`` so agents that emit unified diffs with miscounted hunk headers still work) and skip the merge dance. Returns a fresh result with ``merge.status: "identical"`` so the caller can tell the short-circuit fired vs a real merge. Verified on the codex-team e2e: - cx_team_v5 (codex agents perfectly merged to identical 243-line patches): 0/2 → 2/2 ✓ (f1: 14/14, f2: 20/20) - cx_team_v4 (codex agents diverged on the merge): unchanged at f2 20/20 + f1 13/14 = 33/34 tests, still falls back to agent2-alone via apply_status: {'agent1': 'failed', ...} I also briefly tried adding ``git apply --recount`` to ``_setup_branches``'s fallback chain, but that REGRESSED v4: it made agent1's malformed patch apply where it previously failed silently, triggering a real merge attempt that produced duplicate function definitions (broken Python) via union merge. The identical-patches short-circuit is the strictly-better fix — no regression, recovers the v5 case, and the malformed-hunk normalization only kicks in on the short-circuit path where it can't cause merge conflicts. Also lands previously-uncommitted housekeeping: - prompt.py: ruff-format-only diff on the merge-required block from the prior commit - test_team_wiring.py: ruff --fix removed unused MagicMock imports - test_gcp_backend.py / test_tasks.py: ruff --fix removed f-string-without-placeholder and unused-json import (both unrelated drift caught by the gate) Tests: 1 new (full suite: 327 passed) - ``test_test_merged_shortcircuits_on_identical_patches`` — source inspection confirms the short-circuit branch + "identical" merge-status string exist in test_merged Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * openhands: register Redis-backed CoopTaskTracker as a typed tool The previous openhands team runs (oh_team_v3) showed agents discovering the ``coop-task-*`` shell wrappers via ``compgen`` but never invoking them — gpt-5.5 strongly prefers typed tools registered with the LLM over arbitrary shell commands. This commit lands the architectural fix: a Redis-backed ``CoopTaskTrackerTool`` registered under the same name as openhands' built-in ``TaskTrackerTool`` so the registry resolution swaps it transparently. Files: * ``openhands/tools/task_tracker/coop_definition.py`` — new tool definition + executor. Same ``TaskTrackerAction`` / ``TaskTrackerObservation`` shape, but ``plan`` and ``view`` round- trip through the shared ``cb:<run_id>:`` Redis namespace that ``TaskListClient`` (host side) writes to. Tasks are auto-owned by the calling agent; ``view`` shows peer tasks prefixed with ``[<their_agent_id>]``. Registered under both ``"CoopTaskTrackerTool"`` AND ``"TaskTrackerTool"`` so importing the module rebinds the latter to the Coop variant. * ``openhands/tools/preset/default.py`` — gains a ``team_mode`` kwarg (kept for API stability + tests; the actual swap happens server-side via the .pth/__init__ side-effect import, not by changing the host-side tool list). Pre-PR coop block split into a more nuanced team-mode prompt section that documents the TaskTracker → shared-list behavior. * ``openhands_sdk/adapter.py:ModalSandboxContext.__enter__`` — layers two more bits into the Modal image at build time: - ``add_local_file`` of ``coop_definition.py`` to ``$OH_DIR/coop_definition.py`` (in the sandbox's openhands install) - ``grep ... || echo`` appending ``from . import coop_definition`` to the package's ``__init__.py`` so the registration runs at import time. Tests: 1 new + updated image-layering assertions - ``test_importing_coop_definition_overrides_local_registration``: inspecting the registry's ``_MODULE_QUALNAMES`` confirms ``TaskTrackerTool.name`` resolves to ``coop_definition``'s registration after import. - ``TestOpenHandsImageLayering`` now asserts 2 ``add_local_file`` calls + 2 ``run_commands`` layers (tool-file install + ``coop-task-*`` wrappers) and that the ``from . import coop_definition`` line is in the install commands. Full suite: 329 passed. Ruff / format / mypy all green. KNOWN LIMITATION (documented in coop_definition.py docstring): the openhands_sdk agent-server runs in a Modal sandbox that's network-isolated from the host Redis. The CoopTaskTracker is correctly registered and the LLM can call it, but every operation returns "Shared task list unavailable" because the sandbox can't ``socket.getaddrinfo("host.docker.internal")``. The fix is in the deployment layer (Modal tunnels, a Modal-hosted Redis, or running openhands directly via docker like the other adapters), not in this PR — verified by oh_team_v10: agent ran ``coop-task-list`` first ("The coop CLI failed; I'll use the shared task tracker."), then fell back to TaskTrackerAction which still hit the local executor because the override + Redis combo can't actually work in Modal. For non-Modal openhands deployments (e.g. local docker-backed openhands runs, future remote-conversation transports that share the host network), this tool works as designed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * openhands team mode: end-to-end working with Modal-hosted Redis Resolves the Modal-Redis isolation that blocked the prior CoopTaskTracker swap from actually functioning. Three pieces, working together: 1. **Modal-hosted Redis.** ``runner/team.py:execute_team`` detects ``agent_name == "openhands_sdk"`` and spins up a Modal sandbox running redis-server on a TCP tunnel (``unencrypted_ports=[6379]``, accessed via ``unencrypted_host:unencrypted_port``). Re-uses the existing ``connectors/redis_server.ModalRedisServer`` — it was already written, just unused. Both the host TaskListClient and the agent sandboxes point at the same public TCP endpoint, so pre-seed and agent reads/writes share state. Falls back to local Redis for the other adapters. 2. **CoopTaskTrackerTool injection into the Modal sandbox.** The adapter now ``add_local_file``s three pieces into the OpenHands image at build time: - ``coop_task.py`` → ``/usr/local/bin/cb-coop-task.py`` - ``coop_definition.py`` → ``$OH_DIR/coop_definition.py`` - ``_team_init_override.py`` → ``$OH_DIR/__init__.py`` (replaces upstream; same exports + a side-effect import of coop_definition so the Redis-backed executor overrides the local TaskTracker registration at first import). Plus a ``find -name '*.pyc' -delete`` to invalidate Python's bytecode cache so the new __init__ actually re-runs. 3. **Harvest-time fresh client.** Modal's TCP tunnels drop idle connections after a few minutes, so the original Redis client pre-seed used at startup gets closed before the 9-min agent run finishes. Re-open the client at harvest time using the same URL. End-to-end on ``dottxt_ai_outlines_task/1371 [1,2]`` with ``-a openhands_sdk --setting team --git``: - Modal Redis startup: ``redis ready redis://r450.modal.host:41899`` - Both agents Submitted, 9m total - Eval: 2/2 PASS (f1: 14/14 ✓, f2: 20/20 ✓) - Metrics: ``tasks_total: 4, tasks_done: 4, unowned_at_end: 0, time_to_first_claim_seconds: 52.6, claims_per_agent: {agent2:2, agent1:1}, updates_per_agent: {agent2:4, agent1:5}`` - Cost: $3.33 Tests: image-layering assertions expanded — ``add_local_file`` now called 3 times (CLI helper, tool def, __init__ override), and the run_commands chain copies both files + wipes .pyc caches. Full suite: 329 passed. Ruff / format / mypy all green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * deps: add fakeredis to dev extras The team-mode unit tests (task_list / protocol / fs_mirror / loop_refresh / mcp_server) use ``fakeredis.FakeRedis`` as a hermetic stand-in for redis-server, but ``fakeredis`` wasn't declared anywhere in pyproject.toml — it just happened to be present in my local venv because something else pulled it in transitively. GitHub CI installs ``[dev]`` only, so on a clean install pytest collection fails with ``ModuleNotFoundError: No module named 'fakeredis'`` on every team-mode test file. Adding the dependency explicitly fixes PR #52 (team-mode) CI; once team-mode merges, PR #55 (team-all-adapters) will also pick it up via the same path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * swe_agent: fix import error + add missing transitive deps Three changes that together unblock swe_agent team-mode runs (and solo/coop runs too — the bug wasn't team-specific): 1. ``cooperbench.agents.mini_swe_agent`` → ``mini_swe_agent_v2`` in ``swe_agent/adapter.py`` and ``swe_agent/agent/agents.py``. The old package was renamed in v0.0.13; both swe_agent files had stale imports that no-op'd at module load (TypeError or ModuleNotFoundError depending on how the framework was invoked), making every swe_agent invocation return Error before any LLM call. 2. Add ``numpy``, ``boto3``, ``docker`` to the ``swe-agent`` extras in pyproject.toml. swe_agent's vendored framework imports these at module-load time even when the docker/S3/model paths are dormant, so a clean ``pip install '.[swe-agent]'`` without these would still ImportError on first invocation. 3. uv.lock refreshed with the new transitive deps. End-to-end on dottxt_ai_outlines_task/1371 [1,2] with ``-a swe_agent -m gpt-5.5 --setting team --git`` (sw_team_v5): both agents Submitted, patches 373 + 88 lines, both applied via git apply. Eval failed 0/2 due to a content-quality issue (``NameError: name 'Set' is not defined`` — agent used Set without importing it; both agents hit exit_cost budget limit mid-implementation), but that's model variance, not adapter wiring. swe_agent is unblocked: it runs end-to-end, produces patches, the eval pipeline processes them. Coordination metrics still empty (claims_per_agent: {}) because swe_agent doesn't yet have the in-container coop-task-* CLI install or in-loop task auto-refresh — those are tracked as follow-ups in the PR body. For now the swe_agent team-mode run just gets the team prompt section + env vars; full team-tool integration is a separate PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: team-mode bugs surfaced by 10-pair core run Five compounding bugs prevented `claude_code`, `codex`, and `mini_swe_agent_v2` from reaching honest pass-rates on the core subset in team setting. All four now ≥ 5/10. - normalize_patch ate trailing blank context lines (text.strip() consumes " \n"), breaking last-hunk line counts so git apply rejected otherwise-valid diffs. Replaced with lstrip/rstrip on "\n" only. - mini_swe_agent_v2 adapter wasn't normalizing patches at all — raw .strip() on the patch.txt read, so every msa patch ended in a non-newline byte. Now routes through normalize_patch. - mini_swe_agent_v2 ModalEnvironment created the sandbox with no long-running command, so the image's default CMD exited and every exec hit "Sandbox not found". Pass "sleep", "infinity" as the positional command (matches eval backend's existing fix). - claude_code and codex adapters silently ignored --backend modal because shared build_environment was hardcoded to DockerEnvironment. Added a backend kwarg and threaded config["backend"] through both adapters. - Team lead prompt buried the integration step at the bottom of a long workflow list; Claude/Codex consistently exited after their own feature without reading /workspace/shared/<agent>.patch. Rewrote with a hard-rule opener and a 5-point pre-submission checklist. Member prompt now opens with "stay in your lane" per the lead's PLAN.md. - eval test_merged now falls back to testing each agent's patch alone when the merged tree doesn't pass both features. Surfaced as merge.strategy="solo-agent1" / "solo-agent2". Credits the agent (typically the lead) who correctly integrated both features into one working patch but had it corrupted by union-merging with the other agent's partial implementation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs+data: core subset and team-mode horizontal comparison - dataset/subsets/core.json: 10-pair subset for quick agent comparisons. Stratified by repo (largest-remainder proportional allocation by full-dataset pair count) with a one-slot floor per primary language (Python / Go / Rust / TS). Reproducible via scripts/generate_core_subset.py (seed=42). - docs/BENCHMARK_RESULTS.md: horizontal comparison of four agent frameworks on the core subset in team setting. Includes per-task pass/fail matrix annotated with the merge strategy used, plus the chronological narrative of the dozen reruns that surfaced each of the bugs fixed in the previous commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(eval): don't bail when union-merge also conflicts Previously test_merged returned early with an error when both naive and union merge strategies hit conflicts, so the solo-agent fallback never got a chance to credit a team whose lead alone integrated both features. Now we write an empty merged.patch, let run_tests fail naturally on the merged tree, and fall through to the solo fallback. Doesn't change any of the current 40 eval results — union's merge=union attribute is tolerant enough that every task in the dataset produces some tree (potentially broken code with stitched-together lines); the broken-tree-tests-fail path already triggered the solo fallback. This just closes the defensive gap for future pathological cases. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * eval(team): identical / naive / lead-when-naive-conflicts policy Drops the union-merge strategy and the member-only fallback from test_merged. The new chain is: 1. identical patches → skip-merge short-circuit 2. naive 3-way merge clean → merged-tree tests are authoritative (no further fallback) 3. naive merge conflicts → test the lead's patch.txt alone against both feature suites Rationale: union merge concatenates conflicting hunks, which usually produces syntactically broken code; the cases where it accidentally produced a working tree were rewarding lucky non-overlap, not genuine coordination. The member-only fallback was symmetric to lead-only but incoherent under team-mode semantics (the lead is the designated integrator; if they didn't integrate, the team failed regardless of what the member's branch looks like). Effect on the core-subset horizontal comparison: msa 6 → 6 (unchanged) oh 5 → 4 (loses pallets_jinja/1621 — was passing via union, which concealed that oh's lead doesn't integrate) cc 5 → 5 (unchanged) cx 5 → 5 (unchanged) oh sliding below 5/10 is the correct outcome: the previous union-pass on pallets_jinja/1621 was a false-positive of sorts (oh's agents commit their patch.txt into the working tree, which forces a merge conflict on patch.txt that union resolved while the actual source merge was non-conflicting). Under the stricter policy this gets routed through lead-alone, which oh's lead does not pass. BENCHMARK_RESULTS.md updated to reflect the new totals + per-task matrix legend (N = naive/identical, L = lead-alone). CHANGELOG entry revised; full test suite still green (329 passed, 63 skipped). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(modal): codex stdin hang; eval guardrail for openhands_sdk codex on Modal: `codex exec` was hanging for the full sandbox lifetime (~2h) producing zero stream output. Root cause: codex's exec mode prints "Reading additional input from stdin..." and blocks until stdin EOF. Docker's non-tty `docker exec` gives EOF for free; Modal sandbox keeps stdin open. Fix: add `</dev/null` to the codex invocation in _build_codex_command. Smoke-tested on dottxt_ai_outlines/1655 [1,3] solo on Modal: 1/1 pass in 1m 48s. openhands_sdk eval guardrail: openhands_sdk produces patches that include a committed patch.txt in the working tree and relies on Modal-hosted Redis for coordination; running eval through Docker silently changed the test environment. The eval now reads the run's config.json and refuses with a clear warning when the run was produced by openhands_sdk but --backend != modal. Note: swe_agent already runs on Modal (uses swerex.ModalDeploymentConfig by default; the earlier docs claiming it was docker-only were wrong). Smoke-tested same dottxt task: 1/1 pass in 3m 12s. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(swe_agent): add --backend docker support swe_agent adapter was hardcoded to swerex.ModalDeploymentConfig. Added a backend dispatch that picks DockerDeploymentConfig when config["backend"] == "docker"; Modal stays as the default. Two upstream-swerex issues had to be worked around to make the docker path actually start a container: 1. CooperBench task images set ENTRYPOINT=/usr/local/bin/runner.sh, so swerex's `docker run ... image sh -c "<startup>"` becomes `runner.sh sh -c "<startup>"` and runner.sh interprets "sh" as the feature-patch path. Pass docker_args=["--entrypoint", ""] to clear the entrypoint (mirrors the existing Modal monkey-patch that does .entrypoint([]) on the image). 2. swerex's startup falls back to `pipx run swe-rex ...` when the swerex-remote binary isn't pre-installed, but pipx looks for an executable literally named "swe-rex" — which doesn't exist in the published `swe-rex` package (it provides "swerex-remote"). Monkey-patch DockerDeployment._get_swerex_start_cmd to use `pipx run --spec swe-rex swerex-remote ...` instead. Smoke-tested with `dottxt_ai_outlines/1655 [1,3]` solo on docker: 1/1 pass in 2m 53s, 17 steps, $0.32, no errors. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * team_harness: extract team mode as standalone harness + ablation flags Move team-mode primitives from cooperbench/agents/_team (private) to cooperbench/team_harness (public, library-shaped) so other benchmarks can consume the multi-agent coordination algorithm without depending on CooperBench's task layout. Adds TeamSession + TeamHarnessConfig: - TeamSession bundles per-run state (run_id, namespaced Redis URL, ordered agent list, scratchpad volume name) with the feature config and exposes adapter-facing factories that each return None / [] / {} when their feature is disabled, so adapter code paths collapse to one branch: coop_env.update(session.env_for(agent_id)) extra_run_args.extend(session.scratchpad_mount_args()) mcp_config = session.mcp_config(container_script_path=...) - TeamHarnessConfig is a frozen dataclass of five per-feature booleans (task_list, scratchpad, mcp, auto_refresh, protocol). The lead/member role split is the always-on baseline -- without it team is just coop. Wires five --team-no-* CLI flags through cli.py -> runner.run -> runner.core -> runner.team -> each adapter. result.json now records team_features so post-hoc analysis can attribute deltas to the feature that was off. Adapter refactor: claude_code, codex, mini_swe_agent_v2, swe_agent, and openhands_agent_sdk now accept team_features kwarg and construct a local TeamSession instead of calling loose helpers. Each adapter's team-mode blocks (prompt, env, mount, MCP, install) gate on the session's config. Tests: tests/agents/_team -> tests/team_harness (rename), new test_session.py (29 cases) covers the facade, four new ablation tests in tests/runner/test_team.py verify the runner-side gating. Full suite 363 passed, 63 skipped; ruff/format/mypy clean. End-to-end smoke on dottxt_ai_outlines/1371 [1,2] with codex (docker): - Default: writes task_log.json + tasks.json + metrics, cb-team-<run> volume created. - --team-no-task-list --team-no-scratchpad --team-no-mcp: no task_log / tasks files, empty metrics dict, no volume. team_features in result.json reflects the requested ablation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: team-harness ablation report (flash, codex/gpt-5.5) Self-contained HTML report of the team-harness ablation + multi-agent comparison run on the flash subset (50 task pairs), codex/gpt-5.5, docker, 1 seed. Contents: - docs/team_harness_ablation_report.html — setting comparison (solo/coop/coop+git/team), one-feature-off ablation matrix, timing, findings, methodology, caveats. All numbers embedded inline. - docs/team_harness_ablation_data/{core,flash}_ablation.csv — raw rows. - scripts/run_team_ablation.py — sweep driver (config -> cooperbench run+eval). - scripts/gen_ablation_report.py — regenerates the HTML from logs/. Headline results (passed / 50, both-features-pass): coop msg-only 13 · team no-scratchpad 15 · team no-task_list 20 · solo 24 · coop+git 28 · team no-mcp 30 · team no-auto_refresh 30 · team baseline 31 · team no-protocol 35 Findings: - scratchpad (-16) and task_list (-11) are load-bearing; removing either drops team below solo (two uncoordinated agents < one). - mcp/auto_refresh/protocol show no positive effect for codex (auto_refresh is a no-op for CLI adapters by design; protocol-off even scored +4, i.e. mild overhead without payoff). - Most multi-agent value is a shared code substrate, not orchestration: coop+git (56%) ~ team-scratchpad (62%) >> messaging-only coop (26%). Caveat: team runs used the scratchpad for code-sharing, NOT --git, so "team vs coop+git" compares two sharing substrates; the team --git cell is untested (follow-up). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Ubuntu <ubuntu@ip-172-31-58-153.us-west-2.compute.internal> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* agents/codex: add Codex adapter; lift shared coop bits into _coop Adds an OpenAI Codex CLI adapter alongside the existing Claude Code adapter. Both adapters wrap a third-party CLI inside the task's Docker container; the bits that are agent-agnostic (Redis messaging helper, prompt blocks for solo/coop/coop+git, git remote setup) now live in a new ``cooperbench.agents._coop`` module so the two adapters (and any future CLI adapter) consume them rather than duplicating. Codex adapter highlights: - Invokes ``codex exec --json --sandbox danger-full-access --skip-git-repo-check --model <id>``. - Writes ``${CODEX_HOME}/auth.json`` with the host's OPENAI_API_KEY inside the container so the CLI authenticates without prompts. - Parses Codex's JSONL event stream for status / token totals / messages. Cost is reported as 0.0 because Codex does not emit a cost field; tokens are summed across ``turn.completed`` events. - Model fallback: if Codex rejects ``--model gpt-5.5`` with a "model not found" shaped error, the adapter retries once without ``--model`` and lets Codex pick its default. - Preflight credential check: if OPENAI_API_KEY is unset the adapter returns Error immediately instead of spinning up a container that can only fail. Shared ``_coop`` module: - ``coop_msg.py`` — Redis-backed messaging CLI (one inbox per agent) installed as ``coop-send`` / ``coop-recv`` / ``coop-broadcast`` / ``coop-peek`` / ``coop-agents`` under /usr/local/bin. - ``install_snippet.sh`` — pip-installs redis and drops the shell wrappers; each adapter's setup.sh sources it. - ``prompt.py`` — solo / coop / coop+git prompt assembly, agent- agnostic. - ``runtime.py`` — ``ContainerEnv`` protocol, ``build_environment``, ``write_file_in_container`` / ``read_file_from_container``, ``rewrite_comm_url_for_container``, ``build_git_setup_command``, ``parse_sent_messages_log``, and ``normalize_patch``. Bug fix during this refactor: the previous adapter's ``.strip()`` on ``patch.txt`` was eating the trailing newline that ``git apply`` requires. Replaced with ``normalize_patch()`` (one trailing newline, no leading whitespace). This bit codex's solo run with a "corrupt patch at line N" error; Claude got lucky and didn't. Tests: 24 new for Codex (parsers + adapter), existing 45 Claude Code tests re-pointed at the shared ``_coop`` module. Full suite: 228 passed, 63 skipped. End-to-end runs against dottxt_ai_outlines_task/1371 features 1+2: - codex solo f1: Submitted, 1 turn, 365k input tokens, 184-line patch (with the trailing-newline fix it applies cleanly) - codex coop+git f1,f2: both Submitted, both patches applied but 0/2 tests pass — coordination failure (agent1 fetched ``team`` but never merged, so the stacked patches produce a Python SyntaxError at line 144 of the modified file). Claude on the same task scored 2/2; Codex used the tools less aggressively on this run. The 0/2 result is the kind of coordination failure the bench is designed to surface, not an adapter bug. Future iteration could tighten the prompt or hard-enforce a post-run merge, but neither is necessary to land the adapter itself. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * runner: add team mode (lead + members + shared task list + scratchpad) Adds a third setting alongside ``solo`` and ``coop``, modelled on the agent-team primitives Claude Code uses in its own product. Where coop gives N peer agents one feature each and a Redis inbox to chat over, team mode adds three load-bearing primitives: 1. A typed **shared task list** (cooperbench.agents._team.TaskListClient) backed by Redis hashes + sets, namespaced ``cb:<run_id>:``, with atomic claim semantics (HSETNX-style — exactly one caller wins on a race) and an audit log of every mutation. Exposed in the container as ``coop-task-create`` / ``coop-task-claim`` / ``coop-task-update`` / ``coop-task-list`` shell wrappers. 2. A **lead / member role split**. The first agent is designated ``team-lead`` and gets a system-prompt block instructing them to break the spec into tasks, assign them via ``coop-task-create --assign``, watch progress, and integrate. Other agents are ``member`` and look for open tasks to claim. 3. A **shared scratchpad** Docker volume (``cb-team-<run_id>``) mounted at ``/workspace/shared`` in every container. Free coordination artifact for design notes, partial diffs, interface sketches. Coordination metrics are computed from the task-list audit log after the run finishes (``time_to_first_claim_seconds``, ``claims_per_agent``, ``updates_per_agent``, ``tasks_done``, ``unowned_at_end``) and saved into ``result.json``. Evaluation is identical to coop — per-agent ``patch.txt`` evaluated per-feature — so no eval changes were needed beyond discovering ``team/`` log directories. Compatibility: all five existing adapters accept the new ``team_role`` / ``team_id`` / ``task_list_url`` kwargs. The CLI adapters (``claude_code``, ``codex``) wire the team install snippet into their ``setup.sh`` so the ``coop-task-*`` wrappers land at ``/usr/local/bin``. The Python-loop adapters (``mini_swe_agent_v2``, ``swe_agent``, ``openhands_sdk``) accept the kwargs without breaking; their in-loop integration with the task list (auto-refresh between steps, similar to the existing inbox poll) lands in a follow-up. Unit tests: 46 new - 18 task_list (CRUD, atomic claim, owner-only update, audit log, run isolation) - 12 prompt (lead vs member branches, solo fallback, git interaction) - 3 runtime (env assembly, scratchpad mount args) - 4 metrics (happy path, unowned-at-end, empty log, multiple claims) - 5 runner (lead-is-first-agent, pre-seed, kwarg propagation, metrics in result, three-agent team) - 4 misc Full suite: 274 passed, 63 skipped. Ruff / format / mypy all green. End-to-end on dottxt_ai_outlines_task/1371 [1,2] with Claude Code in team+git mode: - 5 tasks created (2 by bench-runner, 3 by the lead splitting its work), all reached ``done`` - time_to_first_claim_seconds=34.2 - claims_per_agent={agent1: 2, agent2: 1} - updates_per_agent={agent1: 4, agent2: 3} - scratchpad volume actively used (agent2 wrote its diff to /workspace/shared/agent2.patch + a summary.md) - **0/1 pass rate** — both ``patch.txt`` files were empty: the members wrote diffs to the scratchpad instead of also writing ``/workspace/repo/patch.txt``, and the lead never ran the final integration step. This is real coordination signal (the prompt told them to write both places but they followed the scratchpad half only) — a follow-up will tighten the prompt to make patch.txt submission the explicit final step. Future PRs (intentionally out of scope here so this lands at a reviewable size): - In-loop auto-refresh for the Python-loop adapters - MCP long-poll tool to give CLI adapters push-ish inbox semantics - Typed ``coop-request`` / ``coop-respond`` protocol on top of messaging (CC's plan_approval_request shape) - Filesystem mirror of the task list (CC-style ``ls`` artefacts) Stacks on #51 (Codex adapter) so the diff stays focused on team-mode additions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * team mode: filesystem mirror, typed protocol, MCP server, in-loop refresh (#53) Lands the four follow-ups that were called out as "Out of scope" on the team-mode PR (#52), plus a prompt fix surfaced by the team-mode end-to-end run. 1. **Filesystem mirror of task list** (``_team/fs_mirror.py``). Snapshots the Redis-backed task list to ``/workspace/shared/tasks/`` so agents can ``ls`` and ``cat`` tasks with their existing tools rather than going through the ``coop-task-list`` CLI. Layout mirrors Claude Code's team primitive: one ``<id>.json`` per task, plus ``_index.json`` (cheap ``ls`` target) and ``_log.jsonl`` (audit trail). Triggered on every ``coop-task-list`` invocation and from the host runner at startup. Files written via tempfile+replace so readers never observe a partial state. 2. **Typed coop-request / coop-respond protocol** (``_team/protocol.py``). Layered on plain Redis messaging, mirroring CC's ``plan_approval_request`` / ``plan_approval_response`` shape. ``coop-request <peer> <kind> <body>`` returns a request_id (and optionally blocks via ``--wait N`` for a response). ``coop-respond <request_id> <body>`` writes back; the sender's ``await_response`` uses BLPOP so it actually sleeps instead of busy-polling. Both events flow into the shared task-log so coordination metrics include protocol events. 3. **MCP long-poll server** (``_team/mcp_server.py``). Stdio JSON-RPC server that exposes a single ``wait_for_message`` tool backed by BLPOP on the agent's inbox. Registered automatically: Claude Code adapter writes ``$CLAUDE_CONFIG_DIR/.claude.json`` with the server entry; Codex adapter writes ``$CODEX_HOME/config.toml``. The point is to make "watch the inbox" a natural idle behavior for the CLI adapters instead of a busy-loop on ``coop-recv`` returning empty — the closest we can get to push-style delivery for opaque CLI agent loops. 4. **In-loop task-list auto-refresh** (``_team/loop_refresh.py``). ``TeamPoller`` is a per-agent host-side helper that ``mini_swe_agent_v2.DefaultAgent.step()`` calls between LLM queries — same hook as the existing inbox poll. The LLM sees a compact ``[Team task list] open: 1, in_progress: 2, ...`` summary prepended to every turn so it doesn't need to remember to call ``coop-task-list``. Plumbed via ``agent.team_poller`` so the ``mini_swe_agent_v2`` subtree change is one branch in ``step()``. The same module also exports ``poll_team_state()`` for in-container use (env-driven variant). 5. **Prompt fix**: the previous team-mode end-to-end had members writing diffs to ``/workspace/shared/<id>.patch`` only and never to ``/workspace/repo/patch.txt``, scoring 0/2 despite great coordination. Both lead and member prompts now have an explicit ``### Final submission — REQUIRED`` section that calls out ``patch.txt`` as the only file the bench evaluates and provides the exact ``git diff > patch.txt`` command. Also: cosmetic fix to ``runner/core._print_single_result`` so team mode's per-agent dicts (which carry ``patch_lines: int``) render correctly in the run table — previously the column showed 0 because the function tried ``len(r.get("patch", "").splitlines())`` and team mode doesn't store the full patch in the agents dict. Tests: 37 new unit tests - 8 fs_mirror (atomic writes, stale cleanup, empty index) - 9 protocol (request roundtrip, await, timeout, audit log) - 9 mcp_server (initialize, tools/list, tools/call, timeout, blocking, unknown-tool error, env factory) - 8 loop_refresh (summary formatting, TeamPoller, env variant) - 3 prompt (regression: lead+member prompts demand patch.txt) Full suite: **311 passed**, 63 skipped. End-to-end on dottxt_ai_outlines_task/1371 [1,2] with Claude Code + team + git: **2/2 features pass** (14/14 + 20/20 tests). All four follow-ups visibly active in the run artifacts: ``/workspace/shared/tasks/`` populated with per-task JSON + _index + _log; scratchpad has agent2.patch; ``cb-mcp-server.py`` registered in ``.claude.json``; 6 tasks created (2 by runner pre-seed, 4 by lead's sub-task split), 4 reached ``done``, ``time_to_first_claim_seconds=29.9``. Previous run scored 0/2 on the same task — the prompt fix is doing real work. Stacks on #52. Co-authored-by: Ubuntu <ubuntu@ip-172-31-58-153.us-west-2.compute.internal> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * team mode: wire team prompt + env into the three Python-loop adapters Brings ``mini_swe_agent_v2``, ``swe_agent``, and ``openhands_sdk`` to parity with the CLI adapters for team mode. Before this commit they accepted the team kwargs but discarded them; now each one appends the team prompt section to the task it sends the agent, and (where the adapter actually controls the container) propagates ``CB_TEAM_*`` env vars + mounts the team scratchpad. New helper: ``_team.team_task_section(agents, agent_id, team_role)`` returns ONLY the lead-or-member block + coop-task-* CLI usage, without the surrounding task/submission/git scaffolding that ``build_team_instruction`` adds. Python-loop adapters already have their own prompts covering messaging/git/submission, so they need only the new piece; CLI adapters keep using the bigger function. Per-adapter wiring: - ``mini_swe_agent_v2``: appends team_task_section to task; propagates CB_TEAM_* through env_kwargs["env"]; adds ``--add-host=host.docker.internal:host-gateway`` + scratchpad volume to docker run args; installs the team CLI scripts + pip redis in the container after env spin-up. The existing ``TeamPoller`` host-side hook (already in step()) still fires. - ``openhands_sdk``: appends team_task_section to task; folds a new ``team_env`` dict into ``coop_info`` so ``_build_credentials_dict`` propagates CB_TEAM_* into the sandbox. Coop-task-* binary install in the OpenHands agent-server image is a follow-up — OpenHands manages its own image build and doesn't expose a clean post-start exec hook. - ``swe_agent``: appends team_task_section to task. The SWE-agent framework's sandbox + agent loop is third-party and harder to instrument; everything beyond the prompt is a follow-up. Tests: 13 new - 3 prompt unit tests for team_task_section (lead, member, empty) - 10 cross-adapter sanity tests in tests/agents/test_team_wiring.py: consistency between team_task_section and build_team_instruction, every registered runner accepts the team kwargs, openhands env keys, swe_agent signature Full suite: 324 passed, 63 skipped. Ruff/format/mypy all green. End-to-end on dottxt_ai_outlines_task/1371 [1,2] with claude_code + team + git (sanity check that the shared changes didn't regress the CLI adapter): both Submitted in 4m21s, $0.93, patches 210 + 81 lines. End-to-end for the other four (codex, mini_swe_agent_v2, swe_agent, openhands_sdk) requires API keys (Anthropic for the three Python-loop adapters via litellm, OpenAI for codex) that aren't available in this environment. Unit tests cover the new wiring; the e2e validations should be run with real keys before relying on the per-adapter behavior. Compatibility matrix is now: | Adapter | Accepts | Team prompt | Auto-refresh | CLI in container | env vars | |---------------------|---------|-------------|--------------|------------------|----------| | claude_code | yes | yes (full) | n/a | yes | yes | | codex | yes | yes (full) | n/a | yes | yes | | mini_swe_agent_v2 | yes | yes (sec.) | yes | yes | yes | | openhands_sdk | yes | yes (sec.) | n/a | NOT YET | yes | | swe_agent | yes | yes (sec.) | NOT YET | NOT YET | NOT YET | Stacks on #52 (merged-up team-mode branch). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * openhands: layer coop-task-* install onto Modal image for team mode Closes the documented gap from the prior commit's matrix: the ``coop-task-*`` binaries now ship into the OpenHands agent-server sandbox, layered onto the upstream ``-oh`` image via Modal's ``add_local_file`` / ``pip_install`` / ``run_commands`` chain (no upstream image rebuild required). Triggered only when ``coop_info["team_env"]`` is set so solo / coop runs don't pay the ~10s first-build cost. Modal caches the layered image; subsequent team runs are instant. Verified end-to-end: ran openhands_sdk team+git on dottxt_ai_outlines_task/1371 [1,2] with gpt-5.5. The agent ran ``compgen -c | grep coop-task`` and got back all 7 wrappers (create / claim / update / list / request / respond / pending) — the install worked. Whether the model actually invokes the tools is a separate (coordination-quality) axis; in this run it discovered them but didn't use them, same as codex. Both patches applied; f1 14/14, f2 19/20. Tests: 2 new (full suite: 326 passed) - test_team_env_triggers_image_layering — verifies add_local_file + pip_install + run_commands fire with the right args when team mode is active - test_no_layering_when_team_inactive — verifies solo / coop runs skip the image-build cost Matrix update — openhands_sdk now reads: Accepts kwargs: yes / Team prompt: section / Auto-refresh: n/a / CLI in container: YES (was NOT YET) / CB_TEAM_* env: yes Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * team prompt: make the merge-before-submit step REQUIRED The codex team e2e (cx_team_v3) hit 0/2 with great coordination metrics — 5/5 tasks done, 27s first claim, claims even — but neither agent ran ``git merge`` despite the prompt's "Recommended workflow" mentioning it. Both fetched their peer's branch (2 each) and then submitted only their own work, so the eval's naive diff-stacker produced syntactically broken Python. The previous prompt buried the critical step in a "Concretely:" sentence at the end; gpt-5.5 didn't follow it. This rewrite: - Renames the section ``## Git collaboration — MERGE IS REQUIRED BEFORE SUBMITTING`` so the imperative is in the heading itself. - Adds an explicit "Required final sequence — run this verbatim before exiting" block with the full fetch+merge+diff sequence, parameterized over every partner branch. - Explains *why* (each agent's patch.txt is evaluated against every feature's tests; without the merge, the peer feature's symbols are missing → ImportError). - Frames it the same way the patch.txt step is framed (REQUIRED, skip-at-your-loss), which the original prompt fix proved codex responds to. Verified: re-ran cx_team_v4 (codex team+git, same task as v3). Git activity went from ``fetch=2 merge=0 push=0`` per agent → ``fetch=3 merge=2 push=2`` and ``fetch=1 merge=1 push=1``. Both patches now contain both features' symbols. Pass rate v4: 33/34 tests (97%) — f2 fully passes 20/20, f1 fails one test because gpt-5.5's merged code put the ``filters`` kwarg on a helper function rather than the ``prompt`` decorator (content quality, not coordination). A second run (cx_team_v5) produced byte-identical 243-line patches on both agents — codex coordinated so well both ended up with the exact same merged tree. This surfaces a separate bench-side limitation: the eval's diff-stacker fails to apply patch B on top of patch A when every hunk already matches, producing an empty merged.patch. That's a real bug in ``eval/evaluate.py``'s coop merge step, NOT a coordination failure — codex did exactly what the prompt asked. Fix is a separate concern from team-mode wiring. Tests still pass (existing prompt tests are content-agnostic; 326 / 63 skipped). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * eval: short-circuit when both agents submit identical merged patches In team mode codex can coordinate so well that both agents end up with byte-identical patches (each fully merged the other's branch). The existing eval combiner sequence — apply patch1 → apply patch2 on top — chokes because every hunk in patch2 is already applied, producing an empty merged.patch and a downstream "No valid patches in input" failure even though both submissions are individually fine. Fix in ``test_merged``: before invoking ``_setup_branches`` / ``_merge_naive``, ``cmp`` the two patches. If they match, copy patch1 to merged.patch (normalized via ``git apply --recount`` so agents that emit unified diffs with miscounted hunk headers still work) and skip the merge dance. Returns a fresh result with ``merge.status: "identical"`` so the caller can tell the short-circuit fired vs a real merge. Verified on the codex-team e2e: - cx_team_v5 (codex agents perfectly merged to identical 243-line patches): 0/2 → 2/2 ✓ (f1: 14/14, f2: 20/20) - cx_team_v4 (codex agents diverged on the merge): unchanged at f2 20/20 + f1 13/14 = 33/34 tests, still falls back to agent2-alone via apply_status: {'agent1': 'failed', ...} I also briefly tried adding ``git apply --recount`` to ``_setup_branches``'s fallback chain, but that REGRESSED v4: it made agent1's malformed patch apply where it previously failed silently, triggering a real merge attempt that produced duplicate function definitions (broken Python) via union merge. The identical-patches short-circuit is the strictly-better fix — no regression, recovers the v5 case, and the malformed-hunk normalization only kicks in on the short-circuit path where it can't cause merge conflicts. Also lands previously-uncommitted housekeeping: - prompt.py: ruff-format-only diff on the merge-required block from the prior commit - test_team_wiring.py: ruff --fix removed unused MagicMock imports - test_gcp_backend.py / test_tasks.py: ruff --fix removed f-string-without-placeholder and unused-json import (both unrelated drift caught by the gate) Tests: 1 new (full suite: 327 passed) - ``test_test_merged_shortcircuits_on_identical_patches`` — source inspection confirms the short-circuit branch + "identical" merge-status string exist in test_merged Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * openhands: register Redis-backed CoopTaskTracker as a typed tool The previous openhands team runs (oh_team_v3) showed agents discovering the ``coop-task-*`` shell wrappers via ``compgen`` but never invoking them — gpt-5.5 strongly prefers typed tools registered with the LLM over arbitrary shell commands. This commit lands the architectural fix: a Redis-backed ``CoopTaskTrackerTool`` registered under the same name as openhands' built-in ``TaskTrackerTool`` so the registry resolution swaps it transparently. Files: * ``openhands/tools/task_tracker/coop_definition.py`` — new tool definition + executor. Same ``TaskTrackerAction`` / ``TaskTrackerObservation`` shape, but ``plan`` and ``view`` round- trip through the shared ``cb:<run_id>:`` Redis namespace that ``TaskListClient`` (host side) writes to. Tasks are auto-owned by the calling agent; ``view`` shows peer tasks prefixed with ``[<their_agent_id>]``. Registered under both ``"CoopTaskTrackerTool"`` AND ``"TaskTrackerTool"`` so importing the module rebinds the latter to the Coop variant. * ``openhands/tools/preset/default.py`` — gains a ``team_mode`` kwarg (kept for API stability + tests; the actual swap happens server-side via the .pth/__init__ side-effect import, not by changing the host-side tool list). Pre-PR coop block split into a more nuanced team-mode prompt section that documents the TaskTracker → shared-list behavior. * ``openhands_sdk/adapter.py:ModalSandboxContext.__enter__`` — layers two more bits into the Modal image at build time: - ``add_local_file`` of ``coop_definition.py`` to ``$OH_DIR/coop_definition.py`` (in the sandbox's openhands install) - ``grep ... || echo`` appending ``from . import coop_definition`` to the package's ``__init__.py`` so the registration runs at import time. Tests: 1 new + updated image-layering assertions - ``test_importing_coop_definition_overrides_local_registration``: inspecting the registry's ``_MODULE_QUALNAMES`` confirms ``TaskTrackerTool.name`` resolves to ``coop_definition``'s registration after import. - ``TestOpenHandsImageLayering`` now asserts 2 ``add_local_file`` calls + 2 ``run_commands`` layers (tool-file install + ``coop-task-*`` wrappers) and that the ``from . import coop_definition`` line is in the install commands. Full suite: 329 passed. Ruff / format / mypy all green. KNOWN LIMITATION (documented in coop_definition.py docstring): the openhands_sdk agent-server runs in a Modal sandbox that's network-isolated from the host Redis. The CoopTaskTracker is correctly registered and the LLM can call it, but every operation returns "Shared task list unavailable" because the sandbox can't ``socket.getaddrinfo("host.docker.internal")``. The fix is in the deployment layer (Modal tunnels, a Modal-hosted Redis, or running openhands directly via docker like the other adapters), not in this PR — verified by oh_team_v10: agent ran ``coop-task-list`` first ("The coop CLI failed; I'll use the shared task tracker."), then fell back to TaskTrackerAction which still hit the local executor because the override + Redis combo can't actually work in Modal. For non-Modal openhands deployments (e.g. local docker-backed openhands runs, future remote-conversation transports that share the host network), this tool works as designed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * openhands team mode: end-to-end working with Modal-hosted Redis Resolves the Modal-Redis isolation that blocked the prior CoopTaskTracker swap from actually functioning. Three pieces, working together: 1. **Modal-hosted Redis.** ``runner/team.py:execute_team`` detects ``agent_name == "openhands_sdk"`` and spins up a Modal sandbox running redis-server on a TCP tunnel (``unencrypted_ports=[6379]``, accessed via ``unencrypted_host:unencrypted_port``). Re-uses the existing ``connectors/redis_server.ModalRedisServer`` — it was already written, just unused. Both the host TaskListClient and the agent sandboxes point at the same public TCP endpoint, so pre-seed and agent reads/writes share state. Falls back to local Redis for the other adapters. 2. **CoopTaskTrackerTool injection into the Modal sandbox.** The adapter now ``add_local_file``s three pieces into the OpenHands image at build time: - ``coop_task.py`` → ``/usr/local/bin/cb-coop-task.py`` - ``coop_definition.py`` → ``$OH_DIR/coop_definition.py`` - ``_team_init_override.py`` → ``$OH_DIR/__init__.py`` (replaces upstream; same exports + a side-effect import of coop_definition so the Redis-backed executor overrides the local TaskTracker registration at first import). Plus a ``find -name '*.pyc' -delete`` to invalidate Python's bytecode cache so the new __init__ actually re-runs. 3. **Harvest-time fresh client.** Modal's TCP tunnels drop idle connections after a few minutes, so the original Redis client pre-seed used at startup gets closed before the 9-min agent run finishes. Re-open the client at harvest time using the same URL. End-to-end on ``dottxt_ai_outlines_task/1371 [1,2]`` with ``-a openhands_sdk --setting team --git``: - Modal Redis startup: ``redis ready redis://r450.modal.host:41899`` - Both agents Submitted, 9m total - Eval: 2/2 PASS (f1: 14/14 ✓, f2: 20/20 ✓) - Metrics: ``tasks_total: 4, tasks_done: 4, unowned_at_end: 0, time_to_first_claim_seconds: 52.6, claims_per_agent: {agent2:2, agent1:1}, updates_per_agent: {agent2:4, agent1:5}`` - Cost: $3.33 Tests: image-layering assertions expanded — ``add_local_file`` now called 3 times (CLI helper, tool def, __init__ override), and the run_commands chain copies both files + wipes .pyc caches. Full suite: 329 passed. Ruff / format / mypy all green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * deps: add fakeredis to dev extras The team-mode unit tests (task_list / protocol / fs_mirror / loop_refresh / mcp_server) use ``fakeredis.FakeRedis`` as a hermetic stand-in for redis-server, but ``fakeredis`` wasn't declared anywhere in pyproject.toml — it just happened to be present in my local venv because something else pulled it in transitively. GitHub CI installs ``[dev]`` only, so on a clean install pytest collection fails with ``ModuleNotFoundError: No module named 'fakeredis'`` on every team-mode test file. Adding the dependency explicitly fixes PR #52 (team-mode) CI; once team-mode merges, PR #55 (team-all-adapters) will also pick it up via the same path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * swe_agent: fix import error + add missing transitive deps Three changes that together unblock swe_agent team-mode runs (and solo/coop runs too — the bug wasn't team-specific): 1. ``cooperbench.agents.mini_swe_agent`` → ``mini_swe_agent_v2`` in ``swe_agent/adapter.py`` and ``swe_agent/agent/agents.py``. The old package was renamed in v0.0.13; both swe_agent files had stale imports that no-op'd at module load (TypeError or ModuleNotFoundError depending on how the framework was invoked), making every swe_agent invocation return Error before any LLM call. 2. Add ``numpy``, ``boto3``, ``docker`` to the ``swe-agent`` extras in pyproject.toml. swe_agent's vendored framework imports these at module-load time even when the docker/S3/model paths are dormant, so a clean ``pip install '.[swe-agent]'`` without these would still ImportError on first invocation. 3. uv.lock refreshed with the new transitive deps. End-to-end on dottxt_ai_outlines_task/1371 [1,2] with ``-a swe_agent -m gpt-5.5 --setting team --git`` (sw_team_v5): both agents Submitted, patches 373 + 88 lines, both applied via git apply. Eval failed 0/2 due to a content-quality issue (``NameError: name 'Set' is not defined`` — agent used Set without importing it; both agents hit exit_cost budget limit mid-implementation), but that's model variance, not adapter wiring. swe_agent is unblocked: it runs end-to-end, produces patches, the eval pipeline processes them. Coordination metrics still empty (claims_per_agent: {}) because swe_agent doesn't yet have the in-container coop-task-* CLI install or in-loop task auto-refresh — those are tracked as follow-ups in the PR body. For now the swe_agent team-mode run just gets the team prompt section + env vars; full team-tool integration is a separate PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: team-mode bugs surfaced by 10-pair core run Five compounding bugs prevented `claude_code`, `codex`, and `mini_swe_agent_v2` from reaching honest pass-rates on the core subset in team setting. All four now ≥ 5/10. - normalize_patch ate trailing blank context lines (text.strip() consumes " \n"), breaking last-hunk line counts so git apply rejected otherwise-valid diffs. Replaced with lstrip/rstrip on "\n" only. - mini_swe_agent_v2 adapter wasn't normalizing patches at all — raw .strip() on the patch.txt read, so every msa patch ended in a non-newline byte. Now routes through normalize_patch. - mini_swe_agent_v2 ModalEnvironment created the sandbox with no long-running command, so the image's default CMD exited and every exec hit "Sandbox not found". Pass "sleep", "infinity" as the positional command (matches eval backend's existing fix). - claude_code and codex adapters silently ignored --backend modal because shared build_environment was hardcoded to DockerEnvironment. Added a backend kwarg and threaded config["backend"] through both adapters. - Team lead prompt buried the integration step at the bottom of a long workflow list; Claude/Codex consistently exited after their own feature without reading /workspace/shared/<agent>.patch. Rewrote with a hard-rule opener and a 5-point pre-submission checklist. Member prompt now opens with "stay in your lane" per the lead's PLAN.md. - eval test_merged now falls back to testing each agent's patch alone when the merged tree doesn't pass both features. Surfaced as merge.strategy="solo-agent1" / "solo-agent2". Credits the agent (typically the lead) who correctly integrated both features into one working patch but had it corrupted by union-merging with the other agent's partial implementation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs+data: core subset and team-mode horizontal comparison - dataset/subsets/core.json: 10-pair subset for quick agent comparisons. Stratified by repo (largest-remainder proportional allocation by full-dataset pair count) with a one-slot floor per primary language (Python / Go / Rust / TS). Reproducible via scripts/generate_core_subset.py (seed=42). - docs/BENCHMARK_RESULTS.md: horizontal comparison of four agent frameworks on the core subset in team setting. Includes per-task pass/fail matrix annotated with the merge strategy used, plus the chronological narrative of the dozen reruns that surfaced each of the bugs fixed in the previous commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(eval): don't bail when union-merge also conflicts Previously test_merged returned early with an error when both naive and union merge strategies hit conflicts, so the solo-agent fallback never got a chance to credit a team whose lead alone integrated both features. Now we write an empty merged.patch, let run_tests fail naturally on the merged tree, and fall through to the solo fallback. Doesn't change any of the current 40 eval results — union's merge=union attribute is tolerant enough that every task in the dataset produces some tree (potentially broken code with stitched-together lines); the broken-tree-tests-fail path already triggered the solo fallback. This just closes the defensive gap for future pathological cases. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * eval(team): identical / naive / lead-when-naive-conflicts policy Drops the union-merge strategy and the member-only fallback from test_merged. The new chain is: 1. identical patches → skip-merge short-circuit 2. naive 3-way merge clean → merged-tree tests are authoritative (no further fallback) 3. naive merge conflicts → test the lead's patch.txt alone against both feature suites Rationale: union merge concatenates conflicting hunks, which usually produces syntactically broken code; the cases where it accidentally produced a working tree were rewarding lucky non-overlap, not genuine coordination. The member-only fallback was symmetric to lead-only but incoherent under team-mode semantics (the lead is the designated integrator; if they didn't integrate, the team failed regardless of what the member's branch looks like). Effect on the core-subset horizontal comparison: msa 6 → 6 (unchanged) oh 5 → 4 (loses pallets_jinja/1621 — was passing via union, which concealed that oh's lead doesn't integrate) cc 5 → 5 (unchanged) cx 5 → 5 (unchanged) oh sliding below 5/10 is the correct outcome: the previous union-pass on pallets_jinja/1621 was a false-positive of sorts (oh's agents commit their patch.txt into the working tree, which forces a merge conflict on patch.txt that union resolved while the actual source merge was non-conflicting). Under the stricter policy this gets routed through lead-alone, which oh's lead does not pass. BENCHMARK_RESULTS.md updated to reflect the new totals + per-task matrix legend (N = naive/identical, L = lead-alone). CHANGELOG entry revised; full test suite still green (329 passed, 63 skipped). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(modal): codex stdin hang; eval guardrail for openhands_sdk codex on Modal: `codex exec` was hanging for the full sandbox lifetime (~2h) producing zero stream output. Root cause: codex's exec mode prints "Reading additional input from stdin..." and blocks until stdin EOF. Docker's non-tty `docker exec` gives EOF for free; Modal sandbox keeps stdin open. Fix: add `</dev/null` to the codex invocation in _build_codex_command. Smoke-tested on dottxt_ai_outlines/1655 [1,3] solo on Modal: 1/1 pass in 1m 48s. openhands_sdk eval guardrail: openhands_sdk produces patches that include a committed patch.txt in the working tree and relies on Modal-hosted Redis for coordination; running eval through Docker silently changed the test environment. The eval now reads the run's config.json and refuses with a clear warning when the run was produced by openhands_sdk but --backend != modal. Note: swe_agent already runs on Modal (uses swerex.ModalDeploymentConfig by default; the earlier docs claiming it was docker-only were wrong). Smoke-tested same dottxt task: 1/1 pass in 3m 12s. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(swe_agent): add --backend docker support swe_agent adapter was hardcoded to swerex.ModalDeploymentConfig. Added a backend dispatch that picks DockerDeploymentConfig when config["backend"] == "docker"; Modal stays as the default. Two upstream-swerex issues had to be worked around to make the docker path actually start a container: 1. CooperBench task images set ENTRYPOINT=/usr/local/bin/runner.sh, so swerex's `docker run ... image sh -c "<startup>"` becomes `runner.sh sh -c "<startup>"` and runner.sh interprets "sh" as the feature-patch path. Pass docker_args=["--entrypoint", ""] to clear the entrypoint (mirrors the existing Modal monkey-patch that does .entrypoint([]) on the image). 2. swerex's startup falls back to `pipx run swe-rex ...` when the swerex-remote binary isn't pre-installed, but pipx looks for an executable literally named "swe-rex" — which doesn't exist in the published `swe-rex` package (it provides "swerex-remote"). Monkey-patch DockerDeployment._get_swerex_start_cmd to use `pipx run --spec swe-rex swerex-remote ...` instead. Smoke-tested with `dottxt_ai_outlines/1655 [1,3]` solo on docker: 1/1 pass in 2m 53s, 17 steps, $0.32, no errors. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * team_harness: extract team mode as standalone harness + ablation flags Move team-mode primitives from cooperbench/agents/_team (private) to cooperbench/team_harness (public, library-shaped) so other benchmarks can consume the multi-agent coordination algorithm without depending on CooperBench's task layout. Adds TeamSession + TeamHarnessConfig: - TeamSession bundles per-run state (run_id, namespaced Redis URL, ordered agent list, scratchpad volume name) with the feature config and exposes adapter-facing factories that each return None / [] / {} when their feature is disabled, so adapter code paths collapse to one branch: coop_env.update(session.env_for(agent_id)) extra_run_args.extend(session.scratchpad_mount_args()) mcp_config = session.mcp_config(container_script_path=...) - TeamHarnessConfig is a frozen dataclass of five per-feature booleans (task_list, scratchpad, mcp, auto_refresh, protocol). The lead/member role split is the always-on baseline -- without it team is just coop. Wires five --team-no-* CLI flags through cli.py -> runner.run -> runner.core -> runner.team -> each adapter. result.json now records team_features so post-hoc analysis can attribute deltas to the feature that was off. Adapter refactor: claude_code, codex, mini_swe_agent_v2, swe_agent, and openhands_agent_sdk now accept team_features kwarg and construct a local TeamSession instead of calling loose helpers. Each adapter's team-mode blocks (prompt, env, mount, MCP, install) gate on the session's config. Tests: tests/agents/_team -> tests/team_harness (rename), new test_session.py (29 cases) covers the facade, four new ablation tests in tests/runner/test_team.py verify the runner-side gating. Full suite 363 passed, 63 skipped; ruff/format/mypy clean. End-to-end smoke on dottxt_ai_outlines/1371 [1,2] with codex (docker): - Default: writes task_log.json + tasks.json + metrics, cb-team-<run> volume created. - --team-no-task-list --team-no-scratchpad --team-no-mcp: no task_log / tasks files, empty metrics dict, no volume. team_features in result.json reflects the requested ablation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * codex: add Azure OpenAI support Set AZURE_OPENAI_API_KEY + AZURE_OPENAI_ENDPOINT (the OpenAI-compatible v1 base, e.g. https://<resource>.cognitiveservices.azure.com/openai/v1) and pass the Azure deployment name via -m. When both are present they take precedence over OPENAI_API_KEY. How it works: - resolve_azure_config() reads the two env vars (endpoint trailing slash stripped); _azure_config_toml() writes a `model_provider = "azure"` block into codex's config.toml with wire_api = "responses" (codex 0.132 dropped the chat wire API) and env_key = AZURE_OPENAI_API_KEY. - The key is exported into the codex command and read via the provider env_key; auth.json is skipped on the Azure path. - config.toml is now composed from independent fragments (azure provider + team-mode MCP server) so both can coexist. Non-json fallback: codex 0.132's --json event stream deterministically fails against Azure's HTTP/2 /responses endpoint ("stream disconnected: error sending request") while plain output works. Captured requests are byte-identical between modes, so it's a codex response-handling bug, not a config error. The Azure path therefore runs codex WITHOUT --json, harvests the patch from patch.txt (as always) and the final message via --output-last-message, and derives status from codex's exit code. Trade-off: no token/cost/trajectory telemetry on Azure (codex's plain output carries none; cost was already $0 via the broken json parser). Tests: 5 new (resolve_azure_config, _azure_config_toml, non-json run shape + provider config + no auth.json, error status on non-zero exit); autouse fixture clears AZURE_* so non-Azure tests stay hermetic. Full suite 369 passed; ruff/format/mypy green. Validated end-to-end on dottxt_ai_outlines/1655 [1,3] with `-a codex -m gpt-5.5-hao` against a live Azure deployment: Submitted, clean stream (no disconnects), eval passes both features. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(codex): preserve Azure key in coop/team mode The is_coop branch reassigned `coop_env = {...}`, wiping the AZURE_OPENAI_API_KEY added just above it. Codex then failed provider auth ("Missing environment variable: AZURE_OPENAI_API_KEY") in every coop / coop+git / team run, producing empty patches — a full-dataset coop+git Azure sweep scored 0/652 while solo (same path) scored 355/652. Fix: `coop_env.update({...})` so the Azure key survives. Verified with a coop+git Azure smoke (both agents Submitted, real patches, zero missing-key errors). Adds a regression test (test_azure_key_survives_coop_mode). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(codex): harden container install for concurrent runs Codex's setup.sh ran apt without DEBIAN_FRONTEND=noninteractive, so in TTY-less containers debconf fell through Dialog->Readline->Teletype and tripped dpkg ("Sub-process /usr/bin/dpkg returned an error code (1)"). Rare at solo concurrency (6 containers, ~0.6% fail) but dominant under coop/team (12 containers at concurrency 6, ~87% fail) — a full-dataset coop+git sweep collapsed to install failures. Fix: export DEBIAN_FRONTEND=noninteractive and wrap apt/apk/yum installs in a 3x retry (transient mirror throttling under many simultaneous installs from one host). Validated with 15 coop+git tasks at concurrency 6: 15/15 installed cleanly (was ~1/8 before), 30/30 agents produced patches. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Ubuntu <ubuntu@ip-172-31-58-153.us-west-2.compute.internal> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* agents/codex: add Codex adapter; lift shared coop bits into _coop Adds an OpenAI Codex CLI adapter alongside the existing Claude Code adapter. Both adapters wrap a third-party CLI inside the task's Docker container; the bits that are agent-agnostic (Redis messaging helper, prompt blocks for solo/coop/coop+git, git remote setup) now live in a new ``cooperbench.agents._coop`` module so the two adapters (and any future CLI adapter) consume them rather than duplicating. Codex adapter highlights: - Invokes ``codex exec --json --sandbox danger-full-access --skip-git-repo-check --model <id>``. - Writes ``${CODEX_HOME}/auth.json`` with the host's OPENAI_API_KEY inside the container so the CLI authenticates without prompts. - Parses Codex's JSONL event stream for status / token totals / messages. Cost is reported as 0.0 because Codex does not emit a cost field; tokens are summed across ``turn.completed`` events. - Model fallback: if Codex rejects ``--model gpt-5.5`` with a "model not found" shaped error, the adapter retries once without ``--model`` and lets Codex pick its default. - Preflight credential check: if OPENAI_API_KEY is unset the adapter returns Error immediately instead of spinning up a container that can only fail. Shared ``_coop`` module: - ``coop_msg.py`` — Redis-backed messaging CLI (one inbox per agent) installed as ``coop-send`` / ``coop-recv`` / ``coop-broadcast`` / ``coop-peek`` / ``coop-agents`` under /usr/local/bin. - ``install_snippet.sh`` — pip-installs redis and drops the shell wrappers; each adapter's setup.sh sources it. - ``prompt.py`` — solo / coop / coop+git prompt assembly, agent- agnostic. - ``runtime.py`` — ``ContainerEnv`` protocol, ``build_environment``, ``write_file_in_container`` / ``read_file_from_container``, ``rewrite_comm_url_for_container``, ``build_git_setup_command``, ``parse_sent_messages_log``, and ``normalize_patch``. Bug fix during this refactor: the previous adapter's ``.strip()`` on ``patch.txt`` was eating the trailing newline that ``git apply`` requires. Replaced with ``normalize_patch()`` (one trailing newline, no leading whitespace). This bit codex's solo run with a "corrupt patch at line N" error; Claude got lucky and didn't. Tests: 24 new for Codex (parsers + adapter), existing 45 Claude Code tests re-pointed at the shared ``_coop`` module. Full suite: 228 passed, 63 skipped. End-to-end runs against dottxt_ai_outlines_task/1371 features 1+2: - codex solo f1: Submitted, 1 turn, 365k input tokens, 184-line patch (with the trailing-newline fix it applies cleanly) - codex coop+git f1,f2: both Submitted, both patches applied but 0/2 tests pass — coordination failure (agent1 fetched ``team`` but never merged, so the stacked patches produce a Python SyntaxError at line 144 of the modified file). Claude on the same task scored 2/2; Codex used the tools less aggressively on this run. The 0/2 result is the kind of coordination failure the bench is designed to surface, not an adapter bug. Future iteration could tighten the prompt or hard-enforce a post-run merge, but neither is necessary to land the adapter itself. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * runner: add team mode (lead + members + shared task list + scratchpad) Adds a third setting alongside ``solo`` and ``coop``, modelled on the agent-team primitives Claude Code uses in its own product. Where coop gives N peer agents one feature each and a Redis inbox to chat over, team mode adds three load-bearing primitives: 1. A typed **shared task list** (cooperbench.agents._team.TaskListClient) backed by Redis hashes + sets, namespaced ``cb:<run_id>:``, with atomic claim semantics (HSETNX-style — exactly one caller wins on a race) and an audit log of every mutation. Exposed in the container as ``coop-task-create`` / ``coop-task-claim`` / ``coop-task-update`` / ``coop-task-list`` shell wrappers. 2. A **lead / member role split**. The first agent is designated ``team-lead`` and gets a system-prompt block instructing them to break the spec into tasks, assign them via ``coop-task-create --assign``, watch progress, and integrate. Other agents are ``member`` and look for open tasks to claim. 3. A **shared scratchpad** Docker volume (``cb-team-<run_id>``) mounted at ``/workspace/shared`` in every container. Free coordination artifact for design notes, partial diffs, interface sketches. Coordination metrics are computed from the task-list audit log after the run finishes (``time_to_first_claim_seconds``, ``claims_per_agent``, ``updates_per_agent``, ``tasks_done``, ``unowned_at_end``) and saved into ``result.json``. Evaluation is identical to coop — per-agent ``patch.txt`` evaluated per-feature — so no eval changes were needed beyond discovering ``team/`` log directories. Compatibility: all five existing adapters accept the new ``team_role`` / ``team_id`` / ``task_list_url`` kwargs. The CLI adapters (``claude_code``, ``codex``) wire the team install snippet into their ``setup.sh`` so the ``coop-task-*`` wrappers land at ``/usr/local/bin``. The Python-loop adapters (``mini_swe_agent_v2``, ``swe_agent``, ``openhands_sdk``) accept the kwargs without breaking; their in-loop integration with the task list (auto-refresh between steps, similar to the existing inbox poll) lands in a follow-up. Unit tests: 46 new - 18 task_list (CRUD, atomic claim, owner-only update, audit log, run isolation) - 12 prompt (lead vs member branches, solo fallback, git interaction) - 3 runtime (env assembly, scratchpad mount args) - 4 metrics (happy path, unowned-at-end, empty log, multiple claims) - 5 runner (lead-is-first-agent, pre-seed, kwarg propagation, metrics in result, three-agent team) - 4 misc Full suite: 274 passed, 63 skipped. Ruff / format / mypy all green. End-to-end on dottxt_ai_outlines_task/1371 [1,2] with Claude Code in team+git mode: - 5 tasks created (2 by bench-runner, 3 by the lead splitting its work), all reached ``done`` - time_to_first_claim_seconds=34.2 - claims_per_agent={agent1: 2, agent2: 1} - updates_per_agent={agent1: 4, agent2: 3} - scratchpad volume actively used (agent2 wrote its diff to /workspace/shared/agent2.patch + a summary.md) - **0/1 pass rate** — both ``patch.txt`` files were empty: the members wrote diffs to the scratchpad instead of also writing ``/workspace/repo/patch.txt``, and the lead never ran the final integration step. This is real coordination signal (the prompt told them to write both places but they followed the scratchpad half only) — a follow-up will tighten the prompt to make patch.txt submission the explicit final step. Future PRs (intentionally out of scope here so this lands at a reviewable size): - In-loop auto-refresh for the Python-loop adapters - MCP long-poll tool to give CLI adapters push-ish inbox semantics - Typed ``coop-request`` / ``coop-respond`` protocol on top of messaging (CC's plan_approval_request shape) - Filesystem mirror of the task list (CC-style ``ls`` artefacts) Stacks on #51 (Codex adapter) so the diff stays focused on team-mode additions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * team mode: filesystem mirror, typed protocol, MCP server, in-loop refresh (#53) Lands the four follow-ups that were called out as "Out of scope" on the team-mode PR (#52), plus a prompt fix surfaced by the team-mode end-to-end run. 1. **Filesystem mirror of task list** (``_team/fs_mirror.py``). Snapshots the Redis-backed task list to ``/workspace/shared/tasks/`` so agents can ``ls`` and ``cat`` tasks with their existing tools rather than going through the ``coop-task-list`` CLI. Layout mirrors Claude Code's team primitive: one ``<id>.json`` per task, plus ``_index.json`` (cheap ``ls`` target) and ``_log.jsonl`` (audit trail). Triggered on every ``coop-task-list`` invocation and from the host runner at startup. Files written via tempfile+replace so readers never observe a partial state. 2. **Typed coop-request / coop-respond protocol** (``_team/protocol.py``). Layered on plain Redis messaging, mirroring CC's ``plan_approval_request`` / ``plan_approval_response`` shape. ``coop-request <peer> <kind> <body>`` returns a request_id (and optionally blocks via ``--wait N`` for a response). ``coop-respond <request_id> <body>`` writes back; the sender's ``await_response`` uses BLPOP so it actually sleeps instead of busy-polling. Both events flow into the shared task-log so coordination metrics include protocol events. 3. **MCP long-poll server** (``_team/mcp_server.py``). Stdio JSON-RPC server that exposes a single ``wait_for_message`` tool backed by BLPOP on the agent's inbox. Registered automatically: Claude Code adapter writes ``$CLAUDE_CONFIG_DIR/.claude.json`` with the server entry; Codex adapter writes ``$CODEX_HOME/config.toml``. The point is to make "watch the inbox" a natural idle behavior for the CLI adapters instead of a busy-loop on ``coop-recv`` returning empty — the closest we can get to push-style delivery for opaque CLI agent loops. 4. **In-loop task-list auto-refresh** (``_team/loop_refresh.py``). ``TeamPoller`` is a per-agent host-side helper that ``mini_swe_agent_v2.DefaultAgent.step()`` calls between LLM queries — same hook as the existing inbox poll. The LLM sees a compact ``[Team task list] open: 1, in_progress: 2, ...`` summary prepended to every turn so it doesn't need to remember to call ``coop-task-list``. Plumbed via ``agent.team_poller`` so the ``mini_swe_agent_v2`` subtree change is one branch in ``step()``. The same module also exports ``poll_team_state()`` for in-container use (env-driven variant). 5. **Prompt fix**: the previous team-mode end-to-end had members writing diffs to ``/workspace/shared/<id>.patch`` only and never to ``/workspace/repo/patch.txt``, scoring 0/2 despite great coordination. Both lead and member prompts now have an explicit ``### Final submission — REQUIRED`` section that calls out ``patch.txt`` as the only file the bench evaluates and provides the exact ``git diff > patch.txt`` command. Also: cosmetic fix to ``runner/core._print_single_result`` so team mode's per-agent dicts (which carry ``patch_lines: int``) render correctly in the run table — previously the column showed 0 because the function tried ``len(r.get("patch", "").splitlines())`` and team mode doesn't store the full patch in the agents dict. Tests: 37 new unit tests - 8 fs_mirror (atomic writes, stale cleanup, empty index) - 9 protocol (request roundtrip, await, timeout, audit log) - 9 mcp_server (initialize, tools/list, tools/call, timeout, blocking, unknown-tool error, env factory) - 8 loop_refresh (summary formatting, TeamPoller, env variant) - 3 prompt (regression: lead+member prompts demand patch.txt) Full suite: **311 passed**, 63 skipped. End-to-end on dottxt_ai_outlines_task/1371 [1,2] with Claude Code + team + git: **2/2 features pass** (14/14 + 20/20 tests). All four follow-ups visibly active in the run artifacts: ``/workspace/shared/tasks/`` populated with per-task JSON + _index + _log; scratchpad has agent2.patch; ``cb-mcp-server.py`` registered in ``.claude.json``; 6 tasks created (2 by runner pre-seed, 4 by lead's sub-task split), 4 reached ``done``, ``time_to_first_claim_seconds=29.9``. Previous run scored 0/2 on the same task — the prompt fix is doing real work. Stacks on #52. Co-authored-by: Ubuntu <ubuntu@ip-172-31-58-153.us-west-2.compute.internal> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * team mode: wire team prompt + env into the three Python-loop adapters Brings ``mini_swe_agent_v2``, ``swe_agent``, and ``openhands_sdk`` to parity with the CLI adapters for team mode. Before this commit they accepted the team kwargs but discarded them; now each one appends the team prompt section to the task it sends the agent, and (where the adapter actually controls the container) propagates ``CB_TEAM_*`` env vars + mounts the team scratchpad. New helper: ``_team.team_task_section(agents, agent_id, team_role)`` returns ONLY the lead-or-member block + coop-task-* CLI usage, without the surrounding task/submission/git scaffolding that ``build_team_instruction`` adds. Python-loop adapters already have their own prompts covering messaging/git/submission, so they need only the new piece; CLI adapters keep using the bigger function. Per-adapter wiring: - ``mini_swe_agent_v2``: appends team_task_section to task; propagates CB_TEAM_* through env_kwargs["env"]; adds ``--add-host=host.docker.internal:host-gateway`` + scratchpad volume to docker run args; installs the team CLI scripts + pip redis in the container after env spin-up. The existing ``TeamPoller`` host-side hook (already in step()) still fires. - ``openhands_sdk``: appends team_task_section to task; folds a new ``team_env`` dict into ``coop_info`` so ``_build_credentials_dict`` propagates CB_TEAM_* into the sandbox. Coop-task-* binary install in the OpenHands agent-server image is a follow-up — OpenHands manages its own image build and doesn't expose a clean post-start exec hook. - ``swe_agent``: appends team_task_section to task. The SWE-agent framework's sandbox + agent loop is third-party and harder to instrument; everything beyond the prompt is a follow-up. Tests: 13 new - 3 prompt unit tests for team_task_section (lead, member, empty) - 10 cross-adapter sanity tests in tests/agents/test_team_wiring.py: consistency between team_task_section and build_team_instruction, every registered runner accepts the team kwargs, openhands env keys, swe_agent signature Full suite: 324 passed, 63 skipped. Ruff/format/mypy all green. End-to-end on dottxt_ai_outlines_task/1371 [1,2] with claude_code + team + git (sanity check that the shared changes didn't regress the CLI adapter): both Submitted in 4m21s, $0.93, patches 210 + 81 lines. End-to-end for the other four (codex, mini_swe_agent_v2, swe_agent, openhands_sdk) requires API keys (Anthropic for the three Python-loop adapters via litellm, OpenAI for codex) that aren't available in this environment. Unit tests cover the new wiring; the e2e validations should be run with real keys before relying on the per-adapter behavior. Compatibility matrix is now: | Adapter | Accepts | Team prompt | Auto-refresh | CLI in container | env vars | |---------------------|---------|-------------|--------------|------------------|----------| | claude_code | yes | yes (full) | n/a | yes | yes | | codex | yes | yes (full) | n/a | yes | yes | | mini_swe_agent_v2 | yes | yes (sec.) | yes | yes | yes | | openhands_sdk | yes | yes (sec.) | n/a | NOT YET | yes | | swe_agent | yes | yes (sec.) | NOT YET | NOT YET | NOT YET | Stacks on #52 (merged-up team-mode branch). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * openhands: layer coop-task-* install onto Modal image for team mode Closes the documented gap from the prior commit's matrix: the ``coop-task-*`` binaries now ship into the OpenHands agent-server sandbox, layered onto the upstream ``-oh`` image via Modal's ``add_local_file`` / ``pip_install`` / ``run_commands`` chain (no upstream image rebuild required). Triggered only when ``coop_info["team_env"]`` is set so solo / coop runs don't pay the ~10s first-build cost. Modal caches the layered image; subsequent team runs are instant. Verified end-to-end: ran openhands_sdk team+git on dottxt_ai_outlines_task/1371 [1,2] with gpt-5.5. The agent ran ``compgen -c | grep coop-task`` and got back all 7 wrappers (create / claim / update / list / request / respond / pending) — the install worked. Whether the model actually invokes the tools is a separate (coordination-quality) axis; in this run it discovered them but didn't use them, same as codex. Both patches applied; f1 14/14, f2 19/20. Tests: 2 new (full suite: 326 passed) - test_team_env_triggers_image_layering — verifies add_local_file + pip_install + run_commands fire with the right args when team mode is active - test_no_layering_when_team_inactive — verifies solo / coop runs skip the image-build cost Matrix update — openhands_sdk now reads: Accepts kwargs: yes / Team prompt: section / Auto-refresh: n/a / CLI in container: YES (was NOT YET) / CB_TEAM_* env: yes Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * team prompt: make the merge-before-submit step REQUIRED The codex team e2e (cx_team_v3) hit 0/2 with great coordination metrics — 5/5 tasks done, 27s first claim, claims even — but neither agent ran ``git merge`` despite the prompt's "Recommended workflow" mentioning it. Both fetched their peer's branch (2 each) and then submitted only their own work, so the eval's naive diff-stacker produced syntactically broken Python. The previous prompt buried the critical step in a "Concretely:" sentence at the end; gpt-5.5 didn't follow it. This rewrite: - Renames the section ``## Git collaboration — MERGE IS REQUIRED BEFORE SUBMITTING`` so the imperative is in the heading itself. - Adds an explicit "Required final sequence — run this verbatim before exiting" block with the full fetch+merge+diff sequence, parameterized over every partner branch. - Explains *why* (each agent's patch.txt is evaluated against every feature's tests; without the merge, the peer feature's symbols are missing → ImportError). - Frames it the same way the patch.txt step is framed (REQUIRED, skip-at-your-loss), which the original prompt fix proved codex responds to. Verified: re-ran cx_team_v4 (codex team+git, same task as v3). Git activity went from ``fetch=2 merge=0 push=0`` per agent → ``fetch=3 merge=2 push=2`` and ``fetch=1 merge=1 push=1``. Both patches now contain both features' symbols. Pass rate v4: 33/34 tests (97%) — f2 fully passes 20/20, f1 fails one test because gpt-5.5's merged code put the ``filters`` kwarg on a helper function rather than the ``prompt`` decorator (content quality, not coordination). A second run (cx_team_v5) produced byte-identical 243-line patches on both agents — codex coordinated so well both ended up with the exact same merged tree. This surfaces a separate bench-side limitation: the eval's diff-stacker fails to apply patch B on top of patch A when every hunk already matches, producing an empty merged.patch. That's a real bug in ``eval/evaluate.py``'s coop merge step, NOT a coordination failure — codex did exactly what the prompt asked. Fix is a separate concern from team-mode wiring. Tests still pass (existing prompt tests are content-agnostic; 326 / 63 skipped). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * eval: short-circuit when both agents submit identical merged patches In team mode codex can coordinate so well that both agents end up with byte-identical patches (each fully merged the other's branch). The existing eval combiner sequence — apply patch1 → apply patch2 on top — chokes because every hunk in patch2 is already applied, producing an empty merged.patch and a downstream "No valid patches in input" failure even though both submissions are individually fine. Fix in ``test_merged``: before invoking ``_setup_branches`` / ``_merge_naive``, ``cmp`` the two patches. If they match, copy patch1 to merged.patch (normalized via ``git apply --recount`` so agents that emit unified diffs with miscounted hunk headers still work) and skip the merge dance. Returns a fresh result with ``merge.status: "identical"`` so the caller can tell the short-circuit fired vs a real merge. Verified on the codex-team e2e: - cx_team_v5 (codex agents perfectly merged to identical 243-line patches): 0/2 → 2/2 ✓ (f1: 14/14, f2: 20/20) - cx_team_v4 (codex agents diverged on the merge): unchanged at f2 20/20 + f1 13/14 = 33/34 tests, still falls back to agent2-alone via apply_status: {'agent1': 'failed', ...} I also briefly tried adding ``git apply --recount`` to ``_setup_branches``'s fallback chain, but that REGRESSED v4: it made agent1's malformed patch apply where it previously failed silently, triggering a real merge attempt that produced duplicate function definitions (broken Python) via union merge. The identical-patches short-circuit is the strictly-better fix — no regression, recovers the v5 case, and the malformed-hunk normalization only kicks in on the short-circuit path where it can't cause merge conflicts. Also lands previously-uncommitted housekeeping: - prompt.py: ruff-format-only diff on the merge-required block from the prior commit - test_team_wiring.py: ruff --fix removed unused MagicMock imports - test_gcp_backend.py / test_tasks.py: ruff --fix removed f-string-without-placeholder and unused-json import (both unrelated drift caught by the gate) Tests: 1 new (full suite: 327 passed) - ``test_test_merged_shortcircuits_on_identical_patches`` — source inspection confirms the short-circuit branch + "identical" merge-status string exist in test_merged Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * openhands: register Redis-backed CoopTaskTracker as a typed tool The previous openhands team runs (oh_team_v3) showed agents discovering the ``coop-task-*`` shell wrappers via ``compgen`` but never invoking them — gpt-5.5 strongly prefers typed tools registered with the LLM over arbitrary shell commands. This commit lands the architectural fix: a Redis-backed ``CoopTaskTrackerTool`` registered under the same name as openhands' built-in ``TaskTrackerTool`` so the registry resolution swaps it transparently. Files: * ``openhands/tools/task_tracker/coop_definition.py`` — new tool definition + executor. Same ``TaskTrackerAction`` / ``TaskTrackerObservation`` shape, but ``plan`` and ``view`` round- trip through the shared ``cb:<run_id>:`` Redis namespace that ``TaskListClient`` (host side) writes to. Tasks are auto-owned by the calling agent; ``view`` shows peer tasks prefixed with ``[<their_agent_id>]``. Registered under both ``"CoopTaskTrackerTool"`` AND ``"TaskTrackerTool"`` so importing the module rebinds the latter to the Coop variant. * ``openhands/tools/preset/default.py`` — gains a ``team_mode`` kwarg (kept for API stability + tests; the actual swap happens server-side via the .pth/__init__ side-effect import, not by changing the host-side tool list). Pre-PR coop block split into a more nuanced team-mode prompt section that documents the TaskTracker → shared-list behavior. * ``openhands_sdk/adapter.py:ModalSandboxContext.__enter__`` — layers two more bits into the Modal image at build time: - ``add_local_file`` of ``coop_definition.py`` to ``$OH_DIR/coop_definition.py`` (in the sandbox's openhands install) - ``grep ... || echo`` appending ``from . import coop_definition`` to the package's ``__init__.py`` so the registration runs at import time. Tests: 1 new + updated image-layering assertions - ``test_importing_coop_definition_overrides_local_registration``: inspecting the registry's ``_MODULE_QUALNAMES`` confirms ``TaskTrackerTool.name`` resolves to ``coop_definition``'s registration after import. - ``TestOpenHandsImageLayering`` now asserts 2 ``add_local_file`` calls + 2 ``run_commands`` layers (tool-file install + ``coop-task-*`` wrappers) and that the ``from . import coop_definition`` line is in the install commands. Full suite: 329 passed. Ruff / format / mypy all green. KNOWN LIMITATION (documented in coop_definition.py docstring): the openhands_sdk agent-server runs in a Modal sandbox that's network-isolated from the host Redis. The CoopTaskTracker is correctly registered and the LLM can call it, but every operation returns "Shared task list unavailable" because the sandbox can't ``socket.getaddrinfo("host.docker.internal")``. The fix is in the deployment layer (Modal tunnels, a Modal-hosted Redis, or running openhands directly via docker like the other adapters), not in this PR — verified by oh_team_v10: agent ran ``coop-task-list`` first ("The coop CLI failed; I'll use the shared task tracker."), then fell back to TaskTrackerAction which still hit the local executor because the override + Redis combo can't actually work in Modal. For non-Modal openhands deployments (e.g. local docker-backed openhands runs, future remote-conversation transports that share the host network), this tool works as designed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * openhands team mode: end-to-end working with Modal-hosted Redis Resolves the Modal-Redis isolation that blocked the prior CoopTaskTracker swap from actually functioning. Three pieces, working together: 1. **Modal-hosted Redis.** ``runner/team.py:execute_team`` detects ``agent_name == "openhands_sdk"`` and spins up a Modal sandbox running redis-server on a TCP tunnel (``unencrypted_ports=[6379]``, accessed via ``unencrypted_host:unencrypted_port``). Re-uses the existing ``connectors/redis_server.ModalRedisServer`` — it was already written, just unused. Both the host TaskListClient and the agent sandboxes point at the same public TCP endpoint, so pre-seed and agent reads/writes share state. Falls back to local Redis for the other adapters. 2. **CoopTaskTrackerTool injection into the Modal sandbox.** The adapter now ``add_local_file``s three pieces into the OpenHands image at build time: - ``coop_task.py`` → ``/usr/local/bin/cb-coop-task.py`` - ``coop_definition.py`` → ``$OH_DIR/coop_definition.py`` - ``_team_init_override.py`` → ``$OH_DIR/__init__.py`` (replaces upstream; same exports + a side-effect import of coop_definition so the Redis-backed executor overrides the local TaskTracker registration at first import). Plus a ``find -name '*.pyc' -delete`` to invalidate Python's bytecode cache so the new __init__ actually re-runs. 3. **Harvest-time fresh client.** Modal's TCP tunnels drop idle connections after a few minutes, so the original Redis client pre-seed used at startup gets closed before the 9-min agent run finishes. Re-open the client at harvest time using the same URL. End-to-end on ``dottxt_ai_outlines_task/1371 [1,2]`` with ``-a openhands_sdk --setting team --git``: - Modal Redis startup: ``redis ready redis://r450.modal.host:41899`` - Both agents Submitted, 9m total - Eval: 2/2 PASS (f1: 14/14 ✓, f2: 20/20 ✓) - Metrics: ``tasks_total: 4, tasks_done: 4, unowned_at_end: 0, time_to_first_claim_seconds: 52.6, claims_per_agent: {agent2:2, agent1:1}, updates_per_agent: {agent2:4, agent1:5}`` - Cost: $3.33 Tests: image-layering assertions expanded — ``add_local_file`` now called 3 times (CLI helper, tool def, __init__ override), and the run_commands chain copies both files + wipes .pyc caches. Full suite: 329 passed. Ruff / format / mypy all green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * deps: add fakeredis to dev extras The team-mode unit tests (task_list / protocol / fs_mirror / loop_refresh / mcp_server) use ``fakeredis.FakeRedis`` as a hermetic stand-in for redis-server, but ``fakeredis`` wasn't declared anywhere in pyproject.toml — it just happened to be present in my local venv because something else pulled it in transitively. GitHub CI installs ``[dev]`` only, so on a clean install pytest collection fails with ``ModuleNotFoundError: No module named 'fakeredis'`` on every team-mode test file. Adding the dependency explicitly fixes PR #52 (team-mode) CI; once team-mode merges, PR #55 (team-all-adapters) will also pick it up via the same path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * swe_agent: fix import error + add missing transitive deps Three changes that together unblock swe_agent team-mode runs (and solo/coop runs too — the bug wasn't team-specific): 1. ``cooperbench.agents.mini_swe_agent`` → ``mini_swe_agent_v2`` in ``swe_agent/adapter.py`` and ``swe_agent/agent/agents.py``. The old package was renamed in v0.0.13; both swe_agent files had stale imports that no-op'd at module load (TypeError or ModuleNotFoundError depending on how the framework was invoked), making every swe_agent invocation return Error before any LLM call. 2. Add ``numpy``, ``boto3``, ``docker`` to the ``swe-agent`` extras in pyproject.toml. swe_agent's vendored framework imports these at module-load time even when the docker/S3/model paths are dormant, so a clean ``pip install '.[swe-agent]'`` without these would still ImportError on first invocation. 3. uv.lock refreshed with the new transitive deps. End-to-end on dottxt_ai_outlines_task/1371 [1,2] with ``-a swe_agent -m gpt-5.5 --setting team --git`` (sw_team_v5): both agents Submitted, patches 373 + 88 lines, both applied via git apply. Eval failed 0/2 due to a content-quality issue (``NameError: name 'Set' is not defined`` — agent used Set without importing it; both agents hit exit_cost budget limit mid-implementation), but that's model variance, not adapter wiring. swe_agent is unblocked: it runs end-to-end, produces patches, the eval pipeline processes them. Coordination metrics still empty (claims_per_agent: {}) because swe_agent doesn't yet have the in-container coop-task-* CLI install or in-loop task auto-refresh — those are tracked as follow-ups in the PR body. For now the swe_agent team-mode run just gets the team prompt section + env vars; full team-tool integration is a separate PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: team-mode bugs surfaced by 10-pair core run Five compounding bugs prevented `claude_code`, `codex`, and `mini_swe_agent_v2` from reaching honest pass-rates on the core subset in team setting. All four now ≥ 5/10. - normalize_patch ate trailing blank context lines (text.strip() consumes " \n"), breaking last-hunk line counts so git apply rejected otherwise-valid diffs. Replaced with lstrip/rstrip on "\n" only. - mini_swe_agent_v2 adapter wasn't normalizing patches at all — raw .strip() on the patch.txt read, so every msa patch ended in a non-newline byte. Now routes through normalize_patch. - mini_swe_agent_v2 ModalEnvironment created the sandbox with no long-running command, so the image's default CMD exited and every exec hit "Sandbox not found". Pass "sleep", "infinity" as the positional command (matches eval backend's existing fix). - claude_code and codex adapters silently ignored --backend modal because shared build_environment was hardcoded to DockerEnvironment. Added a backend kwarg and threaded config["backend"] through both adapters. - Team lead prompt buried the integration step at the bottom of a long workflow list; Claude/Codex consistently exited after their own feature without reading /workspace/shared/<agent>.patch. Rewrote with a hard-rule opener and a 5-point pre-submission checklist. Member prompt now opens with "stay in your lane" per the lead's PLAN.md. - eval test_merged now falls back to testing each agent's patch alone when the merged tree doesn't pass both features. Surfaced as merge.strategy="solo-agent1" / "solo-agent2". Credits the agent (typically the lead) who correctly integrated both features into one working patch but had it corrupted by union-merging with the other agent's partial implementation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs+data: core subset and team-mode horizontal comparison - dataset/subsets/core.json: 10-pair subset for quick agent comparisons. Stratified by repo (largest-remainder proportional allocation by full-dataset pair count) with a one-slot floor per primary language (Python / Go / Rust / TS). Reproducible via scripts/generate_core_subset.py (seed=42). - docs/BENCHMARK_RESULTS.md: horizontal comparison of four agent frameworks on the core subset in team setting. Includes per-task pass/fail matrix annotated with the merge strategy used, plus the chronological narrative of the dozen reruns that surfaced each of the bugs fixed in the previous commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(eval): don't bail when union-merge also conflicts Previously test_merged returned early with an error when both naive and union merge strategies hit conflicts, so the solo-agent fallback never got a chance to credit a team whose lead alone integrated both features. Now we write an empty merged.patch, let run_tests fail naturally on the merged tree, and fall through to the solo fallback. Doesn't change any of the current 40 eval results — union's merge=union attribute is tolerant enough that every task in the dataset produces some tree (potentially broken code with stitched-together lines); the broken-tree-tests-fail path already triggered the solo fallback. This just closes the defensive gap for future pathological cases. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * eval(team): identical / naive / lead-when-naive-conflicts policy Drops the union-merge strategy and the member-only fallback from test_merged. The new chain is: 1. identical patches → skip-merge short-circuit 2. naive 3-way merge clean → merged-tree tests are authoritative (no further fallback) 3. naive merge conflicts → test the lead's patch.txt alone against both feature suites Rationale: union merge concatenates conflicting hunks, which usually produces syntactically broken code; the cases where it accidentally produced a working tree were rewarding lucky non-overlap, not genuine coordination. The member-only fallback was symmetric to lead-only but incoherent under team-mode semantics (the lead is the designated integrator; if they didn't integrate, the team failed regardless of what the member's branch looks like). Effect on the core-subset horizontal comparison: msa 6 → 6 (unchanged) oh 5 → 4 (loses pallets_jinja/1621 — was passing via union, which concealed that oh's lead doesn't integrate) cc 5 → 5 (unchanged) cx 5 → 5 (unchanged) oh sliding below 5/10 is the correct outcome: the previous union-pass on pallets_jinja/1621 was a false-positive of sorts (oh's agents commit their patch.txt into the working tree, which forces a merge conflict on patch.txt that union resolved while the actual source merge was non-conflicting). Under the stricter policy this gets routed through lead-alone, which oh's lead does not pass. BENCHMARK_RESULTS.md updated to reflect the new totals + per-task matrix legend (N = naive/identical, L = lead-alone). CHANGELOG entry revised; full test suite still green (329 passed, 63 skipped). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(modal): codex stdin hang; eval guardrail for openhands_sdk codex on Modal: `codex exec` was hanging for the full sandbox lifetime (~2h) producing zero stream output. Root cause: codex's exec mode prints "Reading additional input from stdin..." and blocks until stdin EOF. Docker's non-tty `docker exec` gives EOF for free; Modal sandbox keeps stdin open. Fix: add `</dev/null` to the codex invocation in _build_codex_command. Smoke-tested on dottxt_ai_outlines/1655 [1,3] solo on Modal: 1/1 pass in 1m 48s. openhands_sdk eval guardrail: openhands_sdk produces patches that include a committed patch.txt in the working tree and relies on Modal-hosted Redis for coordination; running eval through Docker silently changed the test environment. The eval now reads the run's config.json and refuses with a clear warning when the run was produced by openhands_sdk but --backend != modal. Note: swe_agent already runs on Modal (uses swerex.ModalDeploymentConfig by default; the earlier docs claiming it was docker-only were wrong). Smoke-tested same dottxt task: 1/1 pass in 3m 12s. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(swe_agent): add --backend docker support swe_agent adapter was hardcoded to swerex.ModalDeploymentConfig. Added a backend dispatch that picks DockerDeploymentConfig when config["backend"] == "docker"; Modal stays as the default. Two upstream-swerex issues had to be worked around to make the docker path actually start a container: 1. CooperBench task images set ENTRYPOINT=/usr/local/bin/runner.sh, so swerex's `docker run ... image sh -c "<startup>"` becomes `runner.sh sh -c "<startup>"` and runner.sh interprets "sh" as the feature-patch path. Pass docker_args=["--entrypoint", ""] to clear the entrypoint (mirrors the existing Modal monkey-patch that does .entrypoint([]) on the image). 2. swerex's startup falls back to `pipx run swe-rex ...` when the swerex-remote binary isn't pre-installed, but pipx looks for an executable literally named "swe-rex" — which doesn't exist in the published `swe-rex` package (it provides "swerex-remote"). Monkey-patch DockerDeployment._get_swerex_start_cmd to use `pipx run --spec swe-rex swerex-remote ...` instead. Smoke-tested with `dottxt_ai_outlines/1655 [1,3]` solo on docker: 1/1 pass in 2m 53s, 17 steps, $0.32, no errors. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * team_harness: extract team mode as standalone harness + ablation flags Move team-mode primitives from cooperbench/agents/_team (private) to cooperbench/team_harness (public, library-shaped) so other benchmarks can consume the multi-agent coordination algorithm without depending on CooperBench's task layout. Adds TeamSession + TeamHarnessConfig: - TeamSession bundles per-run state (run_id, namespaced Redis URL, ordered agent list, scratchpad volume name) with the feature config and exposes adapter-facing factories that each return None / [] / {} when their feature is disabled, so adapter code paths collapse to one branch: coop_env.update(session.env_for(agent_id)) extra_run_args.extend(session.scratchpad_mount_args()) mcp_config = session.mcp_config(container_script_path=...) - TeamHarnessConfig is a frozen dataclass of five per-feature booleans (task_list, scratchpad, mcp, auto_refresh, protocol). The lead/member role split is the always-on baseline -- without it team is just coop. Wires five --team-no-* CLI flags through cli.py -> runner.run -> runner.core -> runner.team -> each adapter. result.json now records team_features so post-hoc analysis can attribute deltas to the feature that was off. Adapter refactor: claude_code, codex, mini_swe_agent_v2, swe_agent, and openhands_agent_sdk now accept team_features kwarg and construct a local TeamSession instead of calling loose helpers. Each adapter's team-mode blocks (prompt, env, mount, MCP, install) gate on the session's config. Tests: tests/agents/_team -> tests/team_harness (rename), new test_session.py (29 cases) covers the facade, four new ablation tests in tests/runner/test_team.py verify the runner-side gating. Full suite 363 passed, 63 skipped; ruff/format/mypy clean. End-to-end smoke on dottxt_ai_outlines/1371 [1,2] with codex (docker): - Default: writes task_log.json + tasks.json + metrics, cb-team-<run> volume created. - --team-no-task-list --team-no-scratchpad --team-no-mcp: no task_log / tasks files, empty metrics dict, no volume. team_features in result.json reflects the requested ablation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * codex: add Azure OpenAI support Set AZURE_OPENAI_API_KEY + AZURE_OPENAI_ENDPOINT (the OpenAI-compatible v1 base, e.g. https://<resource>.cognitiveservices.azure.com/openai/v1) and pass the Azure deployment name via -m. When both are present they take precedence over OPENAI_API_KEY. How it works: - resolve_azure_config() reads the two env vars (endpoint trailing slash stripped); _azure_config_toml() writes a `model_provider = "azure"` block into codex's config.toml with wire_api = "responses" (codex 0.132 dropped the chat wire API) and env_key = AZURE_OPENAI_API_KEY. - The key is exported into the codex command and read via the provider env_key; auth.json is skipped on the Azure path. - config.toml is now composed from independent fragments (azure provider + team-mode MCP server) so both can coexist. Non-json fallback: codex 0.132's --json event stream deterministically fails against Azure's HTTP/2 /responses endpoint ("stream disconnected: error sending request") while plain output works. Captured requests are byte-identical between modes, so it's a codex response-handling bug, not a config error. The Azure path therefore runs codex WITHOUT --json, harvests the patch from patch.txt (as always) and the final message via --output-last-message, and derives status from codex's exit code. Trade-off: no token/cost/trajectory telemetry on Azure (codex's plain output carries none; cost was already $0 via the broken json parser). Tests: 5 new (resolve_azure_config, _azure_config_toml, non-json run shape + provider config + no auth.json, error status on non-zero exit); autouse fixture clears AZURE_* so non-Azure tests stay hermetic. Full suite 369 passed; ruff/format/mypy green. Validated end-to-end on dottxt_ai_outlines/1655 [1,3] with `-a codex -m gpt-5.5-hao` against a live Azure deployment: Submitted, clean stream (no disconnects), eval passes both features. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * agents: add Azure OpenAI support to msa / swe_agent / openhands Extends Azure support (added for codex in the prior commit) to the three litellm/SDK-backed adapters. claude_code is intentionally excluded. Shared detection in cooperbench/agents/_azure.py: - resolve_azure_config() reads AZURE_OPENAI_API_KEY + AZURE_OPENAI_ENDPOINT (same env vars as codex), endpoint trailing slash stripped. - azure_litellm_model() returns `openai/<deployment>` — litellm's openai-compatible provider pointed at Azure's v1 base, mirroring how the OpenAI SDK is pointed at Azure (base_url=<v1>). No api_version pin (both the openai-compatible and native azure/ litellm routes were verified against the live endpoint; the former is used). Wiring (each gated on resolve_azure_config(), no-op when unset): - mini_swe_agent_v2: model_name -> openai/<deployment>; api_base + api_key folded into LitellmModelConfig.model_kwargs. - swe_agent: GenericAPIModelConfig(name=openai/<deployment>, api_base=..., api_key=...). - openhands_sdk: LLM(model=openai/<deployment>, api_key=..., base_url=...). Tests: tests/agents/test_azure.py (9) covers detection precedence, endpoint normalization, deployment-name parsing, and the litellm model id. Full suite 378 passed; ruff/format/mypy green. Validation: the litellm->Azure route was confirmed directly (both openai-compatible and azure/ provider forms return 200). mini_swe_agent_v2 validated end-to-end on docker. openhands_sdk (Modal backend) and swe_agent (swerex path) are wired but not yet end-to-end-validated against Azure — deferred so as not to compete with the running full-dataset codex sweep for the shared Azure deployment's quota. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * openhands: drop incidental reformatting, keep only the Azure edit The openhands_agent_sdk/ tree is ruff-excluded in pyproject.toml (adapted from the OpenHands SDK), so the prior commit's `ruff format` churned ~90 unrelated lines. Restore the base file and re-apply only the Azure LLM branch so the diff is minimal. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Ubuntu <ubuntu@ip-172-31-58-153.us-west-2.compute.internal> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Ubuntu and others added 17 commits May 16, 2026 20:42

Base automatically changed from team-mode to main May 18, 2026 23:44

ProKil mentioned this pull request May 19, 2026

team_harness: extract team mode as standalone harness + ablation flags #58

Merged

4 tasks

Merge remote-tracking branch 'origin/main' into team-all-adapters

de21d0c

# Conflicts: # CHANGELOG.md

ProKil merged commit 32ad650 into main May 21, 2026
3 checks passed

ProKil deleted the team-all-adapters branch May 21, 2026 16:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

team mode: wire team prompt + env into the three Python-loop adapters#55

team mode: wire team prompt + env into the three Python-loop adapters#55
ProKil merged 19 commits into
mainfrom
team-all-adapters

ProKil commented May 17, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ProKil commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Backend support matrix (after this PR)

What landed (in order of commits)

Per-adapter wiring + verified result

Follow-up validation: 10-pair core-subset horizontal comparison

Final results (10-pair core subset, team setting)

Bugs fixed (all in the unreleased CHANGELOG entry)

Eval policy change

Added

Tests

Real follow-ups

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ProKil commented May 17, 2026 •

edited

Loading