docs: team-harness ablation report (flash, codex/gpt-5.5) by ProKil · Pull Request #59 · cooperbench/CooperBench

ProKil · 2026-05-19T22:38:12Z

Summary

Stacks on #58. Adds a self-contained HTML report of the ablation + multi-agent comparison experiments run against the team harness, plus the driver/generator scripts and raw CSVs so the numbers are reproducible.

Open docs/team_harness_ablation_report.html in a browser (or GitHub's raw/preview) — all numbers are embedded inline, no external assets.

Experiment setup

agent codex · model gpt-5.5 · subset flash (50 task pairs) · backend docker · 1 seed
6-config one-feature-off ablation (baseline + each of the 5 team features off) + 3 setting comparisons (solo / coop / coop+git)
A pair "passes" only if both features' held-out suites pass against one merged tree (see report Methodology for the full identical→naive→lead-alone protocol)

Results (passed / 50)

configuration	passed	rate
coop (messaging only)	13	26%
team — no scratchpad	15	30%
team — no task_list	20	40%
solo (1 agent)	24	48%
coop + git	28	56%
team — no mcp	30	60%
team — no auto_refresh	30	60%
team — baseline (all on)	31	62%
team — no protocol	35	70%

Findings

Code-sharing is load-bearing. scratchpad (−16) and task_list (−11) account for nearly all of team mode's value; remove either and team drops below solo — two uncoordinated agents are worse than one.
mcp / auto_refresh / protocol show no positive effect for codex. auto_refresh is a no-op for CLI adapters by design (only fires in Python-loop adapters); protocol-off even scored +4 (mild overhead, no payoff).
Most multi-agent value is a shared code substrate, not orchestration. coop+git (56%) ≈ team-scratchpad (62%) ≫ messaging-only coop (26%, the worst config, below solo).

Caveats (also in the report)

Single seed, n=50, codex/gpt-5.5 only; effective discriminating n < 50 (many pairs pass/fail regardless of coordination).
Team runs used the scratchpad for code-sharing, not --git — so "team vs coop+git" compares two different sharing substrates, not "team = coop+git + extras". The team --git cell (both substrates) is untested.
codex exec ran with no step cap (2h wall-clock only); steps=1 in raw logs is one codex turn (~50–95 internal tool calls). Cost shows $0 because codex's --json omits a cost field.

Files

docs/team_harness_ablation_report.html — the report
docs/team_harness_ablation_data/{core,flash}_ablation.csv — raw rows
scripts/run_team_ablation.py — sweep driver
scripts/gen_ablation_report.py — regenerates the HTML from logs/

Test plan

Open the HTML, confirm tables render and numbers match the CSVs
uv run python scripts/gen_ablation_report.py reproduces the file from logs

🤖 Generated with Claude Code

Adds an OpenAI Codex CLI adapter alongside the existing Claude Code adapter. Both adapters wrap a third-party CLI inside the task's Docker container; the bits that are agent-agnostic (Redis messaging helper, prompt blocks for solo/coop/coop+git, git remote setup) now live in a new ``cooperbench.agents._coop`` module so the two adapters (and any future CLI adapter) consume them rather than duplicating. Codex adapter highlights: - Invokes ``codex exec --json --sandbox danger-full-access --skip-git-repo-check --model <id>``. - Writes ``${CODEX_HOME}/auth.json`` with the host's OPENAI_API_KEY inside the container so the CLI authenticates without prompts. - Parses Codex's JSONL event stream for status / token totals / messages. Cost is reported as 0.0 because Codex does not emit a cost field; tokens are summed across ``turn.completed`` events. - Model fallback: if Codex rejects ``--model gpt-5.5`` with a "model not found" shaped error, the adapter retries once without ``--model`` and lets Codex pick its default. - Preflight credential check: if OPENAI_API_KEY is unset the adapter returns Error immediately instead of spinning up a container that can only fail. Shared ``_coop`` module: - ``coop_msg.py`` — Redis-backed messaging CLI (one inbox per agent) installed as ``coop-send`` / ``coop-recv`` / ``coop-broadcast`` / ``coop-peek`` / ``coop-agents`` under /usr/local/bin. - ``install_snippet.sh`` — pip-installs redis and drops the shell wrappers; each adapter's setup.sh sources it. - ``prompt.py`` — solo / coop / coop+git prompt assembly, agent- agnostic. - ``runtime.py`` — ``ContainerEnv`` protocol, ``build_environment``, ``write_file_in_container`` / ``read_file_from_container``, ``rewrite_comm_url_for_container``, ``build_git_setup_command``, ``parse_sent_messages_log``, and ``normalize_patch``. Bug fix during this refactor: the previous adapter's ``.strip()`` on ``patch.txt`` was eating the trailing newline that ``git apply`` requires. Replaced with ``normalize_patch()`` (one trailing newline, no leading whitespace). This bit codex's solo run with a "corrupt patch at line N" error; Claude got lucky and didn't. Tests: 24 new for Codex (parsers + adapter), existing 45 Claude Code tests re-pointed at the shared ``_coop`` module. Full suite: 228 passed, 63 skipped. End-to-end runs against dottxt_ai_outlines_task/1371 features 1+2: - codex solo f1: Submitted, 1 turn, 365k input tokens, 184-line patch (with the trailing-newline fix it applies cleanly) - codex coop+git f1,f2: both Submitted, both patches applied but 0/2 tests pass — coordination failure (agent1 fetched ``team`` but never merged, so the stacked patches produce a Python SyntaxError at line 144 of the modified file). Claude on the same task scored 2/2; Codex used the tools less aggressively on this run. The 0/2 result is the kind of coordination failure the bench is designed to surface, not an adapter bug. Future iteration could tighten the prompt or hard-enforce a post-run merge, but neither is necessary to land the adapter itself. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds a third setting alongside ``solo`` and ``coop``, modelled on the agent-team primitives Claude Code uses in its own product. Where coop gives N peer agents one feature each and a Redis inbox to chat over, team mode adds three load-bearing primitives: 1. A typed **shared task list** (cooperbench.agents._team.TaskListClient) backed by Redis hashes + sets, namespaced ``cb:<run_id>:``, with atomic claim semantics (HSETNX-style — exactly one caller wins on a race) and an audit log of every mutation. Exposed in the container as ``coop-task-create`` / ``coop-task-claim`` / ``coop-task-update`` / ``coop-task-list`` shell wrappers. 2. A **lead / member role split**. The first agent is designated ``team-lead`` and gets a system-prompt block instructing them to break the spec into tasks, assign them via ``coop-task-create --assign``, watch progress, and integrate. Other agents are ``member`` and look for open tasks to claim. 3. A **shared scratchpad** Docker volume (``cb-team-<run_id>``) mounted at ``/workspace/shared`` in every container. Free coordination artifact for design notes, partial diffs, interface sketches. Coordination metrics are computed from the task-list audit log after the run finishes (``time_to_first_claim_seconds``, ``claims_per_agent``, ``updates_per_agent``, ``tasks_done``, ``unowned_at_end``) and saved into ``result.json``. Evaluation is identical to coop — per-agent ``patch.txt`` evaluated per-feature — so no eval changes were needed beyond discovering ``team/`` log directories. Compatibility: all five existing adapters accept the new ``team_role`` / ``team_id`` / ``task_list_url`` kwargs. The CLI adapters (``claude_code``, ``codex``) wire the team install snippet into their ``setup.sh`` so the ``coop-task-*`` wrappers land at ``/usr/local/bin``. The Python-loop adapters (``mini_swe_agent_v2``, ``swe_agent``, ``openhands_sdk``) accept the kwargs without breaking; their in-loop integration with the task list (auto-refresh between steps, similar to the existing inbox poll) lands in a follow-up. Unit tests: 46 new - 18 task_list (CRUD, atomic claim, owner-only update, audit log, run isolation) - 12 prompt (lead vs member branches, solo fallback, git interaction) - 3 runtime (env assembly, scratchpad mount args) - 4 metrics (happy path, unowned-at-end, empty log, multiple claims) - 5 runner (lead-is-first-agent, pre-seed, kwarg propagation, metrics in result, three-agent team) - 4 misc Full suite: 274 passed, 63 skipped. Ruff / format / mypy all green. End-to-end on dottxt_ai_outlines_task/1371 [1,2] with Claude Code in team+git mode: - 5 tasks created (2 by bench-runner, 3 by the lead splitting its work), all reached ``done`` - time_to_first_claim_seconds=34.2 - claims_per_agent={agent1: 2, agent2: 1} - updates_per_agent={agent1: 4, agent2: 3} - scratchpad volume actively used (agent2 wrote its diff to /workspace/shared/agent2.patch + a summary.md) - **0/1 pass rate** — both ``patch.txt`` files were empty: the members wrote diffs to the scratchpad instead of also writing ``/workspace/repo/patch.txt``, and the lead never ran the final integration step. This is real coordination signal (the prompt told them to write both places but they followed the scratchpad half only) — a follow-up will tighten the prompt to make patch.txt submission the explicit final step. Future PRs (intentionally out of scope here so this lands at a reviewable size): - In-loop auto-refresh for the Python-loop adapters - MCP long-poll tool to give CLI adapters push-ish inbox semantics - Typed ``coop-request`` / ``coop-respond`` protocol on top of messaging (CC's plan_approval_request shape) - Filesystem mirror of the task list (CC-style ``ls`` artefacts) Stacks on #51 (Codex adapter) so the diff stays focused on team-mode additions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…resh (#53) Lands the four follow-ups that were called out as "Out of scope" on the team-mode PR (#52), plus a prompt fix surfaced by the team-mode end-to-end run. 1. **Filesystem mirror of task list** (``_team/fs_mirror.py``). Snapshots the Redis-backed task list to ``/workspace/shared/tasks/`` so agents can ``ls`` and ``cat`` tasks with their existing tools rather than going through the ``coop-task-list`` CLI. Layout mirrors Claude Code's team primitive: one ``<id>.json`` per task, plus ``_index.json`` (cheap ``ls`` target) and ``_log.jsonl`` (audit trail). Triggered on every ``coop-task-list`` invocation and from the host runner at startup. Files written via tempfile+replace so readers never observe a partial state. 2. **Typed coop-request / coop-respond protocol** (``_team/protocol.py``). Layered on plain Redis messaging, mirroring CC's ``plan_approval_request`` / ``plan_approval_response`` shape. ``coop-request <peer> <kind> <body>`` returns a request_id (and optionally blocks via ``--wait N`` for a response). ``coop-respond <request_id> <body>`` writes back; the sender's ``await_response`` uses BLPOP so it actually sleeps instead of busy-polling. Both events flow into the shared task-log so coordination metrics include protocol events. 3. **MCP long-poll server** (``_team/mcp_server.py``). Stdio JSON-RPC server that exposes a single ``wait_for_message`` tool backed by BLPOP on the agent's inbox. Registered automatically: Claude Code adapter writes ``$CLAUDE_CONFIG_DIR/.claude.json`` with the server entry; Codex adapter writes ``$CODEX_HOME/config.toml``. The point is to make "watch the inbox" a natural idle behavior for the CLI adapters instead of a busy-loop on ``coop-recv`` returning empty — the closest we can get to push-style delivery for opaque CLI agent loops. 4. **In-loop task-list auto-refresh** (``_team/loop_refresh.py``). ``TeamPoller`` is a per-agent host-side helper that ``mini_swe_agent_v2.DefaultAgent.step()`` calls between LLM queries — same hook as the existing inbox poll. The LLM sees a compact ``[Team task list] open: 1, in_progress: 2, ...`` summary prepended to every turn so it doesn't need to remember to call ``coop-task-list``. Plumbed via ``agent.team_poller`` so the ``mini_swe_agent_v2`` subtree change is one branch in ``step()``. The same module also exports ``poll_team_state()`` for in-container use (env-driven variant). 5. **Prompt fix**: the previous team-mode end-to-end had members writing diffs to ``/workspace/shared/<id>.patch`` only and never to ``/workspace/repo/patch.txt``, scoring 0/2 despite great coordination. Both lead and member prompts now have an explicit ``### Final submission — REQUIRED`` section that calls out ``patch.txt`` as the only file the bench evaluates and provides the exact ``git diff > patch.txt`` command. Also: cosmetic fix to ``runner/core._print_single_result`` so team mode's per-agent dicts (which carry ``patch_lines: int``) render correctly in the run table — previously the column showed 0 because the function tried ``len(r.get("patch", "").splitlines())`` and team mode doesn't store the full patch in the agents dict. Tests: 37 new unit tests - 8 fs_mirror (atomic writes, stale cleanup, empty index) - 9 protocol (request roundtrip, await, timeout, audit log) - 9 mcp_server (initialize, tools/list, tools/call, timeout, blocking, unknown-tool error, env factory) - 8 loop_refresh (summary formatting, TeamPoller, env variant) - 3 prompt (regression: lead+member prompts demand patch.txt) Full suite: **311 passed**, 63 skipped. End-to-end on dottxt_ai_outlines_task/1371 [1,2] with Claude Code + team + git: **2/2 features pass** (14/14 + 20/20 tests). All four follow-ups visibly active in the run artifacts: ``/workspace/shared/tasks/`` populated with per-task JSON + _index + _log; scratchpad has agent2.patch; ``cb-mcp-server.py`` registered in ``.claude.json``; 6 tasks created (2 by runner pre-seed, 4 by lead's sub-task split), 4 reached ``done``, ``time_to_first_claim_seconds=29.9``. Previous run scored 0/2 on the same task — the prompt fix is doing real work. Stacks on #52. Co-authored-by: Ubuntu <ubuntu@ip-172-31-58-153.us-west-2.compute.internal> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Brings ``mini_swe_agent_v2``, ``swe_agent``, and ``openhands_sdk`` to parity with the CLI adapters for team mode. Before this commit they accepted the team kwargs but discarded them; now each one appends the team prompt section to the task it sends the agent, and (where the adapter actually controls the container) propagates ``CB_TEAM_*`` env vars + mounts the team scratchpad. New helper: ``_team.team_task_section(agents, agent_id, team_role)`` returns ONLY the lead-or-member block + coop-task-* CLI usage, without the surrounding task/submission/git scaffolding that ``build_team_instruction`` adds. Python-loop adapters already have their own prompts covering messaging/git/submission, so they need only the new piece; CLI adapters keep using the bigger function. Per-adapter wiring: - ``mini_swe_agent_v2``: appends team_task_section to task; propagates CB_TEAM_* through env_kwargs["env"]; adds ``--add-host=host.docker.internal:host-gateway`` + scratchpad volume to docker run args; installs the team CLI scripts + pip redis in the container after env spin-up. The existing ``TeamPoller`` host-side hook (already in step()) still fires. - ``openhands_sdk``: appends team_task_section to task; folds a new ``team_env`` dict into ``coop_info`` so ``_build_credentials_dict`` propagates CB_TEAM_* into the sandbox. Coop-task-* binary install in the OpenHands agent-server image is a follow-up — OpenHands manages its own image build and doesn't expose a clean post-start exec hook. - ``swe_agent``: appends team_task_section to task. The SWE-agent framework's sandbox + agent loop is third-party and harder to instrument; everything beyond the prompt is a follow-up. Tests: 13 new - 3 prompt unit tests for team_task_section (lead, member, empty) - 10 cross-adapter sanity tests in tests/agents/test_team_wiring.py: consistency between team_task_section and build_team_instruction, every registered runner accepts the team kwargs, openhands env keys, swe_agent signature Full suite: 324 passed, 63 skipped. Ruff/format/mypy all green. End-to-end on dottxt_ai_outlines_task/1371 [1,2] with claude_code + team + git (sanity check that the shared changes didn't regress the CLI adapter): both Submitted in 4m21s, $0.93, patches 210 + 81 lines. End-to-end for the other four (codex, mini_swe_agent_v2, swe_agent, openhands_sdk) requires API keys (Anthropic for the three Python-loop adapters via litellm, OpenAI for codex) that aren't available in this environment. Unit tests cover the new wiring; the e2e validations should be run with real keys before relying on the per-adapter behavior. Compatibility matrix is now: | Adapter | Accepts | Team prompt | Auto-refresh | CLI in container | env vars | |---------------------|---------|-------------|--------------|------------------|----------| | claude_code | yes | yes (full) | n/a | yes | yes | | codex | yes | yes (full) | n/a | yes | yes | | mini_swe_agent_v2 | yes | yes (sec.) | yes | yes | yes | | openhands_sdk | yes | yes (sec.) | n/a | NOT YET | yes | | swe_agent | yes | yes (sec.) | NOT YET | NOT YET | NOT YET | Stacks on #52 (merged-up team-mode branch). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Closes the documented gap from the prior commit's matrix: the ``coop-task-*`` binaries now ship into the OpenHands agent-server sandbox, layered onto the upstream ``-oh`` image via Modal's ``add_local_file`` / ``pip_install`` / ``run_commands`` chain (no upstream image rebuild required). Triggered only when ``coop_info["team_env"]`` is set so solo / coop runs don't pay the ~10s first-build cost. Modal caches the layered image; subsequent team runs are instant. Verified end-to-end: ran openhands_sdk team+git on dottxt_ai_outlines_task/1371 [1,2] with gpt-5.5. The agent ran ``compgen -c | grep coop-task`` and got back all 7 wrappers (create / claim / update / list / request / respond / pending) — the install worked. Whether the model actually invokes the tools is a separate (coordination-quality) axis; in this run it discovered them but didn't use them, same as codex. Both patches applied; f1 14/14, f2 19/20. Tests: 2 new (full suite: 326 passed) - test_team_env_triggers_image_layering — verifies add_local_file + pip_install + run_commands fire with the right args when team mode is active - test_no_layering_when_team_inactive — verifies solo / coop runs skip the image-build cost Matrix update — openhands_sdk now reads: Accepts kwargs: yes / Team prompt: section / Auto-refresh: n/a / CLI in container: YES (was NOT YET) / CB_TEAM_* env: yes Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The codex team e2e (cx_team_v3) hit 0/2 with great coordination metrics — 5/5 tasks done, 27s first claim, claims even — but neither agent ran ``git merge`` despite the prompt's "Recommended workflow" mentioning it. Both fetched their peer's branch (2 each) and then submitted only their own work, so the eval's naive diff-stacker produced syntactically broken Python. The previous prompt buried the critical step in a "Concretely:" sentence at the end; gpt-5.5 didn't follow it. This rewrite: - Renames the section ``## Git collaboration — MERGE IS REQUIRED BEFORE SUBMITTING`` so the imperative is in the heading itself. - Adds an explicit "Required final sequence — run this verbatim before exiting" block with the full fetch+merge+diff sequence, parameterized over every partner branch. - Explains *why* (each agent's patch.txt is evaluated against every feature's tests; without the merge, the peer feature's symbols are missing → ImportError). - Frames it the same way the patch.txt step is framed (REQUIRED, skip-at-your-loss), which the original prompt fix proved codex responds to. Verified: re-ran cx_team_v4 (codex team+git, same task as v3). Git activity went from ``fetch=2 merge=0 push=0`` per agent → ``fetch=3 merge=2 push=2`` and ``fetch=1 merge=1 push=1``. Both patches now contain both features' symbols. Pass rate v4: 33/34 tests (97%) — f2 fully passes 20/20, f1 fails one test because gpt-5.5's merged code put the ``filters`` kwarg on a helper function rather than the ``prompt`` decorator (content quality, not coordination). A second run (cx_team_v5) produced byte-identical 243-line patches on both agents — codex coordinated so well both ended up with the exact same merged tree. This surfaces a separate bench-side limitation: the eval's diff-stacker fails to apply patch B on top of patch A when every hunk already matches, producing an empty merged.patch. That's a real bug in ``eval/evaluate.py``'s coop merge step, NOT a coordination failure — codex did exactly what the prompt asked. Fix is a separate concern from team-mode wiring. Tests still pass (existing prompt tests are content-agnostic; 326 / 63 skipped). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

In team mode codex can coordinate so well that both agents end up with byte-identical patches (each fully merged the other's branch). The existing eval combiner sequence — apply patch1 → apply patch2 on top — chokes because every hunk in patch2 is already applied, producing an empty merged.patch and a downstream "No valid patches in input" failure even though both submissions are individually fine. Fix in ``test_merged``: before invoking ``_setup_branches`` / ``_merge_naive``, ``cmp`` the two patches. If they match, copy patch1 to merged.patch (normalized via ``git apply --recount`` so agents that emit unified diffs with miscounted hunk headers still work) and skip the merge dance. Returns a fresh result with ``merge.status: "identical"`` so the caller can tell the short-circuit fired vs a real merge. Verified on the codex-team e2e: - cx_team_v5 (codex agents perfectly merged to identical 243-line patches): 0/2 → 2/2 ✓ (f1: 14/14, f2: 20/20) - cx_team_v4 (codex agents diverged on the merge): unchanged at f2 20/20 + f1 13/14 = 33/34 tests, still falls back to agent2-alone via apply_status: {'agent1': 'failed', ...} I also briefly tried adding ``git apply --recount`` to ``_setup_branches``'s fallback chain, but that REGRESSED v4: it made agent1's malformed patch apply where it previously failed silently, triggering a real merge attempt that produced duplicate function definitions (broken Python) via union merge. The identical-patches short-circuit is the strictly-better fix — no regression, recovers the v5 case, and the malformed-hunk normalization only kicks in on the short-circuit path where it can't cause merge conflicts. Also lands previously-uncommitted housekeeping: - prompt.py: ruff-format-only diff on the merge-required block from the prior commit - test_team_wiring.py: ruff --fix removed unused MagicMock imports - test_gcp_backend.py / test_tasks.py: ruff --fix removed f-string-without-placeholder and unused-json import (both unrelated drift caught by the gate) Tests: 1 new (full suite: 327 passed) - ``test_test_merged_shortcircuits_on_identical_patches`` — source inspection confirms the short-circuit branch + "identical" merge-status string exist in test_merged Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The previous openhands team runs (oh_team_v3) showed agents discovering the ``coop-task-*`` shell wrappers via ``compgen`` but never invoking them — gpt-5.5 strongly prefers typed tools registered with the LLM over arbitrary shell commands. This commit lands the architectural fix: a Redis-backed ``CoopTaskTrackerTool`` registered under the same name as openhands' built-in ``TaskTrackerTool`` so the registry resolution swaps it transparently. Files: * ``openhands/tools/task_tracker/coop_definition.py`` — new tool definition + executor. Same ``TaskTrackerAction`` / ``TaskTrackerObservation`` shape, but ``plan`` and ``view`` round- trip through the shared ``cb:<run_id>:`` Redis namespace that ``TaskListClient`` (host side) writes to. Tasks are auto-owned by the calling agent; ``view`` shows peer tasks prefixed with ``[<their_agent_id>]``. Registered under both ``"CoopTaskTrackerTool"`` AND ``"TaskTrackerTool"`` so importing the module rebinds the latter to the Coop variant. * ``openhands/tools/preset/default.py`` — gains a ``team_mode`` kwarg (kept for API stability + tests; the actual swap happens server-side via the .pth/__init__ side-effect import, not by changing the host-side tool list). Pre-PR coop block split into a more nuanced team-mode prompt section that documents the TaskTracker → shared-list behavior. * ``openhands_sdk/adapter.py:ModalSandboxContext.__enter__`` — layers two more bits into the Modal image at build time: - ``add_local_file`` of ``coop_definition.py`` to ``$OH_DIR/coop_definition.py`` (in the sandbox's openhands install) - ``grep ... || echo`` appending ``from . import coop_definition`` to the package's ``__init__.py`` so the registration runs at import time. Tests: 1 new + updated image-layering assertions - ``test_importing_coop_definition_overrides_local_registration``: inspecting the registry's ``_MODULE_QUALNAMES`` confirms ``TaskTrackerTool.name`` resolves to ``coop_definition``'s registration after import. - ``TestOpenHandsImageLayering`` now asserts 2 ``add_local_file`` calls + 2 ``run_commands`` layers (tool-file install + ``coop-task-*`` wrappers) and that the ``from . import coop_definition`` line is in the install commands. Full suite: 329 passed. Ruff / format / mypy all green. KNOWN LIMITATION (documented in coop_definition.py docstring): the openhands_sdk agent-server runs in a Modal sandbox that's network-isolated from the host Redis. The CoopTaskTracker is correctly registered and the LLM can call it, but every operation returns "Shared task list unavailable" because the sandbox can't ``socket.getaddrinfo("host.docker.internal")``. The fix is in the deployment layer (Modal tunnels, a Modal-hosted Redis, or running openhands directly via docker like the other adapters), not in this PR — verified by oh_team_v10: agent ran ``coop-task-list`` first ("The coop CLI failed; I'll use the shared task tracker."), then fell back to TaskTrackerAction which still hit the local executor because the override + Redis combo can't actually work in Modal. For non-Modal openhands deployments (e.g. local docker-backed openhands runs, future remote-conversation transports that share the host network), this tool works as designed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Resolves the Modal-Redis isolation that blocked the prior CoopTaskTracker swap from actually functioning. Three pieces, working together: 1. **Modal-hosted Redis.** ``runner/team.py:execute_team`` detects ``agent_name == "openhands_sdk"`` and spins up a Modal sandbox running redis-server on a TCP tunnel (``unencrypted_ports=[6379]``, accessed via ``unencrypted_host:unencrypted_port``). Re-uses the existing ``connectors/redis_server.ModalRedisServer`` — it was already written, just unused. Both the host TaskListClient and the agent sandboxes point at the same public TCP endpoint, so pre-seed and agent reads/writes share state. Falls back to local Redis for the other adapters. 2. **CoopTaskTrackerTool injection into the Modal sandbox.** The adapter now ``add_local_file``s three pieces into the OpenHands image at build time: - ``coop_task.py`` → ``/usr/local/bin/cb-coop-task.py`` - ``coop_definition.py`` → ``$OH_DIR/coop_definition.py`` - ``_team_init_override.py`` → ``$OH_DIR/__init__.py`` (replaces upstream; same exports + a side-effect import of coop_definition so the Redis-backed executor overrides the local TaskTracker registration at first import). Plus a ``find -name '*.pyc' -delete`` to invalidate Python's bytecode cache so the new __init__ actually re-runs. 3. **Harvest-time fresh client.** Modal's TCP tunnels drop idle connections after a few minutes, so the original Redis client pre-seed used at startup gets closed before the 9-min agent run finishes. Re-open the client at harvest time using the same URL. End-to-end on ``dottxt_ai_outlines_task/1371 [1,2]`` with ``-a openhands_sdk --setting team --git``: - Modal Redis startup: ``redis ready redis://r450.modal.host:41899`` - Both agents Submitted, 9m total - Eval: 2/2 PASS (f1: 14/14 ✓, f2: 20/20 ✓) - Metrics: ``tasks_total: 4, tasks_done: 4, unowned_at_end: 0, time_to_first_claim_seconds: 52.6, claims_per_agent: {agent2:2, agent1:1}, updates_per_agent: {agent2:4, agent1:5}`` - Cost: $3.33 Tests: image-layering assertions expanded — ``add_local_file`` now called 3 times (CLI helper, tool def, __init__ override), and the run_commands chain copies both files + wipes .pyc caches. Full suite: 329 passed. Ruff / format / mypy all green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The team-mode unit tests (task_list / protocol / fs_mirror / loop_refresh / mcp_server) use ``fakeredis.FakeRedis`` as a hermetic stand-in for redis-server, but ``fakeredis`` wasn't declared anywhere in pyproject.toml — it just happened to be present in my local venv because something else pulled it in transitively. GitHub CI installs ``[dev]`` only, so on a clean install pytest collection fails with ``ModuleNotFoundError: No module named 'fakeredis'`` on every team-mode test file. Adding the dependency explicitly fixes PR #52 (team-mode) CI; once team-mode merges, PR #55 (team-all-adapters) will also pick it up via the same path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Three changes that together unblock swe_agent team-mode runs (and solo/coop runs too — the bug wasn't team-specific): 1. ``cooperbench.agents.mini_swe_agent`` → ``mini_swe_agent_v2`` in ``swe_agent/adapter.py`` and ``swe_agent/agent/agents.py``. The old package was renamed in v0.0.13; both swe_agent files had stale imports that no-op'd at module load (TypeError or ModuleNotFoundError depending on how the framework was invoked), making every swe_agent invocation return Error before any LLM call. 2. Add ``numpy``, ``boto3``, ``docker`` to the ``swe-agent`` extras in pyproject.toml. swe_agent's vendored framework imports these at module-load time even when the docker/S3/model paths are dormant, so a clean ``pip install '.[swe-agent]'`` without these would still ImportError on first invocation. 3. uv.lock refreshed with the new transitive deps. End-to-end on dottxt_ai_outlines_task/1371 [1,2] with ``-a swe_agent -m gpt-5.5 --setting team --git`` (sw_team_v5): both agents Submitted, patches 373 + 88 lines, both applied via git apply. Eval failed 0/2 due to a content-quality issue (``NameError: name 'Set' is not defined`` — agent used Set without importing it; both agents hit exit_cost budget limit mid-implementation), but that's model variance, not adapter wiring. swe_agent is unblocked: it runs end-to-end, produces patches, the eval pipeline processes them. Coordination metrics still empty (claims_per_agent: {}) because swe_agent doesn't yet have the in-container coop-task-* CLI install or in-loop task auto-refresh — those are tracked as follow-ups in the PR body. For now the swe_agent team-mode run just gets the team prompt section + env vars; full team-tool integration is a separate PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Five compounding bugs prevented `claude_code`, `codex`, and `mini_swe_agent_v2` from reaching honest pass-rates on the core subset in team setting. All four now ≥ 5/10. - normalize_patch ate trailing blank context lines (text.strip() consumes " \n"), breaking last-hunk line counts so git apply rejected otherwise-valid diffs. Replaced with lstrip/rstrip on "\n" only. - mini_swe_agent_v2 adapter wasn't normalizing patches at all — raw .strip() on the patch.txt read, so every msa patch ended in a non-newline byte. Now routes through normalize_patch. - mini_swe_agent_v2 ModalEnvironment created the sandbox with no long-running command, so the image's default CMD exited and every exec hit "Sandbox not found". Pass "sleep", "infinity" as the positional command (matches eval backend's existing fix). - claude_code and codex adapters silently ignored --backend modal because shared build_environment was hardcoded to DockerEnvironment. Added a backend kwarg and threaded config["backend"] through both adapters. - Team lead prompt buried the integration step at the bottom of a long workflow list; Claude/Codex consistently exited after their own feature without reading /workspace/shared/<agent>.patch. Rewrote with a hard-rule opener and a 5-point pre-submission checklist. Member prompt now opens with "stay in your lane" per the lead's PLAN.md. - eval test_merged now falls back to testing each agent's patch alone when the merged tree doesn't pass both features. Surfaced as merge.strategy="solo-agent1" / "solo-agent2". Credits the agent (typically the lead) who correctly integrated both features into one working patch but had it corrupted by union-merging with the other agent's partial implementation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- dataset/subsets/core.json: 10-pair subset for quick agent comparisons. Stratified by repo (largest-remainder proportional allocation by full-dataset pair count) with a one-slot floor per primary language (Python / Go / Rust / TS). Reproducible via scripts/generate_core_subset.py (seed=42). - docs/BENCHMARK_RESULTS.md: horizontal comparison of four agent frameworks on the core subset in team setting. Includes per-task pass/fail matrix annotated with the merge strategy used, plus the chronological narrative of the dozen reruns that surfaced each of the bugs fixed in the previous commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Previously test_merged returned early with an error when both naive and union merge strategies hit conflicts, so the solo-agent fallback never got a chance to credit a team whose lead alone integrated both features. Now we write an empty merged.patch, let run_tests fail naturally on the merged tree, and fall through to the solo fallback. Doesn't change any of the current 40 eval results — union's merge=union attribute is tolerant enough that every task in the dataset produces some tree (potentially broken code with stitched-together lines); the broken-tree-tests-fail path already triggered the solo fallback. This just closes the defensive gap for future pathological cases. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Drops the union-merge strategy and the member-only fallback from test_merged. The new chain is: 1. identical patches → skip-merge short-circuit 2. naive 3-way merge clean → merged-tree tests are authoritative (no further fallback) 3. naive merge conflicts → test the lead's patch.txt alone against both feature suites Rationale: union merge concatenates conflicting hunks, which usually produces syntactically broken code; the cases where it accidentally produced a working tree were rewarding lucky non-overlap, not genuine coordination. The member-only fallback was symmetric to lead-only but incoherent under team-mode semantics (the lead is the designated integrator; if they didn't integrate, the team failed regardless of what the member's branch looks like). Effect on the core-subset horizontal comparison: msa 6 → 6 (unchanged) oh 5 → 4 (loses pallets_jinja/1621 — was passing via union, which concealed that oh's lead doesn't integrate) cc 5 → 5 (unchanged) cx 5 → 5 (unchanged) oh sliding below 5/10 is the correct outcome: the previous union-pass on pallets_jinja/1621 was a false-positive of sorts (oh's agents commit their patch.txt into the working tree, which forces a merge conflict on patch.txt that union resolved while the actual source merge was non-conflicting). Under the stricter policy this gets routed through lead-alone, which oh's lead does not pass. BENCHMARK_RESULTS.md updated to reflect the new totals + per-task matrix legend (N = naive/identical, L = lead-alone). CHANGELOG entry revised; full test suite still green (329 passed, 63 skipped). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

codex on Modal: `codex exec` was hanging for the full sandbox lifetime (~2h) producing zero stream output. Root cause: codex's exec mode prints "Reading additional input from stdin..." and blocks until stdin EOF. Docker's non-tty `docker exec` gives EOF for free; Modal sandbox keeps stdin open. Fix: add `</dev/null` to the codex invocation in _build_codex_command. Smoke-tested on dottxt_ai_outlines/1655 [1,3] solo on Modal: 1/1 pass in 1m 48s. openhands_sdk eval guardrail: openhands_sdk produces patches that include a committed patch.txt in the working tree and relies on Modal-hosted Redis for coordination; running eval through Docker silently changed the test environment. The eval now reads the run's config.json and refuses with a clear warning when the run was produced by openhands_sdk but --backend != modal. Note: swe_agent already runs on Modal (uses swerex.ModalDeploymentConfig by default; the earlier docs claiming it was docker-only were wrong). Smoke-tested same dottxt task: 1/1 pass in 3m 12s. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

swe_agent adapter was hardcoded to swerex.ModalDeploymentConfig. Added a backend dispatch that picks DockerDeploymentConfig when config["backend"] == "docker"; Modal stays as the default. Two upstream-swerex issues had to be worked around to make the docker path actually start a container: 1. CooperBench task images set ENTRYPOINT=/usr/local/bin/runner.sh, so swerex's `docker run ... image sh -c "<startup>"` becomes `runner.sh sh -c "<startup>"` and runner.sh interprets "sh" as the feature-patch path. Pass docker_args=["--entrypoint", ""] to clear the entrypoint (mirrors the existing Modal monkey-patch that does .entrypoint([]) on the image). 2. swerex's startup falls back to `pipx run swe-rex ...` when the swerex-remote binary isn't pre-installed, but pipx looks for an executable literally named "swe-rex" — which doesn't exist in the published `swe-rex` package (it provides "swerex-remote"). Monkey-patch DockerDeployment._get_swerex_start_cmd to use `pipx run --spec swe-rex swerex-remote ...` instead. Smoke-tested with `dottxt_ai_outlines/1655 [1,3]` solo on docker: 1/1 pass in 2m 53s, 17 steps, $0.32, no errors. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Resolves the squash-merge conflicts from #52 landing on main. All conflicts followed the same pattern: this branch's HEAD contains #52's content plus the subsequent work on top, while main's squashed-merge commit contains only #52. Resolved each conflict by taking ours (HEAD), which preserves the cumulative state of: - CHANGELOG: full Fixed/Changed/Added entries for team-mode bug fixes, eval policy change, core subset + benchmark doc, plus the original "team setting" bullet from #52 - _team/prompt.py: the stronger lead-prompt with the 5-point integration checklist (#52 had the older "buried integration" version) - swe_agent/adapter.py: team-mode kwarg propagation + Docker backend dispatch + pipx --spec monkey-patch - runner/team.py: openhands_sdk Modal-Redis tunnel branch - everywhere else: my newer adapter changes are strict supersets of #52's CI green locally: 329 tests passed, ruff clean, mypy clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Move team-mode primitives from cooperbench/agents/_team (private) to cooperbench/team_harness (public, library-shaped) so other benchmarks can consume the multi-agent coordination algorithm without depending on CooperBench's task layout. Adds TeamSession + TeamHarnessConfig: - TeamSession bundles per-run state (run_id, namespaced Redis URL, ordered agent list, scratchpad volume name) with the feature config and exposes adapter-facing factories that each return None / [] / {} when their feature is disabled, so adapter code paths collapse to one branch: coop_env.update(session.env_for(agent_id)) extra_run_args.extend(session.scratchpad_mount_args()) mcp_config = session.mcp_config(container_script_path=...) - TeamHarnessConfig is a frozen dataclass of five per-feature booleans (task_list, scratchpad, mcp, auto_refresh, protocol). The lead/member role split is the always-on baseline -- without it team is just coop. Wires five --team-no-* CLI flags through cli.py -> runner.run -> runner.core -> runner.team -> each adapter. result.json now records team_features so post-hoc analysis can attribute deltas to the feature that was off. Adapter refactor: claude_code, codex, mini_swe_agent_v2, swe_agent, and openhands_agent_sdk now accept team_features kwarg and construct a local TeamSession instead of calling loose helpers. Each adapter's team-mode blocks (prompt, env, mount, MCP, install) gate on the session's config. Tests: tests/agents/_team -> tests/team_harness (rename), new test_session.py (29 cases) covers the facade, four new ablation tests in tests/runner/test_team.py verify the runner-side gating. Full suite 363 passed, 63 skipped; ruff/format/mypy clean. End-to-end smoke on dottxt_ai_outlines/1371 [1,2] with codex (docker): - Default: writes task_log.json + tasks.json + metrics, cb-team-<run> volume created. - --team-no-task-list --team-no-scratchpad --team-no-mcp: no task_log / tasks files, empty metrics dict, no volume. team_features in result.json reflects the requested ablation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Self-contained HTML report of the team-harness ablation + multi-agent comparison run on the flash subset (50 task pairs), codex/gpt-5.5, docker, 1 seed. Contents: - docs/team_harness_ablation_report.html — setting comparison (solo/coop/coop+git/team), one-feature-off ablation matrix, timing, findings, methodology, caveats. All numbers embedded inline. - docs/team_harness_ablation_data/{core,flash}_ablation.csv — raw rows. - scripts/run_team_ablation.py — sweep driver (config -> cooperbench run+eval). - scripts/gen_ablation_report.py — regenerates the HTML from logs/. Headline results (passed / 50, both-features-pass): coop msg-only 13 · team no-scratchpad 15 · team no-task_list 20 · solo 24 · coop+git 28 · team no-mcp 30 · team no-auto_refresh 30 · team baseline 31 · team no-protocol 35 Findings: - scratchpad (-16) and task_list (-11) are load-bearing; removing either drops team below solo (two uncoordinated agents < one). - mcp/auto_refresh/protocol show no positive effect for codex (auto_refresh is a no-op for CLI adapters by design; protocol-off even scored +4, i.e. mild overhead without payoff). - Most multi-agent value is a shared code substrate, not orchestration: coop+git (56%) ~ team-scratchpad (62%) >> messaging-only coop (26%). Caveat: team runs used the scratchpad for code-sharing, NOT --git, so "team vs coop+git" compares two sharing substrates; the team --git cell is untested (follow-up). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

CooperBench had no docs-deployment pipeline — the original template docs.yaml (MkDocs + mike) was deleted in b61bc00, and GitHub Pages points at a non-existent gh-pages branch (live site 404s). So docs/*.html (e.g. the team-harness ablation report from PR #59) never get published. Add a Cloudflare Pages workflow modeled on cooperbench/CooperTrain's deploy-report.yml: on push to main (and PRs) touching docs/**, it generates a docs/index.html listing every docs/*.html report and deploys the docs/ dir to the Cloudflare Pages project "cooperbench-reports". PRs get a preview deployment + comment. Requires CLOUDFLARE_API_TOKEN and CLOUDFLARE_ACCOUNT_ID secrets (same ones CooperTrain uses). Until they're set on this repo, the deploy step skips with a warning so PR checks stay green. Co-authored-by: Ubuntu <ubuntu@ip-172-31-58-153.us-west-2.compute.internal> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…-report # Conflicts: # CHANGELOG.md # src/cooperbench/agents/openhands_agent_sdk/adapter.py

github-actions · 2026-05-21T21:01:10Z

Cloudflare Pages preview deployed

Latest report: https://team-harness-ablation-report.cooperbench-reports.pages.dev/team_harness_ablation_report
Versioned URL: https://f0c92690.cooperbench-reports.pages.dev/team_harness_ablation_report
Project root: https://team-harness-ablation-report.cooperbench-reports.pages.dev

Deployed from 6702199 on team-harness-ablation-report.

Ubuntu and others added 20 commits May 16, 2026 20:42

Base automatically changed from team-harness-module to main May 21, 2026 17:37

ProKil mentioned this pull request May 21, 2026

ci: auto-deploy docs to Cloudflare Pages #62

Merged

Merge remote-tracking branch 'origin/main' into team-harness-ablation…

db211ac

…-report # Conflicts: # CHANGELOG.md # src/cooperbench/agents/openhands_agent_sdk/adapter.py

ProKil merged commit bc5d99b into main May 21, 2026
4 checks passed

ProKil deleted the team-harness-ablation-report branch May 21, 2026 21:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: team-harness ablation report (flash, codex/gpt-5.5)#59

docs: team-harness ablation report (flash, codex/gpt-5.5)#59
ProKil merged 21 commits into
mainfrom
team-harness-ablation-report

ProKil commented May 19, 2026

Uh oh!

github-actions Bot commented May 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ProKil commented May 19, 2026

Summary

Experiment setup

Results (passed / 50)

Findings

Caveats (also in the report)

Files

Test plan

Uh oh!

github-actions Bot commented May 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant