docs: team-harness ablation report (flash, codex/gpt-5.5)#59
Merged
Conversation
Adds an OpenAI Codex CLI adapter alongside the existing Claude Code
adapter. Both adapters wrap a third-party CLI inside the task's
Docker container; the bits that are agent-agnostic (Redis messaging
helper, prompt blocks for solo/coop/coop+git, git remote setup) now
live in a new ``cooperbench.agents._coop`` module so the two adapters
(and any future CLI adapter) consume them rather than duplicating.
Codex adapter highlights:
- Invokes ``codex exec --json --sandbox danger-full-access
--skip-git-repo-check --model <id>``.
- Writes ``${CODEX_HOME}/auth.json`` with the host's OPENAI_API_KEY
inside the container so the CLI authenticates without prompts.
- Parses Codex's JSONL event stream for status / token totals /
messages. Cost is reported as 0.0 because Codex does not emit a
cost field; tokens are summed across ``turn.completed`` events.
- Model fallback: if Codex rejects ``--model gpt-5.5`` with a
"model not found" shaped error, the adapter retries once without
``--model`` and lets Codex pick its default.
- Preflight credential check: if OPENAI_API_KEY is unset the adapter
returns Error immediately instead of spinning up a container that
can only fail.
Shared ``_coop`` module:
- ``coop_msg.py`` — Redis-backed messaging CLI (one inbox per agent)
installed as ``coop-send`` / ``coop-recv`` / ``coop-broadcast`` /
``coop-peek`` / ``coop-agents`` under /usr/local/bin.
- ``install_snippet.sh`` — pip-installs redis and drops the shell
wrappers; each adapter's setup.sh sources it.
- ``prompt.py`` — solo / coop / coop+git prompt assembly, agent-
agnostic.
- ``runtime.py`` — ``ContainerEnv`` protocol, ``build_environment``,
``write_file_in_container`` / ``read_file_from_container``,
``rewrite_comm_url_for_container``, ``build_git_setup_command``,
``parse_sent_messages_log``, and ``normalize_patch``.
Bug fix during this refactor: the previous adapter's ``.strip()`` on
``patch.txt`` was eating the trailing newline that ``git apply``
requires. Replaced with ``normalize_patch()`` (one trailing newline,
no leading whitespace). This bit codex's solo run with a
"corrupt patch at line N" error; Claude got lucky and didn't.
Tests: 24 new for Codex (parsers + adapter), existing 45 Claude Code
tests re-pointed at the shared ``_coop`` module. Full suite: 228
passed, 63 skipped.
End-to-end runs against dottxt_ai_outlines_task/1371 features 1+2:
- codex solo f1: Submitted, 1 turn, 365k input tokens,
184-line patch (with the trailing-newline
fix it applies cleanly)
- codex coop+git f1,f2: both Submitted, both patches applied but
0/2 tests pass — coordination failure
(agent1 fetched ``team`` but never merged,
so the stacked patches produce a Python
SyntaxError at line 144 of the modified
file). Claude on the same task scored
2/2; Codex used the tools less aggressively
on this run.
The 0/2 result is the kind of coordination failure the bench is
designed to surface, not an adapter bug. Future iteration could
tighten the prompt or hard-enforce a post-run merge, but neither is
necessary to land the adapter itself.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a third setting alongside ``solo`` and ``coop``, modelled on the
agent-team primitives Claude Code uses in its own product. Where coop
gives N peer agents one feature each and a Redis inbox to chat over,
team mode adds three load-bearing primitives:
1. A typed **shared task list** (cooperbench.agents._team.TaskListClient)
backed by Redis hashes + sets, namespaced ``cb:<run_id>:``, with
atomic claim semantics (HSETNX-style — exactly one caller wins on a
race) and an audit log of every mutation. Exposed in the container
as ``coop-task-create`` / ``coop-task-claim`` / ``coop-task-update``
/ ``coop-task-list`` shell wrappers.
2. A **lead / member role split**. The first agent is designated
``team-lead`` and gets a system-prompt block instructing them to
break the spec into tasks, assign them via ``coop-task-create
--assign``, watch progress, and integrate. Other agents are
``member`` and look for open tasks to claim.
3. A **shared scratchpad** Docker volume (``cb-team-<run_id>``)
mounted at ``/workspace/shared`` in every container. Free
coordination artifact for design notes, partial diffs, interface
sketches.
Coordination metrics are computed from the task-list audit log after
the run finishes (``time_to_first_claim_seconds``, ``claims_per_agent``,
``updates_per_agent``, ``tasks_done``, ``unowned_at_end``) and saved
into ``result.json``. Evaluation is identical to coop — per-agent
``patch.txt`` evaluated per-feature — so no eval changes were needed
beyond discovering ``team/`` log directories.
Compatibility: all five existing adapters accept the new ``team_role``
/ ``team_id`` / ``task_list_url`` kwargs. The CLI adapters
(``claude_code``, ``codex``) wire the team install snippet into their
``setup.sh`` so the ``coop-task-*`` wrappers land at
``/usr/local/bin``. The Python-loop adapters (``mini_swe_agent_v2``,
``swe_agent``, ``openhands_sdk``) accept the kwargs without breaking;
their in-loop integration with the task list (auto-refresh between
steps, similar to the existing inbox poll) lands in a follow-up.
Unit tests: 46 new
- 18 task_list (CRUD, atomic claim, owner-only update, audit log,
run isolation)
- 12 prompt (lead vs member branches, solo fallback, git interaction)
- 3 runtime (env assembly, scratchpad mount args)
- 4 metrics (happy path, unowned-at-end, empty log, multiple claims)
- 5 runner (lead-is-first-agent, pre-seed, kwarg propagation,
metrics in result, three-agent team)
- 4 misc
Full suite: 274 passed, 63 skipped. Ruff / format / mypy all green.
End-to-end on dottxt_ai_outlines_task/1371 [1,2] with Claude Code in
team+git mode:
- 5 tasks created (2 by bench-runner, 3 by the lead splitting its
work), all reached ``done``
- time_to_first_claim_seconds=34.2
- claims_per_agent={agent1: 2, agent2: 1}
- updates_per_agent={agent1: 4, agent2: 3}
- scratchpad volume actively used (agent2 wrote its diff to
/workspace/shared/agent2.patch + a summary.md)
- **0/1 pass rate** — both ``patch.txt`` files were empty: the
members wrote diffs to the scratchpad instead of also writing
``/workspace/repo/patch.txt``, and the lead never ran the final
integration step. This is real coordination signal (the prompt
told them to write both places but they followed the scratchpad
half only) — a follow-up will tighten the prompt to make patch.txt
submission the explicit final step.
Future PRs (intentionally out of scope here so this lands at a
reviewable size):
- In-loop auto-refresh for the Python-loop adapters
- MCP long-poll tool to give CLI adapters push-ish inbox semantics
- Typed ``coop-request`` / ``coop-respond`` protocol on top of
messaging (CC's plan_approval_request shape)
- Filesystem mirror of the task list (CC-style ``ls`` artefacts)
Stacks on #51 (Codex adapter) so the diff stays focused on team-mode
additions.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…resh (#53) Lands the four follow-ups that were called out as "Out of scope" on the team-mode PR (#52), plus a prompt fix surfaced by the team-mode end-to-end run. 1. **Filesystem mirror of task list** (``_team/fs_mirror.py``). Snapshots the Redis-backed task list to ``/workspace/shared/tasks/`` so agents can ``ls`` and ``cat`` tasks with their existing tools rather than going through the ``coop-task-list`` CLI. Layout mirrors Claude Code's team primitive: one ``<id>.json`` per task, plus ``_index.json`` (cheap ``ls`` target) and ``_log.jsonl`` (audit trail). Triggered on every ``coop-task-list`` invocation and from the host runner at startup. Files written via tempfile+replace so readers never observe a partial state. 2. **Typed coop-request / coop-respond protocol** (``_team/protocol.py``). Layered on plain Redis messaging, mirroring CC's ``plan_approval_request`` / ``plan_approval_response`` shape. ``coop-request <peer> <kind> <body>`` returns a request_id (and optionally blocks via ``--wait N`` for a response). ``coop-respond <request_id> <body>`` writes back; the sender's ``await_response`` uses BLPOP so it actually sleeps instead of busy-polling. Both events flow into the shared task-log so coordination metrics include protocol events. 3. **MCP long-poll server** (``_team/mcp_server.py``). Stdio JSON-RPC server that exposes a single ``wait_for_message`` tool backed by BLPOP on the agent's inbox. Registered automatically: Claude Code adapter writes ``$CLAUDE_CONFIG_DIR/.claude.json`` with the server entry; Codex adapter writes ``$CODEX_HOME/config.toml``. The point is to make "watch the inbox" a natural idle behavior for the CLI adapters instead of a busy-loop on ``coop-recv`` returning empty — the closest we can get to push-style delivery for opaque CLI agent loops. 4. **In-loop task-list auto-refresh** (``_team/loop_refresh.py``). ``TeamPoller`` is a per-agent host-side helper that ``mini_swe_agent_v2.DefaultAgent.step()`` calls between LLM queries — same hook as the existing inbox poll. The LLM sees a compact ``[Team task list] open: 1, in_progress: 2, ...`` summary prepended to every turn so it doesn't need to remember to call ``coop-task-list``. Plumbed via ``agent.team_poller`` so the ``mini_swe_agent_v2`` subtree change is one branch in ``step()``. The same module also exports ``poll_team_state()`` for in-container use (env-driven variant). 5. **Prompt fix**: the previous team-mode end-to-end had members writing diffs to ``/workspace/shared/<id>.patch`` only and never to ``/workspace/repo/patch.txt``, scoring 0/2 despite great coordination. Both lead and member prompts now have an explicit ``### Final submission — REQUIRED`` section that calls out ``patch.txt`` as the only file the bench evaluates and provides the exact ``git diff > patch.txt`` command. Also: cosmetic fix to ``runner/core._print_single_result`` so team mode's per-agent dicts (which carry ``patch_lines: int``) render correctly in the run table — previously the column showed 0 because the function tried ``len(r.get("patch", "").splitlines())`` and team mode doesn't store the full patch in the agents dict. Tests: 37 new unit tests - 8 fs_mirror (atomic writes, stale cleanup, empty index) - 9 protocol (request roundtrip, await, timeout, audit log) - 9 mcp_server (initialize, tools/list, tools/call, timeout, blocking, unknown-tool error, env factory) - 8 loop_refresh (summary formatting, TeamPoller, env variant) - 3 prompt (regression: lead+member prompts demand patch.txt) Full suite: **311 passed**, 63 skipped. End-to-end on dottxt_ai_outlines_task/1371 [1,2] with Claude Code + team + git: **2/2 features pass** (14/14 + 20/20 tests). All four follow-ups visibly active in the run artifacts: ``/workspace/shared/tasks/`` populated with per-task JSON + _index + _log; scratchpad has agent2.patch; ``cb-mcp-server.py`` registered in ``.claude.json``; 6 tasks created (2 by runner pre-seed, 4 by lead's sub-task split), 4 reached ``done``, ``time_to_first_claim_seconds=29.9``. Previous run scored 0/2 on the same task — the prompt fix is doing real work. Stacks on #52. Co-authored-by: Ubuntu <ubuntu@ip-172-31-58-153.us-west-2.compute.internal> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Brings ``mini_swe_agent_v2``, ``swe_agent``, and ``openhands_sdk`` to
parity with the CLI adapters for team mode. Before this commit they
accepted the team kwargs but discarded them; now each one appends the
team prompt section to the task it sends the agent, and (where the
adapter actually controls the container) propagates ``CB_TEAM_*`` env
vars + mounts the team scratchpad.
New helper: ``_team.team_task_section(agents, agent_id, team_role)``
returns ONLY the lead-or-member block + coop-task-* CLI usage,
without the surrounding task/submission/git scaffolding that
``build_team_instruction`` adds. Python-loop adapters already have
their own prompts covering messaging/git/submission, so they need
only the new piece; CLI adapters keep using the bigger function.
Per-adapter wiring:
- ``mini_swe_agent_v2``: appends team_task_section to task;
propagates CB_TEAM_* through env_kwargs["env"]; adds
``--add-host=host.docker.internal:host-gateway`` + scratchpad
volume to docker run args; installs the team CLI scripts + pip
redis in the container after env spin-up. The existing
``TeamPoller`` host-side hook (already in step()) still fires.
- ``openhands_sdk``: appends team_task_section to task; folds a new
``team_env`` dict into ``coop_info`` so
``_build_credentials_dict`` propagates CB_TEAM_* into the
sandbox. Coop-task-* binary install in the OpenHands agent-server
image is a follow-up — OpenHands manages its own image build and
doesn't expose a clean post-start exec hook.
- ``swe_agent``: appends team_task_section to task. The SWE-agent
framework's sandbox + agent loop is third-party and harder to
instrument; everything beyond the prompt is a follow-up.
Tests: 13 new
- 3 prompt unit tests for team_task_section (lead, member, empty)
- 10 cross-adapter sanity tests in tests/agents/test_team_wiring.py:
consistency between team_task_section and build_team_instruction,
every registered runner accepts the team kwargs, openhands env
keys, swe_agent signature
Full suite: 324 passed, 63 skipped. Ruff/format/mypy all green.
End-to-end on dottxt_ai_outlines_task/1371 [1,2] with claude_code +
team + git (sanity check that the shared changes didn't regress the
CLI adapter): both Submitted in 4m21s, $0.93, patches 210 + 81 lines.
End-to-end for the other four (codex, mini_swe_agent_v2, swe_agent,
openhands_sdk) requires API keys (Anthropic for the three Python-loop
adapters via litellm, OpenAI for codex) that aren't available in this
environment. Unit tests cover the new wiring; the e2e validations
should be run with real keys before relying on the per-adapter
behavior.
Compatibility matrix is now:
| Adapter | Accepts | Team prompt | Auto-refresh | CLI in container | env vars |
|---------------------|---------|-------------|--------------|------------------|----------|
| claude_code | yes | yes (full) | n/a | yes | yes |
| codex | yes | yes (full) | n/a | yes | yes |
| mini_swe_agent_v2 | yes | yes (sec.) | yes | yes | yes |
| openhands_sdk | yes | yes (sec.) | n/a | NOT YET | yes |
| swe_agent | yes | yes (sec.) | NOT YET | NOT YET | NOT YET |
Stacks on #52 (merged-up team-mode branch).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the documented gap from the prior commit's matrix: the
``coop-task-*`` binaries now ship into the OpenHands agent-server
sandbox, layered onto the upstream ``-oh`` image via Modal's
``add_local_file`` / ``pip_install`` / ``run_commands`` chain (no
upstream image rebuild required). Triggered only when
``coop_info["team_env"]`` is set so solo / coop runs don't pay the
~10s first-build cost. Modal caches the layered image; subsequent
team runs are instant.
Verified end-to-end: ran openhands_sdk team+git on
dottxt_ai_outlines_task/1371 [1,2] with gpt-5.5. The agent ran
``compgen -c | grep coop-task`` and got back all 7 wrappers
(create / claim / update / list / request / respond / pending) — the
install worked. Whether the model actually invokes the tools is a
separate (coordination-quality) axis; in this run it discovered them
but didn't use them, same as codex. Both patches applied; f1 14/14,
f2 19/20.
Tests: 2 new (full suite: 326 passed)
- test_team_env_triggers_image_layering — verifies add_local_file
+ pip_install + run_commands fire with the right args when team
mode is active
- test_no_layering_when_team_inactive — verifies solo / coop
runs skip the image-build cost
Matrix update — openhands_sdk now reads:
Accepts kwargs: yes / Team prompt: section / Auto-refresh: n/a /
CLI in container: YES (was NOT YET) / CB_TEAM_* env: yes
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The codex team e2e (cx_team_v3) hit 0/2 with great coordination
metrics — 5/5 tasks done, 27s first claim, claims even — but
neither agent ran ``git merge`` despite the prompt's "Recommended
workflow" mentioning it. Both fetched their peer's branch (2 each)
and then submitted only their own work, so the eval's naive
diff-stacker produced syntactically broken Python.
The previous prompt buried the critical step in a "Concretely:"
sentence at the end; gpt-5.5 didn't follow it. This rewrite:
- Renames the section ``## Git collaboration — MERGE IS REQUIRED
BEFORE SUBMITTING`` so the imperative is in the heading itself.
- Adds an explicit "Required final sequence — run this verbatim
before exiting" block with the full fetch+merge+diff sequence,
parameterized over every partner branch.
- Explains *why* (each agent's patch.txt is evaluated against every
feature's tests; without the merge, the peer feature's symbols
are missing → ImportError).
- Frames it the same way the patch.txt step is framed (REQUIRED,
skip-at-your-loss), which the original prompt fix proved
codex responds to.
Verified: re-ran cx_team_v4 (codex team+git, same task as v3).
Git activity went from ``fetch=2 merge=0 push=0`` per agent →
``fetch=3 merge=2 push=2`` and ``fetch=1 merge=1 push=1``. Both
patches now contain both features' symbols. Pass rate v4:
33/34 tests (97%) — f2 fully passes 20/20, f1 fails one test
because gpt-5.5's merged code put the ``filters`` kwarg on a helper
function rather than the ``prompt`` decorator (content quality, not
coordination).
A second run (cx_team_v5) produced byte-identical 243-line patches
on both agents — codex coordinated so well both ended up with the
exact same merged tree. This surfaces a separate bench-side
limitation: the eval's diff-stacker fails to apply patch B on top
of patch A when every hunk already matches, producing an empty
merged.patch. That's a real bug in ``eval/evaluate.py``'s coop
merge step, NOT a coordination failure — codex did exactly what the
prompt asked. Fix is a separate concern from team-mode wiring.
Tests still pass (existing prompt tests are content-agnostic;
326 / 63 skipped).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
In team mode codex can coordinate so well that both agents end up
with byte-identical patches (each fully merged the other's branch).
The existing eval combiner sequence — apply patch1 → apply patch2
on top — chokes because every hunk in patch2 is already applied,
producing an empty merged.patch and a downstream "No valid patches
in input" failure even though both submissions are individually
fine.
Fix in ``test_merged``: before invoking ``_setup_branches`` /
``_merge_naive``, ``cmp`` the two patches. If they match, copy
patch1 to merged.patch (normalized via ``git apply --recount`` so
agents that emit unified diffs with miscounted hunk headers still
work) and skip the merge dance. Returns a fresh result with
``merge.status: "identical"`` so the caller can tell the
short-circuit fired vs a real merge.
Verified on the codex-team e2e:
- cx_team_v5 (codex agents perfectly merged to identical 243-line
patches): 0/2 → 2/2 ✓ (f1: 14/14, f2: 20/20)
- cx_team_v4 (codex agents diverged on the merge): unchanged at
f2 20/20 + f1 13/14 = 33/34 tests, still falls back to
agent2-alone via apply_status: {'agent1': 'failed', ...}
I also briefly tried adding ``git apply --recount`` to
``_setup_branches``'s fallback chain, but that REGRESSED v4: it
made agent1's malformed patch apply where it previously failed
silently, triggering a real merge attempt that produced
duplicate function definitions (broken Python) via union merge.
The identical-patches short-circuit is the strictly-better fix —
no regression, recovers the v5 case, and the malformed-hunk
normalization only kicks in on the short-circuit path where it
can't cause merge conflicts.
Also lands previously-uncommitted housekeeping:
- prompt.py: ruff-format-only diff on the merge-required block
from the prior commit
- test_team_wiring.py: ruff --fix removed unused MagicMock
imports
- test_gcp_backend.py / test_tasks.py: ruff --fix removed
f-string-without-placeholder and unused-json import (both
unrelated drift caught by the gate)
Tests: 1 new (full suite: 327 passed)
- ``test_test_merged_shortcircuits_on_identical_patches`` — source
inspection confirms the short-circuit branch + "identical"
merge-status string exist in test_merged
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous openhands team runs (oh_team_v3) showed agents
discovering the ``coop-task-*`` shell wrappers via ``compgen`` but
never invoking them — gpt-5.5 strongly prefers typed tools registered
with the LLM over arbitrary shell commands. This commit lands the
architectural fix: a Redis-backed ``CoopTaskTrackerTool`` registered
under the same name as openhands' built-in ``TaskTrackerTool`` so the
registry resolution swaps it transparently.
Files:
* ``openhands/tools/task_tracker/coop_definition.py`` — new tool
definition + executor. Same ``TaskTrackerAction`` /
``TaskTrackerObservation`` shape, but ``plan`` and ``view`` round-
trip through the shared ``cb:<run_id>:`` Redis namespace that
``TaskListClient`` (host side) writes to. Tasks are auto-owned
by the calling agent; ``view`` shows peer tasks prefixed with
``[<their_agent_id>]``. Registered under both
``"CoopTaskTrackerTool"`` AND ``"TaskTrackerTool"`` so importing
the module rebinds the latter to the Coop variant.
* ``openhands/tools/preset/default.py`` — gains a ``team_mode``
kwarg (kept for API stability + tests; the actual swap happens
server-side via the .pth/__init__ side-effect import, not by
changing the host-side tool list). Pre-PR coop block split into
a more nuanced team-mode prompt section that documents the
TaskTracker → shared-list behavior.
* ``openhands_sdk/adapter.py:ModalSandboxContext.__enter__`` —
layers two more bits into the Modal image at build time:
- ``add_local_file`` of ``coop_definition.py`` to
``$OH_DIR/coop_definition.py`` (in the sandbox's openhands
install)
- ``grep ... || echo`` appending
``from . import coop_definition`` to the package's
``__init__.py`` so the registration runs at import time.
Tests: 1 new + updated image-layering assertions
- ``test_importing_coop_definition_overrides_local_registration``:
inspecting the registry's ``_MODULE_QUALNAMES`` confirms
``TaskTrackerTool.name`` resolves to ``coop_definition``'s
registration after import.
- ``TestOpenHandsImageLayering`` now asserts 2 ``add_local_file``
calls + 2 ``run_commands`` layers (tool-file install +
``coop-task-*`` wrappers) and that the
``from . import coop_definition`` line is in the install
commands.
Full suite: 329 passed. Ruff / format / mypy all green.
KNOWN LIMITATION (documented in coop_definition.py docstring):
the openhands_sdk agent-server runs in a Modal sandbox that's
network-isolated from the host Redis. The CoopTaskTracker is
correctly registered and the LLM can call it, but every operation
returns "Shared task list unavailable" because the sandbox can't
``socket.getaddrinfo("host.docker.internal")``. The fix is in the
deployment layer (Modal tunnels, a Modal-hosted Redis, or running
openhands directly via docker like the other adapters), not in this
PR — verified by oh_team_v10: agent ran ``coop-task-list`` first
("The coop CLI failed; I'll use the shared task tracker."), then
fell back to TaskTrackerAction which still hit the local executor
because the override + Redis combo can't actually work in Modal.
For non-Modal openhands deployments (e.g. local docker-backed
openhands runs, future remote-conversation transports that share the
host network), this tool works as designed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Resolves the Modal-Redis isolation that blocked the prior CoopTaskTracker
swap from actually functioning. Three pieces, working together:
1. **Modal-hosted Redis.** ``runner/team.py:execute_team`` detects
``agent_name == "openhands_sdk"`` and spins up a Modal sandbox
running redis-server on a TCP tunnel (``unencrypted_ports=[6379]``,
accessed via ``unencrypted_host:unencrypted_port``). Re-uses the
existing ``connectors/redis_server.ModalRedisServer`` — it was
already written, just unused. Both the host TaskListClient and
the agent sandboxes point at the same public TCP endpoint, so
pre-seed and agent reads/writes share state. Falls back to local
Redis for the other adapters.
2. **CoopTaskTrackerTool injection into the Modal sandbox.** The
adapter now ``add_local_file``s three pieces into the OpenHands
image at build time:
- ``coop_task.py`` → ``/usr/local/bin/cb-coop-task.py``
- ``coop_definition.py`` → ``$OH_DIR/coop_definition.py``
- ``_team_init_override.py`` → ``$OH_DIR/__init__.py``
(replaces upstream; same exports + a side-effect import of
coop_definition so the Redis-backed executor overrides the
local TaskTracker registration at first import).
Plus a ``find -name '*.pyc' -delete`` to invalidate Python's
bytecode cache so the new __init__ actually re-runs.
3. **Harvest-time fresh client.** Modal's TCP tunnels drop idle
connections after a few minutes, so the original Redis client
pre-seed used at startup gets closed before the 9-min agent run
finishes. Re-open the client at harvest time using the same URL.
End-to-end on ``dottxt_ai_outlines_task/1371 [1,2]`` with
``-a openhands_sdk --setting team --git``:
- Modal Redis startup: ``redis ready redis://r450.modal.host:41899``
- Both agents Submitted, 9m total
- Eval: 2/2 PASS (f1: 14/14 ✓, f2: 20/20 ✓)
- Metrics: ``tasks_total: 4, tasks_done: 4, unowned_at_end: 0,
time_to_first_claim_seconds: 52.6, claims_per_agent: {agent2:2,
agent1:1}, updates_per_agent: {agent2:4, agent1:5}``
- Cost: $3.33
Tests: image-layering assertions expanded — ``add_local_file`` now
called 3 times (CLI helper, tool def, __init__ override), and the
run_commands chain copies both files + wipes .pyc caches.
Full suite: 329 passed. Ruff / format / mypy all green.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The team-mode unit tests (task_list / protocol / fs_mirror / loop_refresh / mcp_server) use ``fakeredis.FakeRedis`` as a hermetic stand-in for redis-server, but ``fakeredis`` wasn't declared anywhere in pyproject.toml — it just happened to be present in my local venv because something else pulled it in transitively. GitHub CI installs ``[dev]`` only, so on a clean install pytest collection fails with ``ModuleNotFoundError: No module named 'fakeredis'`` on every team-mode test file. Adding the dependency explicitly fixes PR #52 (team-mode) CI; once team-mode merges, PR #55 (team-all-adapters) will also pick it up via the same path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three changes that together unblock swe_agent team-mode runs (and
solo/coop runs too — the bug wasn't team-specific):
1. ``cooperbench.agents.mini_swe_agent`` → ``mini_swe_agent_v2``
in ``swe_agent/adapter.py`` and ``swe_agent/agent/agents.py``.
The old package was renamed in v0.0.13; both swe_agent files
had stale imports that no-op'd at module load (TypeError or
ModuleNotFoundError depending on how the framework was invoked),
making every swe_agent invocation return Error before any LLM
call.
2. Add ``numpy``, ``boto3``, ``docker`` to the ``swe-agent`` extras
in pyproject.toml. swe_agent's vendored framework imports these
at module-load time even when the docker/S3/model paths are
dormant, so a clean ``pip install '.[swe-agent]'`` without these
would still ImportError on first invocation.
3. uv.lock refreshed with the new transitive deps.
End-to-end on dottxt_ai_outlines_task/1371 [1,2] with
``-a swe_agent -m gpt-5.5 --setting team --git`` (sw_team_v5):
both agents Submitted, patches 373 + 88 lines, both applied via
git apply. Eval failed 0/2 due to a content-quality issue
(``NameError: name 'Set' is not defined`` — agent used Set
without importing it; both agents hit exit_cost budget limit
mid-implementation), but that's model variance, not adapter
wiring. swe_agent is unblocked: it runs end-to-end, produces
patches, the eval pipeline processes them.
Coordination metrics still empty (claims_per_agent: {}) because
swe_agent doesn't yet have the in-container coop-task-* CLI
install or in-loop task auto-refresh — those are tracked as
follow-ups in the PR body. For now the swe_agent team-mode run
just gets the team prompt section + env vars; full team-tool
integration is a separate PR.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Five compounding bugs prevented `claude_code`, `codex`, and `mini_swe_agent_v2` from reaching honest pass-rates on the core subset in team setting. All four now ≥ 5/10. - normalize_patch ate trailing blank context lines (text.strip() consumes " \n"), breaking last-hunk line counts so git apply rejected otherwise-valid diffs. Replaced with lstrip/rstrip on "\n" only. - mini_swe_agent_v2 adapter wasn't normalizing patches at all — raw .strip() on the patch.txt read, so every msa patch ended in a non-newline byte. Now routes through normalize_patch. - mini_swe_agent_v2 ModalEnvironment created the sandbox with no long-running command, so the image's default CMD exited and every exec hit "Sandbox not found". Pass "sleep", "infinity" as the positional command (matches eval backend's existing fix). - claude_code and codex adapters silently ignored --backend modal because shared build_environment was hardcoded to DockerEnvironment. Added a backend kwarg and threaded config["backend"] through both adapters. - Team lead prompt buried the integration step at the bottom of a long workflow list; Claude/Codex consistently exited after their own feature without reading /workspace/shared/<agent>.patch. Rewrote with a hard-rule opener and a 5-point pre-submission checklist. Member prompt now opens with "stay in your lane" per the lead's PLAN.md. - eval test_merged now falls back to testing each agent's patch alone when the merged tree doesn't pass both features. Surfaced as merge.strategy="solo-agent1" / "solo-agent2". Credits the agent (typically the lead) who correctly integrated both features into one working patch but had it corrupted by union-merging with the other agent's partial implementation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- dataset/subsets/core.json: 10-pair subset for quick agent comparisons. Stratified by repo (largest-remainder proportional allocation by full-dataset pair count) with a one-slot floor per primary language (Python / Go / Rust / TS). Reproducible via scripts/generate_core_subset.py (seed=42). - docs/BENCHMARK_RESULTS.md: horizontal comparison of four agent frameworks on the core subset in team setting. Includes per-task pass/fail matrix annotated with the merge strategy used, plus the chronological narrative of the dozen reruns that surfaced each of the bugs fixed in the previous commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously test_merged returned early with an error when both naive and union merge strategies hit conflicts, so the solo-agent fallback never got a chance to credit a team whose lead alone integrated both features. Now we write an empty merged.patch, let run_tests fail naturally on the merged tree, and fall through to the solo fallback. Doesn't change any of the current 40 eval results — union's merge=union attribute is tolerant enough that every task in the dataset produces some tree (potentially broken code with stitched-together lines); the broken-tree-tests-fail path already triggered the solo fallback. This just closes the defensive gap for future pathological cases. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drops the union-merge strategy and the member-only fallback from
test_merged. The new chain is:
1. identical patches → skip-merge short-circuit
2. naive 3-way merge clean → merged-tree tests are authoritative
(no further fallback)
3. naive merge conflicts → test the lead's patch.txt alone against
both feature suites
Rationale: union merge concatenates conflicting hunks, which usually
produces syntactically broken code; the cases where it accidentally
produced a working tree were rewarding lucky non-overlap, not genuine
coordination. The member-only fallback was symmetric to lead-only but
incoherent under team-mode semantics (the lead is the designated
integrator; if they didn't integrate, the team failed regardless of
what the member's branch looks like).
Effect on the core-subset horizontal comparison:
msa 6 → 6 (unchanged)
oh 5 → 4 (loses pallets_jinja/1621 — was passing via union, which
concealed that oh's lead doesn't integrate)
cc 5 → 5 (unchanged)
cx 5 → 5 (unchanged)
oh sliding below 5/10 is the correct outcome: the previous union-pass
on pallets_jinja/1621 was a false-positive of sorts (oh's agents commit
their patch.txt into the working tree, which forces a merge conflict
on patch.txt that union resolved while the actual source merge was
non-conflicting). Under the stricter policy this gets routed through
lead-alone, which oh's lead does not pass.
BENCHMARK_RESULTS.md updated to reflect the new totals + per-task
matrix legend (N = naive/identical, L = lead-alone). CHANGELOG entry
revised; full test suite still green (329 passed, 63 skipped).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
codex on Modal: `codex exec` was hanging for the full sandbox lifetime (~2h) producing zero stream output. Root cause: codex's exec mode prints "Reading additional input from stdin..." and blocks until stdin EOF. Docker's non-tty `docker exec` gives EOF for free; Modal sandbox keeps stdin open. Fix: add `</dev/null` to the codex invocation in _build_codex_command. Smoke-tested on dottxt_ai_outlines/1655 [1,3] solo on Modal: 1/1 pass in 1m 48s. openhands_sdk eval guardrail: openhands_sdk produces patches that include a committed patch.txt in the working tree and relies on Modal-hosted Redis for coordination; running eval through Docker silently changed the test environment. The eval now reads the run's config.json and refuses with a clear warning when the run was produced by openhands_sdk but --backend != modal. Note: swe_agent already runs on Modal (uses swerex.ModalDeploymentConfig by default; the earlier docs claiming it was docker-only were wrong). Smoke-tested same dottxt task: 1/1 pass in 3m 12s. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
swe_agent adapter was hardcoded to swerex.ModalDeploymentConfig. Added a backend dispatch that picks DockerDeploymentConfig when config["backend"] == "docker"; Modal stays as the default. Two upstream-swerex issues had to be worked around to make the docker path actually start a container: 1. CooperBench task images set ENTRYPOINT=/usr/local/bin/runner.sh, so swerex's `docker run ... image sh -c "<startup>"` becomes `runner.sh sh -c "<startup>"` and runner.sh interprets "sh" as the feature-patch path. Pass docker_args=["--entrypoint", ""] to clear the entrypoint (mirrors the existing Modal monkey-patch that does .entrypoint([]) on the image). 2. swerex's startup falls back to `pipx run swe-rex ...` when the swerex-remote binary isn't pre-installed, but pipx looks for an executable literally named "swe-rex" — which doesn't exist in the published `swe-rex` package (it provides "swerex-remote"). Monkey-patch DockerDeployment._get_swerex_start_cmd to use `pipx run --spec swe-rex swerex-remote ...` instead. Smoke-tested with `dottxt_ai_outlines/1655 [1,3]` solo on docker: 1/1 pass in 2m 53s, 17 steps, $0.32, no errors. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Resolves the squash-merge conflicts from #52 landing on main. All conflicts followed the same pattern: this branch's HEAD contains #52's content plus the subsequent work on top, while main's squashed-merge commit contains only #52. Resolved each conflict by taking ours (HEAD), which preserves the cumulative state of: - CHANGELOG: full Fixed/Changed/Added entries for team-mode bug fixes, eval policy change, core subset + benchmark doc, plus the original "team setting" bullet from #52 - _team/prompt.py: the stronger lead-prompt with the 5-point integration checklist (#52 had the older "buried integration" version) - swe_agent/adapter.py: team-mode kwarg propagation + Docker backend dispatch + pipx --spec monkey-patch - runner/team.py: openhands_sdk Modal-Redis tunnel branch - everywhere else: my newer adapter changes are strict supersets of #52's CI green locally: 329 tests passed, ruff clean, mypy clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Move team-mode primitives from cooperbench/agents/_team (private) to
cooperbench/team_harness (public, library-shaped) so other benchmarks
can consume the multi-agent coordination algorithm without depending on
CooperBench's task layout.
Adds TeamSession + TeamHarnessConfig:
- TeamSession bundles per-run state (run_id, namespaced Redis URL,
ordered agent list, scratchpad volume name) with the feature config
and exposes adapter-facing factories that each return None / [] / {}
when their feature is disabled, so adapter code paths collapse to one
branch:
coop_env.update(session.env_for(agent_id))
extra_run_args.extend(session.scratchpad_mount_args())
mcp_config = session.mcp_config(container_script_path=...)
- TeamHarnessConfig is a frozen dataclass of five per-feature booleans
(task_list, scratchpad, mcp, auto_refresh, protocol). The lead/member
role split is the always-on baseline -- without it team is just coop.
Wires five --team-no-* CLI flags through cli.py -> runner.run ->
runner.core -> runner.team -> each adapter. result.json now records
team_features so post-hoc analysis can attribute deltas to the feature
that was off.
Adapter refactor: claude_code, codex, mini_swe_agent_v2, swe_agent, and
openhands_agent_sdk now accept team_features kwarg and construct a
local TeamSession instead of calling loose helpers. Each adapter's
team-mode blocks (prompt, env, mount, MCP, install) gate on the
session's config.
Tests: tests/agents/_team -> tests/team_harness (rename), new
test_session.py (29 cases) covers the facade, four new ablation tests
in tests/runner/test_team.py verify the runner-side gating. Full suite
363 passed, 63 skipped; ruff/format/mypy clean.
End-to-end smoke on dottxt_ai_outlines/1371 [1,2] with codex (docker):
- Default: writes task_log.json + tasks.json + metrics, cb-team-<run>
volume created.
- --team-no-task-list --team-no-scratchpad --team-no-mcp: no task_log /
tasks files, empty metrics dict, no volume. team_features in
result.json reflects the requested ablation.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Self-contained HTML report of the team-harness ablation + multi-agent
comparison run on the flash subset (50 task pairs), codex/gpt-5.5,
docker, 1 seed.
Contents:
- docs/team_harness_ablation_report.html — setting comparison
(solo/coop/coop+git/team), one-feature-off ablation matrix, timing,
findings, methodology, caveats. All numbers embedded inline.
- docs/team_harness_ablation_data/{core,flash}_ablation.csv — raw rows.
- scripts/run_team_ablation.py — sweep driver (config -> cooperbench run+eval).
- scripts/gen_ablation_report.py — regenerates the HTML from logs/.
Headline results (passed / 50, both-features-pass):
coop msg-only 13 · team no-scratchpad 15 · team no-task_list 20 ·
solo 24 · coop+git 28 · team no-mcp 30 · team no-auto_refresh 30 ·
team baseline 31 · team no-protocol 35
Findings:
- scratchpad (-16) and task_list (-11) are load-bearing; removing
either drops team below solo (two uncoordinated agents < one).
- mcp/auto_refresh/protocol show no positive effect for codex
(auto_refresh is a no-op for CLI adapters by design; protocol-off
even scored +4, i.e. mild overhead without payoff).
- Most multi-agent value is a shared code substrate, not orchestration:
coop+git (56%) ~ team-scratchpad (62%) >> messaging-only coop (26%).
Caveat: team runs used the scratchpad for code-sharing, NOT --git, so
"team vs coop+git" compares two sharing substrates; the team --git cell
is untested (follow-up).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ProKil
added a commit
that referenced
this pull request
May 21, 2026
CooperBench had no docs-deployment pipeline — the original template docs.yaml (MkDocs + mike) was deleted in b61bc00, and GitHub Pages points at a non-existent gh-pages branch (live site 404s). So docs/*.html (e.g. the team-harness ablation report from PR #59) never get published. Add a Cloudflare Pages workflow modeled on cooperbench/CooperTrain's deploy-report.yml: on push to main (and PRs) touching docs/**, it generates a docs/index.html listing every docs/*.html report and deploys the docs/ dir to the Cloudflare Pages project "cooperbench-reports". PRs get a preview deployment + comment. Requires CLOUDFLARE_API_TOKEN and CLOUDFLARE_ACCOUNT_ID secrets (same ones CooperTrain uses). Until they're set on this repo, the deploy step skips with a warning so PR checks stay green. Co-authored-by: Ubuntu <ubuntu@ip-172-31-58-153.us-west-2.compute.internal> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…-report # Conflicts: # CHANGELOG.md # src/cooperbench/agents/openhands_agent_sdk/adapter.py
|
Cloudflare Pages preview deployed
Deployed from |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Stacks on #58. Adds a self-contained HTML report of the ablation + multi-agent comparison experiments run against the team harness, plus the driver/generator scripts and raw CSVs so the numbers are reproducible.
Open
docs/team_harness_ablation_report.htmlin a browser (or GitHub's raw/preview) — all numbers are embedded inline, no external assets.Experiment setup
codex· modelgpt-5.5· subsetflash(50 task pairs) · backenddocker· 1 seedResults (passed / 50)
Findings
scratchpad(−16) andtask_list(−11) account for nearly all of team mode's value; remove either and team drops below solo — two uncoordinated agents are worse than one.auto_refreshis a no-op for CLI adapters by design (only fires in Python-loop adapters);protocol-off even scored +4 (mild overhead, no payoff).Caveats (also in the report)
--git— so "team vs coop+git" compares two different sharing substrates, not "team = coop+git + extras". Theteam --gitcell (both substrates) is untested.execran with no step cap (2h wall-clock only);steps=1in raw logs is one codex turn (~50–95 internal tool calls). Cost shows $0 because codex's--jsonomits a cost field.Files
docs/team_harness_ablation_report.html— the reportdocs/team_harness_ablation_data/{core,flash}_ablation.csv— raw rowsscripts/run_team_ablation.py— sweep driverscripts/gen_ablation_report.py— regenerates the HTML fromlogs/Test plan
uv run python scripts/gen_ablation_report.pyreproduces the file from logs🤖 Generated with Claude Code