team mode: wire team prompt + env into the three Python-loop adapters#55
Merged
Conversation
Adds an OpenAI Codex CLI adapter alongside the existing Claude Code
adapter. Both adapters wrap a third-party CLI inside the task's
Docker container; the bits that are agent-agnostic (Redis messaging
helper, prompt blocks for solo/coop/coop+git, git remote setup) now
live in a new ``cooperbench.agents._coop`` module so the two adapters
(and any future CLI adapter) consume them rather than duplicating.
Codex adapter highlights:
- Invokes ``codex exec --json --sandbox danger-full-access
--skip-git-repo-check --model <id>``.
- Writes ``${CODEX_HOME}/auth.json`` with the host's OPENAI_API_KEY
inside the container so the CLI authenticates without prompts.
- Parses Codex's JSONL event stream for status / token totals /
messages. Cost is reported as 0.0 because Codex does not emit a
cost field; tokens are summed across ``turn.completed`` events.
- Model fallback: if Codex rejects ``--model gpt-5.5`` with a
"model not found" shaped error, the adapter retries once without
``--model`` and lets Codex pick its default.
- Preflight credential check: if OPENAI_API_KEY is unset the adapter
returns Error immediately instead of spinning up a container that
can only fail.
Shared ``_coop`` module:
- ``coop_msg.py`` — Redis-backed messaging CLI (one inbox per agent)
installed as ``coop-send`` / ``coop-recv`` / ``coop-broadcast`` /
``coop-peek`` / ``coop-agents`` under /usr/local/bin.
- ``install_snippet.sh`` — pip-installs redis and drops the shell
wrappers; each adapter's setup.sh sources it.
- ``prompt.py`` — solo / coop / coop+git prompt assembly, agent-
agnostic.
- ``runtime.py`` — ``ContainerEnv`` protocol, ``build_environment``,
``write_file_in_container`` / ``read_file_from_container``,
``rewrite_comm_url_for_container``, ``build_git_setup_command``,
``parse_sent_messages_log``, and ``normalize_patch``.
Bug fix during this refactor: the previous adapter's ``.strip()`` on
``patch.txt`` was eating the trailing newline that ``git apply``
requires. Replaced with ``normalize_patch()`` (one trailing newline,
no leading whitespace). This bit codex's solo run with a
"corrupt patch at line N" error; Claude got lucky and didn't.
Tests: 24 new for Codex (parsers + adapter), existing 45 Claude Code
tests re-pointed at the shared ``_coop`` module. Full suite: 228
passed, 63 skipped.
End-to-end runs against dottxt_ai_outlines_task/1371 features 1+2:
- codex solo f1: Submitted, 1 turn, 365k input tokens,
184-line patch (with the trailing-newline
fix it applies cleanly)
- codex coop+git f1,f2: both Submitted, both patches applied but
0/2 tests pass — coordination failure
(agent1 fetched ``team`` but never merged,
so the stacked patches produce a Python
SyntaxError at line 144 of the modified
file). Claude on the same task scored
2/2; Codex used the tools less aggressively
on this run.
The 0/2 result is the kind of coordination failure the bench is
designed to surface, not an adapter bug. Future iteration could
tighten the prompt or hard-enforce a post-run merge, but neither is
necessary to land the adapter itself.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a third setting alongside ``solo`` and ``coop``, modelled on the
agent-team primitives Claude Code uses in its own product. Where coop
gives N peer agents one feature each and a Redis inbox to chat over,
team mode adds three load-bearing primitives:
1. A typed **shared task list** (cooperbench.agents._team.TaskListClient)
backed by Redis hashes + sets, namespaced ``cb:<run_id>:``, with
atomic claim semantics (HSETNX-style — exactly one caller wins on a
race) and an audit log of every mutation. Exposed in the container
as ``coop-task-create`` / ``coop-task-claim`` / ``coop-task-update``
/ ``coop-task-list`` shell wrappers.
2. A **lead / member role split**. The first agent is designated
``team-lead`` and gets a system-prompt block instructing them to
break the spec into tasks, assign them via ``coop-task-create
--assign``, watch progress, and integrate. Other agents are
``member`` and look for open tasks to claim.
3. A **shared scratchpad** Docker volume (``cb-team-<run_id>``)
mounted at ``/workspace/shared`` in every container. Free
coordination artifact for design notes, partial diffs, interface
sketches.
Coordination metrics are computed from the task-list audit log after
the run finishes (``time_to_first_claim_seconds``, ``claims_per_agent``,
``updates_per_agent``, ``tasks_done``, ``unowned_at_end``) and saved
into ``result.json``. Evaluation is identical to coop — per-agent
``patch.txt`` evaluated per-feature — so no eval changes were needed
beyond discovering ``team/`` log directories.
Compatibility: all five existing adapters accept the new ``team_role``
/ ``team_id`` / ``task_list_url`` kwargs. The CLI adapters
(``claude_code``, ``codex``) wire the team install snippet into their
``setup.sh`` so the ``coop-task-*`` wrappers land at
``/usr/local/bin``. The Python-loop adapters (``mini_swe_agent_v2``,
``swe_agent``, ``openhands_sdk``) accept the kwargs without breaking;
their in-loop integration with the task list (auto-refresh between
steps, similar to the existing inbox poll) lands in a follow-up.
Unit tests: 46 new
- 18 task_list (CRUD, atomic claim, owner-only update, audit log,
run isolation)
- 12 prompt (lead vs member branches, solo fallback, git interaction)
- 3 runtime (env assembly, scratchpad mount args)
- 4 metrics (happy path, unowned-at-end, empty log, multiple claims)
- 5 runner (lead-is-first-agent, pre-seed, kwarg propagation,
metrics in result, three-agent team)
- 4 misc
Full suite: 274 passed, 63 skipped. Ruff / format / mypy all green.
End-to-end on dottxt_ai_outlines_task/1371 [1,2] with Claude Code in
team+git mode:
- 5 tasks created (2 by bench-runner, 3 by the lead splitting its
work), all reached ``done``
- time_to_first_claim_seconds=34.2
- claims_per_agent={agent1: 2, agent2: 1}
- updates_per_agent={agent1: 4, agent2: 3}
- scratchpad volume actively used (agent2 wrote its diff to
/workspace/shared/agent2.patch + a summary.md)
- **0/1 pass rate** — both ``patch.txt`` files were empty: the
members wrote diffs to the scratchpad instead of also writing
``/workspace/repo/patch.txt``, and the lead never ran the final
integration step. This is real coordination signal (the prompt
told them to write both places but they followed the scratchpad
half only) — a follow-up will tighten the prompt to make patch.txt
submission the explicit final step.
Future PRs (intentionally out of scope here so this lands at a
reviewable size):
- In-loop auto-refresh for the Python-loop adapters
- MCP long-poll tool to give CLI adapters push-ish inbox semantics
- Typed ``coop-request`` / ``coop-respond`` protocol on top of
messaging (CC's plan_approval_request shape)
- Filesystem mirror of the task list (CC-style ``ls`` artefacts)
Stacks on #51 (Codex adapter) so the diff stays focused on team-mode
additions.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…resh (#53) Lands the four follow-ups that were called out as "Out of scope" on the team-mode PR (#52), plus a prompt fix surfaced by the team-mode end-to-end run. 1. **Filesystem mirror of task list** (``_team/fs_mirror.py``). Snapshots the Redis-backed task list to ``/workspace/shared/tasks/`` so agents can ``ls`` and ``cat`` tasks with their existing tools rather than going through the ``coop-task-list`` CLI. Layout mirrors Claude Code's team primitive: one ``<id>.json`` per task, plus ``_index.json`` (cheap ``ls`` target) and ``_log.jsonl`` (audit trail). Triggered on every ``coop-task-list`` invocation and from the host runner at startup. Files written via tempfile+replace so readers never observe a partial state. 2. **Typed coop-request / coop-respond protocol** (``_team/protocol.py``). Layered on plain Redis messaging, mirroring CC's ``plan_approval_request`` / ``plan_approval_response`` shape. ``coop-request <peer> <kind> <body>`` returns a request_id (and optionally blocks via ``--wait N`` for a response). ``coop-respond <request_id> <body>`` writes back; the sender's ``await_response`` uses BLPOP so it actually sleeps instead of busy-polling. Both events flow into the shared task-log so coordination metrics include protocol events. 3. **MCP long-poll server** (``_team/mcp_server.py``). Stdio JSON-RPC server that exposes a single ``wait_for_message`` tool backed by BLPOP on the agent's inbox. Registered automatically: Claude Code adapter writes ``$CLAUDE_CONFIG_DIR/.claude.json`` with the server entry; Codex adapter writes ``$CODEX_HOME/config.toml``. The point is to make "watch the inbox" a natural idle behavior for the CLI adapters instead of a busy-loop on ``coop-recv`` returning empty — the closest we can get to push-style delivery for opaque CLI agent loops. 4. **In-loop task-list auto-refresh** (``_team/loop_refresh.py``). ``TeamPoller`` is a per-agent host-side helper that ``mini_swe_agent_v2.DefaultAgent.step()`` calls between LLM queries — same hook as the existing inbox poll. The LLM sees a compact ``[Team task list] open: 1, in_progress: 2, ...`` summary prepended to every turn so it doesn't need to remember to call ``coop-task-list``. Plumbed via ``agent.team_poller`` so the ``mini_swe_agent_v2`` subtree change is one branch in ``step()``. The same module also exports ``poll_team_state()`` for in-container use (env-driven variant). 5. **Prompt fix**: the previous team-mode end-to-end had members writing diffs to ``/workspace/shared/<id>.patch`` only and never to ``/workspace/repo/patch.txt``, scoring 0/2 despite great coordination. Both lead and member prompts now have an explicit ``### Final submission — REQUIRED`` section that calls out ``patch.txt`` as the only file the bench evaluates and provides the exact ``git diff > patch.txt`` command. Also: cosmetic fix to ``runner/core._print_single_result`` so team mode's per-agent dicts (which carry ``patch_lines: int``) render correctly in the run table — previously the column showed 0 because the function tried ``len(r.get("patch", "").splitlines())`` and team mode doesn't store the full patch in the agents dict. Tests: 37 new unit tests - 8 fs_mirror (atomic writes, stale cleanup, empty index) - 9 protocol (request roundtrip, await, timeout, audit log) - 9 mcp_server (initialize, tools/list, tools/call, timeout, blocking, unknown-tool error, env factory) - 8 loop_refresh (summary formatting, TeamPoller, env variant) - 3 prompt (regression: lead+member prompts demand patch.txt) Full suite: **311 passed**, 63 skipped. End-to-end on dottxt_ai_outlines_task/1371 [1,2] with Claude Code + team + git: **2/2 features pass** (14/14 + 20/20 tests). All four follow-ups visibly active in the run artifacts: ``/workspace/shared/tasks/`` populated with per-task JSON + _index + _log; scratchpad has agent2.patch; ``cb-mcp-server.py`` registered in ``.claude.json``; 6 tasks created (2 by runner pre-seed, 4 by lead's sub-task split), 4 reached ``done``, ``time_to_first_claim_seconds=29.9``. Previous run scored 0/2 on the same task — the prompt fix is doing real work. Stacks on #52. Co-authored-by: Ubuntu <ubuntu@ip-172-31-58-153.us-west-2.compute.internal> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Brings ``mini_swe_agent_v2``, ``swe_agent``, and ``openhands_sdk`` to
parity with the CLI adapters for team mode. Before this commit they
accepted the team kwargs but discarded them; now each one appends the
team prompt section to the task it sends the agent, and (where the
adapter actually controls the container) propagates ``CB_TEAM_*`` env
vars + mounts the team scratchpad.
New helper: ``_team.team_task_section(agents, agent_id, team_role)``
returns ONLY the lead-or-member block + coop-task-* CLI usage,
without the surrounding task/submission/git scaffolding that
``build_team_instruction`` adds. Python-loop adapters already have
their own prompts covering messaging/git/submission, so they need
only the new piece; CLI adapters keep using the bigger function.
Per-adapter wiring:
- ``mini_swe_agent_v2``: appends team_task_section to task;
propagates CB_TEAM_* through env_kwargs["env"]; adds
``--add-host=host.docker.internal:host-gateway`` + scratchpad
volume to docker run args; installs the team CLI scripts + pip
redis in the container after env spin-up. The existing
``TeamPoller`` host-side hook (already in step()) still fires.
- ``openhands_sdk``: appends team_task_section to task; folds a new
``team_env`` dict into ``coop_info`` so
``_build_credentials_dict`` propagates CB_TEAM_* into the
sandbox. Coop-task-* binary install in the OpenHands agent-server
image is a follow-up — OpenHands manages its own image build and
doesn't expose a clean post-start exec hook.
- ``swe_agent``: appends team_task_section to task. The SWE-agent
framework's sandbox + agent loop is third-party and harder to
instrument; everything beyond the prompt is a follow-up.
Tests: 13 new
- 3 prompt unit tests for team_task_section (lead, member, empty)
- 10 cross-adapter sanity tests in tests/agents/test_team_wiring.py:
consistency between team_task_section and build_team_instruction,
every registered runner accepts the team kwargs, openhands env
keys, swe_agent signature
Full suite: 324 passed, 63 skipped. Ruff/format/mypy all green.
End-to-end on dottxt_ai_outlines_task/1371 [1,2] with claude_code +
team + git (sanity check that the shared changes didn't regress the
CLI adapter): both Submitted in 4m21s, $0.93, patches 210 + 81 lines.
End-to-end for the other four (codex, mini_swe_agent_v2, swe_agent,
openhands_sdk) requires API keys (Anthropic for the three Python-loop
adapters via litellm, OpenAI for codex) that aren't available in this
environment. Unit tests cover the new wiring; the e2e validations
should be run with real keys before relying on the per-adapter
behavior.
Compatibility matrix is now:
| Adapter | Accepts | Team prompt | Auto-refresh | CLI in container | env vars |
|---------------------|---------|-------------|--------------|------------------|----------|
| claude_code | yes | yes (full) | n/a | yes | yes |
| codex | yes | yes (full) | n/a | yes | yes |
| mini_swe_agent_v2 | yes | yes (sec.) | yes | yes | yes |
| openhands_sdk | yes | yes (sec.) | n/a | NOT YET | yes |
| swe_agent | yes | yes (sec.) | NOT YET | NOT YET | NOT YET |
Stacks on #52 (merged-up team-mode branch).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the documented gap from the prior commit's matrix: the
``coop-task-*`` binaries now ship into the OpenHands agent-server
sandbox, layered onto the upstream ``-oh`` image via Modal's
``add_local_file`` / ``pip_install`` / ``run_commands`` chain (no
upstream image rebuild required). Triggered only when
``coop_info["team_env"]`` is set so solo / coop runs don't pay the
~10s first-build cost. Modal caches the layered image; subsequent
team runs are instant.
Verified end-to-end: ran openhands_sdk team+git on
dottxt_ai_outlines_task/1371 [1,2] with gpt-5.5. The agent ran
``compgen -c | grep coop-task`` and got back all 7 wrappers
(create / claim / update / list / request / respond / pending) — the
install worked. Whether the model actually invokes the tools is a
separate (coordination-quality) axis; in this run it discovered them
but didn't use them, same as codex. Both patches applied; f1 14/14,
f2 19/20.
Tests: 2 new (full suite: 326 passed)
- test_team_env_triggers_image_layering — verifies add_local_file
+ pip_install + run_commands fire with the right args when team
mode is active
- test_no_layering_when_team_inactive — verifies solo / coop
runs skip the image-build cost
Matrix update — openhands_sdk now reads:
Accepts kwargs: yes / Team prompt: section / Auto-refresh: n/a /
CLI in container: YES (was NOT YET) / CB_TEAM_* env: yes
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The codex team e2e (cx_team_v3) hit 0/2 with great coordination
metrics — 5/5 tasks done, 27s first claim, claims even — but
neither agent ran ``git merge`` despite the prompt's "Recommended
workflow" mentioning it. Both fetched their peer's branch (2 each)
and then submitted only their own work, so the eval's naive
diff-stacker produced syntactically broken Python.
The previous prompt buried the critical step in a "Concretely:"
sentence at the end; gpt-5.5 didn't follow it. This rewrite:
- Renames the section ``## Git collaboration — MERGE IS REQUIRED
BEFORE SUBMITTING`` so the imperative is in the heading itself.
- Adds an explicit "Required final sequence — run this verbatim
before exiting" block with the full fetch+merge+diff sequence,
parameterized over every partner branch.
- Explains *why* (each agent's patch.txt is evaluated against every
feature's tests; without the merge, the peer feature's symbols
are missing → ImportError).
- Frames it the same way the patch.txt step is framed (REQUIRED,
skip-at-your-loss), which the original prompt fix proved
codex responds to.
Verified: re-ran cx_team_v4 (codex team+git, same task as v3).
Git activity went from ``fetch=2 merge=0 push=0`` per agent →
``fetch=3 merge=2 push=2`` and ``fetch=1 merge=1 push=1``. Both
patches now contain both features' symbols. Pass rate v4:
33/34 tests (97%) — f2 fully passes 20/20, f1 fails one test
because gpt-5.5's merged code put the ``filters`` kwarg on a helper
function rather than the ``prompt`` decorator (content quality, not
coordination).
A second run (cx_team_v5) produced byte-identical 243-line patches
on both agents — codex coordinated so well both ended up with the
exact same merged tree. This surfaces a separate bench-side
limitation: the eval's diff-stacker fails to apply patch B on top
of patch A when every hunk already matches, producing an empty
merged.patch. That's a real bug in ``eval/evaluate.py``'s coop
merge step, NOT a coordination failure — codex did exactly what the
prompt asked. Fix is a separate concern from team-mode wiring.
Tests still pass (existing prompt tests are content-agnostic;
326 / 63 skipped).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
In team mode codex can coordinate so well that both agents end up
with byte-identical patches (each fully merged the other's branch).
The existing eval combiner sequence — apply patch1 → apply patch2
on top — chokes because every hunk in patch2 is already applied,
producing an empty merged.patch and a downstream "No valid patches
in input" failure even though both submissions are individually
fine.
Fix in ``test_merged``: before invoking ``_setup_branches`` /
``_merge_naive``, ``cmp`` the two patches. If they match, copy
patch1 to merged.patch (normalized via ``git apply --recount`` so
agents that emit unified diffs with miscounted hunk headers still
work) and skip the merge dance. Returns a fresh result with
``merge.status: "identical"`` so the caller can tell the
short-circuit fired vs a real merge.
Verified on the codex-team e2e:
- cx_team_v5 (codex agents perfectly merged to identical 243-line
patches): 0/2 → 2/2 ✓ (f1: 14/14, f2: 20/20)
- cx_team_v4 (codex agents diverged on the merge): unchanged at
f2 20/20 + f1 13/14 = 33/34 tests, still falls back to
agent2-alone via apply_status: {'agent1': 'failed', ...}
I also briefly tried adding ``git apply --recount`` to
``_setup_branches``'s fallback chain, but that REGRESSED v4: it
made agent1's malformed patch apply where it previously failed
silently, triggering a real merge attempt that produced
duplicate function definitions (broken Python) via union merge.
The identical-patches short-circuit is the strictly-better fix —
no regression, recovers the v5 case, and the malformed-hunk
normalization only kicks in on the short-circuit path where it
can't cause merge conflicts.
Also lands previously-uncommitted housekeeping:
- prompt.py: ruff-format-only diff on the merge-required block
from the prior commit
- test_team_wiring.py: ruff --fix removed unused MagicMock
imports
- test_gcp_backend.py / test_tasks.py: ruff --fix removed
f-string-without-placeholder and unused-json import (both
unrelated drift caught by the gate)
Tests: 1 new (full suite: 327 passed)
- ``test_test_merged_shortcircuits_on_identical_patches`` — source
inspection confirms the short-circuit branch + "identical"
merge-status string exist in test_merged
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous openhands team runs (oh_team_v3) showed agents
discovering the ``coop-task-*`` shell wrappers via ``compgen`` but
never invoking them — gpt-5.5 strongly prefers typed tools registered
with the LLM over arbitrary shell commands. This commit lands the
architectural fix: a Redis-backed ``CoopTaskTrackerTool`` registered
under the same name as openhands' built-in ``TaskTrackerTool`` so the
registry resolution swaps it transparently.
Files:
* ``openhands/tools/task_tracker/coop_definition.py`` — new tool
definition + executor. Same ``TaskTrackerAction`` /
``TaskTrackerObservation`` shape, but ``plan`` and ``view`` round-
trip through the shared ``cb:<run_id>:`` Redis namespace that
``TaskListClient`` (host side) writes to. Tasks are auto-owned
by the calling agent; ``view`` shows peer tasks prefixed with
``[<their_agent_id>]``. Registered under both
``"CoopTaskTrackerTool"`` AND ``"TaskTrackerTool"`` so importing
the module rebinds the latter to the Coop variant.
* ``openhands/tools/preset/default.py`` — gains a ``team_mode``
kwarg (kept for API stability + tests; the actual swap happens
server-side via the .pth/__init__ side-effect import, not by
changing the host-side tool list). Pre-PR coop block split into
a more nuanced team-mode prompt section that documents the
TaskTracker → shared-list behavior.
* ``openhands_sdk/adapter.py:ModalSandboxContext.__enter__`` —
layers two more bits into the Modal image at build time:
- ``add_local_file`` of ``coop_definition.py`` to
``$OH_DIR/coop_definition.py`` (in the sandbox's openhands
install)
- ``grep ... || echo`` appending
``from . import coop_definition`` to the package's
``__init__.py`` so the registration runs at import time.
Tests: 1 new + updated image-layering assertions
- ``test_importing_coop_definition_overrides_local_registration``:
inspecting the registry's ``_MODULE_QUALNAMES`` confirms
``TaskTrackerTool.name`` resolves to ``coop_definition``'s
registration after import.
- ``TestOpenHandsImageLayering`` now asserts 2 ``add_local_file``
calls + 2 ``run_commands`` layers (tool-file install +
``coop-task-*`` wrappers) and that the
``from . import coop_definition`` line is in the install
commands.
Full suite: 329 passed. Ruff / format / mypy all green.
KNOWN LIMITATION (documented in coop_definition.py docstring):
the openhands_sdk agent-server runs in a Modal sandbox that's
network-isolated from the host Redis. The CoopTaskTracker is
correctly registered and the LLM can call it, but every operation
returns "Shared task list unavailable" because the sandbox can't
``socket.getaddrinfo("host.docker.internal")``. The fix is in the
deployment layer (Modal tunnels, a Modal-hosted Redis, or running
openhands directly via docker like the other adapters), not in this
PR — verified by oh_team_v10: agent ran ``coop-task-list`` first
("The coop CLI failed; I'll use the shared task tracker."), then
fell back to TaskTrackerAction which still hit the local executor
because the override + Redis combo can't actually work in Modal.
For non-Modal openhands deployments (e.g. local docker-backed
openhands runs, future remote-conversation transports that share the
host network), this tool works as designed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Resolves the Modal-Redis isolation that blocked the prior CoopTaskTracker
swap from actually functioning. Three pieces, working together:
1. **Modal-hosted Redis.** ``runner/team.py:execute_team`` detects
``agent_name == "openhands_sdk"`` and spins up a Modal sandbox
running redis-server on a TCP tunnel (``unencrypted_ports=[6379]``,
accessed via ``unencrypted_host:unencrypted_port``). Re-uses the
existing ``connectors/redis_server.ModalRedisServer`` — it was
already written, just unused. Both the host TaskListClient and
the agent sandboxes point at the same public TCP endpoint, so
pre-seed and agent reads/writes share state. Falls back to local
Redis for the other adapters.
2. **CoopTaskTrackerTool injection into the Modal sandbox.** The
adapter now ``add_local_file``s three pieces into the OpenHands
image at build time:
- ``coop_task.py`` → ``/usr/local/bin/cb-coop-task.py``
- ``coop_definition.py`` → ``$OH_DIR/coop_definition.py``
- ``_team_init_override.py`` → ``$OH_DIR/__init__.py``
(replaces upstream; same exports + a side-effect import of
coop_definition so the Redis-backed executor overrides the
local TaskTracker registration at first import).
Plus a ``find -name '*.pyc' -delete`` to invalidate Python's
bytecode cache so the new __init__ actually re-runs.
3. **Harvest-time fresh client.** Modal's TCP tunnels drop idle
connections after a few minutes, so the original Redis client
pre-seed used at startup gets closed before the 9-min agent run
finishes. Re-open the client at harvest time using the same URL.
End-to-end on ``dottxt_ai_outlines_task/1371 [1,2]`` with
``-a openhands_sdk --setting team --git``:
- Modal Redis startup: ``redis ready redis://r450.modal.host:41899``
- Both agents Submitted, 9m total
- Eval: 2/2 PASS (f1: 14/14 ✓, f2: 20/20 ✓)
- Metrics: ``tasks_total: 4, tasks_done: 4, unowned_at_end: 0,
time_to_first_claim_seconds: 52.6, claims_per_agent: {agent2:2,
agent1:1}, updates_per_agent: {agent2:4, agent1:5}``
- Cost: $3.33
Tests: image-layering assertions expanded — ``add_local_file`` now
called 3 times (CLI helper, tool def, __init__ override), and the
run_commands chain copies both files + wipes .pyc caches.
Full suite: 329 passed. Ruff / format / mypy all green.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The team-mode unit tests (task_list / protocol / fs_mirror / loop_refresh / mcp_server) use ``fakeredis.FakeRedis`` as a hermetic stand-in for redis-server, but ``fakeredis`` wasn't declared anywhere in pyproject.toml — it just happened to be present in my local venv because something else pulled it in transitively. GitHub CI installs ``[dev]`` only, so on a clean install pytest collection fails with ``ModuleNotFoundError: No module named 'fakeredis'`` on every team-mode test file. Adding the dependency explicitly fixes PR #52 (team-mode) CI; once team-mode merges, PR #55 (team-all-adapters) will also pick it up via the same path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three changes that together unblock swe_agent team-mode runs (and
solo/coop runs too — the bug wasn't team-specific):
1. ``cooperbench.agents.mini_swe_agent`` → ``mini_swe_agent_v2``
in ``swe_agent/adapter.py`` and ``swe_agent/agent/agents.py``.
The old package was renamed in v0.0.13; both swe_agent files
had stale imports that no-op'd at module load (TypeError or
ModuleNotFoundError depending on how the framework was invoked),
making every swe_agent invocation return Error before any LLM
call.
2. Add ``numpy``, ``boto3``, ``docker`` to the ``swe-agent`` extras
in pyproject.toml. swe_agent's vendored framework imports these
at module-load time even when the docker/S3/model paths are
dormant, so a clean ``pip install '.[swe-agent]'`` without these
would still ImportError on first invocation.
3. uv.lock refreshed with the new transitive deps.
End-to-end on dottxt_ai_outlines_task/1371 [1,2] with
``-a swe_agent -m gpt-5.5 --setting team --git`` (sw_team_v5):
both agents Submitted, patches 373 + 88 lines, both applied via
git apply. Eval failed 0/2 due to a content-quality issue
(``NameError: name 'Set' is not defined`` — agent used Set
without importing it; both agents hit exit_cost budget limit
mid-implementation), but that's model variance, not adapter
wiring. swe_agent is unblocked: it runs end-to-end, produces
patches, the eval pipeline processes them.
Coordination metrics still empty (claims_per_agent: {}) because
swe_agent doesn't yet have the in-container coop-task-* CLI
install or in-loop task auto-refresh — those are tracked as
follow-ups in the PR body. For now the swe_agent team-mode run
just gets the team prompt section + env vars; full team-tool
integration is a separate PR.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Five compounding bugs prevented `claude_code`, `codex`, and `mini_swe_agent_v2` from reaching honest pass-rates on the core subset in team setting. All four now ≥ 5/10. - normalize_patch ate trailing blank context lines (text.strip() consumes " \n"), breaking last-hunk line counts so git apply rejected otherwise-valid diffs. Replaced with lstrip/rstrip on "\n" only. - mini_swe_agent_v2 adapter wasn't normalizing patches at all — raw .strip() on the patch.txt read, so every msa patch ended in a non-newline byte. Now routes through normalize_patch. - mini_swe_agent_v2 ModalEnvironment created the sandbox with no long-running command, so the image's default CMD exited and every exec hit "Sandbox not found". Pass "sleep", "infinity" as the positional command (matches eval backend's existing fix). - claude_code and codex adapters silently ignored --backend modal because shared build_environment was hardcoded to DockerEnvironment. Added a backend kwarg and threaded config["backend"] through both adapters. - Team lead prompt buried the integration step at the bottom of a long workflow list; Claude/Codex consistently exited after their own feature without reading /workspace/shared/<agent>.patch. Rewrote with a hard-rule opener and a 5-point pre-submission checklist. Member prompt now opens with "stay in your lane" per the lead's PLAN.md. - eval test_merged now falls back to testing each agent's patch alone when the merged tree doesn't pass both features. Surfaced as merge.strategy="solo-agent1" / "solo-agent2". Credits the agent (typically the lead) who correctly integrated both features into one working patch but had it corrupted by union-merging with the other agent's partial implementation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- dataset/subsets/core.json: 10-pair subset for quick agent comparisons. Stratified by repo (largest-remainder proportional allocation by full-dataset pair count) with a one-slot floor per primary language (Python / Go / Rust / TS). Reproducible via scripts/generate_core_subset.py (seed=42). - docs/BENCHMARK_RESULTS.md: horizontal comparison of four agent frameworks on the core subset in team setting. Includes per-task pass/fail matrix annotated with the merge strategy used, plus the chronological narrative of the dozen reruns that surfaced each of the bugs fixed in the previous commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously test_merged returned early with an error when both naive and union merge strategies hit conflicts, so the solo-agent fallback never got a chance to credit a team whose lead alone integrated both features. Now we write an empty merged.patch, let run_tests fail naturally on the merged tree, and fall through to the solo fallback. Doesn't change any of the current 40 eval results — union's merge=union attribute is tolerant enough that every task in the dataset produces some tree (potentially broken code with stitched-together lines); the broken-tree-tests-fail path already triggered the solo fallback. This just closes the defensive gap for future pathological cases. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drops the union-merge strategy and the member-only fallback from
test_merged. The new chain is:
1. identical patches → skip-merge short-circuit
2. naive 3-way merge clean → merged-tree tests are authoritative
(no further fallback)
3. naive merge conflicts → test the lead's patch.txt alone against
both feature suites
Rationale: union merge concatenates conflicting hunks, which usually
produces syntactically broken code; the cases where it accidentally
produced a working tree were rewarding lucky non-overlap, not genuine
coordination. The member-only fallback was symmetric to lead-only but
incoherent under team-mode semantics (the lead is the designated
integrator; if they didn't integrate, the team failed regardless of
what the member's branch looks like).
Effect on the core-subset horizontal comparison:
msa 6 → 6 (unchanged)
oh 5 → 4 (loses pallets_jinja/1621 — was passing via union, which
concealed that oh's lead doesn't integrate)
cc 5 → 5 (unchanged)
cx 5 → 5 (unchanged)
oh sliding below 5/10 is the correct outcome: the previous union-pass
on pallets_jinja/1621 was a false-positive of sorts (oh's agents commit
their patch.txt into the working tree, which forces a merge conflict
on patch.txt that union resolved while the actual source merge was
non-conflicting). Under the stricter policy this gets routed through
lead-alone, which oh's lead does not pass.
BENCHMARK_RESULTS.md updated to reflect the new totals + per-task
matrix legend (N = naive/identical, L = lead-alone). CHANGELOG entry
revised; full test suite still green (329 passed, 63 skipped).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
codex on Modal: `codex exec` was hanging for the full sandbox lifetime (~2h) producing zero stream output. Root cause: codex's exec mode prints "Reading additional input from stdin..." and blocks until stdin EOF. Docker's non-tty `docker exec` gives EOF for free; Modal sandbox keeps stdin open. Fix: add `</dev/null` to the codex invocation in _build_codex_command. Smoke-tested on dottxt_ai_outlines/1655 [1,3] solo on Modal: 1/1 pass in 1m 48s. openhands_sdk eval guardrail: openhands_sdk produces patches that include a committed patch.txt in the working tree and relies on Modal-hosted Redis for coordination; running eval through Docker silently changed the test environment. The eval now reads the run's config.json and refuses with a clear warning when the run was produced by openhands_sdk but --backend != modal. Note: swe_agent already runs on Modal (uses swerex.ModalDeploymentConfig by default; the earlier docs claiming it was docker-only were wrong). Smoke-tested same dottxt task: 1/1 pass in 3m 12s. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
swe_agent adapter was hardcoded to swerex.ModalDeploymentConfig. Added a backend dispatch that picks DockerDeploymentConfig when config["backend"] == "docker"; Modal stays as the default. Two upstream-swerex issues had to be worked around to make the docker path actually start a container: 1. CooperBench task images set ENTRYPOINT=/usr/local/bin/runner.sh, so swerex's `docker run ... image sh -c "<startup>"` becomes `runner.sh sh -c "<startup>"` and runner.sh interprets "sh" as the feature-patch path. Pass docker_args=["--entrypoint", ""] to clear the entrypoint (mirrors the existing Modal monkey-patch that does .entrypoint([]) on the image). 2. swerex's startup falls back to `pipx run swe-rex ...` when the swerex-remote binary isn't pre-installed, but pipx looks for an executable literally named "swe-rex" — which doesn't exist in the published `swe-rex` package (it provides "swerex-remote"). Monkey-patch DockerDeployment._get_swerex_start_cmd to use `pipx run --spec swe-rex swerex-remote ...` instead. Smoke-tested with `dottxt_ai_outlines/1655 [1,3]` solo on docker: 1/1 pass in 2m 53s, 17 steps, $0.32, no errors. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Resolves the squash-merge conflicts from #52 landing on main. All conflicts followed the same pattern: this branch's HEAD contains #52's content plus the subsequent work on top, while main's squashed-merge commit contains only #52. Resolved each conflict by taking ours (HEAD), which preserves the cumulative state of: - CHANGELOG: full Fixed/Changed/Added entries for team-mode bug fixes, eval policy change, core subset + benchmark doc, plus the original "team setting" bullet from #52 - _team/prompt.py: the stronger lead-prompt with the 5-point integration checklist (#52 had the older "buried integration" version) - swe_agent/adapter.py: team-mode kwarg propagation + Docker backend dispatch + pipx --spec monkey-patch - runner/team.py: openhands_sdk Modal-Redis tunnel branch - everywhere else: my newer adapter changes are strict supersets of #52's CI green locally: 329 tests passed, ruff clean, mypy clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
4 tasks
# Conflicts: # CHANGELOG.md
ProKil
added a commit
that referenced
this pull request
May 21, 2026
#58) * agents/codex: add Codex adapter; lift shared coop bits into _coop Adds an OpenAI Codex CLI adapter alongside the existing Claude Code adapter. Both adapters wrap a third-party CLI inside the task's Docker container; the bits that are agent-agnostic (Redis messaging helper, prompt blocks for solo/coop/coop+git, git remote setup) now live in a new ``cooperbench.agents._coop`` module so the two adapters (and any future CLI adapter) consume them rather than duplicating. Codex adapter highlights: - Invokes ``codex exec --json --sandbox danger-full-access --skip-git-repo-check --model <id>``. - Writes ``${CODEX_HOME}/auth.json`` with the host's OPENAI_API_KEY inside the container so the CLI authenticates without prompts. - Parses Codex's JSONL event stream for status / token totals / messages. Cost is reported as 0.0 because Codex does not emit a cost field; tokens are summed across ``turn.completed`` events. - Model fallback: if Codex rejects ``--model gpt-5.5`` with a "model not found" shaped error, the adapter retries once without ``--model`` and lets Codex pick its default. - Preflight credential check: if OPENAI_API_KEY is unset the adapter returns Error immediately instead of spinning up a container that can only fail. Shared ``_coop`` module: - ``coop_msg.py`` — Redis-backed messaging CLI (one inbox per agent) installed as ``coop-send`` / ``coop-recv`` / ``coop-broadcast`` / ``coop-peek`` / ``coop-agents`` under /usr/local/bin. - ``install_snippet.sh`` — pip-installs redis and drops the shell wrappers; each adapter's setup.sh sources it. - ``prompt.py`` — solo / coop / coop+git prompt assembly, agent- agnostic. - ``runtime.py`` — ``ContainerEnv`` protocol, ``build_environment``, ``write_file_in_container`` / ``read_file_from_container``, ``rewrite_comm_url_for_container``, ``build_git_setup_command``, ``parse_sent_messages_log``, and ``normalize_patch``. Bug fix during this refactor: the previous adapter's ``.strip()`` on ``patch.txt`` was eating the trailing newline that ``git apply`` requires. Replaced with ``normalize_patch()`` (one trailing newline, no leading whitespace). This bit codex's solo run with a "corrupt patch at line N" error; Claude got lucky and didn't. Tests: 24 new for Codex (parsers + adapter), existing 45 Claude Code tests re-pointed at the shared ``_coop`` module. Full suite: 228 passed, 63 skipped. End-to-end runs against dottxt_ai_outlines_task/1371 features 1+2: - codex solo f1: Submitted, 1 turn, 365k input tokens, 184-line patch (with the trailing-newline fix it applies cleanly) - codex coop+git f1,f2: both Submitted, both patches applied but 0/2 tests pass — coordination failure (agent1 fetched ``team`` but never merged, so the stacked patches produce a Python SyntaxError at line 144 of the modified file). Claude on the same task scored 2/2; Codex used the tools less aggressively on this run. The 0/2 result is the kind of coordination failure the bench is designed to surface, not an adapter bug. Future iteration could tighten the prompt or hard-enforce a post-run merge, but neither is necessary to land the adapter itself. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * runner: add team mode (lead + members + shared task list + scratchpad) Adds a third setting alongside ``solo`` and ``coop``, modelled on the agent-team primitives Claude Code uses in its own product. Where coop gives N peer agents one feature each and a Redis inbox to chat over, team mode adds three load-bearing primitives: 1. A typed **shared task list** (cooperbench.agents._team.TaskListClient) backed by Redis hashes + sets, namespaced ``cb:<run_id>:``, with atomic claim semantics (HSETNX-style — exactly one caller wins on a race) and an audit log of every mutation. Exposed in the container as ``coop-task-create`` / ``coop-task-claim`` / ``coop-task-update`` / ``coop-task-list`` shell wrappers. 2. A **lead / member role split**. The first agent is designated ``team-lead`` and gets a system-prompt block instructing them to break the spec into tasks, assign them via ``coop-task-create --assign``, watch progress, and integrate. Other agents are ``member`` and look for open tasks to claim. 3. A **shared scratchpad** Docker volume (``cb-team-<run_id>``) mounted at ``/workspace/shared`` in every container. Free coordination artifact for design notes, partial diffs, interface sketches. Coordination metrics are computed from the task-list audit log after the run finishes (``time_to_first_claim_seconds``, ``claims_per_agent``, ``updates_per_agent``, ``tasks_done``, ``unowned_at_end``) and saved into ``result.json``. Evaluation is identical to coop — per-agent ``patch.txt`` evaluated per-feature — so no eval changes were needed beyond discovering ``team/`` log directories. Compatibility: all five existing adapters accept the new ``team_role`` / ``team_id`` / ``task_list_url`` kwargs. The CLI adapters (``claude_code``, ``codex``) wire the team install snippet into their ``setup.sh`` so the ``coop-task-*`` wrappers land at ``/usr/local/bin``. The Python-loop adapters (``mini_swe_agent_v2``, ``swe_agent``, ``openhands_sdk``) accept the kwargs without breaking; their in-loop integration with the task list (auto-refresh between steps, similar to the existing inbox poll) lands in a follow-up. Unit tests: 46 new - 18 task_list (CRUD, atomic claim, owner-only update, audit log, run isolation) - 12 prompt (lead vs member branches, solo fallback, git interaction) - 3 runtime (env assembly, scratchpad mount args) - 4 metrics (happy path, unowned-at-end, empty log, multiple claims) - 5 runner (lead-is-first-agent, pre-seed, kwarg propagation, metrics in result, three-agent team) - 4 misc Full suite: 274 passed, 63 skipped. Ruff / format / mypy all green. End-to-end on dottxt_ai_outlines_task/1371 [1,2] with Claude Code in team+git mode: - 5 tasks created (2 by bench-runner, 3 by the lead splitting its work), all reached ``done`` - time_to_first_claim_seconds=34.2 - claims_per_agent={agent1: 2, agent2: 1} - updates_per_agent={agent1: 4, agent2: 3} - scratchpad volume actively used (agent2 wrote its diff to /workspace/shared/agent2.patch + a summary.md) - **0/1 pass rate** — both ``patch.txt`` files were empty: the members wrote diffs to the scratchpad instead of also writing ``/workspace/repo/patch.txt``, and the lead never ran the final integration step. This is real coordination signal (the prompt told them to write both places but they followed the scratchpad half only) — a follow-up will tighten the prompt to make patch.txt submission the explicit final step. Future PRs (intentionally out of scope here so this lands at a reviewable size): - In-loop auto-refresh for the Python-loop adapters - MCP long-poll tool to give CLI adapters push-ish inbox semantics - Typed ``coop-request`` / ``coop-respond`` protocol on top of messaging (CC's plan_approval_request shape) - Filesystem mirror of the task list (CC-style ``ls`` artefacts) Stacks on #51 (Codex adapter) so the diff stays focused on team-mode additions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * team mode: filesystem mirror, typed protocol, MCP server, in-loop refresh (#53) Lands the four follow-ups that were called out as "Out of scope" on the team-mode PR (#52), plus a prompt fix surfaced by the team-mode end-to-end run. 1. **Filesystem mirror of task list** (``_team/fs_mirror.py``). Snapshots the Redis-backed task list to ``/workspace/shared/tasks/`` so agents can ``ls`` and ``cat`` tasks with their existing tools rather than going through the ``coop-task-list`` CLI. Layout mirrors Claude Code's team primitive: one ``<id>.json`` per task, plus ``_index.json`` (cheap ``ls`` target) and ``_log.jsonl`` (audit trail). Triggered on every ``coop-task-list`` invocation and from the host runner at startup. Files written via tempfile+replace so readers never observe a partial state. 2. **Typed coop-request / coop-respond protocol** (``_team/protocol.py``). Layered on plain Redis messaging, mirroring CC's ``plan_approval_request`` / ``plan_approval_response`` shape. ``coop-request <peer> <kind> <body>`` returns a request_id (and optionally blocks via ``--wait N`` for a response). ``coop-respond <request_id> <body>`` writes back; the sender's ``await_response`` uses BLPOP so it actually sleeps instead of busy-polling. Both events flow into the shared task-log so coordination metrics include protocol events. 3. **MCP long-poll server** (``_team/mcp_server.py``). Stdio JSON-RPC server that exposes a single ``wait_for_message`` tool backed by BLPOP on the agent's inbox. Registered automatically: Claude Code adapter writes ``$CLAUDE_CONFIG_DIR/.claude.json`` with the server entry; Codex adapter writes ``$CODEX_HOME/config.toml``. The point is to make "watch the inbox" a natural idle behavior for the CLI adapters instead of a busy-loop on ``coop-recv`` returning empty — the closest we can get to push-style delivery for opaque CLI agent loops. 4. **In-loop task-list auto-refresh** (``_team/loop_refresh.py``). ``TeamPoller`` is a per-agent host-side helper that ``mini_swe_agent_v2.DefaultAgent.step()`` calls between LLM queries — same hook as the existing inbox poll. The LLM sees a compact ``[Team task list] open: 1, in_progress: 2, ...`` summary prepended to every turn so it doesn't need to remember to call ``coop-task-list``. Plumbed via ``agent.team_poller`` so the ``mini_swe_agent_v2`` subtree change is one branch in ``step()``. The same module also exports ``poll_team_state()`` for in-container use (env-driven variant). 5. **Prompt fix**: the previous team-mode end-to-end had members writing diffs to ``/workspace/shared/<id>.patch`` only and never to ``/workspace/repo/patch.txt``, scoring 0/2 despite great coordination. Both lead and member prompts now have an explicit ``### Final submission — REQUIRED`` section that calls out ``patch.txt`` as the only file the bench evaluates and provides the exact ``git diff > patch.txt`` command. Also: cosmetic fix to ``runner/core._print_single_result`` so team mode's per-agent dicts (which carry ``patch_lines: int``) render correctly in the run table — previously the column showed 0 because the function tried ``len(r.get("patch", "").splitlines())`` and team mode doesn't store the full patch in the agents dict. Tests: 37 new unit tests - 8 fs_mirror (atomic writes, stale cleanup, empty index) - 9 protocol (request roundtrip, await, timeout, audit log) - 9 mcp_server (initialize, tools/list, tools/call, timeout, blocking, unknown-tool error, env factory) - 8 loop_refresh (summary formatting, TeamPoller, env variant) - 3 prompt (regression: lead+member prompts demand patch.txt) Full suite: **311 passed**, 63 skipped. End-to-end on dottxt_ai_outlines_task/1371 [1,2] with Claude Code + team + git: **2/2 features pass** (14/14 + 20/20 tests). All four follow-ups visibly active in the run artifacts: ``/workspace/shared/tasks/`` populated with per-task JSON + _index + _log; scratchpad has agent2.patch; ``cb-mcp-server.py`` registered in ``.claude.json``; 6 tasks created (2 by runner pre-seed, 4 by lead's sub-task split), 4 reached ``done``, ``time_to_first_claim_seconds=29.9``. Previous run scored 0/2 on the same task — the prompt fix is doing real work. Stacks on #52. Co-authored-by: Ubuntu <ubuntu@ip-172-31-58-153.us-west-2.compute.internal> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * team mode: wire team prompt + env into the three Python-loop adapters Brings ``mini_swe_agent_v2``, ``swe_agent``, and ``openhands_sdk`` to parity with the CLI adapters for team mode. Before this commit they accepted the team kwargs but discarded them; now each one appends the team prompt section to the task it sends the agent, and (where the adapter actually controls the container) propagates ``CB_TEAM_*`` env vars + mounts the team scratchpad. New helper: ``_team.team_task_section(agents, agent_id, team_role)`` returns ONLY the lead-or-member block + coop-task-* CLI usage, without the surrounding task/submission/git scaffolding that ``build_team_instruction`` adds. Python-loop adapters already have their own prompts covering messaging/git/submission, so they need only the new piece; CLI adapters keep using the bigger function. Per-adapter wiring: - ``mini_swe_agent_v2``: appends team_task_section to task; propagates CB_TEAM_* through env_kwargs["env"]; adds ``--add-host=host.docker.internal:host-gateway`` + scratchpad volume to docker run args; installs the team CLI scripts + pip redis in the container after env spin-up. The existing ``TeamPoller`` host-side hook (already in step()) still fires. - ``openhands_sdk``: appends team_task_section to task; folds a new ``team_env`` dict into ``coop_info`` so ``_build_credentials_dict`` propagates CB_TEAM_* into the sandbox. Coop-task-* binary install in the OpenHands agent-server image is a follow-up — OpenHands manages its own image build and doesn't expose a clean post-start exec hook. - ``swe_agent``: appends team_task_section to task. The SWE-agent framework's sandbox + agent loop is third-party and harder to instrument; everything beyond the prompt is a follow-up. Tests: 13 new - 3 prompt unit tests for team_task_section (lead, member, empty) - 10 cross-adapter sanity tests in tests/agents/test_team_wiring.py: consistency between team_task_section and build_team_instruction, every registered runner accepts the team kwargs, openhands env keys, swe_agent signature Full suite: 324 passed, 63 skipped. Ruff/format/mypy all green. End-to-end on dottxt_ai_outlines_task/1371 [1,2] with claude_code + team + git (sanity check that the shared changes didn't regress the CLI adapter): both Submitted in 4m21s, $0.93, patches 210 + 81 lines. End-to-end for the other four (codex, mini_swe_agent_v2, swe_agent, openhands_sdk) requires API keys (Anthropic for the three Python-loop adapters via litellm, OpenAI for codex) that aren't available in this environment. Unit tests cover the new wiring; the e2e validations should be run with real keys before relying on the per-adapter behavior. Compatibility matrix is now: | Adapter | Accepts | Team prompt | Auto-refresh | CLI in container | env vars | |---------------------|---------|-------------|--------------|------------------|----------| | claude_code | yes | yes (full) | n/a | yes | yes | | codex | yes | yes (full) | n/a | yes | yes | | mini_swe_agent_v2 | yes | yes (sec.) | yes | yes | yes | | openhands_sdk | yes | yes (sec.) | n/a | NOT YET | yes | | swe_agent | yes | yes (sec.) | NOT YET | NOT YET | NOT YET | Stacks on #52 (merged-up team-mode branch). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * openhands: layer coop-task-* install onto Modal image for team mode Closes the documented gap from the prior commit's matrix: the ``coop-task-*`` binaries now ship into the OpenHands agent-server sandbox, layered onto the upstream ``-oh`` image via Modal's ``add_local_file`` / ``pip_install`` / ``run_commands`` chain (no upstream image rebuild required). Triggered only when ``coop_info["team_env"]`` is set so solo / coop runs don't pay the ~10s first-build cost. Modal caches the layered image; subsequent team runs are instant. Verified end-to-end: ran openhands_sdk team+git on dottxt_ai_outlines_task/1371 [1,2] with gpt-5.5. The agent ran ``compgen -c | grep coop-task`` and got back all 7 wrappers (create / claim / update / list / request / respond / pending) — the install worked. Whether the model actually invokes the tools is a separate (coordination-quality) axis; in this run it discovered them but didn't use them, same as codex. Both patches applied; f1 14/14, f2 19/20. Tests: 2 new (full suite: 326 passed) - test_team_env_triggers_image_layering — verifies add_local_file + pip_install + run_commands fire with the right args when team mode is active - test_no_layering_when_team_inactive — verifies solo / coop runs skip the image-build cost Matrix update — openhands_sdk now reads: Accepts kwargs: yes / Team prompt: section / Auto-refresh: n/a / CLI in container: YES (was NOT YET) / CB_TEAM_* env: yes Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * team prompt: make the merge-before-submit step REQUIRED The codex team e2e (cx_team_v3) hit 0/2 with great coordination metrics — 5/5 tasks done, 27s first claim, claims even — but neither agent ran ``git merge`` despite the prompt's "Recommended workflow" mentioning it. Both fetched their peer's branch (2 each) and then submitted only their own work, so the eval's naive diff-stacker produced syntactically broken Python. The previous prompt buried the critical step in a "Concretely:" sentence at the end; gpt-5.5 didn't follow it. This rewrite: - Renames the section ``## Git collaboration — MERGE IS REQUIRED BEFORE SUBMITTING`` so the imperative is in the heading itself. - Adds an explicit "Required final sequence — run this verbatim before exiting" block with the full fetch+merge+diff sequence, parameterized over every partner branch. - Explains *why* (each agent's patch.txt is evaluated against every feature's tests; without the merge, the peer feature's symbols are missing → ImportError). - Frames it the same way the patch.txt step is framed (REQUIRED, skip-at-your-loss), which the original prompt fix proved codex responds to. Verified: re-ran cx_team_v4 (codex team+git, same task as v3). Git activity went from ``fetch=2 merge=0 push=0`` per agent → ``fetch=3 merge=2 push=2`` and ``fetch=1 merge=1 push=1``. Both patches now contain both features' symbols. Pass rate v4: 33/34 tests (97%) — f2 fully passes 20/20, f1 fails one test because gpt-5.5's merged code put the ``filters`` kwarg on a helper function rather than the ``prompt`` decorator (content quality, not coordination). A second run (cx_team_v5) produced byte-identical 243-line patches on both agents — codex coordinated so well both ended up with the exact same merged tree. This surfaces a separate bench-side limitation: the eval's diff-stacker fails to apply patch B on top of patch A when every hunk already matches, producing an empty merged.patch. That's a real bug in ``eval/evaluate.py``'s coop merge step, NOT a coordination failure — codex did exactly what the prompt asked. Fix is a separate concern from team-mode wiring. Tests still pass (existing prompt tests are content-agnostic; 326 / 63 skipped). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * eval: short-circuit when both agents submit identical merged patches In team mode codex can coordinate so well that both agents end up with byte-identical patches (each fully merged the other's branch). The existing eval combiner sequence — apply patch1 → apply patch2 on top — chokes because every hunk in patch2 is already applied, producing an empty merged.patch and a downstream "No valid patches in input" failure even though both submissions are individually fine. Fix in ``test_merged``: before invoking ``_setup_branches`` / ``_merge_naive``, ``cmp`` the two patches. If they match, copy patch1 to merged.patch (normalized via ``git apply --recount`` so agents that emit unified diffs with miscounted hunk headers still work) and skip the merge dance. Returns a fresh result with ``merge.status: "identical"`` so the caller can tell the short-circuit fired vs a real merge. Verified on the codex-team e2e: - cx_team_v5 (codex agents perfectly merged to identical 243-line patches): 0/2 → 2/2 ✓ (f1: 14/14, f2: 20/20) - cx_team_v4 (codex agents diverged on the merge): unchanged at f2 20/20 + f1 13/14 = 33/34 tests, still falls back to agent2-alone via apply_status: {'agent1': 'failed', ...} I also briefly tried adding ``git apply --recount`` to ``_setup_branches``'s fallback chain, but that REGRESSED v4: it made agent1's malformed patch apply where it previously failed silently, triggering a real merge attempt that produced duplicate function definitions (broken Python) via union merge. The identical-patches short-circuit is the strictly-better fix — no regression, recovers the v5 case, and the malformed-hunk normalization only kicks in on the short-circuit path where it can't cause merge conflicts. Also lands previously-uncommitted housekeeping: - prompt.py: ruff-format-only diff on the merge-required block from the prior commit - test_team_wiring.py: ruff --fix removed unused MagicMock imports - test_gcp_backend.py / test_tasks.py: ruff --fix removed f-string-without-placeholder and unused-json import (both unrelated drift caught by the gate) Tests: 1 new (full suite: 327 passed) - ``test_test_merged_shortcircuits_on_identical_patches`` — source inspection confirms the short-circuit branch + "identical" merge-status string exist in test_merged Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * openhands: register Redis-backed CoopTaskTracker as a typed tool The previous openhands team runs (oh_team_v3) showed agents discovering the ``coop-task-*`` shell wrappers via ``compgen`` but never invoking them — gpt-5.5 strongly prefers typed tools registered with the LLM over arbitrary shell commands. This commit lands the architectural fix: a Redis-backed ``CoopTaskTrackerTool`` registered under the same name as openhands' built-in ``TaskTrackerTool`` so the registry resolution swaps it transparently. Files: * ``openhands/tools/task_tracker/coop_definition.py`` — new tool definition + executor. Same ``TaskTrackerAction`` / ``TaskTrackerObservation`` shape, but ``plan`` and ``view`` round- trip through the shared ``cb:<run_id>:`` Redis namespace that ``TaskListClient`` (host side) writes to. Tasks are auto-owned by the calling agent; ``view`` shows peer tasks prefixed with ``[<their_agent_id>]``. Registered under both ``"CoopTaskTrackerTool"`` AND ``"TaskTrackerTool"`` so importing the module rebinds the latter to the Coop variant. * ``openhands/tools/preset/default.py`` — gains a ``team_mode`` kwarg (kept for API stability + tests; the actual swap happens server-side via the .pth/__init__ side-effect import, not by changing the host-side tool list). Pre-PR coop block split into a more nuanced team-mode prompt section that documents the TaskTracker → shared-list behavior. * ``openhands_sdk/adapter.py:ModalSandboxContext.__enter__`` — layers two more bits into the Modal image at build time: - ``add_local_file`` of ``coop_definition.py`` to ``$OH_DIR/coop_definition.py`` (in the sandbox's openhands install) - ``grep ... || echo`` appending ``from . import coop_definition`` to the package's ``__init__.py`` so the registration runs at import time. Tests: 1 new + updated image-layering assertions - ``test_importing_coop_definition_overrides_local_registration``: inspecting the registry's ``_MODULE_QUALNAMES`` confirms ``TaskTrackerTool.name`` resolves to ``coop_definition``'s registration after import. - ``TestOpenHandsImageLayering`` now asserts 2 ``add_local_file`` calls + 2 ``run_commands`` layers (tool-file install + ``coop-task-*`` wrappers) and that the ``from . import coop_definition`` line is in the install commands. Full suite: 329 passed. Ruff / format / mypy all green. KNOWN LIMITATION (documented in coop_definition.py docstring): the openhands_sdk agent-server runs in a Modal sandbox that's network-isolated from the host Redis. The CoopTaskTracker is correctly registered and the LLM can call it, but every operation returns "Shared task list unavailable" because the sandbox can't ``socket.getaddrinfo("host.docker.internal")``. The fix is in the deployment layer (Modal tunnels, a Modal-hosted Redis, or running openhands directly via docker like the other adapters), not in this PR — verified by oh_team_v10: agent ran ``coop-task-list`` first ("The coop CLI failed; I'll use the shared task tracker."), then fell back to TaskTrackerAction which still hit the local executor because the override + Redis combo can't actually work in Modal. For non-Modal openhands deployments (e.g. local docker-backed openhands runs, future remote-conversation transports that share the host network), this tool works as designed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * openhands team mode: end-to-end working with Modal-hosted Redis Resolves the Modal-Redis isolation that blocked the prior CoopTaskTracker swap from actually functioning. Three pieces, working together: 1. **Modal-hosted Redis.** ``runner/team.py:execute_team`` detects ``agent_name == "openhands_sdk"`` and spins up a Modal sandbox running redis-server on a TCP tunnel (``unencrypted_ports=[6379]``, accessed via ``unencrypted_host:unencrypted_port``). Re-uses the existing ``connectors/redis_server.ModalRedisServer`` — it was already written, just unused. Both the host TaskListClient and the agent sandboxes point at the same public TCP endpoint, so pre-seed and agent reads/writes share state. Falls back to local Redis for the other adapters. 2. **CoopTaskTrackerTool injection into the Modal sandbox.** The adapter now ``add_local_file``s three pieces into the OpenHands image at build time: - ``coop_task.py`` → ``/usr/local/bin/cb-coop-task.py`` - ``coop_definition.py`` → ``$OH_DIR/coop_definition.py`` - ``_team_init_override.py`` → ``$OH_DIR/__init__.py`` (replaces upstream; same exports + a side-effect import of coop_definition so the Redis-backed executor overrides the local TaskTracker registration at first import). Plus a ``find -name '*.pyc' -delete`` to invalidate Python's bytecode cache so the new __init__ actually re-runs. 3. **Harvest-time fresh client.** Modal's TCP tunnels drop idle connections after a few minutes, so the original Redis client pre-seed used at startup gets closed before the 9-min agent run finishes. Re-open the client at harvest time using the same URL. End-to-end on ``dottxt_ai_outlines_task/1371 [1,2]`` with ``-a openhands_sdk --setting team --git``: - Modal Redis startup: ``redis ready redis://r450.modal.host:41899`` - Both agents Submitted, 9m total - Eval: 2/2 PASS (f1: 14/14 ✓, f2: 20/20 ✓) - Metrics: ``tasks_total: 4, tasks_done: 4, unowned_at_end: 0, time_to_first_claim_seconds: 52.6, claims_per_agent: {agent2:2, agent1:1}, updates_per_agent: {agent2:4, agent1:5}`` - Cost: $3.33 Tests: image-layering assertions expanded — ``add_local_file`` now called 3 times (CLI helper, tool def, __init__ override), and the run_commands chain copies both files + wipes .pyc caches. Full suite: 329 passed. Ruff / format / mypy all green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * deps: add fakeredis to dev extras The team-mode unit tests (task_list / protocol / fs_mirror / loop_refresh / mcp_server) use ``fakeredis.FakeRedis`` as a hermetic stand-in for redis-server, but ``fakeredis`` wasn't declared anywhere in pyproject.toml — it just happened to be present in my local venv because something else pulled it in transitively. GitHub CI installs ``[dev]`` only, so on a clean install pytest collection fails with ``ModuleNotFoundError: No module named 'fakeredis'`` on every team-mode test file. Adding the dependency explicitly fixes PR #52 (team-mode) CI; once team-mode merges, PR #55 (team-all-adapters) will also pick it up via the same path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * swe_agent: fix import error + add missing transitive deps Three changes that together unblock swe_agent team-mode runs (and solo/coop runs too — the bug wasn't team-specific): 1. ``cooperbench.agents.mini_swe_agent`` → ``mini_swe_agent_v2`` in ``swe_agent/adapter.py`` and ``swe_agent/agent/agents.py``. The old package was renamed in v0.0.13; both swe_agent files had stale imports that no-op'd at module load (TypeError or ModuleNotFoundError depending on how the framework was invoked), making every swe_agent invocation return Error before any LLM call. 2. Add ``numpy``, ``boto3``, ``docker`` to the ``swe-agent`` extras in pyproject.toml. swe_agent's vendored framework imports these at module-load time even when the docker/S3/model paths are dormant, so a clean ``pip install '.[swe-agent]'`` without these would still ImportError on first invocation. 3. uv.lock refreshed with the new transitive deps. End-to-end on dottxt_ai_outlines_task/1371 [1,2] with ``-a swe_agent -m gpt-5.5 --setting team --git`` (sw_team_v5): both agents Submitted, patches 373 + 88 lines, both applied via git apply. Eval failed 0/2 due to a content-quality issue (``NameError: name 'Set' is not defined`` — agent used Set without importing it; both agents hit exit_cost budget limit mid-implementation), but that's model variance, not adapter wiring. swe_agent is unblocked: it runs end-to-end, produces patches, the eval pipeline processes them. Coordination metrics still empty (claims_per_agent: {}) because swe_agent doesn't yet have the in-container coop-task-* CLI install or in-loop task auto-refresh — those are tracked as follow-ups in the PR body. For now the swe_agent team-mode run just gets the team prompt section + env vars; full team-tool integration is a separate PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: team-mode bugs surfaced by 10-pair core run Five compounding bugs prevented `claude_code`, `codex`, and `mini_swe_agent_v2` from reaching honest pass-rates on the core subset in team setting. All four now ≥ 5/10. - normalize_patch ate trailing blank context lines (text.strip() consumes " \n"), breaking last-hunk line counts so git apply rejected otherwise-valid diffs. Replaced with lstrip/rstrip on "\n" only. - mini_swe_agent_v2 adapter wasn't normalizing patches at all — raw .strip() on the patch.txt read, so every msa patch ended in a non-newline byte. Now routes through normalize_patch. - mini_swe_agent_v2 ModalEnvironment created the sandbox with no long-running command, so the image's default CMD exited and every exec hit "Sandbox not found". Pass "sleep", "infinity" as the positional command (matches eval backend's existing fix). - claude_code and codex adapters silently ignored --backend modal because shared build_environment was hardcoded to DockerEnvironment. Added a backend kwarg and threaded config["backend"] through both adapters. - Team lead prompt buried the integration step at the bottom of a long workflow list; Claude/Codex consistently exited after their own feature without reading /workspace/shared/<agent>.patch. Rewrote with a hard-rule opener and a 5-point pre-submission checklist. Member prompt now opens with "stay in your lane" per the lead's PLAN.md. - eval test_merged now falls back to testing each agent's patch alone when the merged tree doesn't pass both features. Surfaced as merge.strategy="solo-agent1" / "solo-agent2". Credits the agent (typically the lead) who correctly integrated both features into one working patch but had it corrupted by union-merging with the other agent's partial implementation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs+data: core subset and team-mode horizontal comparison - dataset/subsets/core.json: 10-pair subset for quick agent comparisons. Stratified by repo (largest-remainder proportional allocation by full-dataset pair count) with a one-slot floor per primary language (Python / Go / Rust / TS). Reproducible via scripts/generate_core_subset.py (seed=42). - docs/BENCHMARK_RESULTS.md: horizontal comparison of four agent frameworks on the core subset in team setting. Includes per-task pass/fail matrix annotated with the merge strategy used, plus the chronological narrative of the dozen reruns that surfaced each of the bugs fixed in the previous commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(eval): don't bail when union-merge also conflicts Previously test_merged returned early with an error when both naive and union merge strategies hit conflicts, so the solo-agent fallback never got a chance to credit a team whose lead alone integrated both features. Now we write an empty merged.patch, let run_tests fail naturally on the merged tree, and fall through to the solo fallback. Doesn't change any of the current 40 eval results — union's merge=union attribute is tolerant enough that every task in the dataset produces some tree (potentially broken code with stitched-together lines); the broken-tree-tests-fail path already triggered the solo fallback. This just closes the defensive gap for future pathological cases. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * eval(team): identical / naive / lead-when-naive-conflicts policy Drops the union-merge strategy and the member-only fallback from test_merged. The new chain is: 1. identical patches → skip-merge short-circuit 2. naive 3-way merge clean → merged-tree tests are authoritative (no further fallback) 3. naive merge conflicts → test the lead's patch.txt alone against both feature suites Rationale: union merge concatenates conflicting hunks, which usually produces syntactically broken code; the cases where it accidentally produced a working tree were rewarding lucky non-overlap, not genuine coordination. The member-only fallback was symmetric to lead-only but incoherent under team-mode semantics (the lead is the designated integrator; if they didn't integrate, the team failed regardless of what the member's branch looks like). Effect on the core-subset horizontal comparison: msa 6 → 6 (unchanged) oh 5 → 4 (loses pallets_jinja/1621 — was passing via union, which concealed that oh's lead doesn't integrate) cc 5 → 5 (unchanged) cx 5 → 5 (unchanged) oh sliding below 5/10 is the correct outcome: the previous union-pass on pallets_jinja/1621 was a false-positive of sorts (oh's agents commit their patch.txt into the working tree, which forces a merge conflict on patch.txt that union resolved while the actual source merge was non-conflicting). Under the stricter policy this gets routed through lead-alone, which oh's lead does not pass. BENCHMARK_RESULTS.md updated to reflect the new totals + per-task matrix legend (N = naive/identical, L = lead-alone). CHANGELOG entry revised; full test suite still green (329 passed, 63 skipped). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(modal): codex stdin hang; eval guardrail for openhands_sdk codex on Modal: `codex exec` was hanging for the full sandbox lifetime (~2h) producing zero stream output. Root cause: codex's exec mode prints "Reading additional input from stdin..." and blocks until stdin EOF. Docker's non-tty `docker exec` gives EOF for free; Modal sandbox keeps stdin open. Fix: add `</dev/null` to the codex invocation in _build_codex_command. Smoke-tested on dottxt_ai_outlines/1655 [1,3] solo on Modal: 1/1 pass in 1m 48s. openhands_sdk eval guardrail: openhands_sdk produces patches that include a committed patch.txt in the working tree and relies on Modal-hosted Redis for coordination; running eval through Docker silently changed the test environment. The eval now reads the run's config.json and refuses with a clear warning when the run was produced by openhands_sdk but --backend != modal. Note: swe_agent already runs on Modal (uses swerex.ModalDeploymentConfig by default; the earlier docs claiming it was docker-only were wrong). Smoke-tested same dottxt task: 1/1 pass in 3m 12s. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(swe_agent): add --backend docker support swe_agent adapter was hardcoded to swerex.ModalDeploymentConfig. Added a backend dispatch that picks DockerDeploymentConfig when config["backend"] == "docker"; Modal stays as the default. Two upstream-swerex issues had to be worked around to make the docker path actually start a container: 1. CooperBench task images set ENTRYPOINT=/usr/local/bin/runner.sh, so swerex's `docker run ... image sh -c "<startup>"` becomes `runner.sh sh -c "<startup>"` and runner.sh interprets "sh" as the feature-patch path. Pass docker_args=["--entrypoint", ""] to clear the entrypoint (mirrors the existing Modal monkey-patch that does .entrypoint([]) on the image). 2. swerex's startup falls back to `pipx run swe-rex ...` when the swerex-remote binary isn't pre-installed, but pipx looks for an executable literally named "swe-rex" — which doesn't exist in the published `swe-rex` package (it provides "swerex-remote"). Monkey-patch DockerDeployment._get_swerex_start_cmd to use `pipx run --spec swe-rex swerex-remote ...` instead. Smoke-tested with `dottxt_ai_outlines/1655 [1,3]` solo on docker: 1/1 pass in 2m 53s, 17 steps, $0.32, no errors. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * team_harness: extract team mode as standalone harness + ablation flags Move team-mode primitives from cooperbench/agents/_team (private) to cooperbench/team_harness (public, library-shaped) so other benchmarks can consume the multi-agent coordination algorithm without depending on CooperBench's task layout. Adds TeamSession + TeamHarnessConfig: - TeamSession bundles per-run state (run_id, namespaced Redis URL, ordered agent list, scratchpad volume name) with the feature config and exposes adapter-facing factories that each return None / [] / {} when their feature is disabled, so adapter code paths collapse to one branch: coop_env.update(session.env_for(agent_id)) extra_run_args.extend(session.scratchpad_mount_args()) mcp_config = session.mcp_config(container_script_path=...) - TeamHarnessConfig is a frozen dataclass of five per-feature booleans (task_list, scratchpad, mcp, auto_refresh, protocol). The lead/member role split is the always-on baseline -- without it team is just coop. Wires five --team-no-* CLI flags through cli.py -> runner.run -> runner.core -> runner.team -> each adapter. result.json now records team_features so post-hoc analysis can attribute deltas to the feature that was off. Adapter refactor: claude_code, codex, mini_swe_agent_v2, swe_agent, and openhands_agent_sdk now accept team_features kwarg and construct a local TeamSession instead of calling loose helpers. Each adapter's team-mode blocks (prompt, env, mount, MCP, install) gate on the session's config. Tests: tests/agents/_team -> tests/team_harness (rename), new test_session.py (29 cases) covers the facade, four new ablation tests in tests/runner/test_team.py verify the runner-side gating. Full suite 363 passed, 63 skipped; ruff/format/mypy clean. End-to-end smoke on dottxt_ai_outlines/1371 [1,2] with codex (docker): - Default: writes task_log.json + tasks.json + metrics, cb-team-<run> volume created. - --team-no-task-list --team-no-scratchpad --team-no-mcp: no task_log / tasks files, empty metrics dict, no volume. team_features in result.json reflects the requested ablation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Ubuntu <ubuntu@ip-172-31-58-153.us-west-2.compute.internal> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ProKil
added a commit
that referenced
this pull request
May 21, 2026
* agents/codex: add Codex adapter; lift shared coop bits into _coop
Adds an OpenAI Codex CLI adapter alongside the existing Claude Code
adapter. Both adapters wrap a third-party CLI inside the task's
Docker container; the bits that are agent-agnostic (Redis messaging
helper, prompt blocks for solo/coop/coop+git, git remote setup) now
live in a new ``cooperbench.agents._coop`` module so the two adapters
(and any future CLI adapter) consume them rather than duplicating.
Codex adapter highlights:
- Invokes ``codex exec --json --sandbox danger-full-access
--skip-git-repo-check --model <id>``.
- Writes ``${CODEX_HOME}/auth.json`` with the host's OPENAI_API_KEY
inside the container so the CLI authenticates without prompts.
- Parses Codex's JSONL event stream for status / token totals /
messages. Cost is reported as 0.0 because Codex does not emit a
cost field; tokens are summed across ``turn.completed`` events.
- Model fallback: if Codex rejects ``--model gpt-5.5`` with a
"model not found" shaped error, the adapter retries once without
``--model`` and lets Codex pick its default.
- Preflight credential check: if OPENAI_API_KEY is unset the adapter
returns Error immediately instead of spinning up a container that
can only fail.
Shared ``_coop`` module:
- ``coop_msg.py`` — Redis-backed messaging CLI (one inbox per agent)
installed as ``coop-send`` / ``coop-recv`` / ``coop-broadcast`` /
``coop-peek`` / ``coop-agents`` under /usr/local/bin.
- ``install_snippet.sh`` — pip-installs redis and drops the shell
wrappers; each adapter's setup.sh sources it.
- ``prompt.py`` — solo / coop / coop+git prompt assembly, agent-
agnostic.
- ``runtime.py`` — ``ContainerEnv`` protocol, ``build_environment``,
``write_file_in_container`` / ``read_file_from_container``,
``rewrite_comm_url_for_container``, ``build_git_setup_command``,
``parse_sent_messages_log``, and ``normalize_patch``.
Bug fix during this refactor: the previous adapter's ``.strip()`` on
``patch.txt`` was eating the trailing newline that ``git apply``
requires. Replaced with ``normalize_patch()`` (one trailing newline,
no leading whitespace). This bit codex's solo run with a
"corrupt patch at line N" error; Claude got lucky and didn't.
Tests: 24 new for Codex (parsers + adapter), existing 45 Claude Code
tests re-pointed at the shared ``_coop`` module. Full suite: 228
passed, 63 skipped.
End-to-end runs against dottxt_ai_outlines_task/1371 features 1+2:
- codex solo f1: Submitted, 1 turn, 365k input tokens,
184-line patch (with the trailing-newline
fix it applies cleanly)
- codex coop+git f1,f2: both Submitted, both patches applied but
0/2 tests pass — coordination failure
(agent1 fetched ``team`` but never merged,
so the stacked patches produce a Python
SyntaxError at line 144 of the modified
file). Claude on the same task scored
2/2; Codex used the tools less aggressively
on this run.
The 0/2 result is the kind of coordination failure the bench is
designed to surface, not an adapter bug. Future iteration could
tighten the prompt or hard-enforce a post-run merge, but neither is
necessary to land the adapter itself.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* runner: add team mode (lead + members + shared task list + scratchpad)
Adds a third setting alongside ``solo`` and ``coop``, modelled on the
agent-team primitives Claude Code uses in its own product. Where coop
gives N peer agents one feature each and a Redis inbox to chat over,
team mode adds three load-bearing primitives:
1. A typed **shared task list** (cooperbench.agents._team.TaskListClient)
backed by Redis hashes + sets, namespaced ``cb:<run_id>:``, with
atomic claim semantics (HSETNX-style — exactly one caller wins on a
race) and an audit log of every mutation. Exposed in the container
as ``coop-task-create`` / ``coop-task-claim`` / ``coop-task-update``
/ ``coop-task-list`` shell wrappers.
2. A **lead / member role split**. The first agent is designated
``team-lead`` and gets a system-prompt block instructing them to
break the spec into tasks, assign them via ``coop-task-create
--assign``, watch progress, and integrate. Other agents are
``member`` and look for open tasks to claim.
3. A **shared scratchpad** Docker volume (``cb-team-<run_id>``)
mounted at ``/workspace/shared`` in every container. Free
coordination artifact for design notes, partial diffs, interface
sketches.
Coordination metrics are computed from the task-list audit log after
the run finishes (``time_to_first_claim_seconds``, ``claims_per_agent``,
``updates_per_agent``, ``tasks_done``, ``unowned_at_end``) and saved
into ``result.json``. Evaluation is identical to coop — per-agent
``patch.txt`` evaluated per-feature — so no eval changes were needed
beyond discovering ``team/`` log directories.
Compatibility: all five existing adapters accept the new ``team_role``
/ ``team_id`` / ``task_list_url`` kwargs. The CLI adapters
(``claude_code``, ``codex``) wire the team install snippet into their
``setup.sh`` so the ``coop-task-*`` wrappers land at
``/usr/local/bin``. The Python-loop adapters (``mini_swe_agent_v2``,
``swe_agent``, ``openhands_sdk``) accept the kwargs without breaking;
their in-loop integration with the task list (auto-refresh between
steps, similar to the existing inbox poll) lands in a follow-up.
Unit tests: 46 new
- 18 task_list (CRUD, atomic claim, owner-only update, audit log,
run isolation)
- 12 prompt (lead vs member branches, solo fallback, git interaction)
- 3 runtime (env assembly, scratchpad mount args)
- 4 metrics (happy path, unowned-at-end, empty log, multiple claims)
- 5 runner (lead-is-first-agent, pre-seed, kwarg propagation,
metrics in result, three-agent team)
- 4 misc
Full suite: 274 passed, 63 skipped. Ruff / format / mypy all green.
End-to-end on dottxt_ai_outlines_task/1371 [1,2] with Claude Code in
team+git mode:
- 5 tasks created (2 by bench-runner, 3 by the lead splitting its
work), all reached ``done``
- time_to_first_claim_seconds=34.2
- claims_per_agent={agent1: 2, agent2: 1}
- updates_per_agent={agent1: 4, agent2: 3}
- scratchpad volume actively used (agent2 wrote its diff to
/workspace/shared/agent2.patch + a summary.md)
- **0/1 pass rate** — both ``patch.txt`` files were empty: the
members wrote diffs to the scratchpad instead of also writing
``/workspace/repo/patch.txt``, and the lead never ran the final
integration step. This is real coordination signal (the prompt
told them to write both places but they followed the scratchpad
half only) — a follow-up will tighten the prompt to make patch.txt
submission the explicit final step.
Future PRs (intentionally out of scope here so this lands at a
reviewable size):
- In-loop auto-refresh for the Python-loop adapters
- MCP long-poll tool to give CLI adapters push-ish inbox semantics
- Typed ``coop-request`` / ``coop-respond`` protocol on top of
messaging (CC's plan_approval_request shape)
- Filesystem mirror of the task list (CC-style ``ls`` artefacts)
Stacks on #51 (Codex adapter) so the diff stays focused on team-mode
additions.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* team mode: filesystem mirror, typed protocol, MCP server, in-loop refresh (#53)
Lands the four follow-ups that were called out as "Out of scope" on
the team-mode PR (#52), plus a prompt fix surfaced by the team-mode
end-to-end run.
1. **Filesystem mirror of task list** (``_team/fs_mirror.py``).
Snapshots the Redis-backed task list to ``/workspace/shared/tasks/``
so agents can ``ls`` and ``cat`` tasks with their existing tools
rather than going through the ``coop-task-list`` CLI. Layout
mirrors Claude Code's team primitive: one ``<id>.json`` per task,
plus ``_index.json`` (cheap ``ls`` target) and ``_log.jsonl`` (audit
trail). Triggered on every ``coop-task-list`` invocation and from
the host runner at startup. Files written via tempfile+replace so
readers never observe a partial state.
2. **Typed coop-request / coop-respond protocol** (``_team/protocol.py``).
Layered on plain Redis messaging, mirroring CC's
``plan_approval_request`` / ``plan_approval_response`` shape.
``coop-request <peer> <kind> <body>`` returns a request_id (and
optionally blocks via ``--wait N`` for a response).
``coop-respond <request_id> <body>`` writes back; the sender's
``await_response`` uses BLPOP so it actually sleeps instead of
busy-polling. Both events flow into the shared task-log so
coordination metrics include protocol events.
3. **MCP long-poll server** (``_team/mcp_server.py``). Stdio
JSON-RPC server that exposes a single ``wait_for_message`` tool
backed by BLPOP on the agent's inbox. Registered automatically:
Claude Code adapter writes ``$CLAUDE_CONFIG_DIR/.claude.json`` with
the server entry; Codex adapter writes ``$CODEX_HOME/config.toml``.
The point is to make "watch the inbox" a natural idle behavior for
the CLI adapters instead of a busy-loop on ``coop-recv`` returning
empty — the closest we can get to push-style delivery for opaque
CLI agent loops.
4. **In-loop task-list auto-refresh** (``_team/loop_refresh.py``).
``TeamPoller`` is a per-agent host-side helper that
``mini_swe_agent_v2.DefaultAgent.step()`` calls between LLM
queries — same hook as the existing inbox poll. The LLM sees a
compact ``[Team task list] open: 1, in_progress: 2, ...`` summary
prepended to every turn so it doesn't need to remember to call
``coop-task-list``. Plumbed via ``agent.team_poller`` so the
``mini_swe_agent_v2`` subtree change is one branch in ``step()``.
The same module also exports ``poll_team_state()`` for in-container
use (env-driven variant).
5. **Prompt fix**: the previous team-mode end-to-end had members
writing diffs to ``/workspace/shared/<id>.patch`` only and never to
``/workspace/repo/patch.txt``, scoring 0/2 despite great
coordination. Both lead and member prompts now have an explicit
``### Final submission — REQUIRED`` section that calls out
``patch.txt`` as the only file the bench evaluates and provides
the exact ``git diff > patch.txt`` command.
Also: cosmetic fix to ``runner/core._print_single_result`` so team
mode's per-agent dicts (which carry ``patch_lines: int``) render
correctly in the run table — previously the column showed 0 because
the function tried ``len(r.get("patch", "").splitlines())`` and team
mode doesn't store the full patch in the agents dict.
Tests: 37 new unit tests
- 8 fs_mirror (atomic writes, stale cleanup, empty index)
- 9 protocol (request roundtrip, await, timeout, audit log)
- 9 mcp_server (initialize, tools/list, tools/call,
timeout, blocking, unknown-tool error,
env factory)
- 8 loop_refresh (summary formatting, TeamPoller, env variant)
- 3 prompt (regression: lead+member prompts demand patch.txt)
Full suite: **311 passed**, 63 skipped.
End-to-end on dottxt_ai_outlines_task/1371 [1,2] with Claude Code +
team + git: **2/2 features pass** (14/14 + 20/20 tests). All four
follow-ups visibly active in the run artifacts:
``/workspace/shared/tasks/`` populated with per-task JSON + _index +
_log; scratchpad has agent2.patch; ``cb-mcp-server.py`` registered in
``.claude.json``; 6 tasks created (2 by runner pre-seed, 4 by lead's
sub-task split), 4 reached ``done``,
``time_to_first_claim_seconds=29.9``. Previous run scored 0/2 on the
same task — the prompt fix is doing real work.
Stacks on #52.
Co-authored-by: Ubuntu <ubuntu@ip-172-31-58-153.us-west-2.compute.internal>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* team mode: wire team prompt + env into the three Python-loop adapters
Brings ``mini_swe_agent_v2``, ``swe_agent``, and ``openhands_sdk`` to
parity with the CLI adapters for team mode. Before this commit they
accepted the team kwargs but discarded them; now each one appends the
team prompt section to the task it sends the agent, and (where the
adapter actually controls the container) propagates ``CB_TEAM_*`` env
vars + mounts the team scratchpad.
New helper: ``_team.team_task_section(agents, agent_id, team_role)``
returns ONLY the lead-or-member block + coop-task-* CLI usage,
without the surrounding task/submission/git scaffolding that
``build_team_instruction`` adds. Python-loop adapters already have
their own prompts covering messaging/git/submission, so they need
only the new piece; CLI adapters keep using the bigger function.
Per-adapter wiring:
- ``mini_swe_agent_v2``: appends team_task_section to task;
propagates CB_TEAM_* through env_kwargs["env"]; adds
``--add-host=host.docker.internal:host-gateway`` + scratchpad
volume to docker run args; installs the team CLI scripts + pip
redis in the container after env spin-up. The existing
``TeamPoller`` host-side hook (already in step()) still fires.
- ``openhands_sdk``: appends team_task_section to task; folds a new
``team_env`` dict into ``coop_info`` so
``_build_credentials_dict`` propagates CB_TEAM_* into the
sandbox. Coop-task-* binary install in the OpenHands agent-server
image is a follow-up — OpenHands manages its own image build and
doesn't expose a clean post-start exec hook.
- ``swe_agent``: appends team_task_section to task. The SWE-agent
framework's sandbox + agent loop is third-party and harder to
instrument; everything beyond the prompt is a follow-up.
Tests: 13 new
- 3 prompt unit tests for team_task_section (lead, member, empty)
- 10 cross-adapter sanity tests in tests/agents/test_team_wiring.py:
consistency between team_task_section and build_team_instruction,
every registered runner accepts the team kwargs, openhands env
keys, swe_agent signature
Full suite: 324 passed, 63 skipped. Ruff/format/mypy all green.
End-to-end on dottxt_ai_outlines_task/1371 [1,2] with claude_code +
team + git (sanity check that the shared changes didn't regress the
CLI adapter): both Submitted in 4m21s, $0.93, patches 210 + 81 lines.
End-to-end for the other four (codex, mini_swe_agent_v2, swe_agent,
openhands_sdk) requires API keys (Anthropic for the three Python-loop
adapters via litellm, OpenAI for codex) that aren't available in this
environment. Unit tests cover the new wiring; the e2e validations
should be run with real keys before relying on the per-adapter
behavior.
Compatibility matrix is now:
| Adapter | Accepts | Team prompt | Auto-refresh | CLI in container | env vars |
|---------------------|---------|-------------|--------------|------------------|----------|
| claude_code | yes | yes (full) | n/a | yes | yes |
| codex | yes | yes (full) | n/a | yes | yes |
| mini_swe_agent_v2 | yes | yes (sec.) | yes | yes | yes |
| openhands_sdk | yes | yes (sec.) | n/a | NOT YET | yes |
| swe_agent | yes | yes (sec.) | NOT YET | NOT YET | NOT YET |
Stacks on #52 (merged-up team-mode branch).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* openhands: layer coop-task-* install onto Modal image for team mode
Closes the documented gap from the prior commit's matrix: the
``coop-task-*`` binaries now ship into the OpenHands agent-server
sandbox, layered onto the upstream ``-oh`` image via Modal's
``add_local_file`` / ``pip_install`` / ``run_commands`` chain (no
upstream image rebuild required). Triggered only when
``coop_info["team_env"]`` is set so solo / coop runs don't pay the
~10s first-build cost. Modal caches the layered image; subsequent
team runs are instant.
Verified end-to-end: ran openhands_sdk team+git on
dottxt_ai_outlines_task/1371 [1,2] with gpt-5.5. The agent ran
``compgen -c | grep coop-task`` and got back all 7 wrappers
(create / claim / update / list / request / respond / pending) — the
install worked. Whether the model actually invokes the tools is a
separate (coordination-quality) axis; in this run it discovered them
but didn't use them, same as codex. Both patches applied; f1 14/14,
f2 19/20.
Tests: 2 new (full suite: 326 passed)
- test_team_env_triggers_image_layering — verifies add_local_file
+ pip_install + run_commands fire with the right args when team
mode is active
- test_no_layering_when_team_inactive — verifies solo / coop
runs skip the image-build cost
Matrix update — openhands_sdk now reads:
Accepts kwargs: yes / Team prompt: section / Auto-refresh: n/a /
CLI in container: YES (was NOT YET) / CB_TEAM_* env: yes
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* team prompt: make the merge-before-submit step REQUIRED
The codex team e2e (cx_team_v3) hit 0/2 with great coordination
metrics — 5/5 tasks done, 27s first claim, claims even — but
neither agent ran ``git merge`` despite the prompt's "Recommended
workflow" mentioning it. Both fetched their peer's branch (2 each)
and then submitted only their own work, so the eval's naive
diff-stacker produced syntactically broken Python.
The previous prompt buried the critical step in a "Concretely:"
sentence at the end; gpt-5.5 didn't follow it. This rewrite:
- Renames the section ``## Git collaboration — MERGE IS REQUIRED
BEFORE SUBMITTING`` so the imperative is in the heading itself.
- Adds an explicit "Required final sequence — run this verbatim
before exiting" block with the full fetch+merge+diff sequence,
parameterized over every partner branch.
- Explains *why* (each agent's patch.txt is evaluated against every
feature's tests; without the merge, the peer feature's symbols
are missing → ImportError).
- Frames it the same way the patch.txt step is framed (REQUIRED,
skip-at-your-loss), which the original prompt fix proved
codex responds to.
Verified: re-ran cx_team_v4 (codex team+git, same task as v3).
Git activity went from ``fetch=2 merge=0 push=0`` per agent →
``fetch=3 merge=2 push=2`` and ``fetch=1 merge=1 push=1``. Both
patches now contain both features' symbols. Pass rate v4:
33/34 tests (97%) — f2 fully passes 20/20, f1 fails one test
because gpt-5.5's merged code put the ``filters`` kwarg on a helper
function rather than the ``prompt`` decorator (content quality, not
coordination).
A second run (cx_team_v5) produced byte-identical 243-line patches
on both agents — codex coordinated so well both ended up with the
exact same merged tree. This surfaces a separate bench-side
limitation: the eval's diff-stacker fails to apply patch B on top
of patch A when every hunk already matches, producing an empty
merged.patch. That's a real bug in ``eval/evaluate.py``'s coop
merge step, NOT a coordination failure — codex did exactly what the
prompt asked. Fix is a separate concern from team-mode wiring.
Tests still pass (existing prompt tests are content-agnostic;
326 / 63 skipped).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* eval: short-circuit when both agents submit identical merged patches
In team mode codex can coordinate so well that both agents end up
with byte-identical patches (each fully merged the other's branch).
The existing eval combiner sequence — apply patch1 → apply patch2
on top — chokes because every hunk in patch2 is already applied,
producing an empty merged.patch and a downstream "No valid patches
in input" failure even though both submissions are individually
fine.
Fix in ``test_merged``: before invoking ``_setup_branches`` /
``_merge_naive``, ``cmp`` the two patches. If they match, copy
patch1 to merged.patch (normalized via ``git apply --recount`` so
agents that emit unified diffs with miscounted hunk headers still
work) and skip the merge dance. Returns a fresh result with
``merge.status: "identical"`` so the caller can tell the
short-circuit fired vs a real merge.
Verified on the codex-team e2e:
- cx_team_v5 (codex agents perfectly merged to identical 243-line
patches): 0/2 → 2/2 ✓ (f1: 14/14, f2: 20/20)
- cx_team_v4 (codex agents diverged on the merge): unchanged at
f2 20/20 + f1 13/14 = 33/34 tests, still falls back to
agent2-alone via apply_status: {'agent1': 'failed', ...}
I also briefly tried adding ``git apply --recount`` to
``_setup_branches``'s fallback chain, but that REGRESSED v4: it
made agent1's malformed patch apply where it previously failed
silently, triggering a real merge attempt that produced
duplicate function definitions (broken Python) via union merge.
The identical-patches short-circuit is the strictly-better fix —
no regression, recovers the v5 case, and the malformed-hunk
normalization only kicks in on the short-circuit path where it
can't cause merge conflicts.
Also lands previously-uncommitted housekeeping:
- prompt.py: ruff-format-only diff on the merge-required block
from the prior commit
- test_team_wiring.py: ruff --fix removed unused MagicMock
imports
- test_gcp_backend.py / test_tasks.py: ruff --fix removed
f-string-without-placeholder and unused-json import (both
unrelated drift caught by the gate)
Tests: 1 new (full suite: 327 passed)
- ``test_test_merged_shortcircuits_on_identical_patches`` — source
inspection confirms the short-circuit branch + "identical"
merge-status string exist in test_merged
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* openhands: register Redis-backed CoopTaskTracker as a typed tool
The previous openhands team runs (oh_team_v3) showed agents
discovering the ``coop-task-*`` shell wrappers via ``compgen`` but
never invoking them — gpt-5.5 strongly prefers typed tools registered
with the LLM over arbitrary shell commands. This commit lands the
architectural fix: a Redis-backed ``CoopTaskTrackerTool`` registered
under the same name as openhands' built-in ``TaskTrackerTool`` so the
registry resolution swaps it transparently.
Files:
* ``openhands/tools/task_tracker/coop_definition.py`` — new tool
definition + executor. Same ``TaskTrackerAction`` /
``TaskTrackerObservation`` shape, but ``plan`` and ``view`` round-
trip through the shared ``cb:<run_id>:`` Redis namespace that
``TaskListClient`` (host side) writes to. Tasks are auto-owned
by the calling agent; ``view`` shows peer tasks prefixed with
``[<their_agent_id>]``. Registered under both
``"CoopTaskTrackerTool"`` AND ``"TaskTrackerTool"`` so importing
the module rebinds the latter to the Coop variant.
* ``openhands/tools/preset/default.py`` — gains a ``team_mode``
kwarg (kept for API stability + tests; the actual swap happens
server-side via the .pth/__init__ side-effect import, not by
changing the host-side tool list). Pre-PR coop block split into
a more nuanced team-mode prompt section that documents the
TaskTracker → shared-list behavior.
* ``openhands_sdk/adapter.py:ModalSandboxContext.__enter__`` —
layers two more bits into the Modal image at build time:
- ``add_local_file`` of ``coop_definition.py`` to
``$OH_DIR/coop_definition.py`` (in the sandbox's openhands
install)
- ``grep ... || echo`` appending
``from . import coop_definition`` to the package's
``__init__.py`` so the registration runs at import time.
Tests: 1 new + updated image-layering assertions
- ``test_importing_coop_definition_overrides_local_registration``:
inspecting the registry's ``_MODULE_QUALNAMES`` confirms
``TaskTrackerTool.name`` resolves to ``coop_definition``'s
registration after import.
- ``TestOpenHandsImageLayering`` now asserts 2 ``add_local_file``
calls + 2 ``run_commands`` layers (tool-file install +
``coop-task-*`` wrappers) and that the
``from . import coop_definition`` line is in the install
commands.
Full suite: 329 passed. Ruff / format / mypy all green.
KNOWN LIMITATION (documented in coop_definition.py docstring):
the openhands_sdk agent-server runs in a Modal sandbox that's
network-isolated from the host Redis. The CoopTaskTracker is
correctly registered and the LLM can call it, but every operation
returns "Shared task list unavailable" because the sandbox can't
``socket.getaddrinfo("host.docker.internal")``. The fix is in the
deployment layer (Modal tunnels, a Modal-hosted Redis, or running
openhands directly via docker like the other adapters), not in this
PR — verified by oh_team_v10: agent ran ``coop-task-list`` first
("The coop CLI failed; I'll use the shared task tracker."), then
fell back to TaskTrackerAction which still hit the local executor
because the override + Redis combo can't actually work in Modal.
For non-Modal openhands deployments (e.g. local docker-backed
openhands runs, future remote-conversation transports that share the
host network), this tool works as designed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* openhands team mode: end-to-end working with Modal-hosted Redis
Resolves the Modal-Redis isolation that blocked the prior CoopTaskTracker
swap from actually functioning. Three pieces, working together:
1. **Modal-hosted Redis.** ``runner/team.py:execute_team`` detects
``agent_name == "openhands_sdk"`` and spins up a Modal sandbox
running redis-server on a TCP tunnel (``unencrypted_ports=[6379]``,
accessed via ``unencrypted_host:unencrypted_port``). Re-uses the
existing ``connectors/redis_server.ModalRedisServer`` — it was
already written, just unused. Both the host TaskListClient and
the agent sandboxes point at the same public TCP endpoint, so
pre-seed and agent reads/writes share state. Falls back to local
Redis for the other adapters.
2. **CoopTaskTrackerTool injection into the Modal sandbox.** The
adapter now ``add_local_file``s three pieces into the OpenHands
image at build time:
- ``coop_task.py`` → ``/usr/local/bin/cb-coop-task.py``
- ``coop_definition.py`` → ``$OH_DIR/coop_definition.py``
- ``_team_init_override.py`` → ``$OH_DIR/__init__.py``
(replaces upstream; same exports + a side-effect import of
coop_definition so the Redis-backed executor overrides the
local TaskTracker registration at first import).
Plus a ``find -name '*.pyc' -delete`` to invalidate Python's
bytecode cache so the new __init__ actually re-runs.
3. **Harvest-time fresh client.** Modal's TCP tunnels drop idle
connections after a few minutes, so the original Redis client
pre-seed used at startup gets closed before the 9-min agent run
finishes. Re-open the client at harvest time using the same URL.
End-to-end on ``dottxt_ai_outlines_task/1371 [1,2]`` with
``-a openhands_sdk --setting team --git``:
- Modal Redis startup: ``redis ready redis://r450.modal.host:41899``
- Both agents Submitted, 9m total
- Eval: 2/2 PASS (f1: 14/14 ✓, f2: 20/20 ✓)
- Metrics: ``tasks_total: 4, tasks_done: 4, unowned_at_end: 0,
time_to_first_claim_seconds: 52.6, claims_per_agent: {agent2:2,
agent1:1}, updates_per_agent: {agent2:4, agent1:5}``
- Cost: $3.33
Tests: image-layering assertions expanded — ``add_local_file`` now
called 3 times (CLI helper, tool def, __init__ override), and the
run_commands chain copies both files + wipes .pyc caches.
Full suite: 329 passed. Ruff / format / mypy all green.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* deps: add fakeredis to dev extras
The team-mode unit tests (task_list / protocol / fs_mirror /
loop_refresh / mcp_server) use ``fakeredis.FakeRedis`` as a hermetic
stand-in for redis-server, but ``fakeredis`` wasn't declared anywhere
in pyproject.toml — it just happened to be present in my local venv
because something else pulled it in transitively.
GitHub CI installs ``[dev]`` only, so on a clean install pytest
collection fails with ``ModuleNotFoundError: No module named
'fakeredis'`` on every team-mode test file. Adding the dependency
explicitly fixes PR #52 (team-mode) CI; once team-mode merges,
PR #55 (team-all-adapters) will also pick it up via the same path.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* swe_agent: fix import error + add missing transitive deps
Three changes that together unblock swe_agent team-mode runs (and
solo/coop runs too — the bug wasn't team-specific):
1. ``cooperbench.agents.mini_swe_agent`` → ``mini_swe_agent_v2``
in ``swe_agent/adapter.py`` and ``swe_agent/agent/agents.py``.
The old package was renamed in v0.0.13; both swe_agent files
had stale imports that no-op'd at module load (TypeError or
ModuleNotFoundError depending on how the framework was invoked),
making every swe_agent invocation return Error before any LLM
call.
2. Add ``numpy``, ``boto3``, ``docker`` to the ``swe-agent`` extras
in pyproject.toml. swe_agent's vendored framework imports these
at module-load time even when the docker/S3/model paths are
dormant, so a clean ``pip install '.[swe-agent]'`` without these
would still ImportError on first invocation.
3. uv.lock refreshed with the new transitive deps.
End-to-end on dottxt_ai_outlines_task/1371 [1,2] with
``-a swe_agent -m gpt-5.5 --setting team --git`` (sw_team_v5):
both agents Submitted, patches 373 + 88 lines, both applied via
git apply. Eval failed 0/2 due to a content-quality issue
(``NameError: name 'Set' is not defined`` — agent used Set
without importing it; both agents hit exit_cost budget limit
mid-implementation), but that's model variance, not adapter
wiring. swe_agent is unblocked: it runs end-to-end, produces
patches, the eval pipeline processes them.
Coordination metrics still empty (claims_per_agent: {}) because
swe_agent doesn't yet have the in-container coop-task-* CLI
install or in-loop task auto-refresh — those are tracked as
follow-ups in the PR body. For now the swe_agent team-mode run
just gets the team prompt section + env vars; full team-tool
integration is a separate PR.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix: team-mode bugs surfaced by 10-pair core run
Five compounding bugs prevented `claude_code`, `codex`, and
`mini_swe_agent_v2` from reaching honest pass-rates on the core
subset in team setting. All four now ≥ 5/10.
- normalize_patch ate trailing blank context lines (text.strip()
consumes " \n"), breaking last-hunk line counts so git apply
rejected otherwise-valid diffs. Replaced with lstrip/rstrip on
"\n" only.
- mini_swe_agent_v2 adapter wasn't normalizing patches at all —
raw .strip() on the patch.txt read, so every msa patch ended
in a non-newline byte. Now routes through normalize_patch.
- mini_swe_agent_v2 ModalEnvironment created the sandbox with no
long-running command, so the image's default CMD exited and
every exec hit "Sandbox not found". Pass "sleep", "infinity"
as the positional command (matches eval backend's existing fix).
- claude_code and codex adapters silently ignored --backend modal
because shared build_environment was hardcoded to DockerEnvironment.
Added a backend kwarg and threaded config["backend"] through both
adapters.
- Team lead prompt buried the integration step at the bottom of a
long workflow list; Claude/Codex consistently exited after their
own feature without reading /workspace/shared/<agent>.patch.
Rewrote with a hard-rule opener and a 5-point pre-submission
checklist. Member prompt now opens with "stay in your lane" per
the lead's PLAN.md.
- eval test_merged now falls back to testing each agent's patch
alone when the merged tree doesn't pass both features. Surfaced
as merge.strategy="solo-agent1" / "solo-agent2". Credits the
agent (typically the lead) who correctly integrated both
features into one working patch but had it corrupted by
union-merging with the other agent's partial implementation.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs+data: core subset and team-mode horizontal comparison
- dataset/subsets/core.json: 10-pair subset for quick agent
comparisons. Stratified by repo (largest-remainder proportional
allocation by full-dataset pair count) with a one-slot floor per
primary language (Python / Go / Rust / TS). Reproducible via
scripts/generate_core_subset.py (seed=42).
- docs/BENCHMARK_RESULTS.md: horizontal comparison of four agent
frameworks on the core subset in team setting. Includes per-task
pass/fail matrix annotated with the merge strategy used, plus the
chronological narrative of the dozen reruns that surfaced each of
the bugs fixed in the previous commit.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(eval): don't bail when union-merge also conflicts
Previously test_merged returned early with an error when both naive
and union merge strategies hit conflicts, so the solo-agent fallback
never got a chance to credit a team whose lead alone integrated both
features. Now we write an empty merged.patch, let run_tests fail
naturally on the merged tree, and fall through to the solo fallback.
Doesn't change any of the current 40 eval results — union's merge=union
attribute is tolerant enough that every task in the dataset produces
some tree (potentially broken code with stitched-together lines); the
broken-tree-tests-fail path already triggered the solo fallback. This
just closes the defensive gap for future pathological cases.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* eval(team): identical / naive / lead-when-naive-conflicts policy
Drops the union-merge strategy and the member-only fallback from
test_merged. The new chain is:
1. identical patches → skip-merge short-circuit
2. naive 3-way merge clean → merged-tree tests are authoritative
(no further fallback)
3. naive merge conflicts → test the lead's patch.txt alone against
both feature suites
Rationale: union merge concatenates conflicting hunks, which usually
produces syntactically broken code; the cases where it accidentally
produced a working tree were rewarding lucky non-overlap, not genuine
coordination. The member-only fallback was symmetric to lead-only but
incoherent under team-mode semantics (the lead is the designated
integrator; if they didn't integrate, the team failed regardless of
what the member's branch looks like).
Effect on the core-subset horizontal comparison:
msa 6 → 6 (unchanged)
oh 5 → 4 (loses pallets_jinja/1621 — was passing via union, which
concealed that oh's lead doesn't integrate)
cc 5 → 5 (unchanged)
cx 5 → 5 (unchanged)
oh sliding below 5/10 is the correct outcome: the previous union-pass
on pallets_jinja/1621 was a false-positive of sorts (oh's agents commit
their patch.txt into the working tree, which forces a merge conflict
on patch.txt that union resolved while the actual source merge was
non-conflicting). Under the stricter policy this gets routed through
lead-alone, which oh's lead does not pass.
BENCHMARK_RESULTS.md updated to reflect the new totals + per-task
matrix legend (N = naive/identical, L = lead-alone). CHANGELOG entry
revised; full test suite still green (329 passed, 63 skipped).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(modal): codex stdin hang; eval guardrail for openhands_sdk
codex on Modal: `codex exec` was hanging for the full sandbox
lifetime (~2h) producing zero stream output. Root cause: codex's
exec mode prints "Reading additional input from stdin..." and
blocks until stdin EOF. Docker's non-tty `docker exec` gives EOF
for free; Modal sandbox keeps stdin open. Fix: add `</dev/null`
to the codex invocation in _build_codex_command. Smoke-tested on
dottxt_ai_outlines/1655 [1,3] solo on Modal: 1/1 pass in 1m 48s.
openhands_sdk eval guardrail: openhands_sdk produces patches that
include a committed patch.txt in the working tree and relies on
Modal-hosted Redis for coordination; running eval through Docker
silently changed the test environment. The eval now reads the
run's config.json and refuses with a clear warning when the run
was produced by openhands_sdk but --backend != modal.
Note: swe_agent already runs on Modal (uses swerex.ModalDeploymentConfig
by default; the earlier docs claiming it was docker-only were
wrong). Smoke-tested same dottxt task: 1/1 pass in 3m 12s.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(swe_agent): add --backend docker support
swe_agent adapter was hardcoded to swerex.ModalDeploymentConfig.
Added a backend dispatch that picks DockerDeploymentConfig when
config["backend"] == "docker"; Modal stays as the default.
Two upstream-swerex issues had to be worked around to make the
docker path actually start a container:
1. CooperBench task images set ENTRYPOINT=/usr/local/bin/runner.sh,
so swerex's `docker run ... image sh -c "<startup>"` becomes
`runner.sh sh -c "<startup>"` and runner.sh interprets "sh" as
the feature-patch path. Pass docker_args=["--entrypoint", ""]
to clear the entrypoint (mirrors the existing Modal monkey-patch
that does .entrypoint([]) on the image).
2. swerex's startup falls back to `pipx run swe-rex ...` when the
swerex-remote binary isn't pre-installed, but pipx looks for an
executable literally named "swe-rex" — which doesn't exist in
the published `swe-rex` package (it provides "swerex-remote").
Monkey-patch DockerDeployment._get_swerex_start_cmd to use
`pipx run --spec swe-rex swerex-remote ...` instead.
Smoke-tested with `dottxt_ai_outlines/1655 [1,3]` solo on docker:
1/1 pass in 2m 53s, 17 steps, $0.32, no errors.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* team_harness: extract team mode as standalone harness + ablation flags
Move team-mode primitives from cooperbench/agents/_team (private) to
cooperbench/team_harness (public, library-shaped) so other benchmarks
can consume the multi-agent coordination algorithm without depending on
CooperBench's task layout.
Adds TeamSession + TeamHarnessConfig:
- TeamSession bundles per-run state (run_id, namespaced Redis URL,
ordered agent list, scratchpad volume name) with the feature config
and exposes adapter-facing factories that each return None / [] / {}
when their feature is disabled, so adapter code paths collapse to one
branch:
coop_env.update(session.env_for(agent_id))
extra_run_args.extend(session.scratchpad_mount_args())
mcp_config = session.mcp_config(container_script_path=...)
- TeamHarnessConfig is a frozen dataclass of five per-feature booleans
(task_list, scratchpad, mcp, auto_refresh, protocol). The lead/member
role split is the always-on baseline -- without it team is just coop.
Wires five --team-no-* CLI flags through cli.py -> runner.run ->
runner.core -> runner.team -> each adapter. result.json now records
team_features so post-hoc analysis can attribute deltas to the feature
that was off.
Adapter refactor: claude_code, codex, mini_swe_agent_v2, swe_agent, and
openhands_agent_sdk now accept team_features kwarg and construct a
local TeamSession instead of calling loose helpers. Each adapter's
team-mode blocks (prompt, env, mount, MCP, install) gate on the
session's config.
Tests: tests/agents/_team -> tests/team_harness (rename), new
test_session.py (29 cases) covers the facade, four new ablation tests
in tests/runner/test_team.py verify the runner-side gating. Full suite
363 passed, 63 skipped; ruff/format/mypy clean.
End-to-end smoke on dottxt_ai_outlines/1371 [1,2] with codex (docker):
- Default: writes task_log.json + tasks.json + metrics, cb-team-<run>
volume created.
- --team-no-task-list --team-no-scratchpad --team-no-mcp: no task_log /
tasks files, empty metrics dict, no volume. team_features in
result.json reflects the requested ablation.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs: team-harness ablation report (flash, codex/gpt-5.5)
Self-contained HTML report of the team-harness ablation + multi-agent
comparison run on the flash subset (50 task pairs), codex/gpt-5.5,
docker, 1 seed.
Contents:
- docs/team_harness_ablation_report.html — setting comparison
(solo/coop/coop+git/team), one-feature-off ablation matrix, timing,
findings, methodology, caveats. All numbers embedded inline.
- docs/team_harness_ablation_data/{core,flash}_ablation.csv — raw rows.
- scripts/run_team_ablation.py — sweep driver (config -> cooperbench run+eval).
- scripts/gen_ablation_report.py — regenerates the HTML from logs/.
Headline results (passed / 50, both-features-pass):
coop msg-only 13 · team no-scratchpad 15 · team no-task_list 20 ·
solo 24 · coop+git 28 · team no-mcp 30 · team no-auto_refresh 30 ·
team baseline 31 · team no-protocol 35
Findings:
- scratchpad (-16) and task_list (-11) are load-bearing; removing
either drops team below solo (two uncoordinated agents < one).
- mcp/auto_refresh/protocol show no positive effect for codex
(auto_refresh is a no-op for CLI adapters by design; protocol-off
even scored +4, i.e. mild overhead without payoff).
- Most multi-agent value is a shared code substrate, not orchestration:
coop+git (56%) ~ team-scratchpad (62%) >> messaging-only coop (26%).
Caveat: team runs used the scratchpad for code-sharing, NOT --git, so
"team vs coop+git" compares two sharing substrates; the team --git cell
is untested (follow-up).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Ubuntu <ubuntu@ip-172-31-58-153.us-west-2.compute.internal>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ProKil
added a commit
that referenced
this pull request
May 21, 2026
* agents/codex: add Codex adapter; lift shared coop bits into _coop
Adds an OpenAI Codex CLI adapter alongside the existing Claude Code
adapter. Both adapters wrap a third-party CLI inside the task's
Docker container; the bits that are agent-agnostic (Redis messaging
helper, prompt blocks for solo/coop/coop+git, git remote setup) now
live in a new ``cooperbench.agents._coop`` module so the two adapters
(and any future CLI adapter) consume them rather than duplicating.
Codex adapter highlights:
- Invokes ``codex exec --json --sandbox danger-full-access
--skip-git-repo-check --model <id>``.
- Writes ``${CODEX_HOME}/auth.json`` with the host's OPENAI_API_KEY
inside the container so the CLI authenticates without prompts.
- Parses Codex's JSONL event stream for status / token totals /
messages. Cost is reported as 0.0 because Codex does not emit a
cost field; tokens are summed across ``turn.completed`` events.
- Model fallback: if Codex rejects ``--model gpt-5.5`` with a
"model not found" shaped error, the adapter retries once without
``--model`` and lets Codex pick its default.
- Preflight credential check: if OPENAI_API_KEY is unset the adapter
returns Error immediately instead of spinning up a container that
can only fail.
Shared ``_coop`` module:
- ``coop_msg.py`` — Redis-backed messaging CLI (one inbox per agent)
installed as ``coop-send`` / ``coop-recv`` / ``coop-broadcast`` /
``coop-peek`` / ``coop-agents`` under /usr/local/bin.
- ``install_snippet.sh`` — pip-installs redis and drops the shell
wrappers; each adapter's setup.sh sources it.
- ``prompt.py`` — solo / coop / coop+git prompt assembly, agent-
agnostic.
- ``runtime.py`` — ``ContainerEnv`` protocol, ``build_environment``,
``write_file_in_container`` / ``read_file_from_container``,
``rewrite_comm_url_for_container``, ``build_git_setup_command``,
``parse_sent_messages_log``, and ``normalize_patch``.
Bug fix during this refactor: the previous adapter's ``.strip()`` on
``patch.txt`` was eating the trailing newline that ``git apply``
requires. Replaced with ``normalize_patch()`` (one trailing newline,
no leading whitespace). This bit codex's solo run with a
"corrupt patch at line N" error; Claude got lucky and didn't.
Tests: 24 new for Codex (parsers + adapter), existing 45 Claude Code
tests re-pointed at the shared ``_coop`` module. Full suite: 228
passed, 63 skipped.
End-to-end runs against dottxt_ai_outlines_task/1371 features 1+2:
- codex solo f1: Submitted, 1 turn, 365k input tokens,
184-line patch (with the trailing-newline
fix it applies cleanly)
- codex coop+git f1,f2: both Submitted, both patches applied but
0/2 tests pass — coordination failure
(agent1 fetched ``team`` but never merged,
so the stacked patches produce a Python
SyntaxError at line 144 of the modified
file). Claude on the same task scored
2/2; Codex used the tools less aggressively
on this run.
The 0/2 result is the kind of coordination failure the bench is
designed to surface, not an adapter bug. Future iteration could
tighten the prompt or hard-enforce a post-run merge, but neither is
necessary to land the adapter itself.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* runner: add team mode (lead + members + shared task list + scratchpad)
Adds a third setting alongside ``solo`` and ``coop``, modelled on the
agent-team primitives Claude Code uses in its own product. Where coop
gives N peer agents one feature each and a Redis inbox to chat over,
team mode adds three load-bearing primitives:
1. A typed **shared task list** (cooperbench.agents._team.TaskListClient)
backed by Redis hashes + sets, namespaced ``cb:<run_id>:``, with
atomic claim semantics (HSETNX-style — exactly one caller wins on a
race) and an audit log of every mutation. Exposed in the container
as ``coop-task-create`` / ``coop-task-claim`` / ``coop-task-update``
/ ``coop-task-list`` shell wrappers.
2. A **lead / member role split**. The first agent is designated
``team-lead`` and gets a system-prompt block instructing them to
break the spec into tasks, assign them via ``coop-task-create
--assign``, watch progress, and integrate. Other agents are
``member`` and look for open tasks to claim.
3. A **shared scratchpad** Docker volume (``cb-team-<run_id>``)
mounted at ``/workspace/shared`` in every container. Free
coordination artifact for design notes, partial diffs, interface
sketches.
Coordination metrics are computed from the task-list audit log after
the run finishes (``time_to_first_claim_seconds``, ``claims_per_agent``,
``updates_per_agent``, ``tasks_done``, ``unowned_at_end``) and saved
into ``result.json``. Evaluation is identical to coop — per-agent
``patch.txt`` evaluated per-feature — so no eval changes were needed
beyond discovering ``team/`` log directories.
Compatibility: all five existing adapters accept the new ``team_role``
/ ``team_id`` / ``task_list_url`` kwargs. The CLI adapters
(``claude_code``, ``codex``) wire the team install snippet into their
``setup.sh`` so the ``coop-task-*`` wrappers land at
``/usr/local/bin``. The Python-loop adapters (``mini_swe_agent_v2``,
``swe_agent``, ``openhands_sdk``) accept the kwargs without breaking;
their in-loop integration with the task list (auto-refresh between
steps, similar to the existing inbox poll) lands in a follow-up.
Unit tests: 46 new
- 18 task_list (CRUD, atomic claim, owner-only update, audit log,
run isolation)
- 12 prompt (lead vs member branches, solo fallback, git interaction)
- 3 runtime (env assembly, scratchpad mount args)
- 4 metrics (happy path, unowned-at-end, empty log, multiple claims)
- 5 runner (lead-is-first-agent, pre-seed, kwarg propagation,
metrics in result, three-agent team)
- 4 misc
Full suite: 274 passed, 63 skipped. Ruff / format / mypy all green.
End-to-end on dottxt_ai_outlines_task/1371 [1,2] with Claude Code in
team+git mode:
- 5 tasks created (2 by bench-runner, 3 by the lead splitting its
work), all reached ``done``
- time_to_first_claim_seconds=34.2
- claims_per_agent={agent1: 2, agent2: 1}
- updates_per_agent={agent1: 4, agent2: 3}
- scratchpad volume actively used (agent2 wrote its diff to
/workspace/shared/agent2.patch + a summary.md)
- **0/1 pass rate** — both ``patch.txt`` files were empty: the
members wrote diffs to the scratchpad instead of also writing
``/workspace/repo/patch.txt``, and the lead never ran the final
integration step. This is real coordination signal (the prompt
told them to write both places but they followed the scratchpad
half only) — a follow-up will tighten the prompt to make patch.txt
submission the explicit final step.
Future PRs (intentionally out of scope here so this lands at a
reviewable size):
- In-loop auto-refresh for the Python-loop adapters
- MCP long-poll tool to give CLI adapters push-ish inbox semantics
- Typed ``coop-request`` / ``coop-respond`` protocol on top of
messaging (CC's plan_approval_request shape)
- Filesystem mirror of the task list (CC-style ``ls`` artefacts)
Stacks on #51 (Codex adapter) so the diff stays focused on team-mode
additions.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* team mode: filesystem mirror, typed protocol, MCP server, in-loop refresh (#53)
Lands the four follow-ups that were called out as "Out of scope" on
the team-mode PR (#52), plus a prompt fix surfaced by the team-mode
end-to-end run.
1. **Filesystem mirror of task list** (``_team/fs_mirror.py``).
Snapshots the Redis-backed task list to ``/workspace/shared/tasks/``
so agents can ``ls`` and ``cat`` tasks with their existing tools
rather than going through the ``coop-task-list`` CLI. Layout
mirrors Claude Code's team primitive: one ``<id>.json`` per task,
plus ``_index.json`` (cheap ``ls`` target) and ``_log.jsonl`` (audit
trail). Triggered on every ``coop-task-list`` invocation and from
the host runner at startup. Files written via tempfile+replace so
readers never observe a partial state.
2. **Typed coop-request / coop-respond protocol** (``_team/protocol.py``).
Layered on plain Redis messaging, mirroring CC's
``plan_approval_request`` / ``plan_approval_response`` shape.
``coop-request <peer> <kind> <body>`` returns a request_id (and
optionally blocks via ``--wait N`` for a response).
``coop-respond <request_id> <body>`` writes back; the sender's
``await_response`` uses BLPOP so it actually sleeps instead of
busy-polling. Both events flow into the shared task-log so
coordination metrics include protocol events.
3. **MCP long-poll server** (``_team/mcp_server.py``). Stdio
JSON-RPC server that exposes a single ``wait_for_message`` tool
backed by BLPOP on the agent's inbox. Registered automatically:
Claude Code adapter writes ``$CLAUDE_CONFIG_DIR/.claude.json`` with
the server entry; Codex adapter writes ``$CODEX_HOME/config.toml``.
The point is to make "watch the inbox" a natural idle behavior for
the CLI adapters instead of a busy-loop on ``coop-recv`` returning
empty — the closest we can get to push-style delivery for opaque
CLI agent loops.
4. **In-loop task-list auto-refresh** (``_team/loop_refresh.py``).
``TeamPoller`` is a per-agent host-side helper that
``mini_swe_agent_v2.DefaultAgent.step()`` calls between LLM
queries — same hook as the existing inbox poll. The LLM sees a
compact ``[Team task list] open: 1, in_progress: 2, ...`` summary
prepended to every turn so it doesn't need to remember to call
``coop-task-list``. Plumbed via ``agent.team_poller`` so the
``mini_swe_agent_v2`` subtree change is one branch in ``step()``.
The same module also exports ``poll_team_state()`` for in-container
use (env-driven variant).
5. **Prompt fix**: the previous team-mode end-to-end had members
writing diffs to ``/workspace/shared/<id>.patch`` only and never to
``/workspace/repo/patch.txt``, scoring 0/2 despite great
coordination. Both lead and member prompts now have an explicit
``### Final submission — REQUIRED`` section that calls out
``patch.txt`` as the only file the bench evaluates and provides
the exact ``git diff > patch.txt`` command.
Also: cosmetic fix to ``runner/core._print_single_result`` so team
mode's per-agent dicts (which carry ``patch_lines: int``) render
correctly in the run table — previously the column showed 0 because
the function tried ``len(r.get("patch", "").splitlines())`` and team
mode doesn't store the full patch in the agents dict.
Tests: 37 new unit tests
- 8 fs_mirror (atomic writes, stale cleanup, empty index)
- 9 protocol (request roundtrip, await, timeout, audit log)
- 9 mcp_server (initialize, tools/list, tools/call,
timeout, blocking, unknown-tool error,
env factory)
- 8 loop_refresh (summary formatting, TeamPoller, env variant)
- 3 prompt (regression: lead+member prompts demand patch.txt)
Full suite: **311 passed**, 63 skipped.
End-to-end on dottxt_ai_outlines_task/1371 [1,2] with Claude Code +
team + git: **2/2 features pass** (14/14 + 20/20 tests). All four
follow-ups visibly active in the run artifacts:
``/workspace/shared/tasks/`` populated with per-task JSON + _index +
_log; scratchpad has agent2.patch; ``cb-mcp-server.py`` registered in
``.claude.json``; 6 tasks created (2 by runner pre-seed, 4 by lead's
sub-task split), 4 reached ``done``,
``time_to_first_claim_seconds=29.9``. Previous run scored 0/2 on the
same task — the prompt fix is doing real work.
Stacks on #52.
Co-authored-by: Ubuntu <ubuntu@ip-172-31-58-153.us-west-2.compute.internal>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* team mode: wire team prompt + env into the three Python-loop adapters
Brings ``mini_swe_agent_v2``, ``swe_agent``, and ``openhands_sdk`` to
parity with the CLI adapters for team mode. Before this commit they
accepted the team kwargs but discarded them; now each one appends the
team prompt section to the task it sends the agent, and (where the
adapter actually controls the container) propagates ``CB_TEAM_*`` env
vars + mounts the team scratchpad.
New helper: ``_team.team_task_section(agents, agent_id, team_role)``
returns ONLY the lead-or-member block + coop-task-* CLI usage,
without the surrounding task/submission/git scaffolding that
``build_team_instruction`` adds. Python-loop adapters already have
their own prompts covering messaging/git/submission, so they need
only the new piece; CLI adapters keep using the bigger function.
Per-adapter wiring:
- ``mini_swe_agent_v2``: appends team_task_section to task;
propagates CB_TEAM_* through env_kwargs["env"]; adds
``--add-host=host.docker.internal:host-gateway`` + scratchpad
volume to docker run args; installs the team CLI scripts + pip
redis in the container after env spin-up. The existing
``TeamPoller`` host-side hook (already in step()) still fires.
- ``openhands_sdk``: appends team_task_section to task; folds a new
``team_env`` dict into ``coop_info`` so
``_build_credentials_dict`` propagates CB_TEAM_* into the
sandbox. Coop-task-* binary install in the OpenHands agent-server
image is a follow-up — OpenHands manages its own image build and
doesn't expose a clean post-start exec hook.
- ``swe_agent``: appends team_task_section to task. The SWE-agent
framework's sandbox + agent loop is third-party and harder to
instrument; everything beyond the prompt is a follow-up.
Tests: 13 new
- 3 prompt unit tests for team_task_section (lead, member, empty)
- 10 cross-adapter sanity tests in tests/agents/test_team_wiring.py:
consistency between team_task_section and build_team_instruction,
every registered runner accepts the team kwargs, openhands env
keys, swe_agent signature
Full suite: 324 passed, 63 skipped. Ruff/format/mypy all green.
End-to-end on dottxt_ai_outlines_task/1371 [1,2] with claude_code +
team + git (sanity check that the shared changes didn't regress the
CLI adapter): both Submitted in 4m21s, $0.93, patches 210 + 81 lines.
End-to-end for the other four (codex, mini_swe_agent_v2, swe_agent,
openhands_sdk) requires API keys (Anthropic for the three Python-loop
adapters via litellm, OpenAI for codex) that aren't available in this
environment. Unit tests cover the new wiring; the e2e validations
should be run with real keys before relying on the per-adapter
behavior.
Compatibility matrix is now:
| Adapter | Accepts | Team prompt | Auto-refresh | CLI in container | env vars |
|---------------------|---------|-------------|--------------|------------------|----------|
| claude_code | yes | yes (full) | n/a | yes | yes |
| codex | yes | yes (full) | n/a | yes | yes |
| mini_swe_agent_v2 | yes | yes (sec.) | yes | yes | yes |
| openhands_sdk | yes | yes (sec.) | n/a | NOT YET | yes |
| swe_agent | yes | yes (sec.) | NOT YET | NOT YET | NOT YET |
Stacks on #52 (merged-up team-mode branch).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* openhands: layer coop-task-* install onto Modal image for team mode
Closes the documented gap from the prior commit's matrix: the
``coop-task-*`` binaries now ship into the OpenHands agent-server
sandbox, layered onto the upstream ``-oh`` image via Modal's
``add_local_file`` / ``pip_install`` / ``run_commands`` chain (no
upstream image rebuild required). Triggered only when
``coop_info["team_env"]`` is set so solo / coop runs don't pay the
~10s first-build cost. Modal caches the layered image; subsequent
team runs are instant.
Verified end-to-end: ran openhands_sdk team+git on
dottxt_ai_outlines_task/1371 [1,2] with gpt-5.5. The agent ran
``compgen -c | grep coop-task`` and got back all 7 wrappers
(create / claim / update / list / request / respond / pending) — the
install worked. Whether the model actually invokes the tools is a
separate (coordination-quality) axis; in this run it discovered them
but didn't use them, same as codex. Both patches applied; f1 14/14,
f2 19/20.
Tests: 2 new (full suite: 326 passed)
- test_team_env_triggers_image_layering — verifies add_local_file
+ pip_install + run_commands fire with the right args when team
mode is active
- test_no_layering_when_team_inactive — verifies solo / coop
runs skip the image-build cost
Matrix update — openhands_sdk now reads:
Accepts kwargs: yes / Team prompt: section / Auto-refresh: n/a /
CLI in container: YES (was NOT YET) / CB_TEAM_* env: yes
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* team prompt: make the merge-before-submit step REQUIRED
The codex team e2e (cx_team_v3) hit 0/2 with great coordination
metrics — 5/5 tasks done, 27s first claim, claims even — but
neither agent ran ``git merge`` despite the prompt's "Recommended
workflow" mentioning it. Both fetched their peer's branch (2 each)
and then submitted only their own work, so the eval's naive
diff-stacker produced syntactically broken Python.
The previous prompt buried the critical step in a "Concretely:"
sentence at the end; gpt-5.5 didn't follow it. This rewrite:
- Renames the section ``## Git collaboration — MERGE IS REQUIRED
BEFORE SUBMITTING`` so the imperative is in the heading itself.
- Adds an explicit "Required final sequence — run this verbatim
before exiting" block with the full fetch+merge+diff sequence,
parameterized over every partner branch.
- Explains *why* (each agent's patch.txt is evaluated against every
feature's tests; without the merge, the peer feature's symbols
are missing → ImportError).
- Frames it the same way the patch.txt step is framed (REQUIRED,
skip-at-your-loss), which the original prompt fix proved
codex responds to.
Verified: re-ran cx_team_v4 (codex team+git, same task as v3).
Git activity went from ``fetch=2 merge=0 push=0`` per agent →
``fetch=3 merge=2 push=2`` and ``fetch=1 merge=1 push=1``. Both
patches now contain both features' symbols. Pass rate v4:
33/34 tests (97%) — f2 fully passes 20/20, f1 fails one test
because gpt-5.5's merged code put the ``filters`` kwarg on a helper
function rather than the ``prompt`` decorator (content quality, not
coordination).
A second run (cx_team_v5) produced byte-identical 243-line patches
on both agents — codex coordinated so well both ended up with the
exact same merged tree. This surfaces a separate bench-side
limitation: the eval's diff-stacker fails to apply patch B on top
of patch A when every hunk already matches, producing an empty
merged.patch. That's a real bug in ``eval/evaluate.py``'s coop
merge step, NOT a coordination failure — codex did exactly what the
prompt asked. Fix is a separate concern from team-mode wiring.
Tests still pass (existing prompt tests are content-agnostic;
326 / 63 skipped).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* eval: short-circuit when both agents submit identical merged patches
In team mode codex can coordinate so well that both agents end up
with byte-identical patches (each fully merged the other's branch).
The existing eval combiner sequence — apply patch1 → apply patch2
on top — chokes because every hunk in patch2 is already applied,
producing an empty merged.patch and a downstream "No valid patches
in input" failure even though both submissions are individually
fine.
Fix in ``test_merged``: before invoking ``_setup_branches`` /
``_merge_naive``, ``cmp`` the two patches. If they match, copy
patch1 to merged.patch (normalized via ``git apply --recount`` so
agents that emit unified diffs with miscounted hunk headers still
work) and skip the merge dance. Returns a fresh result with
``merge.status: "identical"`` so the caller can tell the
short-circuit fired vs a real merge.
Verified on the codex-team e2e:
- cx_team_v5 (codex agents perfectly merged to identical 243-line
patches): 0/2 → 2/2 ✓ (f1: 14/14, f2: 20/20)
- cx_team_v4 (codex agents diverged on the merge): unchanged at
f2 20/20 + f1 13/14 = 33/34 tests, still falls back to
agent2-alone via apply_status: {'agent1': 'failed', ...}
I also briefly tried adding ``git apply --recount`` to
``_setup_branches``'s fallback chain, but that REGRESSED v4: it
made agent1's malformed patch apply where it previously failed
silently, triggering a real merge attempt that produced
duplicate function definitions (broken Python) via union merge.
The identical-patches short-circuit is the strictly-better fix —
no regression, recovers the v5 case, and the malformed-hunk
normalization only kicks in on the short-circuit path where it
can't cause merge conflicts.
Also lands previously-uncommitted housekeeping:
- prompt.py: ruff-format-only diff on the merge-required block
from the prior commit
- test_team_wiring.py: ruff --fix removed unused MagicMock
imports
- test_gcp_backend.py / test_tasks.py: ruff --fix removed
f-string-without-placeholder and unused-json import (both
unrelated drift caught by the gate)
Tests: 1 new (full suite: 327 passed)
- ``test_test_merged_shortcircuits_on_identical_patches`` — source
inspection confirms the short-circuit branch + "identical"
merge-status string exist in test_merged
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* openhands: register Redis-backed CoopTaskTracker as a typed tool
The previous openhands team runs (oh_team_v3) showed agents
discovering the ``coop-task-*`` shell wrappers via ``compgen`` but
never invoking them — gpt-5.5 strongly prefers typed tools registered
with the LLM over arbitrary shell commands. This commit lands the
architectural fix: a Redis-backed ``CoopTaskTrackerTool`` registered
under the same name as openhands' built-in ``TaskTrackerTool`` so the
registry resolution swaps it transparently.
Files:
* ``openhands/tools/task_tracker/coop_definition.py`` — new tool
definition + executor. Same ``TaskTrackerAction`` /
``TaskTrackerObservation`` shape, but ``plan`` and ``view`` round-
trip through the shared ``cb:<run_id>:`` Redis namespace that
``TaskListClient`` (host side) writes to. Tasks are auto-owned
by the calling agent; ``view`` shows peer tasks prefixed with
``[<their_agent_id>]``. Registered under both
``"CoopTaskTrackerTool"`` AND ``"TaskTrackerTool"`` so importing
the module rebinds the latter to the Coop variant.
* ``openhands/tools/preset/default.py`` — gains a ``team_mode``
kwarg (kept for API stability + tests; the actual swap happens
server-side via the .pth/__init__ side-effect import, not by
changing the host-side tool list). Pre-PR coop block split into
a more nuanced team-mode prompt section that documents the
TaskTracker → shared-list behavior.
* ``openhands_sdk/adapter.py:ModalSandboxContext.__enter__`` —
layers two more bits into the Modal image at build time:
- ``add_local_file`` of ``coop_definition.py`` to
``$OH_DIR/coop_definition.py`` (in the sandbox's openhands
install)
- ``grep ... || echo`` appending
``from . import coop_definition`` to the package's
``__init__.py`` so the registration runs at import time.
Tests: 1 new + updated image-layering assertions
- ``test_importing_coop_definition_overrides_local_registration``:
inspecting the registry's ``_MODULE_QUALNAMES`` confirms
``TaskTrackerTool.name`` resolves to ``coop_definition``'s
registration after import.
- ``TestOpenHandsImageLayering`` now asserts 2 ``add_local_file``
calls + 2 ``run_commands`` layers (tool-file install +
``coop-task-*`` wrappers) and that the
``from . import coop_definition`` line is in the install
commands.
Full suite: 329 passed. Ruff / format / mypy all green.
KNOWN LIMITATION (documented in coop_definition.py docstring):
the openhands_sdk agent-server runs in a Modal sandbox that's
network-isolated from the host Redis. The CoopTaskTracker is
correctly registered and the LLM can call it, but every operation
returns "Shared task list unavailable" because the sandbox can't
``socket.getaddrinfo("host.docker.internal")``. The fix is in the
deployment layer (Modal tunnels, a Modal-hosted Redis, or running
openhands directly via docker like the other adapters), not in this
PR — verified by oh_team_v10: agent ran ``coop-task-list`` first
("The coop CLI failed; I'll use the shared task tracker."), then
fell back to TaskTrackerAction which still hit the local executor
because the override + Redis combo can't actually work in Modal.
For non-Modal openhands deployments (e.g. local docker-backed
openhands runs, future remote-conversation transports that share the
host network), this tool works as designed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* openhands team mode: end-to-end working with Modal-hosted Redis
Resolves the Modal-Redis isolation that blocked the prior CoopTaskTracker
swap from actually functioning. Three pieces, working together:
1. **Modal-hosted Redis.** ``runner/team.py:execute_team`` detects
``agent_name == "openhands_sdk"`` and spins up a Modal sandbox
running redis-server on a TCP tunnel (``unencrypted_ports=[6379]``,
accessed via ``unencrypted_host:unencrypted_port``). Re-uses the
existing ``connectors/redis_server.ModalRedisServer`` — it was
already written, just unused. Both the host TaskListClient and
the agent sandboxes point at the same public TCP endpoint, so
pre-seed and agent reads/writes share state. Falls back to local
Redis for the other adapters.
2. **CoopTaskTrackerTool injection into the Modal sandbox.** The
adapter now ``add_local_file``s three pieces into the OpenHands
image at build time:
- ``coop_task.py`` → ``/usr/local/bin/cb-coop-task.py``
- ``coop_definition.py`` → ``$OH_DIR/coop_definition.py``
- ``_team_init_override.py`` → ``$OH_DIR/__init__.py``
(replaces upstream; same exports + a side-effect import of
coop_definition so the Redis-backed executor overrides the
local TaskTracker registration at first import).
Plus a ``find -name '*.pyc' -delete`` to invalidate Python's
bytecode cache so the new __init__ actually re-runs.
3. **Harvest-time fresh client.** Modal's TCP tunnels drop idle
connections after a few minutes, so the original Redis client
pre-seed used at startup gets closed before the 9-min agent run
finishes. Re-open the client at harvest time using the same URL.
End-to-end on ``dottxt_ai_outlines_task/1371 [1,2]`` with
``-a openhands_sdk --setting team --git``:
- Modal Redis startup: ``redis ready redis://r450.modal.host:41899``
- Both agents Submitted, 9m total
- Eval: 2/2 PASS (f1: 14/14 ✓, f2: 20/20 ✓)
- Metrics: ``tasks_total: 4, tasks_done: 4, unowned_at_end: 0,
time_to_first_claim_seconds: 52.6, claims_per_agent: {agent2:2,
agent1:1}, updates_per_agent: {agent2:4, agent1:5}``
- Cost: $3.33
Tests: image-layering assertions expanded — ``add_local_file`` now
called 3 times (CLI helper, tool def, __init__ override), and the
run_commands chain copies both files + wipes .pyc caches.
Full suite: 329 passed. Ruff / format / mypy all green.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* deps: add fakeredis to dev extras
The team-mode unit tests (task_list / protocol / fs_mirror /
loop_refresh / mcp_server) use ``fakeredis.FakeRedis`` as a hermetic
stand-in for redis-server, but ``fakeredis`` wasn't declared anywhere
in pyproject.toml — it just happened to be present in my local venv
because something else pulled it in transitively.
GitHub CI installs ``[dev]`` only, so on a clean install pytest
collection fails with ``ModuleNotFoundError: No module named
'fakeredis'`` on every team-mode test file. Adding the dependency
explicitly fixes PR #52 (team-mode) CI; once team-mode merges,
PR #55 (team-all-adapters) will also pick it up via the same path.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* swe_agent: fix import error + add missing transitive deps
Three changes that together unblock swe_agent team-mode runs (and
solo/coop runs too — the bug wasn't team-specific):
1. ``cooperbench.agents.mini_swe_agent`` → ``mini_swe_agent_v2``
in ``swe_agent/adapter.py`` and ``swe_agent/agent/agents.py``.
The old package was renamed in v0.0.13; both swe_agent files
had stale imports that no-op'd at module load (TypeError or
ModuleNotFoundError depending on how the framework was invoked),
making every swe_agent invocation return Error before any LLM
call.
2. Add ``numpy``, ``boto3``, ``docker`` to the ``swe-agent`` extras
in pyproject.toml. swe_agent's vendored framework imports these
at module-load time even when the docker/S3/model paths are
dormant, so a clean ``pip install '.[swe-agent]'`` without these
would still ImportError on first invocation.
3. uv.lock refreshed with the new transitive deps.
End-to-end on dottxt_ai_outlines_task/1371 [1,2] with
``-a swe_agent -m gpt-5.5 --setting team --git`` (sw_team_v5):
both agents Submitted, patches 373 + 88 lines, both applied via
git apply. Eval failed 0/2 due to a content-quality issue
(``NameError: name 'Set' is not defined`` — agent used Set
without importing it; both agents hit exit_cost budget limit
mid-implementation), but that's model variance, not adapter
wiring. swe_agent is unblocked: it runs end-to-end, produces
patches, the eval pipeline processes them.
Coordination metrics still empty (claims_per_agent: {}) because
swe_agent doesn't yet have the in-container coop-task-* CLI
install or in-loop task auto-refresh — those are tracked as
follow-ups in the PR body. For now the swe_agent team-mode run
just gets the team prompt section + env vars; full team-tool
integration is a separate PR.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix: team-mode bugs surfaced by 10-pair core run
Five compounding bugs prevented `claude_code`, `codex`, and
`mini_swe_agent_v2` from reaching honest pass-rates on the core
subset in team setting. All four now ≥ 5/10.
- normalize_patch ate trailing blank context lines (text.strip()
consumes " \n"), breaking last-hunk line counts so git apply
rejected otherwise-valid diffs. Replaced with lstrip/rstrip on
"\n" only.
- mini_swe_agent_v2 adapter wasn't normalizing patches at all —
raw .strip() on the patch.txt read, so every msa patch ended
in a non-newline byte. Now routes through normalize_patch.
- mini_swe_agent_v2 ModalEnvironment created the sandbox with no
long-running command, so the image's default CMD exited and
every exec hit "Sandbox not found". Pass "sleep", "infinity"
as the positional command (matches eval backend's existing fix).
- claude_code and codex adapters silently ignored --backend modal
because shared build_environment was hardcoded to DockerEnvironment.
Added a backend kwarg and threaded config["backend"] through both
adapters.
- Team lead prompt buried the integration step at the bottom of a
long workflow list; Claude/Codex consistently exited after their
own feature without reading /workspace/shared/<agent>.patch.
Rewrote with a hard-rule opener and a 5-point pre-submission
checklist. Member prompt now opens with "stay in your lane" per
the lead's PLAN.md.
- eval test_merged now falls back to testing each agent's patch
alone when the merged tree doesn't pass both features. Surfaced
as merge.strategy="solo-agent1" / "solo-agent2". Credits the
agent (typically the lead) who correctly integrated both
features into one working patch but had it corrupted by
union-merging with the other agent's partial implementation.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs+data: core subset and team-mode horizontal comparison
- dataset/subsets/core.json: 10-pair subset for quick agent
comparisons. Stratified by repo (largest-remainder proportional
allocation by full-dataset pair count) with a one-slot floor per
primary language (Python / Go / Rust / TS). Reproducible via
scripts/generate_core_subset.py (seed=42).
- docs/BENCHMARK_RESULTS.md: horizontal comparison of four agent
frameworks on the core subset in team setting. Includes per-task
pass/fail matrix annotated with the merge strategy used, plus the
chronological narrative of the dozen reruns that surfaced each of
the bugs fixed in the previous commit.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(eval): don't bail when union-merge also conflicts
Previously test_merged returned early with an error when both naive
and union merge strategies hit conflicts, so the solo-agent fallback
never got a chance to credit a team whose lead alone integrated both
features. Now we write an empty merged.patch, let run_tests fail
naturally on the merged tree, and fall through to the solo fallback.
Doesn't change any of the current 40 eval results — union's merge=union
attribute is tolerant enough that every task in the dataset produces
some tree (potentially broken code with stitched-together lines); the
broken-tree-tests-fail path already triggered the solo fallback. This
just closes the defensive gap for future pathological cases.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* eval(team): identical / naive / lead-when-naive-conflicts policy
Drops the union-merge strategy and the member-only fallback from
test_merged. The new chain is:
1. identical patches → skip-merge short-circuit
2. naive 3-way merge clean → merged-tree tests are authoritative
(no further fallback)
3. naive merge conflicts → test the lead's patch.txt alone against
both feature suites
Rationale: union merge concatenates conflicting hunks, which usually
produces syntactically broken code; the cases where it accidentally
produced a working tree were rewarding lucky non-overlap, not genuine
coordination. The member-only fallback was symmetric to lead-only but
incoherent under team-mode semantics (the lead is the designated
integrator; if they didn't integrate, the team failed regardless of
what the member's branch looks like).
Effect on the core-subset horizontal comparison:
msa 6 → 6 (unchanged)
oh 5 → 4 (loses pallets_jinja/1621 — was passing via union, which
concealed that oh's lead doesn't integrate)
cc 5 → 5 (unchanged)
cx 5 → 5 (unchanged)
oh sliding below 5/10 is the correct outcome: the previous union-pass
on pallets_jinja/1621 was a false-positive of sorts (oh's agents commit
their patch.txt into the working tree, which forces a merge conflict
on patch.txt that union resolved while the actual source merge was
non-conflicting). Under the stricter policy this gets routed through
lead-alone, which oh's lead does not pass.
BENCHMARK_RESULTS.md updated to reflect the new totals + per-task
matrix legend (N = naive/identical, L = lead-alone). CHANGELOG entry
revised; full test suite still green (329 passed, 63 skipped).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(modal): codex stdin hang; eval guardrail for openhands_sdk
codex on Modal: `codex exec` was hanging for the full sandbox
lifetime (~2h) producing zero stream output. Root cause: codex's
exec mode prints "Reading additional input from stdin..." and
blocks until stdin EOF. Docker's non-tty `docker exec` gives EOF
for free; Modal sandbox keeps stdin open. Fix: add `</dev/null`
to the codex invocation in _build_codex_command. Smoke-tested on
dottxt_ai_outlines/1655 [1,3] solo on Modal: 1/1 pass in 1m 48s.
openhands_sdk eval guardrail: openhands_sdk produces patches that
include a committed patch.txt in the working tree and relies on
Modal-hosted Redis for coordination; running eval through Docker
silently changed the test environment. The eval now reads the
run's config.json and refuses with a clear warning when the run
was produced by openhands_sdk but --backend != modal.
Note: swe_agent already runs on Modal (uses swerex.ModalDeploymentConfig
by default; the earlier docs claiming it was docker-only were
wrong). Smoke-tested same dottxt task: 1/1 pass in 3m 12s.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(swe_agent): add --backend docker support
swe_agent adapter was hardcoded to swerex.ModalDeploymentConfig.
Added a backend dispatch that picks DockerDeploymentConfig when
config["backend"] == "docker"; Modal stays as the default.
Two upstream-swerex issues had to be worked around to make the
docker path actually start a container:
1. CooperBench task images set ENTRYPOINT=/usr/local/bin/runner.sh,
so swerex's `docker run ... image sh -c "<startup>"` becomes
`runner.sh sh -c "<startup>"` and runner.sh interprets "sh" as
the feature-patch path. Pass docker_args=["--entrypoint", ""]
to clear the entrypoint (mirrors the existing Modal monkey-patch
that does .entrypoint([]) on the image).
2. swerex's startup falls back to `pipx run swe-rex ...` when the
swerex-remote binary isn't pre-installed, but pipx looks for an
executable literally named "swe-rex" — which doesn't exist in
the published `swe-rex` package (it provides "swerex-remote").
Monkey-patch DockerDeployment._get_swerex_start_cmd to use
`pipx run --spec swe-rex swerex-remote ...` instead.
Smoke-tested with `dottxt_ai_outlines/1655 [1,3]` solo on docker:
1/1 pass in 2m 53s, 17 steps, $0.32, no errors.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* team_harness: extract team mode as standalone harness + ablation flags
Move team-mode primitives from cooperbench/agents/_team (private) to
cooperbench/team_harness (public, library-shaped) so other benchmarks
can consume the multi-agent coordination algorithm without depending on
CooperBench's task layout.
Adds TeamSession + TeamHarnessConfig:
- TeamSession bundles per-run state (run_id, namespaced Redis URL,
ordered agent list, scratchpad volume name) with the feature config
and exposes adapter-facing factories that each return None / [] / {}
when their feature is disabled, so adapter code paths collapse to one
branch:
coop_env.update(session.env_for(agent_id))
extra_run_args.extend(session.scratchpad_mount_args())
mcp_config = session.mcp_config(container_script_path=...)
- TeamHarnessConfig is a frozen dataclass of five per-feature booleans
(task_list, scratchpad, mcp, auto_refresh, protocol). The lead/member
role split is the always-on baseline -- without it team is just coop.
Wires five --team-no-* CLI flags through cli.py -> runner.run ->
runner.core -> runner.team -> each adapter. result.json now records
team_features so post-hoc analysis can attribute deltas to the feature
that was off.
Adapter refactor: claude_code, codex, mini_swe_agent_v2, swe_agent, and
openhands_agent_sdk now accept team_features kwarg and construct a
local TeamSession instead of calling loose helpers. Each adapter's
team-mode blocks (prompt, env, mount, MCP, install) gate on the
session's config.
Tests: tests/agents/_team -> tests/team_harness (rename), new
test_session.py (29 cases) covers the facade, four new ablation tests
in tests/runner/test_team.py verify the runner-side gating. Full suite
363 passed, 63 skipped; ruff/format/mypy clean.
End-to-end smoke on dottxt_ai_outlines/1371 [1,2] with codex (docker):
- Default: writes task_log.json + tasks.json + metrics, cb-team-<run>
volume created.
- --team-no-task-list --team-no-scratchpad --team-no-mcp: no task_log /
tasks files, empty metrics dict, no volume. team_features in
result.json reflects the requested ablation.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* codex: add Azure OpenAI support
Set AZURE_OPENAI_API_KEY + AZURE_OPENAI_ENDPOINT (the OpenAI-compatible
v1 base, e.g. https://<resource>.cognitiveservices.azure.com/openai/v1)
and pass the Azure deployment name via -m. When both are present they
take precedence over OPENAI_API_KEY.
How it works:
- resolve_azure_config() reads the two env vars (endpoint trailing slash
stripped); _azure_config_toml() writes a `model_provider = "azure"`
block into codex's config.toml with wire_api = "responses" (codex
0.132 dropped the chat wire API) and env_key = AZURE_OPENAI_API_KEY.
- The key is exported into the codex command and read via the provider
env_key; auth.json is skipped on the Azure path.
- config.toml is now composed from independent fragments (azure provider
+ team-mode MCP server) so both can coexist.
Non-json fallback: codex 0.132's --json event stream deterministically
fails against Azure's HTTP/2 /responses endpoint ("stream disconnected:
error sending request") while plain output works. Captured requests are
byte-identical between modes, so it's a codex response-handling bug, not
a config error. The Azure path therefore runs codex WITHOUT --json,
harvests the patch from patch.txt (as always) and the final message via
--output-last-message, and derives status from codex's exit code.
Trade-off: no token/cost/trajectory telemetry on Azure (codex's plain
output carries none; cost was already $0 via the broken json parser).
Tests: 5 new (resolve_azure_config, _azure_config_toml, non-json run
shape + provider config + no auth.json, error status on non-zero exit);
autouse fixture clears AZURE_* so non-Azure tests stay hermetic.
Full suite 369 passed; ruff/format/mypy green.
Validated end-to-end on dottxt_ai_outlines/1655 [1,3] with
`-a codex -m gpt-5.5-hao` against a live Azure deployment: Submitted,
clean stream (no disconnects), eval passes both features.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(codex): preserve Azure key in coop/team mode
The is_coop branch reassigned `coop_env = {...}`, wiping the
AZURE_OPENAI_API_KEY added just above it. Codex then failed provider
auth ("Missing environment variable: AZURE_OPENAI_API_KEY") in every
coop / coop+git / team run, producing empty patches — a full-dataset
coop+git Azure sweep scored 0/652 while solo (same path) scored 355/652.
Fix: `coop_env.update({...})` so the Azure key survives. Verified with
a coop+git Azure smoke (both agents Submitted, real patches, zero
missing-key errors). Adds a regression test
(test_azure_key_survives_coop_mode).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(codex): harden container install for concurrent runs
Codex's setup.sh ran apt without DEBIAN_FRONTEND=noninteractive, so in
TTY-less containers debconf fell through Dialog->Readline->Teletype and
tripped dpkg ("Sub-process /usr/bin/dpkg returned an error code (1)").
Rare at solo concurrency (6 containers, ~0.6% fail) but dominant under
coop/team (12 containers at concurrency 6, ~87% fail) — a full-dataset
coop+git sweep collapsed to install failures.
Fix: export DEBIAN_FRONTEND=noninteractive and wrap apt/apk/yum installs
in a 3x retry (transient mirror throttling under many simultaneous
installs from one host). Validated with 15 coop+git tasks at
concurrency 6: 15/15 installed cleanly (was ~1/8 before), 30/30 agents
produced patches.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Ubuntu <ubuntu@ip-172-31-58-153.us-west-2.compute.internal>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ProKil
added a commit
that referenced
this pull request
May 21, 2026
* agents/codex: add Codex adapter; lift shared coop bits into _coop
Adds an OpenAI Codex CLI adapter alongside the existing Claude Code
adapter. Both adapters wrap a third-party CLI inside the task's
Docker container; the bits that are agent-agnostic (Redis messaging
helper, prompt blocks for solo/coop/coop+git, git remote setup) now
live in a new ``cooperbench.agents._coop`` module so the two adapters
(and any future CLI adapter) consume them rather than duplicating.
Codex adapter highlights:
- Invokes ``codex exec --json --sandbox danger-full-access
--skip-git-repo-check --model <id>``.
- Writes ``${CODEX_HOME}/auth.json`` with the host's OPENAI_API_KEY
inside the container so the CLI authenticates without prompts.
- Parses Codex's JSONL event stream for status / token totals /
messages. Cost is reported as 0.0 because Codex does not emit a
cost field; tokens are summed across ``turn.completed`` events.
- Model fallback: if Codex rejects ``--model gpt-5.5`` with a
"model not found" shaped error, the adapter retries once without
``--model`` and lets Codex pick its default.
- Preflight credential check: if OPENAI_API_KEY is unset the adapter
returns Error immediately instead of spinning up a container that
can only fail.
Shared ``_coop`` module:
- ``coop_msg.py`` — Redis-backed messaging CLI (one inbox per agent)
installed as ``coop-send`` / ``coop-recv`` / ``coop-broadcast`` /
``coop-peek`` / ``coop-agents`` under /usr/local/bin.
- ``install_snippet.sh`` — pip-installs redis and drops the shell
wrappers; each adapter's setup.sh sources it.
- ``prompt.py`` — solo / coop / coop+git prompt assembly, agent-
agnostic.
- ``runtime.py`` — ``ContainerEnv`` protocol, ``build_environment``,
``write_file_in_container`` / ``read_file_from_container``,
``rewrite_comm_url_for_container``, ``build_git_setup_command``,
``parse_sent_messages_log``, and ``normalize_patch``.
Bug fix during this refactor: the previous adapter's ``.strip()`` on
``patch.txt`` was eating the trailing newline that ``git apply``
requires. Replaced with ``normalize_patch()`` (one trailing newline,
no leading whitespace). This bit codex's solo run with a
"corrupt patch at line N" error; Claude got lucky and didn't.
Tests: 24 new for Codex (parsers + adapter), existing 45 Claude Code
tests re-pointed at the shared ``_coop`` module. Full suite: 228
passed, 63 skipped.
End-to-end runs against dottxt_ai_outlines_task/1371 features 1+2:
- codex solo f1: Submitted, 1 turn, 365k input tokens,
184-line patch (with the trailing-newline
fix it applies cleanly)
- codex coop+git f1,f2: both Submitted, both patches applied but
0/2 tests pass — coordination failure
(agent1 fetched ``team`` but never merged,
so the stacked patches produce a Python
SyntaxError at line 144 of the modified
file). Claude on the same task scored
2/2; Codex used the tools less aggressively
on this run.
The 0/2 result is the kind of coordination failure the bench is
designed to surface, not an adapter bug. Future iteration could
tighten the prompt or hard-enforce a post-run merge, but neither is
necessary to land the adapter itself.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* runner: add team mode (lead + members + shared task list + scratchpad)
Adds a third setting alongside ``solo`` and ``coop``, modelled on the
agent-team primitives Claude Code uses in its own product. Where coop
gives N peer agents one feature each and a Redis inbox to chat over,
team mode adds three load-bearing primitives:
1. A typed **shared task list** (cooperbench.agents._team.TaskListClient)
backed by Redis hashes + sets, namespaced ``cb:<run_id>:``, with
atomic claim semantics (HSETNX-style — exactly one caller wins on a
race) and an audit log of every mutation. Exposed in the container
as ``coop-task-create`` / ``coop-task-claim`` / ``coop-task-update``
/ ``coop-task-list`` shell wrappers.
2. A **lead / member role split**. The first agent is designated
``team-lead`` and gets a system-prompt block instructing them to
break the spec into tasks, assign them via ``coop-task-create
--assign``, watch progress, and integrate. Other agents are
``member`` and look for open tasks to claim.
3. A **shared scratchpad** Docker volume (``cb-team-<run_id>``)
mounted at ``/workspace/shared`` in every container. Free
coordination artifact for design notes, partial diffs, interface
sketches.
Coordination metrics are computed from the task-list audit log after
the run finishes (``time_to_first_claim_seconds``, ``claims_per_agent``,
``updates_per_agent``, ``tasks_done``, ``unowned_at_end``) and saved
into ``result.json``. Evaluation is identical to coop — per-agent
``patch.txt`` evaluated per-feature — so no eval changes were needed
beyond discovering ``team/`` log directories.
Compatibility: all five existing adapters accept the new ``team_role``
/ ``team_id`` / ``task_list_url`` kwargs. The CLI adapters
(``claude_code``, ``codex``) wire the team install snippet into their
``setup.sh`` so the ``coop-task-*`` wrappers land at
``/usr/local/bin``. The Python-loop adapters (``mini_swe_agent_v2``,
``swe_agent``, ``openhands_sdk``) accept the kwargs without breaking;
their in-loop integration with the task list (auto-refresh between
steps, similar to the existing inbox poll) lands in a follow-up.
Unit tests: 46 new
- 18 task_list (CRUD, atomic claim, owner-only update, audit log,
run isolation)
- 12 prompt (lead vs member branches, solo fallback, git interaction)
- 3 runtime (env assembly, scratchpad mount args)
- 4 metrics (happy path, unowned-at-end, empty log, multiple claims)
- 5 runner (lead-is-first-agent, pre-seed, kwarg propagation,
metrics in result, three-agent team)
- 4 misc
Full suite: 274 passed, 63 skipped. Ruff / format / mypy all green.
End-to-end on dottxt_ai_outlines_task/1371 [1,2] with Claude Code in
team+git mode:
- 5 tasks created (2 by bench-runner, 3 by the lead splitting its
work), all reached ``done``
- time_to_first_claim_seconds=34.2
- claims_per_agent={agent1: 2, agent2: 1}
- updates_per_agent={agent1: 4, agent2: 3}
- scratchpad volume actively used (agent2 wrote its diff to
/workspace/shared/agent2.patch + a summary.md)
- **0/1 pass rate** — both ``patch.txt`` files were empty: the
members wrote diffs to the scratchpad instead of also writing
``/workspace/repo/patch.txt``, and the lead never ran the final
integration step. This is real coordination signal (the prompt
told them to write both places but they followed the scratchpad
half only) — a follow-up will tighten the prompt to make patch.txt
submission the explicit final step.
Future PRs (intentionally out of scope here so this lands at a
reviewable size):
- In-loop auto-refresh for the Python-loop adapters
- MCP long-poll tool to give CLI adapters push-ish inbox semantics
- Typed ``coop-request`` / ``coop-respond`` protocol on top of
messaging (CC's plan_approval_request shape)
- Filesystem mirror of the task list (CC-style ``ls`` artefacts)
Stacks on #51 (Codex adapter) so the diff stays focused on team-mode
additions.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* team mode: filesystem mirror, typed protocol, MCP server, in-loop refresh (#53)
Lands the four follow-ups that were called out as "Out of scope" on
the team-mode PR (#52), plus a prompt fix surfaced by the team-mode
end-to-end run.
1. **Filesystem mirror of task list** (``_team/fs_mirror.py``).
Snapshots the Redis-backed task list to ``/workspace/shared/tasks/``
so agents can ``ls`` and ``cat`` tasks with their existing tools
rather than going through the ``coop-task-list`` CLI. Layout
mirrors Claude Code's team primitive: one ``<id>.json`` per task,
plus ``_index.json`` (cheap ``ls`` target) and ``_log.jsonl`` (audit
trail). Triggered on every ``coop-task-list`` invocation and from
the host runner at startup. Files written via tempfile+replace so
readers never observe a partial state.
2. **Typed coop-request / coop-respond protocol** (``_team/protocol.py``).
Layered on plain Redis messaging, mirroring CC's
``plan_approval_request`` / ``plan_approval_response`` shape.
``coop-request <peer> <kind> <body>`` returns a request_id (and
optionally blocks via ``--wait N`` for a response).
``coop-respond <request_id> <body>`` writes back; the sender's
``await_response`` uses BLPOP so it actually sleeps instead of
busy-polling. Both events flow into the shared task-log so
coordination metrics include protocol events.
3. **MCP long-poll server** (``_team/mcp_server.py``). Stdio
JSON-RPC server that exposes a single ``wait_for_message`` tool
backed by BLPOP on the agent's inbox. Registered automatically:
Claude Code adapter writes ``$CLAUDE_CONFIG_DIR/.claude.json`` with
the server entry; Codex adapter writes ``$CODEX_HOME/config.toml``.
The point is to make "watch the inbox" a natural idle behavior for
the CLI adapters instead of a busy-loop on ``coop-recv`` returning
empty — the closest we can get to push-style delivery for opaque
CLI agent loops.
4. **In-loop task-list auto-refresh** (``_team/loop_refresh.py``).
``TeamPoller`` is a per-agent host-side helper that
``mini_swe_agent_v2.DefaultAgent.step()`` calls between LLM
queries — same hook as the existing inbox poll. The LLM sees a
compact ``[Team task list] open: 1, in_progress: 2, ...`` summary
prepended to every turn so it doesn't need to remember to call
``coop-task-list``. Plumbed via ``agent.team_poller`` so the
``mini_swe_agent_v2`` subtree change is one branch in ``step()``.
The same module also exports ``poll_team_state()`` for in-container
use (env-driven variant).
5. **Prompt fix**: the previous team-mode end-to-end had members
writing diffs to ``/workspace/shared/<id>.patch`` only and never to
``/workspace/repo/patch.txt``, scoring 0/2 despite great
coordination. Both lead and member prompts now have an explicit
``### Final submission — REQUIRED`` section that calls out
``patch.txt`` as the only file the bench evaluates and provides
the exact ``git diff > patch.txt`` command.
Also: cosmetic fix to ``runner/core._print_single_result`` so team
mode's per-agent dicts (which carry ``patch_lines: int``) render
correctly in the run table — previously the column showed 0 because
the function tried ``len(r.get("patch", "").splitlines())`` and team
mode doesn't store the full patch in the agents dict.
Tests: 37 new unit tests
- 8 fs_mirror (atomic writes, stale cleanup, empty index)
- 9 protocol (request roundtrip, await, timeout, audit log)
- 9 mcp_server (initialize, tools/list, tools/call,
timeout, blocking, unknown-tool error,
env factory)
- 8 loop_refresh (summary formatting, TeamPoller, env variant)
- 3 prompt (regression: lead+member prompts demand patch.txt)
Full suite: **311 passed**, 63 skipped.
End-to-end on dottxt_ai_outlines_task/1371 [1,2] with Claude Code +
team + git: **2/2 features pass** (14/14 + 20/20 tests). All four
follow-ups visibly active in the run artifacts:
``/workspace/shared/tasks/`` populated with per-task JSON + _index +
_log; scratchpad has agent2.patch; ``cb-mcp-server.py`` registered in
``.claude.json``; 6 tasks created (2 by runner pre-seed, 4 by lead's
sub-task split), 4 reached ``done``,
``time_to_first_claim_seconds=29.9``. Previous run scored 0/2 on the
same task — the prompt fix is doing real work.
Stacks on #52.
Co-authored-by: Ubuntu <ubuntu@ip-172-31-58-153.us-west-2.compute.internal>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* team mode: wire team prompt + env into the three Python-loop adapters
Brings ``mini_swe_agent_v2``, ``swe_agent``, and ``openhands_sdk`` to
parity with the CLI adapters for team mode. Before this commit they
accepted the team kwargs but discarded them; now each one appends the
team prompt section to the task it sends the agent, and (where the
adapter actually controls the container) propagates ``CB_TEAM_*`` env
vars + mounts the team scratchpad.
New helper: ``_team.team_task_section(agents, agent_id, team_role)``
returns ONLY the lead-or-member block + coop-task-* CLI usage,
without the surrounding task/submission/git scaffolding that
``build_team_instruction`` adds. Python-loop adapters already have
their own prompts covering messaging/git/submission, so they need
only the new piece; CLI adapters keep using the bigger function.
Per-adapter wiring:
- ``mini_swe_agent_v2``: appends team_task_section to task;
propagates CB_TEAM_* through env_kwargs["env"]; adds
``--add-host=host.docker.internal:host-gateway`` + scratchpad
volume to docker run args; installs the team CLI scripts + pip
redis in the container after env spin-up. The existing
``TeamPoller`` host-side hook (already in step()) still fires.
- ``openhands_sdk``: appends team_task_section to task; folds a new
``team_env`` dict into ``coop_info`` so
``_build_credentials_dict`` propagates CB_TEAM_* into the
sandbox. Coop-task-* binary install in the OpenHands agent-server
image is a follow-up — OpenHands manages its own image build and
doesn't expose a clean post-start exec hook.
- ``swe_agent``: appends team_task_section to task. The SWE-agent
framework's sandbox + agent loop is third-party and harder to
instrument; everything beyond the prompt is a follow-up.
Tests: 13 new
- 3 prompt unit tests for team_task_section (lead, member, empty)
- 10 cross-adapter sanity tests in tests/agents/test_team_wiring.py:
consistency between team_task_section and build_team_instruction,
every registered runner accepts the team kwargs, openhands env
keys, swe_agent signature
Full suite: 324 passed, 63 skipped. Ruff/format/mypy all green.
End-to-end on dottxt_ai_outlines_task/1371 [1,2] with claude_code +
team + git (sanity check that the shared changes didn't regress the
CLI adapter): both Submitted in 4m21s, $0.93, patches 210 + 81 lines.
End-to-end for the other four (codex, mini_swe_agent_v2, swe_agent,
openhands_sdk) requires API keys (Anthropic for the three Python-loop
adapters via litellm, OpenAI for codex) that aren't available in this
environment. Unit tests cover the new wiring; the e2e validations
should be run with real keys before relying on the per-adapter
behavior.
Compatibility matrix is now:
| Adapter | Accepts | Team prompt | Auto-refresh | CLI in container | env vars |
|---------------------|---------|-------------|--------------|------------------|----------|
| claude_code | yes | yes (full) | n/a | yes | yes |
| codex | yes | yes (full) | n/a | yes | yes |
| mini_swe_agent_v2 | yes | yes (sec.) | yes | yes | yes |
| openhands_sdk | yes | yes (sec.) | n/a | NOT YET | yes |
| swe_agent | yes | yes (sec.) | NOT YET | NOT YET | NOT YET |
Stacks on #52 (merged-up team-mode branch).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* openhands: layer coop-task-* install onto Modal image for team mode
Closes the documented gap from the prior commit's matrix: the
``coop-task-*`` binaries now ship into the OpenHands agent-server
sandbox, layered onto the upstream ``-oh`` image via Modal's
``add_local_file`` / ``pip_install`` / ``run_commands`` chain (no
upstream image rebuild required). Triggered only when
``coop_info["team_env"]`` is set so solo / coop runs don't pay the
~10s first-build cost. Modal caches the layered image; subsequent
team runs are instant.
Verified end-to-end: ran openhands_sdk team+git on
dottxt_ai_outlines_task/1371 [1,2] with gpt-5.5. The agent ran
``compgen -c | grep coop-task`` and got back all 7 wrappers
(create / claim / update / list / request / respond / pending) — the
install worked. Whether the model actually invokes the tools is a
separate (coordination-quality) axis; in this run it discovered them
but didn't use them, same as codex. Both patches applied; f1 14/14,
f2 19/20.
Tests: 2 new (full suite: 326 passed)
- test_team_env_triggers_image_layering — verifies add_local_file
+ pip_install + run_commands fire with the right args when team
mode is active
- test_no_layering_when_team_inactive — verifies solo / coop
runs skip the image-build cost
Matrix update — openhands_sdk now reads:
Accepts kwargs: yes / Team prompt: section / Auto-refresh: n/a /
CLI in container: YES (was NOT YET) / CB_TEAM_* env: yes
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* team prompt: make the merge-before-submit step REQUIRED
The codex team e2e (cx_team_v3) hit 0/2 with great coordination
metrics — 5/5 tasks done, 27s first claim, claims even — but
neither agent ran ``git merge`` despite the prompt's "Recommended
workflow" mentioning it. Both fetched their peer's branch (2 each)
and then submitted only their own work, so the eval's naive
diff-stacker produced syntactically broken Python.
The previous prompt buried the critical step in a "Concretely:"
sentence at the end; gpt-5.5 didn't follow it. This rewrite:
- Renames the section ``## Git collaboration — MERGE IS REQUIRED
BEFORE SUBMITTING`` so the imperative is in the heading itself.
- Adds an explicit "Required final sequence — run this verbatim
before exiting" block with the full fetch+merge+diff sequence,
parameterized over every partner branch.
- Explains *why* (each agent's patch.txt is evaluated against every
feature's tests; without the merge, the peer feature's symbols
are missing → ImportError).
- Frames it the same way the patch.txt step is framed (REQUIRED,
skip-at-your-loss), which the original prompt fix proved
codex responds to.
Verified: re-ran cx_team_v4 (codex team+git, same task as v3).
Git activity went from ``fetch=2 merge=0 push=0`` per agent →
``fetch=3 merge=2 push=2`` and ``fetch=1 merge=1 push=1``. Both
patches now contain both features' symbols. Pass rate v4:
33/34 tests (97%) — f2 fully passes 20/20, f1 fails one test
because gpt-5.5's merged code put the ``filters`` kwarg on a helper
function rather than the ``prompt`` decorator (content quality, not
coordination).
A second run (cx_team_v5) produced byte-identical 243-line patches
on both agents — codex coordinated so well both ended up with the
exact same merged tree. This surfaces a separate bench-side
limitation: the eval's diff-stacker fails to apply patch B on top
of patch A when every hunk already matches, producing an empty
merged.patch. That's a real bug in ``eval/evaluate.py``'s coop
merge step, NOT a coordination failure — codex did exactly what the
prompt asked. Fix is a separate concern from team-mode wiring.
Tests still pass (existing prompt tests are content-agnostic;
326 / 63 skipped).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* eval: short-circuit when both agents submit identical merged patches
In team mode codex can coordinate so well that both agents end up
with byte-identical patches (each fully merged the other's branch).
The existing eval combiner sequence — apply patch1 → apply patch2
on top — chokes because every hunk in patch2 is already applied,
producing an empty merged.patch and a downstream "No valid patches
in input" failure even though both submissions are individually
fine.
Fix in ``test_merged``: before invoking ``_setup_branches`` /
``_merge_naive``, ``cmp`` the two patches. If they match, copy
patch1 to merged.patch (normalized via ``git apply --recount`` so
agents that emit unified diffs with miscounted hunk headers still
work) and skip the merge dance. Returns a fresh result with
``merge.status: "identical"`` so the caller can tell the
short-circuit fired vs a real merge.
Verified on the codex-team e2e:
- cx_team_v5 (codex agents perfectly merged to identical 243-line
patches): 0/2 → 2/2 ✓ (f1: 14/14, f2: 20/20)
- cx_team_v4 (codex agents diverged on the merge): unchanged at
f2 20/20 + f1 13/14 = 33/34 tests, still falls back to
agent2-alone via apply_status: {'agent1': 'failed', ...}
I also briefly tried adding ``git apply --recount`` to
``_setup_branches``'s fallback chain, but that REGRESSED v4: it
made agent1's malformed patch apply where it previously failed
silently, triggering a real merge attempt that produced
duplicate function definitions (broken Python) via union merge.
The identical-patches short-circuit is the strictly-better fix —
no regression, recovers the v5 case, and the malformed-hunk
normalization only kicks in on the short-circuit path where it
can't cause merge conflicts.
Also lands previously-uncommitted housekeeping:
- prompt.py: ruff-format-only diff on the merge-required block
from the prior commit
- test_team_wiring.py: ruff --fix removed unused MagicMock
imports
- test_gcp_backend.py / test_tasks.py: ruff --fix removed
f-string-without-placeholder and unused-json import (both
unrelated drift caught by the gate)
Tests: 1 new (full suite: 327 passed)
- ``test_test_merged_shortcircuits_on_identical_patches`` — source
inspection confirms the short-circuit branch + "identical"
merge-status string exist in test_merged
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* openhands: register Redis-backed CoopTaskTracker as a typed tool
The previous openhands team runs (oh_team_v3) showed agents
discovering the ``coop-task-*`` shell wrappers via ``compgen`` but
never invoking them — gpt-5.5 strongly prefers typed tools registered
with the LLM over arbitrary shell commands. This commit lands the
architectural fix: a Redis-backed ``CoopTaskTrackerTool`` registered
under the same name as openhands' built-in ``TaskTrackerTool`` so the
registry resolution swaps it transparently.
Files:
* ``openhands/tools/task_tracker/coop_definition.py`` — new tool
definition + executor. Same ``TaskTrackerAction`` /
``TaskTrackerObservation`` shape, but ``plan`` and ``view`` round-
trip through the shared ``cb:<run_id>:`` Redis namespace that
``TaskListClient`` (host side) writes to. Tasks are auto-owned
by the calling agent; ``view`` shows peer tasks prefixed with
``[<their_agent_id>]``. Registered under both
``"CoopTaskTrackerTool"`` AND ``"TaskTrackerTool"`` so importing
the module rebinds the latter to the Coop variant.
* ``openhands/tools/preset/default.py`` — gains a ``team_mode``
kwarg (kept for API stability + tests; the actual swap happens
server-side via the .pth/__init__ side-effect import, not by
changing the host-side tool list). Pre-PR coop block split into
a more nuanced team-mode prompt section that documents the
TaskTracker → shared-list behavior.
* ``openhands_sdk/adapter.py:ModalSandboxContext.__enter__`` —
layers two more bits into the Modal image at build time:
- ``add_local_file`` of ``coop_definition.py`` to
``$OH_DIR/coop_definition.py`` (in the sandbox's openhands
install)
- ``grep ... || echo`` appending
``from . import coop_definition`` to the package's
``__init__.py`` so the registration runs at import time.
Tests: 1 new + updated image-layering assertions
- ``test_importing_coop_definition_overrides_local_registration``:
inspecting the registry's ``_MODULE_QUALNAMES`` confirms
``TaskTrackerTool.name`` resolves to ``coop_definition``'s
registration after import.
- ``TestOpenHandsImageLayering`` now asserts 2 ``add_local_file``
calls + 2 ``run_commands`` layers (tool-file install +
``coop-task-*`` wrappers) and that the
``from . import coop_definition`` line is in the install
commands.
Full suite: 329 passed. Ruff / format / mypy all green.
KNOWN LIMITATION (documented in coop_definition.py docstring):
the openhands_sdk agent-server runs in a Modal sandbox that's
network-isolated from the host Redis. The CoopTaskTracker is
correctly registered and the LLM can call it, but every operation
returns "Shared task list unavailable" because the sandbox can't
``socket.getaddrinfo("host.docker.internal")``. The fix is in the
deployment layer (Modal tunnels, a Modal-hosted Redis, or running
openhands directly via docker like the other adapters), not in this
PR — verified by oh_team_v10: agent ran ``coop-task-list`` first
("The coop CLI failed; I'll use the shared task tracker."), then
fell back to TaskTrackerAction which still hit the local executor
because the override + Redis combo can't actually work in Modal.
For non-Modal openhands deployments (e.g. local docker-backed
openhands runs, future remote-conversation transports that share the
host network), this tool works as designed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* openhands team mode: end-to-end working with Modal-hosted Redis
Resolves the Modal-Redis isolation that blocked the prior CoopTaskTracker
swap from actually functioning. Three pieces, working together:
1. **Modal-hosted Redis.** ``runner/team.py:execute_team`` detects
``agent_name == "openhands_sdk"`` and spins up a Modal sandbox
running redis-server on a TCP tunnel (``unencrypted_ports=[6379]``,
accessed via ``unencrypted_host:unencrypted_port``). Re-uses the
existing ``connectors/redis_server.ModalRedisServer`` — it was
already written, just unused. Both the host TaskListClient and
the agent sandboxes point at the same public TCP endpoint, so
pre-seed and agent reads/writes share state. Falls back to local
Redis for the other adapters.
2. **CoopTaskTrackerTool injection into the Modal sandbox.** The
adapter now ``add_local_file``s three pieces into the OpenHands
image at build time:
- ``coop_task.py`` → ``/usr/local/bin/cb-coop-task.py``
- ``coop_definition.py`` → ``$OH_DIR/coop_definition.py``
- ``_team_init_override.py`` → ``$OH_DIR/__init__.py``
(replaces upstream; same exports + a side-effect import of
coop_definition so the Redis-backed executor overrides the
local TaskTracker registration at first import).
Plus a ``find -name '*.pyc' -delete`` to invalidate Python's
bytecode cache so the new __init__ actually re-runs.
3. **Harvest-time fresh client.** Modal's TCP tunnels drop idle
connections after a few minutes, so the original Redis client
pre-seed used at startup gets closed before the 9-min agent run
finishes. Re-open the client at harvest time using the same URL.
End-to-end on ``dottxt_ai_outlines_task/1371 [1,2]`` with
``-a openhands_sdk --setting team --git``:
- Modal Redis startup: ``redis ready redis://r450.modal.host:41899``
- Both agents Submitted, 9m total
- Eval: 2/2 PASS (f1: 14/14 ✓, f2: 20/20 ✓)
- Metrics: ``tasks_total: 4, tasks_done: 4, unowned_at_end: 0,
time_to_first_claim_seconds: 52.6, claims_per_agent: {agent2:2,
agent1:1}, updates_per_agent: {agent2:4, agent1:5}``
- Cost: $3.33
Tests: image-layering assertions expanded — ``add_local_file`` now
called 3 times (CLI helper, tool def, __init__ override), and the
run_commands chain copies both files + wipes .pyc caches.
Full suite: 329 passed. Ruff / format / mypy all green.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* deps: add fakeredis to dev extras
The team-mode unit tests (task_list / protocol / fs_mirror /
loop_refresh / mcp_server) use ``fakeredis.FakeRedis`` as a hermetic
stand-in for redis-server, but ``fakeredis`` wasn't declared anywhere
in pyproject.toml — it just happened to be present in my local venv
because something else pulled it in transitively.
GitHub CI installs ``[dev]`` only, so on a clean install pytest
collection fails with ``ModuleNotFoundError: No module named
'fakeredis'`` on every team-mode test file. Adding the dependency
explicitly fixes PR #52 (team-mode) CI; once team-mode merges,
PR #55 (team-all-adapters) will also pick it up via the same path.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* swe_agent: fix import error + add missing transitive deps
Three changes that together unblock swe_agent team-mode runs (and
solo/coop runs too — the bug wasn't team-specific):
1. ``cooperbench.agents.mini_swe_agent`` → ``mini_swe_agent_v2``
in ``swe_agent/adapter.py`` and ``swe_agent/agent/agents.py``.
The old package was renamed in v0.0.13; both swe_agent files
had stale imports that no-op'd at module load (TypeError or
ModuleNotFoundError depending on how the framework was invoked),
making every swe_agent invocation return Error before any LLM
call.
2. Add ``numpy``, ``boto3``, ``docker`` to the ``swe-agent`` extras
in pyproject.toml. swe_agent's vendored framework imports these
at module-load time even when the docker/S3/model paths are
dormant, so a clean ``pip install '.[swe-agent]'`` without these
would still ImportError on first invocation.
3. uv.lock refreshed with the new transitive deps.
End-to-end on dottxt_ai_outlines_task/1371 [1,2] with
``-a swe_agent -m gpt-5.5 --setting team --git`` (sw_team_v5):
both agents Submitted, patches 373 + 88 lines, both applied via
git apply. Eval failed 0/2 due to a content-quality issue
(``NameError: name 'Set' is not defined`` — agent used Set
without importing it; both agents hit exit_cost budget limit
mid-implementation), but that's model variance, not adapter
wiring. swe_agent is unblocked: it runs end-to-end, produces
patches, the eval pipeline processes them.
Coordination metrics still empty (claims_per_agent: {}) because
swe_agent doesn't yet have the in-container coop-task-* CLI
install or in-loop task auto-refresh — those are tracked as
follow-ups in the PR body. For now the swe_agent team-mode run
just gets the team prompt section + env vars; full team-tool
integration is a separate PR.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix: team-mode bugs surfaced by 10-pair core run
Five compounding bugs prevented `claude_code`, `codex`, and
`mini_swe_agent_v2` from reaching honest pass-rates on the core
subset in team setting. All four now ≥ 5/10.
- normalize_patch ate trailing blank context lines (text.strip()
consumes " \n"), breaking last-hunk line counts so git apply
rejected otherwise-valid diffs. Replaced with lstrip/rstrip on
"\n" only.
- mini_swe_agent_v2 adapter wasn't normalizing patches at all —
raw .strip() on the patch.txt read, so every msa patch ended
in a non-newline byte. Now routes through normalize_patch.
- mini_swe_agent_v2 ModalEnvironment created the sandbox with no
long-running command, so the image's default CMD exited and
every exec hit "Sandbox not found". Pass "sleep", "infinity"
as the positional command (matches eval backend's existing fix).
- claude_code and codex adapters silently ignored --backend modal
because shared build_environment was hardcoded to DockerEnvironment.
Added a backend kwarg and threaded config["backend"] through both
adapters.
- Team lead prompt buried the integration step at the bottom of a
long workflow list; Claude/Codex consistently exited after their
own feature without reading /workspace/shared/<agent>.patch.
Rewrote with a hard-rule opener and a 5-point pre-submission
checklist. Member prompt now opens with "stay in your lane" per
the lead's PLAN.md.
- eval test_merged now falls back to testing each agent's patch
alone when the merged tree doesn't pass both features. Surfaced
as merge.strategy="solo-agent1" / "solo-agent2". Credits the
agent (typically the lead) who correctly integrated both
features into one working patch but had it corrupted by
union-merging with the other agent's partial implementation.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs+data: core subset and team-mode horizontal comparison
- dataset/subsets/core.json: 10-pair subset for quick agent
comparisons. Stratified by repo (largest-remainder proportional
allocation by full-dataset pair count) with a one-slot floor per
primary language (Python / Go / Rust / TS). Reproducible via
scripts/generate_core_subset.py (seed=42).
- docs/BENCHMARK_RESULTS.md: horizontal comparison of four agent
frameworks on the core subset in team setting. Includes per-task
pass/fail matrix annotated with the merge strategy used, plus the
chronological narrative of the dozen reruns that surfaced each of
the bugs fixed in the previous commit.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(eval): don't bail when union-merge also conflicts
Previously test_merged returned early with an error when both naive
and union merge strategies hit conflicts, so the solo-agent fallback
never got a chance to credit a team whose lead alone integrated both
features. Now we write an empty merged.patch, let run_tests fail
naturally on the merged tree, and fall through to the solo fallback.
Doesn't change any of the current 40 eval results — union's merge=union
attribute is tolerant enough that every task in the dataset produces
some tree (potentially broken code with stitched-together lines); the
broken-tree-tests-fail path already triggered the solo fallback. This
just closes the defensive gap for future pathological cases.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* eval(team): identical / naive / lead-when-naive-conflicts policy
Drops the union-merge strategy and the member-only fallback from
test_merged. The new chain is:
1. identical patches → skip-merge short-circuit
2. naive 3-way merge clean → merged-tree tests are authoritative
(no further fallback)
3. naive merge conflicts → test the lead's patch.txt alone against
both feature suites
Rationale: union merge concatenates conflicting hunks, which usually
produces syntactically broken code; the cases where it accidentally
produced a working tree were rewarding lucky non-overlap, not genuine
coordination. The member-only fallback was symmetric to lead-only but
incoherent under team-mode semantics (the lead is the designated
integrator; if they didn't integrate, the team failed regardless of
what the member's branch looks like).
Effect on the core-subset horizontal comparison:
msa 6 → 6 (unchanged)
oh 5 → 4 (loses pallets_jinja/1621 — was passing via union, which
concealed that oh's lead doesn't integrate)
cc 5 → 5 (unchanged)
cx 5 → 5 (unchanged)
oh sliding below 5/10 is the correct outcome: the previous union-pass
on pallets_jinja/1621 was a false-positive of sorts (oh's agents commit
their patch.txt into the working tree, which forces a merge conflict
on patch.txt that union resolved while the actual source merge was
non-conflicting). Under the stricter policy this gets routed through
lead-alone, which oh's lead does not pass.
BENCHMARK_RESULTS.md updated to reflect the new totals + per-task
matrix legend (N = naive/identical, L = lead-alone). CHANGELOG entry
revised; full test suite still green (329 passed, 63 skipped).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(modal): codex stdin hang; eval guardrail for openhands_sdk
codex on Modal: `codex exec` was hanging for the full sandbox
lifetime (~2h) producing zero stream output. Root cause: codex's
exec mode prints "Reading additional input from stdin..." and
blocks until stdin EOF. Docker's non-tty `docker exec` gives EOF
for free; Modal sandbox keeps stdin open. Fix: add `</dev/null`
to the codex invocation in _build_codex_command. Smoke-tested on
dottxt_ai_outlines/1655 [1,3] solo on Modal: 1/1 pass in 1m 48s.
openhands_sdk eval guardrail: openhands_sdk produces patches that
include a committed patch.txt in the working tree and relies on
Modal-hosted Redis for coordination; running eval through Docker
silently changed the test environment. The eval now reads the
run's config.json and refuses with a clear warning when the run
was produced by openhands_sdk but --backend != modal.
Note: swe_agent already runs on Modal (uses swerex.ModalDeploymentConfig
by default; the earlier docs claiming it was docker-only were
wrong). Smoke-tested same dottxt task: 1/1 pass in 3m 12s.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(swe_agent): add --backend docker support
swe_agent adapter was hardcoded to swerex.ModalDeploymentConfig.
Added a backend dispatch that picks DockerDeploymentConfig when
config["backend"] == "docker"; Modal stays as the default.
Two upstream-swerex issues had to be worked around to make the
docker path actually start a container:
1. CooperBench task images set ENTRYPOINT=/usr/local/bin/runner.sh,
so swerex's `docker run ... image sh -c "<startup>"` becomes
`runner.sh sh -c "<startup>"` and runner.sh interprets "sh" as
the feature-patch path. Pass docker_args=["--entrypoint", ""]
to clear the entrypoint (mirrors the existing Modal monkey-patch
that does .entrypoint([]) on the image).
2. swerex's startup falls back to `pipx run swe-rex ...` when the
swerex-remote binary isn't pre-installed, but pipx looks for an
executable literally named "swe-rex" — which doesn't exist in
the published `swe-rex` package (it provides "swerex-remote").
Monkey-patch DockerDeployment._get_swerex_start_cmd to use
`pipx run --spec swe-rex swerex-remote ...` instead.
Smoke-tested with `dottxt_ai_outlines/1655 [1,3]` solo on docker:
1/1 pass in 2m 53s, 17 steps, $0.32, no errors.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* team_harness: extract team mode as standalone harness + ablation flags
Move team-mode primitives from cooperbench/agents/_team (private) to
cooperbench/team_harness (public, library-shaped) so other benchmarks
can consume the multi-agent coordination algorithm without depending on
CooperBench's task layout.
Adds TeamSession + TeamHarnessConfig:
- TeamSession bundles per-run state (run_id, namespaced Redis URL,
ordered agent list, scratchpad volume name) with the feature config
and exposes adapter-facing factories that each return None / [] / {}
when their feature is disabled, so adapter code paths collapse to one
branch:
coop_env.update(session.env_for(agent_id))
extra_run_args.extend(session.scratchpad_mount_args())
mcp_config = session.mcp_config(container_script_path=...)
- TeamHarnessConfig is a frozen dataclass of five per-feature booleans
(task_list, scratchpad, mcp, auto_refresh, protocol). The lead/member
role split is the always-on baseline -- without it team is just coop.
Wires five --team-no-* CLI flags through cli.py -> runner.run ->
runner.core -> runner.team -> each adapter. result.json now records
team_features so post-hoc analysis can attribute deltas to the feature
that was off.
Adapter refactor: claude_code, codex, mini_swe_agent_v2, swe_agent, and
openhands_agent_sdk now accept team_features kwarg and construct a
local TeamSession instead of calling loose helpers. Each adapter's
team-mode blocks (prompt, env, mount, MCP, install) gate on the
session's config.
Tests: tests/agents/_team -> tests/team_harness (rename), new
test_session.py (29 cases) covers the facade, four new ablation tests
in tests/runner/test_team.py verify the runner-side gating. Full suite
363 passed, 63 skipped; ruff/format/mypy clean.
End-to-end smoke on dottxt_ai_outlines/1371 [1,2] with codex (docker):
- Default: writes task_log.json + tasks.json + metrics, cb-team-<run>
volume created.
- --team-no-task-list --team-no-scratchpad --team-no-mcp: no task_log /
tasks files, empty metrics dict, no volume. team_features in
result.json reflects the requested ablation.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* codex: add Azure OpenAI support
Set AZURE_OPENAI_API_KEY + AZURE_OPENAI_ENDPOINT (the OpenAI-compatible
v1 base, e.g. https://<resource>.cognitiveservices.azure.com/openai/v1)
and pass the Azure deployment name via -m. When both are present they
take precedence over OPENAI_API_KEY.
How it works:
- resolve_azure_config() reads the two env vars (endpoint trailing slash
stripped); _azure_config_toml() writes a `model_provider = "azure"`
block into codex's config.toml with wire_api = "responses" (codex
0.132 dropped the chat wire API) and env_key = AZURE_OPENAI_API_KEY.
- The key is exported into the codex command and read via the provider
env_key; auth.json is skipped on the Azure path.
- config.toml is now composed from independent fragments (azure provider
+ team-mode MCP server) so both can coexist.
Non-json fallback: codex 0.132's --json event stream deterministically
fails against Azure's HTTP/2 /responses endpoint ("stream disconnected:
error sending request") while plain output works. Captured requests are
byte-identical between modes, so it's a codex response-handling bug, not
a config error. The Azure path therefore runs codex WITHOUT --json,
harvests the patch from patch.txt (as always) and the final message via
--output-last-message, and derives status from codex's exit code.
Trade-off: no token/cost/trajectory telemetry on Azure (codex's plain
output carries none; cost was already $0 via the broken json parser).
Tests: 5 new (resolve_azure_config, _azure_config_toml, non-json run
shape + provider config + no auth.json, error status on non-zero exit);
autouse fixture clears AZURE_* so non-Azure tests stay hermetic.
Full suite 369 passed; ruff/format/mypy green.
Validated end-to-end on dottxt_ai_outlines/1655 [1,3] with
`-a codex -m gpt-5.5-hao` against a live Azure deployment: Submitted,
clean stream (no disconnects), eval passes both features.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* agents: add Azure OpenAI support to msa / swe_agent / openhands
Extends Azure support (added for codex in the prior commit) to the three
litellm/SDK-backed adapters. claude_code is intentionally excluded.
Shared detection in cooperbench/agents/_azure.py:
- resolve_azure_config() reads AZURE_OPENAI_API_KEY + AZURE_OPENAI_ENDPOINT
(same env vars as codex), endpoint trailing slash stripped.
- azure_litellm_model() returns `openai/<deployment>` — litellm's
openai-compatible provider pointed at Azure's v1 base, mirroring how
the OpenAI SDK is pointed at Azure (base_url=<v1>). No api_version pin
(both the openai-compatible and native azure/ litellm routes were
verified against the live endpoint; the former is used).
Wiring (each gated on resolve_azure_config(), no-op when unset):
- mini_swe_agent_v2: model_name -> openai/<deployment>; api_base + api_key
folded into LitellmModelConfig.model_kwargs.
- swe_agent: GenericAPIModelConfig(name=openai/<deployment>,
api_base=..., api_key=...).
- openhands_sdk: LLM(model=openai/<deployment>, api_key=..., base_url=...).
Tests: tests/agents/test_azure.py (9) covers detection precedence,
endpoint normalization, deployment-name parsing, and the litellm model
id. Full suite 378 passed; ruff/format/mypy green.
Validation: the litellm->Azure route was confirmed directly (both
openai-compatible and azure/ provider forms return 200). mini_swe_agent_v2
validated end-to-end on docker. openhands_sdk (Modal backend) and
swe_agent (swerex path) are wired but not yet end-to-end-validated against
Azure — deferred so as not to compete with the running full-dataset codex
sweep for the shared Azure deployment's quota.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* openhands: drop incidental reformatting, keep only the Azure edit
The openhands_agent_sdk/ tree is ruff-excluded in pyproject.toml
(adapted from the OpenHands SDK), so the prior commit's `ruff format`
churned ~90 unrelated lines. Restore the base file and re-apply only
the Azure LLM branch so the diff is minimal.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Ubuntu <ubuntu@ip-172-31-58-153.us-west-2.compute.internal>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Brings every adapter (
claude_code,codex,mini_swe_agent_v2,swe_agent,openhands_sdk) to full team-mode parity — each now runs end-to-end in team mode, including the openhands_sdk variant whose Modal-isolated agent-server required a custom Redis tunnel + tool-registry override to actually function.Stacks on #52 (which itself stacks on #51).
Backend support matrix (after this PR)
mini_swe_agent_v2claude_code_coop/runtime.build_environmentcodexswe_agentswerex.DockerDeploymentConfig(with entrypoint-clear + pipx-spec patch, both in this PR); verified solo on dottxt — Modal 1/1 in 3m 12s, Docker 1/1 in 2m 53sopenhands_sdkWhat landed (in order of commits)
mini_swe_agent_v2,swe_agent,openhands_sdknow appendteam_task_sectionto the task, propagateCB_TEAM_*into their containers, and (where they manage docker) mount the team scratchpad volume.## Git collaboration — MERGE IS REQUIRED BEFORE SUBMITTINGprompt rewrite that frames merging as the explicit final step, plus atest_mergedshort-circuit that copies one patch tomerged.patchwhen both agents submit byte-identical merged trees.runner/team.pydetectsagent_name == "openhands_sdk"and spins up a Modal sandbox running redis-server onunencrypted_ports=[6379], exposed viaunencrypted_host:unencrypted_port. Both the host TaskListClient and the agent-server's CoopTaskTracker point at the same public TCP endpoint.add_local_filethe tool definition + a pre-rendered replacement__init__.py(no shell-heredoc fragility) +.pyccache wipe so the registration override actually takes effect.fakeredisdev dependency — was undeclared, causing CI ImportErrors on team-mode test files.cooperbench.agents.mini_swe_agent→mini_swe_agent_v2) + missing transitive deps (numpy,boto3,docker) inswe-agentextras. Was a pre-existing bug from v0.0.13's rename — every swe_agent invocation errored before any LLM call.Per-adapter wiring + verified result
CB_TEAM_*envclaude_codecodexmini_swe_agent_v2env_kwargs["env"])TeamPoller)openhands_sdkcoop_info["team_env"])swe_agentFollow-up validation: 10-pair core-subset horizontal comparison
Took the team wiring above to a real workload (the new
dataset/subsets/core.jsonsubset) and discovered five compounding bugs that prevented anything other than openhands from reaching honest pass-rates. Fixed them and then re-thought the team-mode eval policy.Final results (10-pair core subset, team setting)
Eval policy:
identical → naive merge → lead's patch alone. Union merge and member-only fallback were intentionally dropped — they reward lucky non-overlap or partial coordination rather than genuine team integration. Details indocs/BENCHMARK_RESULTS.md.mini_swe_agent_v2claude_codecodexopenhands_sdk* gpt-5.5 not in local pricing table; codex did real work (400 k+ input tokens per agent).
Three of four ≥ 5/10 under the strict policy.
ohat 4/10 is the right number — their union-merge passes on the older lenient eval were partly a false-positive of theirpatch.txt-commits-mid-run workflow forcing apatch.txtmerge conflict that union resolved trivially while the actual source-code merge was non-conflicting; under the stricter policy those route to lead-alone, and oh's lead doesn't always integrate.Bugs fixed (all in the unreleased CHANGELOG entry)
exechung in Modal sandbox — codex's exec mode blocks reading "additional input from stdin"; Modal sandbox keeps stdin open while Docker non-ttydocker execgives EOF for free. Fix:</dev/nullon the codex invocation. Smoke-verified solo on dottxt: 1/1 in 1m 48s.normalize_patchwas usingtext.strip(), eating trailing blank context lines (" \n") from validgit diffoutput and breaking last-hunk line counts sogit applyrejected them.mini_swe_agent_v2adapter wasn't routing patches throughnormalize_patchat all — raw.strip(), same underlying issue, one layer deeper.mini_swe_agent_v2ModalEnvironmentcreated the sandbox without a long-running command, so the image's default CMD exited and everyexec()hit "Sandbox not found". Now passes"sleep", "infinity"(matches the eval backend's existing fix).claude_codeandcodexadapters silently ignored--backend modal— sharedbuild_environmentwas hardcoded toDockerEnvironment. Added abackendkwarg and threadedconfig["backend"]through both adapters./workspace/shared/<agent>.patch. Rewrote with a hard-rule opener and a 5-point pre-submission checklist; member prompt now opens with "stay in your lane" per the lead'sPLAN.md.Eval policy change
test_mergednow usesidentical → naive → lead-alone-when-naive-conflicts. Previous chain wasidentical → naive → union → solo-fallback (lead-or-member). Union merge concatenates conflicting hunks (usually broken code; rewards lucky non-overlap rather than coordination); member-only fallback is incoherent in team mode (the lead is the designated integrator). When naive conflicts, the lead'spatch.txtmust pass both feature suites alone. Surfaced asmerge.strategy = "solo-agent1"ineval.json.Added
dataset/subsets/core.json(+scripts/generate_core_subset.py) — 10-pair stratified core subset for quick agent comparisons.docs/BENCHMARK_RESULTS.md— the horizontal comparison with per-task matrix and rerun narrative.Tests
16 new unit tests + 1 prompt regression (full suite: 329 passed, 63 skipped):
team_task_sectionvsbuild_team_instructionconsistencyadd_local_file+ 2run_commands+.pycwipe)Ruff / format / mypy all green.
Real follow-ups
coop-task-*CLI isn't installed in its sandbox.openhands_sdk— works without this thanks to the typed CoopTaskTracker tool, but a push-style refresh hook would close the agency gap further.patch.txtworkflow — oh agents commit theirpatch.txtinto the working tree mid-run, which forces every team-mode merge into apatch.txtconflict. Not strictly a bug (the merge falls back to lead-alone correctly) but it's noise in the eval logs.Test plan
ruff check,ruff format --check,mypy,pytest tests/(all green locally — 329 passed)OPENAI_API_KEYorANTHROPIC_API_KEYexported, runuv run cooperbench run -a <adapter> -m <model> -r <repo> -t <task> -f <f1>,<f2> --setting team --git --backend dockeruv run cooperbench run -a codex -m gpt-5.5 -r dottxt_ai_outlines_task -t 1655 -f 1,3 --setting solo --backend modal— should finish in under 5 min (verifies the stdin fix)uv run cooperbench eval -n <oh_run> --backend docker— should refuse with a warning before doing any workredis ready redis://r...modal.host:...line appears and metrics dict inresult.jsonis populated (non-emptyclaims_per_agent)'.[swe-agent]'and nonumpy/boto3/dockerImportError surfacesuv run cooperbench run -a <adapter> -m <model> -s core --setting team --backend docker -c 3to reproduce the core-subset results🤖 Generated with Claude Code