team_harness: extract team mode as standalone harness + ablation flags#58
team_harness: extract team mode as standalone harness + ablation flags#58ProKil wants to merge 1 commit into
Conversation
Move team-mode primitives from cooperbench/agents/_team (private) to
cooperbench/team_harness (public, library-shaped) so other benchmarks
can consume the multi-agent coordination algorithm without depending on
CooperBench's task layout.
Adds TeamSession + TeamHarnessConfig:
- TeamSession bundles per-run state (run_id, namespaced Redis URL,
ordered agent list, scratchpad volume name) with the feature config
and exposes adapter-facing factories that each return None / [] / {}
when their feature is disabled, so adapter code paths collapse to one
branch:
coop_env.update(session.env_for(agent_id))
extra_run_args.extend(session.scratchpad_mount_args())
mcp_config = session.mcp_config(container_script_path=...)
- TeamHarnessConfig is a frozen dataclass of five per-feature booleans
(task_list, scratchpad, mcp, auto_refresh, protocol). The lead/member
role split is the always-on baseline -- without it team is just coop.
Wires five --team-no-* CLI flags through cli.py -> runner.run ->
runner.core -> runner.team -> each adapter. result.json now records
team_features so post-hoc analysis can attribute deltas to the feature
that was off.
Adapter refactor: claude_code, codex, mini_swe_agent_v2, swe_agent, and
openhands_agent_sdk now accept team_features kwarg and construct a
local TeamSession instead of calling loose helpers. Each adapter's
team-mode blocks (prompt, env, mount, MCP, install) gate on the
session's config.
Tests: tests/agents/_team -> tests/team_harness (rename), new
test_session.py (29 cases) covers the facade, four new ablation tests
in tests/runner/test_team.py verify the runner-side gating. Full suite
363 passed, 63 skipped; ruff/format/mypy clean.
End-to-end smoke on dottxt_ai_outlines/1371 [1,2] with codex (docker):
- Default: writes task_log.json + tasks.json + metrics, cb-team-<run>
volume created.
- --team-no-task-list --team-no-scratchpad --team-no-mcp: no task_log /
tasks files, empty metrics dict, no volume. team_features in
result.json reflects the requested ablation.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
End-to-end eval results (was missing from the original PR body — ran
Real pass-rate delta in the expected direction — with the shared task list / scratchpad / MCP off, the lead had nowhere to fetch the member's patch from, naive merge produced a non-composing diff, and eval failed. With them on, the lead integrated and the team passed. n=1 is far too small for any conclusion about which of the three features is doing the work, but the plumbing and the ablation signal both work end-to-end. Next-PR target: the same matrix across the |
Ablation matrix (core 10-pair × marginal-effect design)Used the marginal-effect design (1 baseline + 5 one-feature-off) instead of the full 2⁵=32 interaction matrix — answers the "what's each feature's contribution" question at ~1/5 the cost. Codex CLI on Headline numbers
Per-task breakdown — the signal is concentrated
What this says about each feature
Caveats
Raw data
🤖 Generated with Claude Code |
Summary
Stacks on #55. Lifts the team-mode coordination primitives out of
cooperbench/agents/_team(private, benchmark-internal) intocooperbench/team_harness(public, library-shaped) so the algorithm can be studied and consumed by other benchmarks — the long-horizon target discussed in #52's followups. Adds five per-feature ablation flags so we can measure each coordination mechanism's contribution independently.What's new
cooperbench.team_harness— public packageA documented sibling of
cooperbench/agents, not nested under it. Same modules as the old_team/(task_list,protocol,mcp_server,prompt,loop_refresh,fs_mirror,metrics,runtime,coop_task,install_snippet.sh) plus a facade in__init__.py:TeamHarnessConfigtask_list,scratchpad,mcp,auto_refresh,protocol).with_only("task_list", "mcp")anddisabled()helpers.TeamSessionrun_id/redis_url/agents/team_volume/config. Adapter-facing factories:env_for,scratchpad_mount_args,mcp_config,prompt_for,prompt_section,loop_poller,task_list_client,harvest_metrics. Each factory returnsNone/[]/{}when its feature is disabled so adapters write one code path.COOP_TASK_SCRIPT_PATH,INSTALL_SNIPPET_PATH,MCP_SERVER_SCRIPT_PATH,MCP_SERVER_NAME— adapters import these instead of computingPath(__file__).parent.parent / \"_team\".TeamSession.redis_urlis the host-side URL;env_for()rewriteslocalhost/127.0.0.1→host.docker.internalso adapters don't have to plumb that themselves. The rewrite is duplicated from_coop.runtimerather than imported, because the harness is meant to be portable to other benchmarks that don't ship_coop.Ablation flags
Five
--team-no-*flags oncooperbench run, each gating one coordination mechanism:The lead/member role split stays on either way — without it team mode collapses to coop, so it's the always-on baseline.
result.jsonnow records which features were enabled:so post-hoc analysis can attribute pass-rate deltas to the specific feature that was off, without cross-referencing CLI invocations.
Adapter refactor
claude_code,codex,mini_swe_agent_v2,swe_agent,openhands_agent_sdkall accept a newteam_features: TeamHarnessConfig | Nonekwarg and construct a localTeamSessioninstead of calling loose helpers. Each adapter's team-mode blocks (prompt assembly, env vars, scratchpad mount, MCP install, in-loop poller, CLI install) gate onsession.config.<feature>. For example, the MCP install in claude_code is now:— gate on the session's config, no
is_teamflag spread around.Tests
tests/agents/_team→tests/team_harness(rename, 83 existing tests still pass). Plus:tests/team_harness/test_session.py(29 new) — coversTeamHarnessConfigdefaults /with_only/disabled,TeamSession.lead/role_for/is_active,env_for(default, localhost rewrite, non-localhost passthrough, empty when all consumers off, full when onlymcpremains),scratchpad_mount_args(default, off, empty volume),mcp_config(default, off),task_list_client(on, off),loop_poller(on, off),harvest_metrics(active, off, None client),prompt_for(lead, member),prompt_section(default, single-agent collapse).tests/runner/test_team.py(4 new) —team_featuresrecorded inresult.jsonby default; same config instance propagates to every adapter call;--team-no-task-listskips Redis pre-seed and produces empty metrics dict while keeping the role split;--team-no-scratchpadclearsconfig[\"team_volume\"]to the empty string.Full suite: 363 passed, 63 skipped. Ruff / format / mypy all green.
End-to-end on
dottxt_ai_outlines/1371 [1,2]withcodex --setting team --backend dockerresult.metricscb-team-*volumeteam_featuressmoke-team-default{tasks_total: 2, tasks_done: 2, time_to_first_claim_seconds: 67.3, claims_per_agent: {agent2: 1}}cb-team-96067cd3)truesmoke-team-ablated--team-no-task-list --team-no-scratchpad --team-no-mcp{}task_list/scratchpad/mcp: false, auto_refresh/protocol: trueBoth runs Submitted 2/2 agents.
Out of scope (next PRs)
TeamSessionAPI is validated only by CooperBench itself.TeamHarnessConfig× prompt variants × refresh cadence, with eval pass-rate as the objective and the coordination metrics (time_to_first_claim_seconds,unowned_at_end,claims_per_agent) as cheap first-pass proxies.Test plan
ruff check,ruff format --check,mypy,pytest tests/(all green locally)uv run cooperbench run -a codex -r dottxt_ai_outlines_task -t 1371 -f 1,2 --setting team --backend docker --no-auto-eval -n smoke-team-default--team-no-task-list --team-no-scratchpad --team-no-mcpresult.json:team_featuresmatches the flags on both runs🤖 Generated with Claude Code