team_harness: extract team mode as standalone harness + ablation flags by ProKil · Pull Request #58 · cooperbench/CooperBench

ProKil · 2026-05-19T01:02:54Z

Summary

Stacks on #55. Lifts the team-mode coordination primitives out of cooperbench/agents/_team (private, benchmark-internal) into cooperbench/team_harness (public, library-shaped) so the algorithm can be studied and consumed by other benchmarks — the long-horizon target discussed in #52's followups. Adds five per-feature ablation flags so we can measure each coordination mechanism's contribution independently.

What's new

`cooperbench.team_harness` — public package

A documented sibling of cooperbench/agents, not nested under it. Same modules as the old _team/ (task_list, protocol, mcp_server, prompt, loop_refresh, fs_mirror, metrics, runtime, coop_task, install_snippet.sh) plus a facade in __init__.py:

Public symbol	What it is
`TeamHarnessConfig`	Frozen dataclass of five booleans (`task_list`, `scratchpad`, `mcp`, `auto_refresh`, `protocol`). `with_only("task_list", "mcp")` and `disabled()` helpers.
`TeamSession`	Per-run object bundling `run_id` / `redis_url` / `agents` / `team_volume` / `config`. Adapter-facing factories: `env_for`, `scratchpad_mount_args`, `mcp_config`, `prompt_for`, `prompt_section`, `loop_poller`, `task_list_client`, `harvest_metrics`. Each factory returns `None` / `[]` / `{}` when its feature is disabled so adapters write one code path.
Path constants	`COOP_TASK_SCRIPT_PATH`, `INSTALL_SNIPPET_PATH`, `MCP_SERVER_SCRIPT_PATH`, `MCP_SERVER_NAME` — adapters import these instead of computing `Path(__file__).parent.parent / \"_team\"`.

TeamSession.redis_url is the host-side URL; env_for() rewrites localhost / 127.0.0.1 → host.docker.internal so adapters don't have to plumb that themselves. The rewrite is duplicated from _coop.runtime rather than imported, because the harness is meant to be portable to other benchmarks that don't ship _coop.

Ablation flags

Five --team-no-* flags on cooperbench run, each gating one coordination mechanism:

--team-no-task-list      shared Redis task list + pre-seeding + metrics
--team-no-scratchpad     /workspace/shared Docker volume
--team-no-mcp            wait_for_message MCP registration
--team-no-auto-refresh   in-loop task-list summary injection (Python-loop adapters)
--team-no-protocol       coop-request / coop-respond / coop-pending verbs

The lead/member role split stays on either way — without it team mode collapses to coop, so it's the always-on baseline.

result.json now records which features were enabled:

\"team_features\": {
  \"task_list\": false,
  \"scratchpad\": false,
  \"mcp\": false,
  \"auto_refresh\": true,
  \"protocol\": true
}

so post-hoc analysis can attribute pass-rate deltas to the specific feature that was off, without cross-referencing CLI invocations.

Adapter refactor

claude_code, codex, mini_swe_agent_v2, swe_agent, openhands_agent_sdk all accept a new team_features: TeamHarnessConfig | None kwarg and construct a local TeamSession instead of calling loose helpers. Each adapter's team-mode blocks (prompt assembly, env vars, scratchpad mount, MCP install, in-loop poller, CLI install) gate on session.config.<feature>. For example, the MCP install in claude_code is now:

mcp_config = team_session.mcp_config(container_script_path=CONTAINER_TEAM_MCP_PATH) if team_session else None
if mcp_config is not None:
    write_file_in_container(env, CONTAINER_TEAM_MCP_PATH, TEAM_MCP_SCRIPT_PATH.read_text())
    write_file_in_container(env, f\"{CONTAINER_CLAUDE_CONFIG_DIR}/.claude.json\", json.dumps(mcp_config, indent=2))

— gate on the session's config, no is_team flag spread around.

Tests

tests/agents/_team → tests/team_harness (rename, 83 existing tests still pass). Plus:

tests/team_harness/test_session.py (29 new) — covers TeamHarnessConfig defaults / with_only / disabled, TeamSession.lead/role_for/is_active, env_for (default, localhost rewrite, non-localhost passthrough, empty when all consumers off, full when only mcp remains), scratchpad_mount_args (default, off, empty volume), mcp_config (default, off), task_list_client (on, off), loop_poller (on, off), harvest_metrics (active, off, None client), prompt_for (lead, member), prompt_section (default, single-agent collapse).
tests/runner/test_team.py (4 new) — team_features recorded in result.json by default; same config instance propagates to every adapter call; --team-no-task-list skips Redis pre-seed and produces empty metrics dict while keeping the role split; --team-no-scratchpad clears config[\"team_volume\"] to the empty string.

Full suite: 363 passed, 63 skipped. Ruff / format / mypy all green.

End-to-end on `dottxt_ai_outlines/1371 [1,2]` with `codex --setting team --backend docker`

Run	flags	task_log.json	tasks.json	`result.metrics`	`cb-team-*` volume	`team_features`
`smoke-team-default`	(defaults)	✓	✓	`{tasks_total: 2, tasks_done: 2, time_to_first_claim_seconds: 67.3, claims_per_agent: {agent2: 1}}`	created (`cb-team-96067cd3`)	all `true`
`smoke-team-ablated`	`--team-no-task-list --team-no-scratchpad --team-no-mcp`	✗	✗	`{}`	not created	`task_list/scratchpad/mcp: false, auto_refresh/protocol: true`

Both runs Submitted 2/2 agents.

Out of scope (next PRs)

Long-horizon benchmark consumer that drives task creation dynamically rather than pre-seeding one task per feature — the actual second-consumer this PR is in service of. Without it the TeamSession API is validated only by CooperBench itself.
Automatic search over TeamHarnessConfig × prompt variants × refresh cadence, with eval pass-rate as the objective and the coordination metrics (time_to_first_claim_seconds, unowned_at_end, claims_per_agent) as cheap first-pass proxies.

Test plan

ruff check, ruff format --check, mypy, pytest tests/ (all green locally)
Default smoke: uv run cooperbench run -a codex -r dottxt_ai_outlines_task -t 1371 -f 1,2 --setting team --backend docker --no-auto-eval -n smoke-team-default
Ablation smoke: same + --team-no-task-list --team-no-scratchpad --team-no-mcp
Confirm result.json:team_features matches the flags on both runs

🤖 Generated with Claude Code

Move team-mode primitives from cooperbench/agents/_team (private) to cooperbench/team_harness (public, library-shaped) so other benchmarks can consume the multi-agent coordination algorithm without depending on CooperBench's task layout. Adds TeamSession + TeamHarnessConfig: - TeamSession bundles per-run state (run_id, namespaced Redis URL, ordered agent list, scratchpad volume name) with the feature config and exposes adapter-facing factories that each return None / [] / {} when their feature is disabled, so adapter code paths collapse to one branch: coop_env.update(session.env_for(agent_id)) extra_run_args.extend(session.scratchpad_mount_args()) mcp_config = session.mcp_config(container_script_path=...) - TeamHarnessConfig is a frozen dataclass of five per-feature booleans (task_list, scratchpad, mcp, auto_refresh, protocol). The lead/member role split is the always-on baseline -- without it team is just coop. Wires five --team-no-* CLI flags through cli.py -> runner.run -> runner.core -> runner.team -> each adapter. result.json now records team_features so post-hoc analysis can attribute deltas to the feature that was off. Adapter refactor: claude_code, codex, mini_swe_agent_v2, swe_agent, and openhands_agent_sdk now accept team_features kwarg and construct a local TeamSession instead of calling loose helpers. Each adapter's team-mode blocks (prompt, env, mount, MCP, install) gate on the session's config. Tests: tests/agents/_team -> tests/team_harness (rename), new test_session.py (29 cases) covers the facade, four new ablation tests in tests/runner/test_team.py verify the runner-side gating. Full suite 363 passed, 63 skipped; ruff/format/mypy clean. End-to-end smoke on dottxt_ai_outlines/1371 [1,2] with codex (docker): - Default: writes task_log.json + tasks.json + metrics, cb-team-<run> volume created. - --team-no-task-list --team-no-scratchpad --team-no-mcp: no task_log / tasks files, empty metrics dict, no volume. team_features in result.json reflects the requested ablation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ProKil · 2026-05-19T03:34:10Z

End-to-end eval results (was missing from the original PR body — ran cooperbench eval after pushing):

run	merge strategy	both pass	pass rate
`smoke-team-default`	`solo-agent1` (lead integrated both features)	✓	100%
`smoke-team-ablated` (`--team-no-task-list --team-no-scratchpad --team-no-mcp`)	`naive` (diff-cat)	✗	0%

Real pass-rate delta in the expected direction — with the shared task list / scratchpad / MCP off, the lead had nowhere to fetch the member's patch from, naive merge produced a non-composing diff, and eval failed. With them on, the lead integrated and the team passed.

n=1 is far too small for any conclusion about which of the three features is doing the work, but the plumbing and the ablation signal both work end-to-end. Next-PR target: the same matrix across the core 10-pair subset with one-flag-off ablations.

ProKil · 2026-05-19T06:40:08Z

Ablation matrix (core 10-pair × marginal-effect design)

Used the marginal-effect design (1 baseline + 5 one-feature-off) instead of the full 2⁵=32 interaction matrix — answers the "what's each feature's contribution" question at ~1/5 the cost. Codex CLI on gpt-5.5, --setting team --backend docker.

Headline numbers

config	flag off	pass / 10	Δ vs baseline
`ablate-11111`	(baseline, all on)	6	—
`ablate-11011`	mcp	5	−1
`ablate-11101`	auto_refresh	5	−1
`ablate-11110`	protocol	5	−1
`ablate-01111`	task_list	4	−2
`ablate-10111`	scratchpad	3	−3

Per-task breakdown — the signal is concentrated

task                                  base no_tl no_sp no_mcp no_ar no_pr
dottxt_ai_outlines/1655 [1,3]          ✓    ✓    ✓    ✓    ✓    ✓
dspy/8563 [1,4]                        ✗    ✗    ✗    ✗    ✗    ✗
go_chi/27 [3,4]                        ✗    ✗    ✗    ✗    ✗    ✗
llama_index/17244 [5,6]                ✓    ✓    ✓    ✓    ✓    ✓
openai_tiktoken/0 [4,8]                ✓    ✓    ✗    ✓    ✓    ✓
pallets_click/2800 [1,4]               ✗    ✗    ✗    ✗    ✗    ✗
pallets_jinja/1559 [5,8]               ✓    ✗    ✗    ✗    ✗    ✓
pallets_jinja/1621 [6,10]              ✓    ✗    ✗    ✓    ✓    ✗
react_hook_form/153 [2,6]              ✗    ✗    ✗    ✗    ✗    ✗
typst/6554 [2,6]                       ✓    ✓    ✓    ✓    ✓    ✓

3 tasks pass in every config — easy enough that any coordination level works.
4 tasks fail in every config — too hard for codex regardless of coordination.
Only 3 tasks are sensitive to the ablation (openai_tiktoken/0, pallets_jinja/1559, pallets_jinja/1621). Effective sample size = 3.

What this says about each feature

feature	effect	confidence	reading
scratchpad	−3/10 (−3/3 of sensitive tasks)	strongest	`/workspace/shared/` is where members drop their patches and the lead picks them up. Without it, no integration possible — the lead can't read what the member did.
task_list	−2/10 (−2/3 of sensitive tasks)	moderate	Without coordination state, the lead doesn't know what the member's working on. Still passes when the agents independently solve their feature without needing alignment (openai_tiktoken).
mcp	−1/10 (within noise)	low	wait_for_message long-poll. Codex already idles cheaply between commands; the marginal effect is small.
protocol	−1/10 (within noise)	low	Typed coop-request / coop-respond verbs. The agents in this task set don't actually use them much — message-passing via `coop-send` covers most needs.
auto_refresh	−1/10 (within noise)	n/a	`auto_refresh` only fires in Python-loop adapters (mini_swe_agent_v2 etc). Codex is a CLI adapter, so this flag is effectively a no-op for this sweep — the −1 is sample noise, not a real effect.

Caveats

n=3 effective is too small to distinguish mcp/protocol from noise. The scratchpad signal (3/3 of sensitive tasks) is the only one I'd call robust.
Codex CLI's --json stream doesn't emit cost or a model field, so result.json:total_cost shows $0 throughout. Real spend was ~$200-300 for this sweep based on token-count × public gpt-5.5 estimates (~46-50M input mostly cached + ~400K-1M output across the 6 configs).
For the auto_refresh measurement to mean anything, rerun on mini_swe_agent_v2 (where it actually fires).
For tighter CIs on mcp/protocol, expand the core subset or re-stratify to drop the always-pass / always-fail tasks.

Raw data

ablation_matrix.csv (in repo working dir, gitignored)
Per-run logs under logs/ablate-*/team/...

🤖 Generated with Claude Code

ProKil mentioned this pull request May 19, 2026

docs: team-harness ablation report (flash, codex/gpt-5.5) #59

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

team_harness: extract team mode as standalone harness + ablation flags#58

team_harness: extract team mode as standalone harness + ablation flags#58
ProKil wants to merge 1 commit into
team-all-adaptersfrom
team-harness-module

ProKil commented May 19, 2026

Uh oh!

ProKil commented May 19, 2026

Uh oh!

ProKil commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ProKil commented May 19, 2026

Summary

What's new

cooperbench.team_harness — public package

Ablation flags

Adapter refactor

Tests

End-to-end on dottxt_ai_outlines/1371 [1,2] with codex --setting team --backend docker

Out of scope (next PRs)

Test plan

Uh oh!

ProKil commented May 19, 2026

Uh oh!

ProKil commented May 19, 2026

Ablation matrix (core 10-pair × marginal-effect design)

Headline numbers

Per-task breakdown — the signal is concentrated

What this says about each feature

Caveats

Raw data

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

`cooperbench.team_harness` — public package

End-to-end on `dottxt_ai_outlines/1371 [1,2]` with `codex --setting team --backend docker`