Skip to content

team_harness: extract team mode as standalone harness + ablation flags#58

Open
ProKil wants to merge 1 commit into
team-all-adaptersfrom
team-harness-module
Open

team_harness: extract team mode as standalone harness + ablation flags#58
ProKil wants to merge 1 commit into
team-all-adaptersfrom
team-harness-module

Conversation

@ProKil
Copy link
Copy Markdown
Member

@ProKil ProKil commented May 19, 2026

Summary

Stacks on #55. Lifts the team-mode coordination primitives out of cooperbench/agents/_team (private, benchmark-internal) into cooperbench/team_harness (public, library-shaped) so the algorithm can be studied and consumed by other benchmarks — the long-horizon target discussed in #52's followups. Adds five per-feature ablation flags so we can measure each coordination mechanism's contribution independently.

What's new

cooperbench.team_harness — public package

A documented sibling of cooperbench/agents, not nested under it. Same modules as the old _team/ (task_list, protocol, mcp_server, prompt, loop_refresh, fs_mirror, metrics, runtime, coop_task, install_snippet.sh) plus a facade in __init__.py:

Public symbol What it is
TeamHarnessConfig Frozen dataclass of five booleans (task_list, scratchpad, mcp, auto_refresh, protocol). with_only("task_list", "mcp") and disabled() helpers.
TeamSession Per-run object bundling run_id / redis_url / agents / team_volume / config. Adapter-facing factories: env_for, scratchpad_mount_args, mcp_config, prompt_for, prompt_section, loop_poller, task_list_client, harvest_metrics. Each factory returns None / [] / {} when its feature is disabled so adapters write one code path.
Path constants COOP_TASK_SCRIPT_PATH, INSTALL_SNIPPET_PATH, MCP_SERVER_SCRIPT_PATH, MCP_SERVER_NAME — adapters import these instead of computing Path(__file__).parent.parent / \"_team\".

TeamSession.redis_url is the host-side URL; env_for() rewrites localhost / 127.0.0.1host.docker.internal so adapters don't have to plumb that themselves. The rewrite is duplicated from _coop.runtime rather than imported, because the harness is meant to be portable to other benchmarks that don't ship _coop.

Ablation flags

Five --team-no-* flags on cooperbench run, each gating one coordination mechanism:

--team-no-task-list      shared Redis task list + pre-seeding + metrics
--team-no-scratchpad     /workspace/shared Docker volume
--team-no-mcp            wait_for_message MCP registration
--team-no-auto-refresh   in-loop task-list summary injection (Python-loop adapters)
--team-no-protocol       coop-request / coop-respond / coop-pending verbs

The lead/member role split stays on either way — without it team mode collapses to coop, so it's the always-on baseline.

result.json now records which features were enabled:

\"team_features\": {
  \"task_list\": false,
  \"scratchpad\": false,
  \"mcp\": false,
  \"auto_refresh\": true,
  \"protocol\": true
}

so post-hoc analysis can attribute pass-rate deltas to the specific feature that was off, without cross-referencing CLI invocations.

Adapter refactor

claude_code, codex, mini_swe_agent_v2, swe_agent, openhands_agent_sdk all accept a new team_features: TeamHarnessConfig | None kwarg and construct a local TeamSession instead of calling loose helpers. Each adapter's team-mode blocks (prompt assembly, env vars, scratchpad mount, MCP install, in-loop poller, CLI install) gate on session.config.<feature>. For example, the MCP install in claude_code is now:

mcp_config = team_session.mcp_config(container_script_path=CONTAINER_TEAM_MCP_PATH) if team_session else None
if mcp_config is not None:
    write_file_in_container(env, CONTAINER_TEAM_MCP_PATH, TEAM_MCP_SCRIPT_PATH.read_text())
    write_file_in_container(env, f\"{CONTAINER_CLAUDE_CONFIG_DIR}/.claude.json\", json.dumps(mcp_config, indent=2))

— gate on the session's config, no is_team flag spread around.

Tests

tests/agents/_teamtests/team_harness (rename, 83 existing tests still pass). Plus:

  • tests/team_harness/test_session.py (29 new) — covers TeamHarnessConfig defaults / with_only / disabled, TeamSession.lead/role_for/is_active, env_for (default, localhost rewrite, non-localhost passthrough, empty when all consumers off, full when only mcp remains), scratchpad_mount_args (default, off, empty volume), mcp_config (default, off), task_list_client (on, off), loop_poller (on, off), harvest_metrics (active, off, None client), prompt_for (lead, member), prompt_section (default, single-agent collapse).

  • tests/runner/test_team.py (4 new)team_features recorded in result.json by default; same config instance propagates to every adapter call; --team-no-task-list skips Redis pre-seed and produces empty metrics dict while keeping the role split; --team-no-scratchpad clears config[\"team_volume\"] to the empty string.

Full suite: 363 passed, 63 skipped. Ruff / format / mypy all green.

End-to-end on dottxt_ai_outlines/1371 [1,2] with codex --setting team --backend docker

Run flags task_log.json tasks.json result.metrics cb-team-* volume team_features
smoke-team-default (defaults) {tasks_total: 2, tasks_done: 2, time_to_first_claim_seconds: 67.3, claims_per_agent: {agent2: 1}} created (cb-team-96067cd3) all true
smoke-team-ablated --team-no-task-list --team-no-scratchpad --team-no-mcp {} not created task_list/scratchpad/mcp: false, auto_refresh/protocol: true

Both runs Submitted 2/2 agents.

Out of scope (next PRs)

  • Long-horizon benchmark consumer that drives task creation dynamically rather than pre-seeding one task per feature — the actual second-consumer this PR is in service of. Without it the TeamSession API is validated only by CooperBench itself.
  • Automatic search over TeamHarnessConfig × prompt variants × refresh cadence, with eval pass-rate as the objective and the coordination metrics (time_to_first_claim_seconds, unowned_at_end, claims_per_agent) as cheap first-pass proxies.

Test plan

  • ruff check, ruff format --check, mypy, pytest tests/ (all green locally)
  • Default smoke: uv run cooperbench run -a codex -r dottxt_ai_outlines_task -t 1371 -f 1,2 --setting team --backend docker --no-auto-eval -n smoke-team-default
  • Ablation smoke: same + --team-no-task-list --team-no-scratchpad --team-no-mcp
  • Confirm result.json:team_features matches the flags on both runs

🤖 Generated with Claude Code

Move team-mode primitives from cooperbench/agents/_team (private) to
cooperbench/team_harness (public, library-shaped) so other benchmarks
can consume the multi-agent coordination algorithm without depending on
CooperBench's task layout.

Adds TeamSession + TeamHarnessConfig:

- TeamSession bundles per-run state (run_id, namespaced Redis URL,
  ordered agent list, scratchpad volume name) with the feature config
  and exposes adapter-facing factories that each return None / [] / {}
  when their feature is disabled, so adapter code paths collapse to one
  branch:

    coop_env.update(session.env_for(agent_id))
    extra_run_args.extend(session.scratchpad_mount_args())
    mcp_config = session.mcp_config(container_script_path=...)

- TeamHarnessConfig is a frozen dataclass of five per-feature booleans
  (task_list, scratchpad, mcp, auto_refresh, protocol).  The lead/member
  role split is the always-on baseline -- without it team is just coop.

Wires five --team-no-* CLI flags through cli.py -> runner.run ->
runner.core -> runner.team -> each adapter.  result.json now records
team_features so post-hoc analysis can attribute deltas to the feature
that was off.

Adapter refactor: claude_code, codex, mini_swe_agent_v2, swe_agent, and
openhands_agent_sdk now accept team_features kwarg and construct a
local TeamSession instead of calling loose helpers.  Each adapter's
team-mode blocks (prompt, env, mount, MCP, install) gate on the
session's config.

Tests: tests/agents/_team -> tests/team_harness (rename), new
test_session.py (29 cases) covers the facade, four new ablation tests
in tests/runner/test_team.py verify the runner-side gating.  Full suite
363 passed, 63 skipped; ruff/format/mypy clean.

End-to-end smoke on dottxt_ai_outlines/1371 [1,2] with codex (docker):
- Default: writes task_log.json + tasks.json + metrics, cb-team-<run>
  volume created.
- --team-no-task-list --team-no-scratchpad --team-no-mcp: no task_log /
  tasks files, empty metrics dict, no volume.  team_features in
  result.json reflects the requested ablation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ProKil
Copy link
Copy Markdown
Member Author

ProKil commented May 19, 2026

End-to-end eval results (was missing from the original PR body — ran cooperbench eval after pushing):

run merge strategy both pass pass rate
smoke-team-default solo-agent1 (lead integrated both features) 100%
smoke-team-ablated (--team-no-task-list --team-no-scratchpad --team-no-mcp) naive (diff-cat) 0%

Real pass-rate delta in the expected direction — with the shared task list / scratchpad / MCP off, the lead had nowhere to fetch the member's patch from, naive merge produced a non-composing diff, and eval failed. With them on, the lead integrated and the team passed.

n=1 is far too small for any conclusion about which of the three features is doing the work, but the plumbing and the ablation signal both work end-to-end. Next-PR target: the same matrix across the core 10-pair subset with one-flag-off ablations.

@ProKil
Copy link
Copy Markdown
Member Author

ProKil commented May 19, 2026

Ablation matrix (core 10-pair × marginal-effect design)

Used the marginal-effect design (1 baseline + 5 one-feature-off) instead of the full 2⁵=32 interaction matrix — answers the "what's each feature's contribution" question at ~1/5 the cost. Codex CLI on gpt-5.5, --setting team --backend docker.

Headline numbers

config flag off pass / 10 Δ vs baseline
ablate-11111 (baseline, all on) 6
ablate-11011 mcp 5 −1
ablate-11101 auto_refresh 5 −1
ablate-11110 protocol 5 −1
ablate-01111 task_list 4 −2
ablate-10111 scratchpad 3 −3

Per-task breakdown — the signal is concentrated

task                                  base no_tl no_sp no_mcp no_ar no_pr
dottxt_ai_outlines/1655 [1,3]          ✓    ✓    ✓    ✓    ✓    ✓
dspy/8563 [1,4]                        ✗    ✗    ✗    ✗    ✗    ✗
go_chi/27 [3,4]                        ✗    ✗    ✗    ✗    ✗    ✗
llama_index/17244 [5,6]                ✓    ✓    ✓    ✓    ✓    ✓
openai_tiktoken/0 [4,8]                ✓    ✓    ✗    ✓    ✓    ✓
pallets_click/2800 [1,4]               ✗    ✗    ✗    ✗    ✗    ✗
pallets_jinja/1559 [5,8]               ✓    ✗    ✗    ✗    ✗    ✓
pallets_jinja/1621 [6,10]              ✓    ✗    ✗    ✓    ✓    ✗
react_hook_form/153 [2,6]              ✗    ✗    ✗    ✗    ✗    ✗
typst/6554 [2,6]                       ✓    ✓    ✓    ✓    ✓    ✓
  • 3 tasks pass in every config — easy enough that any coordination level works.
  • 4 tasks fail in every config — too hard for codex regardless of coordination.
  • Only 3 tasks are sensitive to the ablation (openai_tiktoken/0, pallets_jinja/1559, pallets_jinja/1621). Effective sample size = 3.

What this says about each feature

feature effect confidence reading
scratchpad −3/10 (−3/3 of sensitive tasks) strongest /workspace/shared/ is where members drop their patches and the lead picks them up. Without it, no integration possible — the lead can't read what the member did.
task_list −2/10 (−2/3 of sensitive tasks) moderate Without coordination state, the lead doesn't know what the member's working on. Still passes when the agents independently solve their feature without needing alignment (openai_tiktoken).
mcp −1/10 (within noise) low wait_for_message long-poll. Codex already idles cheaply between commands; the marginal effect is small.
protocol −1/10 (within noise) low Typed coop-request / coop-respond verbs. The agents in this task set don't actually use them much — message-passing via coop-send covers most needs.
auto_refresh −1/10 (within noise) n/a auto_refresh only fires in Python-loop adapters (mini_swe_agent_v2 etc). Codex is a CLI adapter, so this flag is effectively a no-op for this sweep — the −1 is sample noise, not a real effect.

Caveats

  • n=3 effective is too small to distinguish mcp/protocol from noise. The scratchpad signal (3/3 of sensitive tasks) is the only one I'd call robust.
  • Codex CLI's --json stream doesn't emit cost or a model field, so result.json:total_cost shows $0 throughout. Real spend was ~$200-300 for this sweep based on token-count × public gpt-5.5 estimates (~46-50M input mostly cached + ~400K-1M output across the 6 configs).
  • For the auto_refresh measurement to mean anything, rerun on mini_swe_agent_v2 (where it actually fires).
  • For tighter CIs on mcp/protocol, expand the core subset or re-stratify to drop the always-pass / always-fail tasks.

Raw data

  • ablation_matrix.csv (in repo working dir, gitignored)
  • Per-run logs under logs/ablate-*/team/...

🤖 Generated with Claude Code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant