Skip to content

feat(mcp): run_command tool for visible-pane shell with read-back#262

Open
kshivang wants to merge 4 commits into
masterfrom
dev
Open

feat(mcp): run_command tool for visible-pane shell with read-back#262
kshivang wants to merge 4 commits into
masterfrom
dev

Conversation

@kshivang
Copy link
Copy Markdown
Owner

Summary

  • New blocking MCP tool mcp__bossterm__run_command that runs a shell command in a visible BossTerm pane and returns its stdout/stderr, exit code, and duration. AI clients (Claude Code, Codex, Cursor) now have a default shell that the user can watch in their real terminal — no more invisible Bash calls.
  • Cross-client preference advertising via the MCP initialize-response instructions field — every spec-compliant client sees "prefer run_command over your built-in shell" in its system prompt at handshake, no per-project config needed.
  • ~/.bossterm/mcp.port marker file written atomically on every successful bind, deleted on clean stop — lets external hooks (e.g. a Claude Code PreToolUse hook) decide reachability with a 5 ms stat + nc -z instead of a 300 ms HTTP probe.

What's in the box

Server tool (BossTermMcpServer.registerRunCommand):

  • Pane resolution: explicit pane_id > cached "MCP scratch pane" for the tab > newly-created via splitHorizontal/splitVertical/createTab.
  • Per-tab cache Mutex so two concurrent calls can't both miss the cache and orphan a pane.
  • Per-pane Mutex so two concurrent calls into the same pane can't interleave their scripts in the shell's stdin.
  • Listener-before-send: registers a transient CommandStateListener BEFORE the script is written so OSC 133;A/B/D are all captured.
  • Absolute history-line tracking for output slicing — slice stays correct when output scrolls into history mid-command.
  • Alt-screen detection: polls isUsingAlternateBuffer while waiting for D. On transition, returns error: "TUI detected" and leaves the process alive so the caller can drive it via send_input + read_scrollback.
  • Reserved (UNDISABLABLE_TOOLS): manage_tools rejects attempts to hide it, and applyDisabledSet defensively filters reserved names against hand-edits of disabledMcpTools in settings.json.

Settings (TerminalSettings.kt, McpSettingsSection.kt):

  • mcpRunCommandDefaultTimeoutMs (120_000, UI input).
  • mcpRunCommandDefaultPanel ("horizontal_split", UI dropdown; invalid values silently fall back to horizontal_split so a typo'd setting never bubbles up as a per-call error).
  • mcpRunCommandMaxOutputBytes (120_000, advanced — sized to fit under mcpMaxAnswerChars=150_000 with JSON-wrapper headroom so the response-shortening ladder never has to truncate run_command output).
  • mcpRunCommandShellReadyTimeoutMs (1_500, advanced).
  • UI also gained an "always-exposed meta-tool manage_tools" blurb so users see that tool exists even though it's not in the toggle list.

Docs (README.md, docs/mcp-server.md):

  • Per-tool run_command section: schema, response shape, error contract (TUI / timeout / shell-integration missing).
  • ~/.bossterm/mcp.port marker file note + initialize-time instructions section.
  • Settings reference table rows for the four new keys.
  • "Using as Claude Code's default shell" how-to: copy-paste hook script (~/.claude/hooks/prefer-bossterm.sh), settings.json snippet, optional CLAUDE.md addendum.
  • Troubleshooting notes for multi-instance / sudo / missing nc.

Rigorous review pass

Three independent review agents (server-side, settings/UI, integration/hook) audited the implementation before this commit. They surfaced 3 P0 bugs, 5 P1 issues, 4 P2 doc/UX items, and verified 11 items as correct. All P0 and P1 findings are addressed here:

What's NOT in this PR

  • User-global config files (~/.claude/CLAUDE.md, ~/.claude/hooks/prefer-bossterm.sh, ~/.claude/settings.json hook entry) — per-user, lives outside the repo. The docs walk through wiring them up.

Test plan

  • ./gradlew :compose-ui:compileKotlinDesktop --no-daemon — BUILD SUCCESSFUL.
  • Hook script smoke-tested in all four branches: no marker / dead port / live port / no nc — all PASS.
  • Manual (reviewer): launch BossTerm, enable MCP, attach to Claude Code, ask Claude to ls — pane should appear in the active tab and output returns to Claude.
  • Manual (reviewer): run two run_command calls in parallel from MCP Inspector against the same tab — confirm only ONE new pane spawns (cache race fix).
  • Manual (reviewer): call run_command with script: "vim\n" — expect error: "TUI detected", pane stays alive.
  • Manual (reviewer): kill BossTerm with MCP enabled — confirm ~/.bossterm/mcp.port is gone.
  • Manual (reviewer): hand-edit ~/.bossterm/settings.json to "mcpRunCommandDefaultPanel": "reuse" — first run_command call should silently fall back to horizontal_split instead of erroring.

🤖 Generated with Claude Code

Today AI clients (Claude Code, Codex, Cursor, …) running shell commands
go through their built-in Bash tool — invisible to the user. The user
asked for shell work to happen in their actual BossTerm window while
the agent still receives stdout/stderr + exit code, so they can watch
what's running and the agent isn't blind.

New write tool mcp__bossterm__run_command, blocking variant of
run_in_panel:
- Spawns (or reuses) a visible BossTerm pane, writes the script, waits
  for OSC 133;D, returns {exitCode, output, durationMs, paneId,
  truncated, error}.
- Pane reuse: first call splits below the focused pane (or whatever
  mcpRunCommandDefaultPanel says); subsequent calls in the same tab
  reuse that pane. The cache (tabId → paneId) is guarded by a per-tab
  Mutex so two concurrent calls can't both miss the cache and orphan
  a pane. A per-pane Mutex serializes overlapping calls into the same
  pane's stdin.
- Output capture: absolute history-line tracking via the OSC 133;B and
  ;D events, so the slice stays correct when output scrolls into
  history mid-command. ANSI-stripped via the emulator's per-line .text
  accessor (same path read_scrollback uses).
- Alt-screen / TUI detection: polls textBuffer.isUsingAlternateBuffer
  while waiting for D; on transition, completes with
  error: "TUI detected …" instead of timing out. The process stays
  alive so the caller can drive it via send_input + read_scrollback.
- Reserved tool: added to UNDISABLABLE_TOOLS so manage_tools rejects
  attempts to hide it (otherwise the PreToolUse hook below would route
  Bash to a tool that's no longer on the wire). applyDisabledSet also
  filters reserved names defensively against hand-edits of
  disabledMcpTools in settings.json.

Cross-client preference advertising:
- Server now passes a BOSSTERM_MCP_INSTRUCTIONS string into the SDK's
  4-arg Server constructor. The MCP initialize response surfaces it
  into every client's system prompt, so the "prefer run_command over
  your built-in shell" guidance lands without per-project config.

Port marker file for cheap external probes:
- BossTermMcpManager atomically writes the actual bound port (after
  the 7676→7685 fallback) to ~/.bossterm/mcp.port on every successful
  bind, deletes it on clean stop. Lets the user-global PreToolUse hook
  decide reachability with stat + nc -z (~5ms) instead of an HTTP
  probe with timeout (~300ms worst case).

Settings:
- mcpRunCommandDefaultTimeoutMs (default 120_000, UI input)
- mcpRunCommandDefaultPanel (default horizontal_split, UI dropdown;
  invalid values silently fall back to horizontal_split — never bubble
  the user's bad setting up as a per-call error)
- mcpRunCommandMaxOutputBytes (default 120_000, advanced — sized to
  fit under mcpMaxAnswerChars=150_000 with JSON-wrapper headroom)
- mcpRunCommandShellReadyTimeoutMs (default 1_500, advanced)

UI:
- Two new rows under "BossTerm MCP Server" (panel dropdown + timeout
  input).
- ExposedToolsSection grays out reserved tools and gains a blurb
  surfacing the always-exposed manage_tools meta-tool (which lives
  outside BUILT_IN_READ/WRITE_TOOLS).

Docs:
- README intro updated; new "Using as Claude Code's default shell"
  subsection pointing at the docs.
- docs/mcp-server.md: full run_command schema + response shape +
  error contract, ~/.bossterm/mcp.port marker file note, initialize-
  time instructions section, settings reference rows, "Using as
  Claude Code's default shell" how-to (copy-paste hook script +
  settings.json + CLAUDE.md), multi-instance + sudo + nc caveats in
  Troubleshooting.

Verification:
- ./gradlew :compose-ui:compileKotlinDesktop --no-daemon clean.
- Hook script smoke-tested across all four branches (no marker /
  dead port / live port / no nc).
- End-to-end (MCP Inspector + two concurrent run_command calls + real
  Bash hooking) is still manual; the listener-before-send, alt-screen
  detection, and cache lock can't be exercised without a live UI.

User-global config (~/.claude/CLAUDE.md, hooks/prefer-bossterm.sh,
settings.json hook entry) is per-user and lives outside this repo —
the docs walk through it.

Generated with [Claude Code](https://claude.com/claude-code)
@claude
Copy link
Copy Markdown

claude Bot commented May 20, 2026

Review of PR #262run_command tool

Solid, well-thought-out feature; the per-tab cache lock + per-pane mutex layering is the right model, and the absolute-history-line slice is clever. A few real bugs and contract gaps worth fixing before merge.

Issues

1. Hook script contradicts its own docs (and the PR description) when nc is missing. docs/mcp-server.md:524-527 says the hook "fails closed (lets Bash through) when nc is missing", and the PR description lists this as a P0 fix. The actual script falls through and emits the deny JSON when command -v nc returns non-zero, so a missing nc ends up denying Bash and routing Claude to a port no one verified is alive — exactly the behavior the P0 was supposed to remove. The minimal fix:

if command -v nc >/dev/null 2>&1; then
    nc -z 127.0.0.1 "$port" >/dev/null 2>&1 || exit 0
else
    exit 0
fi

2. panel: "new_tab" confuses tab ids and pane ids. state.createTab(...) returns a tab id (TabbedTerminalState.kt:226-236), but the new-tab branch in registerRunCommand stuffs that into newPaneId and then:

  • caches it as mcpScratchPanes[origTabId] = newTabId
  • returns it as RunCommandResult.paneId while RunCommandResult.tabId stays the original (now-background) tab

Consequences:

  • The next run_command call with the same (original) tab_id reads the cache, gets the new tab's id as a "pane id", and state.findSession(origTabId, newTabId) returns null because that id isn't in origTabId's split tree. The entry is cleared as stale and a brand-new tab is created — every call. So panel: "new_tab" never gets the documented reuse behavior.
  • The response's tabId doesn't reflect where the pane actually lives, so a caller passing paneId back without tab_id will land in whatever tab is active at call time, not the one the pane is in.

Fix: when panel == "new_tab", treat the return value as a tab id — cache mcpScratchPanes[newTabId] = focusedPaneIdOfNewTab and return that tab id in RunCommandResult.tabId.

3. Multi-statement scripts (raw \n) corrupt the result. The schema says "Shell command to run. A trailing newline is added if absent." but doesn't forbid embedded \n. With e.g. "cd /tmp\nls", the shell drives two B/D cycles. finishedSignal.complete() (BossTermMcpServer.kt:1168-1172) is one-shot and fires at the first D — but historyAtB/cursorYAtB get re-assigned on every onCommandStarted, so by the time the awaiter wakes up the start mark may already point at the second statement. Net effect is non-deterministic: either the exit code is the first statement's and the slice runs past it into the second, or the start mark is stale and the slice misses the first statement's output entirely. Two reasonable options:

  • Document this and tell callers to use bash -lc '...' for compound scripts.
  • Stop overwriting historyAtB after the first B, mirroring the one-shot pattern already used for finishedSignal.

4. Stale OSC 133;D from concurrent pane activity can spuriously finish the call. On a reused pane, the listener is wired up before writeUserInput. If a previous tool (a fire-and-forget run_in_panel script, or a keystroke the user typed) is mid-command, the resulting D event lands in our finishedSignal and we return that command's exit code with a garbled output slice. The per-pane mutex serializes run_command against itself but doesn't protect against other writers on the same pane. Cheap mitigation: ignore onCommandFinished until we've observed an onCommandStarted that fired after our writeUserInput.

5. mcpRunCommandMaxOutputBytes is measured in UTF-16 chars, not bytes. sliceCommandOutput checks sb.length + line.length + 1 > maxOutputBytes (BossTermMcpServer.kt ~1316). For ASCII this matches, but a pane full of emoji/CJK can return a string whose UTF-8 byte count is 2–4× the limit — silently breaking the "fits under mcpMaxAnswerChars with JSON-wrapper headroom" guarantee that's the whole point of the 120_000 default. Either rename to …MaxOutputChars everywhere, or measure line.toByteArray(Charsets.UTF_8).size.

6. Per-pane Mutex map leaks on pane close. McpTerminalRegistry.paneMutexes only sheds entries in clearScratchPane(tabId, paneId), which only runs on the cache-miss path. If a user closes the cached pane and never calls run_command against that tab again, the mutex sits forever. The leak is human-scale but unbounded across long-lived sessions — easy to plug by hooking pane-close (TerminalTab already has onProcessExit).

Doc/UX nits

  • docs/mcp-server.md:352 still says "output is capped at 200 KB" — the actual default cap is 120_000 (~120 KB) per the settings table on the same page.
  • The hook script does cat >/dev/null 2>&1 || true; cat redirected to /dev/null doesn't fail on EOF, so the || true is dead code under set -e. Minor.
  • script: "\n" (whitespace only) passes the script.isEmpty() guard and writes a bare newline. Consider script.isBlank().

Things that look correct

  • UNDISABLABLE_TOOLS defensive subtraction in applyDisabledSet (belt-and-braces against hand-edited settings).
  • Atomic move via Files.move(..., ATOMIC_MOVE, REPLACE_EXISTING) for the marker file.
  • Per-tab cache Mutex properly closes the create-and-cache race for concurrent calls on the same tab.
  • Listener-before-send ordering plus freshlyCreated gating so reused panes don't deadlock waiting for an A that already fired.
  • tuiPoller cancellation inside the coroutineScope so the poll loop doesn't outlive the await.

Test coverage

The manual checklist is reasonable, but none of the locking, slicing, or pane-resolution logic has automated coverage. A few cheap JVM tests would be high-value:

  • Slice math: feed a fake TerminalTextBuffer with known scrollback deltas, assert sliceCommandOutput returns the right rows when historyDelta > 0 and when the start mark falls past historyLinesCount.
  • Cache race: two concurrent coroutines hitting registerRunCommand for the same tab_id while a stub state.splitHorizontal records each call — assert exactly one creation.
  • UNDISABLABLE_TOOLS filter: persist disabledMcpTools = setOf("manage_tools", "run_command") and assert both tools remain registered after applyDisabledSet.

Generated with Claude Code

…stered

Today McpTerminalRegistry.primaryState() returns the first-registered
window, so run_command / run_in_panel / get_active_tab without a tab_id
target the wrong window whenever multiple BossTerm windows are open —
the user sees commands run somewhere they aren't looking. The correct
invariant is "the window the calling MCP client is running INSIDE."

New file ProcessAncestry.kt resolves that window by:

1. Resolving the PID owning the loopback TCP socket on the client side
   (lsof on macOS; /proc/net/tcp + /proc/<pid>/fd on Linux).
2. Walking the parent process tree from that PID.
3. Matching the first ancestor PID against every tracked pane's shell
   PID across all registered states.
4. Returning the state owning that pane — or null if no ancestor
   matches (client running outside any BossTerm pane: Claude Desktop,
   external Inspector, CI…).

BossTermMcpManager's Ktor interceptor calls the resolver on every
incoming MCP request and records the result in
McpTerminalRegistry.lastResolvedClientWindow (AtomicReference). The
three tool handlers that previously used registry.primaryState()
(get_active_tab, run_in_panel, run_command) now read the resolved
client window first, with the existing primaryState() as fallback.

Design notes:

- Resolution failures are silent — caller falls back to first-
  registered. Never crashes a tool call on a parent-walk hiccup.
- Concurrent multi-client racing is acceptable last-writer-wins for
  v1 — the dominant single-client case is consistent.
- Resolver runs INSIDE the existing intercept block, AFTER the
  DNS-rebinding check, so a hostile request never triggers lsof.
- Stale lastResolvedClientWindow entries (window closed) are detected
  lazily on read (state is no longer in `states` list) and cleared.

Platform support: macOS + Linux. Windows and unknown platforms skip
the resolver (resolveClientWindow returns null) and the server falls
back to first-registered.

Cost: ~50-150ms per request on macOS (two shell-outs per parent hop —
lsof, then ps for each ancestor up to 4 hops deep), ~5-15ms on Linux
(pure /proc reads).

Docs: new "Caller-window resolution" subsection in mcp-server.md
covering mechanism, scenarios, platform support, and the multi-client
race caveat.

Verification:
- ./gradlew :compose-ui:compileKotlinDesktop --no-daemon clean.
- Live end-to-end testing needs a restart of BossTerm (currently
  running binary is pre-change) and a multi-window scenario — manual,
  reviewer to confirm before merge.

Generated with [Claude Code](https://claude.com/claude-code)
@claude
Copy link
Copy Markdown

claude Bot commented May 20, 2026

Follow-up review pass — ProcessAncestry and items the prior review did not surface

The previous review covered the main run_command correctness issues (hook script nc-missing branch, panel new_tab tab-vs-pane id confusion, multi-statement script slice, stale OSC 133;D from concurrent writers, UTF-16-chars vs bytes for maxOutputBytes, per-pane mutex leak, doc nit on 200 KB). Those still stand. This pass focuses on ProcessAncestry.kt and a couple of items that pass did not cover.

ProcessAncestry — bugs and performance

  1. SUBPROCESS_TIMEOUT_S is dead code on macOS. findClientPidMacOS calls readLines() before process.waitFor(timeout, ...) (ProcessAncestry.kt:124-125); same shape in parentPidMacOS (:208-209). readLines/readText blocks until EOF — i.e. until the child exits — so if lsof or ps hangs while still holding stdout open, the read blocks indefinitely and waitFor is never reached. Even if waitFor does run, its return value is ignored — there is no destroyForcibly, so a slow child is not killed either. Worst case: the Ktor interceptor thread for an MCP request hangs forever on a slow lsof. Fix shape: drain stdout on a background thread, waitFor with the timeout, destroyForcibly and return null on false.

  2. Resolution runs on every MCP request, paying ~50-150 ms on macOS. The Ktor interceptor calls ProcessAncestry.resolveClientWindow for every incoming request, including high-frequency reads (list_tabs, read_scrollback, search_output, get_active_tab). Two lsof or ps shell-outs per request adds up fast under any agentic loop. SSE keeps the source port stable for the lifetime of a connection, so a (remotePort -> state) cache scoped to the SSE session would amortize this to near-zero for steady-state work. The kdoc already calls out the cost; the fix is just to memoize per port.

  3. Linux IPv6 loopback is not scanned. findClientPidLinux reads only /proc/net/tcp (:147). If localhost resolves to ::1 on the user system, the connection lives in /proc/net/tcp6 and the resolver silently fails — the server falls back to first-registered. The server binds 127.0.0.1 explicitly so this is rare, but a client connecting via localhost rather than 127.0.0.1 can end up on the v6 socket. Either scan both files, or document the v4-only constraint.

  4. Linux /proc fd-scan cost is overstated by the kdoc. Comment claims ~5-10 ms (:142) but the inner loop calls Files.readSymbolicLink for every fd of every PID on the system (except our own). On a workstation with hundreds of processes and thousands of fds — common with containers, browsers, IDEs — the wall-clock cost is more like 50-200 ms. Not a correctness bug, just worth honest-numbering in the comment. Memoizing per (2) makes this moot regardless.

  5. lastResolvedClientWindow survives a window unregistering, but only by accident. setLastResolvedClientWindow(null) is a documented no-op, so the cached state lingers after its window is closed; lastResolvedClientWindow() then lazily clears it on the next read by checking cached !in states. The kdoc on setLastResolvedClientWindow claims "the prior resolution from a request that DID resolve" survives, which is only true until that window is unregistered — the next read invalidates it. The behavior is correct; the kdoc oversells it.

Items the prior review did not surface

  1. mcpScratchPanes and tabCacheLocks leak on tab close. Mirroring the per-pane mutex leak the prior review flagged: when a tab is closed entirely (not just a pane within it), both mcpScratchPanes[tabId] and tabCacheLocks[tabId] still hold its tabId. Hook eviction into the same tab-removed signal that already exists in TabbedTerminalState and you kill all three leaks (tab cache, tab lock, pane mutex) in one place.

  2. JSON-schema panel enum is not declared. The run_command panel parameter accepts only reuse / horizontal_split / vertical_split / new_tab, but the schema (BossTermMcpServer.kt:945-948) declares it as a bare string with a free-form description. Adding "enum": [...] lets clients validate before sending and lets Claude / Codex constrain their generations against the allowed set, reducing the "Unknown panel" error path. Minor UX win.

  3. Setting clamps happen silently per-call with no user feedback. mcpRunCommandMaxOutputBytes.coerceAtLeast(1024), mcpRunCommandDefaultTimeoutMs.coerceIn(100, 600_000), and mcpRunCommandShellReadyTimeoutMs.coerceAtLeast(0) all run on every call. A user who edits settings.json directly (as the advanced settings docs encourage) to 0 or a negative gets no log line saying their setting was overridden. Cheap fix: log once at startup (or first call) when the persisted value is out-of-range.

  4. Newline-padding is byte-naive on CRLF. A script ending in \r\n passes the endsWith("\n") check, so no additional newline is added, but the shell still sees \r before the newline. POSIX shells tolerate this; some shell features (here-doc parsing, certain builtins on minimal sh) do not. Either strip a trailing \r before the check, or document the LF-only contract.

  5. mcpRunCommandDefaultPanel is silently normalized at runtime but not on persistence. The runtime tolerates a bad value by falling back to horizontal_split (good — already a P0 fix in the description), but the settings UI dropdown only offers the three valid options. A user who hand-edits settings.json to "reuse" (or any other bad value) keeps that string in their saved settings indefinitely; the UI dropdown then has no matching selection. Either normalize on settings load, or render an explicit "(invalid)" state in the dropdown.

Test coverage — same as prior review

Still no automated coverage for any of: slice math (history delta, wraparound), cache-race resolution, UNDISABLABLE_TOOLS filter, ProcessAncestry IPv6 / subprocess-timeout paths. The prior review three suggested unit tests would catch most of what is flagged here too.


Net: lots of careful engineering (per-tab cache lock + per-pane mutex layering, absolute-history slice, sealed PaneResolution for exhaustive when, atomic marker write, defensive effectiveDisabled filter). The ProcessAncestry subprocess hang and per-request cost are the items most worth fixing before merge alongside the prior review P0/P1 list.

Generated with Claude Code

Today consecutive `run_in_panel` calls with `panel: "horizontal_split"`
each split whichever pane currently has focus — so a second long-running
script lands wherever the user happened to click last, often splitting
the agent's own pane and pushing the prior MCP pane around. With three
or four parallel fire-and-forget launches the layout becomes random and
unpleasant.

Anchor the new pane off the last MCP-created scratch pane instead of
the focused pane: a horizontal_split splits the existing scratch pane
*vertically* (right of it), and the new pane becomes the next anchor.
Result: consecutive MCP scratch panes line up as a horizontal strip
along the bottom of the tab, and the agent's primary pane keeps its
real estate.

New: `TabbedTerminalState.splitVerticalFromPane(tabId, anchorPaneId,
ratio, initialCommand)`. Implemented by extending the existing
`performSplit` helper with an optional `anchorPaneId` argument — when
provided, `SplitViewState.setFocusedPane` aims focus at the anchor
before the existing `splitFocusedPane` call runs, so the same code path
handles both flavors. Anchor lookup degrades gracefully: if the cached
paneId no longer resolves to a session (user closed the pane), the
caller falls back to the original `splitHorizontal` (focused-pane)
behavior.

run_in_panel changes:
 - When `panel == "horizontal_split"` and `registry.getScratchPane(tab)`
   points at a still-live pane, route through `splitVerticalFromPane`
   to stack the new pane right of the anchor.
 - Every successful split now updates the scratch-pane cache to the
   new paneId, so the *next* call sees the new pane as its anchor
   (chains horizontally). Previously only run_command updated the
   cache; now run_in_panel does too, and the two tools share the same
   anchor for the same tab.

`vertical_split` and `new_tab` keep their existing semantics — no
behavior change for callers that explicitly opt out of the anchor
stacking.

run_command is intentionally untouched: it reuses the cached pane on
the next call (no new pane created), so the anchor-stacking rule
doesn't apply.

Verification:
 - `./gradlew :compose-ui:compileKotlinDesktop --no-daemon` clean.
 - Manual (reviewer): two run_in_panel horizontal_split calls in a row
   should yield two panes side-by-side in the bottom row, not nested.

Generated with [Claude Code](https://claude.com/claude-code)
@claude
Copy link
Copy Markdown

claude Bot commented May 20, 2026

Review of PR #262run_command MCP tool

Thorough, well-organized work — the per-tab cache lock, per-pane mutex, absolute-history slicing, and UNDISABLABLE_TOOLS defense-in-depth all show solid concurrency/safety thinking. A few real issues below.

Bugs

1. Hook script contradicts its own docs on the "nc missing" branch (docs/mcp-server.md)

The doc at line 567–570 promises:

The hook fails closed (lets Bash through) when nc is missing, since it can't verify the marker isn't stale.

…and the PR description's P0 #1 says the same. But the script as written does the opposite:

if command -v nc >/dev/null 2>&1; then
    nc -z 127.0.0.1 "\$port" >/dev/null 2>&1 || exit 0
fi
cat <<'JSON' …deny…

When nc is missing, the if body is skipped entirely and execution falls through to the cat that emits the deny. Net result: machines without nc (minimal Alpine, some stripped containers) get Bash blocked even when BossTerm's marker is stale or the port is dead — exactly the "route Claude to a dead port" failure mode P0 #1 claims to have fixed.

Fix:

if command -v nc >/dev/null 2>&1; then
    nc -z 127.0.0.1 "\$port" >/dev/null 2>&1 || exit 0
else
    exit 0  # can't verify port — let Bash through
fi

Worth re-running the "no nc — PASS" entry in the test plan on a host with nc actually removed from PATH.

2. mcpRunCommandMaxOutputBytes is a char count, not a byte count (BossTermMcpServer.kt: sliceCommandOutput)

if (sb.length + line.length + 1 > maxOutputBytes) {

StringBuilder.length and String.length are UTF-16 code-unit counts in the JVM, not byte counts. For ASCII the two coincide, but a run_command whose output contains emoji, CJK, or other supplementary-plane characters will silently exceed the documented byte budget — and the headroom math against mcpMaxAnswerChars (also char-based) leaks the discrepancy. Either rename the setting to mcpRunCommandMaxOutputChars, or convert via .toByteArray(Charsets.UTF_8).size (more expensive — keeping it char-based and aligning the name + docs is fine).

3. Doc says output cap is 200 KB, code defaults to 120 KB (docs/mcp-server.md:395)

- `output` is capped at 200 KB; `truncated` becomes `true` …

But the default for mcpRunCommandMaxOutputBytes is 120_000 and the settings table further down says so. Either a leftover from an earlier draft or one of the two wasn't updated. Suggest: "output is capped at the mcpRunCommandMaxOutputBytes setting (default 120 KB)".

Concerns

4. Caller-window resolution is re-run from scratch on every MCP request (ProcessAncestry.kt)

The Ktor interceptor fires resolveClientWindow per request. For SSE + per-POST JSON-RPC, that's once per tool call — typically dozens during a Claude Code session. Each macOS resolution spawns lsof once + ps -o ppid= per ancestor hop (2–4 subprocesses on the budget you quote); each Linux resolution can iterate all of /proc/*/fd/*. There's no caching keyed on remotePort even though (port → PID → ancestor → state) is invariant within a connection.

A small ConcurrentHashMap<Int, TabbedTerminalState> keyed on remotePort with an LRU bound (or invalidated on state un-register) would cut this to one resolution per ephemeral port. Worth doing before this lands on macOS where the subprocess fan-out is most painful.

5. Cache lock is held during pane creation (BossTermMcpServer.kt, registerRunCommand)

The tabCacheLock(tabId).withLock { … } block calls state.splitHorizontal / state.createTab while holding the lock. Those involve UI-thread interaction and may take more than the "held briefly" the registry comment promises. Two concurrent calls on the same tab will serialize across pane creation, which is fine for correctness but means the second caller's latency includes the first call's full pane-spawn time. Probably acceptable — flagging because the comment understates it.

6. if (script.isEmpty()) rejects the empty string but accepts whitespace-only

run_command(script: " ") and run_command(script: "\n") will both proceed to the shell. Fine for the shell ("just a newline"), but combined with the script.removeSuffix("\n").ifEmpty normalization in run_in_panel it's slightly inconsistent. Worth script.isBlank() if you want the early-exit guard to be uniform.

Smaller things

  • state.findSession(tabId, newPaneId) ?: state.findSession(newPaneId) fallback chain (BossTermMcpServer.kt ~1103): the second lookup uses a different (tab-less) overload — worth a one-line comment explaining when the new_tab path causes the first to miss (the new pane lives in the new tab, not the requested one).
  • writePortMarker ATOMIC_MOVE (BossTermMcpManager.kt ~510): nice. One small thing — the .mcp.port.tmp filename is hard-coded, which means two BossTerm processes binding concurrently could clobber each other's tmp file mid-write. Multi-instance is documented as unsupported, so likely a no-fix, but a per-PID suffix (.mcp.port.tmp.\${pid}) would be a one-line hardening.
  • SUBPROCESS_TIMEOUT_S = 2L (ProcessAncestry.kt): under load, lsof/ps can briefly exceed 2 s. The failure mode is benign (returns null → falls back to first-registered), but with MAX_PARENT_WALK = 16 the worst-case wait for a single resolution is 16 × 2 s = 32 s. Probably never hit, but worth tightening either the per-call timeout or the hop budget.
  • lastResolvedClientWindow() (McpTerminalRegistry.kt): the cached !in states membership check on CopyOnWriteArrayList is O(N) per call — fine for a handful of windows, but worth a Set mirror if window counts grow.
  • maxOutputBytes.coerceAtLeast(1024) is silent (BossTermMcpServer.kt:1018): a user who sets it to 100 won't know their value was raised. Same for the timeout clamp. A one-line log.warn on first coercion would surface misconfiguration.
  • UNDISABLABLE_TOOLS doc-only contract: applyDisabledSet correctly filters now, but the description claims manage_tools's own handler also rejects the reserved names — a quick comment pointer next to the constant would help future readers verify both paths in one place.

Tests

No unit tests in this PR for a sizeable chunk of net-new logic (run_command driver, slicing, pane resolution, ProcessAncestry). The manual test plan is good but reproducibility-fragile. At minimum, sliceCommandOutput is pure and trivially testable — startHistory/endHistory/cursorY combinations including the scroll-into-history case, the truncation case, and the endRowInclusive < startRow edge would catch regressions cheaply.

Nits

  • Consider tightening the instructions blurb (BOSSTERM_MCP_INSTRUCTIONS) — it's six sentences of system-prompt text on every initialize. Claude Code does surface MCP server instructions but tokens add up across many servers.
  • run_command description mentions "Requires OSC 133 shell integration" but doesn't tell the model what happens if it's missing (exit code = null after timeout). Worth adding that explicitly so the agent doesn't infinite-loop retrying.

Overall a strong PR. The hook-script-vs-doc divergence (1) is the only blocker I'd want addressed before merge; everything else is hardening/polish.

P0 #1 — hook script in docs/mcp-server.md "Using as Claude Code's default
shell" didn't match the actual hook file shipped to ~/.claude/hooks/
prefer-bossterm.sh: it falls through and emits the deny JSON when nc is
missing, contradicting the doc's own "fails closed" prose. Match the
shipped script's `if ! command -v nc; then exit 0; fi; nc -z … || exit 0`
shape so a missing nc never routes Claude to an unverified port.

P0 #2 — `panel: "new_tab"` in run_command was structurally broken.
`state.createTab(...)` returns a TAB id, but the code stuffed it into
`newPaneId`, cached `mcpScratchPanes[origTabId] = newTabId`, and returned
`RunCommandResult(tabId = origTabId, paneId = newTabId)`. Consequences:
the next call with the same `tab_id` would look up `findSession(origTab,
newTab)` (which can't resolve — newTab isn't a pane in origTab's split
tree), evict as stale, and create yet another fresh tab — so the new_tab
mode never got its reuse. The response's tabId was also wrong.

Refactor:
 - `PaneResolution.Hit` now carries an `effectiveTabId` field. Splits and
   cache-hits carry the input tab id forward; new_tab path overrides
   with the freshly-created tab's id (and uses `findSession(newTab,
   newTab)` for a single-pane lookup that actually resolves).
 - The cache writes / reads land on the new tab id when applicable, so
   `panel: "reuse"` follow-ups on the same effective tab actually reuse.
 - `RunCommandResult.tabId` is now `resolveResult.tabId` (the effective
   tab id), so callers passing it back land in the right tab.

P0 #3 — Multi-statement scripts (raw "\n" inside `script`) corrupted the
result. The shell fires B/D once per statement; my `onCommandStarted`
unconditionally overwrote historyAtB/cursorYAtB, so by the time the
awaiter wakes up the start mark might point at the SECOND statement
while the exit code is from the first.

Fix: guard the snapshot with an `ourCommandStarted` AtomicBoolean
flipped via `compareAndSet(false, true)`. Only the first B after our
write records markers; later statements' B events no-op. Also added a
schema note in the docs telling callers to use `bash -lc '…'` for
compound logic, since the captured exit code is still only the first
statement's even with the fix.

P0 #4 — Stale OSC 133;D from a prior writer on the same pane (a
fire-and-forget run_in_panel script, a send_input keystroke, the user
manually typing) could complete `finishedSignal` before our script
even left the gate, returning that command's exit code with a garbled
slice. The per-pane Mutex serializes against other run_command calls
but not other writers on the pane.

Fix: an AtomicBoolean `weHaveWritten` is flipped right BEFORE the
`session.writeUserInput(...)` call. The listener ignores any B until
`weHaveWritten` is set, and ignores any D until `ourCommandStarted`
has been set (which only happens on a B we accepted). Together they
gate the entire B/D cycle to OUR command. Captured-var visibility
across threads is handled via atomics — captured `var`s become
kotlin.jvm.internal.Ref whose .element field isn't volatile.

P1 #5 — `mcpRunCommandMaxOutputBytes` was a misnomer: `sb.length` and
`line.length` are Kotlin String lengths (UTF-16 code units), not UTF-8
bytes. For ASCII it matches; for emoji-heavy output the real byte count
can be 4× the configured value.

Renamed everywhere to `mcpRunCommandMaxOutputChars` and updated the
kdoc to spell out the unit choice (apples-to-apples with the also-
char-counting `mcpMaxAnswerChars`). Settings.json entries set under
the old name silently revert to the default — acceptable for a setting
that shipped on the same dev branch.

P1 #6 — `paneMutexes` only sheds entries on the cache-miss-with-stale-
pane path. If a user closes the cached pane and never re-runs against
that tab, the mutex sits forever. Bounded by pane-creation rate but
unbounded over long-lived sessions.

Added opportunistic GC in `McpTerminalRegistry.setScratchPane`: when
the mutex map exceeds 32 entries, walk every state's tabs + pane
snapshots, collect live ids, evict everything else. Naturally batches
behind real new-pane work; no listener wiring needed. Also sheds
matching stale `mcpScratchPanes` entries while the sweep is hot.

Build: `./gradlew :compose-ui:compileKotlinDesktop --no-daemon` clean.

Generated with [Claude Code](https://claude.com/claude-code)
@claude
Copy link
Copy Markdown

claude Bot commented May 20, 2026

Review — PR #262 run_command MCP tool

Solid, deeply-considered work. The OSC 133 listener-before-send sequencing, the per-tab cache lock for cache-miss orphan-prevention, the per-pane FIFO mutex, and the explicit "first-B" gate via weHaveWritten + ourCommandStarted are well-thought-out concurrency invariants and the comments earn their keep. The sealed PaneResolution for exhaustive failure handling is the right shape.

Below: issues worth a second pass before merge, then smaller nits.

1. Per-request lsof / ps shell-out has no caching (perf)

compose-ui/.../BossTermMcpManager.kt:359-365 runs ProcessAncestry.resolveClientWindow inside the Ktor intercept block on every incoming request. The doc on ProcessAncestry itself budgets ~50–150 ms per resolution on macOS (lsof + ps × hops). MCP-over-SSE keeps the TCP connection alive but POSTs each JSON-RPC request; for an agent that fires run_command in a tight loop you pay that cost every time.

The client's remotePort is stable for the lifetime of one TCP connection, so caching keyed on remotePort (with eviction on disconnect or a small TTL) would collapse the steady-state cost to one resolution per connection. Today lastResolvedClient is the only "cache" — and it's a single global slot, which leads to issue #2.

2. lastResolvedClient is a single global slot — multi-client misrouting

McpTerminalRegistry.kt:260 is one AtomicReference<TabbedTerminalState?>. The PR docs flag the "concurrent multi-client racing" case as a known limitation, but it's more structural than it reads: any time two clients in different windows alternate, the no-tab_id tool calls swing between windows. Since MCP requests can interleave on different SSE connections, this isn't just a "rare race for a single window of time" — it's the default behavior under any multi-client setup.

The structurally-correct fix is per-request resolution: stash the result in a Ktor AttributeKey inside the interceptor, then have the tool handler read it. That requires the MCP SDK to expose the inbound ApplicationCall to tool handlers, which may or may not be possible in SDK 0.8.3 — if it isn't, threading a ThreadLocal/coroutine context element from the interceptor into the handler dispatch is the next-best option. Either avoids the cross-client write storm on the global slot.

3. Linux client-PID resolver only scans /proc/net/tcp (IPv4) — IPv6 loopback misses

ProcessAncestry.kt:144-193 reads only /proc/net/tcp. On some Linux configs the loopback connection lands in /proc/net/tcp6 (e.g. clients that pick ::1 or kernels that report mapped-v4 there). Inet4Address-only listening sockets on 127.0.0.1 should normally keep the connection in tcp, but the resolver will silently return null whenever it doesn't and the tool falls back to first-registered without anyone noticing. Worth either also scanning tcp6 or a DEBUG log when both are empty.

4. splitVerticalFromPane permanently shifts user focus

TabbedTerminalState.kt:652-654 calls splitState.setFocusedPane(anchorPaneId) before splitting. There's no restore. When an MCP run_command fires while the user is typing in some other pane, their keyboard focus snaps to the bottom scratch strip mid-keystroke. The split needs the focus redirect to land in the right place, but the user's prior focus should be captured before the redirect and restored after splitFocusedPane returns.

5. Output truncation keeps the first maxOutputChars, not the last

sliceCommandOutput at BossTermMcpServer.kt:1413 breaks out of the loop the first time sb.length + line.length + 1 > maxOutputChars. For the dominant use case (builds, tests, lints, package installs), the tail of the output is where errors live — head-truncation throws away exactly the bytes Claude needs. Consider tail-truncating instead, or at minimum a "head + … + tail" split. Even the docs say "Captured stdout/stderr (ANSI-stripped)" without flagging which end gets dropped, so the contract should match.

6. UNDISABLABLE_TOOLS containing run_command is a UX overreach

BossTermMcpServer.kt:938 reserves run_command because the user-global PreToolUse hook (which is opt-in, not bundled) routes Bash to it. Users who don't install the hook lose the ability to hide run_command from clients, even though the justification doesn't apply to them. manage_tools being reserved is genuinely structural — without it there's no way back. run_command's reservation is a side-effect of an optional external integration. Consider gating its reservation on the user having opted into the hook (e.g. only when ~/.claude/hooks/prefer-bossterm.sh exists), or making it a normal-but-default-on tool.

7. Race: OSC 133;A on fresh pane can arrive before the listener registers

In executeInPane, addCommandStateListener(listener) runs after splitHorizontal/createTab returned. For a fast shell (e.g. dash, or a cached zsh with no rc-file work) the prompt-ready OSC 133;A can fire before the listener attaches, and promptReadySignal.await() waits the full shellReadyTimeoutMs (1500 ms default) for an event that already passed. The pane resolution path could attach the listener before kicking off the split, or TerminalSession could expose a "has-seen-first-prompt" latched flag for the listener to consult on attach.

Nits

  • BossTermMcpServer.kt:1386 and :1909 reference MAX_OUTPUT_BYTES in kdoc but the constant was renamed to maxOutputChars / the per-call cap parameter — stale doc.
  • script.isEmpty() check at BossTermMcpServer.kt:1008 accepts " " and "\n". isBlank() would catch the no-op cases too.
  • Hook script: cat >/dev/null 2>&1 || true after set -e is harmless but unnecessary — cat consuming stdin into /dev/null doesn't fail in practice. The comment is the load-bearing part.
  • paneMutexes.size > MCP_MUTEX_GC_THRESHOLD triggers gcStalePaneMutexes() from inside setScratchPane. The GC walks every state × every tab × every pane snapshot — for users with 30+ panes the sweep on every successful new-pane creation isn't free. A small TTL or a per-N-calls counter would amortize better.
  • mcpScratchPanes value-side staleness check in gcStalePaneMutexes (v !in live) is good, but the "lazy clear on next read" comment in clearScratchPane doesn't actually exist as a code path — readers (run_command's cache-miss-with-stale-pane branch) call clearScratchPane only when they hit the stale entry. Worth either implementing the read-time lazy clear or removing the comment promise.

Test coverage

Test plan is manual-only (5 reviewer checklist items). For something with this much concurrency state (per-tab cache lock, per-pane FIFO mutex, B/D event gating, OSC 133;A race window, history-coordinate slicing) at least a few unit tests on sliceCommandOutput (history-delta scroll-through, oldest-row clamp, exact-boundary truncation) and on PaneResolution's new_tab branch would be cheap insurance. The atomic flags in executeInPane in particular look right but are exactly the kind of code that quietly regresses six months from now.


Overall direction is good and the rigor evident in the existing internal review pass is real — these are mostly polish/edge-case items, not "stop the merge" findings. #1 and #2 are the two I'd most want addressed before this becomes the default shell path for AI clients.

Generated with Claude Code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant