feat(mcp): run_command tool for visible-pane shell with read-back by kshivang · Pull Request #262 · kshivang/BossTerm

kshivang · 2026-05-20T08:29:39Z

Summary

New blocking MCP tool mcp__bossterm__run_command that runs a shell command in a visible BossTerm pane and returns its stdout/stderr, exit code, and duration. AI clients (Claude Code, Codex, Cursor) now have a default shell that the user can watch in their real terminal — no more invisible Bash calls.
Cross-client preference advertising via the MCP initialize-response instructions field — every spec-compliant client sees "prefer run_command over your built-in shell" in its system prompt at handshake, no per-project config needed.
~/.bossterm/mcp.port marker file written atomically on every successful bind, deleted on clean stop — lets external hooks (e.g. a Claude Code PreToolUse hook) decide reachability with a 5 ms stat + nc -z instead of a 300 ms HTTP probe.

What's in the box

Server tool (BossTermMcpServer.registerRunCommand):

Pane resolution: explicit pane_id > cached "MCP scratch pane" for the tab > newly-created via splitHorizontal/splitVertical/createTab.
Per-tab cache Mutex so two concurrent calls can't both miss the cache and orphan a pane.
Per-pane Mutex so two concurrent calls into the same pane can't interleave their scripts in the shell's stdin.
Listener-before-send: registers a transient CommandStateListener BEFORE the script is written so OSC 133;A/B/D are all captured.
Absolute history-line tracking for output slicing — slice stays correct when output scrolls into history mid-command.
Alt-screen detection: polls isUsingAlternateBuffer while waiting for D. On transition, returns error: "TUI detected" and leaves the process alive so the caller can drive it via send_input + read_scrollback.
Reserved (UNDISABLABLE_TOOLS): manage_tools rejects attempts to hide it, and applyDisabledSet defensively filters reserved names against hand-edits of disabledMcpTools in settings.json.

Settings (TerminalSettings.kt, McpSettingsSection.kt):

mcpRunCommandDefaultTimeoutMs (120_000, UI input).
mcpRunCommandDefaultPanel ("horizontal_split", UI dropdown; invalid values silently fall back to horizontal_split so a typo'd setting never bubbles up as a per-call error).
mcpRunCommandMaxOutputBytes (120_000, advanced — sized to fit under mcpMaxAnswerChars=150_000 with JSON-wrapper headroom so the response-shortening ladder never has to truncate run_command output).
mcpRunCommandShellReadyTimeoutMs (1_500, advanced).
UI also gained an "always-exposed meta-tool manage_tools" blurb so users see that tool exists even though it's not in the toggle list.

Docs (README.md, docs/mcp-server.md):

Per-tool run_command section: schema, response shape, error contract (TUI / timeout / shell-integration missing).
~/.bossterm/mcp.port marker file note + initialize-time instructions section.
Settings reference table rows for the four new keys.
"Using as Claude Code's default shell" how-to: copy-paste hook script (~/.claude/hooks/prefer-bossterm.sh), settings.json snippet, optional CLAUDE.md addendum.
Troubleshooting notes for multi-instance / sudo / missing nc.

Rigorous review pass

Three independent review agents (server-side, settings/UI, integration/hook) audited the implementation before this commit. They surfaced 3 P0 bugs, 5 P1 issues, 4 P2 doc/UX items, and verified 11 items as correct. All P0 and P1 findings are addressed here:

P0 feat: Terminal rendering improvements + 4 high-priority features (#2, #3, #4, #5) #1: hook now fails closed when nc is missing — previously it would emit deny without a probe and could route Claude to a dead port.
P0 🔍 Implement text search/find functionality (Ctrl+F) #2: per-tab cache Mutex prevents the concurrent-call orphan-pane race.
P0 🔗 Implement hyperlink detection and click handling #3: invalid mcpRunCommandDefaultPanel (e.g. "reuse") silently falls back to horizontal_split; error messages reference both the per-call panel arg and the setting.
P1: max-output default lowered + 1024 floor documented; timeout kdoc clarified; pane mutex eviction wired into clearScratchPane(tabId, paneId?); gradle-build kdoc note added; shell-ready =0 semantics tightened.

What's NOT in this PR

User-global config files (~/.claude/CLAUDE.md, ~/.claude/hooks/prefer-bossterm.sh, ~/.claude/settings.json hook entry) — per-user, lives outside the repo. The docs walk through wiring them up.

Test plan

./gradlew :compose-ui:compileKotlinDesktop --no-daemon — BUILD SUCCESSFUL.
Hook script smoke-tested in all four branches: no marker / dead port / live port / no nc — all PASS.
Manual (reviewer): launch BossTerm, enable MCP, attach to Claude Code, ask Claude to ls — pane should appear in the active tab and output returns to Claude.
Manual (reviewer): run two run_command calls in parallel from MCP Inspector against the same tab — confirm only ONE new pane spawns (cache race fix).
Manual (reviewer): call run_command with script: "vim\n" — expect error: "TUI detected", pane stays alive.
Manual (reviewer): kill BossTerm with MCP enabled — confirm ~/.bossterm/mcp.port is gone.
Manual (reviewer): hand-edit ~/.bossterm/settings.json to "mcpRunCommandDefaultPanel": "reuse" — first run_command call should silently fall back to horizontal_split instead of erroring.

🤖 Generated with Claude Code

Today AI clients (Claude Code, Codex, Cursor, …) running shell commands go through their built-in Bash tool — invisible to the user. The user asked for shell work to happen in their actual BossTerm window while the agent still receives stdout/stderr + exit code, so they can watch what's running and the agent isn't blind. New write tool mcp__bossterm__run_command, blocking variant of run_in_panel: - Spawns (or reuses) a visible BossTerm pane, writes the script, waits for OSC 133;D, returns {exitCode, output, durationMs, paneId, truncated, error}. - Pane reuse: first call splits below the focused pane (or whatever mcpRunCommandDefaultPanel says); subsequent calls in the same tab reuse that pane. The cache (tabId → paneId) is guarded by a per-tab Mutex so two concurrent calls can't both miss the cache and orphan a pane. A per-pane Mutex serializes overlapping calls into the same pane's stdin. - Output capture: absolute history-line tracking via the OSC 133;B and ;D events, so the slice stays correct when output scrolls into history mid-command. ANSI-stripped via the emulator's per-line .text accessor (same path read_scrollback uses). - Alt-screen / TUI detection: polls textBuffer.isUsingAlternateBuffer while waiting for D; on transition, completes with error: "TUI detected …" instead of timing out. The process stays alive so the caller can drive it via send_input + read_scrollback. - Reserved tool: added to UNDISABLABLE_TOOLS so manage_tools rejects attempts to hide it (otherwise the PreToolUse hook below would route Bash to a tool that's no longer on the wire). applyDisabledSet also filters reserved names defensively against hand-edits of disabledMcpTools in settings.json. Cross-client preference advertising: - Server now passes a BOSSTERM_MCP_INSTRUCTIONS string into the SDK's 4-arg Server constructor. The MCP initialize response surfaces it into every client's system prompt, so the "prefer run_command over your built-in shell" guidance lands without per-project config. Port marker file for cheap external probes: - BossTermMcpManager atomically writes the actual bound port (after the 7676→7685 fallback) to ~/.bossterm/mcp.port on every successful bind, deletes it on clean stop. Lets the user-global PreToolUse hook decide reachability with stat + nc -z (~5ms) instead of an HTTP probe with timeout (~300ms worst case). Settings: - mcpRunCommandDefaultTimeoutMs (default 120_000, UI input) - mcpRunCommandDefaultPanel (default horizontal_split, UI dropdown; invalid values silently fall back to horizontal_split — never bubble the user's bad setting up as a per-call error) - mcpRunCommandMaxOutputBytes (default 120_000, advanced — sized to fit under mcpMaxAnswerChars=150_000 with JSON-wrapper headroom) - mcpRunCommandShellReadyTimeoutMs (default 1_500, advanced) UI: - Two new rows under "BossTerm MCP Server" (panel dropdown + timeout input). - ExposedToolsSection grays out reserved tools and gains a blurb surfacing the always-exposed manage_tools meta-tool (which lives outside BUILT_IN_READ/WRITE_TOOLS). Docs: - README intro updated; new "Using as Claude Code's default shell" subsection pointing at the docs. - docs/mcp-server.md: full run_command schema + response shape + error contract, ~/.bossterm/mcp.port marker file note, initialize- time instructions section, settings reference rows, "Using as Claude Code's default shell" how-to (copy-paste hook script + settings.json + CLAUDE.md), multi-instance + sudo + nc caveats in Troubleshooting. Verification: - ./gradlew :compose-ui:compileKotlinDesktop --no-daemon clean. - Hook script smoke-tested across all four branches (no marker / dead port / live port / no nc). - End-to-end (MCP Inspector + two concurrent run_command calls + real Bash hooking) is still manual; the listener-before-send, alt-screen detection, and cache lock can't be exercised without a live UI. User-global config (~/.claude/CLAUDE.md, hooks/prefer-bossterm.sh, settings.json hook entry) is per-user and lives outside this repo — the docs walk through it. Generated with [Claude Code](https://claude.com/claude-code)

claude · 2026-05-20T08:36:03Z

Review of PR #262 — `run_command` tool

Solid, well-thought-out feature; the per-tab cache lock + per-pane mutex layering is the right model, and the absolute-history-line slice is clever. A few real bugs and contract gaps worth fixing before merge.

Issues

1. Hook script contradicts its own docs (and the PR description) when nc is missing. docs/mcp-server.md:524-527 says the hook "fails closed (lets Bash through) when nc is missing", and the PR description lists this as a P0 fix. The actual script falls through and emits the deny JSON when command -v nc returns non-zero, so a missing nc ends up denying Bash and routing Claude to a port no one verified is alive — exactly the behavior the P0 was supposed to remove. The minimal fix:

if command -v nc >/dev/null 2>&1; then
    nc -z 127.0.0.1 "$port" >/dev/null 2>&1 || exit 0
else
    exit 0
fi

2. panel: "new_tab" confuses tab ids and pane ids. state.createTab(...) returns a tab id (TabbedTerminalState.kt:226-236), but the new-tab branch in registerRunCommand stuffs that into newPaneId and then:

caches it as mcpScratchPanes[origTabId] = newTabId
returns it as RunCommandResult.paneId while RunCommandResult.tabId stays the original (now-background) tab

Consequences:

The next run_command call with the same (original) tab_id reads the cache, gets the new tab's id as a "pane id", and state.findSession(origTabId, newTabId) returns null because that id isn't in origTabId's split tree. The entry is cleared as stale and a brand-new tab is created — every call. So panel: "new_tab" never gets the documented reuse behavior.
The response's tabId doesn't reflect where the pane actually lives, so a caller passing paneId back without tab_id will land in whatever tab is active at call time, not the one the pane is in.

Fix: when panel == "new_tab", treat the return value as a tab id — cache mcpScratchPanes[newTabId] = focusedPaneIdOfNewTab and return that tab id in RunCommandResult.tabId.

3. Multi-statement scripts (raw \n) corrupt the result. The schema says "Shell command to run. A trailing newline is added if absent." but doesn't forbid embedded \n. With e.g. "cd /tmp\nls", the shell drives two B/D cycles. finishedSignal.complete() (BossTermMcpServer.kt:1168-1172) is one-shot and fires at the first D — but historyAtB/cursorYAtB get re-assigned on every onCommandStarted, so by the time the awaiter wakes up the start mark may already point at the second statement. Net effect is non-deterministic: either the exit code is the first statement's and the slice runs past it into the second, or the start mark is stale and the slice misses the first statement's output entirely. Two reasonable options:

Document this and tell callers to use bash -lc '...' for compound scripts.
Stop overwriting historyAtB after the first B, mirroring the one-shot pattern already used for finishedSignal.

4. Stale OSC 133;D from concurrent pane activity can spuriously finish the call. On a reused pane, the listener is wired up before writeUserInput. If a previous tool (a fire-and-forget run_in_panel script, or a keystroke the user typed) is mid-command, the resulting D event lands in our finishedSignal and we return that command's exit code with a garbled output slice. The per-pane mutex serializes run_command against itself but doesn't protect against other writers on the same pane. Cheap mitigation: ignore onCommandFinished until we've observed an onCommandStarted that fired after our writeUserInput.

5. mcpRunCommandMaxOutputBytes is measured in UTF-16 chars, not bytes. sliceCommandOutput checks sb.length + line.length + 1 > maxOutputBytes (BossTermMcpServer.kt ~1316). For ASCII this matches, but a pane full of emoji/CJK can return a string whose UTF-8 byte count is 2–4× the limit — silently breaking the "fits under mcpMaxAnswerChars with JSON-wrapper headroom" guarantee that's the whole point of the 120_000 default. Either rename to …MaxOutputChars everywhere, or measure line.toByteArray(Charsets.UTF_8).size.

6. Per-pane Mutex map leaks on pane close. McpTerminalRegistry.paneMutexes only sheds entries in clearScratchPane(tabId, paneId), which only runs on the cache-miss path. If a user closes the cached pane and never calls run_command against that tab again, the mutex sits forever. The leak is human-scale but unbounded across long-lived sessions — easy to plug by hooking pane-close (TerminalTab already has onProcessExit).

Doc/UX nits

docs/mcp-server.md:352 still says "output is capped at 200 KB" — the actual default cap is 120_000 (~120 KB) per the settings table on the same page.
The hook script does cat >/dev/null 2>&1 || true; cat redirected to /dev/null doesn't fail on EOF, so the || true is dead code under set -e. Minor.
script: "\n" (whitespace only) passes the script.isEmpty() guard and writes a bare newline. Consider script.isBlank().

Things that look correct

UNDISABLABLE_TOOLS defensive subtraction in applyDisabledSet (belt-and-braces against hand-edited settings).
Atomic move via Files.move(..., ATOMIC_MOVE, REPLACE_EXISTING) for the marker file.
Per-tab cache Mutex properly closes the create-and-cache race for concurrent calls on the same tab.
Listener-before-send ordering plus freshlyCreated gating so reused panes don't deadlock waiting for an A that already fired.
tuiPoller cancellation inside the coroutineScope so the poll loop doesn't outlive the await.

Test coverage

The manual checklist is reasonable, but none of the locking, slicing, or pane-resolution logic has automated coverage. A few cheap JVM tests would be high-value:

Slice math: feed a fake TerminalTextBuffer with known scrollback deltas, assert sliceCommandOutput returns the right rows when historyDelta > 0 and when the start mark falls past historyLinesCount.
Cache race: two concurrent coroutines hitting registerRunCommand for the same tab_id while a stub state.splitHorizontal records each call — assert exactly one creation.
UNDISABLABLE_TOOLS filter: persist disabledMcpTools = setOf("manage_tools", "run_command") and assert both tools remain registered after applyDisabledSet.

Generated with Claude Code

…stered Today McpTerminalRegistry.primaryState() returns the first-registered window, so run_command / run_in_panel / get_active_tab without a tab_id target the wrong window whenever multiple BossTerm windows are open — the user sees commands run somewhere they aren't looking. The correct invariant is "the window the calling MCP client is running INSIDE." New file ProcessAncestry.kt resolves that window by: 1. Resolving the PID owning the loopback TCP socket on the client side (lsof on macOS; /proc/net/tcp + /proc/<pid>/fd on Linux). 2. Walking the parent process tree from that PID. 3. Matching the first ancestor PID against every tracked pane's shell PID across all registered states. 4. Returning the state owning that pane — or null if no ancestor matches (client running outside any BossTerm pane: Claude Desktop, external Inspector, CI…). BossTermMcpManager's Ktor interceptor calls the resolver on every incoming MCP request and records the result in McpTerminalRegistry.lastResolvedClientWindow (AtomicReference). The three tool handlers that previously used registry.primaryState() (get_active_tab, run_in_panel, run_command) now read the resolved client window first, with the existing primaryState() as fallback. Design notes: - Resolution failures are silent — caller falls back to first- registered. Never crashes a tool call on a parent-walk hiccup. - Concurrent multi-client racing is acceptable last-writer-wins for v1 — the dominant single-client case is consistent. - Resolver runs INSIDE the existing intercept block, AFTER the DNS-rebinding check, so a hostile request never triggers lsof. - Stale lastResolvedClientWindow entries (window closed) are detected lazily on read (state is no longer in `states` list) and cleared. Platform support: macOS + Linux. Windows and unknown platforms skip the resolver (resolveClientWindow returns null) and the server falls back to first-registered. Cost: ~50-150ms per request on macOS (two shell-outs per parent hop — lsof, then ps for each ancestor up to 4 hops deep), ~5-15ms on Linux (pure /proc reads). Docs: new "Caller-window resolution" subsection in mcp-server.md covering mechanism, scenarios, platform support, and the multi-client race caveat. Verification: - ./gradlew :compose-ui:compileKotlinDesktop --no-daemon clean. - Live end-to-end testing needs a restart of BossTerm (currently running binary is pre-change) and a multi-window scenario — manual, reviewer to confirm before merge. Generated with [Claude Code](https://claude.com/claude-code)

claude · 2026-05-20T09:00:39Z

Follow-up review pass — ProcessAncestry and items the prior review did not surface

The previous review covered the main run_command correctness issues (hook script nc-missing branch, panel new_tab tab-vs-pane id confusion, multi-statement script slice, stale OSC 133;D from concurrent writers, UTF-16-chars vs bytes for maxOutputBytes, per-pane mutex leak, doc nit on 200 KB). Those still stand. This pass focuses on ProcessAncestry.kt and a couple of items that pass did not cover.

ProcessAncestry — bugs and performance

SUBPROCESS_TIMEOUT_S is dead code on macOS. findClientPidMacOS calls readLines() before process.waitFor(timeout, ...) (ProcessAncestry.kt:124-125); same shape in parentPidMacOS (:208-209). readLines/readText blocks until EOF — i.e. until the child exits — so if lsof or ps hangs while still holding stdout open, the read blocks indefinitely and waitFor is never reached. Even if waitFor does run, its return value is ignored — there is no destroyForcibly, so a slow child is not killed either. Worst case: the Ktor interceptor thread for an MCP request hangs forever on a slow lsof. Fix shape: drain stdout on a background thread, waitFor with the timeout, destroyForcibly and return null on false.
Resolution runs on every MCP request, paying ~50-150 ms on macOS. The Ktor interceptor calls ProcessAncestry.resolveClientWindow for every incoming request, including high-frequency reads (list_tabs, read_scrollback, search_output, get_active_tab). Two lsof or ps shell-outs per request adds up fast under any agentic loop. SSE keeps the source port stable for the lifetime of a connection, so a (remotePort -> state) cache scoped to the SSE session would amortize this to near-zero for steady-state work. The kdoc already calls out the cost; the fix is just to memoize per port.
Linux IPv6 loopback is not scanned. findClientPidLinux reads only /proc/net/tcp (:147). If localhost resolves to ::1 on the user system, the connection lives in /proc/net/tcp6 and the resolver silently fails — the server falls back to first-registered. The server binds 127.0.0.1 explicitly so this is rare, but a client connecting via localhost rather than 127.0.0.1 can end up on the v6 socket. Either scan both files, or document the v4-only constraint.
Linux /proc fd-scan cost is overstated by the kdoc. Comment claims ~5-10 ms (:142) but the inner loop calls Files.readSymbolicLink for every fd of every PID on the system (except our own). On a workstation with hundreds of processes and thousands of fds — common with containers, browsers, IDEs — the wall-clock cost is more like 50-200 ms. Not a correctness bug, just worth honest-numbering in the comment. Memoizing per (2) makes this moot regardless.
lastResolvedClientWindow survives a window unregistering, but only by accident. setLastResolvedClientWindow(null) is a documented no-op, so the cached state lingers after its window is closed; lastResolvedClientWindow() then lazily clears it on the next read by checking cached !in states. The kdoc on setLastResolvedClientWindow claims "the prior resolution from a request that DID resolve" survives, which is only true until that window is unregistered — the next read invalidates it. The behavior is correct; the kdoc oversells it.

Items the prior review did not surface

mcpScratchPanes and tabCacheLocks leak on tab close. Mirroring the per-pane mutex leak the prior review flagged: when a tab is closed entirely (not just a pane within it), both mcpScratchPanes[tabId] and tabCacheLocks[tabId] still hold its tabId. Hook eviction into the same tab-removed signal that already exists in TabbedTerminalState and you kill all three leaks (tab cache, tab lock, pane mutex) in one place.
JSON-schema panel enum is not declared. The run_command panel parameter accepts only reuse / horizontal_split / vertical_split / new_tab, but the schema (BossTermMcpServer.kt:945-948) declares it as a bare string with a free-form description. Adding "enum": [...] lets clients validate before sending and lets Claude / Codex constrain their generations against the allowed set, reducing the "Unknown panel" error path. Minor UX win.
Setting clamps happen silently per-call with no user feedback. mcpRunCommandMaxOutputBytes.coerceAtLeast(1024), mcpRunCommandDefaultTimeoutMs.coerceIn(100, 600_000), and mcpRunCommandShellReadyTimeoutMs.coerceAtLeast(0) all run on every call. A user who edits settings.json directly (as the advanced settings docs encourage) to 0 or a negative gets no log line saying their setting was overridden. Cheap fix: log once at startup (or first call) when the persisted value is out-of-range.
Newline-padding is byte-naive on CRLF. A script ending in \r\n passes the endsWith("\n") check, so no additional newline is added, but the shell still sees \r before the newline. POSIX shells tolerate this; some shell features (here-doc parsing, certain builtins on minimal sh) do not. Either strip a trailing \r before the check, or document the LF-only contract.
mcpRunCommandDefaultPanel is silently normalized at runtime but not on persistence. The runtime tolerates a bad value by falling back to horizontal_split (good — already a P0 fix in the description), but the settings UI dropdown only offers the three valid options. A user who hand-edits settings.json to "reuse" (or any other bad value) keeps that string in their saved settings indefinitely; the UI dropdown then has no matching selection. Either normalize on settings load, or render an explicit "(invalid)" state in the dropdown.

Test coverage — same as prior review

Still no automated coverage for any of: slice math (history delta, wraparound), cache-race resolution, UNDISABLABLE_TOOLS filter, ProcessAncestry IPv6 / subprocess-timeout paths. The prior review three suggested unit tests would catch most of what is flagged here too.

Net: lots of careful engineering (per-tab cache lock + per-pane mutex layering, absolute-history slice, sealed PaneResolution for exhaustive when, atomic marker write, defensive effectiveDisabled filter). The ProcessAncestry subprocess hang and per-request cost are the items most worth fixing before merge alongside the prior review P0/P1 list.

Generated with Claude Code

Today consecutive `run_in_panel` calls with `panel: "horizontal_split"` each split whichever pane currently has focus — so a second long-running script lands wherever the user happened to click last, often splitting the agent's own pane and pushing the prior MCP pane around. With three or four parallel fire-and-forget launches the layout becomes random and unpleasant. Anchor the new pane off the last MCP-created scratch pane instead of the focused pane: a horizontal_split splits the existing scratch pane *vertically* (right of it), and the new pane becomes the next anchor. Result: consecutive MCP scratch panes line up as a horizontal strip along the bottom of the tab, and the agent's primary pane keeps its real estate. New: `TabbedTerminalState.splitVerticalFromPane(tabId, anchorPaneId, ratio, initialCommand)`. Implemented by extending the existing `performSplit` helper with an optional `anchorPaneId` argument — when provided, `SplitViewState.setFocusedPane` aims focus at the anchor before the existing `splitFocusedPane` call runs, so the same code path handles both flavors. Anchor lookup degrades gracefully: if the cached paneId no longer resolves to a session (user closed the pane), the caller falls back to the original `splitHorizontal` (focused-pane) behavior. run_in_panel changes: - When `panel == "horizontal_split"` and `registry.getScratchPane(tab)` points at a still-live pane, route through `splitVerticalFromPane` to stack the new pane right of the anchor. - Every successful split now updates the scratch-pane cache to the new paneId, so the *next* call sees the new pane as its anchor (chains horizontally). Previously only run_command updated the cache; now run_in_panel does too, and the two tools share the same anchor for the same tab. `vertical_split` and `new_tab` keep their existing semantics — no behavior change for callers that explicitly opt out of the anchor stacking. run_command is intentionally untouched: it reuses the cached pane on the next call (no new pane created), so the anchor-stacking rule doesn't apply. Verification: - `./gradlew :compose-ui:compileKotlinDesktop --no-daemon` clean. - Manual (reviewer): two run_in_panel horizontal_split calls in a row should yield two panes side-by-side in the bottom row, not nested. Generated with [Claude Code](https://claude.com/claude-code)

claude · 2026-05-20T09:13:06Z

Review of PR #262 — `run_command` MCP tool

Thorough, well-organized work — the per-tab cache lock, per-pane mutex, absolute-history slicing, and UNDISABLABLE_TOOLS defense-in-depth all show solid concurrency/safety thinking. A few real issues below.

Bugs

1. Hook script contradicts its own docs on the "nc missing" branch (docs/mcp-server.md)

The doc at line 567–570 promises:

The hook fails closed (lets Bash through) when nc is missing, since it can't verify the marker isn't stale.

…and the PR description's P0 #1 says the same. But the script as written does the opposite:

if command -v nc >/dev/null 2>&1; then
    nc -z 127.0.0.1 "\$port" >/dev/null 2>&1 || exit 0
fi
cat <<'JSON' …deny…

When nc is missing, the if body is skipped entirely and execution falls through to the cat that emits the deny. Net result: machines without nc (minimal Alpine, some stripped containers) get Bash blocked even when BossTerm's marker is stale or the port is dead — exactly the "route Claude to a dead port" failure mode P0 #1 claims to have fixed.

Fix:

if command -v nc >/dev/null 2>&1; then
    nc -z 127.0.0.1 "\$port" >/dev/null 2>&1 || exit 0
else
    exit 0  # can't verify port — let Bash through
fi

Worth re-running the "no nc — PASS" entry in the test plan on a host with nc actually removed from PATH.

2. mcpRunCommandMaxOutputBytes is a char count, not a byte count (BossTermMcpServer.kt: sliceCommandOutput)

if (sb.length + line.length + 1 > maxOutputBytes) {

StringBuilder.length and String.length are UTF-16 code-unit counts in the JVM, not byte counts. For ASCII the two coincide, but a run_command whose output contains emoji, CJK, or other supplementary-plane characters will silently exceed the documented byte budget — and the headroom math against mcpMaxAnswerChars (also char-based) leaks the discrepancy. Either rename the setting to mcpRunCommandMaxOutputChars, or convert via .toByteArray(Charsets.UTF_8).size (more expensive — keeping it char-based and aligning the name + docs is fine).

3. Doc says output cap is 200 KB, code defaults to 120 KB (docs/mcp-server.md:395)

- `output` is capped at 200 KB; `truncated` becomes `true` …

But the default for mcpRunCommandMaxOutputBytes is 120_000 and the settings table further down says so. Either a leftover from an earlier draft or one of the two wasn't updated. Suggest: "output is capped at the mcpRunCommandMaxOutputBytes setting (default 120 KB)".

Concerns

4. Caller-window resolution is re-run from scratch on every MCP request (ProcessAncestry.kt)

The Ktor interceptor fires resolveClientWindow per request. For SSE + per-POST JSON-RPC, that's once per tool call — typically dozens during a Claude Code session. Each macOS resolution spawns lsof once + ps -o ppid= per ancestor hop (2–4 subprocesses on the budget you quote); each Linux resolution can iterate all of /proc/*/fd/*. There's no caching keyed on remotePort even though (port → PID → ancestor → state) is invariant within a connection.

A small ConcurrentHashMap<Int, TabbedTerminalState> keyed on remotePort with an LRU bound (or invalidated on state un-register) would cut this to one resolution per ephemeral port. Worth doing before this lands on macOS where the subprocess fan-out is most painful.

5. Cache lock is held during pane creation (BossTermMcpServer.kt, registerRunCommand)

The tabCacheLock(tabId).withLock { … } block calls state.splitHorizontal / state.createTab while holding the lock. Those involve UI-thread interaction and may take more than the "held briefly" the registry comment promises. Two concurrent calls on the same tab will serialize across pane creation, which is fine for correctness but means the second caller's latency includes the first call's full pane-spawn time. Probably acceptable — flagging because the comment understates it.

6. if (script.isEmpty()) rejects the empty string but accepts whitespace-only

run_command(script: " ") and run_command(script: "\n") will both proceed to the shell. Fine for the shell ("just a newline"), but combined with the script.removeSuffix("\n").ifEmpty normalization in run_in_panel it's slightly inconsistent. Worth script.isBlank() if you want the early-exit guard to be uniform.

Smaller things

state.findSession(tabId, newPaneId) ?: state.findSession(newPaneId) fallback chain (BossTermMcpServer.kt ~1103): the second lookup uses a different (tab-less) overload — worth a one-line comment explaining when the new_tab path causes the first to miss (the new pane lives in the new tab, not the requested one).
writePortMarker ATOMIC_MOVE (BossTermMcpManager.kt ~510): nice. One small thing — the .mcp.port.tmp filename is hard-coded, which means two BossTerm processes binding concurrently could clobber each other's tmp file mid-write. Multi-instance is documented as unsupported, so likely a no-fix, but a per-PID suffix (.mcp.port.tmp.\${pid}) would be a one-line hardening.
SUBPROCESS_TIMEOUT_S = 2L (ProcessAncestry.kt): under load, lsof/ps can briefly exceed 2 s. The failure mode is benign (returns null → falls back to first-registered), but with MAX_PARENT_WALK = 16 the worst-case wait for a single resolution is 16 × 2 s = 32 s. Probably never hit, but worth tightening either the per-call timeout or the hop budget.
lastResolvedClientWindow() (McpTerminalRegistry.kt): the cached !in states membership check on CopyOnWriteArrayList is O(N) per call — fine for a handful of windows, but worth a Set mirror if window counts grow.
maxOutputBytes.coerceAtLeast(1024) is silent (BossTermMcpServer.kt:1018): a user who sets it to 100 won't know their value was raised. Same for the timeout clamp. A one-line log.warn on first coercion would surface misconfiguration.
UNDISABLABLE_TOOLS doc-only contract: applyDisabledSet correctly filters now, but the description claims manage_tools's own handler also rejects the reserved names — a quick comment pointer next to the constant would help future readers verify both paths in one place.

Tests

No unit tests in this PR for a sizeable chunk of net-new logic (run_command driver, slicing, pane resolution, ProcessAncestry). The manual test plan is good but reproducibility-fragile. At minimum, sliceCommandOutput is pure and trivially testable — startHistory/endHistory/cursorY combinations including the scroll-into-history case, the truncation case, and the endRowInclusive < startRow edge would catch regressions cheaply.

Nits

Consider tightening the instructions blurb (BOSSTERM_MCP_INSTRUCTIONS) — it's six sentences of system-prompt text on every initialize. Claude Code does surface MCP server instructions but tokens add up across many servers.
run_command description mentions "Requires OSC 133 shell integration" but doesn't tell the model what happens if it's missing (exit code = null after timeout). Worth adding that explicitly so the agent doesn't infinite-loop retrying.

Overall a strong PR. The hook-script-vs-doc divergence (1) is the only blocker I'd want addressed before merge; everything else is hardening/polish.

P0 #1 — hook script in docs/mcp-server.md "Using as Claude Code's default shell" didn't match the actual hook file shipped to ~/.claude/hooks/ prefer-bossterm.sh: it falls through and emits the deny JSON when nc is missing, contradicting the doc's own "fails closed" prose. Match the shipped script's `if ! command -v nc; then exit 0; fi; nc -z … || exit 0` shape so a missing nc never routes Claude to an unverified port. P0 #2 — `panel: "new_tab"` in run_command was structurally broken. `state.createTab(...)` returns a TAB id, but the code stuffed it into `newPaneId`, cached `mcpScratchPanes[origTabId] = newTabId`, and returned `RunCommandResult(tabId = origTabId, paneId = newTabId)`. Consequences: the next call with the same `tab_id` would look up `findSession(origTab, newTab)` (which can't resolve — newTab isn't a pane in origTab's split tree), evict as stale, and create yet another fresh tab — so the new_tab mode never got its reuse. The response's tabId was also wrong. Refactor: - `PaneResolution.Hit` now carries an `effectiveTabId` field. Splits and cache-hits carry the input tab id forward; new_tab path overrides with the freshly-created tab's id (and uses `findSession(newTab, newTab)` for a single-pane lookup that actually resolves). - The cache writes / reads land on the new tab id when applicable, so `panel: "reuse"` follow-ups on the same effective tab actually reuse. - `RunCommandResult.tabId` is now `resolveResult.tabId` (the effective tab id), so callers passing it back land in the right tab. P0 #3 — Multi-statement scripts (raw "\n" inside `script`) corrupted the result. The shell fires B/D once per statement; my `onCommandStarted` unconditionally overwrote historyAtB/cursorYAtB, so by the time the awaiter wakes up the start mark might point at the SECOND statement while the exit code is from the first. Fix: guard the snapshot with an `ourCommandStarted` AtomicBoolean flipped via `compareAndSet(false, true)`. Only the first B after our write records markers; later statements' B events no-op. Also added a schema note in the docs telling callers to use `bash -lc '…'` for compound logic, since the captured exit code is still only the first statement's even with the fix. P0 #4 — Stale OSC 133;D from a prior writer on the same pane (a fire-and-forget run_in_panel script, a send_input keystroke, the user manually typing) could complete `finishedSignal` before our script even left the gate, returning that command's exit code with a garbled slice. The per-pane Mutex serializes against other run_command calls but not other writers on the pane. Fix: an AtomicBoolean `weHaveWritten` is flipped right BEFORE the `session.writeUserInput(...)` call. The listener ignores any B until `weHaveWritten` is set, and ignores any D until `ourCommandStarted` has been set (which only happens on a B we accepted). Together they gate the entire B/D cycle to OUR command. Captured-var visibility across threads is handled via atomics — captured `var`s become kotlin.jvm.internal.Ref whose .element field isn't volatile. P1 #5 — `mcpRunCommandMaxOutputBytes` was a misnomer: `sb.length` and `line.length` are Kotlin String lengths (UTF-16 code units), not UTF-8 bytes. For ASCII it matches; for emoji-heavy output the real byte count can be 4× the configured value. Renamed everywhere to `mcpRunCommandMaxOutputChars` and updated the kdoc to spell out the unit choice (apples-to-apples with the also- char-counting `mcpMaxAnswerChars`). Settings.json entries set under the old name silently revert to the default — acceptable for a setting that shipped on the same dev branch. P1 #6 — `paneMutexes` only sheds entries on the cache-miss-with-stale- pane path. If a user closes the cached pane and never re-runs against that tab, the mutex sits forever. Bounded by pane-creation rate but unbounded over long-lived sessions. Added opportunistic GC in `McpTerminalRegistry.setScratchPane`: when the mutex map exceeds 32 entries, walk every state's tabs + pane snapshots, collect live ids, evict everything else. Naturally batches behind real new-pane work; no listener wiring needed. Also sheds matching stale `mcpScratchPanes` entries while the sweep is hot. Build: `./gradlew :compose-ui:compileKotlinDesktop --no-daemon` clean. Generated with [Claude Code](https://claude.com/claude-code)

claude · 2026-05-20T09:29:57Z

Review — PR #262 `run_command` MCP tool

Solid, deeply-considered work. The OSC 133 listener-before-send sequencing, the per-tab cache lock for cache-miss orphan-prevention, the per-pane FIFO mutex, and the explicit "first-B" gate via weHaveWritten + ourCommandStarted are well-thought-out concurrency invariants and the comments earn their keep. The sealed PaneResolution for exhaustive failure handling is the right shape.

Below: issues worth a second pass before merge, then smaller nits.

1. Per-request `lsof` / `ps` shell-out has no caching (perf)

compose-ui/.../BossTermMcpManager.kt:359-365 runs ProcessAncestry.resolveClientWindow inside the Ktor intercept block on every incoming request. The doc on ProcessAncestry itself budgets ~50–150 ms per resolution on macOS (lsof + ps × hops). MCP-over-SSE keeps the TCP connection alive but POSTs each JSON-RPC request; for an agent that fires run_command in a tight loop you pay that cost every time.

The client's remotePort is stable for the lifetime of one TCP connection, so caching keyed on remotePort (with eviction on disconnect or a small TTL) would collapse the steady-state cost to one resolution per connection. Today lastResolvedClient is the only "cache" — and it's a single global slot, which leads to issue #2.

2. `lastResolvedClient` is a single global slot — multi-client misrouting

McpTerminalRegistry.kt:260 is one AtomicReference<TabbedTerminalState?>. The PR docs flag the "concurrent multi-client racing" case as a known limitation, but it's more structural than it reads: any time two clients in different windows alternate, the no-tab_id tool calls swing between windows. Since MCP requests can interleave on different SSE connections, this isn't just a "rare race for a single window of time" — it's the default behavior under any multi-client setup.

The structurally-correct fix is per-request resolution: stash the result in a Ktor AttributeKey inside the interceptor, then have the tool handler read it. That requires the MCP SDK to expose the inbound ApplicationCall to tool handlers, which may or may not be possible in SDK 0.8.3 — if it isn't, threading a ThreadLocal/coroutine context element from the interceptor into the handler dispatch is the next-best option. Either avoids the cross-client write storm on the global slot.

3. Linux client-PID resolver only scans `/proc/net/tcp` (IPv4) — IPv6 loopback misses

ProcessAncestry.kt:144-193 reads only /proc/net/tcp. On some Linux configs the loopback connection lands in /proc/net/tcp6 (e.g. clients that pick ::1 or kernels that report mapped-v4 there). Inet4Address-only listening sockets on 127.0.0.1 should normally keep the connection in tcp, but the resolver will silently return null whenever it doesn't and the tool falls back to first-registered without anyone noticing. Worth either also scanning tcp6 or a DEBUG log when both are empty.

4. `splitVerticalFromPane` permanently shifts user focus

TabbedTerminalState.kt:652-654 calls splitState.setFocusedPane(anchorPaneId) before splitting. There's no restore. When an MCP run_command fires while the user is typing in some other pane, their keyboard focus snaps to the bottom scratch strip mid-keystroke. The split needs the focus redirect to land in the right place, but the user's prior focus should be captured before the redirect and restored after splitFocusedPane returns.

5. Output truncation keeps the first `maxOutputChars`, not the last

sliceCommandOutput at BossTermMcpServer.kt:1413 breaks out of the loop the first time sb.length + line.length + 1 > maxOutputChars. For the dominant use case (builds, tests, lints, package installs), the tail of the output is where errors live — head-truncation throws away exactly the bytes Claude needs. Consider tail-truncating instead, or at minimum a "head + … + tail" split. Even the docs say "Captured stdout/stderr (ANSI-stripped)" without flagging which end gets dropped, so the contract should match.

6. `UNDISABLABLE_TOOLS` containing `run_command` is a UX overreach

BossTermMcpServer.kt:938 reserves run_command because the user-global PreToolUse hook (which is opt-in, not bundled) routes Bash to it. Users who don't install the hook lose the ability to hide run_command from clients, even though the justification doesn't apply to them. manage_tools being reserved is genuinely structural — without it there's no way back. run_command's reservation is a side-effect of an optional external integration. Consider gating its reservation on the user having opted into the hook (e.g. only when ~/.claude/hooks/prefer-bossterm.sh exists), or making it a normal-but-default-on tool.

7. Race: OSC 133;A on fresh pane can arrive before the listener registers

In executeInPane, addCommandStateListener(listener) runs after splitHorizontal/createTab returned. For a fast shell (e.g. dash, or a cached zsh with no rc-file work) the prompt-ready OSC 133;A can fire before the listener attaches, and promptReadySignal.await() waits the full shellReadyTimeoutMs (1500 ms default) for an event that already passed. The pane resolution path could attach the listener before kicking off the split, or TerminalSession could expose a "has-seen-first-prompt" latched flag for the listener to consult on attach.

Nits

BossTermMcpServer.kt:1386 and :1909 reference MAX_OUTPUT_BYTES in kdoc but the constant was renamed to maxOutputChars / the per-call cap parameter — stale doc.
script.isEmpty() check at BossTermMcpServer.kt:1008 accepts " " and "\n". isBlank() would catch the no-op cases too.
Hook script: cat >/dev/null 2>&1 || true after set -e is harmless but unnecessary — cat consuming stdin into /dev/null doesn't fail in practice. The comment is the load-bearing part.
paneMutexes.size > MCP_MUTEX_GC_THRESHOLD triggers gcStalePaneMutexes() from inside setScratchPane. The GC walks every state × every tab × every pane snapshot — for users with 30+ panes the sweep on every successful new-pane creation isn't free. A small TTL or a per-N-calls counter would amortize better.
mcpScratchPanes value-side staleness check in gcStalePaneMutexes (v !in live) is good, but the "lazy clear on next read" comment in clearScratchPane doesn't actually exist as a code path — readers (run_command's cache-miss-with-stale-pane branch) call clearScratchPane only when they hit the stale entry. Worth either implementing the read-time lazy clear or removing the comment promise.

Test coverage

Test plan is manual-only (5 reviewer checklist items). For something with this much concurrency state (per-tab cache lock, per-pane FIFO mutex, B/D event gating, OSC 133;A race window, history-coordinate slicing) at least a few unit tests on sliceCommandOutput (history-delta scroll-through, oldest-row clamp, exact-boundary truncation) and on PaneResolution's new_tab branch would be cheap insurance. The atomic flags in executeInPane in particular look right but are exactly the kind of code that quietly regresses six months from now.

Overall direction is good and the rigor evident in the existing internal review pass is real — these are mostly polish/edge-case items, not "stop the merge" findings. #1 and #2 are the two I'd most want addressed before this becomes the default shell path for AI clients.

Generated with Claude Code

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(mcp): run_command tool for visible-pane shell with read-back#262

feat(mcp): run_command tool for visible-pane shell with read-back#262
kshivang wants to merge 4 commits into
masterfrom
dev

kshivang commented May 20, 2026

Uh oh!

claude Bot commented May 20, 2026

Uh oh!

claude Bot commented May 20, 2026

Uh oh!

claude Bot commented May 20, 2026

Uh oh!

claude Bot commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kshivang commented May 20, 2026

Summary

What's in the box

Rigorous review pass

What's NOT in this PR

Test plan

Uh oh!

claude Bot commented May 20, 2026

Review of PR #262 — run_command tool

Issues

Doc/UX nits

Things that look correct

Test coverage

Uh oh!

claude Bot commented May 20, 2026

Follow-up review pass — ProcessAncestry and items the prior review did not surface

ProcessAncestry — bugs and performance

Items the prior review did not surface

Test coverage — same as prior review

Uh oh!

claude Bot commented May 20, 2026

Review of PR #262 — run_command MCP tool

Bugs

Concerns

Smaller things

Tests

Nits

Uh oh!

claude Bot commented May 20, 2026

Review — PR #262 run_command MCP tool

1. Per-request lsof / ps shell-out has no caching (perf)

2. lastResolvedClient is a single global slot — multi-client misrouting

3. Linux client-PID resolver only scans /proc/net/tcp (IPv4) — IPv6 loopback misses

4. splitVerticalFromPane permanently shifts user focus

5. Output truncation keeps the first maxOutputChars, not the last

6. UNDISABLABLE_TOOLS containing run_command is a UX overreach

7. Race: OSC 133;A on fresh pane can arrive before the listener registers

Nits

Test coverage

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Review of PR #262 — `run_command` tool

Review of PR #262 — `run_command` MCP tool

Review — PR #262 `run_command` MCP tool

1. Per-request `lsof` / `ps` shell-out has no caching (perf)

2. `lastResolvedClient` is a single global slot — multi-client misrouting

3. Linux client-PID resolver only scans `/proc/net/tcp` (IPv4) — IPv6 loopback misses

4. `splitVerticalFromPane` permanently shifts user focus

5. Output truncation keeps the first `maxOutputChars`, not the last

6. `UNDISABLABLE_TOOLS` containing `run_command` is a UX overreach