Skip to content

feat(monty): run_script/repl_python, HITL approval, multi-server wiring#115

Closed
runyaga wants to merge 33 commits intosoliplex:mainfrom
runyaga:feat/m2-monty-script-env
Closed

feat(monty): run_script/repl_python, HITL approval, multi-server wiring#115
runyaga wants to merge 33 commits intosoliplex:mainfrom
runyaga:feat/m2-monty-script-env

Conversation

@runyaga
Copy link
Copy Markdown
Contributor

@runyaga runyaga commented Apr 15, 2026

Summary

  • Adds run_script and repl_python client-side tools to MontyScriptEnvironment, backed by a sandboxed Python interpreter via dart_monty
  • Wires HITL (human-in-the-loop) approval gate for Python execution — requiresApproval: true suspends the session until the user approves or denies
  • Fixes multi-server plugin wiring in standard.dart: SoliplexPlugin now receives connections for all registered servers, not just the primary connection
  • Adds StdoutSink debug logging in standard.dart for development visibility
  • Enhances tool_call_tile.dart with richer tool call display and clipboard support
  • Adds SoliplexConnection.alias and serverUrl fields for improved connection identification
  • Adds HITL unit tests (hitl_test.dart) and expands MontyScriptEnvironment tests

Test plan

  • dart test passes in soliplex_agent and soliplex_monty_plugin
  • run_script and repl_python tools appear in LLM context for Monty rooms
  • Python execution gate: approval banner appears before code runs
  • Denying a tool call cancels the session (no LLM retry loop)
  • Multi-server: connections from all servers available inside Python via SoliplexPlugin
  • Tool call tile renders tool name, arguments, and result with copy support

🤖 Generated with Claude Code

runyaga and others added 30 commits April 14, 2026 20:34
…x API

New package `packages/fe_plugin_soliplex` exposes Soliplex server operations
as host functions callable from sandboxed Python via dart_monty's plugin system.

Host functions: soliplex_list_servers, soliplex_list_rooms, soliplex_get_room,
soliplex_get_documents, soliplex_get_chunk, soliplex_list_threads,
soliplex_create_thread, soliplex_delete_thread, soliplex_converse (stub),
soliplex_upload_file, soliplex_upload_to_thread, soliplex_get_mcp_token.

Multi-server support — each function accepts optional `server` parameter.
Default room and server configurable at construction.

TODO: Wire soliplex_converse with AgUiStreamClient for full AG-UI conversation
flow (SSE streaming, client-side tool calling, state pass-through).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…rver API

Replace stub soliplex_converse with real AG-UI SSE streaming via
new_thread/reply_thread. All functions now require explicit server
and room_id — no defaults.

- Add SoliplexConnection adapter (avoids soliplex_agent dependency)
- 11 host functions: list_servers, list_rooms, get_room, get_documents,
  get_chunk, list_threads, new_thread, reply_thread, upload_file,
  upload_to_thread, get_mcp_token
- Internal _ThreadState tracks message history and AG-UI state per thread
- 23 unit tests with mocked API/SSE streams
- Integration tests against demo.toughserv.com + localhost:8000
  (multi-server simultaneous connections verified)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Tests prove the full pipeline: Python → AgentSession FFI bridge →
SoliplexPlugin host functions → live Soliplex SSE streaming.

Working tests (sandbox: true):
- list_servers, list_rooms, get_room from Python
- Single SSE new_thread conversation

Known limitation: FFI native library has global state that corrupts
after async I/O host functions. Second execute() SEGFAULTs regardless
of sandbox mode. See dart_monty#271.

Multi-turn and bwrap codegen tests are written but blocked by this
FFI issue. WASM backend or Rust fix needed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…Session

Single long-lived AgentSession with SoliplexPlugin. Results:
- Discovery: list_servers, list_rooms (both servers), get_room
- SSE streaming: new_thread on demo cooking room
- Multi-turn: 3-turn bruschetta conversation via reply_thread
  (thread_id persists across execute() calls)
- Cross-server: new_thread on local chat room
- bwrap codegen → extract → execute

8/10 tests pass. The full pipeline is proven:
  Python → AgentSession → SoliplexPlugin → SSE streaming → response
  → state persistence → reply_thread with history → multi-turn works

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Both upload tests pass when run in isolation:
- upload_file: agent-test.txt → bwrap_sandbox room on local
- upload_to_thread: thread-notes.txt → thread on local chat room

Full suite hits intermittent Rust crash on 4th execute() call
(same "no active frame" / SEGFAULT as #271). Tests 1-3 and
uploads pass reliably. The crash is in the monty crate's VM
recompilation path, not in the plugin or upload code.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Multi-server pipeline tests using AgentSession + SoliplexPlugin
with fixed dart_monty (NativeFinalizer race fix):

- Cross-server discovery: both servers, rooms, skills
- Demo recipe → upload to local bwrap_sandbox room
- 3-turn pad thai conversation on demo → cross-server summary on local
- Pancake recipe: demo → upload → local comments
- bwrap codegen with monty rules (LLM formatting inconsistent)

State persists across execute() calls: thread_id, recipe text,
conversation responses all survive for cross-server handoff.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Test 8 proves the complete pipeline:
1. Create thread on bwrap_sandbox
2. Upload monty-rules.md with full API reference
3. Ask agent to read file and generate code
4. Agent generates valid monty code using host functions
5. Extract code from ```monty``` block
6. Execute: code calls list_servers, list_rooms, get_room
7. Returns skills map across BOTH demo + local servers

The generated code correctly uses json.loads() on all host function
returns, iterates servers, finds rooms with skills, and returns
structured data. Zero human intervention after the prompt.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Advanced scenarios:
- Upload full monty ruleset with all plugins (soliplex, template, msgbus, fs)
- Codegen: data pipeline with caching + templates
- Codegen: cross-server intelligence gathering
- Codegen: recipe → file → template report card
- Codegen: orchestrate conversations across servers

debug_null_return.dart proves all SSE calls work in dart run:
3 sequential SSE calls, state persistence, all return non-null.
The null returns in dart test are a test-runner zone issue, not a
code bug. Production (dart run) works correctly.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
SSE event flow is correct: RunStarted → ThinkingStart → ThinkingContent →
TextMessageStart → TextMessageContent → TextMessageEnd → RunFinished.
All events arrive, content is accumulated properly.

The null returns in tests are caused by transient HTTP 500 from the
bwrap_sandbox server when creating threads rapidly after prior SSE
streams. The ApiException propagates through Python's state wrapping
try/except, silently leaving variables undefined → null.

Not an SSE or plugin bug — server-side resource management on
bwrap_sandbox with bubblewrap sandboxes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Upload real experiment files from ioi-experiments to bwrap_sandbox,
agent generates monty code to solve construction scheduling problems.

Test 2 (baseline): Agent generated 2716 chars of scheduling code that
runs in monty — creates blackboard dict, tracks jobs/deps/weather/workers,
produces a day-by-day schedule. Code executes end-to-end.

Test 4 (disruption): Agent generated code but used import os (not
available in monty sandbox). Ruleset needs monty stdlib limitations.

Pipeline: upload experiment files → agent reads files → generates
monty code → extract from code block → execute in sandbox → result.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Rerun with updated prompt rules:
1. Baseline: Executes but WRONG — marks jobs done mid-iteration,
   assigns H1_FRM same day as H1_FND (dep not actually satisfied yet)
2. Optimal: Same bug — copies baseline logic
3. Disruption: Used open() despite rules — needs stronger guidance
4. Infeasible: CORRECT ✅ — clean f-strings, 9 < 15 = infeasible

The baseline/optimal bug is a real algorithmic error: completing jobs
inside the same loop pass where deps are checked. Need to collect
assignments first, then mark complete after the day loop.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
With sandbox filesystem + updated prompts, all experiments produce
correct results:

1. Baseline: Day1=rain, Day2=H1_FND, Day3=H1_FRM+H2_FND, Day4=H1_ROF+H2_FRM ✅
2. Optimal: Same schedule (already optimal) ✅
3. Disruption: Alice sick day 2 — H1_FND done by Bob, Alice back day 3 ✅
   Generated code correctly: collects assignments first, marks done after
4. Infeasible: 9 slots < 15 jobs = infeasible ✅

Files are now dual-written: server thread (bwrap reads) + sandbox
filesystem (generated code reads with Path().read_text()).

Monty limitations discovered and documented in prompt rules:
- No := walrus, no open(), no % format, no enumerate(start=)
- No chained assignment, no dict dot access
- bb_dump not a real function (from experiment spec)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Runs each wave5 experiment on its own fresh thread/session, prints
the complete generated code for inspection.

Findings from run:
- Baseline: LLM generated pseudocode with := and custom syntax
  (not valid Python). The model doesn't reliably follow rules.
- Optimal: Used set literals, match expressions, ⊆ operator —
  not Python at all
- Disruption/Infeasible: Server overloaded from too many threads

The local Ollama model (gpt-oss) is inconsistent — sometimes
generates valid Python, sometimes pseudocode. The prompt rules
help but don't guarantee compliance. Need either:
1. Better model (GPT-4o on demo.toughserv.com is more reliable)
2. Validation + error correction loop
3. Stronger prompt constraints

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
show_generated_demo.dart uses demo.toughserv.com (GPT-4o):
- All 4 experiments generate code (1254-1554 chars each)
- Full monty tracebacks with line numbers shown on errors
- GPT-4o generates valid Python syntax (unlike Ollama's pseudocode)
- But: uses msg_send/bb_dump/locals() despite rules saying otherwise
- Dep checking logic wrong in baseline/optimal (checks within-day)

Code analysis per experiment:
1. Baseline: syntax OK, logic bug (deps checked in same day dict)
2. Optimal: syntax OK, logic bug (schedule keys are day numbers)
3. Disruption: defaultdict import crashes monty
4. Infeasible: locals() not available, wrong approach (tries scheduling)

Next: strengthen prompt rules to forbid unlisted functions,
add error correction loop to fix generated code.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
qwen_experiments.dart: compare 8B vs 35B on 4 tasks (fibonacci,
  discovery, scheduling, pipeline)
qwen_room_chat.dart: 8B asks questions → 35B answers → 5 rounds

Results:
- 35B explains Python decorators correctly
- 8B generates follow-up question about decorator parameters
- 35B analyzes 8B's response
- Server 500s after ~3 rapid thread creations (server resource limit)

Qwen rooms configured with RAG skill, file tools, attachments:
- qwen_8b: spark-3b12:8002, Qwen3-8B-FP8
- qwen_vllm: spark-3b12:8000, Qwen3.5-35B-A3B-FP8

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…spose() leak

Rename package directory and pubspec name from fe_plugin_soliplex to
soliplex_monty_plugin to align with soliplex_client/soliplex_agent naming.

Fix SoliplexPlugin.onDispose() which was a no-op — HTTP connections from
all registered SoliplexConnection instances were never closed. Closes
runyaga/soliplex-audit#3.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Delete 6 debug/experiment scripts that hard-coded /Users/runyaga/dev/... paths
- Replace hardcoded demo.toughserv.com with SOLIPLEX_DEMO_URL env var in all
  test files; SOLIPLEX_LOCAL_URL env var added for local URL (default localhost:8000)
- Fix fe_plugin_soliplex → soliplex_monty_plugin in all imports (lib + tests)
- wave5 file-reading tests skip gracefully when IOI_EXPERIMENTS_DIR /
  MONTY_DOCS_DIR env vars are unset

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
MCP connectivity is a Flutter-layer concern — Python scripts receive
pre-authenticated handles and should not fetch raw tokens themselves.

- Remove _getMcpToken getter and HostFunction from SoliplexPlugin
- Remove MCP section from systemPromptContext
- Update functions count assertion 11→10
- Add onDispose test (100% coverage on lib/)
- Add no-TextMessageStartEvent edge case test
- Fix relative import → package: import in soliplex_plugin.dart
- Fix import ordering in test file
- Format integration test files (pre-existing style debt)

Gates: format ✓  analyzer ✓  coverage 100% ✓

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…by dart_monty

Implements M2 milestone: MontyScriptEnvironment wraps a dart_monty.AgentSession
and exposes execute_python as a ClientTool with reactive ScriptingState signal.

Changes:
- soliplex_agent: add ScriptingState enum and onAttach/scriptingState to
  ScriptEnvironment interface; export ToolExecutionContext
- soliplex_monty_plugin: MontyScriptEnvironment (lib/src/), unit tests
  (test/src/), FFI + WASM integration tests (test/integration/)
- WASM test infra: dart_test.yaml, custom HTML template, bridge/worker JS
  committed to lib/wasm_assets/ (wasm binary gitignored, build separately)

Gates: format ✓  analyzer ✓  coverage 100% ✓  integration/ffi ✓  integration/wasm ✓

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
MontyScriptEnvironment no longer uses SoliplexPlugin as a MontyPlugin.
Soliplex operations are now registered directly as dm.HostFunction on
the AgentSession bridge, and the bridge's schema registry is projected
to ClientTools visible to the server-side LLM.

Key changes:
- Register soliplex_list_servers, soliplex_list_rooms, soliplex_list_threads,
  soliplex_new_thread, soliplex_reply_thread directly on the bridge via
  _register() — no plugin system involved
- _projectToClientTool() converts HostFunctionSchema.toJsonSchema() to
  Tool.parameters and routes ClientTool executor directly to the Dart
  handler (no Python hop)
- _tools built lazily from session.schemas (filtered) + execute_python
- SoliplexConnection.fromServerConnection() factory for clean wiring
- Add soliplex_logging dev_dependency for LoggerFactory extension
- Add integration tests: T0 (secret_number callback proof), T1 (Soliplex
  tools visible), T2 (execute_python), T3 (state persistence), T4 (signal)
- Add tool/test_integration_ffi.sh and tool/test_integration_wasm.sh
- Add tool/chat_probe.dart for manual inspection

All 5 tests pass on FFI and WASM/Chrome.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add Mutex to serialise concurrent dm.AgentSession.execute() calls so
  concurrent Python tool invocations on a shared interpreter cannot stomp
  each other's variable state

- Add wrapSharedScriptEnvironment() factory to soliplex_agent: wraps a
  caller-owned ScriptEnvironment without taking dispose ownership, making
  the shared-env pattern explicit and safe

- Update stateful test group to use wrapSharedScriptEnvironment instead of
  wrapScriptEnvironmentFactory so the lifecycle contract is unambiguous

- T5: regression guard proving dart_monty Isolate/Worker is non-blocking
  (43 FFI / 35 WASM heartbeats confirm event loop stays free during Python)

- T7: proves fire-and-forget sessions have isolated Python state (fresh
  dm.AgentSession per spawn = fresh interpreter = no variable leakage)

- Fix pre-existing ScriptEnvironment test fakes missing onAttach() /
  scriptingState; remove redundant internal imports in agent test helpers

All 7 integration tests pass on both FFI and WASM (Chrome).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Failing tests written first for each gap; implementation then made them pass.

**execution timeout** (`_executionTimeout` field, default 30 s / 2 s in tests)
- `Future.timeout()` wraps `_montySession.execute()` inside the mutex;
  throws `TimeoutException`, releases the mutex cleanly.
- `forTest` accepts `executionTimeout:` so timeout tests run at 500 ms
  without waiting 30 s.

**dispose drain** (replace `unawaited(_montySession.dispose())`)
- `dispose()` now queues `_montySession.dispose()` via
  `_executeMutex.protect(...)`, guaranteeing the Python interpreter is
  only destroyed after any in-flight `execute()` releases the mutex.
- Dispose-verify test updated to pump the event loop before verify.

**in-mutex `_disposed` re-check**
- Calls that entered `_executePython` before `dispose()` but are still
  waiting at the mutex now throw `StateError` after acquiring it, instead
  of calling the already-destroyed session.

**new unit tests** (19 added, 31 total, 0 warnings):
- `timeout` (3): TimeoutException, idle restored, mutex released after timeout
- `concurrency` (3): serialisation order, exception isolation, signal cycling
- `dispose safety` (2): drain before session.dispose, queued callers rejected
- `isolation` (1): deterministic replacement for weak LLM-mediated T7
- `corner cases` (3): large result, missing code key, mid-flight cancel docs

**pre-existing fix**: stub `mockSession.schemas → []` in setUp so the
`late final _tools` initialiser does not throw on first access.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add dcm_options.yaml with ~80 rules adapted from dart_monty (internal
  path exclusions stripped; only test/** and *.g.dart kept)
- Wire both linters into analysis_options.yaml via include directives
- monty_script_environment.dart: dynamic→Object?, late final→nullable+??=,
  async {}→Future.value(), non-null assertion → local var, six
  newline-before-return, _stateSignal cascade dispose, dispose-class-fields
  exclusion (tearoff through unawaited(protect()) is not DCM-traceable)
- soliplex_plugin.dart: move @OverRide methods before private helpers to
  satisfy member-ordering (DCM classifies them as public-methods)

dart format, dart analyze --fatal-infos, dcm analyze lib: zero issues.
31/31 unit tests pass.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace AgentUiDelegate (callback-based, single-tool) with a
signal-driven Human-in-the-Loop gate that scales to concurrent
tool calls and integrates cleanly with the signals reactive layer.

## What changed

**ClientTool API** (`tool_registry.dart`):
- `requiresApproval: bool` — when true, AgentSession suspends
  execution before the tool executor runs and emits a
  PendingApprovalRequest; the UI must call approveToolCall or
  denyToolCall to resume.
- `platformConsentNote: String? Function()?` — optional callback for
  tools that trigger an OS-level permission dialog (e.g. clipboard
  read on web). Returns a human-readable note; AgentSession emits
  PlatformConsentNotice (non-blocking, informational).

**New types**:
- `PendingApprovalRequest` — immutable data class (toolCallId,
  toolName, arguments) emitted on pendingApproval signal.
- `PlatformConsentNotice` / `AwaitingApproval` — ExecutionEvent
  subclasses for consent/approval lifecycle.

**AgentSession** (`agent_session.dart`):
- `pendingApproval: ReadonlySignal<PendingApprovalRequest?>` — UI
  watches this to render Allow/Deny UI.
- `approveToolCall(String) / denyToolCall(String)` — resolves the
  Completer gating the suspended tool call.
- `_awaitApproval()` — internal gate; stores Completer per
  toolCallId in _pendingApprovals, signals UI, awaits resolution.
  Auto-denies on session cancel.

**Deleted**:
- `AgentUiDelegate` (58 lines) + its 457-line test file.
  Replaced entirely by the signal approach above.

**session_extension.dart**: `onDispose()` → `dispose()` rename for
consistency with Dart disposal conventions.

**Tests** (`hitl_test.dart`): 12 tests covering requiresApproval
defaults, three approval tiers (agent-gate / OS-gate / ungated),
PendingApprovalRequest fields, platformConsentNote callbacks, and
PlatformConsentNotice equality.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- ScriptEnvironment now implements SessionExtension directly instead of
  using a separate ScriptEnvironmentExtension adapter class.
- Deleted ScriptEnvironmentExtension (no remaining references).
- wrapScriptEnvironmentFactory → toOwnedFactory
- wrapSharedScriptEnvironment → toSharedFactory
- _SharedScriptEnvironmentExtension → _SharedScriptEnvironmentProxy
- onDispose() → dispose() in the proxy (matches SessionExtension rename)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Wire AgentSession.pendingApproval through ThreadViewState into a
self-contained ToolApprovalSlot widget so only the slot rebuilds on
approval changes — not the entire thread body column.

**New widgets** (`ui/tool_approval_banner.dart`):
- ToolApprovalSlot — owns .watch(context); renders nothing when null.
- ToolApprovalBanner — tool name header, scrollable code preview
  (max 160 px), Allow (FilledButton) / Deny (TextButton).

**ThreadViewState**: mirrors session.pendingApproval to its own
_pendingApproval signal; subscribes in _attachSession, unsubscribes
and resets to null in _detachSession.

**Example tools** (`modules/tools/`):
- get_device_info — ungated (requiresApproval: false, no consent).
- confirm_action — agent-gated (requiresApproval: true); shows
  action argument in the approval banner preview.
- get_clipboard — platform-gated via platformConsentNote; emits
  PlatformConsentNotice on web (browser clipboard permission),
  silent on native.

**standard.dart**: registers all three example tools; wires
MontyScriptEnvironment via extensionFactoryBuilder; startup probe.

**macOS / web**: CocoaPods xcconfig includes for shared_preferences
re-linking after flutter clean; dart_monty WASM bridge assets.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
MontyScriptEnvironment rewritten to accept a list of MontyPlugins
instead of hard-wiring Soliplex connections. Host functions are
registered on the dart_monty bridge and projected as direct
ClientTools (no Python hop).

execute_python now has requiresApproval: true — HITL gate suspends
execution until the user allows or denies from the approval banner.

probe() validates the interpreter on startup by running `1 + 1`.

Regression test: error messages must not leak Rust interpreter
internals (NodeIndex, ExprSubscript, node_index:). Currently failing
pending runyaga/monty subscript tuple-unpack fix.

defensive MissingPluginException guard in DefaultBackendUrlStorage
for flutter clean / CocoaPods re-link scenarios.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Split execute_python into run_script (one-shot) and repl_python (persistent
  REPL) with differentiated descriptions so the LLM picks the right tool
- Return Python errors as tool output (status:completed) instead of throwing;
  includes any print() output that occurred before the error
- Return 'None' for in-place ops (arr.sort()) so LLM knows execution succeeded
- denyToolCall cancels the AgentSession to prevent LLM retry loops
- Serialize approval-required tools in _executeAll to prevent concurrent
  approval banner deadlock
- Wire SoliplexPlugin with all active ServerManager connections (not just
  current room's server)
- ThreadKey (serverId, roomId, threadId) used as _threadStates map key
- SoliplexConnection gains alias + serverUrl; _listServers returns full metadata
- onDispose no longer closes injected connections (owned by ServerManager)
- Remove unsupported help() from systemPromptContext
- Copy button on tool call tile code/result blocks
- Bold labels + monospace container in ToolCallTile

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…act args

* feat: activity log, state panel, and expandable tool call args

Surfaces previously invisible AG-UI events in the execution UI:
- ActivityLog: collapsible sub-agent call/result log with markdown rendering
- StatePanel: info-chiclet toggle showing aguiState JSON with copy support
- StepLog: expandable args per tool call step (via ToolCallArgsEvent bridge)
- ToolCallTile: upgraded to use ArgsBlock for styled markdown rendering
- ArgsBlock: shared widget converting JSON to readable markdown with platform
  monospace font (SF Mono on Apple), scrollable, with copy button

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(execution-ui): replace markdown renderer in ArgsBlock, hide LLM-internal activity rows

- ArgsBlock: swap FlutterMarkdownPlusRenderer for SelectableText + _prettyPrint/_renderMap
  eliminating nested CodeBlockBuilder container and fixing newline escaping
- ActivityLog: filter to skill_tool_call only; hide skill_tool_result rows
  (error tracebacks, JSON arrays are LLM-internal, not user-facing)
- ActivityLog: reject empty-Map args (list_environments '{}') to avoid blank rows

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(execution-ui): remove Thinking steps, drop ExecutionThinkingBlock, compact activity rows

Changes A-D from UX analysis:
- ActivityIndicator: "Calling tools..." (no numeric count) to avoid mismatch with step log
- ExecutionTracker: ThinkingStarted no longer adds a step; remove thinkingBlocks/
  isThinkingStreaming signals (LLM reasoning is internal, not user-facing)
- StepType enum removed; ExecutionStep simplified (no type field)
- ExecutionThinkingBlock removed from LoadingMessageTile and TextMessageTile;
  static _ThinkingBlock for persisted message thinkingText is retained
- thinking_block.dart deleted
- ActivityLog: compact inline SelectableText instead of full ArgsBlock container;
  row padding tightened to vertical: 2
- args_block.dart: prettyPrintArgs/renderMap promoted to top-level for reuse
- Tests updated: ThinkingStarted no longer creates a step, thinking block tests removed

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
runyaga and others added 3 commits April 18, 2026 05:34
…ly (#7)

* refactor(agent): ScriptEnvironment implements SessionExtension directly

- ScriptEnvironment now implements SessionExtension, eliminating the
  ScriptEnvironmentExtension adapter class.
- SessionExtension.onDispose() renamed to dispose() for Dart convention.
- SessionContext added (serverId + roomId) passed through extension
  factory so environments can customize per room.
- ScriptEnvironmentFactory now takes SessionContext.
- toOwnedFactory / toSharedFactory replace wrapScriptEnvironmentFactory.
- SharedScriptEnvironmentProxy replaces ScriptEnvironmentExtension.
- ScriptingState enum added for reactive interpreter lifecycle.
- soliplex_agent exports ScriptingState and SessionContext.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* style: dart format

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* style: add trailing commas to typedef params (linter)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
#9)

* feat(m2b): add soliplex_monty_plugin — Python scripting via dart_monty

Adds the soliplex_monty_plugin package, which connects the dart_monty
Python runtime to the Soliplex agent platform.

## What's included

**MontyScriptEnvironment** — ScriptEnvironment backed by dart_monty's
  AgentSession. Registers SoliplexTools as HostFunctions, projects them
  as ClientTools for the LLM, runs Python in a background isolate/worker.

**SoliplexTool** — flat data struct unifying Python-callable and
  LLM-callable tool definitions (name, description, parameters, handler).

**SoliplexConnection / buildSoliplexTools** — full Soliplex API surface
  callable from Python: list_servers, list_rooms, get_room, get_documents,
  get_chunk, new_thread, reply_thread, list_threads, upload_file,
  upload_to_thread.

**toOwnedFactory / toSharedFactory** — two ownership modes:
  fire-and-forget (isolated Python per session) and stateful (shared
  interpreter across sessions).

**Integration tests** — agent_session_test, monty_env_chat_test (T0–T7),
  monty_script_environment_test.

## What's not included

HITL approval gate (requiresApproval) — deferred to M3. SoliplexTool
and ClientTool have no requiresApproval field in this slice; it will
be added when feat/hitl-tool-approval is landed.

## soliplex_agent changes

- Export ToolExecutionContext from public API (needed by plugin).
- Remove redundant direct imports in test helpers.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* chore(monty-plugin): remove coverage artifacts, add .gitignore

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* refactor(monty-plugin): migrate os: OsProvider? → OsCallHandler?

Follows dart_monty#335 which replaced the OsProvider class hierarchy
with the OsCallHandler typedef from dart_monty_core. Parameter is a
pass-through to dm.AgentSession(os:).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* feat(m2b): wire MontyScriptEnvironment into Flutter app

- AgentRuntimeManager accepts extensionFactoryBuilder so each runtime
  can receive a per-server SessionExtensionFactory
- standard.dart creates a RoomEnvironmentRegistry and wires
  toRoomSharedFactory + MontyScriptEnvironment with all SoliplexTools;
  adds debug logging sink and startup probe (fire-and-forget)
- Add get_device_info_tool and get_clipboard_tool client tools;
  confirm_action_tool deferred to M3 (HITL)
- Fix OsCallHandler → OsProvider to match current dart_monty API

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(monty-plugin): use git dep for dart_monty; OsCallHandler matches origin/main

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* ci: trigger fresh CI run

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* ci: fix stale pub cache causing package_graph.json parse failure

Remove the restore-keys fallback from the pub cache step so CI never
restores an old cache from a different pubspec.lock state. Add
`rm -rf .dart_tool` before pub get to prevent any stale
package_graph.json from interfering with dependency resolution.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(deps): upgrade dart_monty_core to remove broken flutter assets

Picks up c3d7a06 from runyaga/dart_monty_core which removes the
flutter: assets section (dart_monty_bridge.js etc. are WASM build
artifacts not committed to the repo, causing flutter test/build to fail).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Resolve conflicts with PR #9 (feat/m2b):
- soliplex_monty_plugin: take main's buildSoliplexTools/SoliplexTool API
- script_environment.dart: take main's SessionContext factory signature
- standard.dart: take main's SoliplexTools wiring, remove ConfirmActionTool
- test helpers: drop .readonly() on signal, update extensionFactory signatures

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@runyaga runyaga closed this Apr 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant