feat(monty): run_script/repl_python, HITL approval, multi-server wiring#115
Closed
runyaga wants to merge 33 commits intosoliplex:mainfrom
Closed
feat(monty): run_script/repl_python, HITL approval, multi-server wiring#115runyaga wants to merge 33 commits intosoliplex:mainfrom
runyaga wants to merge 33 commits intosoliplex:mainfrom
Conversation
…x API New package `packages/fe_plugin_soliplex` exposes Soliplex server operations as host functions callable from sandboxed Python via dart_monty's plugin system. Host functions: soliplex_list_servers, soliplex_list_rooms, soliplex_get_room, soliplex_get_documents, soliplex_get_chunk, soliplex_list_threads, soliplex_create_thread, soliplex_delete_thread, soliplex_converse (stub), soliplex_upload_file, soliplex_upload_to_thread, soliplex_get_mcp_token. Multi-server support — each function accepts optional `server` parameter. Default room and server configurable at construction. TODO: Wire soliplex_converse with AgUiStreamClient for full AG-UI conversation flow (SSE streaming, client-side tool calling, state pass-through). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…rver API Replace stub soliplex_converse with real AG-UI SSE streaming via new_thread/reply_thread. All functions now require explicit server and room_id — no defaults. - Add SoliplexConnection adapter (avoids soliplex_agent dependency) - 11 host functions: list_servers, list_rooms, get_room, get_documents, get_chunk, list_threads, new_thread, reply_thread, upload_file, upload_to_thread, get_mcp_token - Internal _ThreadState tracks message history and AG-UI state per thread - 23 unit tests with mocked API/SSE streams - Integration tests against demo.toughserv.com + localhost:8000 (multi-server simultaneous connections verified) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Tests prove the full pipeline: Python → AgentSession FFI bridge → SoliplexPlugin host functions → live Soliplex SSE streaming. Working tests (sandbox: true): - list_servers, list_rooms, get_room from Python - Single SSE new_thread conversation Known limitation: FFI native library has global state that corrupts after async I/O host functions. Second execute() SEGFAULTs regardless of sandbox mode. See dart_monty#271. Multi-turn and bwrap codegen tests are written but blocked by this FFI issue. WASM backend or Rust fix needed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…Session Single long-lived AgentSession with SoliplexPlugin. Results: - Discovery: list_servers, list_rooms (both servers), get_room - SSE streaming: new_thread on demo cooking room - Multi-turn: 3-turn bruschetta conversation via reply_thread (thread_id persists across execute() calls) - Cross-server: new_thread on local chat room - bwrap codegen → extract → execute 8/10 tests pass. The full pipeline is proven: Python → AgentSession → SoliplexPlugin → SSE streaming → response → state persistence → reply_thread with history → multi-turn works Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Both upload tests pass when run in isolation: - upload_file: agent-test.txt → bwrap_sandbox room on local - upload_to_thread: thread-notes.txt → thread on local chat room Full suite hits intermittent Rust crash on 4th execute() call (same "no active frame" / SEGFAULT as #271). Tests 1-3 and uploads pass reliably. The crash is in the monty crate's VM recompilation path, not in the plugin or upload code. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Multi-server pipeline tests using AgentSession + SoliplexPlugin with fixed dart_monty (NativeFinalizer race fix): - Cross-server discovery: both servers, rooms, skills - Demo recipe → upload to local bwrap_sandbox room - 3-turn pad thai conversation on demo → cross-server summary on local - Pancake recipe: demo → upload → local comments - bwrap codegen with monty rules (LLM formatting inconsistent) State persists across execute() calls: thread_id, recipe text, conversation responses all survive for cross-server handoff. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Test 8 proves the complete pipeline: 1. Create thread on bwrap_sandbox 2. Upload monty-rules.md with full API reference 3. Ask agent to read file and generate code 4. Agent generates valid monty code using host functions 5. Extract code from ```monty``` block 6. Execute: code calls list_servers, list_rooms, get_room 7. Returns skills map across BOTH demo + local servers The generated code correctly uses json.loads() on all host function returns, iterates servers, finds rooms with skills, and returns structured data. Zero human intervention after the prompt. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Advanced scenarios: - Upload full monty ruleset with all plugins (soliplex, template, msgbus, fs) - Codegen: data pipeline with caching + templates - Codegen: cross-server intelligence gathering - Codegen: recipe → file → template report card - Codegen: orchestrate conversations across servers debug_null_return.dart proves all SSE calls work in dart run: 3 sequential SSE calls, state persistence, all return non-null. The null returns in dart test are a test-runner zone issue, not a code bug. Production (dart run) works correctly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
SSE event flow is correct: RunStarted → ThinkingStart → ThinkingContent → TextMessageStart → TextMessageContent → TextMessageEnd → RunFinished. All events arrive, content is accumulated properly. The null returns in tests are caused by transient HTTP 500 from the bwrap_sandbox server when creating threads rapidly after prior SSE streams. The ApiException propagates through Python's state wrapping try/except, silently leaving variables undefined → null. Not an SSE or plugin bug — server-side resource management on bwrap_sandbox with bubblewrap sandboxes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Upload real experiment files from ioi-experiments to bwrap_sandbox, agent generates monty code to solve construction scheduling problems. Test 2 (baseline): Agent generated 2716 chars of scheduling code that runs in monty — creates blackboard dict, tracks jobs/deps/weather/workers, produces a day-by-day schedule. Code executes end-to-end. Test 4 (disruption): Agent generated code but used import os (not available in monty sandbox). Ruleset needs monty stdlib limitations. Pipeline: upload experiment files → agent reads files → generates monty code → extract from code block → execute in sandbox → result. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Rerun with updated prompt rules: 1. Baseline: Executes but WRONG — marks jobs done mid-iteration, assigns H1_FRM same day as H1_FND (dep not actually satisfied yet) 2. Optimal: Same bug — copies baseline logic 3. Disruption: Used open() despite rules — needs stronger guidance 4. Infeasible: CORRECT ✅ — clean f-strings, 9 < 15 = infeasible The baseline/optimal bug is a real algorithmic error: completing jobs inside the same loop pass where deps are checked. Need to collect assignments first, then mark complete after the day loop. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
With sandbox filesystem + updated prompts, all experiments produce correct results: 1. Baseline: Day1=rain, Day2=H1_FND, Day3=H1_FRM+H2_FND, Day4=H1_ROF+H2_FRM ✅ 2. Optimal: Same schedule (already optimal) ✅ 3. Disruption: Alice sick day 2 — H1_FND done by Bob, Alice back day 3 ✅ Generated code correctly: collects assignments first, marks done after 4. Infeasible: 9 slots < 15 jobs = infeasible ✅ Files are now dual-written: server thread (bwrap reads) + sandbox filesystem (generated code reads with Path().read_text()). Monty limitations discovered and documented in prompt rules: - No := walrus, no open(), no % format, no enumerate(start=) - No chained assignment, no dict dot access - bb_dump not a real function (from experiment spec) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Runs each wave5 experiment on its own fresh thread/session, prints the complete generated code for inspection. Findings from run: - Baseline: LLM generated pseudocode with := and custom syntax (not valid Python). The model doesn't reliably follow rules. - Optimal: Used set literals, match expressions, ⊆ operator — not Python at all - Disruption/Infeasible: Server overloaded from too many threads The local Ollama model (gpt-oss) is inconsistent — sometimes generates valid Python, sometimes pseudocode. The prompt rules help but don't guarantee compliance. Need either: 1. Better model (GPT-4o on demo.toughserv.com is more reliable) 2. Validation + error correction loop 3. Stronger prompt constraints Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
show_generated_demo.dart uses demo.toughserv.com (GPT-4o): - All 4 experiments generate code (1254-1554 chars each) - Full monty tracebacks with line numbers shown on errors - GPT-4o generates valid Python syntax (unlike Ollama's pseudocode) - But: uses msg_send/bb_dump/locals() despite rules saying otherwise - Dep checking logic wrong in baseline/optimal (checks within-day) Code analysis per experiment: 1. Baseline: syntax OK, logic bug (deps checked in same day dict) 2. Optimal: syntax OK, logic bug (schedule keys are day numbers) 3. Disruption: defaultdict import crashes monty 4. Infeasible: locals() not available, wrong approach (tries scheduling) Next: strengthen prompt rules to forbid unlisted functions, add error correction loop to fix generated code. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
qwen_experiments.dart: compare 8B vs 35B on 4 tasks (fibonacci, discovery, scheduling, pipeline) qwen_room_chat.dart: 8B asks questions → 35B answers → 5 rounds Results: - 35B explains Python decorators correctly - 8B generates follow-up question about decorator parameters - 35B analyzes 8B's response - Server 500s after ~3 rapid thread creations (server resource limit) Qwen rooms configured with RAG skill, file tools, attachments: - qwen_8b: spark-3b12:8002, Qwen3-8B-FP8 - qwen_vllm: spark-3b12:8000, Qwen3.5-35B-A3B-FP8 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…spose() leak Rename package directory and pubspec name from fe_plugin_soliplex to soliplex_monty_plugin to align with soliplex_client/soliplex_agent naming. Fix SoliplexPlugin.onDispose() which was a no-op — HTTP connections from all registered SoliplexConnection instances were never closed. Closes runyaga/soliplex-audit#3. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Delete 6 debug/experiment scripts that hard-coded /Users/runyaga/dev/... paths - Replace hardcoded demo.toughserv.com with SOLIPLEX_DEMO_URL env var in all test files; SOLIPLEX_LOCAL_URL env var added for local URL (default localhost:8000) - Fix fe_plugin_soliplex → soliplex_monty_plugin in all imports (lib + tests) - wave5 file-reading tests skip gracefully when IOI_EXPERIMENTS_DIR / MONTY_DOCS_DIR env vars are unset Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
MCP connectivity is a Flutter-layer concern — Python scripts receive pre-authenticated handles and should not fetch raw tokens themselves. - Remove _getMcpToken getter and HostFunction from SoliplexPlugin - Remove MCP section from systemPromptContext - Update functions count assertion 11→10 - Add onDispose test (100% coverage on lib/) - Add no-TextMessageStartEvent edge case test - Fix relative import → package: import in soliplex_plugin.dart - Fix import ordering in test file - Format integration test files (pre-existing style debt) Gates: format ✓ analyzer ✓ coverage 100% ✓ Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…by dart_monty Implements M2 milestone: MontyScriptEnvironment wraps a dart_monty.AgentSession and exposes execute_python as a ClientTool with reactive ScriptingState signal. Changes: - soliplex_agent: add ScriptingState enum and onAttach/scriptingState to ScriptEnvironment interface; export ToolExecutionContext - soliplex_monty_plugin: MontyScriptEnvironment (lib/src/), unit tests (test/src/), FFI + WASM integration tests (test/integration/) - WASM test infra: dart_test.yaml, custom HTML template, bridge/worker JS committed to lib/wasm_assets/ (wasm binary gitignored, build separately) Gates: format ✓ analyzer ✓ coverage 100% ✓ integration/ffi ✓ integration/wasm ✓ Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
MontyScriptEnvironment no longer uses SoliplexPlugin as a MontyPlugin. Soliplex operations are now registered directly as dm.HostFunction on the AgentSession bridge, and the bridge's schema registry is projected to ClientTools visible to the server-side LLM. Key changes: - Register soliplex_list_servers, soliplex_list_rooms, soliplex_list_threads, soliplex_new_thread, soliplex_reply_thread directly on the bridge via _register() — no plugin system involved - _projectToClientTool() converts HostFunctionSchema.toJsonSchema() to Tool.parameters and routes ClientTool executor directly to the Dart handler (no Python hop) - _tools built lazily from session.schemas (filtered) + execute_python - SoliplexConnection.fromServerConnection() factory for clean wiring - Add soliplex_logging dev_dependency for LoggerFactory extension - Add integration tests: T0 (secret_number callback proof), T1 (Soliplex tools visible), T2 (execute_python), T3 (state persistence), T4 (signal) - Add tool/test_integration_ffi.sh and tool/test_integration_wasm.sh - Add tool/chat_probe.dart for manual inspection All 5 tests pass on FFI and WASM/Chrome. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add Mutex to serialise concurrent dm.AgentSession.execute() calls so concurrent Python tool invocations on a shared interpreter cannot stomp each other's variable state - Add wrapSharedScriptEnvironment() factory to soliplex_agent: wraps a caller-owned ScriptEnvironment without taking dispose ownership, making the shared-env pattern explicit and safe - Update stateful test group to use wrapSharedScriptEnvironment instead of wrapScriptEnvironmentFactory so the lifecycle contract is unambiguous - T5: regression guard proving dart_monty Isolate/Worker is non-blocking (43 FFI / 35 WASM heartbeats confirm event loop stays free during Python) - T7: proves fire-and-forget sessions have isolated Python state (fresh dm.AgentSession per spawn = fresh interpreter = no variable leakage) - Fix pre-existing ScriptEnvironment test fakes missing onAttach() / scriptingState; remove redundant internal imports in agent test helpers All 7 integration tests pass on both FFI and WASM (Chrome). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Failing tests written first for each gap; implementation then made them pass. **execution timeout** (`_executionTimeout` field, default 30 s / 2 s in tests) - `Future.timeout()` wraps `_montySession.execute()` inside the mutex; throws `TimeoutException`, releases the mutex cleanly. - `forTest` accepts `executionTimeout:` so timeout tests run at 500 ms without waiting 30 s. **dispose drain** (replace `unawaited(_montySession.dispose())`) - `dispose()` now queues `_montySession.dispose()` via `_executeMutex.protect(...)`, guaranteeing the Python interpreter is only destroyed after any in-flight `execute()` releases the mutex. - Dispose-verify test updated to pump the event loop before verify. **in-mutex `_disposed` re-check** - Calls that entered `_executePython` before `dispose()` but are still waiting at the mutex now throw `StateError` after acquiring it, instead of calling the already-destroyed session. **new unit tests** (19 added, 31 total, 0 warnings): - `timeout` (3): TimeoutException, idle restored, mutex released after timeout - `concurrency` (3): serialisation order, exception isolation, signal cycling - `dispose safety` (2): drain before session.dispose, queued callers rejected - `isolation` (1): deterministic replacement for weak LLM-mediated T7 - `corner cases` (3): large result, missing code key, mid-flight cancel docs **pre-existing fix**: stub `mockSession.schemas → []` in setUp so the `late final _tools` initialiser does not throw on first access. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add dcm_options.yaml with ~80 rules adapted from dart_monty (internal
path exclusions stripped; only test/** and *.g.dart kept)
- Wire both linters into analysis_options.yaml via include directives
- monty_script_environment.dart: dynamic→Object?, late final→nullable+??=,
async {}→Future.value(), non-null assertion → local var, six
newline-before-return, _stateSignal cascade dispose, dispose-class-fields
exclusion (tearoff through unawaited(protect()) is not DCM-traceable)
- soliplex_plugin.dart: move @OverRide methods before private helpers to
satisfy member-ordering (DCM classifies them as public-methods)
dart format, dart analyze --fatal-infos, dcm analyze lib: zero issues.
31/31 unit tests pass.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace AgentUiDelegate (callback-based, single-tool) with a signal-driven Human-in-the-Loop gate that scales to concurrent tool calls and integrates cleanly with the signals reactive layer. ## What changed **ClientTool API** (`tool_registry.dart`): - `requiresApproval: bool` — when true, AgentSession suspends execution before the tool executor runs and emits a PendingApprovalRequest; the UI must call approveToolCall or denyToolCall to resume. - `platformConsentNote: String? Function()?` — optional callback for tools that trigger an OS-level permission dialog (e.g. clipboard read on web). Returns a human-readable note; AgentSession emits PlatformConsentNotice (non-blocking, informational). **New types**: - `PendingApprovalRequest` — immutable data class (toolCallId, toolName, arguments) emitted on pendingApproval signal. - `PlatformConsentNotice` / `AwaitingApproval` — ExecutionEvent subclasses for consent/approval lifecycle. **AgentSession** (`agent_session.dart`): - `pendingApproval: ReadonlySignal<PendingApprovalRequest?>` — UI watches this to render Allow/Deny UI. - `approveToolCall(String) / denyToolCall(String)` — resolves the Completer gating the suspended tool call. - `_awaitApproval()` — internal gate; stores Completer per toolCallId in _pendingApprovals, signals UI, awaits resolution. Auto-denies on session cancel. **Deleted**: - `AgentUiDelegate` (58 lines) + its 457-line test file. Replaced entirely by the signal approach above. **session_extension.dart**: `onDispose()` → `dispose()` rename for consistency with Dart disposal conventions. **Tests** (`hitl_test.dart`): 12 tests covering requiresApproval defaults, three approval tiers (agent-gate / OS-gate / ungated), PendingApprovalRequest fields, platformConsentNote callbacks, and PlatformConsentNotice equality. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- ScriptEnvironment now implements SessionExtension directly instead of using a separate ScriptEnvironmentExtension adapter class. - Deleted ScriptEnvironmentExtension (no remaining references). - wrapScriptEnvironmentFactory → toOwnedFactory - wrapSharedScriptEnvironment → toSharedFactory - _SharedScriptEnvironmentExtension → _SharedScriptEnvironmentProxy - onDispose() → dispose() in the proxy (matches SessionExtension rename) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Wire AgentSession.pendingApproval through ThreadViewState into a self-contained ToolApprovalSlot widget so only the slot rebuilds on approval changes — not the entire thread body column. **New widgets** (`ui/tool_approval_banner.dart`): - ToolApprovalSlot — owns .watch(context); renders nothing when null. - ToolApprovalBanner — tool name header, scrollable code preview (max 160 px), Allow (FilledButton) / Deny (TextButton). **ThreadViewState**: mirrors session.pendingApproval to its own _pendingApproval signal; subscribes in _attachSession, unsubscribes and resets to null in _detachSession. **Example tools** (`modules/tools/`): - get_device_info — ungated (requiresApproval: false, no consent). - confirm_action — agent-gated (requiresApproval: true); shows action argument in the approval banner preview. - get_clipboard — platform-gated via platformConsentNote; emits PlatformConsentNotice on web (browser clipboard permission), silent on native. **standard.dart**: registers all three example tools; wires MontyScriptEnvironment via extensionFactoryBuilder; startup probe. **macOS / web**: CocoaPods xcconfig includes for shared_preferences re-linking after flutter clean; dart_monty WASM bridge assets. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
MontyScriptEnvironment rewritten to accept a list of MontyPlugins instead of hard-wiring Soliplex connections. Host functions are registered on the dart_monty bridge and projected as direct ClientTools (no Python hop). execute_python now has requiresApproval: true — HITL gate suspends execution until the user allows or denies from the approval banner. probe() validates the interpreter on startup by running `1 + 1`. Regression test: error messages must not leak Rust interpreter internals (NodeIndex, ExprSubscript, node_index:). Currently failing pending runyaga/monty subscript tuple-unpack fix. defensive MissingPluginException guard in DefaultBackendUrlStorage for flutter clean / CocoaPods re-link scenarios. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Split execute_python into run_script (one-shot) and repl_python (persistent REPL) with differentiated descriptions so the LLM picks the right tool - Return Python errors as tool output (status:completed) instead of throwing; includes any print() output that occurred before the error - Return 'None' for in-place ops (arr.sort()) so LLM knows execution succeeded - denyToolCall cancels the AgentSession to prevent LLM retry loops - Serialize approval-required tools in _executeAll to prevent concurrent approval banner deadlock - Wire SoliplexPlugin with all active ServerManager connections (not just current room's server) - ThreadKey (serverId, roomId, threadId) used as _threadStates map key - SoliplexConnection gains alias + serverUrl; _listServers returns full metadata - onDispose no longer closes injected connections (owned by ServerManager) - Remove unsupported help() from systemPromptContext - Copy button on tool call tile code/result blocks - Bold labels + monospace container in ToolCallTile Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…act args
* feat: activity log, state panel, and expandable tool call args
Surfaces previously invisible AG-UI events in the execution UI:
- ActivityLog: collapsible sub-agent call/result log with markdown rendering
- StatePanel: info-chiclet toggle showing aguiState JSON with copy support
- StepLog: expandable args per tool call step (via ToolCallArgsEvent bridge)
- ToolCallTile: upgraded to use ArgsBlock for styled markdown rendering
- ArgsBlock: shared widget converting JSON to readable markdown with platform
monospace font (SF Mono on Apple), scrollable, with copy button
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(execution-ui): replace markdown renderer in ArgsBlock, hide LLM-internal activity rows
- ArgsBlock: swap FlutterMarkdownPlusRenderer for SelectableText + _prettyPrint/_renderMap
eliminating nested CodeBlockBuilder container and fixing newline escaping
- ActivityLog: filter to skill_tool_call only; hide skill_tool_result rows
(error tracebacks, JSON arrays are LLM-internal, not user-facing)
- ActivityLog: reject empty-Map args (list_environments '{}') to avoid blank rows
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(execution-ui): remove Thinking steps, drop ExecutionThinkingBlock, compact activity rows
Changes A-D from UX analysis:
- ActivityIndicator: "Calling tools..." (no numeric count) to avoid mismatch with step log
- ExecutionTracker: ThinkingStarted no longer adds a step; remove thinkingBlocks/
isThinkingStreaming signals (LLM reasoning is internal, not user-facing)
- StepType enum removed; ExecutionStep simplified (no type field)
- ExecutionThinkingBlock removed from LoadingMessageTile and TextMessageTile;
static _ThinkingBlock for persisted message thinkingText is retained
- thinking_block.dart deleted
- ActivityLog: compact inline SelectableText instead of full ArgsBlock container;
row padding tightened to vertical: 2
- args_block.dart: prettyPrintArgs/renderMap promoted to top-level for reuse
- Tests updated: ThinkingStarted no longer creates a step, thinking block tests removed
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
…ly (#7) * refactor(agent): ScriptEnvironment implements SessionExtension directly - ScriptEnvironment now implements SessionExtension, eliminating the ScriptEnvironmentExtension adapter class. - SessionExtension.onDispose() renamed to dispose() for Dart convention. - SessionContext added (serverId + roomId) passed through extension factory so environments can customize per room. - ScriptEnvironmentFactory now takes SessionContext. - toOwnedFactory / toSharedFactory replace wrapScriptEnvironmentFactory. - SharedScriptEnvironmentProxy replaces ScriptEnvironmentExtension. - ScriptingState enum added for reactive interpreter lifecycle. - soliplex_agent exports ScriptingState and SessionContext. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * style: dart format Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * style: add trailing commas to typedef params (linter) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
#9) * feat(m2b): add soliplex_monty_plugin — Python scripting via dart_monty Adds the soliplex_monty_plugin package, which connects the dart_monty Python runtime to the Soliplex agent platform. ## What's included **MontyScriptEnvironment** — ScriptEnvironment backed by dart_monty's AgentSession. Registers SoliplexTools as HostFunctions, projects them as ClientTools for the LLM, runs Python in a background isolate/worker. **SoliplexTool** — flat data struct unifying Python-callable and LLM-callable tool definitions (name, description, parameters, handler). **SoliplexConnection / buildSoliplexTools** — full Soliplex API surface callable from Python: list_servers, list_rooms, get_room, get_documents, get_chunk, new_thread, reply_thread, list_threads, upload_file, upload_to_thread. **toOwnedFactory / toSharedFactory** — two ownership modes: fire-and-forget (isolated Python per session) and stateful (shared interpreter across sessions). **Integration tests** — agent_session_test, monty_env_chat_test (T0–T7), monty_script_environment_test. ## What's not included HITL approval gate (requiresApproval) — deferred to M3. SoliplexTool and ClientTool have no requiresApproval field in this slice; it will be added when feat/hitl-tool-approval is landed. ## soliplex_agent changes - Export ToolExecutionContext from public API (needed by plugin). - Remove redundant direct imports in test helpers. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * chore(monty-plugin): remove coverage artifacts, add .gitignore Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * refactor(monty-plugin): migrate os: OsProvider? → OsCallHandler? Follows dart_monty#335 which replaced the OsProvider class hierarchy with the OsCallHandler typedef from dart_monty_core. Parameter is a pass-through to dm.AgentSession(os:). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * feat(m2b): wire MontyScriptEnvironment into Flutter app - AgentRuntimeManager accepts extensionFactoryBuilder so each runtime can receive a per-server SessionExtensionFactory - standard.dart creates a RoomEnvironmentRegistry and wires toRoomSharedFactory + MontyScriptEnvironment with all SoliplexTools; adds debug logging sink and startup probe (fire-and-forget) - Add get_device_info_tool and get_clipboard_tool client tools; confirm_action_tool deferred to M3 (HITL) - Fix OsCallHandler → OsProvider to match current dart_monty API Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(monty-plugin): use git dep for dart_monty; OsCallHandler matches origin/main Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * ci: trigger fresh CI run Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * ci: fix stale pub cache causing package_graph.json parse failure Remove the restore-keys fallback from the pub cache step so CI never restores an old cache from a different pubspec.lock state. Add `rm -rf .dart_tool` before pub get to prevent any stale package_graph.json from interfering with dependency resolution. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(deps): upgrade dart_monty_core to remove broken flutter assets Picks up c3d7a06 from runyaga/dart_monty_core which removes the flutter: assets section (dart_monty_bridge.js etc. are WASM build artifacts not committed to the repo, causing flutter test/build to fail). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Resolve conflicts with PR #9 (feat/m2b): - soliplex_monty_plugin: take main's buildSoliplexTools/SoliplexTool API - script_environment.dart: take main's SessionContext factory signature - standard.dart: take main's SoliplexTools wiring, remove ConfirmActionTool - test helpers: drop .readonly() on signal, update extensionFactory signatures Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
run_scriptandrepl_pythonclient-side tools toMontyScriptEnvironment, backed by a sandboxed Python interpreter viadart_montyrequiresApproval: truesuspends the session until the user approves or deniesstandard.dart:SoliplexPluginnow receives connections for all registered servers, not just the primary connectionStdoutSinkdebug logging instandard.dartfor development visibilitytool_call_tile.dartwith richer tool call display and clipboard supportSoliplexConnection.aliasandserverUrlfields for improved connection identificationhitl_test.dart) and expandsMontyScriptEnvironmenttestsTest plan
dart testpasses insoliplex_agentandsoliplex_monty_pluginrun_scriptandrepl_pythontools appear in LLM context for Monty roomsSoliplexPlugin🤖 Generated with Claude Code