Skip to content

Update Terminus env harness and sandbox backends#729

Open
burtenshaw wants to merge 2 commits into
huggingface:mainfrom
burtenshaw:codex/terminus-env-harness
Open

Update Terminus env harness and sandbox backends#729
burtenshaw wants to merge 2 commits into
huggingface:mainfrom
burtenshaw:codex/terminus-env-harness

Conversation

@burtenshaw
Copy link
Copy Markdown
Collaborator

Summary

This splits the Terminus environment/server support out of the Terminus Pi TRL example PR.

Changes:

  • adds terminus_env.harness.TerminusSessionFactory and build_terminal_tool_call
  • adds HF sandbox and lightweight local sandbox backends
  • removes the old E2B sandbox backend file
  • teaches the server reset path to select sandbox backends and report sandbox creation errors clearly
  • adds state parsing on the client side
  • updates Terminus env docs, requirements, lockfile, and UI text
  • adds focused env and harness tests

Validation

  • python3 -m py_compile envs/terminus_env/harness.py envs/terminus_env/server/hf_sandbox.py envs/terminus_env/server/local_sandbox.py envs/terminus_env/server/terminus_env_environment.py envs/terminus_env/client.py tests/envs/test_terminus_environment.py tests/envs/test_terminus_harness.py
  • ruff check envs/terminus_env/harness.py envs/terminus_env/server/hf_sandbox.py envs/terminus_env/server/local_sandbox.py envs/terminus_env/server/terminus_env_environment.py envs/terminus_env/client.py tests/envs/test_terminus_environment.py tests/envs/test_terminus_harness.py
  • PYTHONPATH=src:envs uv run --with pytest python -m pytest tests/envs/test_terminus_environment.py tests/envs/test_terminus_harness.py

Result: 10 passed in 38.12s.

Copy link
Copy Markdown
Contributor

@Darktex Darktex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: This is an automated review by Claude Code, not a human review.


Alignment Review — PR #729

Update Terminus env harness and sandbox backends

Automated Checks

  • Lint: PASS — no lint violations found in new files
  • Format: FAIL — harness.py and test_terminus_harness.py have lines exceeding 88 chars; run uv run ruff format envs/terminus_env/harness.py tests/envs/test_terminus_harness.py
  • Debug code: CLEAN

Tier 1: Bugs / Correctness

  1. Format violationsharness.py:88, harness.py:250, test_terminus_harness.py:154 exceed line length. Run uv run ruff format on both files.

  2. Unbounded recursion in _normalize_terminal_payload (harness.py:385-392) — When payload["arguments"] is a dict, the function recurses without a depth bound. Pathological input with deeply nested {"arguments": {"arguments": ...}} could hit Python's recursion limit. Low risk in practice but worth noting.

  3. Dead branch in _build_verify (harness.py:226-227) — state is None is unreachable because StepEnvSessionAdapter._read_state() always returns a TerminusState. Harmless but could use a comment if intentional as a fallback.

  4. Test coverage gaps (non-blocking):

    • No test for build_terminal_tool_call("") (empty string)
    • No test for non-JSON language code fences (e.g. ```python ... ```)
    • No test for _build_reset_kwargs sandbox config key forwarding

Tier 2: Alignment

ALIGNMENT FLAG: Sandbox config keys forwarded from task dicts

  • _build_reset_kwargs forwards sandbox_image, sandbox_flavor, hf_sandbox_image, hf_sandbox_flavor etc. from task dicts into reset(). Two overlapping key sets with no conflict resolution. If tasks can specify sandbox backend, training orchestration loses control over execution environment, complicating reproducibility. Is this intentional?

ALIGNMENT FLAG: terminus_reward() placement

  • The function is exported from harness.py but the PR description suggests it may also exist in an example script. If it's canonical, confirm it's in harness.py.__all__. If example-only, update the PR description.

Architecture Assessment

  • "Agents cannot reset": PASS — TerminusSessionFactory.create() wraps TerminusEnv client inside StepEnvSessionAdapter. Agent only sees terminal tool.
  • Client-server separation: PASS — harness.py imports from .client, not .server.
  • Rewards inside environment: PASS — _build_tool_result and _build_verify forward rewards from the environment's step results. No external reward synthesis.
  • Dual API boundary: PASS — MCP for agent tools, Gym-like API for orchestration via StepEnvSessionAdapter.

Verdict: COMMENT

The harness architecture is sound and correctly implements OpenEnv patterns. The server-side files (hf_sandbox.py, local_sandbox.py — bwrap command construction and HF credential handling) should receive careful human review before merge. Format violations need fixing.


Automated review by Claude Code | Learn more

Copy link
Copy Markdown
Contributor

@Darktex Darktex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: This is an automated review by Claude Code, not a human review.


Alignment Review Report

Automated Checks

  • Lint: SKIPPED (no network access in this environment; ruff, usort could not be invoked)
  • Debug code: CLEAN — no debug artifacts found in changed files
  • Tests: SKIPPED (no network access for dependency installation)

Tier 1: Fixes Required

  • envs/terminus_env/pyproject.toml + envs/terminus_env/uv.lockhf-sandbox dependency classification mismatch. pyproject.toml correctly places hf-sandbox>=0.1.1 under [project.optional-dependencies] hf, but uv.lock records it as a core (non-extra) package dependency. This means uv pip install openenv-terminus-env installs hf-sandbox unconditionally, defeating the optional backend design. Regenerate the lockfile with uv lock after confirming pyproject.toml core dependencies does NOT include hf-sandbox.

  • envs/terminus_env/server/gradio_ui.py — UI strings hardcode "HF sandbox" after the backend was made pluggable. Four strings now say things like "Reset the environment to create an HF sandbox" and "Session closed. The HF sandbox was released." These are incorrect when using the local backend. Change to backend-agnostic wording (e.g. "Reset the environment to create a sandbox session" / "Session closed. The sandbox was released.").

  • envs/terminus_env/models.pyCommandResult docstring now reads "Outcome of one shell command run inside the HF sandbox." but CommandResult is backend-agnostic (produced by both HF and local backends). Change to "Outcome of one shell command run inside the sandbox."

  • envs/terminus_env/harness.py __all__ — The terminus_reward function is exported from __init__.py and is part of the public surface, but harness.py's own __all__ list is missing "terminus_reward". Add it so the module's __all__ is consistent with what __init__.py re-exports.

  • envs/terminus_env/harness.py _build_reset_kwargs — The diff includes "sandbox_backend" and "sandbox_root" in the forwarded key list, but the file as seen on the branch is missing them. Verify the PR branch contains the diff-stated version; if not, the harness cannot forward per-task backend selection to the server, silently ignoring those task fields.


Tier 2: Alignment Discussion

ALIGNMENT FLAG: terminus_reward implements reward extraction outside the environment boundary

  • Principle at stake: "Rewards inside environment" (PRINCIPLES.md, RFC 002); "Reward computation must stay inside environment boundary" (INVARIANTS.md §Architectural, point 3)
  • The concern: terminus_reward is a TRL-facing reward function (signature completions, **kwargs -> list[float]) that parses reward values from tool message content on the training side. The reward itself originates in the environment (the environment emits reward= in the tool message), but terminus_reward re-parses and surfaces it to TRL as the authoritative reward signal. This creates a parallel reward extraction path that lives outside the environment — if the environment's reward encoding changes, terminus_reward silently breaks. Is parsing the reward out of the message content on the client/trainer side acceptable here, or should the reward come through the standard result.reward channel only?
  • Suggested reviewer: @Darktex

ALIGNMENT FLAG: build_terminal_tool_call uses ast.parse on model-generated text

  • Principle at stake: "Design for LLMs" / "Minimize lifecycle deltas" (PRINCIPLES.md)
  • The concern: The function parses model output by running ast.parse(f"terminal({match.group('body')})", mode="eval") where match.group('body') is extracted from raw model text via re.DOTALL. Although ast.literal_eval is used for values (safe), the ast.Name fallback path means terminal(command=malicious_name) silently produces {"command": "malicious_name"} — bare identifiers become string values. The sandbox isolates execution, but is this fallback path intended for production use or only for the Pi-model compatibility case?
  • Suggested reviewer: @Darktex

Summary

  • 5 mechanical issues to fix (1 critical: dependency lockfile mismatch; 2 cosmetic but inaccurate: UI/docstring text; 2 consistency: __all__ and _build_reset_kwargs key list)
  • 2 alignment points for human review (reward extraction outside environment boundary; AST parsing of model output with identifier fallback)

The sandbox backend refactor itself is clean and the new architecture (pluggable local/hf selection via TERMINUS_SANDBOX_BACKEND) is a good improvement. The harness and test coverage are solid. The blocking issues are the uv.lock mismatch and the UI text that falsely presents all backends as "HF sandbox."


Automated review by Claude Code | Learn more

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants