Update Terminus env harness and sandbox backends by burtenshaw · Pull Request #729 · huggingface/OpenEnv

burtenshaw · 2026-05-29T17:54:47Z

Summary

This splits the Terminus environment/server support out of the Terminus Pi TRL example PR.

Changes:

adds terminus_env.harness.TerminusSessionFactory and build_terminal_tool_call
adds HF sandbox and lightweight local sandbox backends
removes the old E2B sandbox backend file
teaches the server reset path to select sandbox backends and report sandbox creation errors clearly
adds state parsing on the client side
updates Terminus env docs, requirements, lockfile, and UI text
adds focused env and harness tests

Validation

python3 -m py_compile envs/terminus_env/harness.py envs/terminus_env/server/hf_sandbox.py envs/terminus_env/server/local_sandbox.py envs/terminus_env/server/terminus_env_environment.py envs/terminus_env/client.py tests/envs/test_terminus_environment.py tests/envs/test_terminus_harness.py
ruff check envs/terminus_env/harness.py envs/terminus_env/server/hf_sandbox.py envs/terminus_env/server/local_sandbox.py envs/terminus_env/server/terminus_env_environment.py envs/terminus_env/client.py tests/envs/test_terminus_environment.py tests/envs/test_terminus_harness.py
PYTHONPATH=src:envs uv run --with pytest python -m pytest tests/envs/test_terminus_environment.py tests/envs/test_terminus_harness.py

Result: 10 passed in 38.12s.

Darktex

Note: This is an automated review by Claude Code, not a human review.

Alignment Review — PR #729

Update Terminus env harness and sandbox backends

Automated Checks

Lint: PASS — no lint violations found in new files
Format: FAIL — harness.py and test_terminus_harness.py have lines exceeding 88 chars; run uv run ruff format envs/terminus_env/harness.py tests/envs/test_terminus_harness.py
Debug code: CLEAN

Tier 1: Bugs / Correctness

Format violations — harness.py:88, harness.py:250, test_terminus_harness.py:154 exceed line length. Run uv run ruff format on both files.
Unbounded recursion in _normalize_terminal_payload (harness.py:385-392) — When payload["arguments"] is a dict, the function recurses without a depth bound. Pathological input with deeply nested {"arguments": {"arguments": ...}} could hit Python's recursion limit. Low risk in practice but worth noting.
Dead branch in _build_verify (harness.py:226-227) — state is None is unreachable because StepEnvSessionAdapter._read_state() always returns a TerminusState. Harmless but could use a comment if intentional as a fallback.
Test coverage gaps (non-blocking):
- No test for build_terminal_tool_call("") (empty string)
- No test for non-JSON language code fences (e.g. ```python ... ```)
- No test for _build_reset_kwargs sandbox config key forwarding

Tier 2: Alignment

ALIGNMENT FLAG: Sandbox config keys forwarded from task dicts

_build_reset_kwargs forwards sandbox_image, sandbox_flavor, hf_sandbox_image, hf_sandbox_flavor etc. from task dicts into reset(). Two overlapping key sets with no conflict resolution. If tasks can specify sandbox backend, training orchestration loses control over execution environment, complicating reproducibility. Is this intentional?

ALIGNMENT FLAG: terminus_reward() placement

The function is exported from harness.py but the PR description suggests it may also exist in an example script. If it's canonical, confirm it's in harness.py.__all__. If example-only, update the PR description.

Architecture Assessment

"Agents cannot reset": PASS — TerminusSessionFactory.create() wraps TerminusEnv client inside StepEnvSessionAdapter. Agent only sees terminal tool.
Client-server separation: PASS — harness.py imports from .client, not .server.
Rewards inside environment: PASS — _build_tool_result and _build_verify forward rewards from the environment's step results. No external reward synthesis.
Dual API boundary: PASS — MCP for agent tools, Gym-like API for orchestration via StepEnvSessionAdapter.

Verdict: COMMENT

The harness architecture is sound and correctly implements OpenEnv patterns. The server-side files (hf_sandbox.py, local_sandbox.py — bwrap command construction and HF credential handling) should receive careful human review before merge. Format violations need fixing.

Automated review by Claude Code | Learn more

Darktex

Note: This is an automated review by Claude Code, not a human review.

Alignment Review Report

Automated Checks

Lint: SKIPPED (no network access in this environment; ruff, usort could not be invoked)
Debug code: CLEAN — no debug artifacts found in changed files
Tests: SKIPPED (no network access for dependency installation)

Tier 1: Fixes Required

envs/terminus_env/pyproject.toml + envs/terminus_env/uv.lock — hf-sandbox dependency classification mismatch. pyproject.toml correctly places hf-sandbox>=0.1.1 under [project.optional-dependencies] hf, but uv.lock records it as a core (non-extra) package dependency. This means uv pip install openenv-terminus-env installs hf-sandbox unconditionally, defeating the optional backend design. Regenerate the lockfile with uv lock after confirming pyproject.toml core dependencies does NOT include hf-sandbox.
envs/terminus_env/server/gradio_ui.py — UI strings hardcode "HF sandbox" after the backend was made pluggable. Four strings now say things like "Reset the environment to create an HF sandbox" and "Session closed. The HF sandbox was released." These are incorrect when using the local backend. Change to backend-agnostic wording (e.g. "Reset the environment to create a sandbox session" / "Session closed. The sandbox was released.").
envs/terminus_env/models.py — CommandResult docstring now reads "Outcome of one shell command run inside the HF sandbox." but CommandResult is backend-agnostic (produced by both HF and local backends). Change to "Outcome of one shell command run inside the sandbox."
envs/terminus_env/harness.py __all__ — The terminus_reward function is exported from __init__.py and is part of the public surface, but harness.py's own __all__ list is missing "terminus_reward". Add it so the module's __all__ is consistent with what __init__.py re-exports.
envs/terminus_env/harness.py _build_reset_kwargs — The diff includes "sandbox_backend" and "sandbox_root" in the forwarded key list, but the file as seen on the branch is missing them. Verify the PR branch contains the diff-stated version; if not, the harness cannot forward per-task backend selection to the server, silently ignoring those task fields.

Tier 2: Alignment Discussion

ALIGNMENT FLAG: terminus_reward implements reward extraction outside the environment boundary

Principle at stake: "Rewards inside environment" (PRINCIPLES.md, RFC 002); "Reward computation must stay inside environment boundary" (INVARIANTS.md §Architectural, point 3)
The concern: terminus_reward is a TRL-facing reward function (signature completions, **kwargs -> list[float]) that parses reward values from tool message content on the training side. The reward itself originates in the environment (the environment emits reward= in the tool message), but terminus_reward re-parses and surfaces it to TRL as the authoritative reward signal. This creates a parallel reward extraction path that lives outside the environment — if the environment's reward encoding changes, terminus_reward silently breaks. Is parsing the reward out of the message content on the client/trainer side acceptable here, or should the reward come through the standard result.reward channel only?
Suggested reviewer: @Darktex

ALIGNMENT FLAG: build_terminal_tool_call uses ast.parse on model-generated text

Principle at stake: "Design for LLMs" / "Minimize lifecycle deltas" (PRINCIPLES.md)
The concern: The function parses model output by running ast.parse(f"terminal({match.group('body')})", mode="eval") where match.group('body') is extracted from raw model text via re.DOTALL. Although ast.literal_eval is used for values (safe), the ast.Name fallback path means terminal(command=malicious_name) silently produces {"command": "malicious_name"} — bare identifiers become string values. The sandbox isolates execution, but is this fallback path intended for production use or only for the Pi-model compatibility case?
Suggested reviewer: @Darktex

Summary

5 mechanical issues to fix (1 critical: dependency lockfile mismatch; 2 cosmetic but inaccurate: UI/docstring text; 2 consistency: __all__ and _build_reset_kwargs key list)
2 alignment points for human review (reward extraction outside environment boundary; AST parsing of model output with identifier fallback)

The sandbox backend refactor itself is clean and the new architecture (pluggable local/hf selection via TERMINUS_SANDBOX_BACKEND) is a good improvement. The harness and test coverage are solid. The blocking issues are the uv.lock mismatch and the UI text that falsely presents all backends as "HF sandbox."

Automated review by Claude Code | Learn more

terminus_env: add harness sandbox backends

5e254ae

burtenshaw mentioned this pull request May 29, 2026

[EXAMPLE] [WIP] Add Terminus Pi TRL training #675

Draft

terminus_env: add TRL reward helper

86b2225

Darktex reviewed May 30, 2026

View reviewed changes

Darktex suggested changes May 30, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update Terminus env harness and sandbox backends#729

Update Terminus env harness and sandbox backends#729
burtenshaw wants to merge 2 commits into
huggingface:mainfrom
burtenshaw:codex/terminus-env-harness

burtenshaw commented May 29, 2026

Uh oh!

Darktex left a comment

Uh oh!

Darktex left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

burtenshaw commented May 29, 2026

Summary

Validation

Uh oh!

Darktex left a comment

Choose a reason for hiding this comment

Alignment Review — PR #729

Automated Checks

Tier 1: Bugs / Correctness

Tier 2: Alignment

Architecture Assessment

Verdict: COMMENT

Uh oh!

Darktex left a comment

Choose a reason for hiding this comment

Alignment Review Report

Automated Checks

Tier 1: Fixes Required

Tier 2: Alignment Discussion

Summary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants