[BUG] AgentHarness gateway returns 503 for ~20 min on first use; atelet RAM grows to multi-GB during the wait

### 🐛 Bug Description

When creating a new `AgentHarness` on a fresh kagent install (substrate runtime), the UI gateway path `http://<host>/api/agentharnesses/<ns>/<name>/gateway/` returns **HTTP 503** for many minutes while the upstream actor is being brought up. There is no UI surface indicating that the AgentHarness is not yet `Ready` — the user simply cannot interact with the Agent/Actor and sees a bare nginx 503.

In the local case observed, the 503 lasted ~19 minutes after `AgentHarness` creation. During that window, the substrate node agent (`atelet`) leaked memory aggressively (RSS peaked at ~10 GB on a 32 GB node) before eventually succeeding.

### 🎯 Affected Service(s)
- UI Service (nginx proxy returns the 503)
- Upstream: substrate (`ate-controller`, `atelet`) — see "Underlying cause" below

### 🚦 Impact/Severity
Minor inconvenience for individual installs (the harness eventually becomes Ready on its own), but the failure mode is *invisible* in the UI and can push the node toward OOM on memory-constrained environments.

### 🔄 Steps To Reproduce

1. Local kind cluster with kagent installed (substrate runtime enabled).
2. Apply an `AgentHarness` whose `ActorTemplate` references a large (~hundreds of MB) container image not yet cached on the node — e.g. `ghcr.io/kagent-dev/nemoclaw/sandbox-base`:
   ```yaml
   apiVersion: kagent.dev/v1alpha2
   kind: AgentHarness
   metadata:
     name: peterj-claw
     namespace: kagent
   spec:
     backend: openclaw
     description: OpenClaw on Agent Substrate
     modelConfigRef: default-model-config
     runtime: substrate
     substrate:
       gatewayToken: test-token
       workerPoolRef:
         name: kagent-default
   ```
3. Immediately open `http://localhost:8001/api/agentharnesses/kagent/peterj-claw/gateway/` in a browser.
4. Observe: HTTP 503 with no further UI feedback for many minutes.
5. In parallel, `kubectl get agentharness -n kagent peterj-claw` shows `READY=False` with condition `ActorTemplateReady=False` ("waiting for ActorTemplate golden snapshot").
6. `kubectl logs -n ate-system deploy/ate-controller` shows repeated `ResumeGoldenActor` failures with `code = DeadlineExceeded`.
7. `docker exec <kind-node> ps -eo pid,rss,comm --sort=-rss | head` shows `atelet` RSS climbing into multi-GB territory.

### 🤔 Expected Behavior

- The kagent UI surfaces a clear status when the `AgentHarness` is not yet Ready (e.g. "Bringing up sandbox — pulling image, this can take a few minutes on first use") rather than a bare nginx 503.
- The underlying substrate image-pull path eventually succeeds without entering a death-loop that retains GB-scale buffers in `atelet` RSS.

### 📱 Actual Behavior

- Bare HTTP 503 from the nginx in `kagent-ui` (`/api/agentharnesses/<ns>/<name>/gateway/`).
- `ate-controller` reconciles every ~30s; each reconcile starts a new `ResumeGoldenActor` RPC; each RPC's image pull is cancelled by the RPC deadline before it can complete; `atelet` retains buffers from cancelled pulls.
- After ~19 minutes of retries the pull finally succeeds and the harness transitions to `Ready=True`. RSS drops back to ~1.27 GB.

### 💻 Environment

- OS: macOS host running a local kind cluster
- Kubernetes: kind v1.36
- Kubernetes provider: kind (local, arm64)
- Substrate components: `ate-controller`, `atelet` (DaemonSet), `rustfs` as S3 backend
- Network: ~51 Mbps measured from the kind node to `ghcr.io`; image total compressed size ~360 MB → ~53s ideal pull time, longer than the `ResumeGoldenActor` RPC deadline (~30s)

### 🔍 Additional Context

**Underlying cause (substrate side, likely needs an upstream fix):**

- `ate-controller`'s `ResumeGoldenActor` RPC has a context deadline of ~30s
- Bandwidth × image-size means the first-time pull cannot complete within that deadline on typical home/laptop networks
- Each cancelled RPC starts a fresh pull rather than letting prior progress persist; `atelet` accumulates buffers in memory from cancelled pulls
- The pull eventually succeeds (probably once partial layer state has accumulated enough), but the experience until then is "503 forever, with the node sliding toward OOM"

**Kagent-side improvements that would help even if the substrate behavior is fixed:**

- Display the `AgentHarness` status conditions (`ActorTemplateReady`, `ActorReady`, `Ready`) in the UI; warn before opening the gateway view if `Ready=False`.
- Have the UI proxy intercept the 503 and present a status-aware error (e.g. include the latest condition message: "waiting for ActorTemplate golden snapshot").

### 📋 Logs

```
# kagent-ui nginx
"GET /api/agentharnesses/kagent/peterj-claw/gateway/ HTTP/1.1" 503 35

# ate-controller (every ~30s for ~19 min)
ERROR Reconciler error  controller=actortemplate  ActorTemplate={name:peterj-claw,namespace:kagent}
  error="while resuming golden actor: rpc error: code = DeadlineExceeded desc = context deadline exceeded"

# atelet (sandbox-base layer never finishes populating)
INFO Cache miss  ref=ghcr.io/kagent-dev/nemoclaw/sandbox-base@sha256:d52bee41...
INFO Cache miss  ref=ghcr.io/kagent-dev/nemoclaw/sandbox-base@sha256:d52bee41...
# ...repeated, no matching "Populated image cache" log
```

### 🙋 Are you willing to contribute?
Happy to help triage / repro further; the underlying fix likely needs to land in the substrate repo (ate-controller + atelet) but kagent UI improvements to surface harness status are a self-contained kagent-side change.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] AgentHarness gateway returns 503 for ~20 min on first use; atelet RAM grows to multi-GB during the wait #1996

🐛 Bug Description

🎯 Affected Service(s)

🚦 Impact/Severity

🔄 Steps To Reproduce

🤔 Expected Behavior

📱 Actual Behavior

💻 Environment

🔍 Additional Context

📋 Logs

🙋 Are you willing to contribute?

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[BUG] AgentHarness gateway returns 503 for ~20 min on first use; atelet RAM grows to multi-GB during the wait #1996

Description

🐛 Bug Description

🎯 Affected Service(s)

🚦 Impact/Severity

🔄 Steps To Reproduce

🤔 Expected Behavior

📱 Actual Behavior

💻 Environment

🔍 Additional Context

📋 Logs

🙋 Are you willing to contribute?

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions