🐛 Bug Description
When creating a new AgentHarness on a fresh kagent install (substrate runtime), the UI gateway path http://<host>/api/agentharnesses/<ns>/<name>/gateway/ returns HTTP 503 for many minutes while the upstream actor is being brought up. There is no UI surface indicating that the AgentHarness is not yet Ready — the user simply cannot interact with the Agent/Actor and sees a bare nginx 503.
In the local case observed, the 503 lasted ~19 minutes after AgentHarness creation. During that window, the substrate node agent (atelet) leaked memory aggressively (RSS peaked at ~10 GB on a 32 GB node) before eventually succeeding.
🎯 Affected Service(s)
- UI Service (nginx proxy returns the 503)
- Upstream: substrate (
ate-controller, atelet) — see "Underlying cause" below
🚦 Impact/Severity
Minor inconvenience for individual installs (the harness eventually becomes Ready on its own), but the failure mode is invisible in the UI and can push the node toward OOM on memory-constrained environments.
🔄 Steps To Reproduce
- Local kind cluster with kagent installed (substrate runtime enabled).
- Apply an
AgentHarness whose ActorTemplate references a large (~hundreds of MB) container image not yet cached on the node — e.g. ghcr.io/kagent-dev/nemoclaw/sandbox-base:
apiVersion: kagent.dev/v1alpha2
kind: AgentHarness
metadata:
name: peterj-claw
namespace: kagent
spec:
backend: openclaw
description: OpenClaw on Agent Substrate
modelConfigRef: default-model-config
runtime: substrate
substrate:
gatewayToken: test-token
workerPoolRef:
name: kagent-default
- Immediately open
http://localhost:8001/api/agentharnesses/kagent/peterj-claw/gateway/ in a browser.
- Observe: HTTP 503 with no further UI feedback for many minutes.
- In parallel,
kubectl get agentharness -n kagent peterj-claw shows READY=False with condition ActorTemplateReady=False ("waiting for ActorTemplate golden snapshot").
kubectl logs -n ate-system deploy/ate-controller shows repeated ResumeGoldenActor failures with code = DeadlineExceeded.
docker exec <kind-node> ps -eo pid,rss,comm --sort=-rss | head shows atelet RSS climbing into multi-GB territory.
🤔 Expected Behavior
- The kagent UI surfaces a clear status when the
AgentHarness is not yet Ready (e.g. "Bringing up sandbox — pulling image, this can take a few minutes on first use") rather than a bare nginx 503.
- The underlying substrate image-pull path eventually succeeds without entering a death-loop that retains GB-scale buffers in
atelet RSS.
📱 Actual Behavior
- Bare HTTP 503 from the nginx in
kagent-ui (/api/agentharnesses/<ns>/<name>/gateway/).
ate-controller reconciles every ~30s; each reconcile starts a new ResumeGoldenActor RPC; each RPC's image pull is cancelled by the RPC deadline before it can complete; atelet retains buffers from cancelled pulls.
- After ~19 minutes of retries the pull finally succeeds and the harness transitions to
Ready=True. RSS drops back to ~1.27 GB.
💻 Environment
- OS: macOS host running a local kind cluster
- Kubernetes: kind v1.36
- Kubernetes provider: kind (local, arm64)
- Substrate components:
ate-controller, atelet (DaemonSet), rustfs as S3 backend
- Network: ~51 Mbps measured from the kind node to
ghcr.io; image total compressed size ~360 MB → ~53s ideal pull time, longer than the ResumeGoldenActor RPC deadline (~30s)
🔍 Additional Context
Underlying cause (substrate side, likely needs an upstream fix):
ate-controller's ResumeGoldenActor RPC has a context deadline of ~30s
- Bandwidth × image-size means the first-time pull cannot complete within that deadline on typical home/laptop networks
- Each cancelled RPC starts a fresh pull rather than letting prior progress persist;
atelet accumulates buffers in memory from cancelled pulls
- The pull eventually succeeds (probably once partial layer state has accumulated enough), but the experience until then is "503 forever, with the node sliding toward OOM"
Kagent-side improvements that would help even if the substrate behavior is fixed:
- Display the
AgentHarness status conditions (ActorTemplateReady, ActorReady, Ready) in the UI; warn before opening the gateway view if Ready=False.
- Have the UI proxy intercept the 503 and present a status-aware error (e.g. include the latest condition message: "waiting for ActorTemplate golden snapshot").
📋 Logs
# kagent-ui nginx
"GET /api/agentharnesses/kagent/peterj-claw/gateway/ HTTP/1.1" 503 35
# ate-controller (every ~30s for ~19 min)
ERROR Reconciler error controller=actortemplate ActorTemplate={name:peterj-claw,namespace:kagent}
error="while resuming golden actor: rpc error: code = DeadlineExceeded desc = context deadline exceeded"
# atelet (sandbox-base layer never finishes populating)
INFO Cache miss ref=ghcr.io/kagent-dev/nemoclaw/sandbox-base@sha256:d52bee41...
INFO Cache miss ref=ghcr.io/kagent-dev/nemoclaw/sandbox-base@sha256:d52bee41...
# ...repeated, no matching "Populated image cache" log
🙋 Are you willing to contribute?
Happy to help triage / repro further; the underlying fix likely needs to land in the substrate repo (ate-controller + atelet) but kagent UI improvements to surface harness status are a self-contained kagent-side change.
🐛 Bug Description
When creating a new
AgentHarnesson a fresh kagent install (substrate runtime), the UI gateway pathhttp://<host>/api/agentharnesses/<ns>/<name>/gateway/returns HTTP 503 for many minutes while the upstream actor is being brought up. There is no UI surface indicating that the AgentHarness is not yetReady— the user simply cannot interact with the Agent/Actor and sees a bare nginx 503.In the local case observed, the 503 lasted ~19 minutes after
AgentHarnesscreation. During that window, the substrate node agent (atelet) leaked memory aggressively (RSS peaked at ~10 GB on a 32 GB node) before eventually succeeding.🎯 Affected Service(s)
ate-controller,atelet) — see "Underlying cause" below🚦 Impact/Severity
Minor inconvenience for individual installs (the harness eventually becomes Ready on its own), but the failure mode is invisible in the UI and can push the node toward OOM on memory-constrained environments.
🔄 Steps To Reproduce
AgentHarnesswhoseActorTemplatereferences a large (~hundreds of MB) container image not yet cached on the node — e.g.ghcr.io/kagent-dev/nemoclaw/sandbox-base:http://localhost:8001/api/agentharnesses/kagent/peterj-claw/gateway/in a browser.kubectl get agentharness -n kagent peterj-clawshowsREADY=Falsewith conditionActorTemplateReady=False("waiting for ActorTemplate golden snapshot").kubectl logs -n ate-system deploy/ate-controllershows repeatedResumeGoldenActorfailures withcode = DeadlineExceeded.docker exec <kind-node> ps -eo pid,rss,comm --sort=-rss | headshowsateletRSS climbing into multi-GB territory.🤔 Expected Behavior
AgentHarnessis not yet Ready (e.g. "Bringing up sandbox — pulling image, this can take a few minutes on first use") rather than a bare nginx 503.ateletRSS.📱 Actual Behavior
kagent-ui(/api/agentharnesses/<ns>/<name>/gateway/).ate-controllerreconciles every ~30s; each reconcile starts a newResumeGoldenActorRPC; each RPC's image pull is cancelled by the RPC deadline before it can complete;ateletretains buffers from cancelled pulls.Ready=True. RSS drops back to ~1.27 GB.💻 Environment
ate-controller,atelet(DaemonSet),rustfsas S3 backendghcr.io; image total compressed size ~360 MB → ~53s ideal pull time, longer than theResumeGoldenActorRPC deadline (~30s)🔍 Additional Context
Underlying cause (substrate side, likely needs an upstream fix):
ate-controller'sResumeGoldenActorRPC has a context deadline of ~30sateletaccumulates buffers in memory from cancelled pullsKagent-side improvements that would help even if the substrate behavior is fixed:
AgentHarnessstatus conditions (ActorTemplateReady,ActorReady,Ready) in the UI; warn before opening the gateway view ifReady=False.📋 Logs
🙋 Are you willing to contribute?
Happy to help triage / repro further; the underlying fix likely needs to land in the substrate repo (ate-controller + atelet) but kagent UI improvements to surface harness status are a self-contained kagent-side change.