Skip to content

[BUG] AgentHarness gateway returns 503 for ~20 min on first use; atelet RAM grows to multi-GB during the wait #1996

@jjamroga

Description

@jjamroga

🐛 Bug Description

When creating a new AgentHarness on a fresh kagent install (substrate runtime), the UI gateway path http://<host>/api/agentharnesses/<ns>/<name>/gateway/ returns HTTP 503 for many minutes while the upstream actor is being brought up. There is no UI surface indicating that the AgentHarness is not yet Ready — the user simply cannot interact with the Agent/Actor and sees a bare nginx 503.

In the local case observed, the 503 lasted ~19 minutes after AgentHarness creation. During that window, the substrate node agent (atelet) leaked memory aggressively (RSS peaked at ~10 GB on a 32 GB node) before eventually succeeding.

🎯 Affected Service(s)

  • UI Service (nginx proxy returns the 503)
  • Upstream: substrate (ate-controller, atelet) — see "Underlying cause" below

🚦 Impact/Severity

Minor inconvenience for individual installs (the harness eventually becomes Ready on its own), but the failure mode is invisible in the UI and can push the node toward OOM on memory-constrained environments.

🔄 Steps To Reproduce

  1. Local kind cluster with kagent installed (substrate runtime enabled).
  2. Apply an AgentHarness whose ActorTemplate references a large (~hundreds of MB) container image not yet cached on the node — e.g. ghcr.io/kagent-dev/nemoclaw/sandbox-base:
    apiVersion: kagent.dev/v1alpha2
    kind: AgentHarness
    metadata:
      name: peterj-claw
      namespace: kagent
    spec:
      backend: openclaw
      description: OpenClaw on Agent Substrate
      modelConfigRef: default-model-config
      runtime: substrate
      substrate:
        gatewayToken: test-token
        workerPoolRef:
          name: kagent-default
  3. Immediately open http://localhost:8001/api/agentharnesses/kagent/peterj-claw/gateway/ in a browser.
  4. Observe: HTTP 503 with no further UI feedback for many minutes.
  5. In parallel, kubectl get agentharness -n kagent peterj-claw shows READY=False with condition ActorTemplateReady=False ("waiting for ActorTemplate golden snapshot").
  6. kubectl logs -n ate-system deploy/ate-controller shows repeated ResumeGoldenActor failures with code = DeadlineExceeded.
  7. docker exec <kind-node> ps -eo pid,rss,comm --sort=-rss | head shows atelet RSS climbing into multi-GB territory.

🤔 Expected Behavior

  • The kagent UI surfaces a clear status when the AgentHarness is not yet Ready (e.g. "Bringing up sandbox — pulling image, this can take a few minutes on first use") rather than a bare nginx 503.
  • The underlying substrate image-pull path eventually succeeds without entering a death-loop that retains GB-scale buffers in atelet RSS.

📱 Actual Behavior

  • Bare HTTP 503 from the nginx in kagent-ui (/api/agentharnesses/<ns>/<name>/gateway/).
  • ate-controller reconciles every ~30s; each reconcile starts a new ResumeGoldenActor RPC; each RPC's image pull is cancelled by the RPC deadline before it can complete; atelet retains buffers from cancelled pulls.
  • After ~19 minutes of retries the pull finally succeeds and the harness transitions to Ready=True. RSS drops back to ~1.27 GB.

💻 Environment

  • OS: macOS host running a local kind cluster
  • Kubernetes: kind v1.36
  • Kubernetes provider: kind (local, arm64)
  • Substrate components: ate-controller, atelet (DaemonSet), rustfs as S3 backend
  • Network: ~51 Mbps measured from the kind node to ghcr.io; image total compressed size ~360 MB → ~53s ideal pull time, longer than the ResumeGoldenActor RPC deadline (~30s)

🔍 Additional Context

Underlying cause (substrate side, likely needs an upstream fix):

  • ate-controller's ResumeGoldenActor RPC has a context deadline of ~30s
  • Bandwidth × image-size means the first-time pull cannot complete within that deadline on typical home/laptop networks
  • Each cancelled RPC starts a fresh pull rather than letting prior progress persist; atelet accumulates buffers in memory from cancelled pulls
  • The pull eventually succeeds (probably once partial layer state has accumulated enough), but the experience until then is "503 forever, with the node sliding toward OOM"

Kagent-side improvements that would help even if the substrate behavior is fixed:

  • Display the AgentHarness status conditions (ActorTemplateReady, ActorReady, Ready) in the UI; warn before opening the gateway view if Ready=False.
  • Have the UI proxy intercept the 503 and present a status-aware error (e.g. include the latest condition message: "waiting for ActorTemplate golden snapshot").

📋 Logs

# kagent-ui nginx
"GET /api/agentharnesses/kagent/peterj-claw/gateway/ HTTP/1.1" 503 35

# ate-controller (every ~30s for ~19 min)
ERROR Reconciler error  controller=actortemplate  ActorTemplate={name:peterj-claw,namespace:kagent}
  error="while resuming golden actor: rpc error: code = DeadlineExceeded desc = context deadline exceeded"

# atelet (sandbox-base layer never finishes populating)
INFO Cache miss  ref=ghcr.io/kagent-dev/nemoclaw/sandbox-base@sha256:d52bee41...
INFO Cache miss  ref=ghcr.io/kagent-dev/nemoclaw/sandbox-base@sha256:d52bee41...
# ...repeated, no matching "Populated image cache" log

🙋 Are you willing to contribute?

Happy to help triage / repro further; the underlying fix likely needs to land in the substrate repo (ate-controller + atelet) but kagent UI improvements to surface harness status are a self-contained kagent-side change.

Metadata

Metadata

Assignees

No one assigned

    Labels

    UIIssue pertaining to the UIbugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status
    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions