Fix split-topology result retrieval#2278
Conversation
Greptile SummaryThis PR fixes silent result loss in split-topology multi-replica deployments by replacing the load-balanced proxy fetch with a direct pod-IP handoff: workers advertise
|
| Filename | Overview |
|---|---|
| nemo_retriever/src/nemo_retriever/service/routers/ingest.py | Reworks result retrieval from proxy-based fetch to gateway-local filesystem read, adds result handoff in callback handler. The new 503 guard for pre_rec is None with result_worker_ip breaks the pre-PR graceful gateway-restart fallback. |
| nemo_retriever/src/nemo_retriever/service/services/worker_result_store.py | New module implementing immutable-generation filesystem store for retained result rows; well-structured with atomic rename protocol, TTL sweep, and in-memory fallback. |
| nemo_retriever/src/nemo_retriever/service/services/pipeline_pool.py | Adds bounded deferred-retry loop for gateway callbacks with semaphore-controlled concurrency and exponential backoff. Worker loop can stall when all N slots are occupied; otherwise correct and well-tested. |
| nemo_retriever/src/nemo_retriever/service/services/job_tracker.py | Replaces consume-on-read consume_result_data with idempotent get_result_data; adds opportunistic eviction on every read path. Logic is consistent and deepcopy of result_data prevents aliasing. |
| nemo_retriever/helm/templates/deployment.yaml | Correctly wires NEMO_RETRIEVER_RESULTS_DIR and TTL env vars to standalone and gateway pods only; PVC volume mounted only on gateway in split mode. |
Comments Outside Diff (1)
-
nemo_retriever/src/nemo_retriever/service/routers/ingest.py, line 389-394 (link)Gateway-restart resilience regression for retained-result callbacks
The new guard raises HTTP 503 when
pre_rec is Noneandresult_worker_ipis present, bypassing themark_completedcall. Pre-PR,mark_completedfor an unknown document returnedMarkOutcome.UNKNOWN_DOCUMENTand the endpoint returned HTTP 200, so the worker acknowledged the callback, calleddiscard_local_result_data, and moved on. Post-PR, those callbacks retry indefinitely (every 1 s, then exponentially up to 300 s) untilresult_retention_seconds()(8 h by default) expires. During that window each pre-restart document occupies one of the_num_workerssemaphore slots in_schedule_gateway_callback_retry. If N pre-restart documents each consume one slot, new completions whose callbacks also fail (e.g. during a rolling restart) block onawait slots.acquire(), stalling queue draining until a slot is freed.Prompt To Fix With AI
This is a comment left during a code review. Path: nemo_retriever/src/nemo_retriever/service/routers/ingest.py Line: 389-394 Comment: **Gateway-restart resilience regression for retained-result callbacks** The new guard raises HTTP 503 when `pre_rec is None` and `result_worker_ip` is present, bypassing the `mark_completed` call. Pre-PR, `mark_completed` for an unknown document returned `MarkOutcome.UNKNOWN_DOCUMENT` and the endpoint returned HTTP 200, so the worker acknowledged the callback, called `discard_local_result_data`, and moved on. Post-PR, those callbacks retry indefinitely (every 1 s, then exponentially up to 300 s) until `result_retention_seconds()` (8 h by default) expires. During that window each pre-restart document occupies one of the `_num_workers` semaphore slots in `_schedule_gateway_callback_retry`. If N pre-restart documents each consume one slot, new completions whose callbacks also fail (e.g. during a rolling restart) block on `await slots.acquire()`, stalling queue draining until a slot is freed. How can I resolve this? If you propose a fix, please make it concise.
Prompt To Fix All With AI
Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 1
nemo_retriever/src/nemo_retriever/service/routers/ingest.py:389-394
**Gateway-restart resilience regression for retained-result callbacks**
The new guard raises HTTP 503 when `pre_rec is None` and `result_worker_ip` is present, bypassing the `mark_completed` call. Pre-PR, `mark_completed` for an unknown document returned `MarkOutcome.UNKNOWN_DOCUMENT` and the endpoint returned HTTP 200, so the worker acknowledged the callback, called `discard_local_result_data`, and moved on. Post-PR, those callbacks retry indefinitely (every 1 s, then exponentially up to 300 s) until `result_retention_seconds()` (8 h by default) expires. During that window each pre-restart document occupies one of the `_num_workers` semaphore slots in `_schedule_gateway_callback_retry`. If N pre-restart documents each consume one slot, new completions whose callbacks also fail (e.g. during a rolling restart) block on `await slots.acquire()`, stalling queue draining until a slot is freed.
Reviews (10): Last reviewed commit: "Bound deferred callback retries" | Re-trigger Greptile
Description
Fix split-topology result retrieval when multiple worker replicas are present.
NVBug 6348464 exposed silent result loss because workers retained completed rows only in pod-local memory, while the gateway fetched them through a load-balanced ClusterIP Service with no document-to-pod affinity. Requests routed to a non-owning pod returned 404, so clients received incomplete result data even though every document completed successfully.
This PR:
retriever-resultsvolume.Result retrieval is now deterministic across split-topology replicas. Multi-node deployments must use an RWX-capable storage class; RWO remains suitable when all consumers run on one node.
Validation:
git diff --checkpassed.Checklist