Distributed Worker Architecture for ML Processing (Processing Service V2) pt. 1 by carlosgjs · Pull Request #987 · RolnickLab/antenna

carlosgjs · 2025-10-08T23:49:21Z

Why This Change

Antenna's original ML processing architecture was designed as a demonstration: a single job would connect to a processing service endpoint and wait for images synchronously. This approach has become a bottleneck as the platform has grown:

Long-running jobs are fragile — network interruptions can cause long-running jobs to images to fail partway through
Single worker bottleneck — only one processing service can work on a job at a time
Requires public endpoints — researchers must expose their ML models via a publicly accessible server, which is often impractical in university HPC environments or local workstations behind firewalls

This PR introduces a pull-based distributed worker architecture that fundamentally changes how ML processing works in Antenna. Users still queue jobs in Antenna, but workers pull tasks from the queue rather than Antenna pushing to them. Workers authenticate with a project token and can register as a service to subscribe to job queues.

Architecture Overview

┌─────────────────────────────────────────────────────────────────────┐
│                          ANTENNA (Server)                           │
│                                                                     │
│  User queues job → Job creates tasks → Tasks pushed to NATS queue  │
│                                        ↓                            │
│                          Tasks API (pull/ack/result)                │
└─────────────────────────────────────────────────────────────────────┘
                                    ↑
            ┌───────────────────────┼───────────────────────┐
            │                       │                       │
      ┌─────┴─────┐           ┌─────┴─────┐           ┌─────┴─────┐
      │  Worker   │           │  Worker   │           │  Worker   │
      │  (HPC)    │           │  (Local)  │           │  (Cloud)  │
      └───────────┘           └───────────┘           └───────────┘
        GPU node              Workstation              Any compute

Key Advantages

Resilience — Tasks are individually tracked. Network failures affect only the current batch, not the entire job. Failed tasks can be re-queued automatically.
Horizontal scaling — Run as many workers as you have compute resources. A job that takes hours with one worker can complete in minutes with ten.
No public endpoint required — Workers pull tasks from Antenna's API. They can run anywhere: behind university firewalls, on HPC clusters, on local workstations with GPUs, or in cloud environments.
Faster overall processing — Parallelism + reduced network sensitivity = significantly faster job completion.

Researchers can still bring their own ML models by following the new (but similar) API contract. We provide several pipelines, but custom models work the same way.

Summary

Initial version of the Processing service V2. Follow up to the initial API structure here #1046

See the worker implementation here RolnickLab/ami-data-companion#94

Closes #971
Closes #968
Closes #969

Current State

The async processing path is working but disabled by default in this PR to allow for extended testing. When enabled, starting a job creates a queue for that job and populates it with one task per image. The tasks can be pulled and ACKed via the APIs introduced in PR #1046. The new path can be enabled for a project via the async_pipeline_workers feature flag.

PR #1046 introduced a scaffold of the API endpoints & schemas, which will be published in the documentation when finalized.

List of Changes

Added NATS JetStream to the docker compose. I also tried RabbitMQ and Beanstalkd, but they don't support the visibility timeout semantics we want or a disconnected mode of pulling and ACKing tasks.
Added TaskStateManager and TaskQueueManager
Added the queuing and async results processing logic
Implemented task pull/ack/result endpoints in job views (previously stubs)
Added unit tests for TaskQueueManager and TaskStateManager

Follow-up Work

Complete the API for workers to register in Antenna so their existence and capabilities are visible in the UI, see: PSV2: endpoint to register pipelines #1076
Cleanup of queues after jobs complete, see: PSv2: Implement queue clean-up upon job completion #1083
Decoupling the celery task state from the job state (Fix the displayed status in the job status UI), see: PSv2: Decouple the celery task state from the job state #1084
Integrate with the incomplete job monitoring to re-queue images if needed. See: PSv2: Integrate with the incomplete job monitoring #1085

Related Issues

See issues #970 and #971.

How to Test the Changes

This path can be enabled by turning on the job.project.feature_flags.async_pipeline_workers feature flag, see ami/jobs/models.py:400:

        if job.project.feature_flags.async_pipeline_workers:
            cls.queue_images_to_nats(job, images)
        else:
            cls.process_images(job, images)

And running the ami worker from RolnickLab/ami-data-companion#94

Test

Test both modes by tweaking the flag in the django admin console:

Checklist

I have tested these changes appropriately.
I have added and/or modified relevant tests.
I updated relevant documentation or comments.
I have verified that this PR follows the project's coding standards.
Any dependent changes have already been merged to main.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributed Worker Architecture for ML Processing (Processing Service V2) pt. 1#987

Distributed Worker Architecture for ML Processing (Processing Service V2) pt. 1#987
mihow merged 56 commits intoRolnickLab:mainfrom
uw-ssec:carlosg/jobio

carlosgjs commented Oct 8, 2025 •

edited by mihow

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

carlosgjs commented Oct 8, 2025 • edited by mihow Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why This Change

Architecture Overview

Key Advantages

Summary

Current State

List of Changes

Follow-up Work

Related Issues

How to Test the Changes

Test

Checklist

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

carlosgjs commented Oct 8, 2025 •

edited by mihow

Loading