Skip to content

Distributed Worker Architecture for ML Processing (Processing Service V2) pt. 1#987

Merged
mihow merged 56 commits intoRolnickLab:mainfrom
uw-ssec:carlosg/jobio
Jan 31, 2026
Merged

Distributed Worker Architecture for ML Processing (Processing Service V2) pt. 1#987
mihow merged 56 commits intoRolnickLab:mainfrom
uw-ssec:carlosg/jobio

Conversation

@carlosgjs
Copy link
Copy Markdown
Collaborator

@carlosgjs carlosgjs commented Oct 8, 2025

Why This Change

Antenna's original ML processing architecture was designed as a demonstration: a single job would connect to a processing service endpoint and wait for images synchronously. This approach has become a bottleneck as the platform has grown:

  • Long-running jobs are fragile — network interruptions can cause long-running jobs to images to fail partway through
  • Single worker bottleneck — only one processing service can work on a job at a time
  • Requires public endpoints — researchers must expose their ML models via a publicly accessible server, which is often impractical in university HPC environments or local workstations behind firewalls

This PR introduces a pull-based distributed worker architecture that fundamentally changes how ML processing works in Antenna. Users still queue jobs in Antenna, but workers pull tasks from the queue rather than Antenna pushing to them. Workers authenticate with a project token and can register as a service to subscribe to job queues.

Architecture Overview

┌─────────────────────────────────────────────────────────────────────┐
│                          ANTENNA (Server)                           │
│                                                                     │
│  User queues job → Job creates tasks → Tasks pushed to NATS queue  │
│                                        ↓                            │
│                          Tasks API (pull/ack/result)                │
└─────────────────────────────────────────────────────────────────────┘
                                    ↑
            ┌───────────────────────┼───────────────────────┐
            │                       │                       │
      ┌─────┴─────┐           ┌─────┴─────┐           ┌─────┴─────┐
      │  Worker   │           │  Worker   │           │  Worker   │
      │  (HPC)    │           │  (Local)  │           │  (Cloud)  │
      └───────────┘           └───────────┘           └───────────┘
        GPU node              Workstation              Any compute

Key Advantages

  1. Resilience — Tasks are individually tracked. Network failures affect only the current batch, not the entire job. Failed tasks can be re-queued automatically.

  2. Horizontal scaling — Run as many workers as you have compute resources. A job that takes hours with one worker can complete in minutes with ten.

  3. No public endpoint required — Workers pull tasks from Antenna's API. They can run anywhere: behind university firewalls, on HPC clusters, on local workstations with GPUs, or in cloud environments.

  4. Faster overall processing — Parallelism + reduced network sensitivity = significantly faster job completion.

Researchers can still bring their own ML models by following the new (but similar) API contract. We provide several pipelines, but custom models work the same way.


Summary

Initial version of the Processing service V2. Follow up to the initial API structure here #1046

See the worker implementation here RolnickLab/ami-data-companion#94

Closes #971
Closes #968
Closes #969

Current State

The async processing path is working but disabled by default in this PR to allow for extended testing. When enabled, starting a job creates a queue for that job and populates it with one task per image. The tasks can be pulled and ACKed via the APIs introduced in PR #1046. The new path can be enabled for a project via the async_pipeline_workers feature flag.

PR #1046 introduced a scaffold of the API endpoints & schemas, which will be published in the documentation when finalized.

List of Changes

  • Added NATS JetStream to the docker compose. I also tried RabbitMQ and Beanstalkd, but they don't support the visibility timeout semantics we want or a disconnected mode of pulling and ACKing tasks.
  • Added TaskStateManager and TaskQueueManager
  • Added the queuing and async results processing logic
  • Implemented task pull/ack/result endpoints in job views (previously stubs)
  • Added unit tests for TaskQueueManager and TaskStateManager

Follow-up Work

Related Issues

See issues #970 and #971.

How to Test the Changes

This path can be enabled by turning on the job.project.feature_flags.async_pipeline_workers feature flag, see ami/jobs/models.py:400:

        if job.project.feature_flags.async_pipeline_workers:
            cls.queue_images_to_nats(job, images)
        else:
            cls.process_images(job, images)

And running the ami worker from RolnickLab/ami-data-companion#94

Test

image

Test both modes by tweaking the flag in the django admin console:
image

Checklist

  • I have tested these changes appropriately.
  • I have added and/or modified relevant tests.
  • I updated relevant documentation or comments.
  • I have verified that this PR follows the project's coding standards.
  • Any dependent changes have already been merged to main.

Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

PSv2 Job scheduler Reference PSv2 implementation PSv2 Pull API design and specification

4 participants