Distributed Worker Architecture for ML Processing (Processing Service V2) pt. 1#987
Merged
mihow merged 56 commits intoRolnickLab:mainfrom Jan 31, 2026
Merged
Distributed Worker Architecture for ML Processing (Processing Service V2) pt. 1#987mihow merged 56 commits intoRolnickLab:mainfrom
mihow merged 56 commits intoRolnickLab:mainfrom
Conversation
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why This Change
Antenna's original ML processing architecture was designed as a demonstration: a single job would connect to a processing service endpoint and wait for images synchronously. This approach has become a bottleneck as the platform has grown:
This PR introduces a pull-based distributed worker architecture that fundamentally changes how ML processing works in Antenna. Users still queue jobs in Antenna, but workers pull tasks from the queue rather than Antenna pushing to them. Workers authenticate with a project token and can register as a service to subscribe to job queues.
Architecture Overview
Key Advantages
Resilience — Tasks are individually tracked. Network failures affect only the current batch, not the entire job. Failed tasks can be re-queued automatically.
Horizontal scaling — Run as many workers as you have compute resources. A job that takes hours with one worker can complete in minutes with ten.
No public endpoint required — Workers pull tasks from Antenna's API. They can run anywhere: behind university firewalls, on HPC clusters, on local workstations with GPUs, or in cloud environments.
Faster overall processing — Parallelism + reduced network sensitivity = significantly faster job completion.
Researchers can still bring their own ML models by following the new (but similar) API contract. We provide several pipelines, but custom models work the same way.
Summary
Initial version of the Processing service V2. Follow up to the initial API structure here #1046
See the worker implementation here RolnickLab/ami-data-companion#94
Closes #971
Closes #968
Closes #969
Current State
The async processing path is working but disabled by default in this PR to allow for extended testing. When enabled, starting a job creates a queue for that job and populates it with one task per image. The tasks can be pulled and ACKed via the APIs introduced in PR #1046. The new path can be enabled for a project via the
async_pipeline_workersfeature flag.PR #1046 introduced a scaffold of the API endpoints & schemas, which will be published in the documentation when finalized.
List of Changes
TaskStateManagerandTaskQueueManagerTaskQueueManagerandTaskStateManagerFollow-up Work
Related Issues
See issues #970 and #971.
How to Test the Changes
This path can be enabled by turning on the
job.project.feature_flags.async_pipeline_workersfeature flag, seeami/jobs/models.py:400:And running the
ami workerfrom RolnickLab/ami-data-companion#94Test
Test both modes by tweaking the flag in the django admin console:

Checklist