Scale celeryworkers to keep up with async ML processing by mihow · Pull Request #1295 · RolnickLab/antenna

mihow · 2026-05-06T21:15:12Z

What this proposes

Bumps the default scale of celeryworker_ml in docker-compose.worker.yml from 1 to 4.

Why

When async ML jobs run with high concurrency, we observed the ml_results queue growing long enough that result-processing tasks waited past their retry timeouts and exhausted max_retries. User-visible effect: results from successfully-processed images getting recorded as failures because the result-handling task couldn't be picked up in time.

Splitting ml_results into its own dedicated celery service (the existing celeryworker_ml setup in this file) was the first mitigation. Bumping its default scale to 4 is the second. Together they appear to have helped, though we don't have a clean isolated before/after measurement for the scale change on its own.

Tradeoffs / things to discuss

4 is a starting point rather than something we've empirically tuned. Open to a different number with measurements.
4× the prefork pool means roughly 4× the memory footprint per host. Smaller VMs may want to lower CELERY_WORKER_CONCURRENCY in .env to compensate.
The other two celery services (celeryworker for the antenna queue and celeryworker_jobs) stay at scale: 1 — they haven't shown the same retry-timeout pattern.

Test plan

Deploy to a worker host; confirm 4 celeryworker_ml containers come up
Watch host memory; tune CELERY_WORKER_CONCURRENCY if pressure rises
Watch ml_results task retry counts under load; should drop relative to single-worker baseline

netlify · 2026-05-06T21:15:18Z

✅ Deploy Preview for antenna-preview canceled.

Name	Link
🔨 Latest commit	`0f094b3`
🔍 Latest deploy log	https://app.netlify.com/projects/antenna-preview/deploys/69fbb110abe72a0008dbd88b

netlify · 2026-05-06T21:15:18Z

✅ Deploy Preview for antenna-ssec canceled.

Name	Link
🔨 Latest commit	`0f094b3`
🔍 Latest deploy log	https://app.netlify.com/projects/antenna-ssec/deploys/69fbb110abe72a0008dbd890

coderabbitai · 2026-05-06T21:15:27Z

📝 Walkthrough

Walkthrough

The celeryworker_ml service scale in docker-compose.worker.yml is increased from 1 to 4, allowing four concurrent workers to process tasks from the ml_results queue instead of one.

Changes

Worker Scaling Configuration

Layer / File(s)	Summary
Service Configuration `docker-compose.worker.yml`	The `celeryworker_ml` service `scale` property is updated from 1 to 4 to enable multiple concurrent workers for the ml_results queue.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~2 minutes

Possibly related PRs

RolnickLab/antenna#1257: Introduced the celeryworker_ml service that is now being scaled up from 1 to 4 instances in this PR.

Suggested reviewers

carlosgjs

Poem

🐰 A rabbit's delight in parallel grace,
Four workers now race through ml_results space!
Where once just one labored alone in the queue,
Four friends hop together—a speedier crew! ⚡

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and concisely summarizes the main change: scaling celeryworkers for async ML processing.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description check	✅ Passed	The PR description is comprehensive, covering summary, motivation, tradeoffs, and test plan, but omits the 'Related Issues' and 'Deployment Notes' sections from the template.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch scale-celeryworker-ml-async-processing

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

docker-compose.worker.yml (1)

34-37: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Update the stale memory-pressure comment to reflect the new total worker count.

The comment still says 3 × concurrency prefork processes per host, which was accurate when all three services were at scale: 1. With celeryworker_ml now at scale: 4, the actual total is 6 × CELERY_WORKER_CONCURRENCY per host (1 + 1 + 4).

📝 Proposed fix

-  # CELERY_WORKER_CONCURRENCY is inherited from .envs/.production/.django and
-  # applies per service. On small worker VMs that adds up (3 × concurrency
-  # prefork processes per host) — tune in the env file if memory pressure
-  # becomes a problem.
+  # CELERY_WORKER_CONCURRENCY is inherited from .envs/.production/.django and
+  # applies per service. On small worker VMs that adds up (6 × concurrency
+  # prefork processes per host: 1 antenna + 1 jobs + 4 ml_results) — tune in
+  # the env file if memory pressure becomes a problem.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docker-compose.worker.yml` around lines 34 - 37, Update the stale comment in
docker-compose.worker.yml that currently reads "3 × concurrency prefork
processes per host" to reflect the new total worker count: change it to "6 ×
CELERY_WORKER_CONCURRENCY per host" (calculated as 1 + 1 + 4 for the three
services, e.g., celeryworker, celerybeat, celeryworker_ml with celeryworker_ml
at scale: 4) so the comment correctly documents the total prefork processes.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@docker-compose.worker.yml`:
- Around line 34-37: Update the stale comment in docker-compose.worker.yml that
currently reads "3 × concurrency prefork processes per host" to reflect the new
total worker count: change it to "6 × CELERY_WORKER_CONCURRENCY per host"
(calculated as 1 + 1 + 4 for the three services, e.g., celeryworker, celerybeat,
celeryworker_ml with celeryworker_ml at scale: 4) so the comment correctly
documents the total prefork processes.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f65e65b7-d057-4c4f-ae99-76002d712054

📥 Commits

Reviewing files that changed from the base of the PR and between 183f487 and 47a5ee0.

📒 Files selected for processing (1)

docker-compose.worker.yml

Copilot

Pull request overview

This PR increases the default replica count for the celeryworker_ml service in the worker compose configuration to better handle higher throughput on the ml_results queue in async ML deployments.

Changes:

Increase celeryworker_ml service scale from 1 to 4 in docker-compose.worker.yml.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

  celeryworker_ml:
    <<: *django
-    scale: 1
+    scale: 4
    ports: []


When async ML jobs run with high concurrency, we observed the ml_results queue growing long enough that result-processing tasks waited past their retry timeouts and exhausted max_retries — losing results from successfully-processed images to queue-induced timeouts. Splitting ml_results into its own celery service (the celeryworker_ml setup in this file) was the first mitigation; bumping its default scale to 4 is the second. Together they appear to have helped. 4 is a starting point rather than an empirically tuned value. Each worker process inherits CELERY_WORKER_CONCURRENCY from the env file, so memory pressure on smaller VMs can be managed there.

Copilot AI review requested due to automatic review settings May 6, 2026 21:15

Copilot started reviewing on behalf of mihow May 6, 2026 21:15 View session

coderabbitai Bot reviewed May 6, 2026

View reviewed changes

mihow force-pushed the scale-celeryworker-ml-async-processing branch from 47a5ee0 to 8fbc8ac Compare May 6, 2026 21:17

Copilot AI reviewed May 6, 2026

View reviewed changes

Comment thread docker-compose.worker.yml

Comment on lines 56 to 59

celeryworker_ml:

<<: *django

scale: 1

scale: 4

ports: []

mihow force-pushed the scale-celeryworker-ml-async-processing branch from 8fbc8ac to 0f094b3 Compare May 6, 2026 21:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scale celeryworkers to keep up with async ML processing#1295

Scale celeryworkers to keep up with async ML processing#1295
mihow wants to merge 1 commit intomainfrom
scale-celeryworker-ml-async-processing

mihow commented May 6, 2026 •

edited

Loading

Uh oh!

netlify Bot commented May 6, 2026 •

edited

Loading

Uh oh!

netlify Bot commented May 6, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented May 6, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mihow commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this proposes

Why

Tradeoffs / things to discuss

Test plan

Uh oh!

netlify Bot commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for antenna-preview canceled.

Uh oh!

netlify Bot commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for antenna-ssec canceled.

Uh oh!

coderabbitai Bot commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mihow commented May 6, 2026 •

edited

Loading

netlify Bot commented May 6, 2026 •

edited

Loading

netlify Bot commented May 6, 2026 •

edited

Loading

coderabbitai Bot commented May 6, 2026 •

edited

Loading