Scale celeryworkers to keep up with async ML processing#1295
Scale celeryworkers to keep up with async ML processing#1295
Conversation
✅ Deploy Preview for antenna-preview canceled.
|
✅ Deploy Preview for antenna-ssec canceled.
|
📝 WalkthroughWalkthroughThe ChangesWorker Scaling Configuration
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~2 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
docker-compose.worker.yml (1)
34-37:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winUpdate the stale memory-pressure comment to reflect the new total worker count.
The comment still says
3 × concurrency prefork processes per host, which was accurate when all three services were atscale: 1. Withceleryworker_mlnow atscale: 4, the actual total is6 × CELERY_WORKER_CONCURRENCYper host (1 + 1 + 4).📝 Proposed fix
- # CELERY_WORKER_CONCURRENCY is inherited from .envs/.production/.django and - # applies per service. On small worker VMs that adds up (3 × concurrency - # prefork processes per host) — tune in the env file if memory pressure - # becomes a problem. + # CELERY_WORKER_CONCURRENCY is inherited from .envs/.production/.django and + # applies per service. On small worker VMs that adds up (6 × concurrency + # prefork processes per host: 1 antenna + 1 jobs + 4 ml_results) — tune in + # the env file if memory pressure becomes a problem.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docker-compose.worker.yml` around lines 34 - 37, Update the stale comment in docker-compose.worker.yml that currently reads "3 × concurrency prefork processes per host" to reflect the new total worker count: change it to "6 × CELERY_WORKER_CONCURRENCY per host" (calculated as 1 + 1 + 4 for the three services, e.g., celeryworker, celerybeat, celeryworker_ml with celeryworker_ml at scale: 4) so the comment correctly documents the total prefork processes.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@docker-compose.worker.yml`:
- Around line 34-37: Update the stale comment in docker-compose.worker.yml that
currently reads "3 × concurrency prefork processes per host" to reflect the new
total worker count: change it to "6 × CELERY_WORKER_CONCURRENCY per host"
(calculated as 1 + 1 + 4 for the three services, e.g., celeryworker, celerybeat,
celeryworker_ml with celeryworker_ml at scale: 4) so the comment correctly
documents the total prefork processes.
47a5ee0 to
8fbc8ac
Compare
There was a problem hiding this comment.
Pull request overview
This PR increases the default replica count for the celeryworker_ml service in the worker compose configuration to better handle higher throughput on the ml_results queue in async ML deployments.
Changes:
- Increase
celeryworker_mlservice scale from 1 to 4 indocker-compose.worker.yml.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| celeryworker_ml: | ||
| <<: *django | ||
| scale: 1 | ||
| scale: 4 | ||
| ports: [] |
When async ML jobs run with high concurrency, we observed the ml_results queue growing long enough that result-processing tasks waited past their retry timeouts and exhausted max_retries — losing results from successfully-processed images to queue-induced timeouts. Splitting ml_results into its own celery service (the celeryworker_ml setup in this file) was the first mitigation; bumping its default scale to 4 is the second. Together they appear to have helped. 4 is a starting point rather than an empirically tuned value. Each worker process inherits CELERY_WORKER_CONCURRENCY from the env file, so memory pressure on smaller VMs can be managed there.
8fbc8ac to
0f094b3
Compare
What this proposes
Bumps the default
scaleofceleryworker_mlindocker-compose.worker.ymlfrom 1 to 4.Why
When async ML jobs run with high concurrency, we observed the
ml_resultsqueue growing long enough that result-processing tasks waited past their retry timeouts and exhaustedmax_retries. User-visible effect: results from successfully-processed images getting recorded as failures because the result-handling task couldn't be picked up in time.Splitting
ml_resultsinto its own dedicated celery service (the existingceleryworker_mlsetup in this file) was the first mitigation. Bumping its default scale to 4 is the second. Together they appear to have helped, though we don't have a clean isolated before/after measurement for the scale change on its own.Tradeoffs / things to discuss
CELERY_WORKER_CONCURRENCYin.envto compensate.celeryworkerfor the antenna queue andceleryworker_jobs) stay atscale: 1— they haven't shown the same retry-timeout pattern.Test plan
celeryworker_mlcontainers come upCELERY_WORKER_CONCURRENCYif pressure risesml_resultstask retry counts under load; should drop relative to single-worker baseline