Add default retry policy to queued jobs#4591
Draft
JoshSalway wants to merge 4 commits intoflarum:2.xfrom
Draft
Conversation
Set $tries=3 and exponential $backoff=[30,60,120] on AbstractJob (the base class extended by most queued jobs) and on SendReplyNotification (the only queued job that extends Queueable directly). With queue:work --tries=0 (the default on some hosting platforms, including Vapor prior to vapor-core 2.4.1), jobs inheriting no retry config retry indefinitely on transient failures. Bounded attempts with exponential backoff give DB/SMTP/filesystem blips time to recover, then fail cleanly. Subclasses that want a different policy can override either property. retryUntil() is not used by any current job, so the count-based policy is effective (Laravel's Worker silently ignores $tries when retryUntil() is set; see Worker.php:612 on 13.x, framework PR #35214).
- SendNotificationsJob: Notification::notify() uses a raw bulk insert (framework/core/src/Notification/Notification.php), so retries would create duplicate notification rows. - ExportJob (gdpr): writes an export file and dispatches Exporting/ Exported events; retries would duplicate both. - ComposerCommandJob (package-manager): composer install is not idempotent, and an OOM bypasses the handle() try/catch that normally swallows exceptions, so a retry could run against a partially-written vendor/. Each job sets $tries = 1 to override the AbstractJob default.
ShouldBeUnique defaults $uniqueFor to 0, which means the lock never expires. WithoutOverlapping without expireAfter has the same pattern. If the worker crashes mid-composer-install, either lock can otherwise block future dispatches until cache TTL runs down. Both now bounded at 600 seconds (longer than $timeout's 180s so they can't expire mid-run, short enough to auto-clear after a crash).
WithoutOverlapping defaults releaseAfter to 0, which re-queues a duplicate dispatch immediately. If one somehow slips past the ShouldBeUnique guard (stale $uniqueFor, dispatch race), a second worker would tight-loop against the held lock. ->dontRelease() fails the duplicate cleanly instead, so the admin sees an explicit 'already running' rather than a stealth second composer install after the first completes.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Two related queue-safety improvements for Flarum's job system:
AbstractJob(with explicit opt-outs for three non-idempotent jobs)ComposerCommandJob, which uses bothShouldBeUniqueandWithoutOverlappingbut leaves their expiry at the default (never expires)Five files changed, 48 insertions, 1 deletion:
Setting the defaults (2 files):
framework/core/src/Queue/AbstractJob.phpextensions/subscriptions/src/Job/SendReplyNotification.phpQueueabledirectly (misses the base-class fix)Both get:
Opting out where retries could cause duplicate side effects (3 files):
framework/core/src/Notification/Job/SendNotificationsJob.phpNotification::notify()uses a raw bulk insert, not upsert — retry = duplicate rowsextensions/gdpr/src/Jobs/ExportJob.phpExporting/Exportedevents — retry duplicates bothextensions/package-manager/src/Job/ComposerCommandJob.phptry/catch) a retry runs against a partially-modifiedvendor/Each override is
public int $tries = 1;with an inline comment explaining the specific reason.On workers started with
queue:work --tries=0(unlimited retries — a valid flag value), these three jobs would otherwise retry indefinitely on failure, producing duplicate notification rows, duplicate export files, or composer retries against a partially-modifiedvendor/. Explicit$tries = 1caps them at one attempt regardless of worker config.Every queued job in the repo is now accounted for: it either inherits the new defaults, is patched directly, or is explicitly opted out. See "Per-job retry-safety audit" below for the reasoning.
Part 2: ComposerCommandJob lock hardening
The job already implements
ShouldBeUniqueand usesWithoutOverlappingmiddleware, but both locks ship with defaults that cause problems on worker crash / duplicate-dispatch:Three separate defects being patched:
ShouldBeUniquedefaults$uniqueFor = 0(never expires). If a worker crashes mid-composer install, the admin UI refuses new dispatches until cache TTL runs down.WithoutOverlappingdefaults$expiresAfter = 0(never expires). Same problem at the worker-level lock — crashed worker leaves the lock permanently held.WithoutOverlappingdefaults$releaseAfter = 0(re-queue immediately). If a duplicate dispatch somehow slips pastShouldBeUnique, the second copy tight-loops against the held lock instead of failing cleanly.Fix:
$uniqueFor = 600and->expireAfter(600)->dontRelease().600sis longer than$timeout's180s(locks won't expire mid-run) and short enough to self-clear after a realistic crash.dontRelease()fails a duplicate dispatch loudly instead of letting it stealth-run after the first completes.Why
Laravel's
queue:workprocesses each job exactly once unless the job (or the worker command) says otherwise. Right now no Flarum job sets$tries,$backoff,retryUntil(), or$maxExceptions, so:queue:work(default): any transient failure → job fails immediately, notification/email/index/push silently lost.queue:work --tries=0(valid flag value meaning unlimited): same job retries indefinitely on transient failures, which can saturate the worker pool.Concrete examples of transient failures that currently don't retry:
SendEmailNotificationJob,SendInformationalEmailJob,RequestPasswordResetJob,SendReplyNotificationall fail permanently on one blipSendPusherNotificationsJobsilently drops the realtime messageIndexJobleaves the post unindexedComposerCommandJobmarks the extension install permanently failed (but see note below)With
$tries=3and[30, 60, 120]backoff, the worker attempts the job, waits 30s on first failure, 60s on second, 120s on third. Total window of ~3.5 minutes covers most transient outages without piling up retries against a struggling upstream.Coverage
All 21
ShouldQueueclasses in the repo (3 abstract bases + 18 concrete runnable jobs):framework/core/src/Queue/AbstractJob.phpframework/core/src/User/Job/RequestPasswordResetJob.phpframework/core/src/Search/Job/IndexJob.phpframework/core/src/Notification/Job/SendEmailNotificationJob.phpframework/core/src/Notification/Job/SendNotificationsJob.php$tries = 1(see audit)framework/core/src/Mail/Job/SendAbandonedExtensionsEmailJob.phpframework/core/src/Mail/Job/SendInformationalEmailJob.phpextensions/realtime/src/Push/Jobs/Job.phpextensions/realtime/src/Push/Jobs/SendFlaggedJob.phpextensions/realtime/src/Push/Jobs/SendGeneratedPayloadJob.phpextensions/realtime/src/Push/Jobs/SendNotificationsJob.phpextensions/realtime/src/Push/Jobs/SendTriggerJob.phpextensions/realtime/src/Push/Jobs/SendDialogMessageJob.phpextensions/pusher/src/SendPusherNotificationsJob.phpextensions/package-manager/src/Job/ComposerCommandJob.php$tries = 1(see audit)extensions/messages/src/Job/SendMessageNotificationsJob.phpextensions/mentions/src/Job/SendMentionsNotificationsJob.phpextensions/gdpr/src/Jobs/GdprJob.phpextensions/gdpr/src/Jobs/ErasureJob.phpextensions/gdpr/src/Jobs/ExportJob.php$tries = 1(see audit)extensions/subscriptions/src/Job/SendReplyNotification.phpNothing else in
src/usesretryUntil(), so this is a clean addition.Per-job retry-safety audit
Before shipping, I read the
handle()of every job that would inherit the new defaults and checked whether a retry could produce duplicate side effects. Eleven jobs are retry-safe: three operate on unambiguously idempotent paths ($notifications->sync(), upsert/delete indexing, null-payload Pusher trigger), and the remainder perform transactional emails where duplicate delivery under ACK-loss is preferable to lost delivery. Three jobs are not retry-safe and opt out of the base defaults withpublic int $tries = 1;plus a comment explaining why:SendNotificationsJob— callsNotification::notify(), which usesstatic::insert()at Notification.php:182. Raw bulk insert, notfirstOrCreate/upsert — retry would add duplicate notification rows to recipients' feeds.ExportJob(gdpr) — dispatchesExportingevent, calls$exporter->export()(which writes an export file), notifies the actor, then dispatchesExported. Each run has file + event + notification side effects that aren't deduped.ComposerCommandJob(package-manager) — existinghandle()wraps the body intry { ... } catch (Throwable) { $this->abort($exception); }, which swallows application errors. But OOM / fatal errors kill the worker beforecatchruns, and on that path$tries=3would kick in. Composer commands are not idempotent, so a retry against a partially-modifiedvendor/could fail in new ways.Existing overrides preserved
ComposerCommandJobkeeps its existingpublic int $timeout = 60 * 3;and itsWithoutOverlapping+ShouldBeUniquemiddleware. The new$tries = 1override layers on top cleanly.GdprJobremains abstract extendingAbstractJob. Its concrete subclasses (ExportJob,ErasureJob) inherit the base defaults;ExportJoboverrides to$tries = 1as described above.ErasureJobinherits the default$tries = 3— safe because itshandle()early-returns on deleted users (if (! $user->exists) { return; }), so the common retry scenario (user already erased) is a no-op.Why
$tries + $backoffand notretryUntil()Laravel's Worker silently ignores
$trieswhenretryUntil()is set on the job — the framework treats them as mutually exclusive retry policies. See Worker.php:612 on 13.x, introduced in laravel/framework#35214. None of the 21 Flarum jobs in scope useretryUntil()(verified via grep), so a count-based policy is effective. If a future subclass wants time-based semantics, it can overrideretryUntil()and the base$trieswill be correctly ignored.Why
$backoff(not just$tries)Default backoff is 0 seconds. Without
$backoff, three retries fire back-to-back at worker speed — three hammer blows to the same SMTP server or Redis node that just blipped. Exponential[30, 60, 120]gives real recovery time between attempts.Note on
retry_afterThe queue connection config also has a
retry_aftersafety net (default 90s across thedatabase,beanstalkd, andredisdrivers in Laravel 13). That's the worker-timeout fallback for stuck jobs, separate from the application-level$tries/$backoffpolicy. This PR doesn't touch it.Not a bug fix, but preventative hardening
No open issue directly tracks this. The closest reported symptom was #4301 (closed), an SMTP connection failure whose stack trace shows
SendInformationalEmailJobandRequestPasswordResetJobfailing withstream_socket_client(): SSL operation failed— both jobs this PR covers. That reporter was onQueue driver: syncso retries wouldn't have helped them specifically, but the failure mode is real and happens on async drivers whenever a mail provider has a brief outage.Local verification
php -lclean on both filesContext
Part of a broader Laravel queue-safety audit. Full write-up: https://joshsalway.com/laravel-queue-safety