Add default retry policy to queued jobs by JoshSalway · Pull Request #4591 · flarum/framework

JoshSalway · 2026-04-20T05:20:00Z

What

Two related queue-safety improvements for Flarum's job system:

A default count-based retry policy on AbstractJob (with explicit opt-outs for three non-idempotent jobs)
Bounded lock lifetimes on ComposerCommandJob, which uses both ShouldBeUnique and WithoutOverlapping but leaves their expiry at the default (never expires)

Five files changed, 48 insertions, 1 deletion:

Setting the defaults (2 files):

File	Role
`framework/core/src/Queue/AbstractJob.php`	Base class extended (directly or transitively) by 19 queued job classes across core and extensions
`extensions/subscriptions/src/Job/SendReplyNotification.php`	The one queued job that extends `Queueable` directly (misses the base-class fix)

Both get:

/** The maximum number of times the job may be attempted. */
public int $tries = 3;

/** Delay in seconds between retries. */
public array $backoff = [30, 60, 120];

Opting out where retries could cause duplicate side effects (3 files):

File	Why
`framework/core/src/Notification/Job/SendNotificationsJob.php`	`Notification::notify()` uses a raw bulk insert, not upsert — retry = duplicate rows
`extensions/gdpr/src/Jobs/ExportJob.php`	Writes an export file and fires `Exporting`/`Exported` events — retry duplicates both
`extensions/package-manager/src/Job/ComposerCommandJob.php`	Composer isn't idempotent; on OOM (which bypasses its `try/catch`) a retry runs against a partially-modified `vendor/`

Each override is public int $tries = 1; with an inline comment explaining the specific reason.

On workers started with queue:work --tries=0 (unlimited retries — a valid flag value), these three jobs would otherwise retry indefinitely on failure, producing duplicate notification rows, duplicate export files, or composer retries against a partially-modified vendor/. Explicit $tries = 1 caps them at one attempt regardless of worker config.

Every queued job in the repo is now accounted for: it either inherits the new defaults, is patched directly, or is explicitly opted out. See "Per-job retry-safety audit" below for the reasoning.

Part 2: ComposerCommandJob lock hardening

The job already implements ShouldBeUnique and uses WithoutOverlapping middleware, but both locks ship with defaults that cause problems on worker crash / duplicate-dispatch:

+   public int $uniqueFor = 600;

    public function middleware(): array
    {
        return [
-           new WithoutOverlapping(),
+           (new WithoutOverlapping())
+               ->expireAfter(600)
+               ->dontRelease(),
        ];
    }

Three separate defects being patched:

ShouldBeUnique defaults $uniqueFor = 0 (never expires). If a worker crashes mid-composer install, the admin UI refuses new dispatches until cache TTL runs down.
WithoutOverlapping defaults $expiresAfter = 0 (never expires). Same problem at the worker-level lock — crashed worker leaves the lock permanently held.
WithoutOverlapping defaults $releaseAfter = 0 (re-queue immediately). If a duplicate dispatch somehow slips past ShouldBeUnique, the second copy tight-loops against the held lock instead of failing cleanly.

Fix: $uniqueFor = 600 and ->expireAfter(600)->dontRelease(). 600s is longer than $timeout's 180s (locks won't expire mid-run) and short enough to self-clear after a realistic crash. dontRelease() fails a duplicate dispatch loudly instead of letting it stealth-run after the first completes.

Why

Laravel's queue:work processes each job exactly once unless the job (or the worker command) says otherwise. Right now no Flarum job sets $tries, $backoff, retryUntil(), or $maxExceptions, so:

On a worker started with queue:work (default): any transient failure → job fails immediately, notification/email/index/push silently lost.
On a worker started with queue:work --tries=0 (valid flag value meaning unlimited): same job retries indefinitely on transient failures, which can saturate the worker pool.

Concrete examples of transient failures that currently don't retry:

SMTP provider hiccup → SendEmailNotificationJob, SendInformationalEmailJob, RequestPasswordResetJob, SendReplyNotification all fail permanently on one blip
Redis/Pusher reconnect → SendPusherNotificationsJob silently drops the realtime message
Search index network timeout → IndexJob leaves the post unindexed
Composer network hiccup → ComposerCommandJob marks the extension install permanently failed (but see note below)

With $tries=3 and [30, 60, 120] backoff, the worker attempts the job, waits 30s on first failure, 60s on second, 120s on third. Total window of ~3.5 minutes covers most transient outages without piling up retries against a struggling upstream.

Coverage

All 21 ShouldQueue classes in the repo (3 abstract bases + 18 concrete runnable jobs):

#	File	Covered by
1	`framework/core/src/Queue/AbstractJob.php`	direct (base class)
2	`framework/core/src/User/Job/RequestPasswordResetJob.php`	inherits
3	`framework/core/src/Search/Job/IndexJob.php`	inherits
4	`framework/core/src/Notification/Job/SendEmailNotificationJob.php`	inherits
5	`framework/core/src/Notification/Job/SendNotificationsJob.php`	override `$tries = 1` (see audit)
6	`framework/core/src/Mail/Job/SendAbandonedExtensionsEmailJob.php`	inherits
7	`framework/core/src/Mail/Job/SendInformationalEmailJob.php`	inherits
8	`extensions/realtime/src/Push/Jobs/Job.php`	inherits (abstract realtime base)
8a	`extensions/realtime/src/Push/Jobs/SendFlaggedJob.php`	inherits (via Push/Jobs/Job)
8b	`extensions/realtime/src/Push/Jobs/SendGeneratedPayloadJob.php`	inherits (via Push/Jobs/Job)
8c	`extensions/realtime/src/Push/Jobs/SendNotificationsJob.php`	inherits (via Push/Jobs/Job)
8d	`extensions/realtime/src/Push/Jobs/SendTriggerJob.php`	inherits (via Push/Jobs/Job)
8e	`extensions/realtime/src/Push/Jobs/SendDialogMessageJob.php`	inherits (via Push/Jobs/Job)
9	`extensions/pusher/src/SendPusherNotificationsJob.php`	inherits
10	`extensions/package-manager/src/Job/ComposerCommandJob.php`	override `$tries = 1` (see audit)
11	`extensions/messages/src/Job/SendMessageNotificationsJob.php`	inherits
12	`extensions/mentions/src/Job/SendMentionsNotificationsJob.php`	inherits
13	`extensions/gdpr/src/Jobs/GdprJob.php`	inherits (abstract)
13a	`extensions/gdpr/src/Jobs/ErasureJob.php`	inherits (early-return guard)
13b	`extensions/gdpr/src/Jobs/ExportJob.php`	override `$tries = 1` (see audit)
14	`extensions/subscriptions/src/Job/SendReplyNotification.php`	direct

Nothing else in src/ uses retryUntil(), so this is a clean addition.

Per-job retry-safety audit

Before shipping, I read the handle() of every job that would inherit the new defaults and checked whether a retry could produce duplicate side effects. Eleven jobs are retry-safe: three operate on unambiguously idempotent paths ($notifications->sync(), upsert/delete indexing, null-payload Pusher trigger), and the remainder perform transactional emails where duplicate delivery under ACK-loss is preferable to lost delivery. Three jobs are not retry-safe and opt out of the base defaults with public int $tries = 1; plus a comment explaining why:

SendNotificationsJob — calls Notification::notify(), which uses static::insert() at Notification.php:182. Raw bulk insert, not firstOrCreate/upsert — retry would add duplicate notification rows to recipients' feeds.
ExportJob (gdpr) — dispatches Exporting event, calls $exporter->export() (which writes an export file), notifies the actor, then dispatches Exported. Each run has file + event + notification side effects that aren't deduped.
ComposerCommandJob (package-manager) — existing handle() wraps the body in try { ... } catch (Throwable) { $this->abort($exception); }, which swallows application errors. But OOM / fatal errors kill the worker before catch runs, and on that path $tries=3 would kick in. Composer commands are not idempotent, so a retry against a partially-modified vendor/ could fail in new ways.

Existing overrides preserved

ComposerCommandJob keeps its existing public int $timeout = 60 * 3; and its WithoutOverlapping + ShouldBeUnique middleware. The new $tries = 1 override layers on top cleanly.
GdprJob remains abstract extending AbstractJob. Its concrete subclasses (ExportJob, ErasureJob) inherit the base defaults; ExportJob overrides to $tries = 1 as described above. ErasureJob inherits the default $tries = 3 — safe because its handle() early-returns on deleted users (if (! $user->exists) { return; }), so the common retry scenario (user already erased) is a no-op.

Why `$tries + $backoff` and not `retryUntil()`

Laravel's Worker silently ignores $tries when retryUntil() is set on the job — the framework treats them as mutually exclusive retry policies. See Worker.php:612 on 13.x, introduced in laravel/framework#35214. None of the 21 Flarum jobs in scope use retryUntil() (verified via grep), so a count-based policy is effective. If a future subclass wants time-based semantics, it can override retryUntil() and the base $tries will be correctly ignored.

Why `$backoff` (not just `$tries`)

Default backoff is 0 seconds. Without $backoff, three retries fire back-to-back at worker speed — three hammer blows to the same SMTP server or Redis node that just blipped. Exponential [30, 60, 120] gives real recovery time between attempts.

Note on `retry_after`

The queue connection config also has a retry_after safety net (default 90s across the database, beanstalkd, and redis drivers in Laravel 13). That's the worker-timeout fallback for stuck jobs, separate from the application-level $tries/$backoff policy. This PR doesn't touch it.

Not a bug fix, but preventative hardening

No open issue directly tracks this. The closest reported symptom was #4301 (closed), an SMTP connection failure whose stack trace shows SendInformationalEmailJob and RequestPasswordResetJob failing with stream_socket_client(): SSL operation failed — both jobs this PR covers. That reporter was on Queue driver: sync so retries wouldn't have helped them specifically, but the failure mode is real and happens on async drivers whenever a mail provider has a brief outage.

Local verification

php -l clean on both files
No whitespace or formatting changes outside the added properties
Every audit claim above verified against this branch's source

Context

Part of a broader Laravel queue-safety audit. Full write-up: https://joshsalway.com/laravel-queue-safety

Set $tries=3 and exponential $backoff=[30,60,120] on AbstractJob (the base class extended by most queued jobs) and on SendReplyNotification (the only queued job that extends Queueable directly). With queue:work --tries=0 (the default on some hosting platforms, including Vapor prior to vapor-core 2.4.1), jobs inheriting no retry config retry indefinitely on transient failures. Bounded attempts with exponential backoff give DB/SMTP/filesystem blips time to recover, then fail cleanly. Subclasses that want a different policy can override either property. retryUntil() is not used by any current job, so the count-based policy is effective (Laravel's Worker silently ignores $tries when retryUntil() is set; see Worker.php:612 on 13.x, framework PR #35214).

- SendNotificationsJob: Notification::notify() uses a raw bulk insert (framework/core/src/Notification/Notification.php), so retries would create duplicate notification rows. - ExportJob (gdpr): writes an export file and dispatches Exporting/ Exported events; retries would duplicate both. - ComposerCommandJob (package-manager): composer install is not idempotent, and an OOM bypasses the handle() try/catch that normally swallows exceptions, so a retry could run against a partially-written vendor/. Each job sets $tries = 1 to override the AbstractJob default.

ShouldBeUnique defaults $uniqueFor to 0, which means the lock never expires. WithoutOverlapping without expireAfter has the same pattern. If the worker crashes mid-composer-install, either lock can otherwise block future dispatches until cache TTL runs down. Both now bounded at 600 seconds (longer than $timeout's 180s so they can't expire mid-run, short enough to auto-clear after a crash).

WithoutOverlapping defaults releaseAfter to 0, which re-queues a duplicate dispatch immediately. If one somehow slips past the ShouldBeUnique guard (stale $uniqueFor, dispatch race), a second worker would tight-loop against the held lock. ->dontRelease() fails the duplicate cleanly instead, so the admin sees an explicit 'already running' rather than a stealth second composer install after the first completes.

JoshSalway added 4 commits April 20, 2026 15:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add default retry policy to queued jobs#4591

Add default retry policy to queued jobs#4591
JoshSalway wants to merge 4 commits intoflarum:2.xfrom
JoshSalway:fix/base-job-retry-policy

JoshSalway commented Apr 20, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

JoshSalway commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Why

Coverage

Per-job retry-safety audit

Existing overrides preserved

Why $tries + $backoff and not retryUntil()

Why $backoff (not just $tries)

Note on retry_after

Not a bug fix, but preventative hardening

Local verification

Context

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

JoshSalway commented Apr 20, 2026 •

edited

Loading

Why `$tries + $backoff` and not `retryUntil()`

Why `$backoff` (not just `$tries`)

Note on `retry_after`