Skip to content

Add default retry policy to queued jobs#4591

Draft
JoshSalway wants to merge 4 commits intoflarum:2.xfrom
JoshSalway:fix/base-job-retry-policy
Draft

Add default retry policy to queued jobs#4591
JoshSalway wants to merge 4 commits intoflarum:2.xfrom
JoshSalway:fix/base-job-retry-policy

Conversation

@JoshSalway
Copy link
Copy Markdown

@JoshSalway JoshSalway commented Apr 20, 2026

What

Two related queue-safety improvements for Flarum's job system:

  1. A default count-based retry policy on AbstractJob (with explicit opt-outs for three non-idempotent jobs)
  2. Bounded lock lifetimes on ComposerCommandJob, which uses both ShouldBeUnique and WithoutOverlapping but leaves their expiry at the default (never expires)

Five files changed, 48 insertions, 1 deletion:

Setting the defaults (2 files):

File Role
framework/core/src/Queue/AbstractJob.php Base class extended (directly or transitively) by 19 queued job classes across core and extensions
extensions/subscriptions/src/Job/SendReplyNotification.php The one queued job that extends Queueable directly (misses the base-class fix)

Both get:

/** The maximum number of times the job may be attempted. */
public int $tries = 3;

/** Delay in seconds between retries. */
public array $backoff = [30, 60, 120];

Opting out where retries could cause duplicate side effects (3 files):

File Why
framework/core/src/Notification/Job/SendNotificationsJob.php Notification::notify() uses a raw bulk insert, not upsert — retry = duplicate rows
extensions/gdpr/src/Jobs/ExportJob.php Writes an export file and fires Exporting/Exported events — retry duplicates both
extensions/package-manager/src/Job/ComposerCommandJob.php Composer isn't idempotent; on OOM (which bypasses its try/catch) a retry runs against a partially-modified vendor/

Each override is public int $tries = 1; with an inline comment explaining the specific reason.

On workers started with queue:work --tries=0 (unlimited retries — a valid flag value), these three jobs would otherwise retry indefinitely on failure, producing duplicate notification rows, duplicate export files, or composer retries against a partially-modified vendor/. Explicit $tries = 1 caps them at one attempt regardless of worker config.

Every queued job in the repo is now accounted for: it either inherits the new defaults, is patched directly, or is explicitly opted out. See "Per-job retry-safety audit" below for the reasoning.

Part 2: ComposerCommandJob lock hardening

The job already implements ShouldBeUnique and uses WithoutOverlapping middleware, but both locks ship with defaults that cause problems on worker crash / duplicate-dispatch:

+   public int $uniqueFor = 600;

    public function middleware(): array
    {
        return [
-           new WithoutOverlapping(),
+           (new WithoutOverlapping())
+               ->expireAfter(600)
+               ->dontRelease(),
        ];
    }

Three separate defects being patched:

  1. ShouldBeUnique defaults $uniqueFor = 0 (never expires). If a worker crashes mid-composer install, the admin UI refuses new dispatches until cache TTL runs down.
  2. WithoutOverlapping defaults $expiresAfter = 0 (never expires). Same problem at the worker-level lock — crashed worker leaves the lock permanently held.
  3. WithoutOverlapping defaults $releaseAfter = 0 (re-queue immediately). If a duplicate dispatch somehow slips past ShouldBeUnique, the second copy tight-loops against the held lock instead of failing cleanly.

Fix: $uniqueFor = 600 and ->expireAfter(600)->dontRelease(). 600s is longer than $timeout's 180s (locks won't expire mid-run) and short enough to self-clear after a realistic crash. dontRelease() fails a duplicate dispatch loudly instead of letting it stealth-run after the first completes.

Why

Laravel's queue:work processes each job exactly once unless the job (or the worker command) says otherwise. Right now no Flarum job sets $tries, $backoff, retryUntil(), or $maxExceptions, so:

  • On a worker started with queue:work (default): any transient failure → job fails immediately, notification/email/index/push silently lost.
  • On a worker started with queue:work --tries=0 (valid flag value meaning unlimited): same job retries indefinitely on transient failures, which can saturate the worker pool.

Concrete examples of transient failures that currently don't retry:

  • SMTP provider hiccup → SendEmailNotificationJob, SendInformationalEmailJob, RequestPasswordResetJob, SendReplyNotification all fail permanently on one blip
  • Redis/Pusher reconnect → SendPusherNotificationsJob silently drops the realtime message
  • Search index network timeout → IndexJob leaves the post unindexed
  • Composer network hiccup → ComposerCommandJob marks the extension install permanently failed (but see note below)

With $tries=3 and [30, 60, 120] backoff, the worker attempts the job, waits 30s on first failure, 60s on second, 120s on third. Total window of ~3.5 minutes covers most transient outages without piling up retries against a struggling upstream.

Coverage

All 21 ShouldQueue classes in the repo (3 abstract bases + 18 concrete runnable jobs):

# File Covered by
1 framework/core/src/Queue/AbstractJob.php direct (base class)
2 framework/core/src/User/Job/RequestPasswordResetJob.php inherits
3 framework/core/src/Search/Job/IndexJob.php inherits
4 framework/core/src/Notification/Job/SendEmailNotificationJob.php inherits
5 framework/core/src/Notification/Job/SendNotificationsJob.php override $tries = 1 (see audit)
6 framework/core/src/Mail/Job/SendAbandonedExtensionsEmailJob.php inherits
7 framework/core/src/Mail/Job/SendInformationalEmailJob.php inherits
8 extensions/realtime/src/Push/Jobs/Job.php inherits (abstract realtime base)
8a extensions/realtime/src/Push/Jobs/SendFlaggedJob.php inherits (via Push/Jobs/Job)
8b extensions/realtime/src/Push/Jobs/SendGeneratedPayloadJob.php inherits (via Push/Jobs/Job)
8c extensions/realtime/src/Push/Jobs/SendNotificationsJob.php inherits (via Push/Jobs/Job)
8d extensions/realtime/src/Push/Jobs/SendTriggerJob.php inherits (via Push/Jobs/Job)
8e extensions/realtime/src/Push/Jobs/SendDialogMessageJob.php inherits (via Push/Jobs/Job)
9 extensions/pusher/src/SendPusherNotificationsJob.php inherits
10 extensions/package-manager/src/Job/ComposerCommandJob.php override $tries = 1 (see audit)
11 extensions/messages/src/Job/SendMessageNotificationsJob.php inherits
12 extensions/mentions/src/Job/SendMentionsNotificationsJob.php inherits
13 extensions/gdpr/src/Jobs/GdprJob.php inherits (abstract)
13a extensions/gdpr/src/Jobs/ErasureJob.php inherits (early-return guard)
13b extensions/gdpr/src/Jobs/ExportJob.php override $tries = 1 (see audit)
14 extensions/subscriptions/src/Job/SendReplyNotification.php direct

Nothing else in src/ uses retryUntil(), so this is a clean addition.

Per-job retry-safety audit

Before shipping, I read the handle() of every job that would inherit the new defaults and checked whether a retry could produce duplicate side effects. Eleven jobs are retry-safe: three operate on unambiguously idempotent paths ($notifications->sync(), upsert/delete indexing, null-payload Pusher trigger), and the remainder perform transactional emails where duplicate delivery under ACK-loss is preferable to lost delivery. Three jobs are not retry-safe and opt out of the base defaults with public int $tries = 1; plus a comment explaining why:

  • SendNotificationsJob — calls Notification::notify(), which uses static::insert() at Notification.php:182. Raw bulk insert, not firstOrCreate/upsert — retry would add duplicate notification rows to recipients' feeds.
  • ExportJob (gdpr) — dispatches Exporting event, calls $exporter->export() (which writes an export file), notifies the actor, then dispatches Exported. Each run has file + event + notification side effects that aren't deduped.
  • ComposerCommandJob (package-manager) — existing handle() wraps the body in try { ... } catch (Throwable) { $this->abort($exception); }, which swallows application errors. But OOM / fatal errors kill the worker before catch runs, and on that path $tries=3 would kick in. Composer commands are not idempotent, so a retry against a partially-modified vendor/ could fail in new ways.

Existing overrides preserved

  • ComposerCommandJob keeps its existing public int $timeout = 60 * 3; and its WithoutOverlapping + ShouldBeUnique middleware. The new $tries = 1 override layers on top cleanly.
  • GdprJob remains abstract extending AbstractJob. Its concrete subclasses (ExportJob, ErasureJob) inherit the base defaults; ExportJob overrides to $tries = 1 as described above. ErasureJob inherits the default $tries = 3 — safe because its handle() early-returns on deleted users (if (! $user->exists) { return; }), so the common retry scenario (user already erased) is a no-op.

Why $tries + $backoff and not retryUntil()

Laravel's Worker silently ignores $tries when retryUntil() is set on the job — the framework treats them as mutually exclusive retry policies. See Worker.php:612 on 13.x, introduced in laravel/framework#35214. None of the 21 Flarum jobs in scope use retryUntil() (verified via grep), so a count-based policy is effective. If a future subclass wants time-based semantics, it can override retryUntil() and the base $tries will be correctly ignored.

Why $backoff (not just $tries)

Default backoff is 0 seconds. Without $backoff, three retries fire back-to-back at worker speed — three hammer blows to the same SMTP server or Redis node that just blipped. Exponential [30, 60, 120] gives real recovery time between attempts.

Note on retry_after

The queue connection config also has a retry_after safety net (default 90s across the database, beanstalkd, and redis drivers in Laravel 13). That's the worker-timeout fallback for stuck jobs, separate from the application-level $tries/$backoff policy. This PR doesn't touch it.

Not a bug fix, but preventative hardening

No open issue directly tracks this. The closest reported symptom was #4301 (closed), an SMTP connection failure whose stack trace shows SendInformationalEmailJob and RequestPasswordResetJob failing with stream_socket_client(): SSL operation failed — both jobs this PR covers. That reporter was on Queue driver: sync so retries wouldn't have helped them specifically, but the failure mode is real and happens on async drivers whenever a mail provider has a brief outage.

Local verification

  • php -l clean on both files
  • No whitespace or formatting changes outside the added properties
  • Every audit claim above verified against this branch's source

Context

Part of a broader Laravel queue-safety audit. Full write-up: https://joshsalway.com/laravel-queue-safety

Set $tries=3 and exponential $backoff=[30,60,120] on AbstractJob
(the base class extended by most queued jobs) and on
SendReplyNotification (the only queued job that extends Queueable
directly).

With queue:work --tries=0 (the default on some hosting platforms,
including Vapor prior to vapor-core 2.4.1), jobs inheriting no retry
config retry indefinitely on transient failures. Bounded attempts
with exponential backoff give DB/SMTP/filesystem blips time to
recover, then fail cleanly.

Subclasses that want a different policy can override either property.
retryUntil() is not used by any current job, so the count-based policy
is effective (Laravel's Worker silently ignores $tries when
retryUntil() is set; see Worker.php:612 on 13.x, framework PR #35214).
- SendNotificationsJob: Notification::notify() uses a raw bulk insert
  (framework/core/src/Notification/Notification.php), so retries would
  create duplicate notification rows.
- ExportJob (gdpr): writes an export file and dispatches Exporting/
  Exported events; retries would duplicate both.
- ComposerCommandJob (package-manager): composer install is not
  idempotent, and an OOM bypasses the handle() try/catch that normally
  swallows exceptions, so a retry could run against a partially-written
  vendor/.

Each job sets $tries = 1 to override the AbstractJob default.
ShouldBeUnique defaults $uniqueFor to 0, which means the lock never
expires. WithoutOverlapping without expireAfter has the same pattern.
If the worker crashes mid-composer-install, either lock can otherwise
block future dispatches until cache TTL runs down.

Both now bounded at 600 seconds (longer than $timeout's 180s so they
can't expire mid-run, short enough to auto-clear after a crash).
WithoutOverlapping defaults releaseAfter to 0, which re-queues a
duplicate dispatch immediately. If one somehow slips past the
ShouldBeUnique guard (stale $uniqueFor, dispatch race), a second
worker would tight-loop against the held lock.

->dontRelease() fails the duplicate cleanly instead, so the admin
sees an explicit 'already running' rather than a stealth second
composer install after the first completes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant