Skip to content

fix(backfill): make nightly backfill incremental and defer on GraphQL rate limits#195

Merged
entrius merged 3 commits into
testfrom
fix/reduce-nightly-backfill-window
Jun 24, 2026
Merged

fix(backfill): make nightly backfill incremental and defer on GraphQL rate limits#195
entrius merged 3 commits into
testfrom
fix/reduce-nightly-backfill-window

Conversation

@anderdc

@anderdc anderdc commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator

Problem

The nightly backfill (RepoBackfillScheduleService, daily) re-fetched every PR in a 40-day window from scratch every night — enqueuing a PR_METADATA + PR_FILES (content) job per PR. For high-PR-volume GitHub accounts this exhausted the per-installation GraphQL budget (5k points/hr) and flooded logs.

Evidence (prod ghm-das, 7-day buffer): 100% of GraphQL RATE_LIMIT log lines fall in the post-enqueue window (00:10–02:xx UTC); the other ~22h/day are clean. Driven entirely by two installations (the two largest accounts). Rate-limit + Halving lines were ~16% of all log volume, amplified ~5× per failed batch by the 50→25→12→6→5 retry ladder. Not webhook- or reconcile-driven.

Changes

Incremental backfill — the same single nightly backfill, now only re-fetching what changed:

  • Skip the expensive PR_FILES content fetch when content is already stored (scoring_data_stored) and head+base SHA are unchanged (file diff is fully determined by head+base SHA).
  • Skip PR_METADATA when GitHub's PR updatedAt is unchanged (new nullable pull_requests.updated_at column).
  • Both gates fail safe: new PR / missing content / moved SHA / null timestamp → re-fetch.

Rate-limit deferral instead of failure:

  • A GraphQL 200 carrying RATE_LIMIT/SECONDARY_RATE_LIMIT/graphql_rate_limit is now surfaced as a typed GitHubRateLimitError (these bypass HTTP-status handling since GraphQL returns 200).
  • The processor catches it → job.moveToDelayed(now + retryAfterMs, token) + DelayedError, re-queuing without consuming a retry attempt and freeing the worker slot. No more halving an already-exhausted budget in a tight loop.

Tuning & observability:

  • Default backfill window 40 → 10 days.
  • Greppable [backfill-summary] (per-repo enqueued-vs-skipped tallies) and [rate-limit-defer] log lines.

Tests & CI:

  • Detection/gating logic extracted to dependency-free modules (isGraphQLRateLimit, needsContentRefresh, needsMetadataRefresh).
  • 21 unit tests (jest) covering detection, gating branches, and the processor deferral.
  • New Test CI workflow mirroring the existing Build/Lint jobs.

Deploy notes

  1. Apply the schema column to prod before deploy (additive, nullable, safe to run early):
    ALTER TABLE pull_requests ADD COLUMN IF NOT EXISTS updated_at TIMESTAMPTZ;
  2. First night after deploy does one full pass (every updated_at starts null → gates fail safe); steady state (night 2+) re-fetches only changed PRs.

Verification (post-deploy)

After ≥2 nights, the 00:10–02:xx UTC graphql_rate_limit flood for the two affected installations should collapse toward zero, and [backfill-summary] should show high *_skipped counts for the firehose repos.

Notes

  • Worker concurrency stays at 5: the rate-limit deferral self-heals secondary-limit hits (it backs off using the retry-after/reset headers), so lowering concurrency would be redundant prevention with a small throughput cost on real-time webhook jobs.
  • Follow-up (not in this PR): page the backfill by updatedAt-desc with early-stop so paging cost becomes independent of window size — would let us run a large safety-net window cheaply. ~100 LOC, needs new per-repo last_backfilled_at state.

anderdc added 3 commits June 23, 2026 22:04
… rate limits

The nightly backfill re-fetched every PR in a 40-day window every night,
exhausting the per-installation GitHub GraphQL budget for high-PR-volume
accounts and flooding logs for ~30-40 min nightly (confirmed: 100% of
rate-limit lines fall in the post-enqueue window, driven by two installations).

- Incremental: skip the expensive PR_FILES job when content is already stored
  and head/base SHA are unchanged; skip PR_METADATA when GitHub's updatedAt is
  unchanged (new nullable pull_requests.updated_at column). Both gates fail safe.
- Defer instead of fail: a GraphQL 200 carrying RATE_LIMIT is now a typed
  GitHubRateLimitError; the processor moveToDelayed()s the job (no attempt
  consumed, worker slot freed) instead of halving the batch in a tight retry
  loop that produced ~5 log lines per failed batch.
- Reduce default backfill window 40 -> 10 days; worker concurrency 5 -> 4.
- Add greppable [backfill-summary] and [rate-limit-defer] log lines.
- Extract the detection/gating logic to pure modules; add jest + unit tests
  and a Test CI workflow.
The rate-limit deferral (moveToDelayed) self-heals secondary-limit hits, so
the concurrency reduction was redundant prevention with a small throughput
cost on real-time webhook jobs.
@entrius entrius merged commit 35aa68b into test Jun 24, 2026
3 checks passed
@entrius entrius deleted the fix/reduce-nightly-backfill-window branch June 24, 2026 03:32
anderdc added a commit that referenced this pull request Jun 24, 2026
PR #195 dropped DEFAULT_BACKFILL_DAYS 40->10 but .env.example kept the stale 40.
entrius pushed a commit that referenced this pull request Jun 24, 2026
…#196)

* fix(backfill): anchor nightly backfill to a fixed wall clock via cron

The nightly backfill used setInterval(24h) registered in onModuleInit, so its
fire time was boot + N*24h — anchored to whatever time the process last
restarted. Every redeploy silently moved the window (most recently to ~03:35
UTC), making the schedule unpredictable and the verification window a moving
target.

Switch to @nestjs/schedule's @Cron (ScheduleModule is already wired up and used
by MaintainerPopulateService). Default '10 0 * * *' in America/Chicago = 12:10am
local, stable across redeploys, with timeZone handling the CST/CDT shift so it
stays at local midnight year-round. Overridable via NIGHTLY_BACKFILL_CRON /
NIGHTLY_BACKFILL_TZ; the :10 offset preserves the prior 00:10 stagger.

Behavior preserved: still no run-at-startup, NIGHTLY_BACKFILL_ENABLED still
disables it, and the static per-repo jobId still dedupes overlapping ticks.
Replaces the removed NIGHTLY_BACKFILL_INTERVAL_MS knob in .env.example.

* refactor(backfill): hardcode cron schedule, trim comments

The cron expression and timezone won't realistically change, so drop the
NIGHTLY_BACKFILL_CRON / NIGHTLY_BACKFILL_TZ env indirection and make them plain
constants. Condense the surrounding comments.

* docs(env): sync NIGHTLY_BACKFILL_DAYS example to code default (40->10)

PR #195 dropped DEFAULT_BACKFILL_DAYS 40->10 but .env.example kept the stale 40.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants