fix(backfill): make nightly backfill incremental and defer on GraphQL rate limits#195
Merged
Merged
Conversation
… rate limits The nightly backfill re-fetched every PR in a 40-day window every night, exhausting the per-installation GitHub GraphQL budget for high-PR-volume accounts and flooding logs for ~30-40 min nightly (confirmed: 100% of rate-limit lines fall in the post-enqueue window, driven by two installations). - Incremental: skip the expensive PR_FILES job when content is already stored and head/base SHA are unchanged; skip PR_METADATA when GitHub's updatedAt is unchanged (new nullable pull_requests.updated_at column). Both gates fail safe. - Defer instead of fail: a GraphQL 200 carrying RATE_LIMIT is now a typed GitHubRateLimitError; the processor moveToDelayed()s the job (no attempt consumed, worker slot freed) instead of halving the batch in a tight retry loop that produced ~5 log lines per failed batch. - Reduce default backfill window 40 -> 10 days; worker concurrency 5 -> 4. - Add greppable [backfill-summary] and [rate-limit-defer] log lines. - Extract the detection/gating logic to pure modules; add jest + unit tests and a Test CI workflow.
The rate-limit deferral (moveToDelayed) self-heals secondary-limit hits, so the concurrency reduction was redundant prevention with a small throughput cost on real-time webhook jobs.
entrius
approved these changes
Jun 24, 2026
anderdc
added a commit
that referenced
this pull request
Jun 24, 2026
PR #195 dropped DEFAULT_BACKFILL_DAYS 40->10 but .env.example kept the stale 40.
entrius
pushed a commit
that referenced
this pull request
Jun 24, 2026
…#196) * fix(backfill): anchor nightly backfill to a fixed wall clock via cron The nightly backfill used setInterval(24h) registered in onModuleInit, so its fire time was boot + N*24h — anchored to whatever time the process last restarted. Every redeploy silently moved the window (most recently to ~03:35 UTC), making the schedule unpredictable and the verification window a moving target. Switch to @nestjs/schedule's @Cron (ScheduleModule is already wired up and used by MaintainerPopulateService). Default '10 0 * * *' in America/Chicago = 12:10am local, stable across redeploys, with timeZone handling the CST/CDT shift so it stays at local midnight year-round. Overridable via NIGHTLY_BACKFILL_CRON / NIGHTLY_BACKFILL_TZ; the :10 offset preserves the prior 00:10 stagger. Behavior preserved: still no run-at-startup, NIGHTLY_BACKFILL_ENABLED still disables it, and the static per-repo jobId still dedupes overlapping ticks. Replaces the removed NIGHTLY_BACKFILL_INTERVAL_MS knob in .env.example. * refactor(backfill): hardcode cron schedule, trim comments The cron expression and timezone won't realistically change, so drop the NIGHTLY_BACKFILL_CRON / NIGHTLY_BACKFILL_TZ env indirection and make them plain constants. Condense the surrounding comments. * docs(env): sync NIGHTLY_BACKFILL_DAYS example to code default (40->10) PR #195 dropped DEFAULT_BACKFILL_DAYS 40->10 but .env.example kept the stale 40.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The nightly backfill (
RepoBackfillScheduleService, daily) re-fetched every PR in a 40-day window from scratch every night — enqueuing aPR_METADATA+PR_FILES(content) job per PR. For high-PR-volume GitHub accounts this exhausted the per-installation GraphQL budget (5k points/hr) and flooded logs.Evidence (prod
ghm-das, 7-day buffer): 100% of GraphQLRATE_LIMITlog lines fall in the post-enqueue window (00:10–02:xx UTC); the other ~22h/day are clean. Driven entirely by two installations (the two largest accounts). Rate-limit +Halvinglines were ~16% of all log volume, amplified ~5× per failed batch by the50→25→12→6→5retry ladder. Not webhook- or reconcile-driven.Changes
Incremental backfill — the same single nightly backfill, now only re-fetching what changed:
PR_FILEScontent fetch when content is already stored (scoring_data_stored) and head+base SHA are unchanged (file diff is fully determined by head+base SHA).PR_METADATAwhen GitHub's PRupdatedAtis unchanged (new nullablepull_requests.updated_atcolumn).Rate-limit deferral instead of failure:
RATE_LIMIT/SECONDARY_RATE_LIMIT/graphql_rate_limitis now surfaced as a typedGitHubRateLimitError(these bypass HTTP-status handling since GraphQL returns 200).job.moveToDelayed(now + retryAfterMs, token)+DelayedError, re-queuing without consuming a retry attempt and freeing the worker slot. No more halving an already-exhausted budget in a tight loop.Tuning & observability:
40 → 10days.[backfill-summary](per-repo enqueued-vs-skipped tallies) and[rate-limit-defer]log lines.Tests & CI:
isGraphQLRateLimit,needsContentRefresh,needsMetadataRefresh).TestCI workflow mirroring the existingBuild/Lintjobs.Deploy notes
updated_atstarts null → gates fail safe); steady state (night 2+) re-fetches only changed PRs.Verification (post-deploy)
After ≥2 nights, the 00:10–02:xx UTC
graphql_rate_limitflood for the two affected installations should collapse toward zero, and[backfill-summary]should show high*_skippedcounts for the firehose repos.Notes
retry-after/reset headers), so lowering concurrency would be redundant prevention with a small throughput cost on real-time webhook jobs.updatedAt-desc with early-stop so paging cost becomes independent of window size — would let us run a large safety-net window cheaply. ~100 LOC, needs new per-repolast_backfilled_atstate.