fix(backfill): make nightly backfill incremental and defer on GraphQL rate limits by anderdc · Pull Request #195 · entrius/das-github-mirror

anderdc · 2026-06-24T03:05:05Z

Problem

The nightly backfill (RepoBackfillScheduleService, daily) re-fetched every PR in a 40-day window from scratch every night — enqueuing a PR_METADATA + PR_FILES (content) job per PR. For high-PR-volume GitHub accounts this exhausted the per-installation GraphQL budget (5k points/hr) and flooded logs.

Evidence (prod ghm-das, 7-day buffer): 100% of GraphQL RATE_LIMIT log lines fall in the post-enqueue window (00:10–02:xx UTC); the other ~22h/day are clean. Driven entirely by two installations (the two largest accounts). Rate-limit + Halving lines were ~16% of all log volume, amplified ~5× per failed batch by the 50→25→12→6→5 retry ladder. Not webhook- or reconcile-driven.

Changes

Incremental backfill — the same single nightly backfill, now only re-fetching what changed:

Skip the expensive PR_FILES content fetch when content is already stored (scoring_data_stored) and head+base SHA are unchanged (file diff is fully determined by head+base SHA).
Skip PR_METADATA when GitHub's PR updatedAt is unchanged (new nullable pull_requests.updated_at column).
Both gates fail safe: new PR / missing content / moved SHA / null timestamp → re-fetch.

Rate-limit deferral instead of failure:

A GraphQL 200 carrying RATE_LIMIT/SECONDARY_RATE_LIMIT/graphql_rate_limit is now surfaced as a typed GitHubRateLimitError (these bypass HTTP-status handling since GraphQL returns 200).
The processor catches it → job.moveToDelayed(now + retryAfterMs, token) + DelayedError, re-queuing without consuming a retry attempt and freeing the worker slot. No more halving an already-exhausted budget in a tight loop.

Tuning & observability:

Default backfill window 40 → 10 days.
Greppable [backfill-summary] (per-repo enqueued-vs-skipped tallies) and [rate-limit-defer] log lines.

Tests & CI:

Detection/gating logic extracted to dependency-free modules (isGraphQLRateLimit, needsContentRefresh, needsMetadataRefresh).
21 unit tests (jest) covering detection, gating branches, and the processor deferral.
New Test CI workflow mirroring the existing Build/Lint jobs.

Deploy notes

Apply the schema column to prod before deploy (additive, nullable, safe to run early):
```
ALTER TABLE pull_requests ADD COLUMN IF NOT EXISTS updated_at TIMESTAMPTZ;
```
First night after deploy does one full pass (every updated_at starts null → gates fail safe); steady state (night 2+) re-fetches only changed PRs.

Verification (post-deploy)

After ≥2 nights, the 00:10–02:xx UTC graphql_rate_limit flood for the two affected installations should collapse toward zero, and [backfill-summary] should show high *_skipped counts for the firehose repos.

Notes

Worker concurrency stays at 5: the rate-limit deferral self-heals secondary-limit hits (it backs off using the retry-after/reset headers), so lowering concurrency would be redundant prevention with a small throughput cost on real-time webhook jobs.
Follow-up (not in this PR): page the backfill by updatedAt-desc with early-stop so paging cost becomes independent of window size — would let us run a large safety-net window cheaply. ~100 LOC, needs new per-repo last_backfilled_at state.

… rate limits The nightly backfill re-fetched every PR in a 40-day window every night, exhausting the per-installation GitHub GraphQL budget for high-PR-volume accounts and flooding logs for ~30-40 min nightly (confirmed: 100% of rate-limit lines fall in the post-enqueue window, driven by two installations). - Incremental: skip the expensive PR_FILES job when content is already stored and head/base SHA are unchanged; skip PR_METADATA when GitHub's updatedAt is unchanged (new nullable pull_requests.updated_at column). Both gates fail safe. - Defer instead of fail: a GraphQL 200 carrying RATE_LIMIT is now a typed GitHubRateLimitError; the processor moveToDelayed()s the job (no attempt consumed, worker slot freed) instead of halving the batch in a tight retry loop that produced ~5 log lines per failed batch. - Reduce default backfill window 40 -> 10 days; worker concurrency 5 -> 4. - Add greppable [backfill-summary] and [rate-limit-defer] log lines. - Extract the detection/gating logic to pure modules; add jest + unit tests and a Test CI workflow.

The rate-limit deferral (moveToDelayed) self-heals secondary-limit hits, so the concurrency reduction was redundant prevention with a small throughput cost on real-time webhook jobs.

PR #195 dropped DEFAULT_BACKFILL_DAYS 40->10 but .env.example kept the stale 40.

…#196) * fix(backfill): anchor nightly backfill to a fixed wall clock via cron The nightly backfill used setInterval(24h) registered in onModuleInit, so its fire time was boot + N*24h — anchored to whatever time the process last restarted. Every redeploy silently moved the window (most recently to ~03:35 UTC), making the schedule unpredictable and the verification window a moving target. Switch to @nestjs/schedule's @Cron (ScheduleModule is already wired up and used by MaintainerPopulateService). Default '10 0 * * *' in America/Chicago = 12:10am local, stable across redeploys, with timeZone handling the CST/CDT shift so it stays at local midnight year-round. Overridable via NIGHTLY_BACKFILL_CRON / NIGHTLY_BACKFILL_TZ; the :10 offset preserves the prior 00:10 stagger. Behavior preserved: still no run-at-startup, NIGHTLY_BACKFILL_ENABLED still disables it, and the static per-repo jobId still dedupes overlapping ticks. Replaces the removed NIGHTLY_BACKFILL_INTERVAL_MS knob in .env.example. * refactor(backfill): hardcode cron schedule, trim comments The cron expression and timezone won't realistically change, so drop the NIGHTLY_BACKFILL_CRON / NIGHTLY_BACKFILL_TZ env indirection and make them plain constants. Condense the surrounding comments. * docs(env): sync NIGHTLY_BACKFILL_DAYS example to code default (40->10) PR #195 dropped DEFAULT_BACKFILL_DAYS 40->10 but .env.example kept the stale 40.

anderdc added 3 commits June 23, 2026 22:04

refactor(backfill): trim comments in incremental-backfill helpers

32a60b6

revert(backfill): keep worker concurrency at 5

be2de2b

The rate-limit deferral (moveToDelayed) self-heals secondary-limit hits, so the concurrency reduction was redundant prevention with a small throughput cost on real-time webhook jobs.

entrius approved these changes Jun 24, 2026

View reviewed changes

entrius merged commit 35aa68b into test Jun 24, 2026
3 checks passed

entrius deleted the fix/reduce-nightly-backfill-window branch June 24, 2026 03:32

anderdc mentioned this pull request Jun 24, 2026

fix(backfill): anchor nightly backfill to a fixed wall clock via cron #196

Merged

anderdc added a commit that referenced this pull request Jun 24, 2026

docs(env): sync NIGHTLY_BACKFILL_DAYS example to code default (40->10)

fcbe4d0

PR #195 dropped DEFAULT_BACKFILL_DAYS 40->10 but .env.example kept the stale 40.

anderdc mentioned this pull request Jun 25, 2026

fix(backfill): compare PR updatedAt by instant so the metadata gate skips #197

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(backfill): make nightly backfill incremental and defer on GraphQL rate limits#195

fix(backfill): make nightly backfill incremental and defer on GraphQL rate limits#195
entrius merged 3 commits into
testfrom
fix/reduce-nightly-backfill-window

anderdc commented Jun 24, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

anderdc commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Changes

Deploy notes

Verification (post-deploy)

Notes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

anderdc commented Jun 24, 2026 •

edited

Loading