Skip to content

perf(migrations): speed up reward disbursements backfill#829

Merged
rickyrombo merged 1 commit into
mainfrom
mp/speed-up-reward-disbursements-backfill
May 19, 2026
Merged

perf(migrations): speed up reward disbursements backfill#829
rickyrombo merged 1 commit into
mainfrom
mp/speed-up-reward-disbursements-backfill

Conversation

@rickyrombo
Copy link
Copy Markdown
Contributor

Summary

  • Adds two CREATE INDEX CONCURRENTLY statements at the top of 0201_backfill_missing_reward_disbursements.sql (outside the BEGIN/COMMIT):
    • sol_reward_disbursements (challenge_id, specifier) — lets the dedup LEFT JOIN find existing rows by index instead of a per-row sequential scan.
    • sol_claimable_accounts (ethereum_address, mint, slot DESC) — supports the "latest claimable account per wallet" lookup pattern (used by this migration and the live reward_manager indexer).
  • Replaces the per-row LATERAL subquery with a WITH user_banks AS MATERIALIZED CTE that pre-computes DISTINCT ON (ethereum_address) once and hash-joins against the result.
  • SET LOCAL session_replication_role = replica inside the backfill transaction to suppress the on_sol_reward_disbursement trigger, which fires per row to create challenge_reward notifications + pg_notify. For a one-shot backfill of months-old historical rewards we don't want to spam users, and the trigger work was a meaningful chunk of the per-row cost.

Why

The 0201 backfill is taking over an hour against prod. Diagnosis:

  1. The LEFT JOIN on (challenge_id, specifier) had no index — sol_reward_disbursements is keyed by (signature, instruction_index), and the only other indexes (from 0198) are on recipient_eth_address and created_at.
  2. The LATERAL against sol_claimable_accounts reran ORDER BY slot DESC LIMIT 1 per row.
  3. The row-level trigger added DB work and unwanted historical notifications.

With the new index alone, the LEFT JOIN goes from O(n×m) to O(n log m). With the trigger off and the CTE substitution, the per-row work drops correspondingly. Expected runtime: well under a minute, vs >1h currently.

Migration idempotency

  • CREATE INDEX CONCURRENTLY IF NOT EXISTS — safe to re-run; existing valid indexes are no-ops, existing invalid indexes (from a previous failed CONCURRENTLY run) require manual DROP INDEX first.
  • INSERT … ON CONFLICT (signature, instruction_index) DO NOTHING — unchanged; safe on re-run.
  • Since the migration was never committed in prod (the in-flight one is what we're killing), changing the SQL body just bumps the md5 in pg_migrate.sh's check; the next deploy will run the new shape.

Test plan

  • Cancel the in-flight 0201 backfill (pg_cancel_backend(<pid>) on the stuck session).
  • Confirm both indexes don't already exist as invalid: SELECT indexname, indisvalid FROM pg_indexes JOIN pg_class ON relname = indexname JOIN pg_index USING (indexrelid) WHERE indexname IN ('sol_reward_disbursements_challenge_specifier_idx', 'sol_claimable_accounts_eth_mint_slot_idx'); — drop any invalid ones.
  • Deploy via the migration Job; expect the Job to complete in seconds rather than hours.
  • Verify recovered row count: SELECT COUNT(*) FROM challenge_disbursements cd LEFT JOIN sol_reward_disbursements rd ON rd.challenge_id = cd.challenge_id AND rd.specifier = cd.specifier WHERE rd.signature IS NULL AND cd.slot > 355300886; — should drop from ~29k toward 0 (modulo the no-current-user bucket which is intentionally not recoverable).

🤖 Generated with Claude Code

The 0201 backfill is taking over an hour in prod. Three structural
issues account for the slowdown:

1. The dedup LEFT JOIN on (challenge_id, specifier) has no index.
   sol_reward_disbursements is keyed by (signature, instruction_index)
   and only indexed on recipient_eth_address and created_at. The join
   degenerates to a sequential scan per challenge_disbursements row.

2. The LATERAL subquery against sol_claimable_accounts re-runs an
   "ORDER BY slot DESC LIMIT 1" filter per row, without an index on
   (ethereum_address, mint).

3. The on_sol_reward_disbursement trigger fires for every insert,
   doing three SELECTs and possibly an INSERT into notification — 29k
   rows × that overhead is significant, and notifying users about
   months-old historical rewards is undesirable anyway.

Fixes:

- Add sol_reward_disbursements (challenge_id, specifier) index. Useful
  permanently, not just for this migration. CREATE CONCURRENTLY so the
  live indexer's writes aren't blocked; moved outside the BEGIN/COMMIT
  since CONCURRENTLY can't run inside an explicit transaction (psql
  runs each statement in its own implicit tx when not wrapped).

- Add sol_claimable_accounts (ethereum_address, mint, slot DESC) index.
  Same reasoning — the live indexer also benefits from this lookup
  shape for user_bank resolution.

- Replace the per-row LATERAL with a MATERIALIZED CTE that pre-computes
  DISTINCT ON (ethereum_address) once, then hash-joins. One indexed
  scan instead of N LATERAL invocations.

- SET LOCAL session_replication_role = replica inside the backfill
  transaction to suppress on_sol_reward_disbursement. LOCAL keeps the
  setting scoped to this transaction so concurrent indexer writes
  still fire the trigger normally.

Both index creations use IF NOT EXISTS so re-running is safe; the
backfill INSERT is already idempotent via ON CONFLICT DO NOTHING.
@rickyrombo rickyrombo merged commit 89c8794 into main May 19, 2026
5 checks passed
@rickyrombo rickyrombo deleted the mp/speed-up-reward-disbursements-backfill branch May 19, 2026 16:45
rickyrombo added a commit that referenced this pull request May 19, 2026
## Summary
- Switches `0201_backfill_missing_reward_disbursements.sql` from `CREATE
INDEX CONCURRENTLY` to plain `CREATE INDEX` inside the migration's
`BEGIN/COMMIT`.
- Both indexes (`sol_reward_disbursements (challenge_id, specifier)` and
`sol_claimable_accounts (ethereum_address, mint, slot DESC)`) are now
atomic with the backfill INSERT — if anything fails, the schema rolls
back cleanly.

## Why
`CREATE INDEX CONCURRENTLY` waits on a `virtualxid` lock for every
transaction open during its build phases — not just transactions that
touch the target table, but every one in the cluster.

The legacy Python `index_rewards_manager` Celery task on
discovery-provider keeps ~3-minute transactions open against
`challenge_disbursements` continuously. As fast as one ends, another is
already open. So the CONCURRENTLY build can wait indefinitely without
ever seeing a quiet moment — and it did, for 10+ minutes blocked on
`Lock/virtualxid` in tonight's deploy.

Trade-off accepted: regular `CREATE INDEX` takes a `ShareLock` on the
target table for the duration of the build, blocking writes. But both
target tables are written only by the Go indexer, and only on
reward_manager `EvaluateAttestations` and claimable token `Create`
instructions — sparse on-chain. At current row counts each build
completes in seconds; the blocked writes just queue on pgxpool and
resume right after.

## Test plan
- [ ] Cancel any in-flight 0201 attempt and drop any invalid index it
left behind:
      ```sql
      SELECT pg_cancel_backend(pid) FROM pg_stat_activity
       WHERE query ILIKE 'CREATE INDEX CONCURRENTLY%';
DROP INDEX IF EXISTS sol_reward_disbursements_challenge_specifier_idx;
      DROP INDEX IF EXISTS sol_claimable_accounts_eth_mint_slot_idx;
      ```
- [ ] Roll the new image; migration Job's `bridge migrate` should
complete in well under a minute.
- [ ] Verify both indexes exist as `indisvalid = true`:
      ```sql
      SELECT indexrelid::regclass, indisvalid FROM pg_index
       WHERE indexrelid::regclass::text IN (
         'sol_reward_disbursements_challenge_specifier_idx',
         'sol_claimable_accounts_eth_mint_slot_idx'
       );
      ```
- [ ] Verify missing-row count drops as expected (per #829's test plan).

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant