Skip to content

fix(migrations): propagate pg_migrate.sh failure to exit code#828

Merged
rickyrombo merged 1 commit into
mainfrom
mp/fail-fast-migration-error
May 19, 2026
Merged

fix(migrations): propagate pg_migrate.sh failure to exit code#828
rickyrombo merged 1 commit into
mainfrom
mp/fail-fast-migration-error

Conversation

@rickyrombo
Copy link
Copy Markdown
Contributor

Summary

  • api/main.go:30 called ddl.RunMigrations() and discarded its returned error. When pg_migrate.sh exited non-zero, the process kept going and reached os.Exit(0) in case "migrate": — so a failed migration produced a successful-looking pod exit.
  • Check the error and os.Exit(1) immediately so the failure propagates out of the container. Affects every binary entrypoint (server, indexer, es-indexer, solana-indexer, migrate); none should run against a half-migrated DB.

Background

Discovered today (2026-05-19). The migration Job in audius-k8s#496 is supposed to block its dependent Deployments via Pulumi DependsOn whenever the migration fails. During an incident, the 0201 backfill was manually killed (pg_terminate_backend on the stuck session) and pg_migrate.sh exited with code 2, but the bridge Deployment rolled out anyway.

Tracing the failure path:

  1. pg_migrate.sh exited 2. ✓
  2. cmd.CombinedOutput() in ddl/run_migrations.go:19 returned *exec.ExitError. ✓
  3. RunMigrations() printed "Error running pg_migrate.sh: exit status 2" and returned the error. ✓
  4. main.go:30 ignored the return value. ✗ ← bug
  5. case "migrate": ran os.Exit(0). Container exited 0.
  6. K8s Job flipped to condition=Complete.
  7. Pulumi-kubernetes's Job awaiter resolved success.
  8. DependsOn unblocked → Deployments rolled.

K8s, Pulumi, and the Job spec all behaved correctly given the inputs they got — the API binary was reporting success on failure.

Test plan

  • Manually run bridge migrate against a DB with a deliberately-broken migration (e.g., temporarily put a syntax error in the latest ddl/migrations/*.sql). Confirm exit code is now non-zero.
  • Run bridge migrate against a clean DB. Confirm exit 0 and no behavior change.
  • After merge, watch the next auto-upgrader deploy: a failed bridge migrate invocation should leave the migration Job in condition=Failed and block the Deployment rollout (per the audius-k8s side's DependsOn).

🤖 Generated with Claude Code

The `migrate` subcommand (and every other binary entrypoint) called
ddl.RunMigrations() and discarded its returned error, so when
pg_migrate.sh exited non-zero the process still reached os.Exit(0) in
the switch's `case "migrate":` arm.

This silently broke the K8s migration Job's success contract: a failed
migration ended in a successful-looking pod, the Job flipped to
condition=Complete, the dependent serving Deployments' DependsOn
unblocked, and pods rolled out against a half-migrated DB.

Discovered today (2026-05-19) when a manually-killed 0201 backfill exited
2 but the bridge Deployment rolled anyway. The k8s/Pulumi side was
behaving correctly given the inputs it got — the API binary was just
reporting success on failure.

Check the error and os.Exit(1) at the migration site, before the switch.
Affects every command (server, indexer, es-indexer, solana-indexer,
migrate) — none of them should run against a half-migrated DB.
@rickyrombo rickyrombo merged commit ff5ea5e into main May 19, 2026
5 checks passed
@rickyrombo rickyrombo deleted the mp/fail-fast-migration-error branch May 19, 2026 16:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant