fix(migrations): propagate pg_migrate.sh failure to exit code by rickyrombo · Pull Request #828 · AudiusProject/api

rickyrombo · 2026-05-19T06:58:41Z

Summary

api/main.go:30 called ddl.RunMigrations() and discarded its returned error. When pg_migrate.sh exited non-zero, the process kept going and reached os.Exit(0) in case "migrate": — so a failed migration produced a successful-looking pod exit.
Check the error and os.Exit(1) immediately so the failure propagates out of the container. Affects every binary entrypoint (server, indexer, es-indexer, solana-indexer, migrate); none should run against a half-migrated DB.

Background

Discovered today (2026-05-19). The migration Job in audius-k8s#496 is supposed to block its dependent Deployments via Pulumi DependsOn whenever the migration fails. During an incident, the 0201 backfill was manually killed (pg_terminate_backend on the stuck session) and pg_migrate.sh exited with code 2, but the bridge Deployment rolled out anyway.

Tracing the failure path:

pg_migrate.sh exited 2. ✓
cmd.CombinedOutput() in ddl/run_migrations.go:19 returned *exec.ExitError. ✓
RunMigrations() printed "Error running pg_migrate.sh: exit status 2" and returned the error. ✓
main.go:30 ignored the return value. ✗ ← bug
case "migrate": ran os.Exit(0). Container exited 0.
K8s Job flipped to condition=Complete.
Pulumi-kubernetes's Job awaiter resolved success.
DependsOn unblocked → Deployments rolled.

K8s, Pulumi, and the Job spec all behaved correctly given the inputs they got — the API binary was reporting success on failure.

Test plan

Manually run bridge migrate against a DB with a deliberately-broken migration (e.g., temporarily put a syntax error in the latest ddl/migrations/*.sql). Confirm exit code is now non-zero.
Run bridge migrate against a clean DB. Confirm exit 0 and no behavior change.
After merge, watch the next auto-upgrader deploy: a failed bridge migrate invocation should leave the migration Job in condition=Failed and block the Deployment rollout (per the audius-k8s side's DependsOn).

🤖 Generated with Claude Code

The `migrate` subcommand (and every other binary entrypoint) called ddl.RunMigrations() and discarded its returned error, so when pg_migrate.sh exited non-zero the process still reached os.Exit(0) in the switch's `case "migrate":` arm. This silently broke the K8s migration Job's success contract: a failed migration ended in a successful-looking pod, the Job flipped to condition=Complete, the dependent serving Deployments' DependsOn unblocked, and pods rolled out against a half-migrated DB. Discovered today (2026-05-19) when a manually-killed 0201 backfill exited 2 but the bridge Deployment rolled anyway. The k8s/Pulumi side was behaving correctly given the inputs it got — the API binary was just reporting success on failure. Check the error and os.Exit(1) at the migration site, before the switch. Affects every command (server, indexer, es-indexer, solana-indexer, migrate) — none of them should run against a half-migrated DB.

rickyrombo merged commit ff5ea5e into main May 19, 2026
5 checks passed

rickyrombo deleted the mp/fail-fast-migration-error branch May 19, 2026 16:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(migrations): propagate pg_migrate.sh failure to exit code#828

fix(migrations): propagate pg_migrate.sh failure to exit code#828
rickyrombo merged 1 commit into
mainfrom
mp/fail-fast-migration-error

rickyrombo commented May 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rickyrombo commented May 19, 2026

Summary

Background

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant