test: harden schedule backfill polling in CI#115
Conversation
Verifies that BackfillSchedule triggers actual workflow starts for past schedule slots. Creates a per-minute schedule, backfills a 2-minute past window, then polls DescribeSchedule until total_runs >= 1 to confirm the scheduler processed the backfill signal and started the expected executions. Signed-off-by: abhishek.jha <abhishek.jha@uber.com>
3395d42 to
1a519a1
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. 🚀 New features to boost your workflow:
|
… test_schedule Co-authored-by: Abhishek Jha <17800780+abhishekj720@users.noreply.github.com>
imports for asyncio, time, and QueryFailedError were removed as part of the _describe_schedule_when_ready cleanup but test_backfill still uses them inline for its polling loop. Signed-off-by: abhishek.jha <abhishek.jha@uber.com>
| try: | ||
| await client.create_schedule( | ||
| schedule_id, | ||
| spec=schedule_pb2.ScheduleSpec(cron_expression="* * * * *"), |
There was a problem hiding this comment.
It's worth noting that total_runs doesn't distinguish backfill fires from regular cron fires (server increments the same counter for both). So with * * * * *(fire every minute) and a 30s deadline, roughly half the time the next minute boundary falls inside our poll window and a normal cron tick alone satisfies >= 1. The test would still pass if backfill silently produced nothing.
…sitive total_runs counts both cron and backfill fires, so with * * * * * a regular tick within the poll window can satisfy total_runs >= 1 even if the backfill produced nothing. Pausing first ensures any increment is attributable to the backfill signal. Signed-off-by: abhishek.jha <abhishek.jha@uber.com>
pausing the schedule sends a signal that triggers ContinueAsNew in the scheduler workflow; the subsequent backfill signal lands on the old (closed) run and is dropped, leaving total_runs=0. switch to a cron that never fires during the ~30s poll window so no automatic tick can satisfy the assertion, without touching schedule state. Signed-off-by: abhishek.jha <abhishek.jha@uber.com>
…ertion the yearly cron (0 0 1 1 *) had no schedule slots in the backfill window (now-3min to now-1min), so no fires were triggered and total_runs stayed 0. switch back to * * * * * which always covers exactly 2 slots in that window, and assert total_runs >= 2. a single spurious cron tick during the ~30s poll window adds at most 1 run and cannot produce a false-positive. Signed-off-by: abhishek.jha <abhishek.jha@uber.com>
…fire with the default SkipNew policy, the second backfill slot is skipped because the first started workflow has no worker to complete it, leaving total_runs=1. CONCURRENT overlap allows both slots to fire regardless of in-flight runs, so the assertion total_runs >= 2 is reliably satisfied. Signed-off-by: abhishek.jha <abhishek.jha@uber.com>
describe_schedule can raise CadenceRpcError when the scheduler workflow has just done ContinueAsNew and the server-side retry budget is exhausted. catching it alongside QueryFailedError lets the poll loop keep retrying within its own 30s window regardless of server version. Signed-off-by: abhishek.jha <abhishek.jha@uber.com>
CadenceRpcError from describe_schedule means the server's retry budget for ContinueAsNew is exhausted - a real server failure, not a transient client condition. the test should surface it so server bugs are visible. only QueryFailedError is caught, which is the expected transient state while a new run processes its first decision task. Signed-off-by: abhishek.jha <abhishek.jha@uber.com>
CI failed: The `test_backfill` integration test is failing consistently due to a 60-second polling timeout in `describe_schedule`, which is insufficient for the server to materialize backfill state under CI load.OverviewAll 4 analyzed logs show a consistent failure in the Failures
|
| Auto-apply | Compact |
|
|
Was this helpful? React with 👍 / 👎 | Gitar
What changed?
Updated the schedule backfill integration test to better tolerate transient server behavior in CI. The test now keeps polling when
DescribeSchedulebriefly returnsDEADLINE_EXCEEDED, extends the polling window from 30s to 60s, and documents why transient polling errors are ignored during backfill processing.Why?
The failing GitHub Actions
Unit Testsjob was caused bytests/integration_tests/test_schedule.py::test_backfillfailing intermittently while the scheduler was still processing the backfill request. The Actions logs showeddescribe_schedule()timing out withDeadline Exceeded, so this change makes the test resilient to that transient condition without changing production behavior.How did you test it?
80146212522uv tool run ruff check tests/integration_tests/test_schedule.pyuv run pytest -v tests/cadence/test_schedule.py(21 passed)uv run pytest -v tests/integration_tests/test_schedule.py::test_backfill --integration-tests, but the local Docker/Cassandra test environment hung on startup due to an unrelated sandbox DNS/container issuePotential risks
Low risk. This is a test-only change and does not modify runtime client behavior. The main impact is that a genuinely failing backfill test may now take longer to fail because of the longer polling window.
Release notes
None. Test-only change.
Documentation Changes
None.