test: harden schedule backfill polling in CI by abhishekj720 · Pull Request #115 · cadence-workflow/cadence-python-client

abhishekj720 · 2026-05-22T17:04:17Z

What changed?

Updated the schedule backfill integration test to better tolerate transient server behavior in CI. The test now keeps polling when DescribeSchedule briefly returns DEADLINE_EXCEEDED, extends the polling window from 30s to 60s, and documents why transient polling errors are ignored during backfill processing.

Why?

The failing GitHub Actions Unit Tests job was caused by tests/integration_tests/test_schedule.py::test_backfill failing intermittently while the scheduler was still processing the backfill request. The Actions logs showed describe_schedule() timing out with Deadline Exceeded, so this change makes the test resilient to that transient condition without changing production behavior.

How did you test it?

Reviewed the failing Actions logs for job 80146212522
Ran uv tool run ruff check tests/integration_tests/test_schedule.py
Ran uv run pytest -v tests/cadence/test_schedule.py (21 passed)
Attempted to run uv run pytest -v tests/integration_tests/test_schedule.py::test_backfill --integration-tests, but the local Docker/Cassandra test environment hung on startup due to an unrelated sandbox DNS/container issue

Potential risks

Low risk. This is a test-only change and does not modify runtime client behavior. The main impact is that a genuinely failing backfill test may now take longer to fail because of the longer polling window.

Release notes

None. Test-only change.

Documentation Changes

None.

Verifies that BackfillSchedule triggers actual workflow starts for past schedule slots. Creates a per-minute schedule, backfills a 2-minute past window, then polls DescribeSchedule until total_runs >= 1 to confirm the scheduler processed the backfill signal and started the expected executions. Signed-off-by: abhishek.jha <abhishek.jha@uber.com>

codecov · 2026-05-22T21:07:37Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
see 1 file with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

… test_schedule Co-authored-by: Abhishek Jha <17800780+abhishekj720@users.noreply.github.com>

imports for asyncio, time, and QueryFailedError were removed as part of the _describe_schedule_when_ready cleanup but test_backfill still uses them inline for its polling loop. Signed-off-by: abhishek.jha <abhishek.jha@uber.com>

YaweiZhang-930 · 2026-05-27T03:54:22Z

+        try:
+            await client.create_schedule(
+                schedule_id,
+                spec=schedule_pb2.ScheduleSpec(cron_expression="* * * * *"),


It's worth noting that total_runs doesn't distinguish backfill fires from regular cron fires (server increments the same counter for both). So with * * * * *(fire every minute) and a 30s deadline, roughly half the time the next minute boundary falls inside our poll window and a normal cron tick alone satisfies >= 1. The test would still pass if backfill silently produced nothing.

…sitive total_runs counts both cron and backfill fires, so with * * * * * a regular tick within the poll window can satisfy total_runs >= 1 even if the backfill produced nothing. Pausing first ensures any increment is attributable to the backfill signal. Signed-off-by: abhishek.jha <abhishek.jha@uber.com>

pausing the schedule sends a signal that triggers ContinueAsNew in the scheduler workflow; the subsequent backfill signal lands on the old (closed) run and is dropped, leaving total_runs=0. switch to a cron that never fires during the ~30s poll window so no automatic tick can satisfy the assertion, without touching schedule state. Signed-off-by: abhishek.jha <abhishek.jha@uber.com>

…ertion the yearly cron (0 0 1 1 *) had no schedule slots in the backfill window (now-3min to now-1min), so no fires were triggered and total_runs stayed 0. switch back to * * * * * which always covers exactly 2 slots in that window, and assert total_runs >= 2. a single spurious cron tick during the ~30s poll window adds at most 1 run and cannot produce a false-positive. Signed-off-by: abhishek.jha <abhishek.jha@uber.com>

…fire with the default SkipNew policy, the second backfill slot is skipped because the first started workflow has no worker to complete it, leaving total_runs=1. CONCURRENT overlap allows both slots to fire regardless of in-flight runs, so the assertion total_runs >= 2 is reliably satisfied. Signed-off-by: abhishek.jha <abhishek.jha@uber.com>

describe_schedule can raise CadenceRpcError when the scheduler workflow has just done ContinueAsNew and the server-side retry budget is exhausted. catching it alongside QueryFailedError lets the poll loop keep retrying within its own 30s window regardless of server version. Signed-off-by: abhishek.jha <abhishek.jha@uber.com>

CadenceRpcError from describe_schedule means the server's retry budget for ContinueAsNew is exhausted - a real server failure, not a transient client condition. the test should surface it so server bugs are visible. only QueryFailedError is caught, which is the expected transient state while a new run processes its first decision task. Signed-off-by: abhishek.jha <abhishek.jha@uber.com>

gitar-bot · 2026-06-08T17:53:05Z

CI failed: The `test_backfill` integration test is failing consistently due to a 60-second polling timeout in `describe_schedule`, which is insufficient for the server to materialize backfill state under CI load.

Overview

All 4 analyzed logs show a consistent failure in the test_backfill integration test. The test fails because the describe_schedule polling loop hits its 60-second deadline before the Cadence server successfully returns the expected state, causing an AssertionError.

Failures

`test_backfill` timeout (confidence: high)

Type: test
Affected jobs: 80159846769, 80159847089, 80159846932, 80159846699
Related to change: yes
Root cause: The polling loop uses a hard-coded 60-second deadline that does not account for the latency of the Cadence server during heavy integration test loads when materializing backfill state.
Suggested fix: Increase the polling timeout to 120-180 seconds and improve diagnostic logging within the except blocks to capture why the schedule metadata is not materializing (e.g., logging specific error types or the absence of the schedule).

Summary

Change-related failures: 1 (the test_backfill logic is blocking completion).
Infrastructure/flaky failures: 0 (this is a deterministic test failure under current timeout configurations).
Recommended action: Increase the timeout threshold in the test_backfill loop and consider adding more granular error handling or logging to catch server-side failures during the state materialization process.

Code Review ✅ Approved 2 resolved / 2 findings

Integration test for backfill_schedule added with robust polling and concurrent overlap policy, addressing previous missing import and protobuf duration handling issues. No issues found.

✅ 2 resolved

✅ Bug: timedelta passed directly to protobuf Duration fields

📄 tests/integration_tests/test_schedule.py:30-31
Lines 30-31 pass timedelta objects directly to execution_start_to_close_timeout and task_start_to_close_timeout, which are google.protobuf.Duration message fields. While newer protobuf-python versions may handle this implicitly in constructors, the rest of the codebase (e.g., cadence/client.py:300-301) uses explicit Duration().FromTimedelta() + CopyFrom(). If the protobuf version doesn't support implicit conversion, this will raise a TypeError at runtime. Consider using the explicit conversion pattern for consistency and safety.

✅ Bug: Missing imports for time and asyncio used in test_backfill

📄 tests/integration_tests/test_schedule.py:7-16 📄 tests/integration_tests/test_schedule.py:177 📄 tests/integration_tests/test_schedule.py:184 📄 tests/integration_tests/test_schedule.py:186
The delta diff removes import asyncio and import time from the top of the file (they were previously present). However, test_backfill (lines 177, 186) uses time.monotonic() and await asyncio.sleep(0.5), and QueryFailedError is used at line 184 but its import was also removed. This will cause NameError at runtime when test_backfill executes.

Tip

Comment Gitar fix CI or enable auto-apply: gitar auto-apply:on

Options

Auto-apply is off → Gitar will not commit updates to this branch.
Display: compact → Showing less information.

Comment with these commands to change:

`Auto-apply`	`Compact`
`gitar auto-apply:on`	`gitar display:verbose`

_{Was this helpful? React with 👍 / 👎 | Gitar}

abhishekj720 force-pushed the test/schedule-backfill-integration branch from 3395d42 to 1a519a1 Compare May 22, 2026 19:29

gitar-bot Bot reviewed May 22, 2026

View reviewed changes

Comment thread tests/integration_tests/test_schedule.py Outdated

Merge branch 'main' into test/schedule-backfill-integration

99435d4

gitar-bot and others added 2 commits May 22, 2026 21:09

fix: use explicit Duration conversion for protobuf Duration fields in…

3d2d0fe

… test_schedule Co-authored-by: Abhishek Jha <17800780+abhishekj720@users.noreply.github.com>

Merge branch 'main' into test/schedule-backfill-integration

e7ae5c2

gitar-bot Bot reviewed May 26, 2026

View reviewed changes

Comment thread tests/integration_tests/test_schedule.py

YaweiZhang-930 reviewed May 27, 2026

View reviewed changes

abhishekj720 and others added 4 commits May 27, 2026 10:53

Merge branch 'main' into test/schedule-backfill-integration

c62749e

abhishekj720 marked this pull request as draft May 28, 2026 07:04

abhishekj720 and others added 5 commits May 28, 2026 00:04

Merge branch 'main' into test/schedule-backfill-integration

ac339c6

Merge branch 'main' into test/schedule-backfill-integration

4957683

Copilot started work on behalf of abhishekj720 June 8, 2026 16:51 View session

Copilot AI added 2 commits June 8, 2026 17:10

test: harden schedule backfill integration polling

d7e6852

test: document transient schedule polling errors

0ce5183

Copilot AI changed the title ~~test(schedule): add integration test for backfill_schedule~~ test: harden schedule backfill polling in CI Jun 8, 2026

Copilot finished work on behalf of abhishekj720 June 8, 2026 17:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: harden schedule backfill polling in CI#115

test: harden schedule backfill polling in CI#115
abhishekj720 wants to merge 16 commits into
mainfrom
test/schedule-backfill-integration

abhishekj720 commented May 22, 2026 •

edited by Copilot AI

Loading

Uh oh!

Uh oh!

codecov Bot commented May 22, 2026 •

edited

Loading

Uh oh!

Uh oh!

YaweiZhang-930 May 27, 2026

Uh oh!

gitar-bot Bot commented Jun 8, 2026

Overview

Failures

`test_backfill` timeout (confidence: high)

Summary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

abhishekj720 commented May 22, 2026 • edited by Copilot AI Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

codecov Bot commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

YaweiZhang-930 May 27, 2026

Choose a reason for hiding this comment

Uh oh!

gitar-bot Bot commented Jun 8, 2026

Overview

Failures

test_backfill timeout (confidence: high)

Summary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

abhishekj720 commented May 22, 2026 •

edited by Copilot AI

Loading

codecov Bot commented May 22, 2026 •

edited

Loading

`test_backfill` timeout (confidence: high)