Add retry to gRPC calls that failed due to transient errors#714
Add retry to gRPC calls that failed due to transient errors#714
Conversation
|
@copilot add tests for the new retry logic in this PR |
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
This PR adds retry logic around worker gRPC calls when they fail with transient transport errors, with accompanying structured logging for each retry attempt.
Changes:
- Wrap multiple gRPC client calls (abandon/complete operations) in a shared retry helper with exponential backoff + jitter.
- Add a new warning log event to record transient gRPC retry attempts and backoff duration.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| src/Worker/Grpc/Logs.cs | Adds a new structured log message for retry attempts. |
| src/Worker/Grpc/GrpcDurableTaskWorker.Processor.cs | Introduces ExecuteWithRetryAsync and applies it to multiple gRPC calls. |
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Agent-Logs-Url: https://github.com/microsoft/durabletask-dotnet/sessions/e65654b0-082d-4c4a-b2c9-34330dd3eb37 Co-authored-by: sophiatev <38052607+sophiatev@users.noreply.github.com>
Added tests for
|
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…b.com/microsoft/durabletask-dotnet into stevosyan/add-retry-to-complete-calls
…member' Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>
|
@copilot now add integration tests for this retry functionality |
…sts and fix unit test assertion Agent-Logs-Url: https://github.com/microsoft/durabletask-dotnet/sessions/9ea46ebc-6803-49aa-8fe1-582fc56d49d7 Co-authored-by: sophiatev <38052607+sophiatev@users.noreply.github.com>
Added integration-level retry tests in commit
The commit also fixes the unit test |
…egration test Agent-Logs-Url: https://github.com/microsoft/durabletask-dotnet/sessions/8d495e35-19ea-440a-80a7-e7538566e906 Co-authored-by: sophiatev <38052607+sophiatev@users.noreply.github.com>
Summary
What changed?
As per the PR title, this PR adds retry logic to gRPC calls in the worker process that failed due to transient errors (i.e.
StatusCode.Unavailable)Why is this change needed?
Previously the call would just fail, which means the work item would be abandoned and only picked up again after a relatively long delay. We want to retry pretty immediately for these transient errors.