-
Notifications
You must be signed in to change notification settings - Fork 306
Document known accuracy limitations for ApproximateBacklogCount #4392
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -720,9 +720,18 @@ The age is based on the creation time of the Task at the head of the queue. | |
|
|
||
| You can rely on both these counts when making scaling decisions. | ||
|
|
||
| Please note: [Sticky queues](https://docs.temporal.io/sticky-execution) will affect these values, but only for a few seconds. | ||
| That's because Tasks sent to Sticky queues are not included in the returned values for `ApproximateBacklogCount` and `ApproximateBacklogAge`. | ||
| Inaccuracies diminish as the backlog grows. | ||
| #### Known accuracy limitations {#backlog-accuracy-limitations} | ||
|
|
||
| These values are approximate and may be temporarily inaccurate in the following scenarios: | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There's also just all the discrepancy due to metadata updates being infrequent, and discrepancy due to database TTLs (cassandra makes rows disappear and we don't see them). I think "may be temporarily inaccurate" is too strong, it suggests inaccuracy is temporary and will go away, but that's not true at all. |
||
|
|
||
| - **Overcount from cancelled Workflows**: When a Workflow is cancelled, its pending Tasks may still be counted in the backlog until they expire. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. same, I think "expire" is the wrong word here |
||
| Workers process Tasks in order within a partition, so valid Tasks _can_ be blocked behind invalid (expired) Tasks, but invalid Tasks cannot block other invalid Tasks. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. also: "invalid" (workflow is closed, or activity is canceled, or various other things) and "expired" (due to timeout) are technically different internally but I think it's good to just bundle them together for this purpose. We should be consistent, though, maybe use "invalid" everywhere, or "invalid or expired"? |
||
| The count eventually converges to the correct value as expired Tasks are cleared. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This feels like it's promising too much. Invalid tasks will eventually be accounted for, but I don't think we can say the count will converge to the correct value, there are other sources of discrepancy. |
||
| - **Reset to zero on partition unload**: If no Workers are polling a Task Queue and no new Tasks are added for approximately 5 minutes, the Temporal Service unloads the Task Queue partition from memory. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Other task queue activities such as DescribeTaskQueue or UpdateTaskQueueConfig will also keep the task queue loaded. |
||
| When this happens, `ApproximateBacklogCount` resets to zero until the partition is reloaded (by a Worker polling or a new Task being added). | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. should we say that calling |
||
| This means that an idle Task Queue with a backlog but no active Workers may temporarily report zero. | ||
| - **Sticky queue exclusion**: [Sticky queues](/sticky-execution) are not included in these values. | ||
| Because Sticky queue Tasks only remain valid for a few seconds, this inaccuracy diminishes as the backlog grows. | ||
|
|
||
| ### `TasksAddRate` and `TasksDispatchRate` {#TasksAddRate-and-TasksDispatchRate} | ||
|
|
||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nits:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what is meant by "... tasks that haven't yet expired". The idea is that when they expire or become invalid, they are still counted because they haven't been processed and discarded yet