Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions docs/cloud/metrics/openmetrics/metrics-reference.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -508,6 +508,11 @@ These metrics could have high cardinality depending on number of task queues pre

The approximate number of tasks pending in a task queue. Started Activities are not included in the count as they have been dequeued from the task queue.

:::note Known accuracy limitations
This metric may temporarily overcount due to cancelled Workflow Tasks that haven't yet expired, and may reset to zero if no Workers poll and no Tasks are added for approximately 5 minutes (due to partition unload).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nits:

  • it's not only canceled/terminated workflows, it's also expired tasks, say for expired workflows (specially in new matcher).
  • instead of "partition unload" we can just say "task queue unload" because user should not be exposed to partition concept.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what is meant by "... tasks that haven't yet expired". The idea is that when they expire or become invalid, they are still counted because they haven't been processed and discarded yet

See [backlog accuracy limitations](/develop/worker-performance#backlog-accuracy-limitations) for details.
:::

| Label | Description |
| ----- | ----- |
| `temporal_task_queue` | The task queue name |
Expand Down
15 changes: 12 additions & 3 deletions docs/develop/worker-performance.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -720,9 +720,18 @@ The age is based on the creation time of the Task at the head of the queue.

You can rely on both these counts when making scaling decisions.

Please note: [Sticky queues](https://docs.temporal.io/sticky-execution) will affect these values, but only for a few seconds.
That's because Tasks sent to Sticky queues are not included in the returned values for `ApproximateBacklogCount` and `ApproximateBacklogAge`.
Inaccuracies diminish as the backlog grows.
#### Known accuracy limitations {#backlog-accuracy-limitations}

These values are approximate and may be temporarily inaccurate in the following scenarios:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's also just all the discrepancy due to metadata updates being infrequent, and discrepancy due to database TTLs (cassandra makes rows disappear and we don't see them). I think "may be temporarily inaccurate" is too strong, it suggests inaccuracy is temporary and will go away, but that's not true at all.


- **Overcount from cancelled Workflows**: When a Workflow is cancelled, its pending Tasks may still be counted in the backlog until they expire.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same, I think "expire" is the wrong word here

Workers process Tasks in order within a partition, so valid Tasks _can_ be blocked behind invalid (expired) Tasks, but invalid Tasks cannot block other invalid Tasks.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • again, the partition concept doesn't have much bearing here, so maybe we can omit it.
  • "valid Tasks can be blocked behind invalid (expired) Tasks" -> this is the other way around: valid tasks, until dispatched, can block invalid tasks. once an invalid task is in the front of the queue, we remove it quickly so they do not generally block other invalid/valid tasks. A valid task may stay longer in front of the queue if there are not enough workers.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also: "invalid" (workflow is closed, or activity is canceled, or various other things) and "expired" (due to timeout) are technically different internally but I think it's good to just bundle them together for this purpose. We should be consistent, though, maybe use "invalid" everywhere, or "invalid or expired"?

The count eventually converges to the correct value as expired Tasks are cleared.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like it's promising too much. Invalid tasks will eventually be accounted for, but I don't think we can say the count will converge to the correct value, there are other sources of discrepancy.

- **Reset to zero on partition unload**: If no Workers are polling a Task Queue and no new Tasks are added for approximately 5 minutes, the Temporal Service unloads the Task Queue partition from memory.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no Workers are polling a Task Queue and no new Tasks are added

Other task queue activities such as DescribeTaskQueue or UpdateTaskQueueConfig will also keep the task queue loaded.

When this happens, `ApproximateBacklogCount` resets to zero until the partition is reloaded (by a Worker polling or a new Task being added).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we say that calling DescribeTaskQueue also reloads the task queue?

This means that an idle Task Queue with a backlog but no active Workers may temporarily report zero.
- **Sticky queue exclusion**: [Sticky queues](/sticky-execution) are not included in these values.
Because Sticky queue Tasks only remain valid for a few seconds, this inaccuracy diminishes as the backlog grows.

### `TasksAddRate` and `TasksDispatchRate` {#TasksAddRate-and-TasksDispatchRate}

Expand Down
Loading