Skip to content

Add a few metrics on main thread contention points#328

Merged
fpacifici merged 1 commit into
mainfrom
fpacifici/add_submit_metrics
Jun 4, 2026
Merged

Add a few metrics on main thread contention points#328
fpacifici merged 1 commit into
mainfrom
fpacifici/add_submit_metrics

Conversation

@fpacifici

@fpacifici fpacifici commented Jun 4, 2026

Copy link
Copy Markdown
Collaborator

Summary

After shipping the errors pipeline I saw a huge throughput problem on semantically partitioned topics with large messages. Basically this cannot keep up even with 64 replicas.
The batch time seems to fill in batches with 3k messages every second, which should be more than enough.
The parallel step is not saturated.

So the likely option is that something is really slow in the code between steps in arroyo which run on the main thread.
The sensitive areas are constituted by the calls to submit and poll.

Changes:

  • Add gauge metrics and debug logs for BatchStep outbound drain (next_step.submit), including duration, outbound queue depth, and submit outcome.
  • Add gauge metrics and debug logs for PythonAdapter Python delegate submit, poll/handle paths, and downstream next_strategy.submit.
  • Reuse cached metric labels on both steps to avoid per-call allocations.

Made with Cursor

Expose gauge metrics and debug logs to diagnose backpressure and slow downstream submits during batch drain and Python delegate poll/submit.

Co-authored-by: Cursor <cursoragent@cursor.com>
@fpacifici fpacifici requested a review from a team as a code owner June 4, 2026 14:56
@fpacifici fpacifici changed the title Add submit and poll timing metrics for batch and Python adapter Add a few metrics on main thread contention points Jun 4, 2026
@fpacifici fpacifici merged commit d4badc6 into main Jun 4, 2026
25 checks passed
@fpacifici fpacifici deleted the fpacifici/add_submit_metrics branch June 4, 2026 15:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants