Skip to content

CASSSIDECAR-462: Split RestoreJobDiscoverer into a fast status-check loop and a slow slice-discovery loop#350

Merged
sarankk merged 8 commits into
apache:trunkfrom
mansikhara:CASSSIDECAR-462
Jun 24, 2026
Merged

CASSSIDECAR-462: Split RestoreJobDiscoverer into a fast status-check loop and a slow slice-discovery loop#350
sarankk merged 8 commits into
apache:trunkfrom
mansikhara:CASSSIDECAR-462

Conversation

@mansikhara

Copy link
Copy Markdown
Contributor

No description provided.

@mansikhara mansikhara changed the title CASSSIDECAR-462 Split RestoreJobDiscoverer into a fast status-check loop and a slow slice-discovery loop CASSSIDECAR-462: Split RestoreJobDiscoverer into a fast status-check loop and a slow slice-discovery loop May 15, 2026

@yifan-c yifan-c left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High-level design suggestion

The slow discovery loop is the correctness guarantee — it must always run on schedule. The fast status-check loop is a best-effort latency optimization — it is safe to skip any individual tick. All shared mutable state (JobIdsByDay) is intentionally not self-synchronized; a single external mutual-exclusion gate is the sole source of thread safety. The changes below enforce this by replacing two independent CAS flags (which gave both loops equal priority and allowed concurrent access to shared state) with a ReentrantLock whose call sites encode the asymmetry directly: lock() for the slow loop (blocks briefly, never skips), tryLock() for everything else (opportunistic).

Concrete changes

  1. Replace AtomicBoolean isExecuting with ReentrantLock executionLock. Slow loop uses lock() (blocks until free, never skips); fast loop and processJobNow use tryLock() (skip if busy). This directly encodes the priority asymmetry: a shared CAS treats all callers equally, so a fast loop holding it when the slow loop's 5–10 min timer fires causes the slow loop to miss its entire cycle.

  2. Remove checkerExecuting from StatusCheckTask. ReentrantLock is now the sole gate; a second independent flag re-introduces the two-gate concurrency problem.

  3. Remove synchronized from all JobIdsByDay methods. They were only added because the two independent gates allowed concurrent access. With a single lock the original invariant — exactly one thread touches JobIdsByDay at a time — is fully restored.

  4. Restore volatile int inflightJobsCount. scheduleDecision() is called by the timer without holding the lock. Reading inflightJobIds() there would be an unsynchronized read of JobIdsByDay once synchronized is removed. The original counter is written under the lock (visible via the lock's happens-before edge) and read as a plain volatile from scheduleDecision(). Also fixes the O(jobs)-allocation-per-tick regression for hasInflightJobs() and the gauge update.

}

@JsonProperty(value = "job_discovery_status_check_interval")
public void setJobDiscoveryStatusCheckInterval(MillisecondBoundConfiguration jobDiscoveryStatusCheckInterval)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: This method is unused. Seems like we are following the pattern of having set methods, other set methods in class are deprecated.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kept the setter — it's not actually unused, Jackson invokes it reflectively when deserializing sidecar.yaml via SidecarConfigurationImpl.readYamlConfiguration. It mirrors the shape of setJobDiscoveryActiveLoopDelay / setJobDiscoveryIdleLoopDelay right above. The @deprecated setters in this class are the legacy *_millis aliases (e.g. job_discovery_active_loop_delay_millis) being phased out; job_discovery_status_check_interval is brand new with no legacy form, so there's nothing to deprecate. Happy to add a *_millis alias if you'd like one for parity, but otherwise leaving as-is.

@sarankk sarankk merged commit 2cfea84 into apache:trunk Jun 24, 2026
40 of 42 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants