CASSSIDECAR-462: Split RestoreJobDiscoverer into a fast status-check loop and a slow slice-discovery loop#350
Conversation
…loop and a slow slice-discovery loop
…loop and a slow slice-discovery loop
5631573 to
f565864
Compare
…scoverer # Conflicts: # CHANGES.txt
…loop and a slow slice-discovery loop
…loop and a slow slice-discovery loop
yifan-c
left a comment
There was a problem hiding this comment.
High-level design suggestion
The slow discovery loop is the correctness guarantee — it must always run on schedule. The fast status-check loop is a best-effort latency optimization — it is safe to skip any individual tick. All shared mutable state (JobIdsByDay) is intentionally not self-synchronized; a single external mutual-exclusion gate is the sole source of thread safety. The changes below enforce this by replacing two independent CAS flags (which gave both loops equal priority and allowed concurrent access to shared state) with a ReentrantLock whose call sites encode the asymmetry directly: lock() for the slow loop (blocks briefly, never skips), tryLock() for everything else (opportunistic).
Concrete changes
-
Replace
AtomicBoolean isExecutingwithReentrantLock executionLock. Slow loop useslock()(blocks until free, never skips); fast loop andprocessJobNowusetryLock()(skip if busy). This directly encodes the priority asymmetry: a shared CAS treats all callers equally, so a fast loop holding it when the slow loop's 5–10 min timer fires causes the slow loop to miss its entire cycle. -
Remove
checkerExecutingfromStatusCheckTask.ReentrantLockis now the sole gate; a second independent flag re-introduces the two-gate concurrency problem. -
Remove
synchronizedfrom allJobIdsByDaymethods. They were only added because the two independent gates allowed concurrent access. With a single lock the original invariant — exactly one thread touchesJobIdsByDayat a time — is fully restored. -
Restore
volatile int inflightJobsCount.scheduleDecision()is called by the timer without holding the lock. ReadinginflightJobIds()there would be an unsynchronized read ofJobIdsByDayoncesynchronizedis removed. The original counter is written under the lock (visible via the lock's happens-before edge) and read as a plain volatile fromscheduleDecision(). Also fixes the O(jobs)-allocation-per-tick regression forhasInflightJobs()and the gauge update.
| } | ||
|
|
||
| @JsonProperty(value = "job_discovery_status_check_interval") | ||
| public void setJobDiscoveryStatusCheckInterval(MillisecondBoundConfiguration jobDiscoveryStatusCheckInterval) |
There was a problem hiding this comment.
Nit: This method is unused. Seems like we are following the pattern of having set methods, other set methods in class are deprecated.
There was a problem hiding this comment.
Kept the setter — it's not actually unused, Jackson invokes it reflectively when deserializing sidecar.yaml via SidecarConfigurationImpl.readYamlConfiguration. It mirrors the shape of setJobDiscoveryActiveLoopDelay / setJobDiscoveryIdleLoopDelay right above. The @deprecated setters in this class are the legacy *_millis aliases (e.g. job_discovery_active_loop_delay_millis) being phased out; job_discovery_status_check_interval is brand new with no legacy form, so there's nothing to deprecate. Happy to add a *_millis alias if you'd like one for parity, but otherwise leaving as-is.
…loop and a slow slice-discovery loop
…loop and a slow slice-discovery loop
No description provided.