Skip to content

TSC Clocks + Benchmark Methodology Rework#26

Open
MoonFlowww wants to merge 4 commits into
Compaile:mainfrom
MoonFlowww:main
Open

TSC Clocks + Benchmark Methodology Rework#26
MoonFlowww wants to merge 4 commits into
Compaile:mainfrom
MoonFlowww:main

Conversation

@MoonFlowww
Copy link
Copy Markdown

Summary

Adds three RDTSC-family clock backends behind compile-time macros, with a runtime calibration cascade and chrono fallback. Reworks the benchmark to remove self-measurement bias and -O3 collapsing of the no-track baseline. A few smaller cleanups in the stats path.


Major changes

1. TSC clock backends (ctrack.hpp)

New Clock_RDTSC / Clock_RDTSCP / Clock_RDTSCP_LFENCE structs selected via CTRACK_CLOCK_RDTSC* macros. Selected backend aliased to ActiveClock; chrono is the default when no macro is set. Hard #error on non-x86_64 if a TSC macro is defined.

All std::chrono::high_resolution_clock::time_point and duration_cast<nanoseconds> calls in Event, Simple_Event, EventGroup, store, ctrack_result*, and EventHandler are replaced with ActiveClock::time_point and ActiveClock::duration_ns(). The interface contract: NOW(), duration_ns(s,e), to_string(tp) is uniform across all four clocks.

Calibration cascade in calibrate_tsc():

Source Path Coverage
C1 CPUID 0x15 Intel Skylake+ (exact TSC Hz)
C2 CPUID 0x16 Intel Haswell+ (base MHz, assumed = TSC)
C3 /sys/.../cpu0/cpufreq/base_frequency Linux + intel_pstate
C4 HKLM\...\CentralProcessor\0\~MHz Windows registry
Fallback 3× 1ms __rdtsc vs steady_clock, median AMD / VMs

Calibration runs once via a function-local static const bool _ = (calibrate_tsc(), true); inside EventHandler's ctor. Anchors tsc_anchor_cycles + tsc_anchor_system for to_string conversion.

If every source returns 0 the library calls std::abort() with a message pointing at the macro to remove. No silent fallback to chrono — wrong-by-frequency-ratio numbers are worse than an abort.

2. Benchmark methodology (ctrack_benchmark.cpp)

Three fixes that together remove the bias the old harness had:

BENCHMARK_NOINLINE on busy_wait_ns and every *_no_track helper. The tracked variants are naturally barriered: each EventHandler ctor/dtor mutates thread-local event state, so the compiler cannot reorder or fuse adjacent calls. The *_no_track variants have no such barrier — at -O3 the entire call tree gets inlined into the worker loop as a flat sequence of busy_wait_ns calls, which the scheduler can then reorder asymmetrically vs the tracked path. NOINLINE on the no-track side restores call-site symmetry with the tracked side, so the delta reflects CTRACK overhead and not asymmetric optimization.

raw_clock_ns()CLOCK_MONOTONIC_RAW on POSIX, QueryPerformanceCounter on Windows. Replaces std::chrono::high_resolution_clock as the outer timer in measure_overhead(). Measuring a solution with itself adds bias; the outer timer needs a path independent of whatever ctrack uses internally.

measure_overhead() restructure:

  • Warmup pass before the trials so caches and branch predictors are hot. Its result is not used in the median.
  • 5 trials with alternating order (no_track first vs track first per parity) so any first-trial effect cancels across pairs.
  • ctrack::result_as_string() moved outside the timed window: pre-clear of accumulated state happens before t0, post-clear after t1. The old harness called it inside the timed window, charging stats-flush cost to overhead.
  • Median across trials rejects scheduler outliers.
  • Negative raw_diff clamped to 0 with a verbose-mode note (noise floor).
    Outer clock measures the track − no_track delta.

3. Bench results

Timer variant benchmark: accuracy error vs overhead

Variant Accuracy err Overhead
chrono 12.84% 17.79%
RDTSC 5.85% 8.55%
RDTSCP 1.57% 15.36%
RDTSCP + LFENCE 0.31% 19.73%

Minor changes

  • load_child_events_simple: parent_event lookup hoisted out of the inner child loop (was redundantly fetched per child).
  • EventHandler ctor signature dropped the defaulted start_time parameter; start_time is now captured after register_event() and after the write_events_locked spin, so the spin no longer counts toward the measured interval.
  • EventHandler dtor: removed manual capacity()-size() < 1 check before emplace_back (let vector handle it). Sub-events still use an explicit reserve(max(4, cap*4)) growth pattern.
  • BeautifulTable::table_time unit string "mcs""us". parse_function_timing in the benchmark updated to match. us is the standard ASCII fallback for µs in low-latency perf tooling; mcs is non-standard.
  • table_timepoint rewritten to dispatch through ActiveClock::to_string.
  • result_print / result_as_string print TSC frequency: X GHz when a TSC backend is active.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant