feat(host_metrics source): add temperature metrics collector#25607
feat(host_metrics source): add temperature metrics collector#25607somaz94 wants to merge 1 commit into
Conversation
|
All contributors have signed the CLA ✍️ ✅ |
|
I have read the CLA Document and I hereby sign the CLA |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 34f8b6fec3
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| impl HostMetrics { | ||
| pub async fn temperature_metrics(&self, output: &mut super::MetricsBuffer) { | ||
| output.name = "temperature"; | ||
| let components = Components::new_with_refreshed_list(); |
There was a problem hiding this comment.
Persist Components before reporting max temperatures
When a Linux sensor does not expose a kernel tempN_highest file, sysinfo::Component::max() is computed by comparing successive refreshes of the same Component. Recreating Components on every temperature_metrics call resets that history, so temperature_max_celsius becomes the current sample on each scrape rather than the highest observed temperature. Keep the Components collection on HostMetrics and refresh it between scrapes, or avoid emitting the computed max when no persistent history is available.
Useful? React with 👍 / 👎.
| impl HostMetrics { | ||
| pub async fn temperature_metrics(&self, output: &mut super::MetricsBuffer) { | ||
| output.name = "temperature"; | ||
| let components = Components::new_with_refreshed_list(); |
There was a problem hiding this comment.
Honor SYSFS_ROOT when scraping temperatures
In containerized host-metrics deployments that mount the host sysfs somewhere like /host/sys and set SYSFS_ROOT, the other Linux collectors are redirected through init_roots(), but this sysinfo::Components call reads the process' normal sysfs path instead. Enabling the new collector in that documented setup will scrape the container's /sys and commonly emit no host temperature metrics even though the host sensors are mounted under SYSFS_ROOT.
Useful? React with 👍 / 👎.
| if let Some(temperature) = component.temperature() { | ||
| output.gauge(TEMPERATURE_CELSIUS, temperature as f64, tags()); | ||
| } | ||
| if let Some(max) = component.max() { | ||
| output.gauge(TEMPERATURE_MAX_CELSIUS, max as f64, tags()); |
There was a problem hiding this comment.
On Linux, sysinfo can return Some(f32::NAN) for temperature and max values when a sensor file exists but the read fails, and these branches emit that value as a normal gauge. In those sensor-error cases Vector will forward temperature_celsius/temperature_max_celsius samples with NaN values, which downstream metric sinks such as New Relic explicitly reject, so these readings should be filtered with is_finite() before creating metrics.
Useful? React with 👍 / 👎.
| let label = component.label(); | ||
| let tags = || metric_tags!(COMPONENT => label); |
There was a problem hiding this comment.
Fall back to component IDs for empty labels
On Linux systems where sysinfo falls back from hwmon to /sys/class/thermal (for example Raspberry Pi-style environments), component.label() is empty while component.id() contains the thermal-zone identifier. Using the empty label as the only component tag makes all temperature series share the same tag set when more than one thermal zone is present, so downstream aggregation can collapse distinct sensors; use the ID as a fallback when the label is empty.
Useful? React with 👍 / 👎.
34f8b6f to
4c51702
Compare
Summary
Adds a
temperaturecollector to thehost_metricssource. When enabled, it reads hardware temperature sensors viasysinfo::Componentsand emits three gauges, each tagged with thecomponentlabel of the sensor it was read from:temperature_celsius— current temperaturetemperature_max_celsius— highest recorded temperaturetemperature_critical_celsius— critical threshold (only when the sensor reports one)The collector is opt-in (it is not part of the default collector set). Many environments where Vector runs — containers, virtual machines, most cloud instances — expose no temperature sensors, so enabling it by default would add a per-scrape
Componentsrefresh that yields nothing. Users addtemperaturetocollectorsto turn it on. Components that do not report a given value are skipped, and hosts without sensors simply produce no metrics.Closes: #21389
Vector configuration
How did you test this PR?
generates_temperature_metricsunit test that asserts every emitted metric is a gauge namedtemperature*and carries thecomponenttag. The test tolerates an empty result so it also passes in sensorless CI environments.generated/host_metrics.cue) for the newtemperaturecollector enum value.Change Type
Is this a breaking change?
Does this PR include user facing changes?
changelog.d/.no-changeloglabel to this PR.References