Skip to content

Fix/system monitor#875

Open
jaagut wants to merge 3 commits into
mainfrom
fix/system_monitor
Open

Fix/system monitor#875
jaagut wants to merge 3 commits into
mainfrom
fix/system_monitor

Conversation

@jaagut
Copy link
Copy Markdown
Member

@jaagut jaagut commented May 24, 2026

Summary

Proposed changes

Related issues

Checklist

  • Run pixi run build
  • Write documentation
  • Test on your machine
  • Test on the robot
  • Create issues for future work
  • Triage this PR and label it

Enhance GPU monitoring by integrating NVIDIA and AMD detection, updating collection methods, and adding support for nvidia-ml-py package
@jaagut jaagut force-pushed the fix/system_monitor branch from b051716 to 33366a8 Compare May 24, 2026 19:12
@jaagut jaagut marked this pull request as ready for review May 24, 2026 20:02
@jaagut jaagut moved this from 🆕 New to 📋 Backlog in Software May 24, 2026
@jaagut jaagut moved this from 📋 Backlog to 👀 In review in Software May 24, 2026
@jaagut jaagut requested review from ChlukasX, Flova, MegaIng and Copilot May 24, 2026 20:02
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Improves the ROS2 system_monitor package to collect and publish more robust system workload metrics (notably GPU stats) across different hardware backends, and ensures the required Workload message is generated in bitbots_msgs.

Changes:

  • Add Workload.msg to bitbots_msgs interface generation.
  • Refactor GPU monitoring to auto-detect NVIDIA (NVML), Jetson (sysfs), and AMD (pyamdgpuinfo) backends; tighten type consistency in collectors.
  • Adjust sampling behavior (CPU smoothing + lower default update frequency) and add nvidia-ml-py to the Pixi environment.

Reviewed changes

Copilot reviewed 9 out of 10 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
src/bitbots_msgs/CMakeLists.txt Adds Workload.msg to rosidl-generated interfaces so downstream nodes can publish/subscribe it.
src/bitbots_misc/system_monitor/system_monitor/network_interfaces.py Adds return type annotations for interface collection helpers.
src/bitbots_misc/system_monitor/system_monitor/monitor.py Updates GPU collector call signature and aligns default “disabled” tuple types; minor comment grammar fix.
src/bitbots_misc/system_monitor/system_monitor/memory.py Adds a typed return annotation for memory stats collection.
src/bitbots_misc/system_monitor/system_monitor/gpu.py Replaces single-backend AMD logic with auto-detected NVIDIA/Jetson/AMD backends and improved error handling/logging.
src/bitbots_misc/system_monitor/system_monitor/cpus.py Adds EMA smoothing for CPU usage values and updates return/type annotations.
src/bitbots_misc/system_monitor/config/config.yaml Lowers default update frequency from 10 Hz to 2 Hz.
pixi.toml Adds nvidia-ml-py dependency for NVML-based monitoring.
pixi.lock Locks nvidia-ml-py into all environments.
.vscode/settings.json Adds dictionary words related to new GPU monitoring terms.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +79 to +85
handle = pynvml.nvmlDeviceGetHandleByIndex(0)
load = float(pynvml.nvmlDeviceGetUtilizationRates(handle).gpu)
mem_info = pynvml.nvmlDeviceGetMemoryInfo(handle)
vram_used = mem_info.used
vram_total = mem_info.total
temperature = float(pynvml.nvmlDeviceGetTemperature(handle, 0))
return (load, vram_used, vram_total, temperature)
if raw_load is None:
continue
# Jetson reports GPU load in permille on current L4T kernels.
load = raw_load / 10.0
Comment on lines +76 to +82
# smooth short-term sampling noise with exponential moving average
prev = _prev_usage[cpu_num]
if prev == 0.0:
smoothed = float(round(raw_usage, 2))
else:
smoothed = float(round((raw_usage * _EMA_ALPHA) + (prev * (1.0 - _EMA_ALPHA)), 2))

Comment on lines +72 to +90
def _collect_nvidia(node: Node) -> tuple[float, int, int, float]:
"""Collect GPU metrics from NVIDIA GPU using pynvml."""
try:
import pynvml

pynvml.nvmlInit()
try:
handle = pynvml.nvmlDeviceGetHandleByIndex(0)
load = float(pynvml.nvmlDeviceGetUtilizationRates(handle).gpu)
mem_info = pynvml.nvmlDeviceGetMemoryInfo(handle)
vram_used = mem_info.used
vram_total = mem_info.total
temperature = float(pynvml.nvmlDeviceGetTemperature(handle, 0))
return (load, vram_used, vram_total, temperature)
finally:
try:
pynvml.nvmlShutdown()
except Exception:
pass
Comment on lines +186 to +188
If `node` is provided the ROS node's logger will be used for messages.

node: ROS node for logging (required for backend detection and error logging)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: 👀 In review

Development

Successfully merging this pull request may close these issues.

2 participants