Fix/system monitor by jaagut · Pull Request #875 · bit-bots/bitbots_main

jaagut · 2026-05-24T17:52:43Z

Summary

Proposed changes

Related issues

Checklist

…e consistency

Enhance GPU monitoring by integrating NVIDIA and AMD detection, updating collection methods, and adding support for nvidia-ml-py package

Copilot

Pull request overview

Improves the ROS2 system_monitor package to collect and publish more robust system workload metrics (notably GPU stats) across different hardware backends, and ensures the required Workload message is generated in bitbots_msgs.

Changes:

Add Workload.msg to bitbots_msgs interface generation.
Refactor GPU monitoring to auto-detect NVIDIA (NVML), Jetson (sysfs), and AMD (pyamdgpuinfo) backends; tighten type consistency in collectors.
Adjust sampling behavior (CPU smoothing + lower default update frequency) and add nvidia-ml-py to the Pixi environment.

Reviewed changes

Copilot reviewed 9 out of 10 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
src/bitbots_msgs/CMakeLists.txt	Adds `Workload.msg` to rosidl-generated interfaces so downstream nodes can publish/subscribe it.
src/bitbots_misc/system_monitor/system_monitor/network_interfaces.py	Adds return type annotations for interface collection helpers.
src/bitbots_misc/system_monitor/system_monitor/monitor.py	Updates GPU collector call signature and aligns default “disabled” tuple types; minor comment grammar fix.
src/bitbots_misc/system_monitor/system_monitor/memory.py	Adds a typed return annotation for memory stats collection.
src/bitbots_misc/system_monitor/system_monitor/gpu.py	Replaces single-backend AMD logic with auto-detected NVIDIA/Jetson/AMD backends and improved error handling/logging.
src/bitbots_misc/system_monitor/system_monitor/cpus.py	Adds EMA smoothing for CPU usage values and updates return/type annotations.
src/bitbots_misc/system_monitor/config/config.yaml	Lowers default update frequency from 10 Hz to 2 Hz.
pixi.toml	Adds `nvidia-ml-py` dependency for NVML-based monitoring.
pixi.lock	Locks `nvidia-ml-py` into all environments.
.vscode/settings.json	Adds dictionary words related to new GPU monitoring terms.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+            handle = pynvml.nvmlDeviceGetHandleByIndex(0)
+            load = float(pynvml.nvmlDeviceGetUtilizationRates(handle).gpu)
+            mem_info = pynvml.nvmlDeviceGetMemoryInfo(handle)
+            vram_used = mem_info.used
+            vram_total = mem_info.total
+            temperature = float(pynvml.nvmlDeviceGetTemperature(handle, 0))
+            return (load, vram_used, vram_total, temperature)


+            if raw_load is None:
+                continue
+            # Jetson reports GPU load in permille on current L4T kernels.
+            load = raw_load / 10.0


+    # smooth short-term sampling noise with exponential moving average
+    prev = _prev_usage[cpu_num]
+    if prev == 0.0:
+        smoothed = float(round(raw_usage, 2))
+    else:
+        smoothed = float(round((raw_usage * _EMA_ALPHA) + (prev * (1.0 - _EMA_ALPHA)), 2))
+


+def _collect_nvidia(node: Node) -> tuple[float, int, int, float]:
+    """Collect GPU metrics from NVIDIA GPU using pynvml."""
+    try:
+        import pynvml
+
+        pynvml.nvmlInit()
+        try:
+            handle = pynvml.nvmlDeviceGetHandleByIndex(0)
+            load = float(pynvml.nvmlDeviceGetUtilizationRates(handle).gpu)
+            mem_info = pynvml.nvmlDeviceGetMemoryInfo(handle)
+            vram_used = mem_info.used
+            vram_total = mem_info.total
+            temperature = float(pynvml.nvmlDeviceGetTemperature(handle, 0))
+            return (load, vram_used, vram_total, temperature)
+        finally:
+            try:
+                pynvml.nvmlShutdown()
+            except Exception:
+                pass


+    If `node` is provided the ROS node's logger will be used for messages.
+
+    node: ROS node for logging (required for backend detection and error logging)


Refactor system monitor components for improved data handling and typ…

c6fcff2

…e consistency

github-project-automation Bot added this to Software May 24, 2026

github-project-automation Bot moved this to 🆕 New in Software May 24, 2026

Fix system monitoring (on intel system)

33366a8

Enhance GPU monitoring by integrating NVIDIA and AMD detection, updating collection methods, and adding support for nvidia-ml-py package

jaagut force-pushed the fix/system_monitor branch from b051716 to 33366a8 Compare May 24, 2026 19:12

Fix system_monitor GPU on jetson

c6842cf

jaagut marked this pull request as ready for review May 24, 2026 20:02

jaagut moved this from 🆕 New to 📋 Backlog in Software May 24, 2026

jaagut moved this from 📋 Backlog to 👀 In review in Software May 24, 2026

jaagut requested review from ChlukasX, Flova, MegaIng and Copilot May 24, 2026 20:02

Copilot started reviewing on behalf of jaagut May 24, 2026 20:03 View session

Copilot AI reviewed May 24, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix/system monitor#875

Fix/system monitor#875
jaagut wants to merge 3 commits into
mainfrom
fix/system_monitor

jaagut commented May 24, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		If `node` is provided the ROS node's logger will be used for messages.

		node: ROS node for logging (required for backend detection and error logging)

Conversation

jaagut commented May 24, 2026

Summary

Proposed changes

Related issues

Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants