GPU telemetry with workload attribution. One OTLP agent per node ties hardware metrics (NVIDIA, AMD, Intel Gaudi) to the K8s pod or Slurm job burning the GPU.
kubernetes amd gpu helm slurm nvidia gpu-monitoring mlops opentelemetry llm-observability nvml-monitoring dcgm intel-gaudi-base-operator workload-attribution
-
Updated
May 19, 2026 - Python