Skip to content

amd/Gyanam-GPU-Observability-Assistant

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GYANAM — GPU Observability Assistant

Primary goal: support large-scale GPU observability and any debug effort that needs accurate, continuous, out-of-band telemetry from AMD Instinct fleets — without proprietary agents or vendor lock-in.

GYANAM polls GPU servers via the standard Redfish API (with optional SSH-proxy and SSE streaming paths), parses the telemetry into a schema-aware metric model, stores it in a time-series database (InfluxDB by default; Prometheus also supported), and surfaces it through pre-built Grafana dashboards. On-demand diagnostic-log collection, alert subscriptions, and a CSV export pipeline round out the toolset so operators and debug engineers can move from "something is wrong" to a shareable evidence bundle without writing custom code.

GYANAM Fleet Overview dashboard — fleet-wide GPU temperature, power, and health at a glance

Problem Statement

Modern large-scale customer AI environments require reliable and continuous access to accurate out-of-band (OOB) telemetry and diagnostic logs to support effective hardware debug, triage, and RMA workflows. Today, debug organisations frequently receive incomplete or stale diagnostic data from the point of failure, significantly limiting root-cause accuracy and slowing RMA Failure Analysis.

The primary purpose of GYANAM (GPU Observability Assistant) is to close that gap. It is an on-premise, customer-deployable assistant designed to harvest time-series telemetry and critical debug logs from GPU fleets using only industry-standard interfaces — DMTF Redfish, OCP OAM, and other open specifications. This empowers operators and debug teams to autonomously collect, inspect, and share comprehensive observability and debug data — improving Time to Hypothesis (TTH) and Time to Root Cause (TTR) — while building trust through full transparency of the software solution implementation.

What GYANAM does

The GPU Observability Assistant provides:

  • Continuous large-scale observability — automated, periodic harvesting of OOB telemetry from 250-500+ GPU servers with per-target persistent connections, fire-and-forget scheduling, and tiered retention for long-term analysis.
  • Debug-effort enablement — on-demand diagnostic log-bundle collection per target, structured alert subscriptions (SSE + webhook fallback), and a robust CSV export pipeline with pre-flight count, chunking, retry, and server-side aggregation for sharing evidence with engineering.
  • Standards-only data plane — DMTF Redfish + OCP-aligned schemas, no proprietary agents on the host.
  • On-premise deployment — runs entirely inside the customer's security boundary; nothing leaves their network unless explicitly exported.
  • Customer-initiated data packaging — operators decide what snapshots to share when filing tickets or RMAs.
  • Reference implementation for the ecosystem — encourages OEMs and cloud partners to align around open observability standards.

The assistant acts as an enablement layer for both day-to-day GPU observability and ad-hoc debug efforts, improving the fidelity of telemetry delivered with debug tickets and enabling more accurate, data-driven RMA decisions.

Screenshots

Fleet temperature heatmap
Fleet Heatmap — temperature and power distribution across every GPU in the fleet, at a glance.



Targets management page
Targets — bulk CSV import, per-target test & on-demand log collection, live polling status.



Per-system GPU compute dashboard
Per-system drill-down — every GPU's compute, memory, interconnect, and power broken out.



Alerts subscription and history page
Alerts — real-time SSE / webhook subscriptions with severity routing and history.



Collected diagnostic log bundles
Diagnostic log bundles — on-demand harvest, download, and share for RMA or debug tickets.



Fleet outliers dashboard
Outlier detection — hot GPUs, high-power consumers, thermal imbalance — surfaces fleet anomalies fast.

Screenshots above expect PNG files under docs/screenshots/

Architecture

See System Architecture (PDF) — Mermaid source in docs/architecture.mmd. For the runtime scalability aspects see docs/SCALABILITY.md.

Quick Start

All operations use the gyanam.sh management script.

1. Initialize environment

./gyanam.sh init

Generates .env with random passwords, tokens, and encryption key. Save the printed credentials.

2. Start

./gyanam.sh start

3. Access services

Service URL Default Credentials
Web UI http://localhost:8080 admin / changeme
Grafana http://localhost:3000 (from init output)
InfluxDB http://localhost:8086 (from init output)

4. Add targets

Open http://localhost:8080, click Add Target, configure BMC connection details, then click Test to verify.

Targets page with bulk-import dialog and per-target Test / Collect Logs actions

For bulk onboarding, use Export CSV / Import CSV on the targets page.



Management

./gyanam.sh start                       # Start all services
./gyanam.sh stop                        # Stop all services
./gyanam.sh restart                     # Restart all services
./gyanam.sh status                      # Show status + health checks
./gyanam.sh monitor                     # Check volume usage and disk space
./gyanam.sh logs -f                     # Follow logs (all services)
./gyanam.sh logs collector              # Logs for a specific service
./gyanam.sh build                       # Rebuild the api + collector images

# InfluxDB inspection / export (see docs/DATA_EXPORT_REFERENCE.md)
./gyanam.sh influx-status               # List all buckets + retention
./gyanam.sh influx-list [bucket]        # List measurements in a bucket
./gyanam.sh influx-export [bucket] [out.csv.gz] [start] [stop]

# Downsampling (run once after first start)
./gyanam.sh setup-15m-downsampling      # 15-min aggregates (30d retention)
./gyanam.sh setup-hourly-downsampling   # hourly aggregates (90d retention)
./gyanam.sh setup-all-downsampling      # both, recommended

./gyanam.sh clean                       # Stop AND delete all data (volumes)
./gyanam.sh help                        # Full reference incl. env vars

Monitoring Volume Growth and Disk Usage

Monitor Docker volume sizes and available disk space to prevent storage issues.

Quick Check

./gyanam.sh monitor

Shows:

  • Current size of each Docker volume (influxdb-data, grafana-data, shared-data)
  • Available disk space on the Docker partition (color-coded: green <80%, yellow 80-90%, red >90%)
  • Total Docker system usage

Automated Tracking

Track volume growth over time and get alerts when disk usage is high:

# Set up automated monitoring (optional)
crontab -e

Add these cron jobs (replace /path/to/gyanam with your install path):

# Track volume growth every hour
0 * * * * /path/to/gyanam/scripts/log_volume_growth.sh

# Alert when disk usage exceeds 80% (checks every 15 minutes)
*/15 * * * * /path/to/gyanam/scripts/alert_disk_space.sh

View Growth History

After running the log script for a while:

# View recent volume growth
tail -20 docker_volume_growth.log

# Calculate daily growth
grep "influxdb-data" docker_volume_growth.log | tail -24

Configure Alerts

Edit scripts/alert_disk_space.sh to enable notifications:

  • Slack webhooks
  • Email alerts
  • Custom monitoring integrations

Details: See scripts/README.md for complete monitoring setup and troubleshooting.

Documentation

Doc When to read it
docs/DEPLOYMENT.md Setting up Gyanam on Ubuntu, sized for ~300 nodes
docs/SCALABILITY.md Tuning per fleet size + understanding the runtime architecture
docs/DATA_EXPORT_REFERENCE.md Exporting metrics to CSV (gyanam.sh wrapper + native InfluxDB recipes)
docs/CODEQL_REPORT.md Current CodeQL state, accepted-risk audit, re-run instructions
LINTING.md Pre-commit / ruff / mypy / shellcheck setup
scripts/README.md Volume / disk-space monitoring scripts

Releases

No releases published

Packages

 
 
 

Contributors