Primary goal: support large-scale GPU observability and any debug effort that needs accurate, continuous, out-of-band telemetry from AMD Instinct fleets — without proprietary agents or vendor lock-in.
GYANAM polls GPU servers via the standard Redfish API (with optional SSH-proxy and SSE streaming paths), parses the telemetry into a schema-aware metric model, stores it in a time-series database (InfluxDB by default; Prometheus also supported), and surfaces it through pre-built Grafana dashboards. On-demand diagnostic-log collection, alert subscriptions, and a CSV export pipeline round out the toolset so operators and debug engineers can move from "something is wrong" to a shareable evidence bundle without writing custom code.
Modern large-scale customer AI environments require reliable and continuous access to accurate out-of-band (OOB) telemetry and diagnostic logs to support effective hardware debug, triage, and RMA workflows. Today, debug organisations frequently receive incomplete or stale diagnostic data from the point of failure, significantly limiting root-cause accuracy and slowing RMA Failure Analysis.
The primary purpose of GYANAM (GPU Observability Assistant) is to close that gap. It is an on-premise, customer-deployable assistant designed to harvest time-series telemetry and critical debug logs from GPU fleets using only industry-standard interfaces — DMTF Redfish, OCP OAM, and other open specifications. This empowers operators and debug teams to autonomously collect, inspect, and share comprehensive observability and debug data — improving Time to Hypothesis (TTH) and Time to Root Cause (TTR) — while building trust through full transparency of the software solution implementation.
The GPU Observability Assistant provides:
- Continuous large-scale observability — automated, periodic harvesting of OOB telemetry from 250-500+ GPU servers with per-target persistent connections, fire-and-forget scheduling, and tiered retention for long-term analysis.
- Debug-effort enablement — on-demand diagnostic log-bundle collection per target, structured alert subscriptions (SSE + webhook fallback), and a robust CSV export pipeline with pre-flight count, chunking, retry, and server-side aggregation for sharing evidence with engineering.
- Standards-only data plane — DMTF Redfish + OCP-aligned schemas, no proprietary agents on the host.
- On-premise deployment — runs entirely inside the customer's security boundary; nothing leaves their network unless explicitly exported.
- Customer-initiated data packaging — operators decide what snapshots to share when filing tickets or RMAs.
- Reference implementation for the ecosystem — encourages OEMs and cloud partners to align around open observability standards.
The assistant acts as an enablement layer for both day-to-day GPU observability and ad-hoc debug efforts, improving the fidelity of telemetry delivered with debug tickets and enabling more accurate, data-driven RMA decisions.
Fleet Heatmap — temperature and power distribution across every GPU in the fleet, at a glance.
Targets — bulk CSV import, per-target test & on-demand log collection, live polling status.
Per-system drill-down — every GPU's compute, memory, interconnect, and power broken out.
Alerts — real-time SSE / webhook subscriptions with severity routing and history.
Diagnostic log bundles — on-demand harvest, download, and share for RMA or debug tickets.
Outlier detection — hot GPUs, high-power consumers, thermal imbalance — surfaces fleet anomalies fast.
Screenshots above expect PNG files under
docs/screenshots/
See System Architecture (PDF) — Mermaid source
in docs/architecture.mmd. For the runtime
scalability aspects see docs/SCALABILITY.md.
All operations use the gyanam.sh management script.
./gyanam.sh initGenerates .env with random passwords, tokens, and encryption key. Save the printed credentials.
./gyanam.sh start| Service | URL | Default Credentials |
|---|---|---|
| Web UI | http://localhost:8080 | admin / changeme |
| Grafana | http://localhost:3000 | (from init output) |
| InfluxDB | http://localhost:8086 | (from init output) |
Open http://localhost:8080, click Add Target, configure BMC connection details, then click Test to verify.
For bulk onboarding, use Export CSV / Import CSV on the targets page.
./gyanam.sh start # Start all services
./gyanam.sh stop # Stop all services
./gyanam.sh restart # Restart all services
./gyanam.sh status # Show status + health checks
./gyanam.sh monitor # Check volume usage and disk space
./gyanam.sh logs -f # Follow logs (all services)
./gyanam.sh logs collector # Logs for a specific service
./gyanam.sh build # Rebuild the api + collector images
# InfluxDB inspection / export (see docs/DATA_EXPORT_REFERENCE.md)
./gyanam.sh influx-status # List all buckets + retention
./gyanam.sh influx-list [bucket] # List measurements in a bucket
./gyanam.sh influx-export [bucket] [out.csv.gz] [start] [stop]
# Downsampling (run once after first start)
./gyanam.sh setup-15m-downsampling # 15-min aggregates (30d retention)
./gyanam.sh setup-hourly-downsampling # hourly aggregates (90d retention)
./gyanam.sh setup-all-downsampling # both, recommended
./gyanam.sh clean # Stop AND delete all data (volumes)
./gyanam.sh help # Full reference incl. env varsMonitor Docker volume sizes and available disk space to prevent storage issues.
./gyanam.sh monitorShows:
- Current size of each Docker volume (influxdb-data, grafana-data, shared-data)
- Available disk space on the Docker partition (color-coded: green <80%, yellow 80-90%, red >90%)
- Total Docker system usage
Track volume growth over time and get alerts when disk usage is high:
# Set up automated monitoring (optional)
crontab -eAdd these cron jobs (replace /path/to/gyanam with your install path):
# Track volume growth every hour
0 * * * * /path/to/gyanam/scripts/log_volume_growth.sh
# Alert when disk usage exceeds 80% (checks every 15 minutes)
*/15 * * * * /path/to/gyanam/scripts/alert_disk_space.shAfter running the log script for a while:
# View recent volume growth
tail -20 docker_volume_growth.log
# Calculate daily growth
grep "influxdb-data" docker_volume_growth.log | tail -24Edit scripts/alert_disk_space.sh to enable notifications:
- Slack webhooks
- Email alerts
- Custom monitoring integrations
Details: See scripts/README.md for complete monitoring setup and troubleshooting.
| Doc | When to read it |
|---|---|
docs/DEPLOYMENT.md |
Setting up Gyanam on Ubuntu, sized for ~300 nodes |
docs/SCALABILITY.md |
Tuning per fleet size + understanding the runtime architecture |
docs/DATA_EXPORT_REFERENCE.md |
Exporting metrics to CSV (gyanam.sh wrapper + native InfluxDB recipes) |
docs/CODEQL_REPORT.md |
Current CodeQL state, accepted-risk audit, re-run instructions |
LINTING.md |
Pre-commit / ruff / mypy / shellcheck setup |
scripts/README.md |
Volume / disk-space monitoring scripts |
