GYANAM — GPU Observability Assistant

Primary goal: support large-scale GPU observability and any debug effort that needs accurate, continuous, out-of-band telemetry from AMD Instinct fleets — without proprietary agents or vendor lock-in.

GYANAM polls GPU servers via the standard Redfish API (with optional SSH-proxy and SSE streaming paths), parses the telemetry into a schema-aware metric model, stores it in a time-series database (InfluxDB by default; Prometheus also supported), and surfaces it through pre-built Grafana dashboards. On-demand diagnostic-log collection, alert subscriptions, and a CSV export pipeline round out the toolset so operators and debug engineers can move from "something is wrong" to a shareable evidence bundle without writing custom code.

Problem Statement

Modern large-scale customer AI environments require reliable and continuous access to accurate out-of-band (OOB) telemetry and diagnostic logs to support effective hardware debug, triage, and RMA workflows. Today, debug organisations frequently receive incomplete or stale diagnostic data from the point of failure, significantly limiting root-cause accuracy and slowing RMA Failure Analysis.

The primary purpose of GYANAM (GPU Observability Assistant) is to close that gap. It is an on-premise, customer-deployable assistant designed to harvest time-series telemetry and critical debug logs from GPU fleets using only industry-standard interfaces — DMTF Redfish, OCP OAM, and other open specifications. This empowers operators and debug teams to autonomously collect, inspect, and share comprehensive observability and debug data — improving Time to Hypothesis (TTH) and Time to Root Cause (TTR) — while building trust through full transparency of the software solution implementation.

What GYANAM does

The GPU Observability Assistant provides:

Continuous large-scale observability — automated, periodic harvesting of OOB telemetry from 250-500+ GPU servers with per-target persistent connections, fire-and-forget scheduling, and tiered retention for long-term analysis.
Debug-effort enablement — on-demand diagnostic log-bundle collection per target, structured alert subscriptions (SSE + webhook fallback), and a robust CSV export pipeline with pre-flight count, chunking, retry, and server-side aggregation for sharing evidence with engineering.
Standards-only data plane — DMTF Redfish + OCP-aligned schemas, no proprietary agents on the host.
On-premise deployment — runs entirely inside the customer's security boundary; nothing leaves their network unless explicitly exported.
Customer-initiated data packaging — operators decide what snapshots to share when filing tickets or RMAs.
Reference implementation for the ecosystem — encourages OEMs and cloud partners to align around open observability standards.

The assistant acts as an enablement layer for both day-to-day GPU observability and ad-hoc debug efforts, improving the fidelity of telemetry delivered with debug tickets and enabling more accurate, data-driven RMA decisions.

Screenshots

_{Fleet Heatmap — temperature and power distribution across every GPU in the fleet, at a glance.}

_{Targets — bulk CSV import, per-target test & on-demand log collection, live polling status.}

_{Per-system drill-down — every GPU's compute, memory, interconnect, and power broken out.}

_{Alerts — real-time SSE / webhook subscriptions with severity routing and history.}

_{Diagnostic log bundles — on-demand harvest, download, and share for RMA or debug tickets.}

_{Outlier detection — hot GPUs, high-power consumers, thermal imbalance — surfaces fleet anomalies fast.}

Screenshots above expect PNG files under docs/screenshots/

Architecture

See System Architecture (PDF) — Mermaid source in docs/architecture.mmd. For the runtime scalability aspects see docs/SCALABILITY.md.

Quick Start

All operations use the gyanam.sh management script.

1. Initialize environment

./gyanam.sh init

Generates .env with random passwords, tokens, and encryption key. Save the printed credentials.

2. Start

./gyanam.sh start

3. Access services

Service	URL	Default Credentials
Web UI	http://localhost:8080	admin / changeme
Grafana	http://localhost:3000	(from init output)
InfluxDB	http://localhost:8086	(from init output)

4. Add targets

Open http://localhost:8080, click Add Target, configure BMC connection details, then click Test to verify.

For bulk onboarding, use Export CSV / Import CSV on the targets page.

Management

./gyanam.sh start                       # Start all services
./gyanam.sh stop                        # Stop all services
./gyanam.sh restart                     # Restart all services
./gyanam.sh status                      # Show status + health checks
./gyanam.sh monitor                     # Check volume usage and disk space
./gyanam.sh logs -f                     # Follow logs (all services)
./gyanam.sh logs collector              # Logs for a specific service
./gyanam.sh build                       # Rebuild the api + collector images

# InfluxDB inspection / export (see docs/DATA_EXPORT_REFERENCE.md)
./gyanam.sh influx-status               # List all buckets + retention
./gyanam.sh influx-list [bucket]        # List measurements in a bucket
./gyanam.sh influx-export [bucket] [out.csv.gz] [start] [stop]

# Downsampling (run once after first start)
./gyanam.sh setup-15m-downsampling      # 15-min aggregates (30d retention)
./gyanam.sh setup-hourly-downsampling   # hourly aggregates (90d retention)
./gyanam.sh setup-all-downsampling      # both, recommended

./gyanam.sh clean                       # Stop AND delete all data (volumes)
./gyanam.sh help                        # Full reference incl. env vars

Monitoring Volume Growth and Disk Usage

Monitor Docker volume sizes and available disk space to prevent storage issues.

Quick Check

./gyanam.sh monitor

Shows:

Current size of each Docker volume (influxdb-data, grafana-data, shared-data)
Available disk space on the Docker partition (color-coded: green <80%, yellow 80-90%, red >90%)
Total Docker system usage

Automated Tracking

Track volume growth over time and get alerts when disk usage is high:

# Set up automated monitoring (optional)
crontab -e

Add these cron jobs (replace /path/to/gyanam with your install path):

# Track volume growth every hour
0 * * * * /path/to/gyanam/scripts/log_volume_growth.sh

# Alert when disk usage exceeds 80% (checks every 15 minutes)
*/15 * * * * /path/to/gyanam/scripts/alert_disk_space.sh

View Growth History

After running the log script for a while:

# View recent volume growth
tail -20 docker_volume_growth.log

# Calculate daily growth
grep "influxdb-data" docker_volume_growth.log | tail -24

Configure Alerts

Edit scripts/alert_disk_space.sh to enable notifications:

Slack webhooks
Email alerts
Custom monitoring integrations

Details: See scripts/README.md for complete monitoring setup and troubleshooting.

Documentation

Doc	When to read it
`docs/DEPLOYMENT.md`	Setting up Gyanam on Ubuntu, sized for ~300 nodes
`docs/SCALABILITY.md`	Tuning per fleet size + understanding the runtime architecture
`docs/DATA_EXPORT_REFERENCE.md`	Exporting metrics to CSV (gyanam.sh wrapper + native InfluxDB recipes)
`docs/CODEQL_REPORT.md`	Current CodeQL state, accepted-risk audit, re-run instructions
`LINTING.md`	Pre-commit / ruff / mypy / shellcheck setup
`scripts/README.md`	Volume / disk-space monitoring scripts

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.codeql-results		.codeql-results
.github/codeql		.github/codeql
collector		collector
docs		docs
grafana/provisioning		grafana/provisioning
influxdb/scripts		influxdb/scripts
prometheus		prometheus
reference_artifacts		reference_artifacts
scripts		scripts
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.shellcheckrc		.shellcheckrc
.yamllint.yaml		.yamllint.yaml
LINTING.md		LINTING.md
README.md		README.md
check-linting.sh		check-linting.sh
docker-compose.prometheus.yml		docker-compose.prometheus.yml
docker-compose.yml		docker-compose.yml
gyanam.sh		gyanam.sh
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt
setup-linters.sh		setup-linters.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GYANAM — GPU Observability Assistant

Problem Statement

What GYANAM does

Screenshots

Architecture

Quick Start

1. Initialize environment

2. Start

3. Access services

4. Add targets

Management

Monitoring Volume Growth and Disk Usage

Quick Check

Automated Tracking

View Growth History

Configure Alerts

Documentation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GYANAM — GPU Observability Assistant

Problem Statement

What GYANAM does

Screenshots

Architecture

Quick Start

1. Initialize environment

2. Start

3. Access services

4. Add targets

Management

Monitoring Volume Growth and Disk Usage

Quick Check

Automated Tracking

View Growth History

Configure Alerts

Documentation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages