Skip to content

Add torchrun cluster heartbeat support and update documentation#51

Merged
reiase merged 2 commits into
masterfrom
feature/new_cluster_report
Jun 27, 2026
Merged

Add torchrun cluster heartbeat support and update documentation#51
reiase merged 2 commits into
masterfrom
feature/new_cluster_report

Conversation

@reiase

@reiase reiase commented Jun 27, 2026

Copy link
Copy Markdown
Contributor
  • Added a new torchrun-cluster.md file detailing the hierarchical cluster heartbeat mechanism for multi-process torchrun jobs.
  • Updated existing documentation to reference the new heartbeat feature, including changes in distributed.md, index.md, and their Chinese counterparts.
  • Introduced environment variable references for configuring the cluster heartbeat in env-vars.md.
  • Added a demo script cluster_multinode_demo.py to showcase the new functionality.
  • Updated Cargo.lock to include the new probing-store dependency for enhanced functionality.

reiase added 2 commits June 27, 2026 20:29
- Added a new `torchrun-cluster.md` file detailing the hierarchical cluster heartbeat mechanism for multi-process `torchrun` jobs.
- Updated existing documentation to reference the new heartbeat feature, including changes in `distributed.md`, `index.md`, and their Chinese counterparts.
- Introduced environment variable references for configuring the cluster heartbeat in `env-vars.md`.
- Added a demo script `cluster_multinode_demo.py` to showcase the new functionality.
- Updated `Cargo.lock` to include the new `probing-store` dependency for enhanced functionality.
…e navigation

- Added a new entry for "Torchrun Cluster Heartbeat" in the plugins section of `mkdocs.yml`.
- Updated the navigation in `mkdocs.yml` to include a link to the new `torchrun-cluster.md` design document.
- Revised links in `quickstart.md` and `quickstart.zh.md` to point to the correct paths for SQL Analytics and Distributed documentation.
@reiase reiase merged commit 6556900 into master Jun 27, 2026
10 checks passed
@codecov

codecov Bot commented Jun 27, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 71.18644% with 34 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/lib.rs 0.00% 33 Missing ⚠️
tests/regression/inspect/test_trace_helpers.py 94.11% 1 Missing ⚠️
Files with missing lines Coverage Δ
tests/regression/core/test_torchrun_cluster.py 100.00% <100.00%> (+2.12%) ⬆️
tests/unit/probing/skills/test_loader.py 100.00% <100.00%> (ø)
tests/unit/probing/test_torchrun_cluster_report.py 100.00% <100.00%> (ø)
tests/regression/inspect/test_trace_helpers.py 97.64% <94.11%> (-0.40%) ⬇️
src/lib.rs 0.00% <0.00%> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant