Docs: cluster training by eugenevinitsky · Pull Request #433 · Emerge-Lab/PufferDrive

eugenevinitsky · 2026-05-20T12:07:08Z

TODO

Add two new docs and link them from README.md: - docs/cluster_training.md — sbatch template (incl. GPU heartbeat, required for jobs > ~2h or the cluster's idle-reclaimer will scancel them), CPU rebuild path (cpu_short partition; no GPU needed for build_ext), account/partition strategy (QOSGrpGRES vs QOSMaxGRESPerUser, _tandon_priority vs _general tiers, when to race partitions), replay-mode memory sizing (vec.num_envs lever, [eval.*] suite cost at first eval), and submit_cluster.py failure modes (login-node python lacks pip; submitit's srun launcher inherits the in-container venv python path). - docs/mining.md — mine_failures workflow, the score_threshold default-captures-nothing gotcha (docstring is misleading; -inf means no replay is saved), the pufferlib.vector Multiprocessing CUDA-after-fork hang and --vec.backend Serial workaround, the shape-mismatch gotcha when load_model_path checkpoints have non-default policy.* / rnn.* dimensions (mine_failures doesn't do the sibling config.yaml auto-merge that train() does), and the on-cluster sbatch pattern. .gitignore: un-ignore docs/ (was a blanket rule that prevented checking in markdown docs alongside the existing eval_unification.md spec); add failure_mining/ which is large local-only mining output. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

greptile-apps

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

Copilot

Pull request overview

Adds operational documentation for running cluster training and failure mining workflows (Greene/SLURM + Singularity), and links these guides from the main README while updating .gitignore to track docs/ and ignore large mining artifacts.

Changes:

Add docs/cluster_training.md with SLURM/Singularity templates, GPU heartbeat guidance, memory sizing notes, and known submit_cluster.py failure modes.
Add docs/mining.md describing the mine_failures workflow, common pitfalls, and a cluster sbatch pattern.
Update README.md with pointers to the new docs; update .gitignore to stop ignoring docs/ and to ignore failure_mining/ outputs.

Reviewed changes

Copilot reviewed 3 out of 4 changed files in this pull request and generated 2 comments.

File	Description
README.md	Adds links to the new operational docs from existing cluster install / mining sections.
docs/mining.md	New end-to-end mining workflow guide (CLI usage, threshold semantics, backend workaround, cluster template).
docs/cluster_training.md	New operational cluster training guide (container model, sbatch template + heartbeat, resource sizing, submit tooling notes).
.gitignore	Ensures `docs/` content is tracked and ignores large local mining outputs (`failure_mining/`).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+# Failure mining workflow
+
+How to roll a trained policy out, capture compact replays, and produce a
+browser-viewable HTML index of episodes. Pairs with `pufferl.mine_failures`


+## CPU rebuild path
+
+GPU partitions are routinely saturated by training jobs of this same project.
+`setup_container.sh rebuild` doesn't actually need a GPU — it just runs
+`python setup.py build_ext --inplace --force` plus a smoke import. Submit to a
+CPU partition for fast turnaround:
+
+```bash
+sbatch --account=<general-account> --partition=cpu_short \
+    --cpus-per-task=8 --mem=16gb --time=20 \
+    --chdir=$PWD \
+    -o /scratch/$USER/rebuild_logs/rebuild_%j.log \
+    --wrap "./scripts/setup_container.sh rebuild"
+```
+
+`--chdir=$PWD` is required because the script uses `./scripts/`. Takes ~40s.
+


Two coupled changes that make submit_cluster.py work end-to-end on clusters where the system python differs across login and compute nodes (Greene: login 3.12, compute 3.9) and the only consistent python lives inside the singularity overlay. 1. Submission side: after constructing AutoExecutor, when --container is set, override executor._executor.python so submitit's outer launcher is invoked as singularity exec --nv --overlay :ro $IMAGE $VENV/bin/python That makes the compute-side srun command resolve the launcher python inside the container (where the venv's symlink to /ext3/miniforge3/bin/python3 is valid) instead of needing a cross-node-consistent system python with submitit installed. 2. launch_training side: when the function lands on the compute node it's now already inside singularity (from (1)), so skip the second singularity exec wrap and just run inner_cmd via bash -c. The /.singularity.d/Singularity marker file distinguishes the cases so direct-sbatch callers (not yet inside a container) still get the wrap. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

greptile-apps

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

Drop 'we figured out X' framing per code review feedback. The submit_cluster.py path now works end-to-end (after the patch in the previous commit that wraps submitit's outer launcher in singularity exec), so the docs describe that as the canonical flow rather than as a workaround. Direct-sbatch is no longer documented as a fallback — submit_cluster.py is the single path. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

greptile-apps

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

Replace the cryptic one-line 'you may want to set' with a self-contained explanation: what the env var does (per-arch fat binary), why it matters on a heterogeneous cluster ('no kernel image' on the wrong GPU), what the recommended value covers (A100/L40S/H100/H200), and when it actually matters in practice (interactive builds outside setup_container.sh rebuild, which already exports it). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

greptile-apps

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

The previous CPU rebuild section just said 'doesn't need a GPU' without explaining the apparent contradiction (we're compiling CUDA, no?). Spell out that nvcc is a cross-compiler: it emits PTX/SASS for the target arches in TORCH_CUDA_ARCH_LIST without needing matching hardware, and the CUDA toolkit lives in the singularity image so any node that can mount the image can run the build. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

greptile-apps

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

Just say 'doesn't need a GPU' without the nvcc / fat binary / cross-compiler detour. Readers who want the why can find TORCH_CUDA_ARCH_LIST above. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

greptile-apps

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

This PR is just cluster training. The mining workflow doc and its README pointer move to a separate PR so the two can be reviewed and landed independently. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

greptile-apps

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

greptile-apps

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

cluster_training.md: trim trailing whitespace on two lines and add a final newline (pre-commit hooks were rejecting the diff). .gitignore: drop the sphinx-themed entries. This repo doesn't use sphinx (no docs/conf.py, no Makefile, no mkdocs.yml — docs are just plain Markdown), so 'docs/_build/' and the misleading 'Generated docs (sphinx build output only)' comment were vestigial cookiecutter boilerplate. The earlier '# docs/' line is gone with them. README.md: short pointer paragraph added to the existing HPC section linking to docs/cluster_training.md. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

greptile-apps

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

11 lines → 6. Kept the what (wrap launcher in singularity exec) + why (sys.executable is either version-mismatched host python or a dangling venv symlink) + pointer to the matched check in launch_training. Dropped the enumerated detail; future readers who need it can trace the code. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

greptile-apps

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

Previous comment conflated two scenarios — the version-mismatched host python (real, what we hit) and the venv-symlink-dangling case (a hypothetical alternative setup we don't use). Just describe the real problem: login-node /usr/bin/python3 is 3.12, compute-node is 3.9, so ~/.local installs for 3.12 are invisible on the compute side. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

greptile-apps

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

Yvonne511 · 2026-05-20T16:24:05Z

This is a small bug inside setup_container.sh: mv "${TEMPLATE_NAME%.gz}" overlay.ext3 should be removed
I don't think this is necessary, at least for submitit. I did not need to install this separately. I have run into the "two python installation are being used here" issue, but if I just do "setup_container.sh install," it works fine. So I wonder if the 2 python issue is introduced else where.
python3 -m ensurepip --user python3 -m pip install --user submitit pyyaml cloudpickle
2.1 I do understand python3 scripts/submit_cluster.py does not work in login node without python3 -m ensurepip --user and etc. But you could source /scratch/yw4142/PufferDrive4/venv/bin/activate to solve it. Meaning, the change in submit_cluster.py is also not needed. But open to disagreement.

Per reviewer feedback (yw4142): 1. setup_container.sh create_overlay: remove the stray mv "${TEMPLATE_NAME%.gz}" overlay.ext3 — it renamed the just- extracted file to a name that doesn't match what OVERLAY_PATH defaults to, so every later step looked for the overlay at the wrong path. Drop the line; overlay stays at its original name. 2. submit_cluster.py: revert the executor._executor.python override and the /.singularity.d/Singularity sentinel check in launch_training. They were working around 'submitit not on the compute-side python', but if you source the project venv on the login node, sys.executable becomes the venv python (same path the compute node will run) and submitit's serialization round- trips without needing any wrap. Back to the original pattern: launcher python is the venv python, launch_training wraps the inner train command in singularity exec for CUDA libs. 3. cluster_training.md: replace the pip install --user submitit pyyaml cloudpickle bootstrap with 'source the venv'. setup_container.sh install already lands submitit in the venv via the project's pyproject.toml, and sourcing the venv on login makes it importable. The 'two python installations' section comes out entirely — there's just one python (the venv's). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

greptile-apps

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

riccardosavorgnan · 2026-05-20T16:48:34Z

+    --prefix mytrain \
+    --compute_config scripts/cluster_configs/nyu_greene.yaml \
+    --program_config scripts/cluster_configs/train_base.yaml \
+    --account <acct> --partition <gpu-partition> --time 2880 \


Let's add a comment on where to get the suggested account and gpu partition for torch?

…rlay Move the base python from inside the overlay (/ext3/miniforge3/) onto /scratch (/scratch/$USER/miniforge3/). The venv's bin/python symlink now points at /scratch — a path that resolves on every node, inside or outside singularity, so 'source venv/activate && python ...' works on the login node without needing to enter the container. Changes: - New MINIFORGE3_DIR variable (default /scratch/$USER/miniforge3) and ensure_miniforge3() helper that runs the conda-forge installer (self-contained shell script; no root, no singularity). - CONTAINER_PYTHON default now points at the /scratch miniforge3. - The 'install' dispatch runs ensure_miniforge3 outside the container first, then enters singularity (read-only overlay) for the uv + pip + build_ext steps that need nvcc. - run_in_container_writable + --fakeroot are gone; nothing in the python flow writes to the overlay anymore. The overlay stays mounted :ro for rare system-tool installs but isn't on any write path. For existing users: re-run setup_container.sh install — it'll detect miniforge3 missing on /scratch, install it, recreate the venv against the new base, and reinstall the project. The overlay's old miniforge3 becomes stale but harmless; you can rm it from the overlay later. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

greptile-apps

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

Existing installs have the venv's bin/python pointing at /ext3/miniforge3 from before we moved miniforge3 to /scratch. Re-running install would otherwise reuse the broken-symlink venv and skip recreating it. Detect the broken symlink (bash -e $VENV/bin/python) and rm -rf before recreating against $CONTAINER_PYTHON. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

greptile-apps

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

The previous check (-e $VENV/bin/python) was wrong: install runs inside singularity, where the overlay is mounted, so the OLD venv's symlink into /ext3/miniforge3 IS valid — it just points at the wrong place relative to where MINIFORGE3_DIR now is. Use readlink -f to walk the chain and verify the resolved path is under $MINIFORGE3_DIR; rebuild if not. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

greptile-apps

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

miniforge3 25.x ships Python 3.13, but torch's cu121 wheels are cp39..cp312 only — no cp313 — so the install fails with 'no solution found when resolving dependencies'. Pin the installer URL to 24.11.3-2 (Python 3.12) and add a version check in ensure_miniforge3 so an existing miniforge3 with the wrong python gets reinstalled. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

greptile-apps

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

Putting back the docs/_build/ line I dropped earlier alongside the blanket 'docs/' ignore. The blanket ignore was actively wrong (hid our checked-in markdown), but docs/_build/ is just the standard sphinx build output dir — costs nothing to keep, and protects future contributors from accidentally committing 'make html' output if sphinx is ever added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The cluster-docs PR shouldn't carry the mining-outputs ignore. Splitting to its own PR so the two are reviewable independently. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Copilot AI review requested due to automatic review settings May 20, 2026 12:07

greptile-apps Bot reviewed May 20, 2026

View reviewed changes

Copilot started reviewing on behalf of eugenevinitsky May 20, 2026 12:07 View session

Copilot AI reviewed May 20, 2026

View reviewed changes

eugenevinitsky changed the title ~~Docs: cluster training + mining operational guides~~ (WIP: do not review or read) Docs: cluster training + mining operational guides May 20, 2026

greptile-apps Bot reviewed May 20, 2026

View reviewed changes

docs: trim CPU rebuild section — drop the cross-compiler explanation

33d560e

Just say 'doesn't need a GPU' without the nvcc / fat binary / cross-compiler detour. Readers who want the why can find TORCH_CUDA_ARCH_LIST above. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

greptile-apps Bot reviewed May 20, 2026

View reviewed changes

docs: drop mining doc from this PR (moved to a separate PR)

f864262

This PR is just cluster training. The mining workflow doc and its README pointer move to a separate PR so the two can be reviewed and landed independently. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

greptile-apps Bot reviewed May 20, 2026

View reviewed changes

docs: cluster_training tweaks (TL;DR rewrite, formatting)

b793e60

greptile-apps Bot reviewed May 20, 2026

View reviewed changes

eugenevinitsky mentioned this pull request May 20, 2026

(WIP: do not review or read) Docs: failure mining operational guide #434

Open

greptile-apps Bot reviewed May 20, 2026

View reviewed changes

eugenevinitsky requested review from Yvonne511 and riccardosavorgnan May 20, 2026 16:05

eugenevinitsky changed the title ~~(WIP: do not review or read) Docs: cluster training + mining operational guides~~ Docs: cluster training May 20, 2026

greptile-apps Bot reviewed May 20, 2026

View reviewed changes

riccardosavorgnan reviewed May 20, 2026

View reviewed changes

greptile-apps Bot reviewed May 20, 2026

View reviewed changes

Eugene Vinitsky and others added 2 commits May 20, 2026 17:41

gitignore: drop failure_mining/ from this PR (moved to a separate PR)

e36b5d3

The cluster-docs PR shouldn't carry the mining-outputs ignore. Splitting to its own PR so the two are reviewable independently. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Conversation

eugenevinitsky commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Yvonne511 commented May 20, 2026

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

riccardosavorgnan May 20, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

eugenevinitsky commented May 20, 2026 •

edited

Loading