Docs: cluster training#433
Conversation
Add two new docs and link them from README.md: - docs/cluster_training.md — sbatch template (incl. GPU heartbeat, required for jobs > ~2h or the cluster's idle-reclaimer will scancel them), CPU rebuild path (cpu_short partition; no GPU needed for build_ext), account/partition strategy (QOSGrpGRES vs QOSMaxGRESPerUser, _tandon_priority vs _general tiers, when to race partitions), replay-mode memory sizing (vec.num_envs lever, [eval.*] suite cost at first eval), and submit_cluster.py failure modes (login-node python lacks pip; submitit's srun launcher inherits the in-container venv python path). - docs/mining.md — mine_failures workflow, the score_threshold default-captures-nothing gotcha (docstring is misleading; -inf means no replay is saved), the pufferlib.vector Multiprocessing CUDA-after-fork hang and --vec.backend Serial workaround, the shape-mismatch gotcha when load_model_path checkpoints have non-default policy.* / rnn.* dimensions (mine_failures doesn't do the sibling config.yaml auto-merge that train() does), and the on-cluster sbatch pattern. .gitignore: un-ignore docs/ (was a blanket rule that prevented checking in markdown docs alongside the existing eval_unification.md spec); add failure_mining/ which is large local-only mining output. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
There was a problem hiding this comment.
Pull request overview
Adds operational documentation for running cluster training and failure mining workflows (Greene/SLURM + Singularity), and links these guides from the main README while updating .gitignore to track docs/ and ignore large mining artifacts.
Changes:
- Add
docs/cluster_training.mdwith SLURM/Singularity templates, GPU heartbeat guidance, memory sizing notes, and knownsubmit_cluster.pyfailure modes. - Add
docs/mining.mddescribing themine_failuresworkflow, common pitfalls, and a cluster sbatch pattern. - Update
README.mdwith pointers to the new docs; update.gitignoreto stop ignoringdocs/and to ignorefailure_mining/outputs.
Reviewed changes
Copilot reviewed 3 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| README.md | Adds links to the new operational docs from existing cluster install / mining sections. |
| docs/mining.md | New end-to-end mining workflow guide (CLI usage, threshold semantics, backend workaround, cluster template). |
| docs/cluster_training.md | New operational cluster training guide (container model, sbatch template + heartbeat, resource sizing, submit tooling notes). |
| .gitignore | Ensures docs/ content is tracked and ignores large local mining outputs (failure_mining/). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Failure mining workflow | ||
|
|
||
| How to roll a trained policy out, capture compact replays, and produce a | ||
| browser-viewable HTML index of episodes. Pairs with `pufferl.mine_failures` |
| ## CPU rebuild path | ||
|
|
||
| GPU partitions are routinely saturated by training jobs of this same project. | ||
| `setup_container.sh rebuild` doesn't actually need a GPU — it just runs | ||
| `python setup.py build_ext --inplace --force` plus a smoke import. Submit to a | ||
| CPU partition for fast turnaround: | ||
|
|
||
| ```bash | ||
| sbatch --account=<general-account> --partition=cpu_short \ | ||
| --cpus-per-task=8 --mem=16gb --time=20 \ | ||
| --chdir=$PWD \ | ||
| -o /scratch/$USER/rebuild_logs/rebuild_%j.log \ | ||
| --wrap "./scripts/setup_container.sh rebuild" | ||
| ``` | ||
|
|
||
| `--chdir=$PWD` is required because the script uses `./scripts/`. Takes ~40s. | ||
|
|
Two coupled changes that make submit_cluster.py work end-to-end on
clusters where the system python differs across login and compute
nodes (Greene: login 3.12, compute 3.9) and the only consistent
python lives inside the singularity overlay.
1. Submission side: after constructing AutoExecutor, when --container
is set, override executor._executor.python so submitit's outer
launcher is invoked as
singularity exec --nv --overlay :ro $IMAGE $VENV/bin/python
That makes the compute-side srun command resolve the launcher
python inside the container (where the venv's symlink to
/ext3/miniforge3/bin/python3 is valid) instead of needing a
cross-node-consistent system python with submitit installed.
2. launch_training side: when the function lands on the compute node
it's now already inside singularity (from (1)), so skip the second
singularity exec wrap and just run inner_cmd via bash -c. The
/.singularity.d/Singularity marker file distinguishes the cases
so direct-sbatch callers (not yet inside a container) still get
the wrap.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
Drop 'we figured out X' framing per code review feedback. The submit_cluster.py path now works end-to-end (after the patch in the previous commit that wraps submitit's outer launcher in singularity exec), so the docs describe that as the canonical flow rather than as a workaround. Direct-sbatch is no longer documented as a fallback — submit_cluster.py is the single path. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
Replace the cryptic one-line 'you may want to set' with a self-contained
explanation: what the env var does (per-arch fat binary), why it matters
on a heterogeneous cluster ('no kernel image' on the wrong GPU), what
the recommended value covers (A100/L40S/H100/H200), and when it actually
matters in practice (interactive builds outside setup_container.sh
rebuild, which already exports it).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
The previous CPU rebuild section just said 'doesn't need a GPU' without explaining the apparent contradiction (we're compiling CUDA, no?). Spell out that nvcc is a cross-compiler: it emits PTX/SASS for the target arches in TORCH_CUDA_ARCH_LIST without needing matching hardware, and the CUDA toolkit lives in the singularity image so any node that can mount the image can run the build. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
Just say 'doesn't need a GPU' without the nvcc / fat binary / cross-compiler detour. Readers who want the why can find TORCH_CUDA_ARCH_LIST above. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
This PR is just cluster training. The mining workflow doc and its README pointer move to a separate PR so the two can be reviewed and landed independently. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
cluster_training.md: trim trailing whitespace on two lines and add a final newline (pre-commit hooks were rejecting the diff). .gitignore: drop the sphinx-themed entries. This repo doesn't use sphinx (no docs/conf.py, no Makefile, no mkdocs.yml — docs are just plain Markdown), so 'docs/_build/' and the misleading 'Generated docs (sphinx build output only)' comment were vestigial cookiecutter boilerplate. The earlier '# docs/' line is gone with them. README.md: short pointer paragraph added to the existing HPC section linking to docs/cluster_training.md. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
11 lines → 6. Kept the what (wrap launcher in singularity exec) + why (sys.executable is either version-mismatched host python or a dangling venv symlink) + pointer to the matched check in launch_training. Dropped the enumerated detail; future readers who need it can trace the code. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
Previous comment conflated two scenarios — the version-mismatched host python (real, what we hit) and the venv-symlink-dangling case (a hypothetical alternative setup we don't use). Just describe the real problem: login-node /usr/bin/python3 is 3.12, compute-node is 3.9, so ~/.local installs for 3.12 are invisible on the compute side. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
|
Per reviewer feedback (yw4142):
1. setup_container.sh create_overlay: remove the stray
mv "${TEMPLATE_NAME%.gz}" overlay.ext3 — it renamed the just-
extracted file to a name that doesn't match what OVERLAY_PATH
defaults to, so every later step looked for the overlay at the
wrong path. Drop the line; overlay stays at its original name.
2. submit_cluster.py: revert the executor._executor.python override
and the /.singularity.d/Singularity sentinel check in
launch_training. They were working around 'submitit not on the
compute-side python', but if you source the project venv on the
login node, sys.executable becomes the venv python (same path
the compute node will run) and submitit's serialization round-
trips without needing any wrap. Back to the original pattern:
launcher python is the venv python, launch_training wraps the
inner train command in singularity exec for CUDA libs.
3. cluster_training.md: replace the pip install --user submitit
pyyaml cloudpickle bootstrap with 'source the venv'. setup_container.sh
install already lands submitit in the venv via the project's
pyproject.toml, and sourcing the venv on login makes it
importable. The 'two python installations' section comes out
entirely — there's just one python (the venv's).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
| --prefix mytrain \ | ||
| --compute_config scripts/cluster_configs/nyu_greene.yaml \ | ||
| --program_config scripts/cluster_configs/train_base.yaml \ | ||
| --account <acct> --partition <gpu-partition> --time 2880 \ |
There was a problem hiding this comment.
Let's add a comment on where to get the suggested account and gpu partition for torch?
…rlay Move the base python from inside the overlay (/ext3/miniforge3/) onto /scratch (/scratch/$USER/miniforge3/). The venv's bin/python symlink now points at /scratch — a path that resolves on every node, inside or outside singularity, so 'source venv/activate && python ...' works on the login node without needing to enter the container. Changes: - New MINIFORGE3_DIR variable (default /scratch/$USER/miniforge3) and ensure_miniforge3() helper that runs the conda-forge installer (self-contained shell script; no root, no singularity). - CONTAINER_PYTHON default now points at the /scratch miniforge3. - The 'install' dispatch runs ensure_miniforge3 outside the container first, then enters singularity (read-only overlay) for the uv + pip + build_ext steps that need nvcc. - run_in_container_writable + --fakeroot are gone; nothing in the python flow writes to the overlay anymore. The overlay stays mounted :ro for rare system-tool installs but isn't on any write path. For existing users: re-run setup_container.sh install — it'll detect miniforge3 missing on /scratch, install it, recreate the venv against the new base, and reinstall the project. The overlay's old miniforge3 becomes stale but harmless; you can rm it from the overlay later. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
Existing installs have the venv's bin/python pointing at /ext3/miniforge3 from before we moved miniforge3 to /scratch. Re-running install would otherwise reuse the broken-symlink venv and skip recreating it. Detect the broken symlink (bash -e $VENV/bin/python) and rm -rf before recreating against $CONTAINER_PYTHON. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
The previous check (-e $VENV/bin/python) was wrong: install runs inside singularity, where the overlay is mounted, so the OLD venv's symlink into /ext3/miniforge3 IS valid — it just points at the wrong place relative to where MINIFORGE3_DIR now is. Use readlink -f to walk the chain and verify the resolved path is under $MINIFORGE3_DIR; rebuild if not. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
miniforge3 25.x ships Python 3.13, but torch's cu121 wheels are cp39..cp312 only — no cp313 — so the install fails with 'no solution found when resolving dependencies'. Pin the installer URL to 24.11.3-2 (Python 3.12) and add a version check in ensure_miniforge3 so an existing miniforge3 with the wrong python gets reinstalled. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
Putting back the docs/_build/ line I dropped earlier alongside the blanket 'docs/' ignore. The blanket ignore was actively wrong (hid our checked-in markdown), but docs/_build/ is just the standard sphinx build output dir — costs nothing to keep, and protects future contributors from accidentally committing 'make html' output if sphinx is ever added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The cluster-docs PR shouldn't carry the mining-outputs ignore. Splitting to its own PR so the two are reviewable independently. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
TODO