Skip to content

Docs: cluster training#433

Open
eugenevinitsky wants to merge 18 commits into
emerge/temp_trainingfrom
ev/improve_docs
Open

Docs: cluster training#433
eugenevinitsky wants to merge 18 commits into
emerge/temp_trainingfrom
ev/improve_docs

Conversation

@eugenevinitsky
Copy link
Copy Markdown

@eugenevinitsky eugenevinitsky commented May 20, 2026

TODO

Add two new docs and link them from README.md:

- docs/cluster_training.md — sbatch template (incl. GPU heartbeat,
  required for jobs > ~2h or the cluster's idle-reclaimer will
  scancel them), CPU rebuild path (cpu_short partition; no GPU
  needed for build_ext), account/partition strategy (QOSGrpGRES vs
  QOSMaxGRESPerUser, _tandon_priority vs _general tiers, when to
  race partitions), replay-mode memory sizing (vec.num_envs lever,
  [eval.*] suite cost at first eval), and submit_cluster.py
  failure modes (login-node python lacks pip; submitit's srun
  launcher inherits the in-container venv python path).

- docs/mining.md — mine_failures workflow, the score_threshold
  default-captures-nothing gotcha (docstring is misleading; -inf
  means no replay is saved), the pufferlib.vector Multiprocessing
  CUDA-after-fork hang and --vec.backend Serial workaround, the
  shape-mismatch gotcha when load_model_path checkpoints have
  non-default policy.* / rnn.* dimensions (mine_failures doesn't
  do the sibling config.yaml auto-merge that train() does), and
  the on-cluster sbatch pattern.

.gitignore: un-ignore docs/ (was a blanket rule that prevented
checking in markdown docs alongside the existing eval_unification.md
spec); add failure_mining/ which is large local-only mining output.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 20, 2026 12:07
Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds operational documentation for running cluster training and failure mining workflows (Greene/SLURM + Singularity), and links these guides from the main README while updating .gitignore to track docs/ and ignore large mining artifacts.

Changes:

  • Add docs/cluster_training.md with SLURM/Singularity templates, GPU heartbeat guidance, memory sizing notes, and known submit_cluster.py failure modes.
  • Add docs/mining.md describing the mine_failures workflow, common pitfalls, and a cluster sbatch pattern.
  • Update README.md with pointers to the new docs; update .gitignore to stop ignoring docs/ and to ignore failure_mining/ outputs.

Reviewed changes

Copilot reviewed 3 out of 4 changed files in this pull request and generated 2 comments.

File Description
README.md Adds links to the new operational docs from existing cluster install / mining sections.
docs/mining.md New end-to-end mining workflow guide (CLI usage, threshold semantics, backend workaround, cluster template).
docs/cluster_training.md New operational cluster training guide (container model, sbatch template + heartbeat, resource sizing, submit tooling notes).
.gitignore Ensures docs/ content is tracked and ignores large local mining outputs (failure_mining/).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread docs/mining.md Outdated
# Failure mining workflow

How to roll a trained policy out, capture compact replays, and produce a
browser-viewable HTML index of episodes. Pairs with `pufferl.mine_failures`
Comment thread docs/cluster_training.md
Comment on lines +112 to +128
## CPU rebuild path

GPU partitions are routinely saturated by training jobs of this same project.
`setup_container.sh rebuild` doesn't actually need a GPU — it just runs
`python setup.py build_ext --inplace --force` plus a smoke import. Submit to a
CPU partition for fast turnaround:

```bash
sbatch --account=<general-account> --partition=cpu_short \
--cpus-per-task=8 --mem=16gb --time=20 \
--chdir=$PWD \
-o /scratch/$USER/rebuild_logs/rebuild_%j.log \
--wrap "./scripts/setup_container.sh rebuild"
```

`--chdir=$PWD` is required because the script uses `./scripts/`. Takes ~40s.

@eugenevinitsky eugenevinitsky changed the title Docs: cluster training + mining operational guides (WIP: do not review or read) Docs: cluster training + mining operational guides May 20, 2026
Two coupled changes that make submit_cluster.py work end-to-end on
clusters where the system python differs across login and compute
nodes (Greene: login 3.12, compute 3.9) and the only consistent
python lives inside the singularity overlay.

1. Submission side: after constructing AutoExecutor, when --container
   is set, override executor._executor.python so submitit's outer
   launcher is invoked as
       singularity exec --nv --overlay :ro $IMAGE $VENV/bin/python
   That makes the compute-side srun command resolve the launcher
   python inside the container (where the venv's symlink to
   /ext3/miniforge3/bin/python3 is valid) instead of needing a
   cross-node-consistent system python with submitit installed.

2. launch_training side: when the function lands on the compute node
   it's now already inside singularity (from (1)), so skip the second
   singularity exec wrap and just run inner_cmd via bash -c. The
   /.singularity.d/Singularity marker file distinguishes the cases
   so direct-sbatch callers (not yet inside a container) still get
   the wrap.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

Drop 'we figured out X' framing per code review feedback. The
submit_cluster.py path now works end-to-end (after the patch in
the previous commit that wraps submitit's outer launcher in
singularity exec), so the docs describe that as the canonical
flow rather than as a workaround. Direct-sbatch is no longer
documented as a fallback — submit_cluster.py is the single path.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

Replace the cryptic one-line 'you may want to set' with a self-contained
explanation: what the env var does (per-arch fat binary), why it matters
on a heterogeneous cluster ('no kernel image' on the wrong GPU), what
the recommended value covers (A100/L40S/H100/H200), and when it actually
matters in practice (interactive builds outside setup_container.sh
rebuild, which already exports it).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

The previous CPU rebuild section just said 'doesn't need a GPU' without
explaining the apparent contradiction (we're compiling CUDA, no?). Spell
out that nvcc is a cross-compiler: it emits PTX/SASS for the target
arches in TORCH_CUDA_ARCH_LIST without needing matching hardware, and
the CUDA toolkit lives in the singularity image so any node that can
mount the image can run the build.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

Just say 'doesn't need a GPU' without the nvcc / fat binary / cross-compiler
detour. Readers who want the why can find TORCH_CUDA_ARCH_LIST above.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

This PR is just cluster training. The mining workflow doc and its
README pointer move to a separate PR so the two can be reviewed and
landed independently.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

cluster_training.md: trim trailing whitespace on two lines and add a
final newline (pre-commit hooks were rejecting the diff).

.gitignore: drop the sphinx-themed entries. This repo doesn't use
sphinx (no docs/conf.py, no Makefile, no mkdocs.yml — docs are just
plain Markdown), so 'docs/_build/' and the misleading 'Generated docs
(sphinx build output only)' comment were vestigial cookiecutter
boilerplate. The earlier '# docs/' line is gone with them.

README.md: short pointer paragraph added to the existing HPC section
linking to docs/cluster_training.md.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

11 lines → 6. Kept the what (wrap launcher in singularity exec) +
why (sys.executable is either version-mismatched host python or a
dangling venv symlink) + pointer to the matched check in
launch_training. Dropped the enumerated detail; future readers who
need it can trace the code.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

Previous comment conflated two scenarios — the version-mismatched host
python (real, what we hit) and the venv-symlink-dangling case (a
hypothetical alternative setup we don't use). Just describe the real
problem: login-node /usr/bin/python3 is 3.12, compute-node is 3.9, so
~/.local installs for 3.12 are invisible on the compute side.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

@Yvonne511
Copy link
Copy Markdown

  1. This is a small bug inside setup_container.sh: mv "${TEMPLATE_NAME%.gz}" overlay.ext3 should be removed
  2. I don't think this is necessary, at least for submitit. I did not need to install this separately. I have run into the "two python installation are being used here" issue, but if I just do "setup_container.sh install," it works fine. So I wonder if the 2 python issue is introduced else where.
    python3 -m ensurepip --user python3 -m pip install --user submitit pyyaml cloudpickle
    2.1 I do understand python3 scripts/submit_cluster.py does not work in login node without python3 -m ensurepip --user and etc. But you could source /scratch/yw4142/PufferDrive4/venv/bin/activate to solve it. Meaning, the change in submit_cluster.py is also not needed. But open to disagreement.

@eugenevinitsky eugenevinitsky changed the title (WIP: do not review or read) Docs: cluster training + mining operational guides Docs: cluster training May 20, 2026
Per reviewer feedback (yw4142):

1. setup_container.sh create_overlay: remove the stray
   mv "${TEMPLATE_NAME%.gz}" overlay.ext3 — it renamed the just-
   extracted file to a name that doesn't match what OVERLAY_PATH
   defaults to, so every later step looked for the overlay at the
   wrong path. Drop the line; overlay stays at its original name.

2. submit_cluster.py: revert the executor._executor.python override
   and the /.singularity.d/Singularity sentinel check in
   launch_training. They were working around 'submitit not on the
   compute-side python', but if you source the project venv on the
   login node, sys.executable becomes the venv python (same path
   the compute node will run) and submitit's serialization round-
   trips without needing any wrap. Back to the original pattern:
   launcher python is the venv python, launch_training wraps the
   inner train command in singularity exec for CUDA libs.

3. cluster_training.md: replace the pip install --user submitit
   pyyaml cloudpickle bootstrap with 'source the venv'. setup_container.sh
   install already lands submitit in the venv via the project's
   pyproject.toml, and sourcing the venv on login makes it
   importable. The 'two python installations' section comes out
   entirely — there's just one python (the venv's).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

Comment thread docs/cluster_training.md
--prefix mytrain \
--compute_config scripts/cluster_configs/nyu_greene.yaml \
--program_config scripts/cluster_configs/train_base.yaml \
--account <acct> --partition <gpu-partition> --time 2880 \
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a comment on where to get the suggested account and gpu partition for torch?

…rlay

Move the base python from inside the overlay (/ext3/miniforge3/) onto
/scratch (/scratch/$USER/miniforge3/). The venv's bin/python symlink
now points at /scratch — a path that resolves on every node, inside
or outside singularity, so 'source venv/activate && python ...' works
on the login node without needing to enter the container.

Changes:
- New MINIFORGE3_DIR variable (default /scratch/$USER/miniforge3) and
  ensure_miniforge3() helper that runs the conda-forge installer
  (self-contained shell script; no root, no singularity).
- CONTAINER_PYTHON default now points at the /scratch miniforge3.
- The 'install' dispatch runs ensure_miniforge3 outside the container
  first, then enters singularity (read-only overlay) for the uv + pip
  + build_ext steps that need nvcc.
- run_in_container_writable + --fakeroot are gone; nothing in the python
  flow writes to the overlay anymore. The overlay stays mounted :ro for
  rare system-tool installs but isn't on any write path.

For existing users: re-run setup_container.sh install — it'll detect
miniforge3 missing on /scratch, install it, recreate the venv against
the new base, and reinstall the project. The overlay's old miniforge3
becomes stale but harmless; you can rm it from the overlay later.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

Existing installs have the venv's bin/python pointing at /ext3/miniforge3
from before we moved miniforge3 to /scratch. Re-running install would
otherwise reuse the broken-symlink venv and skip recreating it. Detect
the broken symlink (bash -e $VENV/bin/python) and rm -rf before
recreating against $CONTAINER_PYTHON.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

The previous check (-e $VENV/bin/python) was wrong: install runs inside
singularity, where the overlay is mounted, so the OLD venv's symlink
into /ext3/miniforge3 IS valid — it just points at the wrong place
relative to where MINIFORGE3_DIR now is. Use readlink -f to walk the
chain and verify the resolved path is under $MINIFORGE3_DIR; rebuild
if not.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

miniforge3 25.x ships Python 3.13, but torch's cu121 wheels are cp39..cp312
only — no cp313 — so the install fails with 'no solution found when
resolving dependencies'. Pin the installer URL to 24.11.3-2 (Python
3.12) and add a version check in ensure_miniforge3 so an existing
miniforge3 with the wrong python gets reinstalled.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copy link
Copy Markdown

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

Eugene Vinitsky and others added 2 commits May 20, 2026 17:41
Putting back the docs/_build/ line I dropped earlier alongside the
blanket 'docs/' ignore. The blanket ignore was actively wrong (hid
our checked-in markdown), but docs/_build/ is just the standard
sphinx build output dir — costs nothing to keep, and protects future
contributors from accidentally committing 'make html' output if
sphinx is ever added.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The cluster-docs PR shouldn't carry the mining-outputs ignore. Splitting
to its own PR so the two are reviewable independently.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants