Skip to content

Reduce the memory usage that is important for ne1024 simulation#4102

Open
sjsprecious wants to merge 3 commits into
ESCOMP:masterfrom
sjsprecious:reduce_init_memory
Open

Reduce the memory usage that is important for ne1024 simulation#4102
sjsprecious wants to merge 3 commits into
ESCOMP:masterfrom
sjsprecious:reduce_init_memory

Conversation

@sjsprecious

Copy link
Copy Markdown

This PR introduces some changes in CDEPS that will be used in CTSM later and are critical to reduce memory usage of a simulation at ne1024 resolution. All the changes are done by Claude under my supervisory.

This PR requires a new tag from CDEPS once my PR (ESCOMP/CDEPS#414) is merged.


The goal is to cut CTSM initialization memory (and some init time) at high resolution (like ne1024), where per-rank data replication and duplicate ESMF mesh construction dominate startup cost.

The detailed edits:

  1. New per-node shared-memory helper: clm_shmem_mod.F90

A MPI-3 shared-memory module and specialized for CTSM's decomposition setup. The idea is that arrays that are otherwise allocated identically on every MPI rank instead get one physical copy per shared-memory node, mapped into every rank on that node — freeing ranks_per_node − 1 copies per node.

  • clm_shmem_alloc_i4_1d(ptr, win, n) — allocate a node-shared default-integer rank-1 array (only the node leader requests storage via MPI_Win_allocate_shared; peers map the leader's segment via MPI_Win_shared_query).

  • clm_shmem_leader_allreduce_sum_i4(ptr, win, n) — fence → node leaders sum partials across nodes over a leader-only communicator → fence to publish. Builds a globally-summed array in the shared buffer without every rank holding a global-sized copy.

  • clm_shmem_free / clm_shmem_fence / clm_shmem_is_leader / clm_shmem_leader_comm / clm_shmem_npes_per_node — lifecycle and query helpers; lazily build node-local and node-leader communicators via mpi_comm_split_type(MPI_COMM_TYPE_SHARED).

  1. lnd_set_decomp_and_domain.F90 — apply the shmem helper to the global land mask

The global land mask lndmask_glob(gsize) was previously allocated on every rank and built with an all-rank ESMF_VMAllReduce into a second global-sized temporary (itemp_glob). Now, in both code paths (lnd_set_lndmask_from_maskmesh and lnd_set_lndmask_from_lndmesh):

  • lndmask_glob is allocated once per node via clm_shmem_alloc_i4_1d, with a new lndmask_win window handle threaded through both subroutine signatures.

  • Leader zeroes it, fence, each rank fills its disjoint local indices, then clm_shmem_leader_allreduce_sum_i4 replaces the ESMF_VMAllReduce + itemp_glob temporary (the temporary is deleted entirely).

  • Cleanup is now branch-aware: the cmeps driver paths free via clm_shmem_free(lndmask_glob, lndmask_win); the lilac path still uses plain deallocate (it uses a plain allocate).

This removes two global-sized integer arrays per rank (the mask copy + the all-reduce temp), replaced by one node-shared copy.

  1. NetCDF file-handle close fixes

Closing pio file handles that were opened but closed late or never — frees buffers earlier in init:

  • clm_instMod.F90: moves ncd_pio_closefile(params_ncid) earlier — to right after its last use (bgc_vegetation_inst%Init) instead of at the end of init_accflds.
  • initVerticalMod.F90: moves ncd_pio_closefile(ncid) to right after the last read (STD_ELEV) instead of the end of initVertical.
  • UrbanParamsType.F90: adds a missing ncd_pio_closefile(ncid) on the early-return path (nlevurb == 0) that previously leaked the handle.
  • organicFileMod.F90: adds ncd_pio_closefile(ncid) after reading ORGANIC.
    surfrdMod.F90: adds two ncd_pio_closefile(ncid) calls after dimension reads complete (after the pft/cft dims, and after nlevurb).
  1. reuse already-built CLM mesh

PrigentRoughnessStreamType.F90 and UrbanTimeVarType.F90 wrap its single shr_strdata_init_from_inline call in an if (mapalgo == 'redist') branch:

  • redist branch (stream is already on the model grid, as for the ne1024 Prigent-roughness and urban-time-varying files): passes the new argument stream_mesh_in = mesh, handing CDEPS the already-built CLM model mesh so it does not read the stream mesh file and construct a duplicate full ESMF mesh — the duplicate is a large init memory/time cost at ne1024.

  • else branch: the original call unchanged (CDEPS builds the stream mesh as before).

@samsrabin samsrabin added blocked: dependency Wait to work on this until dependency is resolved next this should get some attention in the next week or two. Normally each Thursday SE meeting. performance idea or PR to improve performance (e.g. throughput, memory) labels Jun 26, 2026
@samsrabin

Copy link
Copy Markdown
Member

Thanks for this, @sjsprecious! A couple of questions:

  1. Do you have a date you need this in by?
  2. Do you expect this to give bit-for-bit identical results to the previous version?

@ekluzek, I'm assigning you for now given your recent work on our task decomposition, but I'm also adding Next so we can discuss in our SE meeting.

@samsrabin samsrabin requested review from ekluzek and removed request for ekluzek June 26, 2026 15:52
@sjsprecious

Copy link
Copy Markdown
Author

Thanks @samsrabin for your quick reply. To answer your questions:

  1. Do you have a date you need this in by?

We are waiting for a new tag for these CTSM changes so that our collaborators can start their scientific runs soon. Thus I would say no hard date, but the sooner, the better.

  1. Do you expect this to give bit-for-bit identical results to the previous version?

Yes, these changes should not change the answers for CTSM. I am happy to do some tests on Derecho if you can share the detailed instructions.

Let me know if you or Erik has any comments/suggestions about these code changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

blocked: dependency Wait to work on this until dependency is resolved next this should get some attention in the next week or two. Normally each Thursday SE meeting. performance idea or PR to improve performance (e.g. throughput, memory)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants