Reduce the memory usage that is important for ne1024 simulation#4102
Open
sjsprecious wants to merge 3 commits into
Open
Reduce the memory usage that is important for ne1024 simulation#4102sjsprecious wants to merge 3 commits into
sjsprecious wants to merge 3 commits into
Conversation
Member
|
Thanks for this, @sjsprecious! A couple of questions:
@ekluzek, I'm assigning you for now given your recent work on our task decomposition, but I'm also adding Next so we can discuss in our SE meeting. |
Author
|
Thanks @samsrabin for your quick reply. To answer your questions:
We are waiting for a new tag for these CTSM changes so that our collaborators can start their scientific runs soon. Thus I would say no hard date, but the sooner, the better.
Yes, these changes should not change the answers for CTSM. I am happy to do some tests on Derecho if you can share the detailed instructions. Let me know if you or Erik has any comments/suggestions about these code changes. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR introduces some changes in CDEPS that will be used in CTSM later and are critical to reduce memory usage of a simulation at ne1024 resolution. All the changes are done by Claude under my supervisory.
This PR requires a new tag from CDEPS once my PR (ESCOMP/CDEPS#414) is merged.
The goal is to cut CTSM initialization memory (and some init time) at high resolution (like
ne1024), where per-rank data replication and duplicate ESMF mesh construction dominate startup cost.The detailed edits:
A MPI-3 shared-memory module and specialized for CTSM's decomposition setup. The idea is that arrays that are otherwise allocated identically on every MPI rank instead get one physical copy per shared-memory node, mapped into every rank on that node — freeing ranks_per_node − 1 copies per node.
clm_shmem_alloc_i4_1d(ptr, win, n) — allocate a node-shared default-integer rank-1 array (only the node leader requests storage via
MPI_Win_allocate_shared; peers map the leader's segment viaMPI_Win_shared_query).clm_shmem_leader_allreduce_sum_i4(ptr, win, n) — fence → node leaders sum partials across nodes over a leader-only communicator → fence to publish. Builds a globally-summed array in the shared buffer without every rank holding a global-sized copy.
clm_shmem_free / clm_shmem_fence / clm_shmem_is_leader / clm_shmem_leader_comm / clm_shmem_npes_per_node — lifecycle and query helpers; lazily build node-local and node-leader communicators via
mpi_comm_split_type(MPI_COMM_TYPE_SHARED).The global land mask
lndmask_glob(gsize)was previously allocated on every rank and built with an all-rankESMF_VMAllReduceinto a second global-sized temporary (itemp_glob). Now, in both code paths (lnd_set_lndmask_from_maskmeshandlnd_set_lndmask_from_lndmesh):lndmask_globis allocated once per node viaclm_shmem_alloc_i4_1d, with a newlndmask_winwindow handle threaded through both subroutine signatures.Leader zeroes it, fence, each rank fills its disjoint local indices, then
clm_shmem_leader_allreduce_sum_i4replaces theESMF_VMAllReduce + itemp_globtemporary (the temporary is deleted entirely).Cleanup is now branch-aware: the cmeps driver paths free via
clm_shmem_free(lndmask_glob, lndmask_win); thelilacpath still uses plaindeallocate(it uses a plain allocate).This removes two global-sized integer arrays per rank (the mask copy + the all-reduce temp), replaced by one node-shared copy.
Closing pio file handles that were opened but closed late or never — frees buffers earlier in init:
ncd_pio_closefile(params_ncid)earlier — to right after its last use (bgc_vegetation_inst%Init) instead of at the end ofinit_accflds.ncd_pio_closefile(ncid)to right after the last read (STD_ELEV) instead of the end ofinitVertical.ncd_pio_closefile(ncid)on the early-return path (nlevurb == 0) that previously leaked the handle.ncd_pio_closefile(ncid)after reading ORGANIC.surfrdMod.F90: adds two
ncd_pio_closefile(ncid)calls after dimension reads complete (after the pft/cft dims, and after nlevurb).PrigentRoughnessStreamType.F90andUrbanTimeVarType.F90wrap its singleshr_strdata_init_from_inlinecall in anif (mapalgo == 'redist')branch:redistbranch (stream is already on the model grid, as for the ne1024 Prigent-roughness and urban-time-varying files): passes the new argumentstream_mesh_in = mesh, handing CDEPS the already-built CLM model mesh so it does not read the stream mesh file and construct a duplicate full ESMF mesh — the duplicate is a large init memory/time cost at ne1024.elsebranch: the original call unchanged (CDEPS builds the stream mesh as before).