Reduce the memory usage that is important for ne1024 simulation#665
Reduce the memory usage that is important for ne1024 simulation#665sjsprecious wants to merge 4 commits into
Conversation
|
@mvertens - If you have a chance, I'd like to hear any thoughts you have on this change. (I have just skimmed it so far - haven't had a chance to look closely yet.) |
billsacks
left a comment
There was a problem hiding this comment.
The overall changes here look good to me. I appreciate your plugging in the use of lnd2rof_map, which seems like it previously wasn't hooked up. And I think I see why you needed to introduce a separate lnd2rof_fmap (though see my comments asking for this to be made more explicit in some documentation).
I do have a few requests... many about editing some comments, but a couple slightly more substantial... but still overall minor - for the most part this looks good to me.
Once you make the final changes, I'd like to see that at least one or two tests have been run with baseline comparisons to verify that these changes work and are bit-for-bit in out-of-the-box configurations. One test that I think would cover all of these changes would be a B compset test with the aoflux grid set to ogrid (the setting of ogrid is needed to cover the changes in med_phases_aofluxes_mod, and should still cover the other changes here). I'd like to see that run with comparisons against a baseline.
| ! Reuse the ocean field's mesh directly rather than building a duplicate full | ||
| ! ESMF mesh (ne1024pg2) per rank. lmesh persists in FBArea(compocn) and | ||
| ! aoflux_mesh is never destroyed, so sharing the handle is safe (ogrid path). |
There was a problem hiding this comment.
This change seems reasonable, and I appreciate the comment justifying this. Please remove the specific reference to ne1024pg2 from the comment.
That said, I don't understand this well enough to be totally confident that this is okay... I'll be more comfortable if you run a B compset test with the aoflux grid set to ogrid and show that it passes and is bit-for-bit with a baseline.
There was a problem hiding this comment.
Thanks @billsacks . I have updated the comments as you suggested.
Can you share the detailed instructions to run a B compset test with the aoflux grid set to ogrid? I am happy to do the tests later and let you know if these changes are BFB.
| integer :: n,n1,ns | ||
| integer :: maptype | ||
| integer :: fieldCount | ||
| logical :: isPresent, isSet, lexist ! lnd2rof_fmap attribute presence / weight-file check |
There was a problem hiding this comment.
These logicals might be used for other variables in the future, so it's probably best to remove the comment suggesting that they're just used for this one variable.
There was a problem hiding this comment.
Thanks. Fixed in the latest commit.
| integer :: maptype | ||
| integer :: fieldCount | ||
| logical :: isPresent, isSet, lexist ! lnd2rof_fmap attribute presence / weight-file check | ||
| character(len=CX) :: lnd2rof_fmap ! consd (destarea) lnd->rof fraction-map file ($LND2ROF_FRAC_FMAPNAME) |
There was a problem hiding this comment.
I don't think comments in the Fortran should refer to xml variables (LND2ROF_FRAC_FMAPNAME), since not all systems use these xml variables.
There was a problem hiding this comment.
Thanks. Fixed in the latest commit.
| call NUOPC_CompAttributeGet(gcomp, name='lnd2rof_fmap', value=lnd2rof_fmap, & | ||
| isPresent=isPresent, isSet=isSet, rc=rc) | ||
| if (ChkErr(rc,__LINE__,u_FILE_u)) return | ||
| if (.not. (isPresent .and. isSet)) lnd2rof_fmap = 'unset' |
There was a problem hiding this comment.
It seems like this line is redundant with the initialization to 'unset' a few lines above, and that just one or the other could be kept, with the other removed.
There was a problem hiding this comment.
Good catch. I have removed lnd2rof_fmap = 'unset' a few lines above.
| if (.not. (isPresent .and. isSet)) lnd2rof_fmap = 'unset' | ||
| if (trim(lnd2rof_fmap) /= 'unset') then | ||
| inquire(file=trim(lnd2rof_fmap), exist=lexist) | ||
| if (.not. lexist) lnd2rof_fmap = 'unset' |
There was a problem hiding this comment.
I feel like this should abort with an error message rather than silently setting this to 'unset'. Then this block could be merged with the following block so it would look like the following (noting also, as a side note, that trim is unnecessary in string comparisons... I know we do that in a lot of places, but it's unnecessary - e.g., see https://community.intel.com/t5/Intel-Fortran-Compiler/String-Comparisons/td-p/798323):
if (lnd2rof_fmap /= 'unset') then
inquire(...)
if (.not. lexist) then
! ERROR here with a meaningful message
end if
call med_map_routehandles_init(...)There was a problem hiding this comment.
Thanks. I have updated it based on your suggestions.
| <type>char</type> | ||
| <category>mapping</category> | ||
| <group>MED_attributes</group> | ||
| <desc>lnd to rof conservative destarea ('consd') fraction mapping file read by the mediator |
There was a problem hiding this comment.
If I understand correctly, it looks like the difference between this and LND2ROF_FMAPNAME is that this one uses consd mapping and the other uses consf mapping. Is that right? If so, I think it would be useful to extend the comment in LND2ROF_FMAPNAME pointing out that that one uses consf mapping.
There was a problem hiding this comment.
Correct. I updated it in the latest commit.
| <default_value>unset</default_value> | ||
| <group>run_domain</group> | ||
| <file>env_run.xml</file> | ||
| <desc>lnd2rof conservative (destarea, 'consd') fraction mapping file used by the mediator |
There was a problem hiding this comment.
Similar to my comment in namelist_definition_drv: it looks like the difference between this one and LND2ROF_FMAPNAME is whether it's using consd or consf mapping. Is that right? If so, please add a comment to LND2ROF_FMAPNAME also pointing out that that one should be consf mapping.
There was a problem hiding this comment.
Correct. I updated it in the latest commit.
| <desc>rof2lnd flux mapping file</desc> | ||
| </entry> | ||
|
|
||
| <entry id="LND2ROF_FRAC_FMAPNAME"> |
There was a problem hiding this comment.
This new entry would be better positioned right after LND2ROF_FMAPNAME (similar to its positioning in namelist_definition_drv).
There was a problem hiding this comment.
Thanks. Fixed in the latest commit.
| ! At ne1024pg2 resolution: the lnd->rof fraction map is the only grid-crossing | ||
| ! conservative coupling map; building it online (ESMF_FieldRegridStore -> | ||
| ! GeomRendezvous -> Zoltan_RCB) OOM-kills the job during DataInitialize. | ||
| ! Read offline consd (conserve/destarea) fraction weights from the file named by the | ||
| ! 'lnd2rof_fmap' attribute (xmlchange LND2ROF_FRAC_FMAPNAME=<file>) when it is set | ||
| ! and present; otherwise fall back to the online path (no behavior change). |
There was a problem hiding this comment.
This comment is very specific to your case. This could be appropriate for an issue / PR comment, but the code comment should be more general. Please rewrite this comment to be more general. Also, it seems best to avoid referencing XML variables in the Fortran, since xml variables aren't used by all modeling systems.
There was a problem hiding this comment.
Thanks. Fixed in the latest commit.
This PR reduces the memory usage that is critical for ultra-high resolution simulation such as ne1024. All the code changes are done by Claude under my supervisory.
The goal is to let the
lnd→rofconservative coupling maps be read from offline weight files instead of being computed online, which OOM-kills the job atne1024duringDataInitialize (ESMF_FieldRegridStore → GeomRendezvous → Zoltan_RCB). This is the implementation of thelnd2rof_consfOOM fix. There are two distinctlnd→rofmaps handled, plus anaofluxmemory optimization.In main, the
lnd2rof_mapattribute (driven by the pre-existingLND2ROF_FMAPNAMEXML var) was already read into thelnd2rof_mapvariable — but the fiveaddmap_fromcalls for the runoff fields (Flrl_rofsur, Flrl_rofi, Flrl_rofgwl, Flrl_rofsub, Flrl_irrig) hardcodedunset, so the file was silently ignored and the map was always built online.This is a separate map (
destarea, notfracarea) used duringmed_fraction_init, with no pre-existing namelist hook. The fix adds full new plumbing:LND2ROF_FRAC_FMAPNAME(default unset, env_run.xml, group run_domain) in config_component.xml.lnd2rof_fmap(modify_via_xml="LND2ROF_FRAC_FMAPNAME") innamelist_definition_drv.xml.lnd2rof_fmapattribute viaNUOPC_CompAttributeGet; if it is present, set, and the file exists on disk (inquire), it callsmed_map_routehandles_initwithmapfile=to read offline weights; otherwise it falls back to the original online path — no behavior change when unset. Adds ause NUOPCimport and local vars (isPresent/isSet/lexist, lnd2rof_fmap).optional, intent(in) :: mapfileargument throughmed_map_routehandles_initfrom_fieldbundle, forwarding it to the field-level routine (which already accepted mapfile) when present. Backward-compatible.Replaces
is_local%wrap%aoflux_mesh = ESMF_MeshCreate(lmesh, rc=rc)withis_local%wrap%aoflux_mesh = lmesh— reuses the ocean field's existing mesh handle instead of allocating a duplicate full ESMF mesh per rank at ne1024pg2. Safe because lmesh persists inFBArea(compocn)andaoflux_meshis never destroyed.