Skip to content

Local Aware IBM#1378

Open
danieljvickers wants to merge 81 commits into
MFlowCode:masterfrom
danieljvickers:local-aware-ibm
Open

Local Aware IBM#1378
danieljvickers wants to merge 81 commits into
MFlowCode:masterfrom
danieljvickers:local-aware-ibm

Conversation

@danieljvickers
Copy link
Copy Markdown
Member

Description

The current code has an issue scaling much past 10k particles due to limitations in the MPIAllReduceSum in the IB force computation. This PR attempts to alleviate this by limiting the number of IBs any given rank can be aware of to its neighbors. This turns the AllReduce compute to a MPI neighbor computation, removing the communication bottlneck. To support this, a massive overhaul of IB ownership between ranks was required.

Type of change

  • Refactor

Testing

TBD

Checklist

  • I added or updated tests for new behavior

AI code reviews

Reviews are not triggered automatically. To request a review, comment on the PR:

  • @coderabbitai review — incremental review (new changes only)
  • @coderabbitai full review — full review from scratch
  • /review — Qodo review
  • /improve — Qodo code suggestions
  • @claude full review — Claude full review (also triggers on PR open/reopen/ready)
  • Add label claude-full-review — Claude full review via label

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 23, 2026

Claude Code Review

Head SHA: 9879df3

Files changed:

  • 21
  • src/simulation/m_ibm.fpp
  • src/simulation/m_start_up.fpp
  • src/simulation/m_collisions.fpp
  • src/simulation/m_global_parameters.fpp
  • src/common/m_constants.fpp
  • src/common/m_derived_types.fpp
  • src/common/m_mpi_common.fpp
  • src/post_process/m_data_output.fpp
  • src/simulation/m_ib_patches.fpp
  • src/simulation/m_time_steppers.fpp

Findings:

Critical: Stack overflow in s_reduce_ib_patch_array (src/simulation/m_start_up.fpp)

subroutine s_reduce_ib_patch_array()
    type(ib_patch_parameters), dimension(num_ib_patches_max) :: patch_ib_gbl

patch_ib_gbl is a local (stack-allocated) variable dimensioned by num_ib_patches_max, which was just raised from 50,000 to 2,050,000 in src/common/m_constants.fpp. ib_patch_parameters contains a character(LEN=pathlen_max) field (pathlen_max=400) plus ~500 bytes of numeric fields — roughly 900 bytes per element. At 2,050,000 elements this is ~1.8 GB on the call stack, which will segfault on every platform (typical stack limits are 8–64 MB). This variable must be heap-allocated (allocate/deallocate).


High: Debug print left in production code (src/simulation/m_ibm.fpp)

print *, proc_rank, " New Owner ", patch_ib(k)%gbl_patch_id  ! TODO :: REMOVE THIS DEBUG PRINT

This fires on every ownership transfer for every locally-tracked IB, generating unbounded output at scale. The ! TODO :: REMOVE THIS DEBUG PRINT marker confirms it is unintentional.


Medium: Commented-out code in m_collisions.fpp leaves pid2 lookup absent (src/simulation/m_collisions.fpp)

! call s_get_neighborhood_idx(pid1, pid1) ! global patch ID -> local index call s_get_neighborhood_idx(pid2, pid2)
if (pid1 <= 0 .or. pid2 <= 0) cycle

The comment text contains call s_get_neighborhood_idx(pid2, pid2) — what appears to be a second intended lookup that was accidentally folded into the comment instead of being left as executable code. Neither lookup actually executes: pid1 and pid2 are the raw decoded global IDs from s_decode_patch_periodicity, not local indices. The guard if (pid1 <= 0 .or. pid2 <= 0) cycle would never trigger for valid global IDs (which are ≥ 1), meaning the subsequent patch_ib(pid1) and patch_ib(pid2) accesses use global IDs as local array indices, which is out-of-bounds when num_ibs < num_gbl_ibs. If this is intentional (global IDs equal local indices in the no-MPI / single-rank path), that should be documented; otherwise both lookups need to be uncommented.


Low: GPU_UPDATE(device=[patch_ib(ib_idx)%moment]) removed without replacement (src/simulation/m_ibm.fpp)

-            patch_ib(ib_marker)%moment = moment*patch_ib(ib_marker)%mass/(count*cell_volume)
-            $:GPU_UPDATE(device='[patch_ib(ib_marker)%moment]')
+            patch_ib(ib_idx)%moment = moment*patch_ib(ib_idx)%mass/(count*cell_volume)

The per-field device update was removed. If patch_ib(i)%moment is consumed on the GPU before the next full GPU_UPDATE(device=[patch_ib(1:num_ibs)]) (e.g., in a subsequent call to s_ibm_correct_state within the same time step), the device will use a stale value. Verify that a covering full-struct update always precedes any GPU use of %moment after s_compute_moment_of_inertia is called from s_propagate_immersed_boundaries.

@danieljvickers danieljvickers marked this pull request as ready for review May 11, 2026 18:30
@qodo-code-review
Copy link
Copy Markdown
Contributor

ⓘ You've reached your Qodo monthly free-tier limit. Reviews pause until next month — upgrade your plan to continue now, or link your paid account if you already have one.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 11, 2026

Codecov Report

❌ Patch coverage is 39.46731% with 250 lines in your changes missing coverage. Please review.
✅ Project coverage is 64.19%. Comparing base (f1fd862) to head (c862070).

Files with missing lines Patch % Lines
src/simulation/m_ibm.fpp 30.89% 121 Missing and 11 partials ⚠️
src/simulation/m_start_up.fpp 51.79% 57 Missing and 10 partials ⚠️
src/post_process/m_data_output.fpp 0.00% 32 Missing ⚠️
src/simulation/m_collisions.fpp 27.27% 13 Missing and 3 partials ⚠️
src/simulation/m_time_steppers.fpp 50.00% 2 Missing ⚠️
src/simulation/m_data_output.fpp 88.88% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1378      +/-   ##
==========================================
- Coverage   64.95%   64.19%   -0.76%     
==========================================
  Files          72       72              
  Lines       18880    19145     +265     
  Branches     1573     1611      +38     
==========================================
+ Hits        12263    12290      +27     
- Misses       5641     5865     +224     
- Partials      976      990      +14     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

danieljvickers and others added 11 commits May 12, 2026 11:16
Re-run ffmt with the corrected 0.4.0 build that includes the
breaks.retain filter preventing line splits immediately before `%`
(member accessor). Fixes three occurrences of `& %sf` in m_ibm.fpp,
one `& %geometry` in m_ibm.fpp, and one `& %beg` in m_time_steppers.fpp.
num_ib_patches_max = 2050000 caused a ~2.25 GB unconditional heap
allocation + full init loop in s_assign_default_values_to_user_inputs
on every startup, crashing the CI debug build even for cases with no IBM.
Also, s_reduce_ib_patch_array had a 2.25 GB stack-allocated local array
that caused SIGSEGV for IBM cases.

Introduce num_ib_patches_max_namelist = 50000 (restoring the pre-particle-bed
budget) for the initial allocation. s_generate_particle_beds grows patch_ib
to num_ib_patches_max via MOVE_ALLOC only when particle beds are actually
being generated. s_reduce_ib_patch_array now uses a heap-allocated local
array sized to num_ibs instead of the full num_ib_patches_max.
When particle beds are used, rank 0 grows patch_ib to num_ib_patches_max
inside s_generate_particle_beds. Non-rank-0 ranks only have the
namelist-sized allocation (num_ib_patches_max_namelist). The num_ibs
scalar is broadcast first, so we can check and grow before the per-patch
MPI_BCAST loop accesses patch_ib(i) for i > num_ib_patches_max_namelist.
@sbryngelson
Copy link
Copy Markdown
Member

You had an issue that I think I fixed.

Root cause: num_ib_patches_max = 2,050,000 caused a ~2.25 GB unconditional allocation + full init loop on every startup (even for non-IBM cases), crashing CI debug builds.

Fix — 3 files, 2 commits:

  1. m_constants.fpp: Added num_ib_patches_max_namelist = 50000 — the small initial budget (same as before particle beds were added).
  2. m_global_parameters.fpp + m_start_up.fpp: Initial patch_ib allocation uses num_ib_patches_max_namelist on all ranks (~55 MB instead of ~2.25 GB).
  3. m_particle_bed.fpp: Grows patch_ib to num_ib_patches_max via MOVE_ALLOC only when particle beds are actually being generated (rank 0 only, before the MPI broadcast).
  4. m_mpi_proxy.fpp: Non-rank-0 processes grow patch_ib to match if num_ibs > size(patch_ib) — handles the MPI > 1 case where rank 0 grew for particle beds but other ranks still
    have the small allocation.
  5. m_start_up.fpp s_reduce_ib_patch_array: Changed the 2.25 GB stack-allocated patch_ib_gbl to a heap-allocated array sized to exactly num_ibs — eliminates the SIGSEGV for IBM
    cases.

… limit

NIB and the case_validator both now reference num_ib_patches_max_namelist
(50000) instead of num_ib_patches_max (2050000). This constant is the
actual namelist limit; particle beds grow patch_ib beyond it at runtime
but those entries are never specified in the namelist. The fallback values
match the constant, ensuring Homebrew installs (which lack m_constants.fpp)
use the correct limit.
patch_ib is reallocated to num_aware_ibs slots (e.g. 54000 for 3D with
the default ib_neighborhood_radius=1) before the 1-rank and no-MPI copy.
Using patch_ib(1:num_gbl_ibs) crashed when num_gbl_ibs > num_aware_ibs
(e.g. large particle beds on a single MPI rank). Use min() to clamp
the copy, matching the original truncation behavior while avoiding the
out-of-bounds write on the newly allocatable patch_ib_gbl.
For single-rank and no-MPI cases every IB patch is local, so shrinking
patch_ib to num_aware_ibs (e.g. 54k for 3D) and then keeping num_ibs at
num_gbl_ibs was a latent out-of-bounds: IBM loops over num_ibs would
access patch_ib beyond its capacity.

MOVE_ALLOC transfers patch_ib_gbl (sized exactly num_gbl_ibs) directly
into patch_ib — no copy, no truncation, patch_ib size matches num_ibs.
The num_aware_ibs resize is still done for multi-rank cases where each
rank genuinely only needs its local neighbourhood subset.
patch_ib has GPU_DECLARE(create=...) in m_global_parameters, which means
OpenACC/OpenMP tracks it via plain allocate/deallocate automatically.
move_alloc is not reliably intercepted by all GPU runtimes for declare-
create variables, so replace all three grow-on-demand sites (m_particle_bed,
m_mpi_proxy, s_reduce_ib_patch_array 1-rank/no-MPI) with the established
pattern: allocate tmp, copy, deallocate patch_ib, allocate patch_ib.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants