Skip to content

feat: Intel GPU Max (Ponte Vecchio) OpenMP target offload support#1445

Draft
sbryngelson wants to merge 7 commits into
masterfrom
intel-gpu
Draft

feat: Intel GPU Max (Ponte Vecchio) OpenMP target offload support#1445
sbryngelson wants to merge 7 commits into
masterfrom
intel-gpu

Conversation

@sbryngelson
Copy link
Copy Markdown
Member

Summary

Adds end-to-end support for building and running MFC on Intel Data Center GPU Max 1100 (Ponte Vecchio) using ifx 2025.0+ with OpenMP target offload to SPIR-V/SPIR64. Verified on GT CRNCH RoboGator (dash4). All 161 1D regression tests pass on the Intel GPU.

Usage

source ./mfc.sh load -c crnch -m g       # load Intel oneAPI 2025.1 modules
./mfc.sh build --gpu mp --intel-aot -j 8 # AOT compile to native PVC ISA
./mfc.sh test --gpu mp --intel-aot -- --binary mpirun

Changes

Build system (CMakeLists.txt, toolchain/)

  • Recognize IntelLLVM compiler ID throughout (was Intel)
  • Add -fiopenmp -fopenmp-targets=spir64 compile/link flags for GPU builds
  • Add -fp-model=precise to prevent ifx FP reassociation in SPIR-V kernels
  • Add --intel-aot flag: AOT compilation via ocloc to native PVC ISA, eliminates ~30 min Level Zero JIT delay (test runs: 30 min → 14 sec)
  • Strip SPIR-V from mkl_dfti_omp_offload.o via clang-offload-bundler to fix zeModuleDynamicLink Level Zero failures
  • Link libmkl_sycl_dft, libsycl, libOpenCL for oneMKL FFT
  • Add GT CRNCH RoboGator (crnch) module entry with Intel oneAPI 2025.1
  • run.py: auto-set LIBOMPTARGET_LEVEL_ZERO_COMMAND_BATCH=256 and SYCL_PI_LEVEL_ZERO_TRACK_INDIRECT_ACCESS_MEMORY=0 (~16% throughput gain)
  • Post-process pyrometheus m_thermochem.f90 for --gpu mp: replace C-macro GPU_ROUTINE with literal !$omp declare target
  • test.py: --binary mpirun support to bypass SLURM srun slot limits on CRNCH

GPU macro layer (src/common/include/)

  • omp_macros.fpp: Intel-specific OMP_PARALLEL_LOOP, OMP_ROUTINE, OMP_MKL_DISPATCH branches for SPIR-V codegen
  • parallel_macros.fpp: GPU_MKL_DISPATCH() macro for oneMKL dispatch
  • shared_parallel_macros.fpp: add USING_INTEL Fypp variable; extend all #:if not MFC_CASE_OPTIMIZATION and USING_AMD guards to (USING_AMD or USING_INTEL), and bare #:if USING_AMD guards for dimension(sys_size) in CBC modules

Source fixes (Intel SPIR-V constraints)

  • Assumed-shape arrays in GPU routines: Intel SPIR-V cannot propagate array descriptors in device subroutines — replaced with explicit-shape (num_fluids_max, dim(3), etc.) across 20 files
  • VLA private arrays in GPU loops: Intel SPIR-V needs fixed stack frame size at compile time — extended USING_AMD VLA guards to USING_INTEL in m_riemann_solvers, m_variables_conversion, m_bubbles_EE, m_weno, m_cbc, m_compute_cbc, and 13 other files
  • m_fftw.fpp: oneMKL DFTI + !$omp dispatch GPU FFT path for Intel
  • m_compute_levelset.fpp: split single if-else dispatch to fix multi-callee phi-node issue and ifx inliner ICE

Documentation

  • docs/documentation/intel-gpu-max.md: full build, run, and troubleshooting guide for Intel GPU Max

Test plan

  • All 161 1D tests pass on Intel GPU Max 1100 (verified locally on CRNCH dash4)
  • CI passes on existing gfortran / nvfortran / Cray ftn / ifx CPU targets
  • No regression on AMD GPU (USING_AMD guards preserved; USING_INTEL is orthogonal)

Add end-to-end support for building and running MFC on Intel Data Center
GPU Max (Ponte Vecchio) using ifx 2025.0+ with OpenMP target offload to
SPIR-V/SPIR64. Verified on GT CRNCH RoboGator (dash4) with Intel GPU
Max 1100. All 161 1D regression tests pass.

## Compiler and build system
- Recognize IntelLLVM compiler ID throughout CMakeLists.txt (was Intel)
- Add -fiopenmp -fopenmp-targets=spir64 compile/link flags for GPU builds
- Add -fp-model=precise to prevent ifx FP reassociation in SPIR-V kernels
- Add -fpp to global compile flags for Intel preprocessor compatibility
- Link MKL parallel, libmkl_sycl_dft, libsycl, libOpenCL for oneMKL FFT
- Strip SPIR-V from mkl_dfti_omp_offload.o via clang-offload-bundler to
  fix zeModuleDynamicLink Level Zero failures
- Add --intel-aot flag: AOT compilation via ocloc to native PVC ISA,
  eliminates ~30 min Level Zero JIT delay (test runs: 30 min -> 14 sec)
- Add IntelLLVM to no-FFTW-from-source list in dependencies/CMakeLists.txt
- Fix LAPACK PIE link error with ifx on Ubuntu 22.04

## GPU kernel fixes
- omp_macros.fpp: add Intel-specific OMP_PARALLEL_LOOP, END_OMP_PARALLEL_LOOP,
  OMP_ROUTINE, OMP_MKL_DISPATCH branches for SPIR-V codegen
- parallel_macros.fpp: add GPU_MKL_DISPATCH() macro for oneMKL dispatch
- shared_parallel_macros.fpp: add USING_INTEL Fypp variable; extend all
  #:if not MFC_CASE_OPTIMIZATION and USING_AMD guards to include USING_INTEL
  and bare #:if USING_AMD guards for dimension(sys_size) in m_cbc/m_compute_cbc
- m_fftw.fpp: oneMKL DFTI + ! dispatch GPU FFT path for Intel
- m_compute_levelset.fpp: split single if-else dispatch to fix multi-callee
  phi-node issue and inliner ICE; add -fno-inline workaround
- m_riemann_solvers.fpp, m_variables_conversion.fpp, m_bubbles_EE.fpp,
  m_weno.fpp, m_sim_helpers.fpp, m_pressure_relaxation.fpp, m_boundary_common,
  m_chemistry.fpp, m_phase_change.fpp, m_bubbles_EL.fpp, m_viscous.fpp,
  m_ibm.fpp, m_hyperelastic.fpp, m_acoustic_src.fpp, m_surface_tension.fpp,
  m_data_output.fpp, m_qbmm.fpp, m_compute_cbc.fpp, m_cbc.fpp, m_ib_patches.fpp:
  explicit array sizes in GPU_ROUTINE arguments (no assumed-shape in SPIR-V)
  and extend VLA guards to USING_INTEL for non-case-optimized GPU builds
- m_helper.fpp: Intel-specific workarounds for SPIR-V codegen

## Toolchain
- Add GT CRNCH RoboGator (crnch) module entry with Intel oneAPI 2025.1
- run.py: Intel GPU detection, set LIBOMPTARGET_LEVEL_ZERO_COMMAND_BATCH=256
  and SYCL_PI_LEVEL_ZERO_TRACK_INDIRECT_ACCESS_MEMORY=0 for ~16% speedup
- run/input.py: post-process pyrometheus m_thermochem.f90 for --gpu mp
  (replace C-macro GPU_ROUTINE with literal ! declare target)
- build.py, state.py: --intel-aot flag and ocloc device selection
- test.py: --binary mpirun support to bypass SLURM srun slot limits on CRNCH
- bootstrap/modules.sh: crnch module bootstrap
- templates/include/helpers.mako: Intel MPI I_MPI_FABRICS=shm hint
- modules: crnch entry (Intel oneAPI 2025.1, mpiifx, GPU Max 1100)

## Documentation
- docs/documentation/intel-gpu-max.md: full build, run, troubleshoot guide
@github-actions
Copy link
Copy Markdown

Claude Code Review

Head SHA: 6b1d0de

Files changed:

  • 39
  • CMakeLists.txt
  • src/common/include/omp_macros.fpp
  • src/common/include/parallel_macros.fpp
  • src/common/m_mpi_common.fpp
  • src/simulation/m_fftw.fpp
  • src/simulation/m_compute_levelset.fpp
  • src/simulation/m_ib_patches.fpp
  • src/simulation/m_pressure_relaxation.fpp
  • toolchain/mfc/run/input.py
  • toolchain/mfc/run/run.py

Findings:

Banned integer kind literals in src/simulation/m_fftw.fpp

In the new Intel GPU path of s_apply_azimuthal_filter, two integer-kind literal forms appear that are banned by fortran-conventions.md ("Bare integer kind like 2_wp → use 2.0_wp"):

(0_dp, 0_dp) — used to zero data_fltr_cmplx_gpu entries (appears in both the y==0 ring and in the fourier_rings loop body):

data_fltr_cmplx_gpu(...) = (0_dp, 0_dp)

0_dp is an integer literal of kind dp (= 8), not a real literal. Should be (0._dp, 0._dp).

2_dp — used in the Nyquist frequency computation inside the fourier_rings loop:

Nfq = min(floor(2_dp*real(i, dp)*pi), cmplx_size)

2_dp is an integer literal of kind dp. Should be 2._dp or plain 2.

Both appear inside the #if defined(MFC_GPU) && defined(__INTEL_LLVM_COMPILER) guard blocks. The source linter (toolchain/mfc/lint_source.py, run by ./mfc.sh precheck) enforces the "no bare integer kind" rule and would flag these.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

1 participant