Skip to content

Make cuda version check dynamic#202

Open
casparvl wants to merge 10 commits intoEESSI:mainfrom
casparvl:make_cuda_version_check_dynamic
Open

Make cuda version check dynamic#202
casparvl wants to merge 10 commits intoEESSI:mainfrom
casparvl:make_cuda_version_check_dynamic

Conversation

@casparvl
Copy link
Copy Markdown
Contributor

@casparvl casparvl commented Apr 14, 2026

Fixes #189

First step to fixing #201

Unblocks #200 , which blocks EESSI/software-layer#1462 which blocks EESSI/software-layer#1453 ...

This PR can be modified by modifying the create_lmodsitepackage.py with something like:

                if not cudaVersion or cudaVersion == "" then
                    -- Hardcode for local testing
                    -- local eessi_prefix = os.getenv("EESSI_PREFIX")
                    local eessi_prefix = pathJoin('/home', 'casparl', 'EESSI', 'software-layer-scripts')
                    local script = pathJoin(eessi_prefix, 'scripts', 'gpu_support', 'nvidia', 'get_cuda_driver_version.sh')

Comment thread create_lmodsitepackage.py Outdated
Comment thread scripts/gpu_support/nvidia/get_cuda_driver_version.sh
Caspar van Leeuwen added 4 commits April 16, 2026 16:52
…ng the get_cuda_driver_script twice, as it's costly. We simply adapt the script to always return a 0 exit, and then do any handling of the case where EESSI_CUDA_DRIVER_VERSION is NOT set by the end in the calling Lmod hook
@bedroge
Copy link
Copy Markdown
Contributor

bedroge commented Apr 17, 2026

This now correctly sets EESSI_CUDA_DRIVER_VERSION=13.1 on my RTX Pro 6000.

Forcing it to e.g. 12.0 and loading something that needs CUDA 12.9.1 prints a nice error:

Your driver CUDA version is 12.5  but the module you want to load requires CUDA 12.9.1.

With 12.9 it works fine, as expected, as it doesn't check the patch version.

Just to test failures of the script, I modified the script and hardcoded an empty string for the driver version. Then I get the following, as expected:

Lmod Warning:  Environment variable EESSI_CUDA_DRIVER_VERSION not found. Cannot ensure that driver version is new enough for CUDA toolkit version: '12.9.1'. This module
will still be loaded, but may not function as expected. Export EESSI_CUDA_DRIVER_VERSION_SUPPRESS_WARNING=1 

Setting that env var allowed me to load the module.

Comment thread create_lmodsitepackage.py Outdated
@bedroge
Copy link
Copy Markdown
Contributor

bedroge commented Apr 17, 2026

bot: build repo:eessi.io-2023.06-software instance:eessi-bot-mc-aws for:arch=x86_64/amd/zen2
bot: build repo:eessi.io-2025.06-software instance:eessi-bot-mc-aws for:arch=x86_64/amd/zen2

@eessi-bot-aws
Copy link
Copy Markdown

eessi-bot-aws bot commented Apr 17, 2026

New job on instance eessi-bot-mc-aws for repository eessi.io-2023.06-software
Building on: amd-zen2
Building for: x86_64/amd/zen2
Job dir: /project/def-users/SHARED/jobs/2026.04/pr_202/148717

date job status comment
Apr 17 15:17:46 UTC 2026 submitted job id 148717 awaits release by job manager
Apr 17 15:17:58 UTC 2026 released job awaits launch by Slurm scheduler
Apr 17 15:24:11 UTC 2026 running job 148717 is running
Apr 17 15:28:27 UTC 2026 finished
😁 SUCCESS (click triangle for details)
Details
✅ job output file slurm-148717.out
✅ no message matching FATAL:
✅ no message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
✅ found message(s) matching No missing installations
✅ found message matching .tar.* created!
Artefacts
eessi-2023.06-software-linux-x86_64-amd-zen2-17764394540.tar.zstsize: 0 MiB (4992 bytes)
entries: 2
modules under 2023.06/software/linux/x86_64/amd/zen2/modules/all
no module files in tarball
software under 2023.06/software/linux/x86_64/amd/zen2/software
no software packages in tarball
reprod directories under 2023.06/software/linux/x86_64/amd/zen2/reprod
no reprod directories in tarball
other under 2023.06/software/linux/x86_64/amd/zen2
2023.06/scripts/gpu_support/nvidia/get_cuda_driver_version.sh
.lmod/SitePackage.lua
Apr 17 15:28:27 UTC 2026 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ OK ] ( 1/10) EESSI_LAMMPS_lj %device_type=cpu %module_name=LAMMPS/29Aug2024-foss-2023b-kokkos %scale=1_node /aeb2d9df @BotBuildTests:x86-64-zen2+default
P: perf: 436.921 timesteps/s (r:0, l:None, u:None)
[ OK ] ( 2/10) EESSI_LAMMPS_lj %device_type=cpu %module_name=LAMMPS/2Aug2023_update2-foss-2023a-kokkos %scale=1_node /04ff9ece @BotBuildTests:x86-64-zen2+default
P: perf: 444.711 timesteps/s (r:0, l:None, u:None)
[ OK ] ( 3/10) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_allreduce %module_name=OSU-Micro-Benchmarks/7.2-gompi-2023b %scale=1_node %device_type=cpu /775175bf @BotBuildTests:x86-64-zen2+default
P: latency: 3.04 us (r:0, l:None, u:None)
[ OK ] ( 4/10) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_allreduce %module_name=OSU-Micro-Benchmarks/7.1-1-gompi-2023a %scale=1_node %device_type=cpu /52707c40 @BotBuildTests:x86-64-zen2+default
P: latency: 2.86 us (r:0, l:None, u:None)
[ OK ] ( 5/10) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_alltoall %module_name=OSU-Micro-Benchmarks/7.2-gompi-2023b %scale=1_node %device_type=cpu /b1aacda9 @BotBuildTests:x86-64-zen2+default
P: latency: 7.69 us (r:0, l:None, u:None)
[ OK ] ( 6/10) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_alltoall %module_name=OSU-Micro-Benchmarks/7.1-1-gompi-2023a %scale=1_node %device_type=cpu /c6bad193 @BotBuildTests:x86-64-zen2+default
P: latency: 5.69 us (r:0, l:None, u:None)
[ OK ] ( 7/10) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_latency %module_name=OSU-Micro-Benchmarks/7.2-gompi-2023b %scale=1_node /15cad6c4 @BotBuildTests:x86-64-zen2+default
P: latency: 0.77 us (r:0, l:None, u:None)
[ OK ] ( 8/10) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_latency %module_name=OSU-Micro-Benchmarks/7.1-1-gompi-2023a %scale=1_node /6672deda @BotBuildTests:x86-64-zen2+default
P: latency: 0.84 us (r:0, l:None, u:None)
[ OK ] ( 9/10) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_bw %module_name=OSU-Micro-Benchmarks/7.2-gompi-2023b %scale=1_node /2a9a47b1 @BotBuildTests:x86-64-zen2+default
P: bandwidth: 6475.1 MB/s (r:0, l:None, u:None)
[ OK ] (10/10) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_bw %module_name=OSU-Micro-Benchmarks/7.1-1-gompi-2023a %scale=1_node /1b24ab8e @BotBuildTests:x86-64-zen2+default
P: bandwidth: 6511.21 MB/s (r:0, l:None, u:None)
[ PASSED ] Ran 10/10 test case(s) from 10 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
✅ job output file slurm-148717.out
✅ no message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@eessi-bot-aws
Copy link
Copy Markdown

eessi-bot-aws bot commented Apr 17, 2026

New job on instance eessi-bot-mc-aws for repository eessi.io-2025.06-software
Building on: amd-zen2
Building for: x86_64/amd/zen2
Job dir: /project/def-users/SHARED/jobs/2026.04/pr_202/148718

date job status comment
Apr 17 15:17:50 UTC 2026 submitted job id 148718 awaits release by job manager
Apr 17 15:17:56 UTC 2026 released job awaits launch by Slurm scheduler
Apr 17 15:24:08 UTC 2026 running job 148718 is running
Apr 17 15:26:20 UTC 2026 finished
😁 SUCCESS (click triangle for details)
Details
✅ job output file slurm-148718.out
✅ no message matching FATAL:
✅ no message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
✅ found message(s) matching No missing installations
✅ found message matching .tar.* created!
Artefacts
eessi-2025.06-software-linux-x86_64-amd-zen2-17764394440.tar.zstsize: 0 MiB (4998 bytes)
entries: 2
modules under 2025.06/software/linux/x86_64/amd/zen2/modules/all
no module files in tarball
software under 2025.06/software/linux/x86_64/amd/zen2/software
no software packages in tarball
reprod directories under 2025.06/software/linux/x86_64/amd/zen2/reprod
no reprod directories in tarball
other under 2025.06/software/linux/x86_64/amd/zen2
2025.06/scripts/gpu_support/nvidia/get_cuda_driver_version.sh
.lmod/SitePackage.lua
Apr 17 15:26:20 UTC 2026 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ OK ] (1/5) EESSI_LAMMPS_lj %device_type=cpu %module_name=LAMMPS/22Jul2025-foss-2024a-kokkos %scale=1_node /ade8cad7 @BotBuildTests:x86-64-zen2+default
P: perf: 440.368 timesteps/s (r:0, l:None, u:None)
[ OK ] (2/5) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_allreduce %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node %device_type=cpu /e4bf9965 @BotBuildTests:x86-64-zen2+default
P: latency: 1.26 us (r:0, l:None, u:None)
[ OK ] (3/5) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_alltoall %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node %device_type=cpu /3da4890b @BotBuildTests:x86-64-zen2+default
P: latency: 2.02 us (r:0, l:None, u:None)
[ OK ] (4/5) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_latency %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node /3255009a @BotBuildTests:x86-64-zen2+default
P: latency: 0.18 us (r:0, l:None, u:None)
[ OK ] (5/5) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_bw %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node /59f4b331 @BotBuildTests:x86-64-zen2+default
P: bandwidth: 7741.89 MB/s (r:0, l:None, u:None)
[ PASSED ] Ran 5/5 test case(s) from 5 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
✅ job output file slurm-148718.out
✅ no message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Get CUDA driver version dynamically

2 participants