Set gpu tpb by otbrown · Pull Request #736 · QuEST-Kit/QuEST

otbrown · 2026-04-24T17:16:49Z

Creating a facility for users to runtime set threads per block for tuning the GPU implementation. NOTE: only applies to kernels that are not handled by Thrust, which does its own thing. Resolves #735.

I considered and rejected the idea of creating a symmetric interface for the CPU for users who don't know OMP_NUM_THREADS or omp_set_num_threads() exist, but that's much riskier as the point of truth is external (in the OpenMP runtime).

TODO:

Should gpu_getNumThreadsPerBlock return a qindex? Probably.
Create a new home for user facing API, as environment doesn't really make sense.
Add a compile time default value -- that way expert maintainers can compile a tuned default into a library which is used on a system.
Query seemingly unused branch at

QuEST/quest/src/gpu/gpu_subroutines.cpp

Line 453 in b7d4a29

if constexpr (NumTargs != -1) {
Add TPB to QuEST GPU environment reporting.
Add tests for new interface.
@JPRichings To check if this is really worthwhile, but please wait a week to tell me if it isn't.

…. Yes, it's confusing. Yes, the OpenMP ARB know.

…they would be easy.

otbrown · 2026-04-24T17:20:30Z

Rudimentary testing done with:

#include <cstdio>
#include "quest.h"

int main (void)
{
  const int NQUBITS = 24;
  const int TPB = 32;


  initQuESTEnv();
  reportQuESTEnv();

  std::printf("Initial number of threads per block: %d\n", getQuESTGpuThreadsPerBlock());

  setQuESTGpuThreadsPerBlock(TPB);
  std::printf("New number of threads per block: %d\n", getQuESTGpuThreadsPerBlock());

  Qureg qureg = createForcedQureg(NQUBITS);

  std::printf("Initialising Qureg.\n");
  initPlusState(qureg);
  reportQureg(qureg);

  std::printf("Applying Quantum Fourier Transform.\n");
  applyFullQuantumFourierTransform(qureg, false);
  reportQureg(qureg);

  destroyQureg(qureg);
  finalizeQuESTEnv();

  return 0;
}

JPRichings · 2026-04-24T17:31:22Z

Why would gpu_getNumThreadsPerBlock be a qindex this is not a quantum quantity. uint should be fine ( I am sure there is a recommendation from the cuda api we can match.

TysonRayJones · 2026-04-25T19:26:54Z

Is there an advantage to users having to set this as a runtime hyperparameter? My (mostly undeveloped) belief is we can use occupancy tools (alluded to here) to automate this. I definitely shy from giving users a greater onus to optimise for their settings (like other prolific softwares), which the v4 overhaul was supposed to avoid (via e.g. the autodeployer).

Note too that the kernels so far are very primitive - each thread handles the updating of the minimum possible number of amplitudes (often just one!). I quite like that because it's very readable and simple (great for an open-source scientific project) but is an obvious site for optimisation.

Why would gpu_getNumThreadsPerBlock be a qindex this is not a quantum quantity. uint should be fine ( I am sure there is a recommendation from the cuda api we can match.

It's true that it will never be anywhere as big as the quantities qindex is expected to store (like the number of basis states), but I have already used it in places where I thought an int might be insufficient. Inoffensive either way as uint or qindex imo

JPRichings · 2026-04-25T20:56:31Z

Hi Tyson,

I just noticed the fixed value to 128 and have a feeling that it was large.

I just wanted a handle so I could write a benchmark so we can easily automate performance tuning ourselves.

I have not played with the occupancy tools but I should take a proper look as this might solve this automatically.

My other concern is that there are differences between Nvidia and AMD on optimal sizes due to hardware differences so we might not be able to reply on the occupancy tuning in all cases unless this becomes available on all platforms.

TysonRayJones · 2026-04-25T20:57:17Z

-    #pragma omp parallel shared(n)
-    #pragma omp single
-    n = omp_get_num_threads();
+    n = omp_get_max_threads();


Isn't this functionally wrong? We wish to return the number of available threads as set by the user, and which is the default adopted by our openmp pragmas. If you call omp_get_max_threads() outside a parallel region, won't it just return 1?

From an EPCC colleague:

omp_set_num_threads() set the value of the nthreads internal control variable, but omp_get_num_threads() does not get this value ( but omp_get_max_threads() does).

Standards aren't immune to issues.

Yeah as James indicates omp_get_num_threads outside a parallel region will return 1, but omp_get_max_threads returns OMP_NUM_THREADS or whatever was last set using omp_set_num_threads. Parallel regions without a num_threads clause then use that value.

Oh nice, my brain hadn't even noticed the change of num to max` 😅

TysonRayJones · 2026-04-25T20:58:12Z


    qindex numThreads = qureg.numAmpsPerNode / powerOf2(qubits.size());
-    qindex numBlocks = getNumBlocks(numThreads);
+    const int NUM_THREADS_PER_BLOCK = gpu_getNumThreadsPerBlock();


If we opt for this, why is NUM_THREADS_PER_BLOCK capitalised like a constant? It's runtime

I agree capitalisation here bad.

It's const in scope 😉 apologies, accidentally following my own style guide there rather than the QuEST one. I'll %s/NUM_THREADS_PER_BLOCK/numThreadsPerBlock/g it.

TysonRayJones · 2026-04-25T21:18:22Z

I just noticed the fixed value to 128 and have a feeling that it was large.

I guess it's very GPU specific! I think 128 was motivated by thinking of CC=3, which has a max active blocks per SM of 16, and a max active threads per SM of 2048. So using 128 threads per block perfectly maximally occupies the SMs (when there are enough amplitudes to admit more than 16 blocks per SM, of course!)

For illustration, the next smallest size is 96 (it must be a multiple of 32, else threads within a warp will be idle), which yields a number of active threads of 16 * 96 = 1536, which wastes 2048 - 1536 = 512 threads per SM!

Of course, newer GPUs support more active blocks per SM (even when the max active threads per SM is unchanged). E.g. CC=8 supports up to 32 active blocks per SM, so we could shrink to 64 threads per block while achieving the same occupancy - but I don't have a great intuition for the effect when we're memory-bandwidth bound.

Certainly seems prudent to consult a CUDA runtime API, if that doesn't hurt our AMD compatibility!

otbrown · 2026-04-27T11:14:08Z

Apologies, probably won't get to look at this again this week, but very happy to set this value programmatically if it can be done!

As it's architecture dependent, we definitely do need a way to adjust it, and ideally both at runtime and compile time. At compile time, so kindly HPC support teams can compile and maintain a tuned version, and at runtime, so they can scan through values without having to recompile in between. I'll have a chat with James abour approaches later this week!

I 100% agree that we don't really want unknowing users messing around with this. I think something like an architecture.h or perftune.h or similar might be the best solution. A set of functionality that we explicitly document is for users who know what they are doing to tune the performance of the library for a specific architecture. It might be this is the only value in there for the time being, but for slingshot-11 reasons we need to add a parameter capping the total in-flight data and this would be a good spot for that too.

TysonRayJones · 2026-04-29T21:54:01Z

Fair enough - you've convinced me! Being able to runtime adjust is of course extremely helping during development of a user-friendlier adaptive system anyhow.

I like the sound of perftune.h - it could also go into debug.h in the interim to there being more performance-tuning specific functionality.

TysonRayJones · 2026-05-03T16:39:50Z

+}
+
+void setQuESTGpuThreadsPerBlock(const int NEW_TPB) {
+    // just rely on the internal function to throw an error if there's no GPU support compiled


TODO: validate this is a factor of 32 (and is positive, etc etc)

Doc to user: HIP warpsize is 64!

otbrown · 2026-05-03T16:40:49Z

Should validate TPB is multiple of 32!

otbrown added 2 commits April 24, 2026 14:00

cpu_config.cpp: replaced omp_get_num_threads with omp_get_max_threads…

c150e0b

…. Yes, it's confusing. Yes, the OpenMP ARB know.

We do these things not because they are easy, but because we thought …

9b8ddd1

…they would be easy.

otbrown self-assigned this Apr 24, 2026

TysonRayJones reviewed Apr 25, 2026

View reviewed changes

TysonRayJones reviewed May 3, 2026

View reviewed changes

Conversation

otbrown commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

otbrown commented Apr 24, 2026

Uh oh!

JPRichings commented Apr 24, 2026

Uh oh!

TysonRayJones commented Apr 25, 2026

Uh oh!

JPRichings commented Apr 25, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TysonRayJones commented Apr 25, 2026

Uh oh!

otbrown commented Apr 27, 2026

Uh oh!

TysonRayJones commented Apr 29, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

otbrown commented May 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

otbrown commented Apr 24, 2026 •

edited

Loading