Weaning off std::complex arithmetic#729
Conversation
just to isolate this big diff - structural changes to gpu_qcomp are coming
and changed &arr[ind] to arr + ind, for visual clarity
but there's so much boilerplate overlap with cpu_qcomp, I wonder if we should unify the two! We can retain separate cpu_qcomp and gpu_qcomp types (for clarity) through a typedef
compiler stack-overflows when OpenMP is enabled - possibly due to thread-private instantiation of this 2D array?
to debug MSVC + OpenMP compilation failure in CI
story of my life bruddah
nice one chatgpt
|
@JPRichings @otbrown Here's a first draft of the proposed You'll see most of the |
|
Thanks Tyson! I have 5(!) hours of meetings tomorrow, so unlikely to get to this then, but will make time later in the week! |
|
Hi Tyson, one of my meetings got cancelled 🥳 Moving away from the standard library entirely is philosphically upsetting, but clearly has practical benefits!
I strongly support this proposal. Structurally it would be nice to have the common interface in core to then be specialised/overwritten in cpu/gpu. I am still a bit nervous about needing to maintain a set of maths functions but if we can also contain them in one place in core (relying on the concrete implementations of arithmetic primitives elsewhere if needed), that would be good. If it's helpful I can volunteer @eessmann as |
Yea it's outrageous but at least it won't keep us up at night about esoteric performance pitfalls! (We'll trade that for alignment nightmares)
Made common definitions in the previous commit. We can still discretionarily fallback to the
That'd be super helpful! I preserved the method (used in relation to that comment) in the GPU backend, so I'm hoping things just work on HIP. I don't have an AMD machine handy to test on, so it would be terrific if Erich can give it a whirl! I can also re-setup the (paid) GPU Github Action runners if billing has been sorted out. Note the new |
|
Heya @TysonRayJones, |
Wew that'd be brilliant!
It's a good question; previously, because using thrust complex inside CUDA kernels has some pitfalls and (iirc) didn't help solve the operator overload issues already seen of cuComplex (where HIP/rocThrust defines arithmetic operator overloads but CUDA/Thrust doesn't). It was my original intention to use thrust complex for QuEST 4.0, but was forced to switch to cuComplex (although I've forgotten precisely why). With this PR however, there is a more compelling reason not to use thrust-complex; we must already define a custom complex type for the CPU backend to avoid compiler-specific performance pitfalls. Since we have to define its arithmetic overloads, we may as well re-use this (base) type for the GPU backend too, for better symmetry. We should now never have any issues related to operator overloads on particular platforms like HIP (but I do live in fear of alignment issues, as alluded to!) |
so it's the first thing a developer sees, reducing likelihood they inefficiently use qcomp overloads
Since Erich's identified the relevant
Sadly, we don't have a budget we could use for that, but one of the three of us (Me, Erich, @JPRichings) can be the CI runner 😉 |
The failure never triggered in CI nor on my Windows laptop, but did trigger on my Windows desktop. Joy!
|
I meant to compare the CPU performance of this PR (729) against Erich's #711, the main branch (v4.2.0), and the final v3 version (v3.7), to check the new Here, for 6 different precisions and statevector sizes, I plot the runtime ratio of the aforementioned branches to the v3 branch. Good news:
(The single-Z operator seems to blow in v4 too!) Even if this were MSVC specific, it remains worrying to me since most Windows QuEST users (foul beasts) are likely to actually be QuESTlink users, where lower qubit counts will be typical. These results suggest migrating QuESTlink to v4 (in a hypothetical future where I had the time, eheh) would incur a huge slowdown! 😨 |
|
@TysonRayJones could you provide the plots of absolute performance opposed to the ratios as i want to inspect the pattern of performance within each version at different number of qubits and between versions to see if the landscape has changed? I agree this is a bit odd. I will try to find time to reproduce when I get a min on a data centre GPU from a linux environment to try to isolate this. |
|
Thanks James, definitely worth digging into with a bit of profiling!
I did just want to also highlight that this is still very small -- 16MiB is the size of the L3 cache shared between 4 cores on ARCHER2. On my more recent AMD home computer the L3 is 32MiB and the L2 is 8MiB, so at these sizes I would always expect the CPU to be faster than the GPU when accounting for offload time, so the CPU performance is probably more important for your hypothetical QuESTLink users. I appreciate of course that this is currently GPU vs GPU! We'll do some more detailed profiling of our own. |
|
@JPRichings I've uploaded the data to Google Drive here, as well as the Mathematica (MMA) notebook for plotting, in case helpful. They include the plots you mention:
The "+ -gate" just means initialising the
Beware that was an incredibly stupid error on my part (relabelling the "ccccU" gate to "c*U" for brevity, leading to missing data treated as '1' by mathematica in the divison, grr).
That'd be super useful, thanks very much! I'm especially curious if AMD GPUs show the same regression.
That's very true, and suggests we should update my naively-motivated "autodeployer thresholds" here, which currently enable GPU-acceleration for |
|
Thanks you for this. This is already reassuring as the performance scaling for v4 and v3 are very similar for most gates and there seems to be a common threshold where v4 performance starts to win out over v4 so this looks a bit more global which is something we can hunt down with profiling. The two exceptions are the + state which I will look at now and Z gate which I have already delved into in detail today to look at code changes in this gate application. @otbrown had some thoughts around optimisations that didn't make it from V3 to V4 for the pauliZ around here: ( QuEST/QuEST/src/QuEST_common.c Line 257 in d4f75f7 |
|
V3: QuEST/QuEST/src/GPU/QuEST_gpu.cu Line 434 in d4f75f7 V4.2: QuEST/quest/src/api/initialisations.cpp Line 59 in 9d7618d QuEST/quest/src/core/localiser.cpp Line 729 in 9d7618d QuEST/quest/src/core/accelerator.cpp Line 1131 in 9d7618d QuEST/quest/src/gpu/gpu_subroutines.cpp Line 1866 in 9d7618d QuEST/quest/src/gpu/gpu_thrust.cuh Line 1051 in 9d7618d Given that the performance is comparable here and the kernel is very simple and is just setting values in the state vector probably points to how some of the changes to gpu kernels in v4. I assume there is a common pattern for gate application across the gates and something there is adding additional overhead that wasn't present in v3? I need to check but I guess many of the gates end up here: QuEST/quest/src/gpu/gpu_kernels.cuh Line 517 in 9d7618d |
|
Really dumb guess based on eye balling the code is that: QuEST/quest/src/core/bitwise.hpp Line 206 in 9d7618d is common in most of the statevector kernels and is called by each thread. hmmm for loop in GPU kernel... QuEST/quest/src/core/bitwise.hpp Line 164 in 9d7618d |
|
I also notice that this masking logic is also present in the CPU side changes as noted in the examples here: #638 with The performance discrepancy in #720 also seems present in the same range of qubits < 20... Profiling required! |
|
hmmm... QuEST/quest/src/gpu/gpu_subroutines.cpp Line 700 in 9d7618d |
|
Ok so after a fair amount of digging I think we are moving data from host to device just before every kernel call that has ctrls. Until this line, QuEST/quest/src/gpu/gpu_subroutines.cpp Line 708 in 9d7618d I think Ctrls are host side data but on the return of getPtr is called in the call to the kernel function: QuEST/quest/src/gpu/gpu_subroutines.cpp Line 713 in 9d7618d Comparing this to V3.7 is the ctrl mask was captured by value in the kernel launch as a QuEST/QuEST/src/GPU/QuEST_gpu.cu Line 875 in d4f75f7 and lived in constant memory which will probably have less of a performance impact than relying on thrust to move the data on creation of the vector. |
|
report1.nsys-rep.txt
cuda malloc: ~ 230 microsecs Run on grace-hopper node so mem copy from host to device will be really quick and much faster than in above results. qft output: |
|
Apologies for delay - everything's on fire, I'll likely not have time to return until next weekend.
Absolutely right - a common Thrust paradigm but a gross pitfall in my opinion. I hate the implicit staging! Eager to change this if this is such an opportunity.
Ahh nicely done! The logic for handling controls differs between v3 and v4; the latter is asymptotically superior. It'd be amazing if we can perform that "enumerate only control-satisfying amps" logic using a control bitmask, possibly using these primitives?? Otherwise, we can easily maintain a persistent negligibly-sized device workspace, like we already do here. |
|
Noting though that we stil see a performance regression when there are no control qubits - could it be that a superfluous zero-size allocation is happening in that scenario? 😅 Certainly there's no algorithmic overhead because of the |
|
I think I have a fix where we move the Ctrls over to device into constant memory using |
|
I caution this with I could have made a horrible error in how I am doing the data transfer and messed up the correctness but based on the attached profile the performance is much improved:
The call to memcopy to symbol only takes 200ns! The kernel now takes: ~ 2-3 micro seconds (probably as accessing constant read only memory is much quicker) There is now no allocation! Profile: report_symbol.nsys-rep.txt Great caution until I confirm correctness on monday! Given there is another large cuda malloc call (~250 microsec) still in the profile I think there is more performance to be had in this. Hardware target same grace-hopper node as before. qft output: |
|
Same issue in: QuEST/quest/src/gpu/gpu_subroutines.cpp Line 289 in 9d7618d 22 instances of |
|
Further improvement if 1 additional kernel updated: Same qft executable same grace-hopper hardware. Profile: |
|
Note to check gpu_thrust.cuh as QuEST/quest/src/gpu/gpu_thrust.cuh Line 1015 in 9d7618d QuEST/quest/src/gpu/gpu_thrust.cuh Line 787 in 9d7618d Causing |
|
Wew super neat to have this addressed, nice one! 🎉 For this PR, I'll proceed under the assumption there's no CUDA performance impact of
The pain of these manual cross-platform performance regression checks begs for some automated benchmarker - but it's of course extremely unsexy work. Maybe a good unitaryHack bounty, or a chore for someone interested in experimenting with AI sloppers ¯\(ツ)/¯ |
We've been pondering this a bit -- it might be something we can set up through EIDF. I've also been considering reviewing all tests to confirm they're MPI-safe and then running MPI tests through EIDF as well. All very much post-v4.3 activity though I think! |
|
This is the pull request that corrected the performance issues on nvidia not tested on AMD #739 |
|
Note to self: can probably verify |





Experiment with fully custom
cpu_qcompandgpu_qcomptypes to avoid...qcomp = std::complexperformance pitfallscu_qcompcorrectness pitfalls