Weaning off std::complex arithmetic by TysonRayJones · Pull Request #729 · QuEST-Kit/QuEST

TysonRayJones · 2026-04-17T19:54:12Z

Experiment with fully custom cpu_qcomp and gpu_qcomp types to avoid...

compiler-specific flags to overcome qcomp = std::complex performance pitfalls
HIP-specific functions to overcome cu_qcomp correctness pitfalls

just to isolate this big diff - structural changes to gpu_qcomp are coming

and changed &arr[ind] to arr + ind, for visual clarity

but there's so much boilerplate overlap with cpu_qcomp, I wonder if we should unify the two! We can retain separate cpu_qcomp and gpu_qcomp types (for clarity) through a typedef

compiler stack-overflows when OpenMP is enabled - possibly due to thread-private instantiation of this 2D array?

to debug MSVC + OpenMP compilation failure in CI

story of my life bruddah

nice one chatgpt

TysonRayJones · 2026-04-20T16:21:05Z

@JPRichings @otbrown Here's a first draft of the proposed qcomp changes (ignore that it provokes a stack-overflow bug in multithreaded MSVC, grr). One only really needs to look at cpu_types.hpp and gpu_types.cuh, and at an example of how cpu_qcomp is used within cpu_subroutines.cpp, like here.

You'll see most of the cpu_types.hpp and gpu_types.cuh boilerplate is identical, merely expressing the arithmetic operators of (c|g)pu_qcomp via their qreal components. I propose further simplification by defining a new complex.hpp (or similarly named, perhaps basetypes.hpp?) in core/ (rather than in cpu/ or gpu/) which defines all the operator overloads of a templated type, and (c|g)pu_qcomp gets defined in a typedef of that "ancestor type" in cpu/ or gpu/. Then adding a new operator overload to all backends is trivial (just add one function to complex.hpp), as is adding a backend specific function (define it exclusively within cpu_types.hpp or gpu_types.hpp). Noting this comment here in case you object to the unification (so I can see what to roll-back if I find time to implement beforehand - which is alas unlikely!)

otbrown · 2026-04-20T17:44:11Z

Thanks Tyson!

I have 5(!) hours of meetings tomorrow, so unlikely to get to this then, but will make time later in the week!

otbrown · 2026-04-21T14:24:06Z

Hi Tyson, one of my meetings got cancelled 🥳

Moving away from the standard library entirely is philosphically upsetting, but clearly has practical benefits!

I propose further simplification by defining a new complex.hpp

I strongly support this proposal. Structurally it would be nice to have the common interface in core to then be specialised/overwritten in cpu/gpu. I am still a bit nervous about needing to maintain a set of maths functions but if we can also contain them in one place in core (relying on the concrete implementations of arithmetic primitives elsewhere if needed), that would be good.

 * TODO:
 * OLD UNPACKERS
 * which I am hestitant to switch to the CPU-style until I better
 * understand why the explicit gpu_qcomp instantiation is necessary
 * (iirc static HIP structs have a different alignment than qcomp?!)

If it's helpful I can volunteer @eessmann as ~~tribute~~ a second pair of eyes and hands to dig into this? We now have a task in QATCH (phase 2 of the Quantum Software Lab) which is broadly "help maintain QuEST (and other scalable emulators)", that Erich is assigned to.

TysonRayJones · 2026-04-22T06:17:45Z

Moving away from the standard library entirely is philosphically upsetting, but clearly has practical benefits!

Yea it's outrageous but at least it won't keep us up at night about esoteric performance pitfalls! (We'll trade that for alignment nightmares)

I strongly support this proposal. Structurally it would be nice to have the common interface in core to then be specialised/overwritten in cpu/gpu. I am still a bit nervous about needing to maintain a set of maths functions but if we can also contain them in one place in core (relying on the concrete implementations of arithmetic primitives elsewhere if needed), that would be good.

Made common definitions in the previous commit. We can still discretionarily fallback to the std::complex operators within the CPU custom maths, paying the NaN performance penalties, when convenient (as pow in cpu_qcomp.hpp currently does).

If it's helpful I can volunteer @eessmann as ~~tribute~~ a second pair of eyes and hands to dig into this? We now have a task in QATCH (phase 2 of the Quantum Software Lab) which is broadly "help maintain QuEST (and other scalable emulators)", that Erich is assigned to.

That'd be super helpful! I preserved the method (used in relation to that comment) in the GPU backend, so I'm hoping things just work on HIP. I don't have an AMD machine handy to test on, so it would be terrific if Erich can give it a whirl! I can also re-setup the (paid) GPU Github Action runners if billing has been sorted out.

Note the new cpu_qcomp type might introduce a similar issue (misalignment of static arrays); currently, the qcomp[2] and qcomp[4] within a DiagMatr1 and DiagMatr2 are being converted/decayed to cpu_qcomp pointers. An explicit copy might be necessary, like in the GPU case, but I've so far been unable to trigger the alignment problem.

TysonRayJones · 2026-04-22T06:20:43Z

You can see the custom backend-specific maths/overloads here and here - so far, it's just pow!

eessmann · 2026-04-22T14:05:16Z

Heya @TysonRayJones,
Happy to help, I have access to HIP on Dirac so I give it a test on MI300s. Out of curiosity, why not go with thrust complex for our gpu complex data type?

TysonRayJones · 2026-04-22T17:27:06Z

Happy to help, I have access to HIP on Dirac so I give it a test on MI300s.

Wew that'd be brilliant!

Out of curiosity, why not go with thrust complex for our gpu complex data type?

It's a good question; previously, because using thrust complex inside CUDA kernels has some pitfalls and (iirc) didn't help solve the operator overload issues already seen of cuComplex (where HIP/rocThrust defines arithmetic operator overloads but CUDA/Thrust doesn't). It was my original intention to use thrust complex for QuEST 4.0, but was forced to switch to cuComplex (although I've forgotten precisely why).

With this PR however, there is a more compelling reason not to use thrust-complex; we must already define a custom complex type for the CPU backend to avoid compiler-specific performance pitfalls. Since we have to define its arithmetic overloads, we may as well re-use this (base) type for the GPU backend too, for better symmetry. We should now never have any issues related to operator overloads on particular platforms like HIP (but I do live in fear of alignment issues, as alluded to!)

so it's the first thing a developer sees, reducing likelihood they inefficiently use qcomp overloads

otbrown · 2026-04-23T15:10:33Z

We can still discretionarily fallback to the std::complex operators within the CPU custom maths, paying the NaN performance penalties, when convenient

Since Erich's identified the relevant fortran-rules flag for all the main compilers, there should be no penalty! Can always just add the flag to the specific relevant source file(s) since it won't be needed elsewhere. Of course if we can't identify the compiler we can just omit.

I can also re-setup the (paid) GPU Github Action runners if billing has been sorted out.

Sadly, we don't have a budget we could use for that, but one of the three of us (Me, Erich, @JPRichings) can be the CI runner 😉

The failure never triggered in CI nor on my Windows laptop, but did trigger on my Windows desktop. Joy!

TysonRayJones · 2026-04-24T04:39:18Z

I meant to compare the CPU performance of this PR (729) against Erich's #711, the main branch (v4.2.0), and the final v3 version (v3.7), to check the new cpu_qcomp doesn't have any performance pitfalls, using the scripts here. I opted to first compare the GPU performance (measuring the impact of gpu_qcomp), using MSVC on my Windows machine with a RTX 4080 Super. The results are pretty baffling, and I hope are specific to MSVC!

Here, for 6 different precisions and statevector sizes, I plot the runtime ratio of the aforementioned branches to the v3 branch. Good news: gpu_qcomp doesn't seem to introduce any problems! Bad news: v4 GPU is up to 30x slower than v3 GPU, even at 20 qubits (a 16 MiB state)!

(The single-Z operator seems to blow in v4 too!)

Even if this were MSVC specific, it remains worrying to me since most Windows QuEST users (foul beasts) are likely to actually be QuESTlink users, where lower qubit counts will be typical. These results suggest migrating QuESTlink to v4 (in a hypothetical future where I had the time, eheh) would incur a huge slowdown! 😨

JPRichings · 2026-04-24T08:45:02Z

@TysonRayJones could you provide the plots of absolute performance opposed to the ratios as i want to inspect the pattern of performance within each version at different number of qubits and between versions to see if the landscape has changed?

I agree this is a bit odd. I will try to find time to reproduce when I get a min on a data centre GPU from a linux environment to try to isolate this.

otbrown · 2026-04-24T10:17:29Z

Thanks James, definitely worth digging into with a bit of profiling!

even at 20 qubits (a 16 MiB state)!

I did just want to also highlight that this is still very small -- 16MiB is the size of the L3 cache shared between 4 cores on ARCHER2. On my more recent AMD home computer the L3 is 32MiB and the L2 is 8MiB, so at these sizes I would always expect the CPU to be faster than the GPU when accounting for offload time, so the CPU performance is probably more important for your hypothetical QuESTLink users.

I appreciate of course that this is currently GPU vs GPU! We'll do some more detailed profiling of our own.

TysonRayJones · 2026-04-24T18:34:39Z

@JPRichings I've uploaded the data to Google Drive here, as well as the Mathematica (MMA) notebook for plotting, in case helpful. They include the plots you mention:

The "+ -gate" just means initialising the |+> state - very stupidly worded 😅

(The single-Z operator seems to blow in v4 too!)

Beware that was an incredibly stupid error on my part (relabelling the "ccccU" gate to "c*U" for brevity, leading to missing data treated as '1' by mathematica in the divison, grr).

I agree this is a bit odd. I will try to find time to reproduce when I get a min on a data centre GPU from a linux environment to try to isolate this.

That'd be super useful, thanks very much! I'm especially curious if AMD GPUs show the same regression.

I did just want to also highlight that this is still very small -- 16MiB is the size of the L3 cache shared between 4 cores on ARCHER2. On my more recent AMD home computer the L3 is 32MiB and the L2 is 8MiB, so at these sizes I would always expect the CPU to be faster than the GPU when accounting for offload time, so the CPU performance is probably more important for your hypothetical QuESTLink users.

That's very true, and suggests we should update my naively-motivated "autodeployer thresholds" here, which currently enable GPU-acceleration for 12+ qubits!

JPRichings · 2026-04-24T19:54:12Z

Thanks you for this. This is already reassuring as the performance scaling for v4 and v3 are very similar for most gates and there seems to be a common threshold where v4 performance starts to win out over v4 so this looks a bit more global which is something we can hunt down with profiling. The two exceptions are the + state which I will look at now and Z gate which I have already delved into in detail today to look at code changes in this gate application. @otbrown had some thoughts around optimisations that didn't make it from V3 to V4 for the pauliZ around here: (

QuEST/QuEST/src/QuEST_common.c

Line 257 in d4f75f7

void statevec_pauliZ(Qureg qureg, int targetQubit) {

).

JPRichings · 2026-04-24T20:00:23Z

+ state:

V3:

QuEST/QuEST/src/GPU/QuEST_gpu.cu

Line 434 in d4f75f7

void statevec_initPlusState(Qureg qureg)

V4.2:

QuEST/quest/src/api/initialisations.cpp

Line 59 in 9d7618d

void initPlusState(Qureg qureg) {

QuEST/quest/src/core/localiser.cpp

Line 729 in 9d7618d

void localiser_statevec_initUniformState(Qureg qureg, qcomp amp) {

QuEST/quest/src/core/accelerator.cpp

Line 1131 in 9d7618d

void accel_statevec_initUniformState_sub(Qureg qureg, qcomp amp) {

QuEST/quest/src/gpu/gpu_subroutines.cpp

Line 1866 in 9d7618d

void gpu_statevec_initUniformState_sub(Qureg qureg, qcomp amp) {

QuEST/quest/src/gpu/gpu_thrust.cuh

Line 1051 in 9d7618d

void thrust_statevec_initUniformState(Qureg qureg, cu_qcomp amp) {

Given that the performance is comparable here and the kernel is very simple and is just setting values in the state vector probably points to how some of the changes to gpu kernels in v4. I assume there is a common pattern for gate application across the gates and something there is adding additional overhead that wasn't present in v3?

I need to check but I guess many of the gates end up here:

QuEST/quest/src/gpu/gpu_kernels.cuh

Line 517 in 9d7618d

__global__ void kernel_statevec_anyCtrlAnyTargDiagMatr_sub(

JPRichings · 2026-04-24T20:18:43Z

Really dumb guess based on eye balling the code is that:

QuEST/quest/src/core/bitwise.hpp

Line 206 in 9d7618d

    
           INLINE qindex insertBitsWithMaskedValues(qindex number, int* bitInds, int numBits, qindex mask) {

is common in most of the statevector kernels and is called by each thread.

hmmm for loop in GPU kernel...

QuEST/quest/src/core/bitwise.hpp

Line 164 in 9d7618d

    
           INLINE qindex insertBits(qindex number, int* bitIndices, int numIndices, int bitValue) {

JPRichings · 2026-04-24T20:40:49Z

I also notice that this masking logic is also present in the CPU side changes as noted in the examples here: #638

with insertBitsWithMaskedValues in a parallel region in the v4 CPU side code.

The performance discrepancy in #720 also seems present in the same range of qubits < 20...

Profiling required!

JPRichings · 2026-04-24T22:01:35Z

hmmm...

QuEST/quest/src/gpu/gpu_subroutines.cpp

Line 700 in 9d7618d

/// @todo

JPRichings · 2026-04-24T22:51:53Z

Ok so after a fair amount of digging I think we are moving data from host to device just before every kernel call that has ctrls.

Until this line,

QuEST/quest/src/gpu/gpu_subroutines.cpp

Line 708 in 9d7618d

devints deviceCtrls = util_getSorted(ctrls);

I think Ctrls are host side data but on the return of util_getSorted they are placed in a devints type which is a thrust device_vector this causes a host to device copy of data just before the call to the kernel function.

using devints = thrust::device_vector<int>;

int* getPtr(devints& qubits) {

    return thrust::raw_pointer_cast(qubits.data());
}

getPtr is called in the call to the kernel function:

QuEST/quest/src/gpu/gpu_subroutines.cpp

Line 713 in 9d7618d

    
           getPtr(deviceCtrls), ctrls.size(), ctrlStateMask, getPtr(deviceTargs), targs.size(),

Comparing this to V3.7 is the ctrl mask was captured by value in the kernel launch as a long long int so transferred in kernel call to device (I think):

QuEST/QuEST/src/GPU/QuEST_gpu.cu

Line 875 in d4f75f7

    
           void statevec_multiControlledTwoQubitUnitary(Qureg qureg, long long int ctrlMask, int q1, int q2, ComplexMatrix4 u)

and lived in constant memory which will probably have less of a performance impact than relying on thrust to move the data on creation of the vector.

JPRichings · 2026-04-24T23:31:06Z

report1.nsys-rep.txt
nsight systems profile of a qft circuit quest V4:

cuda malloc: ~ 230 microsecs
cuda memcpyasync: ~ 8 microsec
Kernel call: ~ 9 microsec

Run on grace-hopper node so mem copy from host to device will be really quick and much faster than in above results.

qft output:

Total number of gates: 210
Measured probability amplitude of |0..0> state: 9.53674e-07
Calculated probability amplitude of |0..0>, C0 = 1 / 2^20: 9.53674e-07
Measuring final state: (all probabilities should be 0.5)
Qubit 0 measured in state 1 with probability 0.5
Qubit 1 measured in state 1 with probability 0.5
Qubit 2 measured in state 1 with probability 0.5
Qubit 3 measured in state 1 with probability 0.5
Qubit 4 measured in state 1 with probability 0.5
Qubit 5 measured in state 1 with probability 0.5
Qubit 6 measured in state 1 with probability 0.5
Qubit 7 measured in state 0 with probability 0.5
Qubit 8 measured in state 1 with probability 0.5
Qubit 9 measured in state 1 with probability 0.5
Qubit 10 measured in state 1 with probability 0.5
Qubit 11 measured in state 1 with probability 0.5
Qubit 12 measured in state 1 with probability 0.5
Qubit 13 measured in state 1 with probability 0.5
Qubit 14 measured in state 1 with probability 0.5
Qubit 15 measured in state 1 with probability 0.5
Qubit 16 measured in state 1 with probability 0.5
Qubit 17 measured in state 0 with probability 0.5
Qubit 18 measured in state 1 with probability 0.5
Qubit 19 measured in state 1 with probability 0.5

Final state:
|11111110111111111011>
QFT run time: 0.0691252s
Total run time: 0.386635s
Generating '/tmp/nsys-report-50e2.qdstrm'
[1/1] [========================100%] report1.nsys-rep
Generated:

TysonRayJones · 2026-04-25T18:48:24Z

Apologies for delay - everything's on fire, I'll likely not have time to return until next weekend.
Your analysis looks right to me!

I think Ctrls are host side data but on the return of util_getSorted they are placed in a devints type which is a thrust device_vector this causes a host to device copy of data just before the call to the kernel function.

Absolutely right - a common Thrust paradigm but a gross pitfall in my opinion. I hate the implicit staging! Eager to change this if this is such an opportunity.

cuda malloc: ~ 230 microsecs
cuda memcpyasync: ~ 8 microsec
Kernel call: ~ 9 microsec

Ahh nicely done! The logic for handling controls differs between v3 and v4; the latter is asymptotically superior.
In v3, we enumerate every local amplitude and skip those failing the control conditions (permits the ctrls to be passed as a simple bitmask).
In v4, we enumerate strictly those amplitudes which satisfy the control condition. This leads to a factor 1/2^num_controls fewer iterations (a big asymptotic saving!) but so far requires the controls are kept in a sorted list (for bitwise insertion). You've pointed out re-allocating the GPU memory to stage that list is extremely non-negligible, eep!

It'd be amazing if we can perform that "enumerate only control-satisfying amps" logic using a control bitmask, possibly using these primitives?? Otherwise, we can easily maintain a persistent negligibly-sized device workspace, like we already do here.

TysonRayJones · 2026-04-25T18:54:23Z

Noting though that we stil see a performance regression when there are no control qubits - could it be that a superfluous zero-size allocation is happening in that scenario? 😅 Certainly there's no algorithmic overhead because of the NumCtrls templating (I thiiiiiink)

JPRichings · 2026-04-25T20:02:25Z

I think I have a fix where we move the Ctrls over to device into constant memory using cudaMemcpyToSymbol. Just profiling now. Looks fast could be wrong.

JPRichings · 2026-04-25T20:21:49Z

I caution this with I could have made a horrible error in how I am doing the data transfer and messed up the correctness but based on the attached profile the performance is much improved:

The call to memcopy to symbol only takes 200ns!

The kernel now takes: ~ 2-3 micro seconds (probably as accessing constant read only memory is much quicker)

There is now no allocation!

Profile: report_symbol.nsys-rep.txt

Great caution until I confirm correctness on monday!

Given there is another large cuda malloc call (~250 microsec) still in the profile I think there is more performance to be had in this.

Hardware target same grace-hopper node as before.

qft output:

Total number of gates: 210
Measured probability amplitude of |0..0> state: 9.53674e-07
Calculated probability amplitude of |0..0>, C0 = 1 / 2^20: 9.53674e-07
Measuring final state: (all probabilities should be 0.5)
Qubit 0 measured in state 0 with probability 0.5
Qubit 1 measured in state 0 with probability 0.5
Qubit 2 measured in state 0 with probability 0.5
Qubit 3 measured in state 1 with probability 0.5
Qubit 4 measured in state 1 with probability 0.5
Qubit 5 measured in state 0 with probability 0.5
Qubit 6 measured in state 0 with probability 0.5
Qubit 7 measured in state 1 with probability 0.5
Qubit 8 measured in state 0 with probability 0.5
Qubit 9 measured in state 1 with probability 0.5
Qubit 10 measured in state 1 with probability 0.5
Qubit 11 measured in state 0 with probability 0.5
Qubit 12 measured in state 1 with probability 0.5
Qubit 13 measured in state 0 with probability 0.5
Qubit 14 measured in state 0 with probability 0.5
Qubit 15 measured in state 1 with probability 0.5
Qubit 16 measured in state 0 with probability 0.5
Qubit 17 measured in state 0 with probability 0.5
Qubit 18 measured in state 0 with probability 0.5
Qubit 19 measured in state 0 with probability 0.5

Final state:
|00011001011010010000>
QFT run time: 0.00877581s
Total run time: 0.318082s
Generating '/tmp/nsys-report-e8a0.qdstrm'
[1/1] [========================100%] report1.nsys-rep
Generated:

JPRichings · 2026-04-25T20:37:53Z

Same issue in:

QuEST/quest/src/gpu/gpu_subroutines.cpp

Line 289 in 9d7618d

    
           void gpu_statevec_anyCtrlOneTargDenseMatr_subA(Qureg qureg, vector<int> ctrls, vector<int> ctrlStates, int targ, CompMatr1 matr) {

22 instances of devints in gpu_subroutines.cpp my guess is that each one will be causing the same issue.

JPRichings · 2026-04-25T23:05:54Z

Further improvement if 1 additional kernel updated: kernel_statevec_anyCtrlOneTargDenseMatr_subA

Same qft executable same grace-hopper hardware.

Profile:
report_symbol2.nsys-rep.txt

Total number of gates: 210
Measured probability amplitude of |0..0> state: 9.53674e-07
Calculated probability amplitude of |0..0>, C0 = 1 / 2^20: 9.53674e-07
Measuring final state: (all probabilities should be 0.5)
Qubit 0 measured in state 1 with probability 0.5
Qubit 1 measured in state 1 with probability 0.5
Qubit 2 measured in state 1 with probability 0.5
Qubit 3 measured in state 0 with probability 0.5
Qubit 4 measured in state 0 with probability 0.5
Qubit 5 measured in state 1 with probability 0.5
Qubit 6 measured in state 0 with probability 0.5
Qubit 7 measured in state 1 with probability 0.5
Qubit 8 measured in state 0 with probability 0.5
Qubit 9 measured in state 0 with probability 0.5
Qubit 10 measured in state 0 with probability 0.5
Qubit 11 measured in state 0 with probability 0.5
Qubit 12 measured in state 0 with probability 0.5
Qubit 13 measured in state 0 with probability 0.5
Qubit 14 measured in state 1 with probability 0.5
Qubit 15 measured in state 0 with probability 0.5
Qubit 16 measured in state 1 with probability 0.5
Qubit 17 measured in state 0 with probability 0.5
Qubit 18 measured in state 1 with probability 0.5
Qubit 19 measured in state 0 with probability 0.5

Final state:
|11100101000000101010>
QFT run time: 0.00287859s
Total run time: 0.318814s

JPRichings · 2026-04-25T23:40:22Z

Note to check gpu_thrust.cuh as

QuEST/quest/src/gpu/gpu_thrust.cuh

Line 1015 in 9d7618d

devints devQubits = qubits;

QuEST/quest/src/gpu/gpu_thrust.cuh

Line 787 in 9d7618d

    
           qreal thrust_statevec_calcProbOfMultiQubitOutcome_sub(Qureg qureg, vector<int> qubits, vector<int> outcomes) {

Causing applyQubitMeasurementAndGetProb to be impacted.

TysonRayJones · 2026-04-29T21:35:56Z

Wew super neat to have this addressed, nice one! 🎉

For this PR, I'll proceed under the assumption there's no CUDA performance impact of gpu_qcomp. Still need to...

check this is true with HIP also (I'd bet fine!)
check cpu_qcomp has no performance impact (i.e. do all my tests again for CPU, with and without multithreading).

The pain of these manual cross-platform performance regression checks begs for some automated benchmarker - but it's of course extremely unsexy work. Maybe a good unitaryHack bounty, or a chore for someone interested in experimenting with AI sloppers ¯\(ツ)/¯

otbrown · 2026-04-30T10:13:11Z

begs for some automated benchmarker

We've been pondering this a bit -- it might be something we can set up through EIDF. I've also been considering reviewing all tests to confirm they're MPI-safe and then running MPI tests through EIDF as well.

All very much post-v4.3 activity though I think!

JPRichings · 2026-05-01T15:30:21Z

This is the pull request that corrected the performance issues on nvidia not tested on AMD #739

TysonRayJones · 2026-05-03T19:14:52Z

Note to self: can probably verify cpu_qcomp performance using Meisam's scripts in #720

created cpu_qcomp

15028ba

TysonRayJones marked this pull request as draft April 17, 2026 19:54

TysonRayJones added 14 commits April 17, 2026 16:51

renamed cu_qcomp to gpu_qcomp

ab8215c

just to isolate this big diff - structural changes to gpu_qcomp are coming

Restoring gpu_qcomp in-place operators

7353cbf

changed gpu_qcomp getter names to match CPU

43ab41a

and changed &arr[ind] to arr + ind, for visual clarity

created gpu_qcomp

d86c126

but there's so much boilerplate overlap with cpu_qcomp, I wonder if we should unify the two! We can retain separate cpu_qcomp and gpu_qcomp types (for clarity) through a typedef

patch CPU MSVC compiler pitfall

1d62e75

woopsiedoodle

4a3b5af

CUDA patch

7707622

patch HIP

e1dc06c

patch Linux

fddc7f2

silence unused warning

b4fa4c4

attempt patch MSVC

91f39ee

compiler stack-overflows when OpenMP is enabled - possibly due to thread-private instantiation of this 2D array?

temp disabling test compilation

7fc91d8

to debug MSVC + OpenMP compilation failure in CI

more msvc debug

35dcd7c

story of my life bruddah

syntax error

9a4d11a

nice one chatgpt

Unify common cpu_qcomp and gpu_qcomp definitions into base_qcomp

799b9f5

This was referenced Apr 22, 2026

CPU subroutines are compiled with Ofast. #725

Open

Lower performance in V4 compared to V3 #720

Open

TysonRayJones added 3 commits April 22, 2026 18:48

Patch MSVC crash (#732)

d2c87b2

Move cpu_qcomp cast to function tops

fb620c7

so it's the first thing a developer sees, reducing likelihood they inefficiently use qcomp overloads

Updated comments

27ff4d8

Patch MSVC compile failure

28d9bf8

The failure never triggered in CI nor on my Windows laptop, but did trigger on my Windows desktop. Joy!

Conversation

TysonRayJones commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TysonRayJones commented Apr 20, 2026

Uh oh!

otbrown commented Apr 20, 2026

Uh oh!

otbrown commented Apr 21, 2026

Uh oh!

TysonRayJones commented Apr 22, 2026

Uh oh!

TysonRayJones commented Apr 22, 2026

Uh oh!

eessmann commented Apr 22, 2026

Uh oh!

TysonRayJones commented Apr 22, 2026

Uh oh!

otbrown commented Apr 23, 2026

Uh oh!

TysonRayJones commented Apr 24, 2026

Uh oh!

JPRichings commented Apr 24, 2026

Uh oh!

otbrown commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TysonRayJones commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JPRichings commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JPRichings commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JPRichings commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JPRichings commented Apr 24, 2026

Uh oh!

JPRichings commented Apr 24, 2026

Uh oh!

JPRichings commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JPRichings commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TysonRayJones commented Apr 25, 2026

Uh oh!

TysonRayJones commented Apr 25, 2026

Uh oh!

JPRichings commented Apr 25, 2026

Uh oh!

JPRichings commented Apr 25, 2026

Uh oh!

JPRichings commented Apr 25, 2026

Uh oh!

JPRichings commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JPRichings commented Apr 25, 2026

Uh oh!

TysonRayJones commented Apr 29, 2026

Uh oh!

otbrown commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JPRichings commented May 1, 2026

Uh oh!

TysonRayJones commented May 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

TysonRayJones commented Apr 17, 2026 •

edited

Loading

otbrown commented Apr 24, 2026 •

edited

Loading

TysonRayJones commented Apr 24, 2026 •

edited

Loading

JPRichings commented Apr 24, 2026 •

edited

Loading

JPRichings commented Apr 24, 2026 •

edited

Loading

JPRichings commented Apr 24, 2026 •

edited

Loading

JPRichings commented Apr 24, 2026 •

edited

Loading

JPRichings commented Apr 24, 2026 •

edited

Loading

JPRichings commented Apr 25, 2026 •

edited

Loading

otbrown commented Apr 30, 2026 •

edited

Loading