The GPU is not always faster
⭐ New Article
Updated the post with revised AVX (multithreaded) benchmarks as there was a small error in their measurements. Thanks to rnrn for bringing this to my attention.
My experience over the years showed me that even people that consider themselves professionals are not immune to the subversive allure of a silver bullet, in this article embodied by a coprocessor that supposedly solves all their problems at once.
While graphics professing units have deservedly found astounding success in many general parallel programming applications in recent years, one must not become fixated by them and try to brute complicated solutions for problems that are better served by other approaches.
In this article it will be shown how a discrete GPU approach loses out to a vectorized CPU version in an almost embarassingly parallel task. In fact, it will be shown that on this computer the GPU version has no hope of catching up to the CPU approach due bandwith limits of PCI-Express itself. Nevertheless, the takeaway will be that the results could be completely reversed on different hardware and proper analysis of all components is required to implement the right approach.
The Dot Product
The implementation of the vector dot product
for real valued two vectors \(x,y\) is a great example of a seemingly simple mathematical operation that is very nuanced in its details.
At first glance, everything appears perfect.
The vectors can be perfectly partitioned into at most \(\frac{n}{2}\) independent tasks that can be reduced in around \(\log{n}\) reductions. The number of additions is therefore a truncated geometric series
As each task within a reduction level is independent, there are no interdependencies that slow one down, only the change between reduction levels (of which there are only \(\log{n}\)) serve as synchronization points. The tasks scales well with the number of processors. If memory access is free and a multiplication and addition each cost only 1 cycle and there are \(\frac{n}{2}\) processors, it only takes \(\log{n}\) cycles. Theoretically, through multiprocessing the algorithmic complexity in the number of steps can be reduced to \(O(\log{n})\) at most.
In the real world the situation is less clear.
From a theoretical perspective, simply summing the products may lead to numerical errors which are undesirable for one's application. In addition, even if \(\frac{n}{2}\) processors were available, in reality memory accesses are not only not free but in fact very expensive. In fact, memory transfers itself can take up the majority of the time.
Throughout all tests numerical precision will not be a concern of ours.