To establish a baseline, it is always a good idea to test the most straightforward implementation first:
template<class T>
T dotproduct_naive(size_t n, T* a, T* b){
T sum=T(0);
for (size_t i=0;i<n;i++){
sum+=(*a)*(*b);
a++;
b++;
}
return sum;
}
This leads to the following running times for different values of n
:
This is almost a perfect straight line, as evidenced by a Pearson correlation coefficient value of around 1, which implies an almost perfect positive linear relationship. This is to be expected, since the algorithm iterates through the arrays in a simple loop which is \(O(n)\) in the number of operations.
Normalization of these numbers with respect to throughput and overall number of array elements leads to the following:
2n
for the throughput calculation since there are two arrays x
,y
each of size n
that must be loaded into memory.
Interesting behaviour can be observed when the graph is cropped to smaller sizes of 2n
:
This could be explained as smaller array sizes more easily completely fit into L1 cache. It is therefore worthwhile to study what happens when dotproduct_naive
is ran with smaller float
datatypes.
On GCC 14.2 (with -O3 and -avx2 flags) the compiled assembly for a double
instantiation of dotproduct_naive
autovectorizes the code sucessfully:
1 test rdi, rdi
2 je .L9
3 lea rax, [rdi-1]
4 cmp rax, 2
5 jbe .L10
6 mov r8, rdi
7 xor eax, eax
8 vxorpd xmm0, xmm0, xmm0 //zeros a register
9 xor ecx, ecx
10 shr r8, 2 //divides the array length n by 4
11.L4: //splits arrays a and b into batches of 64 bytes in order to by processed by SIMD in a loop
12 vmovupd ymm1, YMMWORD PTR [rdx+rax] //unaligned move into an AVX 256 bit SIMD register
13 vmulpd ymm1, ymm1, YMMWORD PTR [rsi+rax] //multiplies the elements
14 add rcx, 1
15 add rax, 32 //adds 32 bytes (corresponding to 4 double elements) to the loop increment
16 vaddsd xmm0, xmm0, xmm1
17 vunpckhpd xmm2, xmm1, xmm1
18 vextractf128 xmm1, ymm1, 0x1
19 vaddsd xmm0, xmm0, xmm2
20 vaddsd xmm0, xmm0, xmm1
21 vunpckhpd xmm1, xmm1, xmm1
22 vaddsd xmm0, xmm0, xmm1
23 cmp r8, rcx
24 jne .L4 //repeats loop until iterator is same as floor(n/4)
25 test dil, 3
26 je .L21
27 mov rcx, rdi
28 and rcx, -4
29 lea rax, [0+rcx*8]
30 lea r8, [rdx+rax]
31 add rax, rsi
32 vzeroupper
33.L3:
34 sub rdi, rcx
35 cmp rdi, 1
36 je .L7
37 vmovupd xmm1, XMMWORD PTR [rdx+rcx*8]
38 vmulpd xmm1, xmm1, XMMWORD PTR [rsi+rcx*8]
39 vaddsd xmm0, xmm0, xmm1
40 vunpckhpd xmm1, xmm1, xmm1
41 vaddsd xmm0, xmm1, xmm0
42 test dil, 1
43 je .L1
44 and rdi, -2
45 sal rdi, 3
46 add r8, rdi
47 add rax, rdi
48.L7:
49 vmovsd xmm1, QWORD PTR [rax]
50 vmulsd xmm1, xmm1, QWORD PTR [r8]
51 vaddsd xmm0, xmm0, xmm1
52 ret
53.L21:
54 vzeroupper
55.L1:
56 ret
57.L9:
58 vxorpd xmm0, xmm0, xmm0
59 ret
60.L10:
61 mov r8, rdx
62 mov rax, rsi
63 vxorpd xmm0, xmm0, xmm0
64 xor ecx, ecx
65 jmp .L3
Microsoft Visual Studio Compiler v.1940 VS 17.10 translates similarly to vectorized instruction. While this is a great start, the performance can be increased by by adding more register acumulators as seen in the next section.