These are the results from a new benchmark run with all of the results superimposed:

{ "data": [ { "line": { "color": "gray", "shape": "linear", "width": 3 }, "mode": "lines+markers", "name": "naive (double)", "type": "scatter", "x":[64,128,256,512,1024,2048,4096,8192,16384,32768,65536,131072,262144,524288,1048576,2097152,4194304,8388608,16777216,33554432,67108864,134217728], "y":[4.425678749976214e-05,0.00010617810937219474,0.00024059250000781115,0.0005145170000032522,0.0010458917187861517,0.0020969464476148624,0.004233155624751817,0.008539371651750116,0.017431180802824593,0.03492680711005735,0.06942584821315124,0.1401280433870478,0.2882145535685205,0.6912450892871546,1.3541646428555916,2.5628056817257665,5.280455999891274,10.1988578117016,20.40013823630836,44.914452941156924,82.42278888873342,160.60519999882672] }, { "line": { "color": "gray", "shape": "linear", "width": 3 }, "marker": {"symbol": "x", "line": {"color": "rgb(0,0,0)"}}, "mode": "lines+markers", "name": "naive (float)", "type": "scatter", "x":[64,128,256,512,1024,2048,4096,8192,16384,32768,65536,131072,262144,524288,1048576,2097152,4194304,8388608,16777216,33554432,67108864,134217728], "y":[4.551478754004654e-05,0.0001080568906309054,0.0002423484439239047,0.0005229521428451075,0.0010398504688055254,0.002142794999963371,0.0043235653244200435,0.008699351788702781,0.017560001964558442,0.0353881359254739,0.07034419643007693,0.13932926785076102,0.28151795900882925,0.5738746428895476,1.2181573213768258,2.3806739286685894,4.766196644186799,9.545095999880383,18.823276470785085,38.07438947382922,75.53237777513762,159.55677999882028] }, { "line": { "color": "blue", "shape": "linear", "width": 3 }, "mode": "lines+markers", "name": "AVX (double)", "type": "scatter", "x":[64,128,256,512,1024,2048,4096,8192,16384,32768,65536,131072,262144,524288,1048576,2097152,4194304,8388608,16777216,33554432,67108864,134217728], "y":[1.99497770528988e-05,2.3976836786772427e-05,3.333219895979966e-05,4.947664285802083e-05,8.807629964228309e-05,0.00018274625331114315,0.0006811666294847132,0.0013478530589460145,0.0026914067238994763,0.007842918526616163,0.01677284602268908,0.03369911873637785,0.06836390625072195,0.2225894687580876,0.5498012355223125,1.1899069642822724,2.5950587122210753,5.129495000001043,10.969423437018122,22.445333333841216,44.73162352737478,89.7446714308379] }, { "line": { "color": "blue", "shape": "linear", "width": 3 }, "marker": {"symbol": "x", "line": {"color": "rgb(0,0,0)"}}, "mode": "lines+markers", "name": "AVX (float)", "type": "scatter", "x":[64,128,256,512,1024,2048,4096,8192,16384,32768,65536,131072,262144,524288,1048576,2097152,4194304,8388608,16777216,33554432,67108864,134217728], "y":[1.6113281142098564e-05,1.930254534645314e-05,2.545465039370001e-05,4.084304966669401e-05,7.065965178688721e-05,0.00012884469643592767,0.00025875907141848333,0.0006910574777196286,0.0013840453571382179,0.0028092447058446976,0.008149606026433633,0.017567783042671516,0.036227867337029235,0.07383452009110313,0.22627221290857738,0.5519520535537075,1.2309801561968925,2.763269477791291,5.495357142665723,11.244923437516263,23.400806666662294,48.30549375037663] }, { "line": { "color": "green", "shape": "linear", "width": 3 }, "mode": "lines+markers", "name": "AVX (square) (double)", "type": "scatter", "x":[64,128,256,512,1024,2048,4096,8192,16384,32768,65536,131072,262144,524288,1048576,2097152,4194304,8388608,16777216,33554432,67108864,134217728], "y":[1.6880580244103258e-05,1.909641445624475e-05,2.7170318484597834e-05,4.709800213947314e-05,8.871787103956995e-05,0.0001696535760881342,0.00034063540583850543,0.0007050689732100832,0.0013930794641834967,0.0028207168656978537,0.008761513118471747,0.017358469810424364,0.03606444215396976,0.07198020089163037,0.23267353570970176,0.6639329463983553,1.529777232203092,3.830175847555037,6.215475555250628,14.870712500331658,27.257807693856122,60.56751818290319] }, { "line": { "color": "green", "shape": "linear", "width": 3 }, "marker": {"symbol": "x", "line": {"color": "rgb(0,0,0)"}}, "mode": "lines+markers", "name": "AVX (square) (float)", "type": "scatter", "x":[64,128,256,512,1024,2048,4096,8192,16384,32768,65536,131072,262144,524288,1048576,2097152,4194304,8388608,16777216,33554432,67108864,134217728], "y":[1.8833705525300943e-05,1.9968853392437364e-05,2.4598795873159725e-05,4.145098225161807e-05,7.149806919447396e-05,0.00012551114171202633,0.00023706317858471137,0.00047058713629440384,0.0009902518793388157,0.001972870014238961,0.003966643973204295,0.010993643750225601,0.022028006249456666,0.04465527029492835,0.0909347031301877,0.31785368028656175,0.7985571428434923,1.9170823057579132,3.419525698269587,6.629455357012505,13.78973749940217,31.589203998446465] }, { "line": { "color": "#17becf", "shape": "linear", "width": 3 }, "mode": "lines+markers", "name": "AVX (multithreaded) (double)", "type": "scatter", "x":[64,128,256,512,1024,2048,4096,8192,16384,32768,65536,131072,262144,524288,1048576,2097152,4194304,8388608,16777216,33554432,67108864,134217728], "y":[2.9842175045300422e-05,3.46182482252459e-05,4.582847057573804e-05,5.940219973588242e-05,9.915727081237577e-05,0.00018481258703383728,0.0006778057353602585,0.00133081590719292,0.0026594905666835777,0.008018495604719445,0.11226955244044734,0.16462127398940027,0.34318176554506974,0.6337099082216859,1.1616117948801534,2.2750207791467765,2.6609911196811202,4.65116209161827,8.84630632899183,17.34701499954099,34.05023333228504,66.34875454685904] }, { "line": { "color": "#17becf", "shape": "linear", "width": 3 }, "marker": {"symbol": "x", "line": {"color": "rgb(0,0,0)"}}, "mode": "lines+markers", "name": "AVX (multithreaded) (float)", "type": "scatter", "x":[64,128,256,512,1024,2048,4096,8192,16384,32768,65536,131072,262144,524288,1048576,2097152,4194304,8388608,16777216,33554432,67108864,134217728], "y":[2.4752737596200453e-05,2.9096064466073634e-05,3.838361902682763e-05,5.349267661671572e-05,8.17533013882223e-05,0.00014126103525200016,0.00028855685918022396,0.0007302829517056267,0.001440667774710434,0.0028004323160847875,0.11031728573766461,0.16231820203566766,0.3208781249872009,0.5999995598653135,1.1938381443061,2.4771598591157695,3.4975875000236556,3.5966554402682562,5.13411826922003,9.099410811432865,17.623176923594794,34.37024000158999] }, { "line": { "color": "red", "shape": "linear", "width": 3 }, "mode": "lines+markers", "name": "CUDA (double)", "type": "scatter", "x":[64,128,256,512,1024,2048,4096,8192,16384,32768,65536,131072,262144,524288,1048576,2097152,4194304,8388608,16777216,33554432,67108864,134217728], "y":[2.4434156378600824,2.780104365079365,2.73213937007874,2.768647509578544,2.693645381526104,2.7936759689922477,2.7968353174603173,2.839294466403162,2.80734578313253,2.852645416666667,3.124393991416309,3.4364904040404043,4.574319607843138,4.860166666666667,4.416587837837838,5.45330081300813,8.58126582278481,13.950958333333332,24.477949999999996,46.44868666666667,90.3032875,179.03805] }, { "line": { "color": "red", "shape": "linear", "width": 3 }, "marker": {"symbol": "x", "line": {"color": "rgb(0,0,0)"}}, "mode": "lines+markers", "name": "CUDA (float)", "type": "scatter", "x":[64,128,256,512,1024,2048,4096,8192,16384,32768,65536,131072,262144,524288,1048576,2097152,4194304,8388608,16777216,33554432,67108864,134217728], "y":[1.9736842105263157,2.220517981072555,2.305371661237785,2.4896347826086953,2.535403157894737,2.653988235294118,2.7364905511811024,2.7392082677165353,2.856065714285714,2.965857021276596,3.135290178571428,3.3847721393034824,4.506410897435897,4.484786754966888,4.805499310344828,4.526466,5.4311,8.522645569620254,13.9450875,26.25911481481482,48.54294,91.2033125] }, { "line": { "color": "#9467bd", "shape": "linear", "width": 3 }, "mode": "lines+markers", "name": "CUDA (square) (double)", "type": "scatter", "x":[64,128,256,512,1024,2048,4096,8192,16384,32768,65536,131072,262144,524288,1048576,2097152,4194304,8388608,16777216,33554432,67108864,134217728], "y":[2.0337301587301586,2.3721505415162456,2.2305374213836475,2.2268716981132077,2.2539651006711408,2.5065308724832214,2.230206709265176,2.2815196141479097,2.276638834951456,2.3416168316831683,2.375689491525424,2.644920610687023,3.9367004950495055,3.5748664893617024,3.998171856287425,3.954112716763006,5.31875,8.39064875,13.617239999999999,24.578664285714286,45.61686666666667,92.0157875] }, { "line": { "color": "#9467bd", "shape": "linear", "width": 3 }, "marker": {"symbol": "x", "line": {"color": "rgb(0,0,0)"}}, "mode": "lines+markers", "name": "CUDA (square) (float)", "type": "scatter", "x":[64,128,256,512,1024,2048,4096,8192,16384,32768,65536,131072,262144,524288,1048576,2097152,4194304,8388608,16777216,33554432,67108864,134217728], "y":[2.0448825503355703,2.253276751592357,2.363061688311688,2.578149290780142,2.506293006993007,2.5945688259109314,2.6953516728624534,2.6734120300751876,2.798325,2.8746704453441296,2.9062642553191487,3.327699528301887,3.8938863157894734,3.797780434782608,3.99015197740113,4.120310112359551,4.169154140127389,5.208818181818182,8.362764634146341,13.778024489795918,24.487157142857143,46.250186666666664] } ], "layout": { "title": "All Results", "xaxis":{ "title":"n" }, "yaxis":{ "title":"Running time [ms]" } }, "frames": [] }

There is a clear winner: For general large input vectors, the multithreaded AVX version has the highest bandwidth and least overall running time. For most applications the single threaded AVX version will suffice however, as it only takes 34% longer than the multithreaded version. Its optimization potential is also not exhausted.

Each version is severely memory bandwidth bottlenecked, the CUDA version suffers the most with its practical 11.8 GB/s device-to-host bandwidth due to its PCI-Express 3.0 x16 interface.

The biggest advantage of the AVX version is its availability and ease of programming. The source code is much simpler, requires no specialized compiler and it is ubiquitously available.

But nothing is set in stone. If the CPU is throttled at 2.3 GHz instead of a boosted 3.9 GHz in the tests prior the results look as such:

{ "data": [ { "line": { "color": "gray", "shape": "linear", "width": 3 }, "mode": "lines+markers", "name": "naive (double)", "type": "scatter", "x":[64,128,256,512,1024,2048,4096,8192,16384,32768,65536,131072,262144,524288,1048576,2097152,4194304,8388608,16777216,33554432,67108864,134217728], "y":[7.547789394993653e-05,0.00018550609949789053,0.00040951081173442905,0.0008791710360766855,0.0017861249874005876,0.0036152590322789603,0.007255862723728309,0.014571218749292061,0.029236674838817606,0.05843638000078499,0.11709087499184534,0.23420287914615304,0.47349089085095114,0.9847986613184593,2.05283565244273,4.10530542188986,8.226050666222969,16.475158536284255,32.93150909435512,65.76838888900562,131.39998000115156,263.55145004345104] }, { "line": { "color": "gray", "shape": "linear", "width": 3 }, "marker": {"symbol": "x", "line": {"color": "rgb(0,0,0)"}}, "mode": "lines+markers", "name": "naive (float)", "type": "scatter", "x":[64,128,256,512,1024,2048,4096,8192,16384,32768,65536,131072,262144,524288,1048576,2097152,4194304,8388608,16777216,33554432,67108864,134217728], "y":[7.847377232142858e-05,0.00018118606081382534,0.0004089066246031707,0.0008779984919160146,0.001785455076288191,0.0036124009797382515,0.0072796450889914665,0.014550154685755294,0.0291747996097598,0.05839778571888538,0.1170125156204449,0.2350440241208105,0.46850750165101335,0.937362500008021,1.9149260443271452,3.8749441340586492,7.780557777732611,15.487851111942696,31.020872728814457,62.408872721293434,124.42858332845692,248.9962000011777] }, { "line": { "color": "blue", "shape": "linear", "width": 3 }, "mode": "lines+markers", "name": "AVX (double)", "type": "scatter", "x":[64,128,256,512,1024,2048,4096,8192,16384,32768,65536,131072,262144,524288,1048576,2097152,4194304,8388608,16777216,33554432,67108864,134217728], "y":[7.847377232142858e-05,0.00018196726623331065,0.0004089888611572077,0.0008788234374930783,0.0017895214192968334,0.0036261213820407917,0.007262569196037865,0.014578299648543989,0.029178815896396235,0.05857393000042066,0.11709016072148058,0.23516501506754975,0.4711936369734178,0.991337081739103,2.0494026084686965,4.111287790502226,8.253542222600016,16.46949333242244,32.88805238082118,65.74810909064995,131.66211667703465,266.3532999577001] }, { "line": { "color": "blue", "shape": "linear", "width": 3 }, "marker": {"symbol": "x", "line": {"color": "rgb(0,0,0)"}}, "mode": "lines+markers", "name": "AVX (float)", "type": "scatter", "x":[64,128,256,512,1024,2048,4096,8192,16384,32768,65536,131072,262144,524288,1048576,2097152,4194304,8388608,16777216,33554432,67108864,134217728], "y":[2.7832031684875495e-05,3.3470381776184495e-05,4.38698124999064e-05,7.090844866460039e-05,0.000122187871770824,0.00022317018749163254,0.00044264144619089854,0.0011890476785733231,0.002419781562721859,0.004838697322286732,0.015837082474590803,0.030007031678272164,0.06174773214817313,0.12791701785837567,0.37293470982798943,0.8870692101744823,1.8827530437324573,3.9859709496396545,8.339617777771005,16.624733332234126,33.351161905253925,65.41696364398707] }, { "line": { "color": "green", "shape": "linear", "width": 3 }, "mode": "lines+markers", "name": "AVX (square) (double)", "type": "scatter", "x":[64,128,256,512,1024,2048,4096,8192,16384,32768,65536,131072,262144,524288,1048576,2097152,4194304,8388608,16777216,33554432,67108864,134217728], "y":[7.672991071428571e-05,0.00016555705359288757,0.00039490541296670144,0.0008534267856573154,0.0017638397891868974,0.003588431228555474,0.007231058035748512,0.01451991517699623,0.02918836676660083,0.05845689285446757,0.11710960938216886,0.23403150320382352,0.4688470194126215,0.9428062918814151,1.9218879358608265,3.86774860382309,7.743895555742913,15.535682222495476,31.125920834407832,62.13750000196424,124.1070333441409,249.03270000747094] }, { "line": { "color": "green", "shape": "linear", "width": 3 }, "marker": {"symbol": "x", "line": {"color": "rgb(0,0,0)"}}, "mode": "lines+markers", "name": "AVX (square) (float)", "type": "scatter", "x":[64,128,256,512,1024,2048,4096,8192,16384,32768,65536,131072,262144,524288,1048576,2097152,4194304,8388608,16777216,33554432,67108864,134217728], "y":[3.1494141117095955e-05,3.352794427337876e-05,4.1399643855641675e-05,6.988664063101169e-05,0.00011511603571956844,0.00020943620047783945,0.00039930591518506323,0.0007942055803401413,0.0016517822199835872,0.0032929211926088697,0.006778882236875288,0.019010681329471996,0.038329940221156905,0.07321223214863234,0.14860030131204247,0.4711313392493009,1.114840294487545,2.448629431370459,5.068308000918478,10.082582666849097,20.399364704430543,40.67177647341262] }, { "line": { "color": "#17becf", "shape": "linear", "width": 3 }, "mode": "lines+markers", "name": "AVX (multithreaded) (double)", "type": "scatter", "x":[64,128,256,512,1024,2048,4096,8192,16384,32768,65536,131072,262144,524288,1048576,2097152,4194304,8388608,16777216,33554432,67108864,134217728], "y":[0.00012551103631372324,0.00023956536037184628,0.000466260243545456,0.0009205864489596441,0.0018369866809523428,0.003653611299172795,0.007301695427911804,0.014589141100719271,0.029278800688105948,0.05858380889053881,0.26149768461315276,0.33937393826779527,0.5893793012987513,1.089459434137006,2.0685357355593785,3.98963684221588,4.272563281119801,5.809922882166328,10.780830158009415,20.776726469835815,40.176417644354785,79.23224444190662] }, { "line": { "color": "#17becf", "shape": "linear", "width": 3 }, "marker": {"symbol": "x", "line": {"color": "rgb(0,0,0)"}}, "mode": "lines+markers", "name": "AVX (multithreaded) (float)", "type": "scatter", "x":[64,128,256,512,1024,2048,4096,8192,16384,32768,65536,131072,262144,524288,1048576,2097152,4194304,8388608,16777216,33554432,67108864,134217728], "y":[4.3426536382219484e-05,5.019499142403314e-05,6.262471933867652e-05,8.809432766825356e-05,0.0001454980015897635,0.00024165914739591828,0.0004807194419510844,0.0012062218285134349,0.0023621724041159304,0.004880466555374221,0.18010215564350843,0.2746946343653117,0.5596219869148508,1.0235096240430175,2.073194642781302,4.190268675258092,5.947538392709768,5.9560275222217545,7.002177011307286,11.506012120552247,21.166425713870144,37.642805552523996] }, { "line": { "color": "red", "shape": "linear", "width": 3 }, "mode": "lines+markers", "name": "CUDA (double)", "type": "scatter", "x":[64,128,256,512,1024,2048,4096,8192,16384,32768,65536,131072,262144,524288,1048576,2097152,4194304,8388608,16777216,33554432,67108864,134217728], "y":[3.0924479166666665,3.5953544041450773,3.7248551020408165,3.630168617021277,3.58733112244898,3.751980208333334,3.6965159574468087,3.643735135135135,3.576326178010471,3.6821234042553197,3.8170898876404493,4.399196226415095,6.647574468085106,6.8233644230769235,6.88308811881188,6.974455339805824,9.386486111111111,15.206115217391302,25.868711111111114,46.78848,90.3112125,181.54635] }, { "line": { "color": "red", "shape": "linear", "width": 3 }, "marker": {"symbol": "x", "line": {"color": "rgb(0,0,0)"}}, "mode": "lines+markers", "name": "CUDA (float)", "type": "scatter", "x":[64,128,256,512,1024,2048,4096,8192,16384,32768,65536,131072,262144,524288,1048576,2097152,4194304,8388608,16777216,33554432,67108864,134217728], "y":[2.358490566037736,3.305133953488372,3.226904954954955,3.2784216981132075,3.29336338028169,3.291342056074766,3.2676,3.3126551886792455,3.3061582938388625,3.305328217821782,3.5665567708333334,4.244566666666667,6.49910380952381,6.520398969072165,6.617934285714286,6.8236747474747474,6.72720594059406,9.509881428571427,15.289410869565218,25.828151851851853,47.35309333333334,91.1440375] }, { "line": { "color": "#9467bd", "shape": "linear", "width": 3 }, "mode": "lines+markers", "name": "CUDA (square) (double)", "type": "scatter", "x":[64,128,256,512,1024,2048,4096,8192,16384,32768,65536,131072,262144,524288,1048576,2097152,4194304,8388608,16777216,33554432,67108864,134217728], "y":[2.94811320754717,3.267376525821596,3.2975600000000003,3.2491089201877936,3.1894064220183482,3.149660648148148,3.150175113122172,3.205194117647059,3.2160995475113126,3.2815616438356168,3.4932809999999996,3.9423323863636366,5.383573880597014,5.346026153846155,5.467591056910569,5.537370769230769,6.14713982300885,9.305498666666665,14.818768085106381,25.131742857142857,46.45998666666666,89.52754999999999] }, { "line": { "color": "#9467bd", "shape": "linear", "width": 3 }, "marker": {"symbol": "x", "line": {"color": "rgb(0,0,0)"}}, "mode": "lines+markers", "name": "CUDA (square) (float)", "type": "scatter", "x":[64,128,256,512,1024,2048,4096,8192,16384,32768,65536,131072,262144,524288,1048576,2097152,4194304,8388608,16777216,33554432,67108864,134217728], "y":[2.6641705069124426,3.2971990521327017,3.2863161904761906,3.2836551401869163,3.2789431279620858,3.2861830188679244,3.2922279620853083,3.2897356481481483,3.3106695238095236,3.354358048780488,3.526596,3.8438513661202185,5.307573076923077,5.231836434108526,5.2849728682170545,5.3689449612403095,5.462185483870968,6.203222123893806,9.28741095890411,14.861272340425531,25.255837037037036,46.38018666666666] } ], "layout": { "title": "All Results @2.3 GHz CPU", "xaxis":{ "title":"n" }, "yaxis":{ "title":"Running time [ms]" } }, "frames": [] }
With a weakened CPU (such as could be the case on laptop battery mode) the CUDA versions look more competitive.

But in normal operation, on this computer, as long as the vector data originally resides on the host, no amount of tricks in the form of blocked algorithms, multiple streams and async operations will turn the GPU version into a formidable fighter against the SIMD approach. The fundamental PCI-Express 3.0 x16 interface bandwidth limit of 15.5 GB/s cannot be exceeded. The discrete GPU is not faster (and may in this case even be slower than the integrated one, which is an interesting exercise that should be tested).

The long list of caveats prefixing the aforementioned conclusion is the point. Even for a simple problem such as calculating a large dot product performantly the systematic interplay between a processor and its connected components is more important than only one aspect itself. A specific piece of hardware informs but does not guide the process. For peak performance, one looks at the big picture and not marketing.