c++ - Performance difference in between Windows and Linux using intel compiler: looking at the assembly -


i running program on both windows , linux (x86-64). has been compiled same compiler (intel parallel studio xe 2017) same options, , windows version 3 times faster linux one. culprit call std::erf resolved in intel math library both cases (by default, linked dynamically on windows , statically on linux using dynamic linking on linux gives same performance).

here simple program reproduce problem.

#include <cmath> #include <cstdio>  int main() {   int n = 100000000;   float sum = 1.0f;    (int k = 0; k < n; k++) {     sum += std::erf(sum);   }    std::printf("%7.2f\n", sum); } 

when profile program using vtune, find assembly bit different in between windows , linux version. here call site (the loop) on windows

block 3: "vmovaps xmm0, xmm6" call 0x1400023e0 <erff> block 4: inc ebx "vaddss xmm6, xmm6, xmm0" "cmp ebx, 0x5f5e100" jl 0x14000103f <block 3> 

and beginning of erf function called on windows

block 1: push rbp "sub rsp, 0x40" "lea rbp, ptr [rsp+0x20]" "lea rcx, ptr [rip-0xa6c81]" "movd edx, xmm0" "movups xmmword ptr [rbp+0x10], xmm6" "movss dword ptr [rbp+0x30], xmm0" "mov eax, edx" "and edx, 0x7fffffff" "and eax, 0x80000000" "add eax, 0x3f800000" "mov dword ptr [rbp], eax" "movss xmm6, dword ptr [rbp]" "cmp edx, 0x7f800000" ... 

on linux, code bit different. call site is:

block 3 "vmovaps %xmm1, %xmm0" "vmovssl  %xmm1, (%rsp)" callq  0x400bc0 <erff> block 4 inc %r12d "vmovssl  (%rsp), %xmm1" "vaddss %xmm0, %xmm1, %xmm1"   <-------- hotspot here "cmp $0x5f5e100, %r12d" jl 0x400b6b <block 3> 

and beginning of called function (erf) is:

"movd %xmm0, %edx" "movssl  %xmm0, -0x10(%rsp)"   <-------- hotspot here "mov %edx, %eax" "and $0x7fffffff, %edx" "and $0x80000000, %eax" "add $0x3f800000, %eax" "movl  %eax, -0x18(%rsp)" "movssl  -0x18(%rsp), %xmm0" "cmp $0x7f800000, %edx" jnl 0x400dac <block 8> ... 

i have shown 2 points time lost on linux.

does understand assembly enough explain me difference of 2 codes , why linux version 3 times slower?

in both cases arguments , results passed only in registers, per respective calling conventions on windows , gnu/linux.

in gnu/linux variant, xmm1 used accumulating sum. since it's call-clobbered register (a.k.a caller-saved) it's stored (and restored) in stack frame of caller on each call.

in windows variant, xmm6 used accumulating sum. register callee-saved in windows calling convention (but not in gnu/linux one).

so, in summary, gnu/linux version saves/restores both xmm0 (in callee[1]) , xmm1 (in caller), whereas windows version saves/restores xmm6 (in callee).

[1] need @ std::errf figure out why.


Comments

Popular posts from this blog

asynchronous - C# WinSCP .NET assembly: How to upload multiple files asynchronously -

aws api gateway - SerializationException in posting new Records via Dynamodb Proxy Service in API -

asp.net - Problems sending emails from forum -