c++ - Performance difference in between Windows and Linux using intel compiler: looking at the assembly -
i running program on both windows , linux (x86-64). has been compiled same compiler (intel parallel studio xe 2017) same options, , windows version 3 times faster linux one. culprit call std::erf resolved in intel math library both cases (by default, linked dynamically on windows , statically on linux using dynamic linking on linux gives same performance).
here simple program reproduce problem.
#include <cmath> #include <cstdio> int main() { int n = 100000000; float sum = 1.0f; (int k = 0; k < n; k++) { sum += std::erf(sum); } std::printf("%7.2f\n", sum); }
when profile program using vtune, find assembly bit different in between windows , linux version. here call site (the loop) on windows
block 3: "vmovaps xmm0, xmm6" call 0x1400023e0 <erff> block 4: inc ebx "vaddss xmm6, xmm6, xmm0" "cmp ebx, 0x5f5e100" jl 0x14000103f <block 3>
and beginning of erf function called on windows
block 1: push rbp "sub rsp, 0x40" "lea rbp, ptr [rsp+0x20]" "lea rcx, ptr [rip-0xa6c81]" "movd edx, xmm0" "movups xmmword ptr [rbp+0x10], xmm6" "movss dword ptr [rbp+0x30], xmm0" "mov eax, edx" "and edx, 0x7fffffff" "and eax, 0x80000000" "add eax, 0x3f800000" "mov dword ptr [rbp], eax" "movss xmm6, dword ptr [rbp]" "cmp edx, 0x7f800000" ...
on linux, code bit different. call site is:
block 3 "vmovaps %xmm1, %xmm0" "vmovssl %xmm1, (%rsp)" callq 0x400bc0 <erff> block 4 inc %r12d "vmovssl (%rsp), %xmm1" "vaddss %xmm0, %xmm1, %xmm1" <-------- hotspot here "cmp $0x5f5e100, %r12d" jl 0x400b6b <block 3>
and beginning of called function (erf) is:
"movd %xmm0, %edx" "movssl %xmm0, -0x10(%rsp)" <-------- hotspot here "mov %edx, %eax" "and $0x7fffffff, %edx" "and $0x80000000, %eax" "add $0x3f800000, %eax" "movl %eax, -0x18(%rsp)" "movssl -0x18(%rsp), %xmm0" "cmp $0x7f800000, %edx" jnl 0x400dac <block 8> ...
i have shown 2 points time lost on linux.
does understand assembly enough explain me difference of 2 codes , why linux version 3 times slower?
in both cases arguments , results passed only in registers, per respective calling conventions on windows , gnu/linux.
in gnu/linux variant, xmm1
used accumulating sum. since it's call-clobbered register (a.k.a caller-saved) it's stored (and restored) in stack frame of caller on each call.
in windows variant, xmm6
used accumulating sum. register callee-saved in windows calling convention (but not in gnu/linux one).
so, in summary, gnu/linux version saves/restores both xmm0
(in callee[1]) , xmm1
(in caller), whereas windows version saves/restores xmm6
(in callee).
[1] need @ std::errf
figure out why.
Comments
Post a Comment