Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vc has bad performance with Intel C/C++ compiler on Linux #135

Closed
amadio opened this issue Jun 13, 2016 · 31 comments
Closed

Vc has bad performance with Intel C/C++ compiler on Linux #135

amadio opened this issue Jun 13, 2016 · 31 comments
Assignees

Comments

@amadio
Copy link
Collaborator

amadio commented Jun 13, 2016

Vc version / revision Operating System Compiler & Version Compiler Flags Assembler & Version CPU
1.2.0 Linux GCC 5.3.0 -O3 -march=native Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz
1.2.0 Linux ICC 16.0.2 -O3 -march=native Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz

Vc has much worse performance when compiled with ICC than GCC, as shown in the example below.
The test case is solving many quadratic equations, given coefficients a, b, c. The source code can be found at http://pastebin.com/hr6nPDmJ (quadratic.cc)

Testcase

Here is a session on my computer:

 ~ $ icpc -Wall -std=c++11 -O3 -march=native -DUSE_VC=1 quadratic.cc -lVc -o icc.out
 ~ $ g++  -Wall -std=c++11 -O3 -march=native -DUSE_VC=1 quadratic.cc -lVc -o gcc.out
 ~ $ gcc.out
                optimized scalar:  57.000ms
                 AVX2 intrinsics:  12.000ms
                             *Vc:  13.000ms*
 ~ $ icc.out
                optimized scalar:  13.000ms
                 AVX2 intrinsics:  12.000ms
                             *Vc:  49.000ms*
 ~ $ 

Notice that Vc code is slower than the auto-vectorized code that ICC generates.
AVX intrinsics code provided for reference.

Note: It used to be the case that even the scalar code would see its performance degraded, and while this is not true for this particular example, I still see places in which merely including Vc degrades performance of code that is not using Vc at all when using the Intel compiler, probably due to options changed in Vc headers.

@mattkretz
Copy link
Member

mattkretz commented Jun 14, 2016

Thanks a lot @amadio ! May I use the code in the Vc benchmark suite (still under construction)? What license and copyright? I'd like to use BSD 3-clause, if possible.

Here's the result running on my Skylake Desktop system (seems to be the same CPU you're using) compiled with GCC 5.2:

Run on (8 X 3400 MHz CPU s)
2016-06-14 12:59:54
Benchmark           Time           CPU Iterations
-------------------------------------------------
scalar       60372055 ns   60333333 ns         12
intrinsics   11747007 ns   11785714 ns         56
vc           11552161 ns   11483871 ns         62

@mattkretz mattkretz self-assigned this Jun 14, 2016
@amadio
Copy link
Collaborator Author

amadio commented Jun 14, 2016

Hi Matthias,

You can use the code with any open license you want, and modify it to your needs. A very similar version is in VecCore, where I also test performance with fake SIMD classes and UME::SIMD. The performance of Vc with GCC is usually very good, same for Clang, but for ICC I always see performance degradation, and the scalar version of Vc is oddly the fastest.

If you do not have a license for ICC, I can run the benchmarks for you, but I strongly encourage you to also apply for an open source license so that you can test Vc with ICC too.

@mattkretz
Copy link
Member

Thanks. I have access to ICC - it's just more tedious to get to compared to the system-provided compilers. My main problem is the time constraints: working 12h weeks and prioritizing standardization of SIMD programming in C++ leaves little time for Vc maintenance currently.

@noma
Copy link

noma commented Jun 17, 2016

Same observation here, I've recently implemented some worst-case SIMD-scenarios (branching, nesting, early returns, while-loop) in Vc, for a paper where we compared different implementation strategies for these cases. Vc did well with clang and gcc, but totally failed with the Intel compiler. Benchmark codes will be public soon.

@mattkretz
Copy link
Member

Actually, I'd rather like to say "the Intel compiler totally failed". There are numerous issues with ICC where it fails to parse C++11 (or even C++98) �correctly, so Vc contains workarounds specifically for ICC. After all, correctness must go first. It has to compile and produce the correct results, only then can I look for performance. This is a really frustrating chapter of Vc development. I'll investigate as soon as I can. However, I'd be very thankful for any more in-depth analysis where the Intel compiler optimizer fails. Then we might be able to get it fixed via a bug report to Intel and/or a workaround in Vc.

@amadio
Copy link
Collaborator Author

amadio commented Jun 22, 2016

@noma Do you mind sending a link to your paper? I'm interested in this kind of work.

@noma
Copy link

noma commented Jun 22, 2016

@mattkretz You are right, let's blame the compiler. :-)
@amadio It's accepted, but not yet published (EuroPar), but I guess we can send you a copy (we can also meet if you happen to be at ISC too)

@amadio
Copy link
Collaborator Author

amadio commented Aug 23, 2016

@mattkretz @noma Just to let you know, using -DVc_IMPL=Scalar not only improves performance with ICC, but also lets me compile code using Vc for KNL. Worth a try.

On a separate note, Vc-1.2.0 seems to be broken with ICC-16.0.3. The library compiles and installs, but I cannot compile anything against it.

@pcanal
Copy link

pcanal commented Aug 29, 2016

Note: It used to be the case that even the scalar code would see its performance degraded, and while this is not true for this particular example, I still see places in which merely including Vc degrades performance of code that is not using Vc at all when using the Intel compiler, probably due to options changed in Vc headers.

@mattkretz This seems to indicates that somewhere somehow the Vc headers are switching icc to not longer optimize as well. Do you have any clue what this 'switch' might be? I am hoping that this is a 'simple disable optimization' flag and maybe the latest version of icc no longer need it (to compile Vc). I am also hoping that it would then help Vc's performance.

At the moment, this means that any bench-marking done using ICC shows Vc as not being competitive (at all) with other Vector libraries (e.g. UME-SIMD) or even well-auto-vectorized scalar code ... and thus make Vc unfairly look bad .... (i.e. I really encourage you to find what is triggering ICC to get into this not-so-great optimize mode).

Cheers,
Philippe.

@pcanal
Copy link

pcanal commented Aug 31, 2016

@mattkretz FYI, Intel representative are using this issue to claim to their HPC customer "This [Vc] library is not longer reliable and must be avoided at all cost" ...

i.e. this make Vc looks very bad ...

So I strongly recommend to re-examine this issue if only to be to clearly explain why 'whatever is setting icc in its bad mode' is necessary.

Thanks.

@mattkretz
Copy link
Member

mattkretz commented Sep 1, 2016

Thanks for the warning @pcanal . I still don't know exactly what is causing it. But the major difference with ICC is the implementation of scalar element aliasing onto the intrinsic vector objects. With GCC and clang I can use either vector builtins or the gnu::may_alias attribute. Those are not supported by ICC and therefore I use a union inside every single Vector and Mask object. This makes it really hard for the optimizer to minimize the loads and stores correctly and might be the main reason for the bad ICC performance. "Fixing" this should be possible now. I needed to get lvalue reference assignment via operator[] out of the Vc API first, to do it.
In any case. I'm working as hard as I can on a cleanup that will enable AVX-512, 8-bit and 64-bit integer vectors, and at the same time reduce a lot of baggage, hopefully making ICC perform well again. Doing another hotfix workaround would delay this important work, but I'll see what I can do.

@pcanal
Copy link

pcanal commented Sep 1, 2016

and therefore I use a union inside every single Vector and Mask object and therefore I use a union inside every single Vector and Mask object

Yes but in addition to this, 'just' including the Vc headers while not using Vc at all, reduce the performance of code produced by ICC so there is also something that is 'changing the mode' of ICC globally lurking somewhere in the header file ... at least (re)discovering what that is would be very helpful

thanks.

@mattkretz
Copy link
Member

@pcanal do you have a testcase for this? This is very strange. I don't recall any #pragma or similar that could have this effect. Making compilation slower, yes, but making unrelated code slower, no, that is really strange.

@pcanal
Copy link

pcanal commented Sep 2, 2016

@amadio Can you remind us the file and compilation option to reproduce the 'icc-is-slowdwon-by-including-Vc' problem?

@amadio
Copy link
Collaborator Author

amadio commented Sep 2, 2016

Below is an easy way to reproduce the problem. Notice that the performance of the scalarwrapper backend in VecCore is degraded if Vc is enabled, even though it does not depend on Vc. I hope this helps you trace the cause of this issue.

$ git clone https://:@gitlab.cern.ch:8443/VecGeom/VecGeom.git
Cloning into 'VecGeom'...
remote: Counting objects: 44322, done.
remote: Compressing objects: 100% (10869/10869), done.
remote: Total 44322 (delta 33386), reused 43664 (delta 32853)
Receiving objects: 100% (44322/44322), 30.95 MiB | 2.33 MiB/s, done.
Resolving deltas: 100% (33386/33386), done.
Checking connectivity... done.
$ cd VecGeom/VecCore
$ mkdir build && cd build
$ CC=icc CXX=icpc cmake .. -DBUILD_TESTING=ON -DTARGET_ISA=native -DVC=OFF
-- The C compiler identification is Intel 16.0.2.20160204
-- The CXX compiler identification is Intel 16.0.2.20160204
-- Check for working C compiler: /opt/intel/compilers_and_libraries_2016.2.181/linux/bin/intel64/icc
-- Check for working C compiler: /opt/intel/compilers_and_libraries_2016.2.181/linux/bin/intel64/icc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /opt/intel/compilers_and_libraries_2016.2.181/linux/bin/intel64/icpc
-- Check for working CXX compiler: /opt/intel/compilers_and_libraries_2016.2.181/linux/bin/intel64/icpc -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Compiling for NATIVE instruction set architecture
-- Found PythonInterp: /usr/bin/python (found version "3.5.2") 
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Looking for pthread_create
-- Looking for pthread_create - not found
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE  
-- Configuring done
-- Generating done
-- Build files have been written to: /tmp/gentoo/VecGeom/VecCore/build
$ make quadratic
Scanning dependencies of target quadratic
[ 50%] Building CXX object bench/CMakeFiles/quadratic.dir/quadratic.cc.o
[100%] Linking CXX executable quadratic
[100%] Built target quadratic
$ bench/quadratic 
                    naive scalar:  23.000ms
                optimized scalar:  13.000ms
                 AVX2 intrinsics:  12.000ms
            plain scalar backend:  13.000ms
           scalarwrapper backend:  13.000ms
$ CC=icc CXX=icpc cmake .. -DBUILD_TESTING=ON -DTARGET_ISA=native -DVC=ON
-- Detected Compiler: Intel 16.0.2
-- Performing Test check_cxx_compiler_flag__diag_disable_913
-- Performing Test check_cxx_compiler_flag__diag_disable_913 - Success
-- Performing Test check_cxx_compiler_flag__diag_disable_13211
-- Performing Test check_cxx_compiler_flag__diag_disable_13211 - Success
-- Performing Test check_cxx_compiler_flag__diag_disable_61
-- Performing Test check_cxx_compiler_flag__diag_disable_61 - Success
-- Performing Test check_cxx_compiler_flag__diag_disable_173
-- Performing Test check_cxx_compiler_flag__diag_disable_173 - Success
-- Performing Test check_cxx_compiler_flag__diag_disable_264
-- Performing Test check_cxx_compiler_flag__diag_disable_264 - Success
-- Performing Test check_cxx_compiler_flag__ansi_alias
-- Performing Test check_cxx_compiler_flag__ansi_alias - Success
-- Performing Test check_cxx_compiler_flag__ffp_contract_fast
-- Performing Test check_cxx_compiler_flag__ffp_contract_fast - Failed
-- target changed from "" to "auto"
-- Detected CPU: skylake
CMake Warning at /usr/lib/cmake/Vc/UserWarning.cmake:4 (message):
  FMA4 disabled per default because of old/broken toolchain
Call Stack (most recent call first):
  /usr/lib/cmake/Vc/OptimizeForArchitecture.cmake:365 (UserWarning)
  /usr/lib/cmake/Vc/VcMacros.cmake:342 (OptimizeForArchitecture)
  /usr/lib64/cmake/Vc/VcConfig.cmake:21 (vc_set_preferred_compiler_flags)
  CMakeLists.txt:32 (find_package)


CMake Warning at /usr/lib/cmake/Vc/UserWarning.cmake:4 (message):
  XOP disabled per default because of old/broken toolchain
Call Stack (most recent call first):
  /usr/lib/cmake/Vc/OptimizeForArchitecture.cmake:371 (UserWarning)
  /usr/lib/cmake/Vc/VcMacros.cmake:342 (OptimizeForArchitecture)
  /usr/lib64/cmake/Vc/VcConfig.cmake:21 (vc_set_preferred_compiler_flags)
  CMakeLists.txt:32 (find_package)


-- Performing Test check_cxx_compiler_flag__xCORE_AVX2
-- Performing Test check_cxx_compiler_flag__xCORE_AVX2 - Success
-- Configuring done
-- Generating done
-- Build files have been written to: /tmp/gentoo/VecGeom/VecCore/build
$ make quadratic
Scanning dependencies of target quadratic
[ 50%] Building CXX object bench/CMakeFiles/quadratic.dir/quadratic.cc.o
[100%] Linking CXX executable quadratic
[100%] Built target quadratic
$ bench/quadratic 
                    naive scalar:  23.000ms
                optimized scalar:  13.000ms
                 AVX2 intrinsics:  12.000ms
            plain scalar backend:  13.000ms
           scalarwrapper backend:  88.000ms
               Vc scalar backend:  16.000ms
               Vc vector backend:  51.000ms
          VcSimdArray<8> backend:  52.000ms
         VcSimdArray<16> backend:  36.000ms
         VcSimdArray<32> backend:  20.000ms
$

@pcanal
Copy link

pcanal commented Sep 3, 2016

@mattkrez @amadio

Good news for Vc. It turns out that Guilherme's conclusion was incorrect.

The problem is not linked to Vc at all but to the number of testing loops/functions in Guilherme's example. The slowdown can be reproduced simply applying the patch below and executing

$ CC=icc CXX=icpc cmake $SRC_DIR/VecGeom/VecCore -DBUILD_TESTING=ON -DTARGET_ISA=native -DVC=OFF
$ make quadratic
$ ./bench/quadratic
naive scalar: 73.000ms
optimized scalar: 21.000ms
AVX2 intrinsics: 20.000ms
plain scalar backend: 19.000ms
scalarwrapper backend: 121.000ms
scalarwrapper backend: 119.000ms
scalarwrapper backend: 123.000ms
scalarwrapper backend: 119.000ms
scalarwrapper backend: 120.000ms
scalarwrapper backend: 121.000ms
scalarwrapper backend: 40.000ms

Note in particular that the result for scalarwrapper backend is not even stable ( 120ms vs 40ms ) ....

So this problem is solely a limitation of ICC ...

Cheers,
Philippe.

diff --git a/VecCore/bench/quadratic.cc b/VecCore/bench/quadratic.cc
index 0e3b6ec..af99b1c 100644
--- a/VecCore/bench/quadratic.cc
+++ b/VecCore/bench/quadratic.cc
@@ -236,6 +236,12 @@ int main(int argc, char *argv[])
TestQuadSolvebackend::Scalar(a, b, c, x1, x2, roots, N, "plain scalar backend");
TestQuadSolvebackend::ScalarWrapper(a, b, c, x1, x2, roots, N, "scalarwrapper backend");

  • TestQuadSolvebackend::ScalarWrapper(a, b, c, x1, x2, roots, N, "scalarwrapper backend");
  • TestQuadSolvebackend::ScalarWrapper(a, b, c, x1, x2, roots, N, "scalarwrapper backend");
  • TestQuadSolvebackend::ScalarWrapper(a, b, c, x1, x2, roots, N, "scalarwrapper backend");
  • TestQuadSolvebackend::ScalarWrapper(a, b, c, x1, x2, roots, N, "scalarwrapper backend");
  • TestQuadSolvebackend::ScalarWrapper(a, b, c, x1, x2, roots, N, "scalarwrapper backend");

#ifdef VECCORE_ENABLE_VC
TestQuadSolvebackend::VcScalar(a, b, c, x1, x2, roots, N, "Vc scalar backend");
TestQuadSolvebackend::VcVector(a, b, c, x1, x2, roots, N, "Vc vector backend");

@pcanal
Copy link

pcanal commented Sep 3, 2016

This means that the Vc have no unexpected slowdown, as @mattkretz mentioned Vc is still slower on icc:

            optimized scalar:  22.000ms
             AVX2 intrinsics:  21.000ms
        plain scalar backend:  22.000ms
       scalarwrapper backend: 136.000ms
           Vc scalar backend:  24.000ms
           Vc vector backend:  86.000ms
      VcSimdArray<8> backend:  86.000ms
     VcSimdArray<16> backend:  58.000ms
     VcSimdArray<32> backend:  37.000ms

whereas with gcc the Vc vector backend is in the same ball as the optimized scalar or the intrisics (see initial post)

@amadio
Copy link
Collaborator Author

amadio commented Sep 5, 2016

After investigating things a bit deeper, I agree with @pcanal. There is no slowdown in unrelated code, although Vc's performance is still slower with ICC in general. ICC generates streaming stores in some situations, which makes some code seem faster. We had identified it before, but the test was not corrected properly to eliminate these streaming stores in benchmarking, only the intrinsics version was corrected to not have streaming stores. Now I have pretty good reason (from other tests and after inspecting the assembly generated in this test) to believe that the performance problem between ICC and Vc is a stack alignment + ABI problem (using stack vs registers when passing things around). The solution is probably a combination of perfect forwarding in Vc and changes to ABI conventions for unions and structs in ICC.

@mattkretz
Copy link
Member

The issue I see here is that ICC inserts lots of unnecessary unaligned loads and stores into the critical path. I have no idea yet what confuses ICC that much.

The ABI issue is a good idea, but I have already ensured that this never breaks again. There are ABI unit tests in Vc. I.e. unless the ABI tests fail for you, you can be fairly certain that Vc vector objects passed by value are actually passed via registers.

@mattkretz
Copy link
Member

This is so frustrating...

__m256 x = {};

compiles to

vmovups 0x2030c(%rip),%ymm0        # 42bc20

instead of

vxorps %ymm0,%ymm0

Likewise, any implicit load/store from/to the stack uses unaligned moves, even though the compiler correctly ensures 32-Byte alignment of the stack pointer.

@mattkretz
Copy link
Member

mattkretz commented Sep 6, 2016

And to expand the test case:

struct Storage { __m256 data; };

Storage foo() {
  __m256 tmp = {};
  Storage s{tmp};
  s.data = _mm256_add_ps(s.data, s.data);
  asm volatile("vmovaps %0,%0" :"+x"(s.data));
  return s;
}

compiles to:

000000000040b8d0 <foo()>:
  40b8d0:·   push   %rbp
  40b8d1:·   mov    %rsp,%rbp
  40b8d4:·   and    $0xffffffffffffffe0,%rsp
  40b8d8:·   vmovups 0x202c0(%rip),%ymm0        # 42bba0 <tmp.119499.0.0.76>
  40b8e0:·   vaddps %ymm0,%ymm0,%ymm1
  40b8e4:·   vmovups %ymm1,-0x20(%rsp)
  40b8ea:·   vmovaps %ymm1,%ymm1
  40b8ee:·   vmovups %ymm1,-0x20(%rsp)
  40b8f4:·   vmovups -0x20(%rsp),%ymm0
  40b8fa:·   mov    %rbp,%rsp
  40b8fd:·   pop    %rbp
  40b8fe:·   retq

Sorry, but that just shows how confused ICC is about any minimal abstraction on top of SIMD intrinsics. I have no idea at this point other than to write bug reports.

Edit: compiler flags were -O3 -DNDEBUG -std=c++11 -ansi-alias -xCORE-AVX2

@mattkretz
Copy link
Member

mattkretz commented Sep 6, 2016

And in case the inline asm case doesn't seem motivating enough. Here's another deal-breaker:

struct Storage {
  Storage() : data(_mm256_setzero_ps()) {}
  Storage(__m256 x) : data(x) {}
  __m256 data;
};

struct Vector {
    Vector(const float *mem) { d = _mm256_load_ps(mem); }
    Storage d;
};

Vector foo() {
  const float mem[8] = {1, 2, 3, 4, 5, 6, 7, 8};
  Vector s(mem);
  return s;
}

Compiles to:

000000000040b8d0 <foo()>:
  40b8d0:·   push   %rbp
  40b8d1:·   mov    %rsp,%rbp
  40b8d4:·   and    $0xffffffffffffffe0,%rsp
  40b8d8:·   vmovups 0x202c0(%rip),%ymm0        # 42bba0 <mem.119470.0.0.79>
  40b8e0:·   vmovups %ymm0,-0x20(%rsp)
  40b8e6:·   vmovups -0x20(%rsp),%ymm0
  40b8ec:·   mov    %rbp,%rsp
  40b8ef:·   pop    %rbp
  40b8f0:·   retq

WTF. The following is what I expect:

vmovaps 0xb1a0b1a(%rip),%ymm0
retq

(Note that GCC5 also emits lots of function call boilerplate, but at least the AVX code boils down to a single vmovaps.)

@mattkretz
Copy link
Member

Alright, identified one issue that I can avoid: Using __m256 as tag type is not identified as a noop by ICC and leads to a load + dead store.

@mattkretz
Copy link
Member

Here's another finding. Test case:

Vector foo(Vector a, Vector b, Mask k) {
  a(k) = b;
  return a;
}

GCC compiles this to:

0000000000000000 <foo(Vc_1::Vector<float, Vc_1::VectorAbi::Avx>, Vc_1::Vector<float, Vc_1::VectorAbi::Avx>, Vc_1::Mask<float, Vc_1::VectorAbi::Avx>)>:
   0vblendvps %ymm2,%ymm1,%ymm0,%ymm0
   6:·       retq

ICC compiles it to:

0000000000000000 <foo(Vc_1::Vector<float, Vc_1::VectorAbi::Avx>, Vc_1::Vector<float, Vc_1::VectorAbi::Avx>, Vc_1::Mask<float, Vc_1::VectorAbi::Avx>)>:
   0push   %rbp
   1mov    %rsp,%rbp
   4and    $0xffffffffffffffe0,%rsp
   8sub    $0x100,%rsp
   f:·       vmovups %ymm0,0xc0(%rsp)
  18lea    0xc0(%rsp),%rax
  20vmovups %ymm1,0x20(%rax)
  25vmovups %ymm2,(%rsp)
  2a:·       vmovups %ymm2,-0x80(%rax)
  2f:·       mov    %rax,0x20(%rsp)
  34vmovups 0x20(%rsp),%ymm3
  3a:·       vmovups %ymm3,-0x60(%rax)
  3f:·       vmovups -0x80(%rax),%ymm4
  44vmovups %ymm4,-0x40(%rax)
  49mov    -0x60(%rax),%rdx
  4d:·       mov    %rdx,-0x20(%rax)
  51mov    0xa0(%rsp),%rax
  59vmovups 0x80(%rsp),%ymm1
  62vmovups (%rax),%ymm0
  66vblendvps %ymm1,0xe0(%rsp),%ymm0,%ymm2
  71vmovups %ymm2,(%rax)
  75vmovups 0xc0(%rsp),%ymm0
  7e:·       mov    %rbp,%rsp
  81pop    %rbp
  82:·       retq

mattkretz added a commit that referenced this issue Sep 7, 2016
The use of __m256[id] as default constructed arguments to the load
functions works fine for GCC and clang. ICC, however, generates a dead
store and thus significant overhead.

Refs: gh-135

Signed-off-by: Matthias Kretz <kretz@kde.org>
mattkretz added a commit that referenced this issue Sep 7, 2016
ICC can do proper static propagation when a SIMD object is initialized
with _mm(256)_setzero. It fails to generate proper code for
__m256[id]().

Refs: gh-135

Signed-off-by: Matthias Kretz <kretz@kde.org>
@mattkretz
Copy link
Member

Vc master, compiled with ICC 16.0.2:

Run on (8 X 3352.19 MHz CPU s)
2016-09-07 11:19:44
Benchmark           Time           CPU Iterations
-------------------------------------------------
scalar       15150738 ns   15085714 ns         35
intrinsics   11936877 ns   11714286 ns         56
vc           10351496 ns   10387097 ns         62

@mattkretz
Copy link
Member

@mzyzak FYI. You reported ICC performance issues too. Can you please retest with Vc master?

@noma
Copy link

noma commented Sep 7, 2016

Another FYI: Version 17.0 of the Intel compiler was released yesterday (2016-09-06).

@mattkretz
Copy link
Member

Thanks @noma. I'll ask for the new version to get installed on our infrastructure.

@amadio
Copy link
Collaborator Author

amadio commented Sep 9, 2016

@mattkretz Thanks for working on this. I get much better results with ICC now.

@rolandschulz
Copy link

Did you report those issues to our compiler team?

@mattkretz
Copy link
Member

@rolandschulz A colleague at CERN took this up for me. (I still wrote the test cases.) It was as frustrating as every time I did it myself. The support engineer on Premier Support needs hand holding to understand, reproduce, and escalate the issue. If only reporting such issues wouldn't require days of work and involve more motivating feedback. :-(

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants