Version : 0.1 (29/06/2016)
[TOC]
The official repository is https://gitlab.inria.fr/bramas/inastemp
-
What is Inastemp? Inastemp provides a set of
C++
classes to make vectorization with intrinsics easier. It aims at developing numerical kernels by separating the algorithm from the hardware target. Inastemp comes with several examples and patterns related to widespread use-cases. -
License The code is is published under the MIT license. See the
LICENSE
file or [https://opensource.org/licenses/MIT] for details. -
Target audience and use cases The library requires basic C++ knowledge about classes, templates, etc. It is not mandatory to really know what is vectorization all about, but it certainly helps. For example, the increment of loops is usually tied to the underlying size of the vector, which means that one should understand why and where the increment is not equal to 1. A good way to use Inastemp would be to have a software engineer managing the inclusion and hiding the small complexities of the template process, and to have the scientists work with basic simple functions templatized against a vector type.
-
Overview The library itself purely consists of headers. However, it comes with CMAKE files but it is not mandatory to use them. CMAKE will detect the hardware capacities and turn on some intrinsics by default. One can turn them on/off using
ccmake
or other configuration tool if needed. Unit tests and examples can be compiled and the headers installed with the usual CMAKE stage. -
Related publication Inastemp paper has been published in the Scientific Programming Journal (Hindawi). The pdf can be found at https://www.hindawi.com/journals/sp/2017/5482468/ We will appreciate that any use or discussion of Inastemp refers to this document.
The main point of Inastemp is to factorize the intrinsics code to let the developers implement kernels without knowing the final destination of the code. It has been common practice to write a kernel for each architecture (partly because, similarly, it has been common practice to develop in C/Fortran). At this point, Inastemp fills the gap being responsible of the architecture. When developing a kernel for a specific target we usually make some specific optimizations. But any optimization usually means being less generic: Optimization by targeting an hardware, by using special instructions, by considering that the input data respect a given pattern, etc. Using C++ templates, the abstraction allows to have a generic algorithm and to choose only at compile time the optimizations that should be applied. It offers a huge benefit in term of factorization and maintainability of the code.
Generic <-----------------------> Optimized
All architectures One architecture
All input Special input
Template C++ source-code Compiled Template C++ code
-
The following x86 SIMD types are currently supported:
- SSE3, SSSE3, SSE4.1, SSE4.2, AVX, AVX2, AVX512-KNL, AVX512-SKL
-
The following Powere PC SIMD types are currently supported:
- Power-8 Altivec/VMX
-
The following ARM PC SIMD types are currently supported:
- SVE
-
The following NEC SIMD types are currently supported:
- VEC/SVE for Aurora
-
arithmetic operators
*/+-
are provided -
CPU capacities are detected automatically during the CMake stage
-
The compiler capacities are detected automatically during the CMake stage
-
The library purely contains of headers, no linkage is necessary.
-
CPU detection may use Intel®-SDE
-
Unit-tests may use Intel®-SDE
-
Fast intrinsic
exp()
function (if not supported natively by the compiler) -
Explicit branches vectorization
-
several patterns which represent many applications are demonstrated
- Using CMake:
# Create a Build directory
mkdir Build
# Go into the Build directory
cd Build
# Run cmake - considering Inastemp is the upper directory
cmake ..
# OR with more output messages
VERBOSE=1 cmake ..
# Then make will build the unit test and examples
make
- Direct inclusion of the headers:
simply include
Src/InastempStaticConfig.h
before anything else. In such case, only the instruction sets that are enable by the compiler will be enable in Inastemp. For example, compiling using Gcc and-mAVX
will turn on AVX.
The CMake file will look at the variable ${CMAKE_SYSTEM_PROCESSOR}
to detect the type of CPU.
If for any reason, this variable is not correct, one can override it with ${CUSTOM_SYSTEM_PROCESSOR}
.
For example, it is expected that the variable is set to ARM
or aarch64
for ARM based processor, but if that is not the case, then simply override the CMake value (cmake .. -DCUSTOM_SYSTEM_PROCESSOR=ARM
).
For example, on Isambard VERBOSE=1 CXX=armclang++ cmake -DARCH_NATIVE_FLAG=-march=armv8.2-a+sve ..
There are two hardware detections:
- the instructions supported by the compiler
- the instructions supported by the current CPU
For each existing vector type in Inastemp, the cmake stage runs both tests.
The results are print out if VERBOSE=1
.
If a vector type can be compiled, it creates the cmake variable INASTEMP_USE_X
where X
could be SSE3
, AVX2
, etc.
Then, all the vector types that are supported by the CPU are turned to ON
by default.
So it is possible to turn on specific options even if the CPU does not support them (to execute over intel SDE for example).
But if an options does not exist (even if the hardware seems appropriate) it means that the compiler does not support it.
For example, here is a part of the output of the ccmake ..
command on a AVX2 CPU and using GCC 6.1:
INASTEMP_USE_AVX ON
INASTEMP_USE_AVX2 ON
INASTEMP_USE_AVX512KNL OFF
INASTEMP_USE_AVX512SKL OFF
INASTEMP_USE_SSE3 ON
INASTEMP_USE_SSE41 ON
INASTEMP_USE_SSE42 ON
INASTEMP_USE_SSSE3 ON
AVX512KNL
and AVX512SKL
are supported by the compiler but not by the hardware, so they are turned OFF
but could be turn to ON
if needed.
By turning the cmake variable INASTEMP_ISDE_CPU
to ON
the hardware detection is done over intel SDE.
In this case, one can ask Inastemp to check any hardware (passing the appropriate options to isde).
Inastemp performs the following test to enable the VMX classes (and to disable all the X86/ARM classes):
if(${CMAKE_SYSTEM_PROCESSOR} STREQUAL "ppc64le")
If the detection looks incorrect, please override it with the CMake variable ${CUSTOM_SYSTEM_PROCESSOR}
.
Inastemp performs the following test to enable the SVE classes (and to disable all the PowerPC/X86 classes):
elseif(${CUSTOM_SYSTEM_PROCESSOR} STREQUAL "ARM" OR ${CUSTOM_SYSTEM_PROCESSOR} STREQUAL "aarch64")
If the detection looks incorrect, please override it with the CMake variable ${CUSTOM_SYSTEM_PROCESSOR}
.
Then, the configuration stage will try to compile a tiny example to see if the compiler can work with SVE.
In this aim, it will pass -march=native+sve
to the compiler.
If you want to compile for another processor, override ${ARCH_NATIVE_FLAG}
using -march=armv8.2-a+sve
for example.
Then the configuration system will look if the CPU is able to execute some SVE, the answer is currently always yes (because the ARM emulator says always no, so we override the value, but that will change in the future).
As NEC Aurora is a device and is not involved for the compilation that is done on the CPU.
Therefore, it is mandatory to set ${CUSTOM_SYSTEM_PROCESSOR}
.
Moreover, it is necessary to set the CXX
compiler as the compiler for vectorization is usually not the default one.
As a result, here is the cmake
command to be used:
# Considering that the compiler is located at /opt/nec/nosupport/llvm-ve-1.9.0/bin/clang++
CXX="/opt/nec/nosupport/llvm-ve-1.9.0/bin/clang++ -target ve-linux -fno-vectorize -fno-slp-vectorize" cmake .. -DCUSTOM_SYSTEM_PROCESSOR=AURORA
Considering that the folder `inastemp exist in the root directory of the project, one could use:
## inastemp inclusion in CMakeLists.txt of the other project
# tell inastemp we do not want examples/tests
# (will be detected automatically otherwise because it is included as a subproject)
set(INASTEMP_JUST_LIB TRUE)
# Set the C++ standard required by the master project (default inastamp is 11)
set(CMAKE_CXX_STANDARD 11) # Could be 11 or 14 ...
set(CMAKE_CXX_STANDARD_REQUIRED ON) # Could be OFF in specific cases
# add the cmakelist directory
add_subdirectory(inastemp)
# use the filled variables from inastemp
INCLUDE_DIRECTORIES(
${INASTEMP_BINARY_DIR}/Src
${INASTEMP_SOURCE_DIR}/Src
${INASTEMP_INCLUDE_DIR}
)
# propagate the flags to be able to compile
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${INASTEMP_CXX_FLAGS}")
The statement set(INASTEMP_JUST_LIB TRUE)
asks Inastemp not to compile tests/examples and to limit inclusion of extra warning flags in INASTEMP_CXX_FLAGS
.
The inclusion of Inastemp as a subproject does not modify the variables of the upper project.
This is why we use the variables ${INASTEMP_CXX_FLAGS}
explicitly to transfer the hardware specific flags.
When Inastemp detects it is included as a subproject, it turns the variable INASTEMP_AS_SUBPROJECT
to ON
and minimizes the additional flags.
As a result, the variable INASTEMP_CXX_FLAGS
will contain a flag for C++17
and -funroll-loops
(from INASTEMP_EXTRA_CXX_FLAGS
), plus the flags needed for the target instructions.
Such that, if one wants to compile only some files with these specific flags, it must use INASTEMP_CXX_FLAGS
carefully.
Inastemp was developed and tested using the following compilers on the x86_64 architecture.
- Gcc 6.1 (earlier versions if AVX512/KNL/SKL are not used, but C++17 should be supported)
- Clang 3.5
- Intel 16.0 Earlier versions may work as well.
It was also tested on IBM POWER 8 NVL/4023GHz (Openpower) using the following compilers:
- Gcc 6.2.0
It was also tested on Arm (ThunderX2CN99) using the following compilers:
-
Armclang++ 19.0 (build number 69) (based on LLVM 7.0.2)
- Results validated on emulator armie 19.2-11
-
Intel special flags We pass
-diag-disable 2304 -diag-disable 10121 -diag-disable 10120
to intel compiler to remove the implicit conversion warnings between Inastemp classes. In fact, we voluntarily want implicit conversion because it helps us to factorize and reduce the number of lines drastically.
User can add the following flags to push for more inlining -inline-forceinline -no-inline-max-total-size
for intel compiler.
However, it looks like such options do not work for several compiler versions, and thus we do not want to add them by default.
Inastemp is not designed to compile a binary targeting multiple hardware (like having an execution path for the different possibilities). If such need appears, let us know and we might enable it in a next release. But everything is here to do so, simply a modification of the compilation system is needed.
To ensure good performance, developers should obey some rules:
- Never try to use an abstract data type above the vector-type (especially never use
InaVecInterface
). In fact, the Inastemp functions must be inlined and known at compile time to be efficient. While in some cases the compiler may know what is behind an abstract type this is not always true and may lead to very poor performance. - Do not use the SCALAR classes for real development. SSE is available almost everywhere so SCALAR classes should be used only when really needed, like for example at the end of the loop if it is not possible to allocate a loop with a size equal to a multiple of the length of the vector.
- Do not silently convert scalar float/double to vector type in a loop, as shown in the following example:
for(){
// ...
VecType a = ....
VecType v = 1.0 * a; // Silent conversion of 1.0
// Or
VecType v = VecType(1.0) * a; // Explicit conversion
// In the previous lines 1.0 is converted to VecType
// It might be optimized by the compiler but please manually move it before the loop
}
// Do
const VecType VecOne = 1;
for(){
// ...
VecType a = ....
VecType v = VecOne * a;
- Never update a variable in a lambda function in a if-else structure (see If-else description)
a = 1
b = VecType::If(cond).Then([&](){ // take 'a' by ref
a += 1
return a;
}).Else([&](){
return a; // here 'a' is 1+1
});
Inastemp provides an implementation of the exponential
function using intrinsics.
It follows the nice paper Fast Exponential Computation on SIMD Architectures
available at
[https://www.researchgate.net/publication/272178514_Fast_Exponential_Computation_on_SIMD_Architectures].
The Remez polynomial is used, and the Scilab file to find out the coefficient is provided.
When using SDE, Inastemp looks for sde64
, therefore you must update your path or put a symbolic link:
// Most system:
export PATH=/PATH-TO-SDE-BIN/:$PATH
// On ubuntu if sde is not installed in the system folder, I used
sudo update-alternatives --install /usr/bin/sde64 sde64 /PATH-TO-SDE-BIN/sde64 1
When using armie, Inastemp looks for armie
, therefore you must update your path or put a symbolic link:
// Most system:
export PATH=/PATH-TO-ARMIE-BIN/:$PATH
// On ubuntu if armie is not installed in the system folder, I used
sudo update-alternatives --install /usr/bin/armie armie /PATH-TO-ARMIE-BIN/armie 1
Taken from the pattern examples, we simply sum two arrays.
template < class VecType >
void SumArrays(double* __restrict__ dest, const double* __restrict__ src1,
const double* __restrict__ src2, const size_t nbToProceed) {
for (size_t idx = 0; idx < nbToProceed; idx += VecType::GetVecLength()) {
const VecType v1(&src1[idx]);
const VecType v2(&src2[idx]);
const VecType res = v1 + v2;
res.storeInArray(&dest[idx]);
// In one line:
//VecType(VecType(&src1[idx]) + VecType(&src2[idx]).storeInArray(&dest[idx]);
}
}
Here we compute the interactions between particles. The scalar sequential code reads:
void ScalarFunction(const int nbParticles, const double* __restrict__ positionsX,
const double* __restrict__ positionsY,const double* __restrict__ positionsZ,
const double* __restrict__ physicalValues, double* __restrict__ potentials,
const double cutDistance, const double constantIfCut){
for(int idxTarget = 0 ; idxTarget < nbParticles ; ++idxTarget){
for(int idxSource = idxTarget+1 ; idxSource < nbParticles ; ++idxSource){
const double dx = positionsX[idxTarget] - positionsX[idxSource];
const double dy = positionsY[idxTarget] - positionsY[idxSource];
const double dz = positionsZ[idxTarget] - positionsZ[idxSource];
const double distance = std::sqrt(dx*dx + dy*dy + dz*dz);
const double inv_distance = 1/distance;
if(distance < cutDistance){
potentials[idxTarget] += ( inv_distance * physicalValues[idxSource] );
potentials[idxSource] += ( inv_distance * physicalValues[idxTarget] );
}
else{
potentials[idxTarget] += ( inv_distance * (physicalValues[idxTarget]-constantIfCut) );
potentials[idxSource] += ( inv_distance * (physicalValues[idxTarget]-constantIfCut) );
}
}
}
}
A part of it is rewritten :
template <class VecType>
void VectorizedFunction(const int nbParticles, const double* __restrict__ positionsX,
const double* __restrict__ positionsY,const double* __restrict__ positionsZ,
const double* __restrict__ physicalValues, double* __restrict__ potentials,
const double cutDistance, const double constantIfCut){
const VecType VecOne = 1;
const VecType VecConstantIfCut = constantIfCut;
const VecType VecCutDistance = cutDistance;
for(size_t idxTarget = 0 ; idxTarget < nbParticles ; ++idxTarget){
const VecType targetX = positionsX[idxTarget];
const VecType targetY = positionsY[idxTarget];
const VecType targetZ = positionsZ[idxTarget];
const VecType targetPhysicalValue = physicalValues[idxTarget];
VecType targetPotential = VecType::GetZero();
const size_t lastToCompute = ((nbParticles-(idxTarget+1))/VecType::GetVecLength())*VecType::GetVecLength()+(idxTarget+1);
for(size_t idxSource = idxTarget+1 ; idxSource < lastToCompute ; idxSource += VecType::GetVecLength()){
const VecType dx = targetX - VecType(&positionsX[idxSource]);
const VecType dy = targetY - VecType(&positionsY[idxSource]);
const VecType dz = targetZ - VecType(&positionsZ[idxSource]);
const VecType distance = VecType(dx*dx + dy*dy + dz*dz).sqrt();
const VecType inv_distance = VecOne/distance;
const typename VecType::MaskType testRes = (distance < VecCutDistance);
const VecType sourcesPhysicalValue = VecType(&physicalValues[idxSource]);
targetPotential += inv_distance * VecType::IfElse(testRes, sourcesPhysicalValue,
sourcesPhysicalValue-VecConstantIfCut);
const VecType resSource = inv_distance * VecType::IfElse(testRes, targetPhysicalValue,
targetPhysicalValue-VecConstantIfCut);
const VecType currentSource = VecType(&potentials[idxSource]);
(resSource+currentSource).storeInArray(&potentials[idxSource]);
}
potentials[idxTarget] += targetPotential.horizontalSum();
// .....
In many cases applications have branches inside their kernels which makes them very difficult to vectorize. Inastemp provides tools to explicitly manage the branches. Note that the approach computes all branches! Therefore, large branches may lead to large execution time.
// Scalar code
if( a < b )
res += c * d // Case A()
else
res -= d // Case B()
The code example is equivalent to a multiplication by zero for the false
case:
// scalar code
conditionA = ( a < b ) ? 1.0 : 0
conditionB = !( a < b ) ? 1.0 : 0
res += conditionA * c * d // Case A()
res -= conditionB * d // Case B()
It is however faster to use bit masks as follows:
// scalar code
conditionATrue = ( a < b ) ? 0xFF..FF : 0
res += conditionA AND (c * d) // Case A()
res -= (NOT conditionATrue) AND d // Case B()
Using Inastemp one may explicitely use bit-masks. Alternatively, the Inastemp if-else function can be used.
- Getting a mask:
// Get mask by calling functions
VecType::MaskType mask = VecType().Is{Lower,Greater}[OrEqual](a,b)
VecType::MaskType mask = VecType().Is[Not]Equal(a,b)
// and much more: IsPositive, IsZero,....
// Or based on C++ operator overloading
VecType::MaskType mask = a < b; // a (<|>)[=] b
// and much more: ==, !=, etc..
- Using a mask: The VecType classes provide methods such as AND (&), OR (|), XOR (^), NOT_AND.
The VecType classes provide three methods to deal with conditions. Note that Inastemp computes all sub-results and performs all computation, ie. even for the false condition.
- IfTrue(condition, value)
res = VecType::IfTrue(a < b , c);
// res is set to c if a < b, else to zero
The IfTrue method is mainly useful when a part of a formula is conditional:
res += a * b + .IfTrue(a < b , c);
- IfElse(condition, valueIfTrue, valueIfFalse)
res = VecType::IfElse(a < b , c, d);
// res is set to c if a < b, else to d
- If-ifelse-else
The vectorizer classes provide an
if
method which allows to return a value or to call a lambda function.
// Faster
c = VecType::If(test).Then(a).Else(b);
// Should not be slower using functions
c = VecType::If(test).Then([&](){return a;}).Else([&](){return b;});
Also, the lambda/anonymous functions must pass varaibles per references using [&](){}
and not [=](){}
.
Here are some examples.
z = something()
a = VecType::If(cond1).Then( z )
.ElseIf(cond2).Then([&](){
VecType x = func(z)
return x;
})
.Else([&](){
return z*4;
});
The structure can be expressed as follows:
VecType::If( a condition )
.Then(a value or a lambda function) // Then must follow If
.ElseIf( a condition ) // Optional else if
.Then(a value or a lambda function) // Then must follow ElseIf
.Else(a value or a lambda function) // Optional else
Knowing that there are branches to be computed one can think about numerical optimizations.
In the next example we compute two times b * c
because both the branches are executed:
a = VecType::If(cond1)
.Then([&](){
return b * c + 1;
})
.Else([&](){
return b * c + 2;
});
The developer should simplify the formulas as best as possible to reduce the number of instructions.
bc = b * c
a = VecType::If(cond1)
.Then([&](){
return bc + 1;
})
.Else([&](){
return bc + 2;
});
In some cases, some branches are much more probable than others such that it could make sense to test if at least one value in the vector requieres to compute the exeptional path. It is possible to test if a MaskType is "all true" or "all false" and to act consequently. But this might reduce the performance while doing less work because of CPU missprediction and un-pipelining.
// Scalar code
if( a < b ){ // this is true 99% of the time
c = exp(a);
}
else {
c = exp(b);
}
// Vectorized code with all branches computed (2 exp call per scalar value)
c = VecType::IfElse(a < b).Then( a.Exp() ).Else( b.Exp() );
// Improved one
VecType::MaskType mask_ab = (a < b);
if( mask_ab.isAllTrue() ){ // This is higly probable
c = exp(a)
}
else{ // isAllFalse is not probable here
c = VecType::IfElse(a < b).Then(exp(a)).Else(exp(b));
}
To add a new intrisics type the following steps need to be taken.
- Ensure the detection of the CPU if you want to automatically enable the SIMD types supported by your hardware.
See
CMakeModules/GetCpuInfos.cmake
andCMakeModules/getCpuInfos.cpp
- Ensure that the compiler supports the seleced SIMD types (mandatory!).
See
CMakeModules/GetCompilerInfos.cmake
- Add a folder
TYPE
inCMakeModules
with the same test as the others :compileTestTYPE.cpp
to test if the compiler can compile the code.checkTYPEpe.cpp
to test if the compiler provide operators for this type.
- Modify the
GetCompilerInfos.cmake
file where the key(ADD-NEW-HERE)
is located
- Add a folder
- Add single/double classes following the same syntax as the other:
InaTYPEOperators.cpp
for the operators (a simple modification of an existing file should do the job)InaVecTYPE<{double,float>.hpp
to put the class for the new type in double and single precision
- Add your type in the main
CMakeLists.txt
file
In case an error message similar to
inlining failed in call to always_inline A-SIMD-FUNCTION : target specific option mismatch
is thrown it means that the flags passed to GCC are not enough to enable the
function A-SIMD-FUNCTION
. One should carefuly check the version of
SSE/AVX/AVX512 related to the given function. A VERBOSE=1 make
should show the
passed flags.
warning: ignoring attributes on template argument ‘__mX {aka __vector(Y) float/double}’ [-Wignored-attributes]
This warning may safely be ignored because for the Inastemp implementation this is not a problem. In fact, Gcc tells us that the type passed by template has its attributes removed, because we do not use the template type to instantiate variables.
Please ensure to pass the values per references, that is in the lambda function declarations using [&](){}
and not [=](){}
.
In fact, it seems that the compiler is not propagate the memory alignement correctly when copying variable for anonymous functions.
Therefore, you can also tryied your binary with iSDE and may obtain something like:
SDE ERROR: TID: 0 executed instruction with an unaligned memory reference to address 0x1d2e490 INSTR: 0x00042495e: IFORM: VMOVAPS_MEMqq_YMMqq :: vmovaps ymmword ptr [rbx], ymm0
FUNCTION: _ZNSt14_Function_base13_Base_managerIZ24VectorizedFunctionLambdaI10InaVecAVX2IfEfEviPKT0_S6_S6_S6_PS4_S4_S4_EUlvE_E15_M_init_functorERSt9_Any_dataOS8_St17integral_constantIbLb0EE
FUNCTION ADDR: 0x000424923
- Berenger Bramas (berenger.bramas@inria.fr)
- Logo Credit to D.M.