-
Notifications
You must be signed in to change notification settings - Fork 138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Redundant rounding in AVX512 exp d8? #285
Comments
This is because there is no AVX512 intrinsic function for rounding 512-bit vector. |
What about the cvtpd2dq instruction you're already using (or that the compiler is also generating?)?
Does this not support the needed/correct rounding modes? |
That's an instruction. https://software.intel.com/sites/landingpage/IntrinsicsGuide/ |
What is needed here is rounding from double to double. |
Ah, of course. I'm closing this issue. |
Intrinsic functions for AVX-512 are weird. |
Because it looks like you need both integer and double versions of the rounded numbers, what about:
? This is probably more efficient than splitting the vector into two 256 bit vectors and applying the avx2 instructions. |
That method does not cover the whole range of double precision number. |
Can you write inline assembly? |
For what it's worth,
Only in the context of using What sort of format do you have in mind? |
Ah, there is no 512-bit version of vroundpd. |
What may be the problem for Being able to strip out all the conversions between 256 and 512 bit vectors (and making the other necessary adjustments, like replacing the 20 bit shifts with 52 bit shifts) helps improve performance a little more, so it'd be great if this were safe. |
Okay, I will try it. |
_mm512_cvt_roundpd_epi64 requires AVX512DQ. |
Knight's Landing and Knight's Mill (the defunct Xeon Phi CPUs) don't have DQ, but Skylake-X and on do. While lacking many of the AVX512 instruction sets (such as DQ), they did have ER, which all the others lack, and provides accurate reciprocal, reciprocal square root, and exp2 instructions. I don't have access to a Knight's * CPU, but when it comes to those functions, I think the native implementations are going to hard to beat on that architecture according to Agner Fog's instruction tables (page 347), so odds are optimized implementations are going to want to take advantage of them. |
So, my plan is to add support for avx512fcdvwdqvl. |
It seems that _mm512_roundscale_pd can be used for rounding. https://stackoverflow.com/questions/50854991/instrinsic-mm512-round-ps-is-missing-for-avx512 |
Merged. |
I haven't looked at the source code, but compiling with
-DSLEEF_ENABLE_LLVM_BITCODE=TRUE
let me look at some of the LLVM bitcode. Look at the first few lines (%3
through%8
):It uses shufflevector to split a vector of length 8 into two vectors of length 4. It rounds both. Then it uses another shufflevector to combine them, before converting to quadword integers and rounding a second time with the same rounding more.
I tried modifying the code:
That is, I replaced the
%7
with%2
above, to use the unrounded result, and then relied on the compiled to remove the now dead code on lines 3-7 (it would be a pain to manually rename all the SSA values by actually deleting the lines).Using
llvmcall
from Julia on this IR to definevexp
andvexpv2
(using the modified IR to eliminate the redundant 256 bit rounds):Shows a sizeable performance improvement. (Note that there are "evals/sample" evaluations per sample for more accurate timings of fast functions.)
Answers appear to be identical:
If anyone happens to have Julia installed and a system with AVX512, you can run the above benchmarks via
FWIW, the exp d8 from my system's GLIBC takes only 4.2 nanoseconds.
The text was updated successfully, but these errors were encountered: