-
Notifications
You must be signed in to change notification settings - Fork 276
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Check for adcx instruction in the addcarryx intrinsics instead of just adc #666
Comments
Title should say "addcarryx" |
LLVM doesn't emit this intrinsic anymore: https://reviews.llvm.org/D51754 The rationale given is:
So I'm closing this. If you feel like we should emit this instruction here, feel free to re-open this, and fill an LLVM bug about it pinging the people involved in the submission and review of that pull-request, and link it here. |
The public interface we expose has nothing to do with what LLVM does internally. For simple functions such as this, our public interface needs to match the Intel Intrinsics guide. If we can't implement it properly in LLVM, we should not stabilize the intrinsic. If you didn't care about using ADX, you could use the addcarry intrinsic which always works. |
You are right, I think that llvm diff is unrelated. We should be using |
FWIW, we don't promise which instructions any intrinsic will generate. We promise that the running executable won't be able to tell the difference. |
So looking at the LLVM tests, it appears that indeed addcarryx is replaced with addcarry when possible. AFAICT replacing it would only be incorrect if the user were to inspect the overflow flag which we are not doing (and I am unsure whether that's possible or easy to do in LLVM - cc @rkruppe). @jethrogb I have a PR (#672) that updates these to use the llvm addcarryx intrinsic, but the code generation issue persists. We could workaround that using inline assembly, but before we add any workarounds one would need to fill in an LLVM bug showing an MWE in which the transformation is incorrect. |
@jethrogb do you think you could help with an MWE for an LLVM bug report for this ? |
I came up with this /// Returns `(a + b, c + d)`
#[inline(never)]
#[no_mangle]
pub fn dual_mp_add(a: &[u64], b: &[u64], c: &[u64], d: &[u64]) -> (Vec<u64>, Vec<u64>) {
unsafe {
let len = a.len();
assert_eq!(len, b.len());
assert_eq!(len, c.len());
assert_eq!(len, d.len());
let mut ab = vec![0; len];
let mut cd = vec![0; len];
let mut ab_carry = 0;
let mut cd_carry = 0;
for i in 0..len {
ab_carry = _addcarryx_u64(ab_carry, *a.get_unchecked(i), *b.get_unchecked(i), ab.get_unchecked_mut(i));
cd_carry = _addcarryx_u64(cd_carry, *c.get_unchecked(i), *d.get_unchecked(i), cd.get_unchecked_mut(i));
}
(ab, cd)
}
} But the assembly is so laughably inefficient (the addcarry call isn't even inlined) that not using the ADX instruction isn't really going to do much in terms of performance. |
Sorry, I forgot that you need to add a 34a0: 49 8b 4c f5 00 mov 0x0(%r13,%rsi,8),%rcx
34a5: 40 80 c7 ff add $0xff,%dil
34a9: 49 13 0c f4 adc (%r12,%rsi,8),%rcx
34ad: 40 0f 92 c7 setb %dil
34b1: 48 89 4c f5 00 mov %rcx,0x0(%rbp,%rsi,8)
34b6: 49 8b 0c f7 mov (%r15,%rsi,8),%rcx
34ba: 80 c2 ff add $0xff,%dl
34bd: 49 13 0c f0 adc (%r8,%rsi,8),%rcx
34c1: 0f 92 c2 setb %dl
34c4: 48 89 0c f0 mov %rcx,(%rax,%rsi,8)
34c8: 48 8d 76 01 lea 0x1(%rsi),%rsi
34cc: 48 39 f3 cmp %rsi,%rbx
34cf: 75 cf jne 34a0 <dual_mp_add+0xf0> I think this is supposed to be slower than using ADX. |
Edit: You can using JRCXZ! |
Ok, so I replaced that code with this assembly:
This is basically the same speed with my poor benchmark. |
If the code is correct and the speed is the same, there might be an advantage to not using |
I decided to implement MP multiply, which is documented by Intel as a main usecase for these instructions: #![feature(test, asm)]
extern crate test;
use core::arch::x86_64::{_addcarryx_u64, _mulx_u64};
#[inline(never)]
#[no_mangle]
pub fn mp_mul_intrin(a: &[u64], b: &[u64]) -> Vec<u64> {
unsafe {
let len = a.len();
assert_eq!(len, b.len());
let mut c = vec![0; len * 2];
for b_i in (0..len).into_iter().rev() {
let b_elem = *b.get_unchecked(b_i);
let mut carry_lo = 0;
let mut carry_hi = 0;
for a_i in (0..len).into_iter().rev() {
let mut hi = 0;
let lo = _mulx_u64(*a.get_unchecked(a_i), b_elem, &mut hi);
carry_lo = _addcarryx_u64(carry_lo, lo, *c.get_unchecked(b_i + a_i + 1), c.get_unchecked_mut(b_i + a_i + 1));
carry_hi = _addcarryx_u64(carry_hi, hi, *c.get_unchecked(b_i + a_i), c.get_unchecked_mut(b_i + a_i));
}
*c.get_unchecked_mut(b_i) += carry_lo as u64;
}
c
}
}
#[inline(never)]
#[no_mangle]
pub fn mp_mul_asm(a: &[u64], b: &[u64]) -> Vec<u64> {
unsafe {
let len = a.len();
assert_eq!(len, b.len());
let mut c = vec![0; len * 2];
asm!("
# start iteration of `b_i`: `len`-1 downto 0
lea -1($3), %rsi
1: # every iteration of `b_i`
# load `b_elem`
mov ($1, %rsi, 8), %rdx
# clear `carry_lo`, `carry_hi`
xor %r9, %r9
# start iteration of `a_i`+1: `len` downto 1
mov $3, %rcx
jmp 3f
2: # the end of every iteration of `a_i`+1, except the last iteration
# no displacement, RCX has already been decremented
mov %r10, (%r11, %rcx, 8)
3: # every iteration of `a_i`+1
# compute c + b_i
lea ($2, %rsi, 8), %r11
# compute a[a_i]*b_elem
mulx -8($0, %rcx, 8), %r8, %r9
# add low word
mov (%r11, %rcx, 8), %r10
adcx %r8, %r10
mov %r10, (%r11, %rcx, 8)
# add high word
mov -8(%r11, %rcx, 8), %r10
adox %r9, %r10
# this is done later to be able to add in last carry in last iteration of `a_i`+1
# mov %r10, -8(%r11, %rcx, 8)
# advance `a_i`+1
lea -1(%rcx), %rcx
jrcxz 4f
jmp 2b
4: # the end of the last iteration of `a_i`+1
adc $$0, %r10
mov %r10, (%r11)
# advance `b_i`
dec %rsi
jns 1b
"
:
: "r"(a.as_ptr()), "r"(b.as_ptr()), "r"(c.as_mut_ptr()), "r"(len)
: "rsi", "rcx", "rdx", "r8", "r9", "r10", "r11", "flags"
);
c
}
}
use num_bigint::{RandBigInt, BigUint};
fn bigint_to_slice(i: &BigUint) -> Vec<u64> {
i.to_bytes_be().chunks(8).map(|chunk| {
let mut a = [0u8; 8];
a.copy_from_slice(chunk);
u64::from_be_bytes(a)
}).collect()
}
#[test]
fn test_rand() {
for _ in 0..1_000 {
let a = rand::thread_rng().gen_biguint(1024);
let b = rand::thread_rng().gen_biguint(1024);
if a.bits() + b.bits() <= 2041 {
continue;
}
let c = &a * &b;
assert_eq!(
mp_mul_intrin(&bigint_to_slice(&a), &bigint_to_slice(&b)),
bigint_to_slice(&c)
);
assert_eq!(
mp_mul_asm(&bigint_to_slice(&a), &bigint_to_slice(&b)),
bigint_to_slice(&c)
);
println!("ok");
}
}
#[bench]
fn bench_intrin(bench: &mut test::Bencher) {
loop {
let a = rand::thread_rng().gen_biguint(1024);
let b = rand::thread_rng().gen_biguint(1024);
if a.bits() + b.bits() <= 2041 {
continue;
}
let a = bigint_to_slice(&a);
let b = bigint_to_slice(&b);
bench.iter(|| mp_mul_intrin(&a, &b));
break;
}
}
#[bench]
fn bench_asm(bench: &mut test::Bencher) {
loop {
let a = rand::thread_rng().gen_biguint(1024);
let b = rand::thread_rng().gen_biguint(1024);
if a.bits() + b.bits() <= 2041 {
continue;
}
let a = bigint_to_slice(&a);
let b = bigint_to_slice(&b);
bench.iter(|| mp_mul_asm(&a, &b));
break;
}
} Here you can see a pretty significant difference:
For reference, here's the assembly I get for the intrinsics version:
|
Thank you, I think that's very convincing. Could you fill in an LLVM bug for this and link it here? (otherwise I can do it for you if you want). |
I'd prefer it if you did it, I'm guessing you have a better picture of what should be in such a bug report. |
So investigating for the LLVM bug report I produced the following slightly more minimal example (godbolt - no std, no debug info, no println, no panics, ...): #![feature(adx_target_feature)]
#![no_std]
use core::{arch::x86_64::*, hint::unreachable_unchecked};
#[target_feature(enable = "adx,bmi2")]
pub unsafe fn mp_mul_intrin(a: &[u64], b: &[u64], c: &mut [u64]) {
let len = a.len();
if len != b.len() {
unreachable_unchecked();
}
if c.len() != len * 2 {
unreachable_unchecked();
}
for b_i in (0..len).into_iter().rev() {
let b_elem = *b.get_unchecked(b_i);
let mut carry_lo = 0;
let mut carry_hi = 0;
for a_i in (0..len).into_iter().rev() {
let mut hi = 0;
let lo = _mulx_u64(*a.get_unchecked(a_i), b_elem, &mut hi);
carry_lo = _addcarryx_u64(
carry_lo,
lo,
*c.get_unchecked(b_i + a_i + 1),
c.get_unchecked_mut(b_i + a_i + 1),
);
carry_hi = _addcarryx_u64(
carry_hi,
hi,
*c.get_unchecked(b_i + a_i),
c.get_unchecked_mut(b_i + a_i),
);
}
*c.get_unchecked_mut(b_i) += carry_lo as u64;
}
} Inspecting the LLVM-IR in debug mode, we are not emitting |
This reproduces the issue (godbolt): #![feature(link_llvm_intrinsics, abi_unadjusted, adx_target_feature)]
#[allow(improper_ctypes)]
extern "unadjusted" {
#[link_name = "llvm.x86.addcarryx.u64"]
fn llvm_addcarryx_u64(a: u8, b: u64, c: u64, d: *mut u8) -> u8;
}
#[target_feature(enable = "adx")]
pub unsafe fn _addcarryx_u64(c_in: u8, a: u64, b: u64, out: &mut u64) -> u8 {
llvm_addcarryx_u64(c_in, a, b, out as *mut _ as *mut u8)
} produces ource_filename = "example.3a1fbbbh-cgu.0"
target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"
define i8 @_ZN7example14_addcarryx_u6417hf98d062ffaad3a58E(i8 %c_in, i64 %a, i64 %b, i64* align 8 dereferenceable(8) %out) unnamed_addr #0 {
start:
%0 = bitcast i64* %out to i8*
%1 = call { i8, i64 } @llvm.x86.addcarry.64(i8 %c_in, i64 %a, i64 %b)
%2 = extractvalue { i8, i64 } %1, 1
%3 = bitcast i8* %0 to i64*
store i64 %2, i64* %3, align 1
%4 = extractvalue { i8, i64 } %1, 0
br label %bb1
bb1: ; preds = %start
ret i8 %4
}
declare { i8, i64 } @llvm.x86.addcarry.64(i8, i64, i64) #1
attributes #0 = { nounwind nonlazybind "probe-stack"="__rust_probestack""target-cpu"="x86-64""target-features"="+adx" }
attributes #1 = { nounwind readnone }
!llvm.module.flags = !{!0}
!0 = !{i32 2, !"RtLibUseGOT", i32 1} when compiled with "-C opt-level=0 -C debuginfo=0 -C panic=abort --edition=2018 -C lto=no --emit=llvm-ir". |
It appears that this is just LLVM replacing the |
It looks like the mulx instrinsic also doesn't work correctly? Edit: there we don't even try to call the LLVM intrinsic... |
IIRC the bmi2 mulx is almost never worth it codegen wise. I’m on the phone
but there should be a closed issue here about that.
|
That is #27. In the code I shared, without mulx, it's no use emitting adcx/adox, because it would be overwriting the flags we were so carefully trying to keep around. |
#27 claims the 64-bit version "works fine" but that's not (or no longer) the case. |
Damn, we should probably also fill an LLVM issue about that :/
…On Sat 20. Apr 2019 at 18:54, jethrogb ***@***.***> wrote:
#27 <#27> claims the
64-bit version "works fine" but that's not (or no longer) the case.
—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
<#666 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAG43JU5QCOL2LEGV5OEYFLPRNDFLANCNFSM4GTP7DSQ>
.
|
So this is now blocked on LLVM: https://bugs.llvm.org/show_bug.cgi?id=41546 It seems LLVM is not set up to generate code like this. |
Does this affect using the |
This only affects the intrinsics. |
Doesn't u128's |
I saw @gnzlbg mention this earlier:
...and @Amanieu mentioned that inline assembly is not impacted here. It seems that the LLVM bug report has been filed, so is an inline assembly workaround potentially on the table here now? |
AFAICT this is only a perf issue and not a correctness issue. Using inline assembly in the body of just this intrinsic definition will likely only slowdown things as LLVM won't be able to fold an immediate into the instruction. It can't cause a speedup as adcx still clobbers the carry flag and there is no way to say that only the carry flag is clobbered and not the other flags. To get a perf improvement from using inline assembly you will likely need to write the entire code sequence that contains the adcx usage in inline asm. |
@bjorn3 yeah, I was worried about that. As @jethrogb mentioned earlier what would be really nice here is some way to combine ADC/MULX such that flags aren't clobbered. But if they are, it defeats the point. To make matters even more complicated we're looking for options that can potentially work in a const context, which is why we're looking for an intrinsics-based approach as opposed to directly using inline assembly ourselves. But if that isn't possible, we can potentially pursue these sorts of optimizations using inline assembly outside of a const context. |
So I have this prototype, which uses only functions + playground::carry_chain: # @playground::carry_chain
# %bb.0:
pushq %rax
movq %rdx, %rax
xorq %rdx, %rdx
movq %rdi, %rdx
mulxq %rax, %rdx, %rdi
adoxq %rdi, %rdx
adcxq %rax, %rdi
movq %rsi, %rdx
mulxq %rcx, %rax, %rdx
adoxq %rdx, %rax
adcxq %rcx, %rdx
popq %rcx
retq
# -- End function (removed I haven't been able to get rid of the |
cc @jethrogb who reported this here #631 (comment)
The text was updated successfully, but these errors were encountered: