Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VAES should not be restricted to AVX512 #1343

Closed
SchrodingerZhu opened this issue Oct 19, 2022 · 1 comment · Fixed by #1348
Closed

VAES should not be restricted to AVX512 #1343

SchrodingerZhu opened this issue Oct 19, 2022 · 1 comment · Fixed by #1348

Comments

@SchrodingerZhu
Copy link

SchrodingerZhu commented Oct 19, 2022

pub unsafe fn _mm256_aesdec_epi128(a: __m256i, round_key: __m256i) -> __m256i {

It seems that the stdarch library unconditionally put VAES related intrinsics into the AVX512VL scope while VAES is actually available on platforms without AVX512 support (AMD Zen3).

I spot the issue when I was investigating VAES with ahash: tkaitchuck/aHash#85

When using _mm256_aesenc_epi128, instead of generating something like

   2cf7e:       c5 d1 6c ee             vpunpcklqdq %xmm6,%xmm5,%xmm5
   2cf82:       c4 c1 f9 6e f2          vmovq  %r10,%xmm6
   2cf87:       c5 c9 6c f7             vpunpcklqdq %xmm7,%xmm6,%xmm6
   2cf8b:       c4 e2 7d dc c4          vaesenc %ymm4,%ymm0,%ymm0
   2cf90:       c4 e2 75 dc cb          vaesenc %ymm3,%ymm1,%ymm1
   2cf95:       c4 e3 7d 38 f6 01       vinserti128 $0x1,%xmm6,%ymm0,%ymm6
   2cf9b:       c4 e3 55 02 ee f0       vpblendd $0xf0,%ymm6,%ymm5,%ymm5
   2cfa1:       c4 e2 55 00 ea          vpshufb %ymm2,%ymm5,%ymm5
   2cfa6:       c5 e5 d4 dd             vpaddq %ymm5,%ymm3,%ymm3
   2cfaa:       c4 e2 65 00 da          vpshufb %ymm2,%ymm3,%ymm3
   2cfaf:       c5 dd d4 db             vpaddq %ymm3,%ymm4,%ymm3
   2cfb3:       c4 e3 7d 39 dc 01       vextracti128 $0x1,%ymm3,%xmm4
   2cfb9:       c4 e3 f9 16 d8 01       vpextrq $0x1,%xmm3,%rax
   2cfbf:       c4 e1 f9 7e d9          vmovq  %xmm3,%rcx
   2cfc4:       c4 c3 f9 16 e1 01       vpextrq $0x1,%xmm4,%r9
   2cfca:       c4 c1 f9 7e e2          vmovq  %xmm4,%r10

the compiler generates

  2cf8d:       c5 fc 29 84 24 80 00    vmovaps %ymm0,0x80(%rsp)
   2cf94:       00 00 
   2cf96:       c5 fd 7f 94 24 20 01    vmovdqa %ymm2,0x120(%rsp)
   2cf9d:       00 00 
   2cf9f:       c5 fc 29 8c 24 40 01    vmovaps %ymm1,0x140(%rsp)
   2cfa6:       00 00 
   2cfa8:       c5 f8 77                vzeroupper
   2cfab:       e8 c0 db ff ff          call   2ab70 <core::core_arch::x86::avx512vaes::_mm256_aesenc_epi128::hdf0bd31f9011eecd>
   2cfb0:       c5 fd 6f 84 24 a0 00    vmovdqa 0xa0(%rsp),%ymm0
   2cfb7:       00 00 
   2cfb9:       c5 fd 6f 1c 24          vmovdqa (%rsp),%ymm3
   2cfbe:       c5 fd d4 44 24 60       vpaddq 0x60(%rsp),%ymm0,%ymm0
   2cfc4:       c4 e2 7d 00 05 d3 ad    vpshufb 0x8add3(%rip),%ymm0,%ymm0        # b7da0 <_fini+0x10a0>
   2cfcb:       08 00 
   2cfcd:       c5 fd d4 44 24 40       vpaddq 0x40(%rsp),%ymm0,%ymm0
   2cfd3:       c4 c3 f9 16 c7 01       vpextrq $0x1,%xmm0,%r15
   2cfd9:       c4 e1 f9 7e c3          vmovq  %xmm0,%rbx
   2cfde:       c4 e3 7d 39 c0 01       vextracti128 $0x1,%ymm0,%xmm0
   2cfe4:       c4 c3 f9 16 c6 01       vpextrq $0x1,%xmm0,%r14
   2cfea:       c4 e1 f9 7e c6          vmovq  %xmm0,%rsi

It is very strange that the compiler do not inline the function call under release profile even if target-cpu=native is set.
However, when I explicit write

    extern "C" {
        #[link_name = "llvm.x86.aesni.aesenc.256"]
        fn aesenc_256(a: __m256i, round_key: __m256i) -> __m256i;
    }

    unsafe {
        transmute(aesenc_256(transmute(value), transmute(xor)))
    }

The compiler will give the upper asm as expected.

I suspect that this is because of the intrinsic being marked as avx512vl instruction.

@SchrodingerZhu
Copy link
Author

My CPU info:

Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         48 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  128
  On-line CPU(s) list:   0-127
Vendor ID:               AuthenticAMD
  Model name:            AMD EPYC 7773X 64-Core Processor
    CPU family:          25
    Model:               1
    Thread(s) per core:  2
    Core(s) per socket:  64
    Socket(s):           1
    Stepping:            2
    Frequency boost:     enabled
    CPU(s) scaling MHz:  63%
    CPU max MHz:         3527.7339
    CPU min MHz:         1500.0000
    BogoMIPS:            4404.54
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 
                         fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 inv
                         pcid_single hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr 
                         rdpru wbnoinvd amd_ppin brs arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold v_vmsave_vmload vgif v_spec_ctrl umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca

Both code run without problem despite the unsatisfactory code generation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant