Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect avx2 detection #2957

Closed
thrasibule opened this issue Oct 30, 2020 · 6 comments · Fixed by #2960
Closed

Incorrect avx2 detection #2957

thrasibule opened this issue Oct 30, 2020 · 6 comments · Fixed by #2960

Comments

@thrasibule
Copy link
Contributor

thrasibule commented Oct 30, 2020

This is a followup to issue #2933. I'm getting reports from users still having issues to compile after the fix that went in 3.12. Their cpu doesn't support avx2, yet HAVE_AVX2=1 is generated in Makefile.conf. I have the opposite situation: my cpu has the avx2 flag, yet ./get_arch 0 will set HAVE_AVX2 flag to 0. I'm starting to wonder if the support_avx2 function defined here is correct. Based on intel docs https://software.intel.com/content/www/us/en/develop/articles/how-to-detect-new-instruction-support-in-the-4th-generation-intel-core-processor-family.html, I would expect to test something like ebx & (1 <<3 ) & (1 << 5) & (1 << 8) instead of ebx & (1 <<7). gcc also has some intrinsics for testing cpu features which might be easier to use.

@martin-frbg
Copy link
Collaborator

That function only checks if the operating system has disabled avx2 support(which could happen in a vm for example), not if the cpu itself is capable. Can you provide more details for both failure cases please ? (In the case of the user reports, model names or ideally /proc/cpuinfo contents would be necessary - I guess Intel may have marketed some dies with broken or otherwise disabled avx2 units under a haswell-like cpuid that OpenBLAS erroneously lumps in with the real thing)

@thrasibule
Copy link
Contributor Author

See archlinux forum here: https://aur.archlinux.org/packages/openblas-lapack/ There are two cpus which show HAVE_AVX2=1, even though the cpus shouldn't support it. I really think testing ebx & (1 << 5) is the right thing to do, that's what other people are doing see here for instance: https://github.com/google/highwayhash/blob/master/highwayhash/instruction_sets.cc#L105. I've asked the users to test out this fix on their cpu.

@martin-frbg
Copy link
Collaborator

Fairly certain we need to test both the cpu and os capability bit. (Pretty sure there was an old ticket for this which led to the addition of that check, but too tired to search now.)

@thrasibule
Copy link
Contributor Author

From what I can tell, maybe you're referring to this commit when you talk about os support?
735ca38
The function xgetbv from the commit indeed corresponds to the check_xcr0_ymm function from the intel docs I linked above.

int check_xcr0_ymm() 
{
    uint32_t xcr0;
#if defined(_MSC_VER)
    xcr0 = (uint32_t)_xgetbv(0);  /* min VS2010 SP1 compiler is required */
#else
    __asm__ ("xgetbv" : "=a" (xcr0) : "c" (0) : "%edx" );
#endif
    return ((xcr0 & 6) == 6); /* checking if xmm and ymm state are enabled in XCR0 */
}

But I can't figure out what the check ebx & (1<<7) is doing in support_avx2, and I think checking ebx & (1<<5) would be an improvement on the current situation, besides matching with intel docs. I've found a cpu where I can reproduce. This is the output of /proc/cpuinfo:

processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 58
model name	: Intel(R) Core(TM) i5-3570S CPU @ 3.10GHz
stepping	: 9
microcode	: 0x20
cpu MHz		: 2216.499
cache size	: 6144 KB
physical id	: 0
siblings	: 4
core id		: 0
cpu cores	: 4
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm cpuid_fault pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms xsaveopt dtherm ida arat pln pts flush_l1d
bugs		: cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf
bogomips	: 6188.08
clflush size	: 64
cache_alignment	: 64
address sizes	: 36 bits physical, 48 bits virtual

This is your standard sandybridge processor, but this is the output of ./getarch 0:

CORE=SANDYBRIDGE
LIBCORE=sandybridge
NUM_CORES=4
HAVE_MMX=1
HAVE_SSE=1
HAVE_SSE2=1
HAVE_SSE3=1
HAVE_SSSE3=1
HAVE_SSE4_1=1
HAVE_SSE4_2=1
HAVE_AVX=1
HAVE_AVX2=1
MAKE += -j 4

note that HAVE_AVX2 is set to 1, which makes no sense, since this cpu doesn't have the avx2 instructions, OS support or no OS support. Now because of our old friend here: https://github.com/xianyi/OpenBLAS/blob/develop/Makefile.x86_64#L69, -mavx2 doesn't get added, and it fails to compile with CFLAGS="-march=x86-64 -mtune=generic" as in #2933.

Now if I check ebx against 1<<5, then everything works, support_avx2 returns 0 and everything compiles.

I've even found a cpu with the opposite situation. It's a ryzen cpu running in a virtualbox, which lets through the avx2 flag. See the /proc/cpuinfo below:

vendor_id	: AuthenticAMD
cpu family	: 23
model		: 8
model name	: AMD Ryzen 7 2700X Eight-Core Processor
stepping	: 2
microcode	: 0x6000626
cpu MHz		: 3699.996
cache size	: 512 KB
physical id	: 0
siblings	: 8
core id		: 0
cpu cores	: 8
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid tsc_known_freq pni pclmulqdq ssse3 cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch ssbd vmmcall fsgsbase avx2 rdseed clflushopt arat
bugs		: fxsave_leak sysret_ss_attrs null_seg spectre_v1 spectre_v2 spec_store_bypass
bogomips	: 7402.32
TLB size	: 2560 4K pages
clflush size	: 64
cache_alignment	: 64
address sizes	: 48 bits physical, 48 bits virtual
power management:

However, ./get_arch 0 doesn't detect avx2, which again is fixed by checking the fifth bit. However, I'm not sure what the universal intrinsics really need, but for instance virtualbox doesn't let through the bmi and bmi2 flag which are also part of the instructions (that's what 1<<3 and 1<<8 test), so it fails if we add those.

@thrasibule
Copy link
Contributor Author

And one more data point: https://github.com/xianyi/OpenBLAS/blob/develop/cpuid_x86.c#L238 This lines actually checks ebx against 32 = 1<<5, not 1<<7. So I really think 1<<7 is incorrect.

@martin-frbg
Copy link
Collaborator

#1949 it was - apparently I did not see the full picture back then (nor did the others). The check was meant to disable AVX2 if necessary after it was already inferred from the cpuid - which for family 6, exmodel 3, model A it certainly is not. (And it seems I had actually started out with a cpu capabilty check and changed it to the os flag when that was pointed out to me - the way this
function gets used, it had better do both checks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants