Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MathLoadTest_autosimd crash Illegal instruction vmState=0x00000000 Compiled_method= #19408

Open
pshipton opened this issue Apr 29, 2024 · 34 comments

Comments

@pshipton
Copy link
Member

pshipton commented Apr 29, 2024

https://openj9-jenkins.osuosl.org/job/Test_openjdk11_j9_special.system_x86-64_windows_Personal_testList_0/185 - win2012x64-openj9-1a
MathLoadTest_autosimd_special_5m_12 -Xjit -Xgcpolicy:balanced -Xnocompressedrefs

https://openj9-artifactory.osuosl.org/artifactory/ci-openj9/Test/Test_openjdk11_j9_special.system_x86-64_windows_Personal_testList_0/185/system_test_output.tar.gz

18:09:33  MLT stderr Type=Illegal instruction vmState=0x00000000
18:09:33  MLT stderr Windows_ExceptionCode=c000001d J9Generic_Signal=00000010 ExceptionAddress=00007FFE7AED4436 ContextFlags=0010005f
18:09:33  MLT stderr Handler1=00007FFE903939B0 Handler2=00007FFE9335ABA0
18:09:33  MLT stderr RDI=0000000000000001 RSI=00007FF649F255C0 RAX=0000000000000000 RBX=0000000000000000
18:09:33  MLT stderr RCX=00007FF649F255C0 RDX=0000000000000018 R8=0000005FBCCC1100 R9=0000000000000008
18:09:33  MLT stderr R10=0000005FBCCC1100 R11=00007FF649F25618 R12=00007FF649F25670 R13=00007FF649F254D8
18:09:33  MLT stderr R14=0000000000000038 R15=00007FF648677468
18:09:33  MLT stderr RIP=00007FFE7AED4436 RSP=0000005FBCCFD450 RBP=0000005FBCCCEB00 EFLAGS=0000000000010297
18:09:33  MLT stderr FS=0053 ES=002B DS=002B
18:09:33  MLT stderr XMM0 0000000000000000 (f: 0.000000, d: 0.000000e+00)
18:09:33  MLT stderr XMM1 0000000000000000 (f: 0.000000, d: 0.000000e+00)
18:09:33  MLT stderr XMM2 4048800000000000 (f: 0.000000, d: 4.900000e+01)
18:09:33  MLT stderr XMM3 0000000000000000 (f: 0.000000, d: 0.000000e+00)
18:09:33  MLT stderr XMM4 0000000000000000 (f: 0.000000, d: 0.000000e+00)
18:09:33  MLT stderr XMM5 0000000000000000 (f: 0.000000, d: 0.000000e+00)
18:09:33  MLT stderr XMM6 0000000040076d87 (f: 1074228608.000000, d: 5.307395e-315)
18:09:33  MLT stderr XMM7 0000000040076d87 (f: 1074228608.000000, d: 5.307395e-315)
18:09:33  MLT stderr XMM8 3fc170a704dccd9f (f: 81579424.000000, d: 1.362504e-01)
18:09:33  MLT stderr XMM9 0000000000000000 (f: 0.000000, d: 0.000000e+00)
18:09:33  MLT stderr XMM10 3fc170a704dccd9f (f: 81579424.000000, d: 1.362504e-01)
18:09:33  MLT stderr XMM11 3ff0000000000000 (f: 0.000000, d: 1.000000e+00)
18:09:33  MLT stderr XMM12 3fc170a704dccd9f (f: 81579424.000000, d: 1.362504e-01)
18:09:33  MLT stderr XMM13 3ff0000000000000 (f: 0.000000, d: 1.000000e+00)
18:09:33  MLT stderr XMM14 0000000000000000 (f: 0.000000, d: 0.000000e+00)
18:09:33  MLT stderr XMM15 0000000000000000 (f: 0.000000, d: 0.000000e+00)
18:09:33  MLT stderr Module=
18:09:33  MLT stderr Module_base_address=00007FFE7AA00000 Offset_in_DLL=00000000004d4436
18:09:33  MLT stderr 
18:09:33  MLT stderr Compiled_method=net/adoptopenjdk/test/autosimd/AutoSIMDTestDouble.testSimpleBinary(Lnet/adoptopenjdk/test/autosimd/BinaryOpSIMDDouble;)V
18:09:33  MLT stderr Target=2_90_20240427_1157 (Windows Server 2012 R2 6.3 build 9600)
18:09:33  MLT stderr CPU=amd64 (4 logical CPUs) (0x3fff77000 RAM)
18:09:33  MLT stderr ----------- Stack Backtrace -----------
18:09:33  MLT stderr Unhandled exception
18:09:33  MLT stderr Type=Illegal instruction vmState=0x00000000

MathLoadTest_autosimd_special_5m_20
-Xcompressedrefs -Xgcpolicy:gencon -Xjit:counts=- - - - - - 1 1 1 1000 250 250 - - - 10000 100000 10000,gcOnResolve,rtResolve,sampleInterval=2,scorchingSampleThreshold=10000,quickProfile -Xmn512k -Xcheck:gc:vmthreads:all:quiet

18:15:48  MLT stderr Type=Illegal instruction vmState=0x00000000
18:15:48  MLT stderr Windows_ExceptionCode=c000001d J9Generic_Signal=00000010 ExceptionAddress=00007FFE7B0FED43 ContextFlags=0010005f
18:15:48  MLT stderr Handler1=00007FFE903939B0 Handler2=00007FFE9335ABA0
18:15:48  MLT stderr RDI=0000000000000001 RSI=0000000000000007 RAX=0000000000000000 RBX=00000007FFFAEDD8
18:15:48  MLT stderr RCX=00000007FFFAED88 RDX=00000007FFFAED40 R8=0000000000000000 R9=00000007FFFAED90
18:15:48  MLT stderr R10=00000007FFFAED48 R11=000000000000001B R12=0000000000000000 R13=00000007FFFAED40
18:15:48  MLT stderr R14=0000000000000008 R15=0000000000000008
18:15:48  MLT stderr RIP=00007FFE7B0FED43 RSP=0000000000512F00 RBP=0000000000507A00 EFLAGS=0000000000010202
18:15:48  MLT stderr FS=0053 ES=002B DS=002B
18:15:48  MLT stderr XMM0 0000000000000000 (f: 0.000000, d: 0.000000e+00)
18:15:48  MLT stderr XMM1 3fee8524488267d7 (f: 1216505856.000000, d: 9.537527e-01)
18:15:48  MLT stderr XMM2 4048800000000000 (f: 0.000000, d: 4.900000e+01)
18:15:48  MLT stderr XMM3 0000000000000000 (f: 0.000000, d: 0.000000e+00)
18:15:48  MLT stderr XMM4 0000000000000000 (f: 0.000000, d: 0.000000e+00)
18:15:48  MLT stderr XMM5 0000000000000000 (f: 0.000000, d: 0.000000e+00)
18:15:48  MLT stderr XMM6 0000000000000000 (f: 0.000000, d: 0.000000e+00)
18:15:48  MLT stderr XMM7 3fed70da7230c1d8 (f: 1915798016.000000, d: 9.200260e-01)
18:15:48  MLT stderr XMM8 0000000040e00000 (f: 1088421888.000000, d: 5.377519e-315)
18:15:48  MLT stderr XMM9 000000003e000000 (f: 1040187392.000000, d: 5.139209e-315)
18:15:48  MLT stderr XMM10 000000003f800000 (f: 1065353216.000000, d: 5.263544e-315)
18:15:48  MLT stderr XMM11 0000000000000000 (f: 0.000000, d: 0.000000e+00)
18:15:48  MLT stderr XMM12 0000000000000000 (f: 0.000000, d: 0.000000e+00)
18:15:48  MLT stderr XMM13 0000000000000000 (f: 0.000000, d: 0.000000e+00)
18:15:48  MLT stderr XMM14 0000000000000000 (f: 0.000000, d: 0.000000e+00)
18:15:48  MLT stderr XMM15 0000000000000000 (f: 0.000000, d: 0.000000e+00)
18:15:48  MLT stderr Module=
18:15:48  MLT stderr Module_base_address=00007FFE7AA00000 Offset_in_DLL=00000000006fed43
18:15:48  MLT stderr 
18:15:48  MLT stderr Compiled_method=net/adoptopenjdk/test/autosimd/AutoSIMDTestLong.simdAdd([J[J[JI)V
18:15:48  MLT stderr Target=2_90_20240427_1157 (Windows Server 2012 R2 6.3 build 9600)
18:15:48  MLT stderr CPU=amd64 (4 logical CPUs) (0x3fff77000 RAM)
18:15:48  MLT stderr ----------- Stack Backtrace -----------
18:15:48  MLT stderr (0x00007FFE7B0FED43)
18:15:48  MLT stderr (0x0000000000000006)
18:15:48  MLT stderr (0x0000000000000202)
18:15:48  MLT stderr (0x0000000000000007)
18:15:48  MLT stderr (0x0000000000000213)
18:15:48  MLT stderr (0x00000000004FF600)
18:15:48  MLT stderr (0x0000000000000008)
18:15:48  MLT stderr (0x00000007FFFAED88)
18:15:48  MLT stderr (0x00000007FFFAEDD0)
18:15:48  MLT stderr (0x00000007FFFAED40)
18:15:48  MLT stderr (0x00007FFE7B097435)
18:15:48  MLT stderr (0x0000000700000008)
18:15:48  MLT stderr (0x00000007FFFAEDD0)
18:15:48  MLT stderr (0x00000007FFFAED88)
18:15:48  MLT stderr (0x00000007FFFAED40)
18:15:48  MLT stderr (0x0000000000000206)
18:15:48  MLT stderr (0x00007FFE7B04D37C)
18:15:48  MLT stderr (0x0000000000000008)
18:15:48  MLT stderr (0x00000007FFFAEDD0)
18:15:48  MLT stderr (0x00000007FFFAED88)
18:15:48  MLT stderr (0x00000007FFFAED40)
18:15:48  MLT stderr (0x00000007FFFAED30)
18:15:48  MLT stderr (0x0000000700392B70)
18:15:48  MLT stderr (0x00000007FFFAED30)
18:15:48  MLT stderr (0x000000EBD703F300)
18:15:48  MLT stderr (0x0000000000000007)
18:15:48  MLT stderr (0x00007FFE7AE6B54A)
18:15:48  MLT stderr (0x00000000004FF600)
18:15:48  MLT stderr (0x0000000000000008)
18:15:48  MLT stderr (0x00000007FFFAED88)
18:15:48  MLT stderr (0x00000007FFFAEDD0)
18:15:48  MLT stderr ---------------------------------------

Changes since last special.system build
f44a1c6...c5c5206
eclipse-openj9/openj9-omr@723d2e4...33a1542
ibmruntimes/openj9-openjdk-jdk11@95a3a61...b4574cc

@pshipton
Copy link
Member Author

@hzongaro fyi

@pshipton
Copy link
Member Author

pshipton commented Apr 29, 2024

This is a 0.45 release build.
https://openj9-jenkins.osuosl.org/job/Test_openjdk22_j9_special.system_x86-64_windows_Release_testList_3/7/ - win2012x64-openj9-1a
MathLoadTest_autosimd_special_5m_2
-Xgcpolicy:optthruput -Xjit:count=0,optlevel=hot,gcOnResolve,rtResolve -Xnocompressedrefs

https://openj9-artifactory.osuosl.org/artifactory/ci-openj9/Test/Test_openjdk22_j9_special.system_x86-64_windows_Release_testList_3/7/system_test_output.tar.gz

13:44:42  MLT stderr Type=Illegal instruction vmState=0x00000000
13:44:42  MLT stderr Windows_ExceptionCode=c000001d J9Generic_Signal=00000010 ExceptionAddress=00007FFCFDA0D1C3 ContextFlags=0010005f
13:44:42  MLT stderr Handler1=00007FFD12C715B0 Handler2=00007FFD15B6ABA0
13:44:42  MLT stderr RDI=0000000000000000 RSI=00007FF6A8F03EF8 RAX=00007FF6A8F03EA8 RBX=00007FF6A8F03F58
13:44:42  MLT stderr RCX=0000000000000008 RDX=00007FF6A8F03F48 R8=0000000000000001 R9=00007FF6A8F03F08
13:44:42  MLT stderr R10=00007FF6A8F03EB8 R11=0000000000000000 R12=0000000000000007 R13=0000000000000000
13:44:42  MLT stderr R14=0000000000000007 R15=00007FF6A8F03EA8
13:44:42  MLT stderr RIP=00007FFCFDA0D1C3 RSP=000000C115107910 RBP=000000C113E11500 EFLAGS=0000000000010202
13:44:42  MLT stderr FS=0053 ES=002B DS=002B
13:44:42  MLT stderr XMM0 0000000000000000 (f: 0.000000, d: 0.000000e+00)
13:44:42  MLT stderr XMM1 0000000000000000 (f: 0.000000, d: 0.000000e+00)
13:44:42  MLT stderr XMM2 bf9e6ef750d2d67c (f: 1355994752.000000, d: -2.972018e-02)
13:44:42  MLT stderr XMM3 3f37a76fde89167d (f: 3733526016.000000, d: 3.609322e-04)
13:44:42  MLT stderr XMM4 3fd38a432efa283e (f: 788146240.000000, d: 3.053139e-01)
13:44:42  MLT stderr XMM5 3fa7dd0a579694c2 (f: 1469486336.000000, d: 4.660828e-02)
13:44:42  MLT stderr XMM6 0000000000000000 (f: 0.000000, d: 0.000000e+00)
13:44:42  MLT stderr XMM7 0000000000000000 (f: 0.000000, d: 0.000000e+00)
13:44:42  MLT stderr XMM8 3fee8524488267d7 (f: 1216505856.000000, d: 9.537527e-01)
13:44:42  MLT stderr XMM9 3f6fb81eaa4a9143 (f: 2857013504.000000, d: 3.871975e-03)
13:44:42  MLT stderr XMM10 3fb7dd0a579694c2 (f: 1469486336.000000, d: 9.321656e-02)
13:44:42  MLT stderr XMM11 0000000000000000 (f: 0.000000, d: 0.000000e+00)
13:44:42  MLT stderr XMM12 0000000000000000 (f: 0.000000, d: 0.000000e+00)
13:44:42  MLT stderr XMM13 0000000000000000 (f: 0.000000, d: 0.000000e+00)
13:44:42  MLT stderr XMM14 0000000000000000 (f: 0.000000, d: 0.000000e+00)
13:44:42  MLT stderr XMM15 0000000000000000 (f: 0.000000, d: 0.000000e+00)
13:44:42  MLT stderr Module=
13:44:42  MLT stderr Module_base_address=00007FFCFD000000 Offset_in_DLL=0000000000a0d1c3
13:44:42  MLT stderr 
13:44:42  MLT stderr Compiled_method=net/adoptopenjdk/test/autosimd/AutoSIMDTestDouble.simdSub([D[D[DI)V
13:44:42  MLT stderr Target=2_90_20240428_11 (Windows Server 2012 R2 6.3 build 9600)
13:44:42  MLT stderr CPU=amd64 (4 logical CPUs) (0x3fff77000 RAM)
13:44:42  MLT stderr ----------- Stack Backtrace -----------
13:44:42  MLT stderr (0x00007FFCFDA0D1C3)
13:44:42  MLT stderr (0x000000C115117B00)
13:44:42  MLT stderr (0x00007FF6A8F03E98)
13:44:42  MLT stderr (0x00007FF6A8F03F48)
13:44:42  MLT stderr (0x00007FF6A8F03EF8)
13:44:42  MLT stderr (0x00007FF6A8F03EA8)
13:44:42  MLT stderr (0x00007FFCFDA0C815)
13:44:42  MLT stderr (0x00007FF6A945E280)
13:44:42  MLT stderr (0x00007FF6A92A0260)
13:44:42  MLT stderr ---------------------------------------

@pshipton
Copy link
Member Author

Dup of #19377?

These are all 64-bit JVMs.

@hzongaro
Copy link
Member

@BradleyWood, may I ask you to look at this one as well?

@pshipton
Copy link
Member Author

pshipton commented May 1, 2024

https://openj9-jenkins.osuosl.org/job/Test_openjdk17_j9_sanity.functional_x86-64_windows_Nightly_testList_0/709 - win2012x64-openj9-1a
SIMDCommonedAddressTest_0

https://openj9-artifactory.osuosl.org/artifactory/ci-openj9/Test/Test_openjdk17_j9_sanity.functional_x86-64_windows_Nightly_testList_0/709/functional_test_output.tar.gz

23:46:14  Unhandled exception
23:46:14  Type=Illegal instruction vmState=0x00040000
23:46:14  Windows_ExceptionCode=c000001d J9Generic_Signal=00000010 ExceptionAddress=00007FFC82C00102 ContextFlags=0010005f
23:46:14  Handler1=00007FFC9B4DC5D0 Handler2=00007FFC9A41ABA0
23:46:14  RDI=0000000000000532 RSI=00000007FFE70110 RAX=0000000000000000 RBX=0000000000000000
23:46:14  RCX=0000000000000542 RDX=00000007FFE71620 R8=00000007FFE70110 R9=00000007FFE71628
23:46:14  R10=0000000000000541 R11=00007FFC9A645397 R12=0000000000000000 R13=0000003242AD9EA8
23:46:14  R14=0000000000000000 R15=000000324234CC40
23:46:14  RIP=00007FFC82C00102 RSP=0000000000105D20 RBP=0000000000017000 EFLAGS=0000000000010293
23:46:14  FS=0053 ES=002B DS=002B
23:46:14  XMM0 000000324cc6f009 (f: 1288105984.000000, d: 1.067362e-312)
23:46:14  XMM1 0000000000000000 (f: 0.000000, d: 0.000000e+00)
23:46:14  XMM2 0000000000000000 (f: 0.000000, d: 0.000000e+00)
23:46:14  XMM3 0000000000000000 (f: 0.000000, d: 0.000000e+00)
23:46:14  XMM4 0000000000000000 (f: 0.000000, d: 0.000000e+00)
23:46:14  XMM5 0000000000000000 (f: 0.000000, d: 0.000000e+00)
23:46:14  XMM6 0000000000105de8 (f: 1072616.000000, d: 5.299427e-318)
23:46:14  XMM7 0000000000105de8 (f: 1072616.000000, d: 5.299427e-318)
23:46:14  XMM8 0000000000000000 (f: 0.000000, d: 0.000000e+00)
23:46:14  XMM9 0000000000000000 (f: 0.000000, d: 0.000000e+00)
23:46:14  XMM10 0000000000000000 (f: 0.000000, d: 0.000000e+00)
23:46:14  XMM11 0000000000000000 (f: 0.000000, d: 0.000000e+00)
23:46:14  XMM12 0000000000000000 (f: 0.000000, d: 0.000000e+00)
23:46:14  XMM13 0000000000000000 (f: 0.000000, d: 0.000000e+00)
23:46:14  XMM14 0000000000000000 (f: 0.000000, d: 0.000000e+00)
23:46:14  XMM15 0000000000000000 (f: 0.000000, d: 0.000000e+00)
23:46:14  Module=
23:46:14  Module_base_address=00007FFC82C00000 Offset_in_DLL=0000000000000102
23:46:14  
23:46:14  Compiled_method=jit/test/tr/SIMDOpts/SIMDCommonedAddressTest.testSIMDCommonedAddress([I[II)V
23:46:14  Target=2_90_20240430_754 (Windows Server 2012 R2 6.3 build 9600)
23:46:14  CPU=amd64 (4 logical CPUs) (0x3fff77000 RAM)
23:46:14  ----------- Stack Backtrace -----------
23:46:14  (0x00007FFC82C00102)
23:46:14  J9_GetInterface+0x19217 (0x00007FFC9B502D67 [j9vm29+0xf2d67])
23:46:14  (0x00000007FFEECBD0)
23:46:14  ---------------------------------------

@pshipton
Copy link
Member Author

pshipton commented May 1, 2024

See also #19424 (cmdLineTester_loopReduction_0)
and #19377 (MathLoadTest_autosimd_5m_2 on win32)

All the failures occur on win2012x64-openj9-1a

@BradleyWood
Copy link
Member

We definitely have a problem with generating AVX-512 on 32-bit JVMs in 64-bit machine. I think we have two issues here since there is a failure on a 64-bit jvm. @pshipton Could you get me the cpuid info of the machine that failed this test?

@BradleyWood
Copy link
Member

The javacore states PROCESSOR_IDENTIFIER=Intel64 Family 6 Model 85 which is one of cooperlake, skylake, or cascadelate (server). Each of these support AVX-512 so we must be looking at a problem in addition to #19377.

The javacore also states this, which makes no sense to me. Don't know how avx512dq can exist without avx512f.
JITFEATURE CPU features (JIT): fpu cx8 cmov mmx sse sse2 sse3 ssse3 fma sse4_1 sse4_2 popcnt aesni osxsave avx fdp_excptn_only avx512dq rdseed sha avx512vl null

@pshipton
Copy link
Member Author

pshipton commented May 1, 2024

I think @AdamBrousseau will have to obtain the cpuid info of win2012x64-openj9-1a

@pshipton
Copy link
Member Author

pshipton commented May 1, 2024

Is is a virtual machine, so perhaps it's messed up somehow.

@pshipton
Copy link
Member Author

pshipton commented May 1, 2024

I've disabled https://openj9-jenkins.osuosl.org/computer/win2012x64%2Dopenj9%2D1a/ in jenkins since we don't need tests running on it and crashing.

Also opened infrastructure/issues/9283

@BradleyWood
Copy link
Member

@pshipton Has anything similar happened on any other machine?

This is the instruction causing problems. It is valid on AVX-512 supported hardware.

62d17e086f441300     vmovdqu32 xmm0, xmmword ptr [r11 + rdx]

@AdamBrousseau Could you get me the cpuid info for win2012x64-openj9-1a

@pshipton
Copy link
Member Author

pshipton commented May 2, 2024

Has anything similar happened on any other machine?

No, I checked all the failures and they were on win2012x64-openj9-1a

@AdamBrousseau
Copy link
Contributor

image

@BradleyWood
Copy link
Member

@AdamBrousseau I need the list of instruction set extensions supported by that CPU. Whatever command would be equivalent to lscpu on linux.

@AdamBrousseau
Copy link
Contributor

Hopefully this helps

$ wmic cpu list /format:list

AddressWidth=64
Architecture=9
Availability=3
Caption=Intel64 Family 6 Model 85 Stepping 4
ConfigManagerErrorCode=
ConfigManagerUserConfig=
CpuStatus=1
CreationClassName=Win32_Processor
CurrentClockSpeed=2300
CurrentVoltage=
DataWidth=64
Description=Intel64 Family 6 Model 85 Stepping 4
DeviceID=CPU0
ErrorCleared=
ErrorDescription=
ExtClock=
Family=1
InstallDate=
L2CacheSize=
L2CacheSpeed=
LastErrorCode=
Level=6
LoadPercentage=1
Manufacturer=GenuineIntel
MaxClockSpeed=2300
Name=Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz
OtherFamilyDescription=
PNPDeviceID=
PowerManagementCapabilities=
PowerManagementSupported=FALSE
ProcessorId=1FCBFBFF00050654
ProcessorType=3
Revision=21764
Role=CPU
SocketDesignation=CPU 1
Status=OK
StatusInfo=3
Stepping=
SystemCreationClassName=Win32_ComputerSystem
SystemName=WIN2012R2-X86-1
UniqueId=
UpgradeMethod=1
Version=
VoltageCaps=0

https://www.intel.com/content/www/us/en/products/sku/120485/intel-xeon-gold-6140-processor-24-75m-cache-2-30-ghz/specifications.html
https://en.wikichip.org/wiki/intel/xeon_gold/6140#google_vignette

@BradleyWood
Copy link
Member

@AdamBrousseau So the cpu in question does support AVX-512, and therefore the instruction in this issue. But that output does not tell me if it is enabled or not.

@BradleyWood
Copy link
Member

@pshipton I assume this hasn't been seen since you disabled that machine. Are you going to remove the blocker tag?

@pshipton
Copy link
Member Author

pshipton commented May 9, 2024

Done.

@JamesKingdon
Copy link
Contributor

@pshipton I'm looking into another report of this problem. You mentioned that win2012x64-openj9-1a was a virtual machine, would that have been on vmware?

@pshipton
Copy link
Member Author

I don't know. @AdamBrousseau might.

@AdamBrousseau
Copy link
Contributor

I highly doubt they are vmware given the licensing cost/model they (vmware) have moved to. I have asked in slack.

@JamesKingdon
Copy link
Contributor

Thanks, I'll be interested to know how it was setup. We haven't had any confirmation of the cause on my case yet.

@AdamBrousseau
Copy link
Contributor

That machine (the older 2012 on classic infra) are Citrix hypervisor. The newer ones on VPC are KVM.

@JamesKingdon
Copy link
Contributor

Thanks. I went searching for documentation on Citrix config but I couldn't find anything about setting cpu features for the VM. I did find an old post about missing avx512 support for some processors depending on core and frequency settings, but I was a bit confused by it as I'd expect that to result in a freq cap rather than missing feature support.

@JamesKingdon
Copy link
Contributor

JamesKingdon commented Jun 26, 2024

@BradleyWood could you double check something I'm a bit confused over? In OMR::X86::TreeEvaluator::maskLoadEvaluator ( https://github.com/eclipse/omr/blob/b5ef5eda4680b6b5cf0c2f954362f9f47353ce04/compiler/x/codegen/SIMDTreeEvaluator.cpp#L72-L92 ) we test for avx512f and if available take the body of the if statement. But if it's not available we call SIMDloadEvaluator(node, cg); which doesn't include any further tests of cpu feature flags. In the two cases where we have seen crashes avx512f has not been set, and you mentioned earlier that this seemed odd. I'm wondering if the code path to SIMDloadEvaluator with avx512f not set is a rare case that may not have been well tested, and perhaps is not valid?

@BradleyWood
Copy link
Member

@BradleyWood could you double check something I'm a bit confused over? In OMR::X86::TreeEvaluator::maskLoadEvaluator ( https://github.com/eclipse/omr/blob/b5ef5eda4680b6b5cf0c2f954362f9f47353ce04/compiler/x/codegen/SIMDTreeEvaluator.cpp#L72-L92 ) we test for avx512f and if available take the body of the if statement. But if it's not available we call SIMDloadEvaluator(node, cg); which doesn't include any further tests of cpu feature flags. In the two cases where we have seen crashes avx512f has not been set, and you mentioned earlier that this seemed odd. I'm wondering if the code path to SIMDloadEvaluator with avx512f not set is a rare case that may not have been well tested, and perhaps is not valid?

That method is for loading vector masks which I doubt is used without enabling the vector API. Are you seeing opcodes such as mload or mloadi?

SIMDloadEvaluator checks for instruction support via the call to opCode.getSIMDEncoding().
https://github.com/eclipse/omr/blob/b5ef5eda4680b6b5cf0c2f954362f9f47353ce04/compiler/x/codegen/SIMDTreeEvaluator.cpp#L151-L153
https://github.com/eclipse/omr/blob/b5ef5eda4680b6b5cf0c2f954362f9f47353ce04/compiler/x/codegen/OMRInstOpCode.hpp#L504-L627

@JamesKingdon
Copy link
Contributor

I only have the compiled method body to go on, where we crash on vmovdqu32 xmm0, xmmword ptr [rax + rcx*4 + 0x88]. I thought that would come from VMOVDQU32RegMem, and SIMDloadEvaluator was the only place I could find that referenced that opcode. I guess there's a flaw in my logic somewhere :)

@JamesKingdon
Copy link
Contributor

Ah, I missed calls to SIMDloadEvaluator from the amd64/codegen directory, so there are more possible paths.

@BradleyWood
Copy link
Member

@JamesKingdon Well it could come from other places too, MOVDQURegMem becomes vmovdqu32 when encoded with EVEX (avx-512) prefix, but you have to explicitly mark it as EVEX_128 to get the instruction above. Generally that is only done after getting the encoding prefix from calling opCode.getSIMDEncoding().

But I'm pretty sure that processor supports that instruction, so I am bewildered as to how we get illegal instruction crash.

@JamesKingdon
Copy link
Contributor

@BradleyWood
I confirmed with a testcase that the compiler is converting movdqu instructions to vmovdqu32. It looks confusing in the log as it incorrectly continues to print movdqu, but checking the hex in a disassembler confirms that the actual instruction has been replaced with vmovdqu32. The comment also looks wrong as it says MOVDQURegReg when this would appear to be MOVDQURegMem:
0x7ec0c741e4f3 0000040b [0x7ec0c5659d90] 62 f1 7e 08 6f 84 90 88 00 00 00 movdqu xmm0, xmmword ptr [rax+4*rdx+0x88] # MOVDQURegReg, SymRef <array-shadow>[#245 Shadow +136] [flags 0x80000613 0x0 ]
Could you point me at the code that makes this substitution?

@JamesKingdon
Copy link
Contributor

The case that triggered my interest was closed, but has just come back to life again so I need to pick this up. I was thinking that the problem was with badly configured virtualisation layers that enable avx512vl but not avx512f, but I've noticed several cases recently with that configuration that haven't been for this problem, so at the very least it's not the only factor in reproducing the issue.

@BradleyWood
Copy link
Member

We have found that this crash is present on virtual machines with AVX-512 capable hardware. However, these features had been disabled. The question remains, why do we detect support for these features if the hyperviser has disabled them.

I would expect that the hypervisor modifies the behaviour of the cpuid instruction, which is how we gather this information.

@BradleyWood
Copy link
Member

In all likelihood Windows server 2012 does not support AVX-512. I believe we did not check the ZMM os support flag. This is likely a different issue than our internal customer is experiencing @JamesKingdon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants