Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NativeAOT] Intermittent crash accessing thread statics on Windows Arm64 #104500

Closed
jkotas opened this issue Jul 6, 2024 · 5 comments · Fixed by #104516
Closed

[NativeAOT] Intermittent crash accessing thread statics on Windows Arm64 #104500

jkotas opened this issue Jul 6, 2024 · 5 comments · Fixed by #104516
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI blocking-clean-ci-optional Blocking optional rolling runs tenet-reliability Reliability/stability related issue (stress, load problems, etc.)
Milestone

Comments

@jkotas
Copy link
Member

jkotas commented Jul 6, 2024

Hit in https://dev.azure.com/dnceng-public/public/_build/results?buildId=731138&view=ms.vss-test-web.build-test-results-tab&runId=18392824&resultId=183933&paneView=dotnet-dnceng.dnceng-build-release-tasks.helix-test-information-tab

[SKIP] System.Net.Sockets.Tests.CreateSocket.CtorAndAccept_SocketNotKeptAliveViaInheritance
Process terminated. Access Violation: Attempted to read or write protected memory. This is often an indication that other memory is corrupt. The application will be terminated since this platform does not support throwing an AccessViolationException.
   at System.RuntimeExceptionHelpers.FailFast(String, Exception, String, RhFailFastReason, IntPtr, IntPtr) + 0x1e0
   at System.RuntimeExceptionHelpers.GetRuntimeException(ExceptionIDs) + 0x158
   at System.Runtime.EH.GetClasslibException(ExceptionIDs, IntPtr) + 0x34
   at System.Runtime.EH.RhThrowHwEx(UInt32, EH.ExInfo&) + 0xb8
   at System.Net.Sockets.Socket.DoBind(EndPoint endPointSnapshot, SocketAddress socketAddress) + 0x11c
   at System.Net.Sockets.Socket.Bind(EndPoint localEP) + 0xc8
   at System.Net.Sockets.Tests.DualModeBind.<>c.<Socket_BindV4IPEndPoint_Throws>b__0_0() + 0x78

We crash in the inlined read of PInvokeMarshal.t_lastError:

00007ff7`03d7209c f9402e41 ldr         x1,[xpr,#0x58]
00007ff7`03d720a0 d000d0c0 adrp        x0,_tls_index
00007ff7`03d720a4 9135e000 add         x0,x0,#0xD78
00007ff7`03d720a8 b9400000 ldr         w0,[x0]
00007ff7`03d720ac f8605820 ldr         x0,[x1,w0 uxtw #3]
00007ff7`03d720b0 91400021 add         x1,x1,#0,lsl #0xC
00007ff7`03d720b4 91001021 add         x1,x1,#4
00007ff7`03d720b8 8b010000 add         x0,x0,x1
00007ff7`03d720bc f9400016 ldr         x22,[x0]<------ AccessViolationException here
@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Jul 6, 2024
@jkotas jkotas added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI and removed area-System.Net untriaged New issue has not been triaged by the area owner labels Jul 6, 2024
@jkotas
Copy link
Member Author

jkotas commented Jul 6, 2024

Regression likely introduced by #104282

The most likely explanation of this crash is that the GC info is not right for the TLS access. @kunalspathak Could you please take a look?

cc @VSadov

@jkotas jkotas added the tenet-reliability Reliability/stability related issue (stress, load problems, etc.) label Jul 6, 2024
@jkotas jkotas added this to the 9.0.0 milestone Jul 6, 2024
@jkotas jkotas changed the title [NativeAOT] Intermittent crash accessing thread statics on Windows Arm63 [NativeAOT] Intermittent crash accessing thread statics on Windows Arm64 Jul 6, 2024
@VSadov
Copy link
Member

VSadov commented Jul 6, 2024

GC info is likely ok, the change is not introducing much in that sense, compared to other platforms.
@kunalspathak was looking at some possible issues with relocations though. Those are a bit different pattern from what we have seen so far.

@VSadov
Copy link
Member

VSadov commented Jul 6, 2024

May be related to.
#104441

@kunalspathak
Copy link
Member

kunalspathak commented Jul 6, 2024

May be related to. #104441

#104441 is related to superpmi tool itself and not the crash. It might be that I am missing recording gcinfo for the unique pattern I generated. Will take a look.

@kunalspathak
Copy link
Member

kunalspathak commented Jul 8, 2024

While investigating this, I realized that the relocation type that was getting said has a typo. I was using the one meant for "ldr" of adrp/ldr pair instead of the one meant for "add" for adrp/add pair. That prompted me to be related to this problem, because chances are the relocations wouldn't have happened correctly and hence the native memory address getting loaded here was not getting calculated correctly. However, with that fixed too, it didn't address this problem. Currently, I have a very limited windows/arm64 machine on which I can repro (by downloading the artifacts), but not incremental builds, etc. to test the changes. I have TTD traces for the failure, but need to look closely at why the native memory that is getting populated is wrong.

Edit: So I don't think the gcinfo is wrong as @VSadov mentioned, but something to do the offset after relocation are not getting calculated properly.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI blocking-clean-ci-optional Blocking optional rolling runs tenet-reliability Reliability/stability related issue (stress, load problems, etc.)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants