[release/8.0-staging] Fix a set of minor DAC bugs I encountered recently #104919

github-actions · 2024-07-15T21:19:40Z

Backport of #100031 to release/8.0-staging

Customer Impact

Customer reported
Found internally

This fix addresses hangs and crashes reported by the Windbg team diagnosing Watson dumps.

Lee encountered one of these issues in testing. This is specifically the issue listed fixed in src/coreclr/utilcode/collections.cpp. I would be fine with only fixing/changing that file if it reduces the risk overhead. The other two issues I did not encounter.

A crash in the runtime (or just hitting break in a native debugger) can cause us to arbitrarily pause the process in an "inconsistent" state, where we are in the middle of updating some pointers. I ran into this particular issue where (psEntry->iNext == psSrch->iNext) multiple times while debugging a particular issue.

The problem here is particularly annoying for the debuggers calling into the DBI. WinDbg goes into an infinite loop. Visual Studio seems to hang for several minutes, but it did seem to recover in some cases and it hard-crashed in others.

We should fix this since it could potentially affect our ability to debug problems. It may also cause headaches in Watson if this is hit regularly, as putting the debugger into an infinite loop is a really bad state to be in for a service. (That team hasn't reached out, only pointing out the severity when debugging at scale.)

Regression

Yes
No

Windbg behavior is now using different debugger API's which has caused an experience regression for customers.

Testing

Manual use over the last few months of windbg has experienced fewer problems. Manual testing (by @leculver) of a small repro case has shown that this fix addresses at least one case where this failure broke the managed debugger.

Risk

Low, this affects only the DAC, and will result in avoiding infinite loops and doing error checking instead of triggering access violations.

IMPORTANT: If this backport is for a servicing release, please verify that:

The PR target branch is release/X.0-staging, not release/X.0.
If the change touches code that ships in a NuGet package, you have added the necessary package authoring and gotten it explicitly reviewed.

In src/coreclr/debug/daccess/dacdbiimpl.cpp - If g_pDebugger is NULL, produce an error, or NULL result instead of an invalid memory access. - This caused problems when debugging with a debug version of the DAC In src/coreclr/debug/di/module.cpp - Avoid attempting to get the MapAddr if the number of entries in the map is 0, as it may not yet be fully initialized and attempting to call the GetMapAddr function may cause an ASSERT in debug/checked versions of the DAC In src/coreclr/utilcode/collections.cpp - If a CHashTable is not fully initialized, or is in an intermediate state, its possible for the linked list to have an infinite cycle. Tweak the iteration here under the DAC to stop - This is known to cause stack walking to loop infinitely in the debugger when debugging a runtime which has suspended at a non-safe point

carlossanlop · 2024-07-15T22:05:50Z

Friendly reminder that today Monday July 15th is Code Complete day at 4pm PT, which is the deadline to get this included in the August Release.

hoyosjs · 2024-07-16T19:57:21Z

src/coreclr/debug/daccess/dacdbiimpl.cpp

@@ -5650,10 +5650,13 @@ void DacDbiInterfaceImpl::LookupEnCVersions(Module*          pModule,
    DebuggerJitInfo * pDJI = NULL;
    EX_TRY_ALLOW_DATATARGET_MISSING_MEMORY
    {
-        pDMI = g_pDebugger->GetOrCreateMethodInfo(pModule, mdMethod);
-        if (pDMI != NULL)
+        if (g_pDebugger != NULL)


How do we get to this state? This gets allocated in EEStartup. Is the DBI getting called super early on? Feels like this is something that shouldn't happen. Data target missing memory like the catch gets should be the main reason to not have it (or init error)

This is a good overall cleanup for defensive coding. To be clear for this PR, I did not encounter the dacdbiimpl.cpp change with a real world example. It would be fine with me to not backport this particular fix and just leave the fix shipping for v.next. It's only the collections.cpp issue that I'd like to backport since I ran into it.

@hoyosjs I hit those particular issues when debugging CLR startup scenarios. ... I spend a lot of time debugging the runtime startup path, and I regularly was hitting DAC crashes when trying to diagnose the startup path. Typically, the DAC isn't super useful at that point, but its really annoying to be unable to step through the various process creation paths.

Also, yes, the DBI is commonly called this early in the newer Windbg UI.

I withdraw my comment. :) If this happens in CLR startup I do recommend backporting this part to .net 8 as well. We deal with that often in servicing, watson and elsewhere. Even if most CLR devs will be on .Net 9, it would be good to get ahead of this issue for our long term servicing release.

tommcdon · 2024-07-17T13:28:29Z

Do we have an understanding of what the customer scenarios are affected by this issue and how this change might address them?

leculver · 2024-07-17T17:41:13Z

This was an internal find (me).

I encountered one of these issues in testing. This is specifically the issue listed fixed in src/coreclr/utilcode/collections.cpp. I would be fine with only fixing/changing that file if it reduces the risk overhead. The other two issues I did not encounter.

A crash in the runtime (or just hitting break in a native debugger) can cause us to arbitrarily pause the process in an "inconsistent" state, where we are in the middle of updating some pointers. I ran into this particular issue where (psEntry->iNext == psSrch->iNext) multiple times while debugging a particular issue.

The problem here is particularly annoying for the debuggers calling into the DBI. WinDbg goes into an infinite loop. Visual Studio seems to hang for several minutes, but it did seem to recover in some cases and it hard-crashed in others.

We should fix this since it could potentially affect our ability to debug problems. It may also cause headaches in Watson if this is hit regularly, as putting the debugger into an infinite loop is a really bad state to be in for a service. (That team hasn't reached out, only pointing out the severity when debugging at scale.)

carlossanlop · 2024-09-05T17:28:55Z

This is still missing Tactics approval.

Friendly reminder that Code Complete for the October Release is September 9. If we want this fix to be included in that release, please merge this PR before that date.

carlossanlop · 2024-10-14T17:15:09Z

Friendly reminder that today 10/14 is Code Complete for the November Release. If this fix is intended to be included in that release, please make sure to get a Tactics approval and merge it before 4pm PT. Otherwise, it will have to wait until next month.

dotnet-issue-labeler bot added the area-Diagnostics-coreclr label Jul 15, 2024

hoyosjs reviewed Jul 16, 2024

View reviewed changes

davidwrighton self-assigned this Jul 17, 2024

hoyosjs approved these changes Jul 19, 2024

View reviewed changes

tommcdon approved these changes Sep 10, 2024

View reviewed changes

mikem8361 self-assigned this Oct 8, 2024

mikem8361 added the Servicing-approved Approved for servicing release label Oct 15, 2024

davidwrighton merged commit c2c3b54 into release/8.0-staging Oct 15, 2024
110 of 116 checks passed

github-actions bot locked and limited conversation to collaborators Nov 15, 2024

jkotas deleted the backport/pr-100031-to-release/8.0-staging branch December 29, 2024 15:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[release/8.0-staging] Fix a set of minor DAC bugs I encountered recently #104919

[release/8.0-staging] Fix a set of minor DAC bugs I encountered recently #104919

github-actions bot commented Jul 15, 2024 •

edited by mikem8361

Loading

carlossanlop commented Jul 15, 2024

hoyosjs Jul 16, 2024

leculver Jul 17, 2024 •

edited

Loading

davidwrighton Jul 17, 2024

davidwrighton Jul 17, 2024

leculver Jul 17, 2024

tommcdon commented Jul 17, 2024

leculver commented Jul 17, 2024 •

edited

Loading

carlossanlop commented Sep 5, 2024

carlossanlop commented Oct 14, 2024

[release/8.0-staging] Fix a set of minor DAC bugs I encountered recently #104919

[release/8.0-staging] Fix a set of minor DAC bugs I encountered recently #104919

Conversation

github-actions bot commented Jul 15, 2024 • edited by mikem8361 Loading

Customer Impact

Regression

Testing

Risk

carlossanlop commented Jul 15, 2024

hoyosjs Jul 16, 2024

Choose a reason for hiding this comment

leculver Jul 17, 2024 • edited Loading

Choose a reason for hiding this comment

davidwrighton Jul 17, 2024

Choose a reason for hiding this comment

davidwrighton Jul 17, 2024

Choose a reason for hiding this comment

leculver Jul 17, 2024

Choose a reason for hiding this comment

tommcdon commented Jul 17, 2024

leculver commented Jul 17, 2024 • edited Loading

carlossanlop commented Sep 5, 2024

carlossanlop commented Oct 14, 2024

github-actions bot commented Jul 15, 2024 •

edited by mikem8361

Loading

leculver Jul 17, 2024 •

edited

Loading

leculver commented Jul 17, 2024 •

edited

Loading