Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[release/8.0-staging] Fix a set of minor DAC bugs I encountered recently #104919

Merged

Conversation

github-actions[bot]
Copy link
Contributor

@github-actions github-actions bot commented Jul 15, 2024

Backport of #100031 to release/8.0-staging

/cc @davidwrighton

Customer Impact

  • Customer reported
  • Found internally

This fix addresses hangs and crashes reported by the Windbg team diagnosing Watson dumps.

Lee encountered one of these issues in testing. This is specifically the issue listed fixed in src/coreclr/utilcode/collections.cpp. I would be fine with only fixing/changing that file if it reduces the risk overhead. The other two issues I did not encounter.

A crash in the runtime (or just hitting break in a native debugger) can cause us to arbitrarily pause the process in an "inconsistent" state, where we are in the middle of updating some pointers. I ran into this particular issue where (psEntry->iNext == psSrch->iNext) multiple times while debugging a particular issue.

The problem here is particularly annoying for the debuggers calling into the DBI. WinDbg goes into an infinite loop. Visual Studio seems to hang for several minutes, but it did seem to recover in some cases and it hard-crashed in others.

We should fix this since it could potentially affect our ability to debug problems. It may also cause headaches in Watson if this is hit regularly, as putting the debugger into an infinite loop is a really bad state to be in for a service. (That team hasn't reached out, only pointing out the severity when debugging at scale.)

Regression

  • Yes
  • No

Windbg behavior is now using different debugger API's which has caused an experience regression for customers.

Testing

Manual use over the last few months of windbg has experienced fewer problems. Manual testing (by @leculver) of a small repro case has shown that this fix addresses at least one case where this failure broke the managed debugger.

Risk

Low, this affects only the DAC, and will result in avoiding infinite loops and doing error checking instead of triggering access violations.

IMPORTANT: If this backport is for a servicing release, please verify that:

  • The PR target branch is release/X.0-staging, not release/X.0.

  • If the change touches code that ships in a NuGet package, you have added the necessary package authoring and gotten it explicitly reviewed.

In src/coreclr/debug/daccess/dacdbiimpl.cpp
- If g_pDebugger is NULL, produce an error, or NULL result instead of an invalid memory access.
- This caused problems when debugging with a debug version of the DAC

In src/coreclr/debug/di/module.cpp
- Avoid attempting to get the MapAddr if the number of entries in the map is 0, as it may not yet be fully initialized and attempting to call the GetMapAddr function may cause an ASSERT in debug/checked versions of the DAC

In src/coreclr/utilcode/collections.cpp
- If a CHashTable is not fully initialized, or is in an intermediate state, its possible for the linked list to have an infinite cycle. Tweak the iteration here under the DAC to stop
  - This is  known to cause stack walking to loop infinitely in the debugger when debugging a runtime which has suspended at a non-safe point
@carlossanlop
Copy link
Member

Friendly reminder that today Monday July 15th is Code Complete day at 4pm PT, which is the deadline to get this included in the August Release.

@@ -5650,10 +5650,13 @@ void DacDbiInterfaceImpl::LookupEnCVersions(Module* pModule,
DebuggerJitInfo * pDJI = NULL;
EX_TRY_ALLOW_DATATARGET_MISSING_MEMORY
{
pDMI = g_pDebugger->GetOrCreateMethodInfo(pModule, mdMethod);
if (pDMI != NULL)
if (g_pDebugger != NULL)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we get to this state? This gets allocated in EEStartup. Is the DBI getting called super early on? Feels like this is something that shouldn't happen. Data target missing memory like the catch gets should be the main reason to not have it (or init error)

Copy link
Contributor

@leculver leculver Jul 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good overall cleanup for defensive coding. To be clear for this PR, I did not encounter the dacdbiimpl.cpp change with a real world example. It would be fine with me to not backport this particular fix and just leave the fix shipping for v.next. It's only the collections.cpp issue that I'd like to backport since I ran into it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hoyosjs I hit those particular issues when debugging CLR startup scenarios. ... I spend a lot of time debugging the runtime startup path, and I regularly was hitting DAC crashes when trying to diagnose the startup path. Typically, the DAC isn't super useful at that point, but its really annoying to be unable to step through the various process creation paths.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, yes, the DBI is commonly called this early in the newer Windbg UI.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I withdraw my comment. :) If this happens in CLR startup I do recommend backporting this part to .net 8 as well. We deal with that often in servicing, watson and elsewhere. Even if most CLR devs will be on .Net 9, it would be good to get ahead of this issue for our long term servicing release.

@tommcdon
Copy link
Member

Do we have an understanding of what the customer scenarios are affected by this issue and how this change might address them?

@davidwrighton davidwrighton self-assigned this Jul 17, 2024
@leculver
Copy link
Contributor

leculver commented Jul 17, 2024

This was an internal find (me).

I encountered one of these issues in testing. This is specifically the issue listed fixed in src/coreclr/utilcode/collections.cpp. I would be fine with only fixing/changing that file if it reduces the risk overhead. The other two issues I did not encounter.

A crash in the runtime (or just hitting break in a native debugger) can cause us to arbitrarily pause the process in an "inconsistent" state, where we are in the middle of updating some pointers. I ran into this particular issue where (psEntry->iNext == psSrch->iNext) multiple times while debugging a particular issue.

The problem here is particularly annoying for the debuggers calling into the DBI. WinDbg goes into an infinite loop. Visual Studio seems to hang for several minutes, but it did seem to recover in some cases and it hard-crashed in others.

We should fix this since it could potentially affect our ability to debug problems. It may also cause headaches in Watson if this is hit regularly, as putting the debugger into an infinite loop is a really bad state to be in for a service. (That team hasn't reached out, only pointing out the severity when debugging at scale.)

@carlossanlop
Copy link
Member

This is still missing Tactics approval.

Friendly reminder that Code Complete for the October Release is September 9. If we want this fix to be included in that release, please merge this PR before that date.

@mikem8361 mikem8361 self-assigned this Oct 8, 2024
@carlossanlop
Copy link
Member

Friendly reminder that today 10/14 is Code Complete for the November Release. If this fix is intended to be included in that release, please make sure to get a Tactics approval and merge it before 4pm PT. Otherwise, it will have to wait until next month.

@mikem8361 mikem8361 added the Servicing-approved Approved for servicing release label Oct 15, 2024
@davidwrighton davidwrighton merged commit c2c3b54 into release/8.0-staging Oct 15, 2024
110 of 116 checks passed
@github-actions github-actions bot locked and limited conversation to collaborators Nov 15, 2024
@jkotas jkotas deleted the backport/pr-100031-to-release/8.0-staging branch December 29, 2024 15:53
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants