Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[release/8.0-staging] Restore erroneously removed encoding of the argument count in a generic method instantiation #98825

Merged
merged 1 commit into from
Mar 6, 2024

Conversation

github-actions[bot]
Copy link
Contributor

@github-actions github-actions bot commented Feb 22, 2024

Backport of #98731 to release/8.0-staging

/cc @AaronRobinsonMSFT @kg

Customer Impact

  • Customer reported
  • Found internally

Any users of the multicore jit feature would encounter silent encoding failures while recording profiles due to a defect in our encoding for generic method instances that would make them impossible to decode later. Those silent failures would then produce usually-but-not-always silent failures at runtime when performing multicore jitting of methods from the profiles.

The former failures are by their nature silent because we were forgetting to encode some important information, and no validation of the output is performed at encoding time.

The latter failures are silent because we intentionally suppress exceptions during profile playback, (presumably) in an attempt to ensure that a partially or wholly outdated profile (i.e. from an older build of the software during development, where one or more of the assemblies have changed but not all of them) will not prevent the application from starting, and the outdated profile will just not provide as much performance benefit to the user. This exception suppression was not removed as a part of the fix since doing so would imply a significant risk of application crashes for the end user.

The exception suppression is unable to suppress all the potential failures that can be triggered by this encoding error, and in the rare circumstance where the wrongly-encoded bytes were arranged just so, the application would terminate due to an assertion.

#81799 is likely caused by this issue.

Regression

  • Yes
  • No

The regression was caused by an erroneous removal of a write in the generic method instantiation encoder; we removed it because we misread an if-condition such that it looked like we only wrote the count when IBC was active, so it was removed along with IBC support.

PR where regression was introduced: #68717

Testing

Manual review and CI. Still looking into ways to verify it directly and will reach out to affected customers to see if they can test a build from main.

  • It is possible to manually verify the change by inspecting the profile output in a hex editor (I will do this)
  • It is not possible to verify this change using existing profiles, as existing profiles are corrupt due to the existing defect
  • A profile from the fixed code and a profile from the old code are almost completely indistinguishable at runtime, as the impact of the latter is usually just that a smaller set of methods are jitted. Observing this would require running a special build of the runtime with some logging flags enabled; I don't know the actual process for doing this, but I can find out if we need that to happen before we can approve this backport. It could take a very large number of profile recording attempts to reproduce this issue in isolation were it not fixed by this commit.

Risk

[High/Medium/Low. Justify the indication by mentioning how risks were measured and addressed.]
Low risk. The affected system is rarely used, and (for the exception handling reasons mentioned earlier) any potential negative impact from a defect in this change will not meaningfully worsen the experience for users over the previous state of the code. We are restoring the original intended function of this code instead of adding new code/behavior.

IMPORTANT: If this backport is for a servicing release, please verify that:

  • The PR target branch is release/X.0-staging, not release/X.0.

  • If the change touches code that ships in a NuGet package, you have added the necessary package authoring and gotten it explicitly reviewed.

@AaronRobinsonMSFT
Copy link
Member

@kg Can you fill out the rest of this issue? I'm not sure how this was discovered or how it manifests.

@AaronRobinsonMSFT AaronRobinsonMSFT added Servicing-consider Issue for next servicing release review area-VM-coreclr labels Feb 22, 2024
@AaronRobinsonMSFT AaronRobinsonMSFT added this to the 8.0.x milestone Feb 22, 2024
@kg
Copy link
Member

kg commented Feb 22, 2024

I can't manually verify this with a hex editor, but I don't understand multicore jit well enough to be convinced that my repro is working. And none of the tests from our test suite work in my development environment, they complain about missing files (I'm not sure if they are just outdated?). So there is a gap in validation there, but I am pretty confident that this is the right fix.

Copy link
Member

@jeffschwMSFT jeffschwMSFT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approved. we will take for consideration in 8.0.x

@leecow leecow removed the Servicing-consider Issue for next servicing release review label Feb 27, 2024
@leecow leecow modified the milestones: 8.0.x, 8.0.4 Feb 27, 2024
@jeffschwMSFT jeffschwMSFT added the Servicing-approved Approved for servicing release label Feb 27, 2024
@carlossanlop
Copy link
Member

Friendly reminder that Monday March 11th is the Code Complete date for the April Release. If you want this change to be included in that release, please make sure to merge this before that date.

@kg
Copy link
Member

kg commented Mar 6, 2024

Friendly reminder that Monday March 11th is the Code Complete date for the April Release. If you want this change to be included in that release, please make sure to merge this before that date.

Thank you for the reminder. I'm not experienced with this kind of backport, at this point do I just hit the button? "we will take for consideration in 8.0.x" made me think it was waiting on a final approval

@carlossanlop
Copy link
Member

For future reference, the servicing requirements are described in this document: https://github.com/dotnet/runtime/blob/release/8.0-staging/docs/project/library-servicing.md

Which your PR already meets: sign-off from other area owners, approval from Tactics (confirmed with the addition of the servicing-approved label), CI is green. If you don't need to make any additional changes, then yes, you can hit the squash & merge button. 😊

@kg kg merged commit e398dd6 into release/8.0-staging Mar 6, 2024
110 of 118 checks passed
@jkotas jkotas deleted the backport/pr-98731-to-release/8.0-staging branch March 15, 2024 23:36
@github-actions github-actions bot locked and limited conversation to collaborators Apr 15, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-VM-coreclr Servicing-approved Approved for servicing release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants