Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Native SDK #664

Open
tustanivsky opened this issue Oct 23, 2024 · 11 comments
Open

Update Native SDK #664

tustanivsky opened this issue Oct 23, 2024 · 11 comments

Comments

@tustanivsky
Copy link
Collaborator

Building the Sentry Native SDK is an essential part of Unreal Engine plugin CI workflow which is required to support automatic crash capturing on Linux.

Starting from Native SDK version 0.7.7 we've faced issues with updating the sentry-native submodule caused by multiple build errors in CI (workflow file, build script). Specifically, the upgrade of mini_chromium (one of the crashpad's dependencies) now requires C++20 and that causes compatibility issues with older toolchains used in Unreal Engine and our CI environment.

The attempt to address the above problem was made in scope of #639. Switching to newer clang-13 and libc++-16 in CI fixed the Native SDK build and even allowed us to pass the test checks for UE 5.3/5.4 however it still fails for UE 5.2 (and probably for older versions too) due to mismatched versions of GLIBC and GLIBCXX. This could mean that Docker containers provided by Epic Games which we're using for testing don’t meet Unreal Engine's toolchain requirements. We've already been in a somewhat similar situation in #173 that made us stick to the older Ubuntu-20.04 version to produce Native SDK build artifacts.

Possible alternative workarounds:

  • Consider using breakpad instead of crashpad (not ideal)
  • Investigate if there's a proper way to hook into the engine's error-handling flow to send crashes to Sentry

PR in UE SDK which's also related to this subject: #657

Related issue in Native SDK: getsentry/sentry-native#1041.

@supervacuus
Copy link

If I can add anything to this topic, I would primarily reframe the questions from "workaround" (i.e., accepting something immutable that one must adapt to) to something that invites a long-term perspective on handling these issues in the Unreal SDK. The only immutable thing is the support range for UE versions (and their supported toolchains).

When I suggest breakpad instead of crashpad, that isn't meant as a short-term workaround to the build issue du-jour but as a perspective change on a maintenance burden.

crashpad is a continuously moving target written by Google (a company embracing and contributing to "modern" C++ standards) for chromium (i.e., they write it primarily for themselves, which means they have complete build control over clients). They have zero incentive to maintain long-term compatibility with older toolchains (crashpad also doesn't support gcc while we maintain compatibility in our fork), and their primary goal is runtime reach.

breakpad, on the other hand, hasn't seen any significant updates to its client in over 2 years. When they modernized the code base, they mainly used C++11 features like explicitly defaulted/deleted functions.

So, suppose you want to provide long-term support for "old" toolchains like clang and libc++ 11.0.1 (libc++ being the signifier here because older versions of libstdc++ work fine with the recent mini_chromium since we also build with clang on Ubuntu in CI). In that case, it might be better not to use a backend like crashpad that will continuously adopt the latest standards. Even if you drop support for major 4 and lower 5 (which I think shouldn't be dictated by a dependency), there will soon be a point where Google introduces C++23 features into crashpad, and the search for a "workaround" starts anew.

Given that context, you must ask yourself what is "not ideal" about breakpad, to be willing to accept this compatibility game of catch-up. Immediate error reporting? Support for specific ELF module structures? Same backend across configurations? And if you are willing to commit, would you also be willing to (co-)maintain a version of mini_chromium compatible with the supported toolchain you need to serve (because Unreal targets would be the primary driver to do so)?

And I've only picked this particular trade-off for a question where there should be a goal and commitment behind it, but there are others:

  • the question of whether the test docker containers provide too low a target (vs. documented Unreal requirements) and hinder adoption. Specifically, where there is a difference between having linker-supplied static dependencies from the Unreal build and dynamic dependencies from the OS loader when the crashpad_handler starts. Building an executable with libc++ 16 will break deployment targets with much older standard libraries, but what do you test with that? The crashpad_handler, as long as it isn't self-contained, is entirely independent of UE targets and more bound to what your customers have on their deployment targets.
  • the question of how an additional executable (like the crashpad_handler) should even be built, where you currently are in a situation where you build one part of your Linux package in a GHA workflow and another on the developers machine (and the latter is from then on self-contained vs. the former relies on dependencies at deployment).
  • a story and timeline for Unreal version support because build-breakage shouldn't be the driver and efforts to maintain compatibility should be weighed against support sunsets.
  • whether you need the Native SDK at all or whether its backend should be adapted (which was raised in the context of an Unreal-specific backend for the Native SDK). I cannot say a lot about this, but you will see the same issues with any artifact that you have to build against these constraints; nothing is particularly sentry-native specific (except for using crashpad as a backend, by driving both the toolchain requirements and having another executable to package).

@bruno-garcia
Copy link
Member

bruno-garcia commented Oct 27, 2024

Investigate if there's a proper way to hook into the engine's error-handling flow to send crashes to Sentry

Agree this is worth looking at. If anything we'll have the same level of support for crash handling than folks are used to with the built-in crash reporter.

even allowed us to pass the test checks for UE 5.3/5.4 however it still fails for UE 5.2

How much adoption does 5.2 have? Could we keep 5.2 and lower behind, on older version of sentry-native and offer only security patches and bug fixes? It was released a year and a half ago so I imagine some studios are probably still working on a games on this versions which are yet to be released, if that's the case, sounds like leaving it behind is a bad idea?

crashpad also doesn't support gcc while we maintain compatibility in our fork

We could say "no more gcc support" or "use breakpad instead" but we decided to maintain a fork. I imagine the impact of dropping gcc support is large, and the effort to maintain the fork is relatively small (how often do we need to touch this?)
I see this as a valid answer on the Unreal Engine case too, but we need to have a better understanding of the impact and effort.
How many times did crashpad bump break the Unreal SDK? Sounds like it was 2, in 2 years? Do we expect having to deal with this once a year, and it's a reasonable effort to fix things, while dropping crashpad is a very significant step back wrt support for our users? We need to have a good understanding of those things to make the final call.

Consider using breakpad instead of crashpad (not ideal)

Do we have a clear understanding of the trade-offs of breakpad vs crashpad? We need that to be able to weight in the effort to maintain this or not. Out of process crash dumps are more reliable, is what I heard. But do we have details on this? What else is relevant? Numbers or examples would help.

breakpad, on the other hand, hasn't seen any significant updates to its client in over 2 years. When they modernized the code base, they mainly used C++11 features like explicitly defaulted/deleted functions.

This does make it interesting for us to invest on it. That said, there's a reason why Google dropped this project and invested in crashpad. So we need to understand clearly what we're getting ourselves into if we're going all in on breakpad.

Google introduces C++23 features into crashpad, and the search for a "workaround" starts anew.

That's OK if the product quality we get out of this is significantly better, and we don't have a better option. We'll keep on this game. This is what we're here for. We deal with this crap so our customers don't have to, and they can get the best quality crash reports.

Given that context, you must ask yourself what is "not ideal" about breakpad, to be willing to accept this compatibility game of catch-up. Immediate error reporting? Support for specific ELF module structures? Same backend across configurations? And if you are willing to commit, would you also be willing to (co-)maintain a version of mini_chromium compatible with the supported toolchain you need to serve (because Unreal targets would be the primary driver to do so)?

Agree this is what we need to agree on. It's very possible we'll commit to maintain this, and get the proper resources to do so. Not saying we will, I'm just saying we need to look in detail at the trade offs, and the effort involved.

a story and timeline for Unreal version support because build-breakage shouldn't be the driver and efforts to maintain compatibility should be weighed against support sunsets.

Right, but also the impact on our customers. We value backward compatibility and will go at great lengths to keep that. Support for stuff is only dropped if the effort to maintain is high and the impact on customers is reasonably low. If the effort is very high but impact on customers is not irrelevant, please raise a flag. We can possibly bring in more help to take on some of that work so we continue to support it.

Building an executable with libc++ 16 will break deployment targets with much older standard libraries, but what do you test with that? The crashpad_handler, as long as it isn't self-contained, is entirely independent of UE targets and more bound to what your customers have on their deployment targets.

This is a good point. Can we test that? Is this more of a Linux issue on some distro's (like Amazon Linux)? Could we add a test matrix that tests against the main distros?
Do we know how large of a problem this is or it's more of a hypothetical one (other than issues we had and already fixed)?

@bruno-garcia
Copy link
Member

Lets make sure we talk to some customers too, coincidently just heard this about breakpad:
Image

@PlasmaDev5
Copy link
Collaborator

This is actually something i spoke about on discord last week. If we keep running into issues with crashpad then it makes sense to look for alternatives else we are endlessly waiting for the next issue. When talking about it on discord it was mentioned that if we drop crashpad and breakpad for Unreal then we no longer need native all. This opens up some smaller gains in my eyes such as a simplified and more standard Unreal plugin structure. I believe it would also solve the marketplace CLI executable issues iv seen mentioned. It would also improve compatibility as the crash reporter is not something that changes often. In theory we may be able to open support as far back as we desire as i imagine the API being the same or at least close enough to ifdef it.

I did spend some time looking into the Unreal crash reporting.
From what i could find they have both an in-process and out of process code path and looks like they use both on desktop depending on the nature of the issue. It looks to already handle complex GPU hangs and other useful behaviors of that nature. For the most part it all runs via FGenericPlatformStackWalk . The docs also suggest using Crash Report Client as a starting point to build your own crash reporting tools.

Overall i just feel making use of the Unreal crash just fits better into the goals and expectations of game developers as we always aim to use the most simplified. It will likely need a bit more investigation and planning before this approach is taken.

@bruno-garcia
Copy link
Member

bruno-garcia commented Oct 28, 2024

From what i could find they have both an in-process and out of process code path and looks like they use both on desktop depending on the nature of the issue.

Could you link some code here for reference?

Overall i just feel making use of the Unreal crash just fits better into the goals and expectations of game developers as we always aim to use the most simplified. It will likely need a bit more investigation and planning before this approach is taken.

That will depend on practical terms what we're getting and losing.

looks to already handle complex GPU hangs

This is something that game devs on Unity or private engines could benefit too. So another route is to double down on sentry-native and add support for that in there

@PlasmaDev5
Copy link
Collaborator

Could you link some code here for reference?

So the Code is a in a few places so not practical to link everything else we end up with an essay of links but i can link the 2 key points.
Crash Report Client
https://github.com/EpicGames/UnrealEngine/tree/release/Engine/Source/Programs/CrashReportClient

Root for core level abstractions. This includes the stackwalker, crash context and more
https://github.com/EpicGames/UnrealEngine/tree/release/Engine/Source/Runtime/Core/Public/GenericPlatform

This is something that game devs on Unity or private engines could benefit too. So another route is to double down on sentry-native and add support for that in there

This would likely need to be unique per engine as i believe we would have to build into the renderer itself to attached to the device creation based on the graphics API and everything being used.

@tustanivsky
Copy link
Collaborator Author

Regarding dropping sentry-native entirely:

While this approach might seem appealing from a maintenance perspective, it's important to note that currently Native SDK provides much more than just crash reporting. It's integral for features like message/event and user feedback capturing, Release Health and Performance monitoring, etc. Replicating this functionality purely through Unreal's systems would require significant development effort and time.

On UE version support:

Getting some analytics data can give us a better understanding which engine versions are most commonly adopted among the plugin users and help to inform the decision regarding our UE version support strategy. We're required to support the three latest engine versions according to the UE Marketplace guidelines but it's unclear whether we can/should aim for more?

The most recent experiments with updating sentry-native to 0.7.10 allowed to "fix" errors mentioned in the original message. Using clang-13 for the build should preserve Native SDK compatibility with UE 5.0.3 and newer versions. While this needs additional testing to verify, it could provide good coverage for our user base.

The discovery about Epic's Docker image for UE 5.2 (and older) running Ubuntu 18.04 and thus missing some libs required by crashpad is notable but manageable. We still can address this by updating the CI setup to handle these prerequisites and documenting the additional package requirements.

Can we test that? Is this more of a Linux issue on some distro's (like Amazon Linux)? Could we add a test matrix that tests against the main distros?

Getting access to large GitHub runners finally gives us the opportunity to package sample project in CI and run smoke tests in different Linux environments (i.e. Amazon Linux 2/2023, etc.)

Consider using breakpad instead of crashpad (not ideal)

Our previous attempts to switch to breakpad weren't welcomed by some users (#374, #380). As far as I recall there was a major concern that sometimes they couldn't persist crash info until the next game launch when breakpad is supposed to pick that up and send it to Sentry.

@bruno-garcia
Copy link
Member

Getting some analytics data can give us a better understanding which engine versions are most commonly adopted

One thing to note here is that analytics data will show us the current usage. While this will be very useful, it's only telling us about games that launched, and that are being played/live. But not necessarly being patched/fixed. One proxy for this is comparing the UE version with our SDK version, if folks are on newer versions of our SDK we can assume they still actively developing those titles.

That said, there are still customers (I talked to some) who are building a title they plan to launch only next year, still on UE 4.27.
Since titles often take years to build, we need to be conservative here on what versions we support.

@bruno-garcia
Copy link
Member

Regarding Sentry Native's backends (crashpad, breakpad and inproc) X using Unreal Engine's crash reporting code:
Detailed info about Sentry Native's crash handling strategies on this and this. We need a similar consideration for the UE crash handling code. Ideally combined with what context each is adding (CPU, memory info? tags? etc)

@tustanivsky
Copy link
Collaborator Author

UPD: I've been investigating what are the limits of different Unreal Engine versions in terms of their compatibility with the latest Native SDK and despite the constraints we have to build against according to Epic Games docs things work as expected after applying the changes suggested in #667. Extending the list of supported UE versions by adding 4.27, 5.0 and 5.1 to our CI pipeline (#675) didn't reveal any new issues in this regard either. Additionally, I've managed to successfully run some tests with the Amazon Linux 2023 docker image (related to #635).

While the linked PR should allow us to move forward with the crashpad (at least for now) it still makes sense to keep this discussion open and continue evaluating other backend options.

@bruno-garcia
Copy link
Member

I had a chat with @mitsuhiko who brought up some good points about stackwalking on device (either in-process or perhaps out of process akin to crashpad) as well as creating a minidump, could be the best route in general. Even outside Unreal.
Either way we'd want to take an approach at the shared-SDK level, e.g: sentry-native or perhaps a Rust-built component that shares the stackwalking behavior we have server side already in Sentry.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Status: No status
Development

No branches or pull requests

5 participants