Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RUM-3461 feat: Send fatal App Hang after app restart #1759

Conversation

ncreated
Copy link
Member

@ncreated ncreated commented Apr 3, 2024

What and why?

📦 This PR adds the logic of sending RUM errors for fatal App Hangs. It completes the previous work done in #1751 where fatal hangs were tracked, but not yet sent.

🎁 Because the condition of uploading fatal App Hangs to previous RUM session is very similar to the flow of sending RUM errors for app crashes, this PR brings additional refactoring to reuse existing concepts for both kinds of fatal error. That includes:

  • solving RUM-3115 by dropping RUMCrashEvent and setting error.threads, error.meta, error.binary_images and error.was_truncated directly in RUMErrorEvent;
  • introducing FatalErrorBuilder that is shared between fatal App Hangs handler and Crash receiver.

To cut corners, proposed implementation considers only the main scenario when hang happens in existing RUM session with an active RUM view. It ignores the complexity of "no view" and "no foreground session" situations that are handled in crash reporting (through RUMOffViewEventsHandlingRule). There is a massive opportunity for reusing the same logic through FatalErrorBuilder but it must be preceded by additional refactoring, notably model sharing between crash and fatal hang contexts in DatadogInternal. That work is detached to RUM-3840.

How?

When app is restarted with "pending hang" information:

  • if "pending hang" was recorded less than 4h ago → send RUM error + RUM view to previous RUM session
  • if "pending hang" was recorded more than 4h ago → send RUM error to previous RUM session

This is achieved in similar fashion as in crash reporting. An extra context (tracking consent + NTP offset + last RUM view) is written to DataStore on "hang start". It is deleted on "hang end" or "hang cancelled". If found after app restart, the RUM error and RUM view are constructed with FatalErrorBuilder. For code reuse, the FatalErrorBuilder is also used for buildling RUM errors that transport crash reports.

Review checklist

  • Feature or bugfix MUST have appropriate tests (unit, integration)
  • Make sure each commit and the PR mention the Issue number or JIRA reference
  • Add CHANGELOG entry for user facing changes

Custom CI job configuration (optional)

  • Run unit tests for Core, RUM, Trace, Logs, CR and WVT
  • Run unit tests for Session Replay
  • Run integration tests
  • Run smoke tests
  • Run tests for tools/

@datadog-datadog-prod-us1
Copy link

datadog-datadog-prod-us1 bot commented Apr 3, 2024

Datadog Report

Branch report: ncreated/RUM-3461/send-fatal-app-hangs
Commit report: 52141f0
Test service: dd-sdk-ios

✅ 0 Failed, 2919 Passed, 0 Skipped, 11m 22.99s Wall Time
🔻 Test Sessions change in coverage: 10 decreased, 4 increased

🔻 Code Coverage Decreases vs Default Branch (10)

This report shows up to 5 code coverage decreases.

  • test DatadogInternalTests tvOS 78.88% (-0.84%) - Details
  • test DatadogInternalTests iOS 78.93% (-0.83%) - Details
  • test DatadogTraceTests tvOS 49.06% (-0.68%) - Details
  • test DatadogTraceTests iOS 49% (-0.63%) - Details
  • test DatadogLogsTests tvOS 45.38% (-0.6%) - Details

Base automatically changed from ncreated/RUM-3461/fatal-app-hangs-tracking to ncreated/RUM-3461/fatal-app-hangs April 5, 2024 08:21
@ncreated ncreated force-pushed the ncreated/RUM-3461/send-fatal-app-hangs branch from 0243a15 to b954910 Compare April 5, 2024 08:24
@ncreated ncreated self-assigned this Apr 5, 2024
@ncreated ncreated marked this pull request as ready for review April 5, 2024 08:47
@ncreated ncreated requested review from a team as code owners April 5, 2024 08:47
@ncreated ncreated force-pushed the ncreated/RUM-3461/send-fatal-app-hangs branch from b954910 to 66a117d Compare April 5, 2024 08:53
Comment on lines 11 to -12
@testable import DatadogCrashReporting
@testable import DatadogCore
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 It makes a significant step towards moving CrashReportReceiverTest to RUM tests target, instead of maintaining it in Core tests. What is left is the dependency on DatadogCrashReporting.CrashContext mock. This will be resolved once all CR models are moved to DatadogInternal 👍.

Copy link
Member

@maxep maxep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, well done 👏 👏

Copy link
Contributor

@ganeshnj ganeshnj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some questions, one critical on privacy side.

Reads well 🎉

XCTAssertNotNil(sentRUMError.additionalAttributes?[DDError.meta], "It must contain crash details")
XCTAssertNotNil(sentRUMError.additionalAttributes?[DDError.wasTruncated], "It must contain crash details")
XCTAssertEqual(sentRUMError.model.error.sourceType, .ios, "Must send .ios as the sourceType")
XCTAssertEqual(sentRUMError.dd.session?.plan, .plan1, "All RUM events should use RUM Lite plan")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice cleanup 🧹

/// The crash context
let context: CrashContext
}

private struct CrashReport: Decodable {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


no longer needed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes 👍. This was the CrashReport structure sent from DatadogCrashReporting to DatadogLogs over message bus. Because in #1687 we moved DDCrashReport to DatadogInternal we no longer need to encode and decode it over the bus.

In other words, the DatadogCrashReporting.CrashReportSender sends DatadogInternal.DDCrashReport, so here in DatadogLogs we can receive exactly the same type. This PR also adds PassthroughCodable conformance to DDCrashReport, so we can bypass serialization when passing it over message bus - which brings perf improvement when compared to decoding removed struct.

/// A crash with given metadata information.
case crash
/// A fatal App Hang.
case hang
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


How is an hang fatal?

Is this a recoverable hang or customer quit after this hang? or we don't know about such thing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Fatal" means a hang that was followed by app termination (process kill). The termination is the aspect that differentiates non-fatal and fatal hangs. See:

Currently we don't differentiate if the termination was due to user force quitting a hanged app or if the app was killed by OS watchdog due to prolonging hang. We should be able to distinguish these when watchdog terminations are implemented - we will have to track "user force quits" to recognise OOM errors.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the links.

monitor.stop()

// When
featureScope.contextMock.trackingConsent = .mockRandom() // no matter of the consent in restarted session
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


from privacy point of view, shouldn't both before and after tracing must be granted?

There could be extra information we might attach to old events (at the time granted) in the subsequent run of the app where the consent is not granted.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thinking more on this, can we assert on the event that we recorded and sent is same, nothing gets added after.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from privacy point of view, shouldn't both before and after tracing must be granted?

What we do here is the same how it works in Crash Reporting, so it follows earlier product decisions.

There could be extra information we might attach to old events (at the time granted) in the subsequent run of the app where the consent is not granted.

thinking more on this, can we assert on the event that we recorded and sent is same, nothing gets added after

This is fair call 👍. Today there is no such extra information added, but I agree with the principle of testing this. As discussed on Slack, we could introduce DDAssertDif() to expect changes only on certain values. I did a spike on adding it in this PR, but I failed due to an extra complexity of coding Encodable values as part of RUM events which is not compatible with the basic reflection comparison that we implement today. It turns out to be more complex task - detaching to separate JIRA: RUM-4011.

Comment on lines +190 to +193
[
"No pending App Hang found", // from hanged process
"Sending fatal App hang as RUM error with issuing RUM view update", // from next process
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀 nice assertions.

@ncreated ncreated requested a review from ganeshnj April 10, 2024 10:58
@ncreated ncreated merged commit 5f0f664 into ncreated/RUM-3461/fatal-app-hangs Apr 10, 2024
10 checks passed
@ncreated ncreated deleted the ncreated/RUM-3461/send-fatal-app-hangs branch April 10, 2024 12:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants