RUM-3461 feat: Send fatal App Hang after app restart #1759

ncreated · 2024-04-03T16:42:34Z

What and why?

📦 This PR adds the logic of sending RUM errors for fatal App Hangs. It completes the previous work done in #1751 where fatal hangs were tracked, but not yet sent.

🎁 Because the condition of uploading fatal App Hangs to previous RUM session is very similar to the flow of sending RUM errors for app crashes, this PR brings additional refactoring to reuse existing concepts for both kinds of fatal error. That includes:

solving RUM-3115 by dropping RUMCrashEvent and setting error.threads, error.meta, error.binary_images and error.was_truncated directly in RUMErrorEvent;
introducing FatalErrorBuilder that is shared between fatal App Hangs handler and Crash receiver.

To cut corners, proposed implementation considers only the main scenario when hang happens in existing RUM session with an active RUM view. It ignores the complexity of "no view" and "no foreground session" situations that are handled in crash reporting (through RUMOffViewEventsHandlingRule). There is a massive opportunity for reusing the same logic through FatalErrorBuilder but it must be preceded by additional refactoring, notably model sharing between crash and fatal hang contexts in DatadogInternal. That work is detached to RUM-3840.

How?

When app is restarted with "pending hang" information:

if "pending hang" was recorded less than 4h ago → send RUM error + RUM view to previous RUM session
if "pending hang" was recorded more than 4h ago → send RUM error to previous RUM session

This is achieved in similar fashion as in crash reporting. An extra context (tracking consent + NTP offset + last RUM view) is written to DataStore on "hang start". It is deleted on "hang end" or "hang cancelled". If found after app restart, the RUM error and RUM view are constructed with FatalErrorBuilder. For code reuse, the FatalErrorBuilder is also used for buildling RUM errors that transport crash reports.

Review checklist

Feature or bugfix MUST have appropriate tests (unit, integration)
Make sure each commit and the PR mention the Issue number or JIRA reference
Add CHANGELOG entry for user facing changes

Custom CI job configuration (optional)

Run unit tests for Core, RUM, Trace, Logs, CR and WVT
Run unit tests for Session Replay
Run integration tests
Run smoke tests
Run tests for tools/

datadog-datadog-prod-us1 · 2024-04-03T16:51:51Z

Datadog Report

Branch report: ncreated/RUM-3461/send-fatal-app-hangs
Commit report: 52141f0
Test service: dd-sdk-ios

✅ 0 Failed, 2919 Passed, 0 Skipped, 11m 22.99s Wall Time
🔻 Test Sessions change in coverage: 10 decreased, 4 increased

🔻 Code Coverage Decreases vs Default Branch (10)

This report shows up to 5 code coverage decreases.

test DatadogInternalTests tvOS 78.88% (-0.84%) - Details
test DatadogInternalTests iOS 78.93% (-0.83%) - Details
test DatadogTraceTests tvOS 49.06% (-0.68%) - Details
test DatadogTraceTests iOS 49% (-0.63%) - Details
test DatadogLogsTests tvOS 45.38% (-0.6%) - Details

ncreated · 2024-04-05T10:26:31Z

DatadogCore/Tests/Datadog/RUM/Integrations/CrashReportReceiverTests.swift

 @testable import DatadogCrashReporting
-@testable import DatadogCore


💡 It makes a significant step towards moving CrashReportReceiverTest to RUM tests target, instead of maintaining it in Core tests. What is left is the dependency on DatadogCrashReporting.CrashContext mock. This will be resolved once all CR models are moved to DatadogInternal 👍.

maxep

Looks great, well done 👏 👏

ganeshnj

Some questions, one critical on privacy side.

Reads well 🎉

ganeshnj · 2024-04-09T13:37:15Z

DatadogCore/Tests/Datadog/RUM/Integrations/CrashReportReceiverTests.swift

-            XCTAssertNotNil(sentRUMError.additionalAttributes?[DDError.meta], "It must contain crash details")
-            XCTAssertNotNil(sentRUMError.additionalAttributes?[DDError.wasTruncated], "It must contain crash details")
-            XCTAssertEqual(sentRUMError.model.error.sourceType, .ios, "Must send .ios as the sourceType")
+            XCTAssertEqual(sentRUMError.dd.session?.plan, .plan1, "All RUM events should use RUM Lite plan")


nice cleanup 🧹

ganeshnj · 2024-04-09T13:38:11Z

DatadogLogs/Sources/Feature/MessageReceivers.swift

        /// The crash context
        let context: CrashContext
    }

-    private struct CrashReport: Decodable {


❓
no longer needed?

Yes 👍. This was the CrashReport structure sent from DatadogCrashReporting to DatadogLogs over message bus. Because in #1687 we moved DDCrashReport to DatadogInternal we no longer need to encode and decode it over the bus.

In other words, the DatadogCrashReporting.CrashReportSender sends DatadogInternal.DDCrashReport, so here in DatadogLogs we can receive exactly the same type. This PR also adds PassthroughCodable conformance to DDCrashReport, so we can bypass serialization when passing it over message bus - which brings perf improvement when compared to decoding removed struct.

ganeshnj · 2024-04-09T13:45:16Z

DatadogRUM/Sources/FatalErrorBuilder.swift

+        /// A crash with given metadata information.
+        case crash
+        /// A fatal App Hang.
+        case hang


❓
How is an hang fatal?

Is this a recoverable hang or customer quit after this hang? or we don't know about such thing?

"Fatal" means a hang that was followed by app termination (process kill). The termination is the aspect that differentiates non-fatal and fatal hangs. See:

diagrams for non-fatal hangs tracking

diagram for fatal hangs tracking

Currently we don't differentiate if the termination was due to user force quitting a hanged app or if the app was killed by OS watchdog due to prolonging hang. We should be able to distinguish these when watchdog terminations are implemented - we will have to track "user force quits" to recognise OOM errors.

Thanks for the links.

ganeshnj · 2024-04-09T13:58:55Z

DatadogRUM/Tests/Instrumentation/AppHangs/AppHangsMonitorTests.swift

+        monitor.stop()
+
+        // When
+        featureScope.contextMock.trackingConsent = .mockRandom() // no matter of the consent in restarted session


❓
from privacy point of view, shouldn't both before and after tracing must be granted?

There could be extra information we might attach to old events (at the time granted) in the subsequent run of the app where the consent is not granted.

thinking more on this, can we assert on the event that we recorded and sent is same, nothing gets added after.

from privacy point of view, shouldn't both before and after tracing must be granted?

What we do here is the same how it works in Crash Reporting, so it follows earlier product decisions.

There could be extra information we might attach to old events (at the time granted) in the subsequent run of the app where the consent is not granted.

thinking more on this, can we assert on the event that we recorded and sent is same, nothing gets added after

This is fair call 👍. Today there is no such extra information added, but I agree with the principle of testing this. As discussed on Slack, we could introduce DDAssertDif() to expect changes only on certain values. I did a spike on adding it in this PR, but I failed due to an extra complexity of coding Encodable values as part of RUM events which is not compatible with the basic reflection comparison that we implement today. It turns out to be more complex task - detaching to separate JIRA: RUM-4011.

ganeshnj · 2024-04-09T13:59:47Z

DatadogRUM/Tests/Instrumentation/AppHangs/AppHangsMonitorTests.swift

+            [
+                "No pending App Hang found", // from hanged process
+                "Sending fatal App hang as RUM error with issuing RUM view update", // from next process
+            ]


🚀 nice assertions.

Base automatically changed from ncreated/RUM-3461/fatal-app-hangs-tracking to ncreated/RUM-3461/fatal-app-hangs April 5, 2024 08:21

ncreated force-pushed the ncreated/RUM-3461/send-fatal-app-hangs branch from 0243a15 to b954910 Compare April 5, 2024 08:24

ncreated self-assigned this Apr 5, 2024

ncreated marked this pull request as ready for review April 5, 2024 08:47

ncreated requested review from a team as code owners April 5, 2024 08:47

ncreated mentioned this pull request Apr 5, 2024

RUM-3461 feat: Fatal App Hangs monitoring #1763

Merged

8 tasks

ncreated added 2 commits April 5, 2024 10:53

RUM-3461 Send fatal App Hang after app restart

4edfde2

RUM-3461 Test data and conditions behind App Hang uploads

66a117d

ncreated force-pushed the ncreated/RUM-3461/send-fatal-app-hangs branch from b954910 to 66a117d Compare April 5, 2024 08:53

ncreated added 2 commits April 5, 2024 10:59

RUM-3461 Lint

9515f36

RUM-3461 Fix rebase issues

379314d

ncreated commented Apr 5, 2024

View reviewed changes

maxep approved these changes Apr 9, 2024

View reviewed changes

ganeshnj reviewed Apr 9, 2024

View reviewed changes

ncreated requested a review from ganeshnj April 10, 2024 10:58

ganeshnj approved these changes Apr 10, 2024

View reviewed changes

ncreated merged commit 5f0f664 into ncreated/RUM-3461/fatal-app-hangs Apr 10, 2024
10 checks passed

ncreated deleted the ncreated/RUM-3461/send-fatal-app-hangs branch April 10, 2024 12:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RUM-3461 feat: Send fatal App Hang after app restart #1759

RUM-3461 feat: Send fatal App Hang after app restart #1759

ncreated commented Apr 3, 2024 •

edited

Loading

datadog-datadog-prod-us1 bot commented Apr 3, 2024 •

edited

Loading

ncreated Apr 5, 2024

maxep left a comment

ganeshnj left a comment

ganeshnj Apr 9, 2024

ganeshnj Apr 9, 2024

ncreated Apr 9, 2024

ganeshnj Apr 9, 2024

ncreated Apr 10, 2024

ganeshnj Apr 10, 2024

ganeshnj Apr 9, 2024

ganeshnj Apr 9, 2024

ncreated Apr 10, 2024

ganeshnj Apr 9, 2024

		@testable import DatadogCrashReporting
		@testable import DatadogCore

RUM-3461 feat: Send fatal App Hang after app restart #1759

RUM-3461 feat: Send fatal App Hang after app restart #1759

Conversation

ncreated commented Apr 3, 2024 • edited Loading

What and why?

How?

Review checklist

Custom CI job configuration (optional)

datadog-datadog-prod-us1 bot commented Apr 3, 2024 • edited Loading

Datadog Report

🔻 Code Coverage Decreases vs Default Branch (10)

Choose a reason for hiding this comment

maxep left a comment

Choose a reason for hiding this comment

ganeshnj left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ncreated commented Apr 3, 2024 •

edited

Loading

datadog-datadog-prod-us1 bot commented Apr 3, 2024 •

edited

Loading