RUM-3461 feat: track Fatal App Hangs #1751

ncreated · 2024-03-27T09:18:21Z

What and why?

📦 This PR brings the main part of Fatal App Hangs tracking. It tracks hangs from start to end or cancellation, persisting the fatal error information to be considered after app restart.

🚧 Sending RUM error and RUM view update will be added in following PR, to keep the code review scoped and lean.

How?

Fatal App Hangs tracking has a lot of similarity with Crash tracking. Notably, the context of the fatal error must be persisted in the failed process so it is available after restart (that includes: last RUM view, last RUM session info, current NTP offset and more).

To embrace this similarity and enable later code reuse this PR introduces FatalErrorContextNotifier that talks to both DatadogCrashReporting (over message bus) and RUM's AppHangsMonitor. It is meant to be extended even more for later watchdog terminations tracking. This way all fatal errors (crashes, fatal hangs and OOMs) will utilize the same logic of uploading them to previous RUM session (covering all edge cases like fail in background vs BET enabled/disabled, fail with no RUM session started, restart after more than 4h of the issue etc.).

The overall architecture introduced in this PR can be illustrated as follows:

Review checklist

Feature or bugfix MUST have appropriate tests (unit, integration)
Make sure each commit and the PR mention the Issue number or JIRA reference
Add CHANGELOG entry for user facing changes

Custom CI job configuration (optional)

Run unit tests for Core, RUM, Trace, Logs, CR and WVT
Run unit tests for Session Replay
Run integration tests
Run smoke tests
Run tests for tools/

datadog-datadog-prod-us1 · 2024-03-27T09:25:47Z

Datadog Report

Branch report: ncreated/RUM-3461/fatal-app-hangs-tracking
Commit report: eb272fd
Test service: dd-sdk-ios

✅ 0 Failed, 2919 Passed, 0 Skipped, 12m 42.45s Wall Time
🔻 Test Sessions change in coverage: 10 decreased, 4 increased

🔻 Code Coverage Decreases vs Default Branch (10)

This report shows up to 5 code coverage decreases.

test DatadogInternalTests iOS 78.95% (-0.8%) - Details
test DatadogInternalTests tvOS 78.93% (-0.78%) - Details
test DatadogTraceTests tvOS 49.11% (-0.63%) - Details
test DatadogTraceTests iOS 49.06% (-0.57%) - Details
test DatadogLogsTests iOS 45.37% (-0.57%) - Details

ncreated · 2024-03-28T10:14:05Z

DatadogRUM/Sources/Feature/RUMDataStore.swift

+/// RUM interface for data store.
+///
+/// It stores values in JSON format and implements convenience for type-safe key referencing and data serialization.
+/// Serialization errors are logged to telemetry.
+internal struct RUMDataStore {


💡 This adds JSON-coding convenience for interacting with Data Store. It leverages SRP to separate concerns by abstracting error handling and failure reporting. It will be vastly leveraged for watchdog terminations, where the number of stored values will grow.

I like that.
I did manual encoding/decoding in my PR using DataStore.

IMO worth adding similar kind of codable support extension as public. It's going to be very common use case.

cc @maxep

That's a fair call, we shall see when we add more use cases. It is important to let the feature decide the serialization IMHO, but we could provide an extension for JSON in the same fashion as proposed here!

👍 Sounds good. Let's move on with scope.rumDataStore in RUM. I can follow with refactoring PR to introduce scope.jsonDataStore in DatadogInternal.

ncreated · 2024-03-28T10:16:16Z

DatadogRUM/Sources/Instrumentation/AppHangs/AppHangsMonitor.swift

+    /// Handles non-fatal App Hangs.
+    internal let nonFatalHangsHandler: NonFatalAppHangsHandler
+    /// Handles non-fatal App Hangs.
+    internal let fatalHangsHandler: FatalAppHangsHandler


💡 The previous non-fatal App Hang logic went to NonFatalAppHangsHandler. Fatals are implemented in separate handler. This is to separate concerns as both will evolve significantly different ways (with fatal handler reaching significantly higher complexity).

ncreated · 2024-03-28T10:17:10Z

DatadogRUM/Sources/Instrumentation/AppHangs/FatalAppHangsHandler.swift

+        // TODO: RUM-3461
+        // Similar to how we send Crash report in `CrashReportReceiver`:
+        // - construct RUM error from `fatalHang.hang` information
+        // - update `error.count` in `fatalHang.lastRUMView`


💡 This will come in following PR. This PR is merged against feature branch.

ncreated · 2024-03-28T10:29:18Z

DatadogRUM/Sources/Instrumentation/AppHangs/ProcessIdentifier.swift

+/// Example use case in fatal App Hangs tracking:
+/// - SDK started → RUM enabled → [hang occurs] → pending App Hang saved → SDK stopped → SDK started again → RUM enabled again → pending App Hang loaded
+/// - When restarting RUM , the `processID` check ensures dropping pending hang from the previous instance, preventing false "fatal" hang detection.
+internal let currentProcessID = UUID()


💡 In fact, this is to achieve both:

cover the edge case of pending hang being written before stopping the SDK and being considered a "fatal" when restarting it within the same process;

add a safety net for limiting false-positive fatal hangs (by only reporting if their tracked processID is different from currentProcessID).

maciejburda

Gave it general overview. Looks good. Well commented and separated!

maciejburda · 2024-03-28T16:07:16Z

DatadogRUM/Sources/Feature/RUMDataStore.swift

+/// RUM interface for data store.
+///
+/// It stores values in JSON format and implements convenience for type-safe key referencing and data serialization.
+/// Serialization errors are logged to telemetry.
+internal struct RUMDataStore {


I like that.
I did manual encoding/decoding in my PR using DataStore.

IMO worth adding similar kind of codable support extension as public. It's going to be very common use case.

cc @maxep

maxep

Looks great! I left some comments regarding locks, LMWYT!

maxep · 2024-04-02T12:48:27Z

DatadogRUM/Sources/Feature/RUMDataStore.swift

+/// RUM interface for data store.
+///
+/// It stores values in JSON format and implements convenience for type-safe key referencing and data serialization.
+/// Serialization errors are logged to telemetry.
+internal struct RUMDataStore {


That's a fair call, we shall see when we add more use cases. It is important to let the feature decide the serialization IMHO, but we could provide an extension for JSON in the same fashion as proposed here!

maxep · 2024-04-02T12:50:51Z

DatadogRUM/Sources/Instrumentation/AppHangs/AppHangsWatchdogThread.swift

@@ -7,7 +7,22 @@
 import Foundation
 import DatadogInternal

-internal final class AppHangsWatchdogThread: Thread {
+internal protocol AppHangsObservingThread: AnyObject {


/question Why constraining to AnyObject?

This is to enable mutability for callbacks:

var onHangStarted: ((AppHang) -> Void)? { set get } var onHangCancelled: ((AppHang) -> Void)? { set get } var onHangEnded: ((AppHang, TimeInterval) -> Void)? { set get } var onBeforeSleep: (() -> Void)? { set get }

Alternatively, we can manage the thread reference as var, not let:

- private let watchdogThread: AppHangsObservingThread + private var watchdogThread: AppHangsObservingThread

AnyObject is no longer required after switch to delegate pattern.

maxep · 2024-04-02T13:08:27Z

DatadogRUM/Sources/Instrumentation/AppHangs/AppHangsWatchdogThread.swift

    @ReadWriteLock
    private var mainThreadID: ThreadID? = nil
    /// Telemetry interface.
    private let telemetry: Telemetry
-    /// Closure to be notified when App Hang ends. It will be executed on the watchdog thread.
+    /// Closure to be notified when App Hang starts.
+    /// It is executed on the watchdog thread.
+    @ReadWriteLock
+    internal var onHangStarted: ((AppHang) -> Void)?
+    /// Closure to be notified when App Hang gets cancelled due to possible false-positive.
+    /// It is executed on the watchdog thread.
+    @ReadWriteLock
+    internal var onHangCancelled: ((AppHang) -> Void)?
+    /// Closure to be notified when App Hang ends. It passes the hang and its duration.
+    /// It is executed on the watchdog thread.
    @ReadWriteLock
-    internal var onHangEnded: ((AppHang) -> Void)?
+    internal var onHangEnded: ((AppHang, TimeInterval) -> Void)?
    /// A block called after this thread finished its pass and will become idle.
    @ReadWriteLock
    internal var onBeforeSleep: (() -> Void)?


/suggestion That makes 5 locks :/ The property wrapper might not fit very well here, unless we use a companion object with a single lock.

Another approach would be to create another read-write lock object that is not a property wrapper and that we can use to lock multiple accesses.

There is only one write to each onHang* lock during SDK startup on the main thread. Later, SDK only reads them from the watchdog (background) thread.

I could combine these callbacks to Observer struct, e.g.:

thread.hangObserver = .init( onStarted: { _ in }, onCancelled: { _ in } onEnded: { _, _ in } )

Alternatively, we can leverage delegate model (thread.delegate = self). WDYT @maxep ?

The good old delegate pattern :) It is quite appropriate in this case 👍 Do you mind trying out?

Sure, I changed to delegate 👍. It is weak var which implies "atomic", so no locks required.

DatadogRUM/Sources/RUMMonitor/Scopes/FatalErrorContextNotifier.swift

maxep

Looks great, well done!

ncreated self-assigned this Mar 27, 2024

Base automatically changed from ncreated/RUM-3461/refactor-rum-to-depend-on-feature-scope to develop March 28, 2024 07:52

RUM-3461 Track App Hang "start", "end" and "cancellation"

3af24b5

ncreated force-pushed the ncreated/RUM-3461/fatal-app-hangs-tracking branch from 23e4924 to 0d85c4b Compare March 28, 2024 09:04

ncreated changed the base branch from develop to ncreated/RUM-3461/fatal-app-hangs March 28, 2024 09:05

ncreated changed the title ~~RUM-3461 feat: Fatal App Hangs monitoring~~ RUM-3461 feat: track Fatal App Hangs between Mar 28, 2024

RUM-3461 Write fatal App Hang to RUM data store, read upon restart

77a811a

ncreated force-pushed the ncreated/RUM-3461/fatal-app-hangs-tracking branch from 0d85c4b to 77a811a Compare March 28, 2024 09:13

ncreated changed the title ~~RUM-3461 feat: track Fatal App Hangs between~~ RUM-3461 feat: track Fatal App Hangs Mar 28, 2024

ncreated commented Mar 28, 2024

View reviewed changes

ncreated marked this pull request as ready for review March 28, 2024 10:18

ncreated requested review from a team as code owners March 28, 2024 10:18

ncreated commented Mar 28, 2024

View reviewed changes

maciejburda approved these changes Mar 28, 2024

View reviewed changes

maxep reviewed Apr 2, 2024

View reviewed changes

RUM-3461 CR feedback - change closures to delegate pattern

eb272fd

ncreated requested a review from maxep April 3, 2024 13:58

maxep approved these changes Apr 3, 2024

View reviewed changes

ncreated merged commit bf5a680 into ncreated/RUM-3461/fatal-app-hangs Apr 5, 2024
8 checks passed

ncreated deleted the ncreated/RUM-3461/fatal-app-hangs-tracking branch April 5, 2024 08:21

This was referenced Apr 5, 2024

RUM-3461 feat: Send fatal App Hang after app restart #1759

Merged

RUM-3461 feat: Fatal App Hangs monitoring #1763

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RUM-3461 feat: track Fatal App Hangs #1751

RUM-3461 feat: track Fatal App Hangs #1751

ncreated commented Mar 27, 2024 •

edited

Loading

datadog-datadog-prod-us1 bot commented Mar 27, 2024 •

edited

Loading

ncreated Mar 28, 2024

maciejburda Mar 28, 2024

maxep Apr 2, 2024

ncreated Apr 3, 2024

ncreated Mar 28, 2024

ncreated Mar 28, 2024

ncreated Mar 28, 2024 •

edited

Loading

maciejburda left a comment

maciejburda Mar 28, 2024

maxep left a comment

maxep Apr 2, 2024

maxep Apr 2, 2024

ncreated Apr 2, 2024

ncreated Apr 3, 2024

maxep Apr 2, 2024

ncreated Apr 2, 2024

maxep Apr 3, 2024

ncreated Apr 3, 2024

maxep left a comment

RUM-3461 feat: track Fatal App Hangs #1751

RUM-3461 feat: track Fatal App Hangs #1751

Conversation

ncreated commented Mar 27, 2024 • edited Loading

What and why?

How?

Review checklist

Custom CI job configuration (optional)

datadog-datadog-prod-us1 bot commented Mar 27, 2024 • edited Loading

Datadog Report

🔻 Code Coverage Decreases vs Default Branch (10)

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ncreated Mar 28, 2024 • edited Loading

Choose a reason for hiding this comment

maciejburda left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maxep left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maxep left a comment

Choose a reason for hiding this comment

ncreated commented Mar 27, 2024 •

edited

Loading

datadog-datadog-prod-us1 bot commented Mar 27, 2024 •

edited

Loading

ncreated Mar 28, 2024 •

edited

Loading