Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RUM-3461 feat: track Fatal App Hangs #1751

Conversation

ncreated
Copy link
Member

@ncreated ncreated commented Mar 27, 2024

What and why?

📦 This PR brings the main part of Fatal App Hangs tracking. It tracks hangs from start to end or cancellation, persisting the fatal error information to be considered after app restart.

🚧 Sending RUM error and RUM view update will be added in following PR, to keep the code review scoped and lean.

How?

Fatal App Hangs tracking has a lot of similarity with Crash tracking. Notably, the context of the fatal error must be persisted in the failed process so it is available after restart (that includes: last RUM view, last RUM session info, current NTP offset and more).

To embrace this similarity and enable later code reuse this PR introduces FatalErrorContextNotifier that talks to both DatadogCrashReporting (over message bus) and RUM's AppHangsMonitor. It is meant to be extended even more for later watchdog terminations tracking. This way all fatal errors (crashes, fatal hangs and OOMs) will utilize the same logic of uploading them to previous RUM session (covering all edge cases like fail in background vs BET enabled/disabled, fail with no RUM session started, restart after more than 4h of the issue etc.).

The overall architecture introduced in this PR can be illustrated as follows:

Screenshot 2024-03-28 at 11 10 37

Review checklist

  • Feature or bugfix MUST have appropriate tests (unit, integration)
  • Make sure each commit and the PR mention the Issue number or JIRA reference
  • Add CHANGELOG entry for user facing changes

Custom CI job configuration (optional)

  • Run unit tests for Core, RUM, Trace, Logs, CR and WVT
  • Run unit tests for Session Replay
  • Run integration tests
  • Run smoke tests
  • Run tests for tools/

@ncreated ncreated self-assigned this Mar 27, 2024
@datadog-datadog-prod-us1
Copy link

datadog-datadog-prod-us1 bot commented Mar 27, 2024

Datadog Report

Branch report: ncreated/RUM-3461/fatal-app-hangs-tracking
Commit report: eb272fd
Test service: dd-sdk-ios

✅ 0 Failed, 2919 Passed, 0 Skipped, 12m 42.45s Wall Time
🔻 Test Sessions change in coverage: 10 decreased, 4 increased

🔻 Code Coverage Decreases vs Default Branch (10)

This report shows up to 5 code coverage decreases.

  • test DatadogInternalTests iOS 78.95% (-0.8%) - Details
  • test DatadogInternalTests tvOS 78.93% (-0.78%) - Details
  • test DatadogTraceTests tvOS 49.11% (-0.63%) - Details
  • test DatadogTraceTests iOS 49.06% (-0.57%) - Details
  • test DatadogLogsTests iOS 45.37% (-0.57%) - Details

Base automatically changed from ncreated/RUM-3461/refactor-rum-to-depend-on-feature-scope to develop March 28, 2024 07:52
@ncreated ncreated force-pushed the ncreated/RUM-3461/fatal-app-hangs-tracking branch from 23e4924 to 0d85c4b Compare March 28, 2024 09:04
@ncreated ncreated changed the base branch from develop to ncreated/RUM-3461/fatal-app-hangs March 28, 2024 09:05
@ncreated ncreated changed the title RUM-3461 feat: Fatal App Hangs monitoring RUM-3461 feat: track Fatal App Hangs between Mar 28, 2024
@ncreated ncreated force-pushed the ncreated/RUM-3461/fatal-app-hangs-tracking branch from 0d85c4b to 77a811a Compare March 28, 2024 09:13
@ncreated ncreated changed the title RUM-3461 feat: track Fatal App Hangs between RUM-3461 feat: track Fatal App Hangs Mar 28, 2024
Comment on lines +24 to +28
/// RUM interface for data store.
///
/// It stores values in JSON format and implements convenience for type-safe key referencing and data serialization.
/// Serialization errors are logged to telemetry.
internal struct RUMDataStore {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 This adds JSON-coding convenience for interacting with Data Store. It leverages SRP to separate concerns by abstracting error handling and failure reporting. It will be vastly leveraged for watchdog terminations, where the number of stored values will grow.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that.
I did manual encoding/decoding in my PR using DataStore.

IMO worth adding similar kind of codable support extension as public. It's going to be very common use case.

cc @maxep

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a fair call, we shall see when we add more use cases. It is important to let the feature decide the serialization IMHO, but we could provide an extension for JSON in the same fashion as proposed here!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Sounds good. Let's move on with scope.rumDataStore in RUM. I can follow with refactoring PR to introduce scope.jsonDataStore in DatadogInternal.

Comment on lines +24 to +27
/// Handles non-fatal App Hangs.
internal let nonFatalHangsHandler: NonFatalAppHangsHandler
/// Handles non-fatal App Hangs.
internal let fatalHangsHandler: FatalAppHangsHandler
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 The previous non-fatal App Hang logic went to NonFatalAppHangsHandler. Fatals are implemented in separate handler. This is to separate concerns as both will evolve significantly different ways (with fatal handler reaching significantly higher complexity).

Comment on lines +73 to +76
// TODO: RUM-3461
// Similar to how we send Crash report in `CrashReportReceiver`:
// - construct RUM error from `fatalHang.hang` information
// - update `error.count` in `fatalHang.lastRUMView`
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 This will come in following PR. This PR is merged against feature branch.

@ncreated ncreated marked this pull request as ready for review March 28, 2024 10:18
@ncreated ncreated requested review from a team as code owners March 28, 2024 10:18
Comment on lines +16 to +19
/// Example use case in fatal App Hangs tracking:
/// - SDK started → RUM enabled → [hang occurs] → pending App Hang saved → SDK stopped → SDK started again → RUM enabled again → pending App Hang loaded
/// - When restarting RUM , the `processID` check ensures dropping pending hang from the previous instance, preventing false "fatal" hang detection.
internal let currentProcessID = UUID()
Copy link
Member Author

@ncreated ncreated Mar 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 In fact, this is to achieve both:

  • cover the edge case of pending hang being written before stopping the SDK and being considered a "fatal" when restarting it within the same process;
  • add a safety net for limiting false-positive fatal hangs (by only reporting if their tracked processID is different from currentProcessID).

Copy link
Member

@maciejburda maciejburda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gave it general overview. Looks good. Well commented and separated!

Comment on lines +24 to +28
/// RUM interface for data store.
///
/// It stores values in JSON format and implements convenience for type-safe key referencing and data serialization.
/// Serialization errors are logged to telemetry.
internal struct RUMDataStore {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that.
I did manual encoding/decoding in my PR using DataStore.

IMO worth adding similar kind of codable support extension as public. It's going to be very common use case.

cc @maxep

Copy link
Member

@maxep maxep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! I left some comments regarding locks, LMWYT!

Comment on lines +24 to +28
/// RUM interface for data store.
///
/// It stores values in JSON format and implements convenience for type-safe key referencing and data serialization.
/// Serialization errors are logged to telemetry.
internal struct RUMDataStore {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a fair call, we shall see when we add more use cases. It is important to let the feature decide the serialization IMHO, but we could provide an extension for JSON in the same fashion as proposed here!

@@ -7,7 +7,22 @@
import Foundation
import DatadogInternal

internal final class AppHangsWatchdogThread: Thread {
internal protocol AppHangsObservingThread: AnyObject {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/question Why constraining to AnyObject?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to enable mutability for callbacks:

var onHangStarted: ((AppHang) -> Void)? { set get }
var onHangCancelled: ((AppHang) -> Void)? { set get }
var onHangEnded: ((AppHang, TimeInterval) -> Void)? { set get }
var onBeforeSleep: (() -> Void)? { set get }

Alternatively, we can manage the thread reference as var, not let:

- private let watchdogThread: AppHangsObservingThread
+ private var watchdogThread: AppHangsObservingThread

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AnyObject is no longer required after switch to delegate pattern.

Comment on lines 64 to 82
@ReadWriteLock
private var mainThreadID: ThreadID? = nil
/// Telemetry interface.
private let telemetry: Telemetry
/// Closure to be notified when App Hang ends. It will be executed on the watchdog thread.
/// Closure to be notified when App Hang starts.
/// It is executed on the watchdog thread.
@ReadWriteLock
internal var onHangStarted: ((AppHang) -> Void)?
/// Closure to be notified when App Hang gets cancelled due to possible false-positive.
/// It is executed on the watchdog thread.
@ReadWriteLock
internal var onHangCancelled: ((AppHang) -> Void)?
/// Closure to be notified when App Hang ends. It passes the hang and its duration.
/// It is executed on the watchdog thread.
@ReadWriteLock
internal var onHangEnded: ((AppHang) -> Void)?
internal var onHangEnded: ((AppHang, TimeInterval) -> Void)?
/// A block called after this thread finished its pass and will become idle.
@ReadWriteLock
internal var onBeforeSleep: (() -> Void)?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/suggestion That makes 5 locks :/ The property wrapper might not fit very well here, unless we use a companion object with a single lock.

Another approach would be to create another read-write lock object that is not a property wrapper and that we can use to lock multiple accesses.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is only one write to each onHang* lock during SDK startup on the main thread. Later, SDK only reads them from the watchdog (background) thread.

I could combine these callbacks to Observer struct, e.g.:

thread.hangObserver = .init(
   onStarted: { _ in },
   onCancelled: { _ in }
   onEnded: { _, _ in }
)

Alternatively, we can leverage delegate model (thread.delegate = self). WDYT @maxep ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The good old delegate pattern :) It is quite appropriate in this case 👍 Do you mind trying out?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I changed to delegate 👍. It is weak var which implies "atomic", so no locks required.

@ncreated ncreated requested a review from maxep April 3, 2024 13:58
Copy link
Member

@maxep maxep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, well done!

@ncreated ncreated merged commit bf5a680 into ncreated/RUM-3461/fatal-app-hangs Apr 5, 2024
8 checks passed
@ncreated ncreated deleted the ncreated/RUM-3461/fatal-app-hangs-tracking branch April 5, 2024 08:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants