-
Notifications
You must be signed in to change notification settings - Fork 133
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RUM-3461 feat: track Fatal App Hangs #1751
RUM-3461 feat: track Fatal App Hangs #1751
Conversation
Datadog ReportBranch report: ✅ 0 Failed, 2919 Passed, 0 Skipped, 12m 42.45s Wall Time 🔻 Code Coverage Decreases vs Default Branch (10)
|
23e4924
to
0d85c4b
Compare
0d85c4b
to
77a811a
Compare
/// RUM interface for data store. | ||
/// | ||
/// It stores values in JSON format and implements convenience for type-safe key referencing and data serialization. | ||
/// Serialization errors are logged to telemetry. | ||
internal struct RUMDataStore { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 This adds JSON-coding convenience for interacting with Data Store. It leverages SRP to separate concerns by abstracting error handling and failure reporting. It will be vastly leveraged for watchdog terminations, where the number of stored values will grow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that.
I did manual encoding/decoding in my PR using DataStore.
IMO worth adding similar kind of codable support extension as public. It's going to be very common use case.
cc @maxep
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a fair call, we shall see when we add more use cases. It is important to let the feature decide the serialization IMHO, but we could provide an extension for JSON in the same fashion as proposed here!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Sounds good. Let's move on with scope.rumDataStore
in RUM
. I can follow with refactoring PR to introduce scope.jsonDataStore
in DatadogInternal
.
/// Handles non-fatal App Hangs. | ||
internal let nonFatalHangsHandler: NonFatalAppHangsHandler | ||
/// Handles non-fatal App Hangs. | ||
internal let fatalHangsHandler: FatalAppHangsHandler |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 The previous non-fatal App Hang logic went to NonFatalAppHangsHandler
. Fatals are implemented in separate handler. This is to separate concerns as both will evolve significantly different ways (with fatal handler reaching significantly higher complexity).
// TODO: RUM-3461 | ||
// Similar to how we send Crash report in `CrashReportReceiver`: | ||
// - construct RUM error from `fatalHang.hang` information | ||
// - update `error.count` in `fatalHang.lastRUMView` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 This will come in following PR. This PR is merged against feature branch.
/// Example use case in fatal App Hangs tracking: | ||
/// - SDK started → RUM enabled → [hang occurs] → pending App Hang saved → SDK stopped → SDK started again → RUM enabled again → pending App Hang loaded | ||
/// - When restarting RUM , the `processID` check ensures dropping pending hang from the previous instance, preventing false "fatal" hang detection. | ||
internal let currentProcessID = UUID() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 In fact, this is to achieve both:
- cover the edge case of pending hang being written before stopping the SDK and being considered a "fatal" when restarting it within the same process;
- add a safety net for limiting false-positive fatal hangs (by only reporting if their tracked
processID
is different fromcurrentProcessID
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gave it general overview. Looks good. Well commented and separated!
/// RUM interface for data store. | ||
/// | ||
/// It stores values in JSON format and implements convenience for type-safe key referencing and data serialization. | ||
/// Serialization errors are logged to telemetry. | ||
internal struct RUMDataStore { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that.
I did manual encoding/decoding in my PR using DataStore.
IMO worth adding similar kind of codable support extension as public. It's going to be very common use case.
cc @maxep
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! I left some comments regarding locks, LMWYT!
/// RUM interface for data store. | ||
/// | ||
/// It stores values in JSON format and implements convenience for type-safe key referencing and data serialization. | ||
/// Serialization errors are logged to telemetry. | ||
internal struct RUMDataStore { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a fair call, we shall see when we add more use cases. It is important to let the feature decide the serialization IMHO, but we could provide an extension for JSON in the same fashion as proposed here!
@@ -7,7 +7,22 @@ | |||
import Foundation | |||
import DatadogInternal | |||
|
|||
internal final class AppHangsWatchdogThread: Thread { | |||
internal protocol AppHangsObservingThread: AnyObject { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/question Why constraining to AnyObject
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is to enable mutability for callbacks:
var onHangStarted: ((AppHang) -> Void)? { set get }
var onHangCancelled: ((AppHang) -> Void)? { set get }
var onHangEnded: ((AppHang, TimeInterval) -> Void)? { set get }
var onBeforeSleep: (() -> Void)? { set get }
Alternatively, we can manage the thread reference as var
, not let
:
- private let watchdogThread: AppHangsObservingThread
+ private var watchdogThread: AppHangsObservingThread
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AnyObject
is no longer required after switch to delegate pattern.
@ReadWriteLock | ||
private var mainThreadID: ThreadID? = nil | ||
/// Telemetry interface. | ||
private let telemetry: Telemetry | ||
/// Closure to be notified when App Hang ends. It will be executed on the watchdog thread. | ||
/// Closure to be notified when App Hang starts. | ||
/// It is executed on the watchdog thread. | ||
@ReadWriteLock | ||
internal var onHangStarted: ((AppHang) -> Void)? | ||
/// Closure to be notified when App Hang gets cancelled due to possible false-positive. | ||
/// It is executed on the watchdog thread. | ||
@ReadWriteLock | ||
internal var onHangCancelled: ((AppHang) -> Void)? | ||
/// Closure to be notified when App Hang ends. It passes the hang and its duration. | ||
/// It is executed on the watchdog thread. | ||
@ReadWriteLock | ||
internal var onHangEnded: ((AppHang) -> Void)? | ||
internal var onHangEnded: ((AppHang, TimeInterval) -> Void)? | ||
/// A block called after this thread finished its pass and will become idle. | ||
@ReadWriteLock | ||
internal var onBeforeSleep: (() -> Void)? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/suggestion That makes 5 locks :/ The property wrapper might not fit very well here, unless we use a companion object with a single lock.
Another approach would be to create another read-write lock object that is not a property wrapper and that we can use to lock multiple accesses.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is only one write to each onHang*
lock during SDK startup on the main thread. Later, SDK only reads them from the watchdog (background) thread.
I could combine these callbacks to Observer
struct, e.g.:
thread.hangObserver = .init(
onStarted: { _ in },
onCancelled: { _ in }
onEnded: { _, _ in }
)
Alternatively, we can leverage delegate
model (thread.delegate = self
). WDYT @maxep ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The good old delegate pattern :) It is quite appropriate in this case 👍 Do you mind trying out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I changed to delegate 👍. It is weak var
which implies "atomic", so no locks required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, well done!
What and why?
📦 This PR brings the main part of Fatal App Hangs tracking. It tracks hangs from start to end or cancellation, persisting the fatal error information to be considered after app restart.
🚧 Sending RUM error and RUM view update will be added in following PR, to keep the code review scoped and lean.
How?
Fatal App Hangs tracking has a lot of similarity with Crash tracking. Notably, the context of the fatal error must be persisted in the failed process so it is available after restart (that includes: last RUM view, last RUM session info, current NTP offset and more).
To embrace this similarity and enable later code reuse this PR introduces
FatalErrorContextNotifier
that talks to bothDatadogCrashReporting
(over message bus) and RUM'sAppHangsMonitor
. It is meant to be extended even more for later watchdog terminations tracking. This way all fatal errors (crashes, fatal hangs and OOMs) will utilize the same logic of uploading them to previous RUM session (covering all edge cases like fail in background vs BET enabled/disabled, fail with no RUM session started, restart after more than 4h of the issue etc.).The overall architecture introduced in this PR can be illustrated as follows:
Review checklist
Custom CI job configuration (optional)
tools/