-
Notifications
You must be signed in to change notification settings - Fork 133
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RUM-4911 feat(watchdog-termination): send Watchdog Termination #1917
Conversation
153a90d
to
9970203
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code looks great 👌. I left few comments on integrating watchdog error events into RUM - notably we must consider session availability constraints in RUM intake. All described through in-line comments.
On top of that, as this is final PR on this matter, we must add integration-unit tests for watchdog terminations monitoring.
DatadogCore/Tests/Datadog/Mocks/DatadogInternal/DatadogCoreProxy.swift
Outdated
Show resolved
Hide resolved
DatadogInternal/Sources/Models/CrashReporting/LaunchReport.swift
Outdated
Show resolved
Hide resolved
DatadogRUM/Sources/Instrumentation/AppHangs/ProcessIdentifier.swift
Outdated
Show resolved
Hide resolved
} | ||
|
||
let checker: WatchdogTerminationChecker | ||
let appStateManager: WatchdogTerminationAppStateManager | ||
let feature: FeatureScope | ||
let reporter: WatchdogTerminationReporting | ||
weak var core: DatadogCoreProtocol? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion/ This looks against our goals from #1744 where we started removing direct core references from Features code. Direct refs are error prone (can leak if not weakified) and less scalable (they do not isolate core capabilities, instead bloat DatadogCoreProtocol
with new things).
While I totally get the idea of having this reference here and .mostRecentModifiedFileAt(before:)
doesn't fit into FeatureScope
, I wonder if we could avoid arbitrary weak var core
by introducing CoreUtils
type that manages the core
reference similar to how it is done for CoreTelemetry
. Just like with Telemetry
, it can be safe to pass CoreUtils
reference around and retain it on non-optional value.
cc @maxep what would be your take on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is recurrent problem and I think we can do better here, this perhaps beyond the scope of this PR but one idea that I have is to introduce something like
public struct Weak<T> {
public var value: T?
public init(value: T? = nil) {
self.value = value
}
}
and update the protocols to not conform AnyObject
public protocol DatadogCoreProtocol: MessageSending, BaggageSharing, CoreStorage {
this will be enforced at compiler level and we can use core using Weak<DatadogCoreProtocol>
. This is must better approach than keep wrapping the core into different structs like we did for telemetry.
Let's sync on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Action: for now, we do with wrapper mechanism as we do today and let's try to simplify after.
let error = builder.createRUMError(with: viewEvent) | ||
let view = builder.updateRUMViewWithError(viewEvent) | ||
writer.write(value: error) | ||
writer.write(value: view) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change-request/ We need to consider event timings - RUM intake doesn't accept events in certain situation. Notably, the viewEventAvailabilityThreshold
defines the last moment when we can issue a view update to previous session.
It should be okay for the product to only support the main flow initially, like it was done in FatalAppHangsHandler
. We can cover edge-case situations later, ideally by moving this logic to FatalErrorBuilder
instead of duplicating it between all kind of fatal errors:
RUM.CrashReportReceiver
- supports all edge cases for sending crashesFatalAppHangsHandler
- supports only the happy-path for sending fatal hangs- now
WatchdogTerminationReporter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate how does
dd-sdk-ios/DatadogRUM/Sources/Instrumentation/AppHangs/FatalAppHangsHandler.swift
Lines 114 to 124 in 380bdf4
if realDateNow.timeIntervalSince(realErrorDate) < FatalErrorBuilder.Constants.viewEventAvailabilityThreshold { | |
DD.logger.debug("Sending fatal App hang as RUM error with issuing RUM view update") | |
// It is still OK to send RUM view to previous RUM session. | |
writer.write(value: error) | |
writer.write(value: view) | |
} else { | |
// We know it is too late for sending RUM view to previous RUM session as it is now stale on backend. | |
// To avoid inconsistency, we only send the RUM error. | |
DD.logger.debug("Sending fatal App hang as RUM error without updating RUM view") | |
writer.write(value: error) | |
} |
What makes it confusing is that else
part only writes the error but this should be covered with view + error already.
Currently we are sending both view update and error. Which one is taken or rejected, is up to the backend logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which one is taken or rejected, is up to the backend logic.
If we want to challenge existing logic it must be first verified with RUM Ingestion. The decision to send error with no view if 4h lasted since crash was based on inputs from BE team. I found first minimal record of this decision here (see: PROBLEM 1). It is dated 2020, so it might be worth revisiting this topic. If this constraint is no longer there, we can vastly simplify App Hangs and Crash Reporting code.
I consider this a blocker as it introduces a divergence in fatal errors handling (WDT / CR / AH).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Next: we should do if-else with 4 hours window (constant) and unblock this PR and continue the conversation with RUM BE if we can simplify this.
Co-authored-by: Maciek Grzybowski <maciek.grzybowski@datadoghq.com>
…swift Co-authored-by: Maciek Grzybowski <maciek.grzybowski@datadoghq.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👌 Looks good with the remaining blocker on sending view event more than 4h after session was closed (see my comment). I'd rather not end up in situation where one feature implements different view upload logic than other. If 4h constraint is no longer a thing (please verify), we must remove it from Crash Reporting and App Hangs monitoring too (this can happen through separate JIRA).
Datadog/IntegrationUnitTests/RUM/WatchdogTerminationsMonitoringTests.swift
Show resolved
Hide resolved
let error = builder.createRUMError(with: viewEvent) | ||
let view = builder.updateRUMViewWithError(viewEvent) | ||
writer.write(value: error) | ||
writer.write(value: view) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which one is taken or rejected, is up to the backend logic.
If we want to challenge existing logic it must be first verified with RUM Ingestion. The decision to send error with no view if 4h lasted since crash was based on inputs from BE team. I found first minimal record of this decision here (see: PROBLEM 1). It is dated 2020, so it might be worth revisiting this topic. If this constraint is no longer there, we can vastly simplify App Hangs and Crash Reporting code.
I consider this a blocker as it introduces a divergence in fatal errors handling (WDT / CR / AH).
/// Example use case in watchdog termination tracking: | ||
/// - SDK started -> RUM enabled -> [watchdog termination] -> SDK stopped -> SDK started again -> RUM enabled again -> check if the app was terminated by watchdog | ||
/// - If true, check any file updates that were done before current process started, that is most close to the watchdog termination. | ||
internal let runningSince = Date() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Action: we can use LaunchTime
…re not encouraged
…gTests.swift Co-authored-by: Maciek Grzybowski <maciek.grzybowski@datadoghq.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎯 Great efforts!
What and why?
We already have ability to detect the WT #1889
Now we want to report them.
How?
Related PR: DataDog/rum-events-format#213
We already added capability to detect a watchdog termination #1889, this PR implements the event sending mechanism. It uses basic building blocks added during the App hang reporting.
There are two fundamental changes done in this
The crash reporter now sets the core context baggage instead using point-to-point messaging. Point-to-point messaging introduced a strict ordering of feature enablement. ie first RUM must be enabled and then CR. This is a lot of cognitive load for our customers and we would like to make it simpler to use but this eventually makes the implementation complex for the monitor.
How we keep track of the RUM view event, which is used to build the error event. In current implementation, the monitor keeps track of most recent view event which is simpler than the initial concept.
The
WatchdogTerminationAppStateManager
is now simpler asWatchdogTerminationMonitor
takes care of message processing. The monitor starts when the launch report is available and checks for the occurrence of the watchdog termination, if found, it writes the event else start observing the application lifecycle and RUM view events.WatchdogTerminationMonitor.State
enum is introduced for the state management.Review checklist
Custom CI job configuration (optional)
tools/