Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RUM-4911 feat(watchdog-termination): send Watchdog Termination #1917

Merged
merged 14 commits into from
Jun 27, 2024

Conversation

ganeshnj
Copy link
Contributor

@ganeshnj ganeshnj commented Jun 21, 2024

What and why?

We already have ability to detect the WT #1889

Now we want to report them.

How?

Related PR: DataDog/rum-events-format#213

We already added capability to detect a watchdog termination #1889, this PR implements the event sending mechanism. It uses basic building blocks added during the App hang reporting.

There are two fundamental changes done in this

  1. The crash reporter now sets the core context baggage instead using point-to-point messaging. Point-to-point messaging introduced a strict ordering of feature enablement. ie first RUM must be enabled and then CR. This is a lot of cognitive load for our customers and we would like to make it simpler to use but this eventually makes the implementation complex for the monitor.

  2. How we keep track of the RUM view event, which is used to build the error event. In current implementation, the monitor keeps track of most recent view event which is simpler than the initial concept.

    In the initial concept, view events were read from the written batch which is more complex from implementation point of view but better for customer.

The WatchdogTerminationAppStateManager is now simpler as WatchdogTerminationMonitor takes care of message processing. The monitor starts when the launch report is available and checks for the occurrence of the watchdog termination, if found, it writes the event else start observing the application lifecycle and RUM view events. WatchdogTerminationMonitor.State enum is introduced for the state management.

Review checklist

  • Feature or bugfix MUST have appropriate tests (unit, integration)
  • Make sure each commit and the PR mention the Issue number or JIRA reference
  • Add CHANGELOG entry for user facing changes

Custom CI job configuration (optional)

  • Run unit tests for Session Replay
  • Run smoke tests
  • Run tests for tools/

@ganeshnj ganeshnj force-pushed the ganeshnj/feat/RUM-4911-wt-report branch from 153a90d to 9970203 Compare June 21, 2024 14:04
@ganeshnj ganeshnj marked this pull request as ready for review June 21, 2024 14:15
@ganeshnj ganeshnj requested review from a team as code owners June 21, 2024 14:15
Copy link
Member

@ncreated ncreated left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks great 👌. I left few comments on integrating watchdog error events into RUM - notably we must consider session availability constraints in RUM intake. All described through in-line comments.

On top of that, as this is final PR on this matter, we must add integration-unit tests for watchdog terminations monitoring.

DatadogCore/Sources/Core/DatadogCore.swift Outdated Show resolved Hide resolved
DatadogObjc/Sources/RUM/RUMDataModels+objc.swift Outdated Show resolved Hide resolved
}

let checker: WatchdogTerminationChecker
let appStateManager: WatchdogTerminationAppStateManager
let feature: FeatureScope
let reporter: WatchdogTerminationReporting
weak var core: DatadogCoreProtocol?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion/ This looks against our goals from #1744 where we started removing direct core references from Features code. Direct refs are error prone (can leak if not weakified) and less scalable (they do not isolate core capabilities, instead bloat DatadogCoreProtocol with new things).

While I totally get the idea of having this reference here and .mostRecentModifiedFileAt(before:) doesn't fit into FeatureScope, I wonder if we could avoid arbitrary weak var core by introducing CoreUtils type that manages the core reference similar to how it is done for CoreTelemetry. Just like with Telemetry, it can be safe to pass CoreUtils reference around and retain it on non-optional value.

cc @maxep what would be your take on this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is recurrent problem and I think we can do better here, this perhaps beyond the scope of this PR but one idea that I have is to introduce something like

public struct Weak<T> {
    public var value: T?
    
    public init(value: T? = nil) {
        self.value = value
    }
}

and update the protocols to not conform AnyObject

public protocol DatadogCoreProtocol: MessageSending, BaggageSharing, CoreStorage {

this will be enforced at compiler level and we can use core using Weak<DatadogCoreProtocol>. This is must better approach than keep wrapping the core into different structs like we did for telemetry.

Let's sync on this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action: for now, we do with wrapper mechanism as we do today and let's try to simplify after.

Comment on lines 48 to 51
let error = builder.createRUMError(with: viewEvent)
let view = builder.updateRUMViewWithError(viewEvent)
writer.write(value: error)
writer.write(value: view)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change-request/ We need to consider event timings - RUM intake doesn't accept events in certain situation. Notably, the viewEventAvailabilityThreshold defines the last moment when we can issue a view update to previous session.

It should be okay for the product to only support the main flow initially, like it was done in FatalAppHangsHandler. We can cover edge-case situations later, ideally by moving this logic to FatalErrorBuilder instead of duplicating it between all kind of fatal errors:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate how does

if realDateNow.timeIntervalSince(realErrorDate) < FatalErrorBuilder.Constants.viewEventAvailabilityThreshold {
DD.logger.debug("Sending fatal App hang as RUM error with issuing RUM view update")
// It is still OK to send RUM view to previous RUM session.
writer.write(value: error)
writer.write(value: view)
} else {
// We know it is too late for sending RUM view to previous RUM session as it is now stale on backend.
// To avoid inconsistency, we only send the RUM error.
DD.logger.debug("Sending fatal App hang as RUM error without updating RUM view")
writer.write(value: error)
}
logic helps here?

What makes it confusing is that else part only writes the error but this should be covered with view + error already.

Currently we are sending both view update and error. Which one is taken or rejected, is up to the backend logic.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which one is taken or rejected, is up to the backend logic.

If we want to challenge existing logic it must be first verified with RUM Ingestion. The decision to send error with no view if 4h lasted since crash was based on inputs from BE team. I found first minimal record of this decision here (see: PROBLEM 1). It is dated 2020, so it might be worth revisiting this topic. If this constraint is no longer there, we can vastly simplify App Hangs and Crash Reporting code.

I consider this a blocker as it introduces a divergence in fatal errors handling (WDT / CR / AH).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Next: we should do if-else with 4 hours window (constant) and unblock this PR and continue the conversation with RUM BE if we can simplify this.

@ganeshnj ganeshnj requested a review from ncreated June 26, 2024 12:03
Copy link
Member

@ncreated ncreated left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👌 Looks good with the remaining blocker on sending view event more than 4h after session was closed (see my comment). I'd rather not end up in situation where one feature implements different view upload logic than other. If 4h constraint is no longer a thing (please verify), we must remove it from Crash Reporting and App Hangs monitoring too (this can happen through separate JIRA).

Comment on lines 48 to 51
let error = builder.createRUMError(with: viewEvent)
let view = builder.updateRUMViewWithError(viewEvent)
writer.write(value: error)
writer.write(value: view)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which one is taken or rejected, is up to the backend logic.

If we want to challenge existing logic it must be first verified with RUM Ingestion. The decision to send error with no view if 4h lasted since crash was based on inputs from BE team. I found first minimal record of this decision here (see: PROBLEM 1). It is dated 2020, so it might be worth revisiting this topic. If this constraint is no longer there, we can vastly simplify App Hangs and Crash Reporting code.

I consider this a blocker as it introduces a divergence in fatal errors handling (WDT / CR / AH).

/// Example use case in watchdog termination tracking:
/// - SDK started -> RUM enabled -> [watchdog termination] -> SDK stopped -> SDK started again -> RUM enabled again -> check if the app was terminated by watchdog
/// - If true, check any file updates that were done before current process started, that is most close to the watchdog termination.
internal let runningSince = Date()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action: we can use LaunchTime

@ganeshnj ganeshnj requested a review from ncreated June 27, 2024 11:31
Copy link
Member

@ncreated ncreated left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Great efforts!

@ganeshnj ganeshnj merged commit 3686191 into develop Jun 27, 2024
11 checks passed
@ganeshnj ganeshnj deleted the ganeshnj/feat/RUM-4911-wt-report branch June 27, 2024 11:45
@maxep maxep mentioned this pull request Jul 4, 2024
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants