RUM-4911 feat(watchdog-termination): send Watchdog Termination #1917

ganeshnj · 2024-06-21T11:18:15Z

What and why?

We already have ability to detect the WT #1889

Now we want to report them.

How?

Related PR: DataDog/rum-events-format#213

We already added capability to detect a watchdog termination #1889, this PR implements the event sending mechanism. It uses basic building blocks added during the App hang reporting.

There are two fundamental changes done in this

The crash reporter now sets the core context baggage instead using point-to-point messaging. Point-to-point messaging introduced a strict ordering of feature enablement. ie first RUM must be enabled and then CR. This is a lot of cognitive load for our customers and we would like to make it simpler to use but this eventually makes the implementation complex for the monitor.
How we keep track of the RUM view event, which is used to build the error event. In current implementation, the monitor keeps track of most recent view event which is simpler than the initial concept.

In the initial concept, view events were read from the written batch which is more complex from implementation point of view but better for customer.

The WatchdogTerminationAppStateManager is now simpler as WatchdogTerminationMonitor takes care of message processing. The monitor starts when the launch report is available and checks for the occurrence of the watchdog termination, if found, it writes the event else start observing the application lifecycle and RUM view events. WatchdogTerminationMonitor.State enum is introduced for the state management.

Review checklist

Feature or bugfix MUST have appropriate tests (unit, integration)
Make sure each commit and the PR mention the Issue number or JIRA reference
Add CHANGELOG entry for user facing changes

Custom CI job configuration (optional)

Run unit tests for Session Replay
Run smoke tests
Run tests for tools/

ncreated

The code looks great 👌. I left few comments on integrating watchdog error events into RUM - notably we must consider session availability constraints in RUM intake. All described through in-line comments.

On top of that, as this is final PR on this matter, we must add integration-unit tests for watchdog terminations monitoring.

DatadogCore/Tests/Datadog/Mocks/DatadogInternal/DatadogCoreProxy.swift

DatadogCore/Sources/Core/DatadogCore.swift

DatadogInternal/Sources/Models/CrashReporting/LaunchReport.swift

DatadogObjc/Sources/RUM/RUMDataModels+objc.swift

DatadogRUM/Sources/Instrumentation/AppHangs/ProcessIdentifier.swift

ncreated · 2024-06-25T07:05:24Z

DatadogRUM/Sources/Instrumentation/WatchdogTerminations/WatchdogTerminationMonitor.swift

    }

    let checker: WatchdogTerminationChecker
    let appStateManager: WatchdogTerminationAppStateManager
    let feature: FeatureScope
    let reporter: WatchdogTerminationReporting
+    weak var core: DatadogCoreProtocol?


suggestion/ This looks against our goals from #1744 where we started removing direct core references from Features code. Direct refs are error prone (can leak if not weakified) and less scalable (they do not isolate core capabilities, instead bloat DatadogCoreProtocol with new things).

While I totally get the idea of having this reference here and .mostRecentModifiedFileAt(before:) doesn't fit into FeatureScope, I wonder if we could avoid arbitrary weak var core by introducing CoreUtils type that manages the core reference similar to how it is done for CoreTelemetry. Just like with Telemetry, it can be safe to pass CoreUtils reference around and retain it on non-optional value.

cc @maxep what would be your take on this?

This is recurrent problem and I think we can do better here, this perhaps beyond the scope of this PR but one idea that I have is to introduce something like

public struct Weak<T> { public var value: T? public init(value: T? = nil) { self.value = value } }

and update the protocols to not conform AnyObject

public protocol DatadogCoreProtocol: MessageSending, BaggageSharing, CoreStorage {

this will be enforced at compiler level and we can use core using Weak<DatadogCoreProtocol>. This is must better approach than keep wrapping the core into different structs like we did for telemetry.

Let's sync on this.

Action: for now, we do with wrapper mechanism as we do today and let's try to simplify after.

ncreated · 2024-06-25T07:17:30Z

DatadogRUM/Sources/Instrumentation/WatchdogTerminations/WatchdogTerminationReporter.swift

+            let error = builder.createRUMError(with: viewEvent)
+            let view = builder.updateRUMViewWithError(viewEvent)
+            writer.write(value: error)
+            writer.write(value: view)


change-request/ We need to consider event timings - RUM intake doesn't accept events in certain situation. Notably, the viewEventAvailabilityThreshold defines the last moment when we can issue a view update to previous session.

It should be okay for the product to only support the main flow initially, like it was done in FatalAppHangsHandler. We can cover edge-case situations later, ideally by moving this logic to FatalErrorBuilder instead of duplicating it between all kind of fatal errors:

RUM.CrashReportReceiver - supports all edge cases for sending crashes

FatalAppHangsHandler - supports only the happy-path for sending fatal hangs

now WatchdogTerminationReporter

Can you elaborate how does

dd-sdk-ios/DatadogRUM/Sources/Instrumentation/AppHangs/FatalAppHangsHandler.swift

Lines 114 to 124 in 380bdf4

if realDateNow.timeIntervalSince(realErrorDate) < FatalErrorBuilder.Constants.viewEventAvailabilityThreshold {

DD.logger.debug("Sending fatal App hang as RUM error with issuing RUM view update")

// It is still OK to send RUM view to previous RUM session.

writer.write(value: error)

writer.write(value: view)

} else {

// We know it is too late for sending RUM view to previous RUM session as it is now stale on backend.

// To avoid inconsistency, we only send the RUM error.

DD.logger.debug("Sending fatal App hang as RUM error without updating RUM view")

writer.write(value: error)

}

logic helps here?

What makes it confusing is that else part only writes the error but this should be covered with view + error already.

Currently we are sending both view update and error. Which one is taken or rejected, is up to the backend logic.

Which one is taken or rejected, is up to the backend logic.

If we want to challenge existing logic it must be first verified with RUM Ingestion. The decision to send error with no view if 4h lasted since crash was based on inputs from BE team. I found first minimal record of this decision here (see: PROBLEM 1). It is dated 2020, so it might be worth revisiting this topic. If this constraint is no longer there, we can vastly simplify App Hangs and Crash Reporting code.

I consider this a blocker as it introduces a divergence in fatal errors handling (WDT / CR / AH).

Next: we should do if-else with 4 hours window (constant) and unblock this PR and continue the conversation with RUM BE if we can simplify this.

Co-authored-by: Maciek Grzybowski <maciek.grzybowski@datadoghq.com>

…swift Co-authored-by: Maciek Grzybowski <maciek.grzybowski@datadoghq.com>

ncreated

👌 Looks good with the remaining blocker on sending view event more than 4h after session was closed (see my comment). I'd rather not end up in situation where one feature implements different view upload logic than other. If 4h constraint is no longer a thing (please verify), we must remove it from Crash Reporting and App Hangs monitoring too (this can happen through separate JIRA).

Datadog/IntegrationUnitTests/RUM/WatchdogTerminationsMonitoringTests.swift

ncreated · 2024-06-27T08:04:33Z

DatadogRUM/Sources/Instrumentation/WatchdogTerminations/WatchdogTerminationReporter.swift

+            let error = builder.createRUMError(with: viewEvent)
+            let view = builder.updateRUMViewWithError(viewEvent)
+            writer.write(value: error)
+            writer.write(value: view)


Which one is taken or rejected, is up to the backend logic.

If we want to challenge existing logic it must be first verified with RUM Ingestion. The decision to send error with no view if 4h lasted since crash was based on inputs from BE team. I found first minimal record of this decision here (see: PROBLEM 1). It is dated 2020, so it might be worth revisiting this topic. If this constraint is no longer there, we can vastly simplify App Hangs and Crash Reporting code.

I consider this a blocker as it introduces a divergence in fatal errors handling (WDT / CR / AH).

ganeshnj · 2024-06-27T08:57:42Z

DatadogRUM/Sources/Instrumentation/AppHangs/ProcessIdentifier.swift

+/// Example use case in watchdog termination tracking:
+/// - SDK started -> RUM enabled -> [watchdog termination] -> SDK stopped -> SDK started again -> RUM enabled again -> check if the app was terminated by watchdog
+/// - If true, check any file updates that were done before current process started, that is most close to the watchdog termination.
+internal let runningSince = Date()


Action: we can use LaunchTime

…re not encouraged

…gTests.swift Co-authored-by: Maciek Grzybowski <maciek.grzybowski@datadoghq.com>

ncreated

🎯 Great efforts!

RUM-4911 feat(watchdog-termination): send Watchdog Termination

9970203

ganeshnj force-pushed the ganeshnj/feat/RUM-4911-wt-report branch from 153a90d to 9970203 Compare June 21, 2024 14:04

ganeshnj marked this pull request as ready for review June 21, 2024 14:15

ganeshnj requested review from a team as code owners June 21, 2024 14:15

ncreated requested changes Jun 25, 2024

View reviewed changes

ganeshnj and others added 8 commits June 25, 2024 11:44

RUM-4911 feat(watchdog-termination): passthrough core

69581c1

RUM-4911 feat(watchdog-termination): use rw queue for file IO

147eec3

RUM-4911 feat(watchdog-termination): regen models

51e8381

Update DatadogInternal/Sources/Models/CrashReporting/LaunchReport.swift

db59759

Co-authored-by: Maciek Grzybowski <maciek.grzybowski@datadoghq.com>

Update DatadogRUM/Sources/Instrumentation/AppHangs/ProcessIdentifier.…

4e02368

…swift Co-authored-by: Maciek Grzybowski <maciek.grzybowski@datadoghq.com>

RUM-4911 feat(watchdog-termination): add test integ unit test

5952126

RUM-4911 feat(watchdog-termination): use CoreStorage

4df1ae2

RUM-4911 feat(watchdog-termination): fix compilation issue

c62f177

ganeshnj requested a review from ncreated June 26, 2024 12:03

ncreated reviewed Jun 27, 2024

View reviewed changes

ganeshnj commented Jun 27, 2024

View reviewed changes

ganeshnj and others added 5 commits June 27, 2024 11:14

RUM-4911 feat(watchdog-termination): fix 4 hours after view updates a…

cbbd2ef

…re not encouraged

RUM-4911 feat(watchdog-termination): wrap Storage like Telemetry

83ca2e8

RUM-4911 feat(watchdog-termination): use LaunchTime

b8dce7f

Update Datadog/IntegrationUnitTests/RUM/WatchdogTerminationsMonitorin…

3ec0569

…gTests.swift Co-authored-by: Maciek Grzybowski <maciek.grzybowski@datadoghq.com>

RUM-4911 feat(watchdog-termination): fix various test compilation issue

27af706

ganeshnj requested a review from ncreated June 27, 2024 11:31

ncreated approved these changes Jun 27, 2024

View reviewed changes

ganeshnj merged commit 3686191 into develop Jun 27, 2024
11 checks passed

ganeshnj deleted the ganeshnj/feat/RUM-4911-wt-report branch June 27, 2024 11:45

maxep mentioned this pull request Jul 4, 2024

Release 2.14.0 #1941

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RUM-4911 feat(watchdog-termination): send Watchdog Termination #1917

RUM-4911 feat(watchdog-termination): send Watchdog Termination #1917

ganeshnj commented Jun 21, 2024 •

edited

Loading

ncreated left a comment

ncreated Jun 25, 2024

ganeshnj Jun 26, 2024

ganeshnj Jun 27, 2024

ncreated Jun 25, 2024

ganeshnj Jun 25, 2024

ncreated Jun 27, 2024

ganeshnj Jun 27, 2024

ncreated left a comment

ncreated Jun 27, 2024

ganeshnj Jun 27, 2024

ncreated left a comment

	if realDateNow.timeIntervalSince(realErrorDate) < FatalErrorBuilder.Constants.viewEventAvailabilityThreshold {
	DD.logger.debug("Sending fatal App hang as RUM error with issuing RUM view update")
	// It is still OK to send RUM view to previous RUM session.
	writer.write(value: error)
	writer.write(value: view)
	} else {
	// We know it is too late for sending RUM view to previous RUM session as it is now stale on backend.
	// To avoid inconsistency, we only send the RUM error.
	DD.logger.debug("Sending fatal App hang as RUM error without updating RUM view")
	writer.write(value: error)
	}

RUM-4911 feat(watchdog-termination): send Watchdog Termination #1917

RUM-4911 feat(watchdog-termination): send Watchdog Termination #1917

Conversation

ganeshnj commented Jun 21, 2024 • edited Loading

What and why?

How?

Review checklist

Custom CI job configuration (optional)

ncreated left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ncreated left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ncreated left a comment

Choose a reason for hiding this comment

ganeshnj commented Jun 21, 2024 •

edited

Loading