RUM-3588 fix: Always update crash context with RUM attributes #1834

ncreated · 2024-05-14T14:13:26Z

What and why?

📦 With this PR, RUM error and RUM view sent after crash will always include up-to-date attributes set on RUMMonitor.

Because we don't update crash context with the "last RUM view" info on every change to global attributes, the global RUM attributes held in "last view" were not always correct. Global attributes change that happened during the last active view was never included in the crash error.

How?

The RUM.FatalErrorContextNotifier is updated with globalAttributes property. Upon change, it sends a baggage over message bus holding DatadogInternal.GlobalRUMAttributes (new type). It is received in DatadogCrashReporting and gets encoded into CrashContext.

Upon restarting the app from crash, the value is decoded to RUM.CrashContext on regular basis.

🎁 On testing front, I separated tests for updating the FatalErrorContextNotifier (1 and 2) from sending encoded baggage over message bus (3 and 4):

♻️ Considered Alternative Approach

In my initial attempt, I tried a simpler approach of updating crashContext.lastRUMView with latest RUM attributes. That only required producing view updates on every global attribute change and updating the view event in FatalErrorContext inside RUMViewScope. This however added a linear backpressure to the RUM queue (red line), growing with the number of calls to monitor.addAttribute(forKey:). With lack of "add multiple attributes" API, this perf hit is not acceptable.

The solution implemented in this PR adds much less to the RUM backpressure (yellow line). The impact is related to message bus communication which is done on shared thread.

Review checklist

Feature or bugfix MUST have appropriate tests (unit, integration)
Make sure each commit and the PR mention the Issue number or JIRA reference
Add CHANGELOG entry for user facing changes

Custom CI job configuration (optional)

Run unit tests for Core, RUM, Trace, Logs, CR and WVT
Run unit tests for Session Replay
Run integration tests
Run smoke tests
Run tests for tools/

datadog-datadog-prod-us1 · 2024-05-14T15:08:47Z

Datadog Report

Branch report: ncreated/RUM-3588/update-crash-context-with-rum-attributes
Commit report: 99b72b8
Test service: dd-sdk-ios

✅ 0 Failed, 3188 Passed, 0 Skipped, 2m 18.87s Total Time
🔻 Test Sessions change in coverage: 9 decreased, 4 increased

🔻 Code Coverage Decreases vs Default Branch (9)

This report shows up to 5 code coverage decreases.

test DatadogCrashReportingTests tvOS 27.44% (-0.27%) - Details
test DatadogTraceTests tvOS 55.39% (-0.24%) - Details
test DatadogInternalTests tvOS 79.8% (-0.23%) - Details
test DatadogCrashReportingTests iOS 27.43% (-0.23%) - Details
test DatadogInternalTests iOS 79.8% (-0.2%) - Details

ncreated · 2024-05-16T15:57:09Z

DatadogCore/Tests/Datadog/CrashReporting/CrashContext/CrashContextProviderTests.swift

💡 I did revamp of the way we test CrashContextCoreProvider to pay down tech debt taken during V2-migration. The debt was on testing the provider through sending messages to mocked core.

Full Rationale (from my Slack thread):

The way we manage CrashContext updates is super fragile 🤹. Here we replace the whole context on every DatadogContext update. SDK context updates are very frequent, meaning we do it often. To make it correct, there are few elements that require contributor to have deep understanding of following bits:

The custom Equatable implementation for CrashContext must be updated with the new value added.

The convenience CrashContext.init() must be updated with transporting the previous value, otherwise it will erase previous state through DatadogContext overwrite

The initial CrashContext is nil until the first DatadogContext arrives — which was not visible in previous tests as they followed the pattern of installing it inside core mock that accidentally guarantees the availability of first DatadogContext.

With the testing convention I introduced to this file, now all above criteria are covered.

ncreated · 2024-05-16T15:57:41Z

DatadogCore/Tests/Datadog/CrashReporting/CrashContext/CrashContextTests.swift

+        )
+    }
+
+    func testGivenContextWithLastLogttributesSet_whenItGetsEncoded_thenTheValueIsPreservedAfterDecoding() throws {


💡 Adding coverage to coding global Log attributes as it was missing.

ncreated · 2024-05-16T16:00:34Z

DatadogRUM/Tests/RUMMonitor/Scopes/RUMSessionScopeTests.swift

-        let baggageSent = try XCTUnwrap(featureScope.messagesSent().lastBaggage(withKey: RUMBaggageKeys.sessionState))
-        let sessionStateSent: RUMSessionState = try baggageSent.decode()
-        XCTAssertEqual(sessionStateSent, expectedSessionState, "It must send 'session state' message")
+        let actualSessionState = try XCTUnwrap(fatalErrorContext.sessionState)
+        XCTAssertEqual(actualSessionState, expectedSessionState)


💡 This change in RUMSessionScopeTests is around asserting the update on FatalErrorContext instead of inspecting the baggage sent over message bus. The baggage is now tested in FatalErrorContextTests which streamlines things and better separates concerns.

ncreated · 2024-05-16T16:02:12Z

DatadogRUM/Tests/RUMMonitor/Scopes/RUMSessionScopeTests.swift

-    func testWhenScopeEnded_itDoesNotSendViewResetMessage() {
-        let featureScope = FeatureScopeMock()
-
-        // Given
-        let scope: RUMSessionScope = .mockWith(
-            parent: parent,
-            startTime: Date(),
-            dependencies: .mockWith(featureScope: featureScope)
-        )
-
-        let startViewCommand = RUMStartViewCommand.mockWith(time: Date(), identity: .mockViewIdentifier())
-        _ = scope.process(command: startViewCommand, context: context, writer: writer)
-        let startResourceCommand = RUMStartResourceCommand.mockWith(time: Date())
-        _ = scope.process(command: startResourceCommand, context: context, writer: writer)
-
-        // When
-        _ = scope.process(command: RUMStopSessionCommand.mockWith(time: Date()), context: context, writer: writer)
-        let stopResourceCommand = RUMStopResourceCommand.mockWith(resourceKey: startResourceCommand.resourceKey, time: Date())
-        _ = scope.process(command: stopResourceCommand, context: context, writer: writer)
-
-        // Then
-        let viewResetMessages = featureScope.messagesSent().filter { $0.asBaggage?.key == RUMBaggageKeys.viewReset }
-        XCTAssertEqual(viewResetMessages.count, 1, "It must send only one 'view reset' message")
-    }


💡 I remove this test as it covers non-existing behaviour. Once RUMStopSessionCommand is received, the RUM monitor deallocates its scope, so it can no longer receive future commands (RUMStopResourceCommand in this case).

ncreated · 2024-05-16T16:02:31Z

DatadogRUM/Tests/RUMMonitor/Scopes/RUMViewScopeTests.swift

-        let baggageSent = try XCTUnwrap(featureScope.messagesSent().firstBaggage(withKey: RUMBaggageKeys.viewEvent))
-        let rumViewSent: RUMViewEvent = try baggageSent.decode()
-        DDAssertReflectionEqual(rumViewSent, rumViewWritten, "It must sent written event over message bus")
+        let rumViewInFatalErrorContext = try XCTUnwrap(fatalErrorContext.view)
+        DDAssertReflectionEqual(rumViewWritten, rumViewInFatalErrorContext, "It must update fatal error context with the view event written")


💡 This change in RUMViewScopeTests is around asserting the update on FatalErrorContext instead of inspecting the baggage sent over message bus. The baggage is now tested in FatalErrorContextTests which streamlines things and better separates concerns.

maciejburda

Looks good! Great coverage!

Some minor comments

maciejburda · 2024-05-20T13:51:06Z

DatadogCrashReporting/Sources/CrashContext/CrashContextProvider.swift

+    private func updateRUMAttributes(with baggage: FeatureBaggage, to core: DatadogCoreProtocol) {
+        queue.async { [weak core] in
+            do {
+                self.rumAttributes = try baggage.decode(type: GlobalRUMAttributes.self)


shouldn't we weakify self as well? 🤔

👍 Good point, ✅ updated in the whole file

maciejburda · 2024-05-20T13:52:08Z

DatadogInternal/Sources/Models/RUM/GlobalRUMAttributes.swift

+
+import Foundation
+
+public struct GlobalRUMAttributes: Codable, PassthroughAnyCodable {


would it make sense to write extension for [AttributeKey: AttributeValue] instead?

Extending standard type would bring much less clarity to the whole concept. GlobalRUMAttributes is a shared type (it is defined in DatadogInternal), standing for contract between RUM and Crash Reporting. Whenever we step on this type in code, we know it is meant for message bus communication. Same could not be achieved with using standard ubiquitous type.

Also, the type defines two aspects that could not be supported in dictionary:

Codable - so we can code it back and forth ([String: Endocable] can't be Codable);

PassthroughAnyCodable - to avoid marshalling the type when passing it over message bus ([String: Encodable] can't be managed this way as we use it transitively in multiple other types exchanged over MB, e.g. for encoding user attributes in RUM view event);

but do not send view update event (to avoid performance penalties such as adding backpressure to writer queue, increasing RUM directory size with more event writes and effectively extending the backlog of uploadable files for RUM)

An alternative approach was sending attributes as part of last RUMView, however causing view updates on every attribute change is producing significant backpressure on RUM queue.

… report

…sh error

maxep

It looks great! Thanks for tackling the debt, it is way bette now.
I have left a small suggestion on naming but it can be merged as is.

maxep · 2024-05-21T08:34:46Z

DatadogRUM/Sources/RUMMonitor/Scopes/FatalErrorContextNotifier.swift

 /// It tracks value changes and notifies updates on message bus.
-internal final class FatalErrorContextNotifier {
+internal final class FatalErrorContextNotifier: FatalErrorContextNotifying {


/suggestion I think the names should reflect responsibility (interface) vs. behavior (imp). Here it would sending the context vs. sending on the bus.

internal final class FatalErrorContextBus: FatalErrorContextSender

It's true for strategy pattern, but is not the case here. The protocol is made purely for dependency injection and testability. It is not for introducing multiple strategies of sending FatalErrorContext.

If at any point we want to notify FEC through different implementations, then I'll be 100% on the side of this suggestion. For now I'll keep it as it is 🙂.

ncreated self-assigned this May 14, 2024

Base automatically changed from ncreated/RUM-3588/send-view-update-on-attributes-change-part1 to develop May 16, 2024 14:57

ncreated force-pushed the ncreated/RUM-3588/update-crash-context-with-rum-attributes branch from 5a893cd to 13d7263 Compare May 16, 2024 15:49

ncreated commented May 16, 2024

View reviewed changes

ncreated marked this pull request as ready for review May 16, 2024 16:52

ncreated requested review from a team as code owners May 16, 2024 16:52

maciejburda previously approved these changes May 20, 2024

View reviewed changes

ncreated added 9 commits May 21, 2024 09:30

RUM-3588 Abstract FatalErrorContextNotifier and add tests

8913560

RUM-3588 Fix linter

3de34ee

RUM-3588 Send global RUM attributes direclty to CrashContext

dbade4a

An alternative approach was sending attributes as part of last RUMView, however causing view updates on every attribute change is producing significant backpressure on RUM queue.

RUM-3588 Write RUM attributes to RUM view and error created for crash…

7b55807

… report

RUM-3588 Cleanup

3fae9b8

RUM-3588 Add integration test for adding global RUM attributes to cra…

6ef10e9

…sh error

RUM-3588 Update CHANGELOG.md

eb610b9

RUM-3588 CR feedback + rebase

99b72b8

ncreated dismissed maciejburda’s stale review via 99b72b8 May 21, 2024 07:45

ncreated force-pushed the ncreated/RUM-3588/update-crash-context-with-rum-attributes branch from 5b17904 to 99b72b8 Compare May 21, 2024 07:45

maxep approved these changes May 21, 2024

View reviewed changes

ncreated requested a review from maciejburda May 21, 2024 12:54

ncreated merged commit b7552c0 into develop May 21, 2024
15 checks passed

ncreated deleted the ncreated/RUM-3588/update-crash-context-with-rum-attributes branch May 21, 2024 16:54

This was referenced May 29, 2024

Dogfood recent changes #1865

Merged

Release 2.12.0 #1871

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RUM-3588 fix: Always update crash context with RUM attributes #1834

RUM-3588 fix: Always update crash context with RUM attributes #1834

ncreated commented May 14, 2024 •

edited

Loading

datadog-datadog-prod-us1 bot commented May 14, 2024 •

edited

Loading

ncreated May 16, 2024

ncreated May 16, 2024

ncreated May 16, 2024

ncreated May 16, 2024

ncreated May 16, 2024

maciejburda left a comment

maciejburda May 20, 2024

ncreated May 21, 2024

maciejburda May 20, 2024

ncreated May 21, 2024

maxep left a comment

maxep May 21, 2024

ncreated May 21, 2024


		import Foundation

		public struct GlobalRUMAttributes: Codable, PassthroughAnyCodable {

RUM-3588 fix: Always update crash context with RUM attributes #1834

RUM-3588 fix: Always update crash context with RUM attributes #1834

Conversation

ncreated commented May 14, 2024 • edited Loading

What and why?

How?

♻️ Considered Alternative Approach

Review checklist

Custom CI job configuration (optional)

datadog-datadog-prod-us1 bot commented May 14, 2024 • edited Loading

Datadog Report

🔻 Code Coverage Decreases vs Default Branch (9)

Choose a reason for hiding this comment

Full Rationale (from my Slack thread):

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maciejburda left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maxep left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ncreated commented May 14, 2024 •

edited

Loading

datadog-datadog-prod-us1 bot commented May 14, 2024 •

edited

Loading