Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RUM-3588 fix: Always update crash context with RUM attributes #1834

Merged

Conversation

ncreated
Copy link
Member

@ncreated ncreated commented May 14, 2024

What and why?

📦 With this PR, RUM error and RUM view sent after crash will always include up-to-date attributes set on RUMMonitor.

Because we don't update crash context with the "last RUM view" info on every change to global attributes, the global RUM attributes held in "last view" were not always correct. Global attributes change that happened during the last active view was never included in the crash error.

How?

The RUM.FatalErrorContextNotifier is updated with globalAttributes property. Upon change, it sends a baggage over message bus holding DatadogInternal.GlobalRUMAttributes (new type). It is received in DatadogCrashReporting and gets encoded into CrashContext.

Upon restarting the app from crash, the value is decoded to RUM.CrashContext on regular basis.

🎁 On testing front, I separated tests for updating the FatalErrorContextNotifier (1 and 2) from sending encoded baggage over message bus (3 and 4):

Screenshot 2024-05-16 at 18 19 42

♻️ Considered Alternative Approach

In my initial attempt, I tried a simpler approach of updating crashContext.lastRUMView with latest RUM attributes. That only required producing view updates on every global attribute change and updating the view event in FatalErrorContext inside RUMViewScope. This however added a linear backpressure to the RUM queue (red line), growing with the number of calls to monitor.addAttribute(forKey:). With lack of "add multiple attributes" API, this perf hit is not acceptable.

The solution implemented in this PR adds much less to the RUM backpressure (yellow line). The impact is related to message bus communication which is done on shared thread.

Screenshot 2024-05-16 at 18 33 20

Review checklist

  • Feature or bugfix MUST have appropriate tests (unit, integration)
  • Make sure each commit and the PR mention the Issue number or JIRA reference
  • Add CHANGELOG entry for user facing changes

Custom CI job configuration (optional)

  • Run unit tests for Core, RUM, Trace, Logs, CR and WVT
  • Run unit tests for Session Replay
  • Run integration tests
  • Run smoke tests
  • Run tests for tools/

@ncreated ncreated self-assigned this May 14, 2024
@datadog-datadog-prod-us1
Copy link

datadog-datadog-prod-us1 bot commented May 14, 2024

Datadog Report

Branch report: ncreated/RUM-3588/update-crash-context-with-rum-attributes
Commit report: 99b72b8
Test service: dd-sdk-ios

✅ 0 Failed, 3188 Passed, 0 Skipped, 2m 18.87s Total Time
🔻 Test Sessions change in coverage: 9 decreased, 4 increased

🔻 Code Coverage Decreases vs Default Branch (9)

This report shows up to 5 code coverage decreases.

  • test DatadogCrashReportingTests tvOS 27.44% (-0.27%) - Details
  • test DatadogTraceTests tvOS 55.39% (-0.24%) - Details
  • test DatadogInternalTests tvOS 79.8% (-0.23%) - Details
  • test DatadogCrashReportingTests iOS 27.43% (-0.23%) - Details
  • test DatadogInternalTests iOS 79.8% (-0.2%) - Details

Base automatically changed from ncreated/RUM-3588/send-view-update-on-attributes-change-part1 to develop May 16, 2024 14:57
@ncreated ncreated force-pushed the ncreated/RUM-3588/update-crash-context-with-rum-attributes branch from 5a893cd to 13d7263 Compare May 16, 2024 15:49
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 I did revamp of the way we test CrashContextCoreProvider to pay down tech debt taken during V2-migration. The debt was on testing the provider through sending messages to mocked core.

Full Rationale (from my Slack thread):

The way we manage CrashContext updates is super fragile 🤹. Here we replace the whole context on every DatadogContext update. SDK context updates are very frequent, meaning we do it often. To make it correct, there are few elements that require contributor to have deep understanding of following bits:


With the testing convention I introduced to this file, now all above criteria are covered.

)
}

func testGivenContextWithLastLogttributesSet_whenItGetsEncoded_thenTheValueIsPreservedAfterDecoding() throws {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Adding coverage to coding global Log attributes as it was missing.

Comment on lines -364 to +367
let baggageSent = try XCTUnwrap(featureScope.messagesSent().lastBaggage(withKey: RUMBaggageKeys.sessionState))
let sessionStateSent: RUMSessionState = try baggageSent.decode()
XCTAssertEqual(sessionStateSent, expectedSessionState, "It must send 'session state' message")
let actualSessionState = try XCTUnwrap(fatalErrorContext.sessionState)
XCTAssertEqual(actualSessionState, expectedSessionState)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 This change in RUMSessionScopeTests is around asserting the update on FatalErrorContext instead of inspecting the baggage sent over message bus. The baggage is now tested in FatalErrorContextTests which streamlines things and better separates concerns.

Comment on lines -589 to -612
func testWhenScopeEnded_itDoesNotSendViewResetMessage() {
let featureScope = FeatureScopeMock()

// Given
let scope: RUMSessionScope = .mockWith(
parent: parent,
startTime: Date(),
dependencies: .mockWith(featureScope: featureScope)
)

let startViewCommand = RUMStartViewCommand.mockWith(time: Date(), identity: .mockViewIdentifier())
_ = scope.process(command: startViewCommand, context: context, writer: writer)
let startResourceCommand = RUMStartResourceCommand.mockWith(time: Date())
_ = scope.process(command: startResourceCommand, context: context, writer: writer)

// When
_ = scope.process(command: RUMStopSessionCommand.mockWith(time: Date()), context: context, writer: writer)
let stopResourceCommand = RUMStopResourceCommand.mockWith(resourceKey: startResourceCommand.resourceKey, time: Date())
_ = scope.process(command: stopResourceCommand, context: context, writer: writer)

// Then
let viewResetMessages = featureScope.messagesSent().filter { $0.asBaggage?.key == RUMBaggageKeys.viewReset }
XCTAssertEqual(viewResetMessages.count, 1, "It must send only one 'view reset' message")
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 I remove this test as it covers non-existing behaviour. Once RUMStopSessionCommand is received, the RUM monitor deallocates its scope, so it can no longer receive future commands (RUMStopResourceCommand in this case).

Comment on lines -2540 to +2586
let baggageSent = try XCTUnwrap(featureScope.messagesSent().firstBaggage(withKey: RUMBaggageKeys.viewEvent))
let rumViewSent: RUMViewEvent = try baggageSent.decode()
DDAssertReflectionEqual(rumViewSent, rumViewWritten, "It must sent written event over message bus")
let rumViewInFatalErrorContext = try XCTUnwrap(fatalErrorContext.view)
DDAssertReflectionEqual(rumViewWritten, rumViewInFatalErrorContext, "It must update fatal error context with the view event written")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 This change in RUMViewScopeTests is around asserting the update on FatalErrorContext instead of inspecting the baggage sent over message bus. The baggage is now tested in FatalErrorContextTests which streamlines things and better separates concerns.

@ncreated ncreated marked this pull request as ready for review May 16, 2024 16:52
@ncreated ncreated requested review from a team as code owners May 16, 2024 16:52
maciejburda
maciejburda previously approved these changes May 20, 2024
Copy link
Member

@maciejburda maciejburda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Great coverage!

Some minor comments

private func updateRUMAttributes(with baggage: FeatureBaggage, to core: DatadogCoreProtocol) {
queue.async { [weak core] in
do {
self.rumAttributes = try baggage.decode(type: GlobalRUMAttributes.self)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we weakify self as well? 🤔

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Good point, ✅ updated in the whole file


import Foundation

public struct GlobalRUMAttributes: Codable, PassthroughAnyCodable {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it make sense to write extension for [AttributeKey: AttributeValue] instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extending standard type would bring much less clarity to the whole concept. GlobalRUMAttributes is a shared type (it is defined in DatadogInternal), standing for contract between RUM and Crash Reporting. Whenever we step on this type in code, we know it is meant for message bus communication. Same could not be achieved with using standard ubiquitous type.

Also, the type defines two aspects that could not be supported in dictionary:

  • Codable - so we can code it back and forth ([String: Endocable] can't be Codable);
  • PassthroughAnyCodable - to avoid marshalling the type when passing it over message bus ([String: Encodable] can't be managed this way as we use it transitively in multiple other types exchanged over MB, e.g. for encoding user attributes in RUM view event);

ncreated added 9 commits May 21, 2024 09:30
but do not send view update event (to avoid performance penalties such as adding
backpressure to writer queue, increasing RUM directory size with more event writes
and effectively extending the backlog of uploadable files for RUM)
An alternative approach was sending attributes as part of last RUMView, however
causing view updates on every attribute change is producing significant backpressure
on RUM queue.
@ncreated ncreated force-pushed the ncreated/RUM-3588/update-crash-context-with-rum-attributes branch from 5b17904 to 99b72b8 Compare May 21, 2024 07:45
Copy link
Member

@maxep maxep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks great! Thanks for tackling the debt, it is way bette now.
I have left a small suggestion on naming but it can be merged as is.

/// It tracks value changes and notifies updates on message bus.
internal final class FatalErrorContextNotifier {
internal final class FatalErrorContextNotifier: FatalErrorContextNotifying {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/suggestion I think the names should reflect responsibility (interface) vs. behavior (imp). Here it would sending the context vs. sending on the bus.

internal final class FatalErrorContextBus: FatalErrorContextSender

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's true for strategy pattern, but is not the case here. The protocol is made purely for dependency injection and testability. It is not for introducing multiple strategies of sending FatalErrorContext.

If at any point we want to notify FEC through different implementations, then I'll be 100% on the side of this suggestion. For now I'll keep it as it is 🙂.

@ncreated ncreated requested a review from maciejburda May 21, 2024 12:54
@ncreated ncreated merged commit b7552c0 into develop May 21, 2024
15 checks passed
@ncreated ncreated deleted the ncreated/RUM-3588/update-crash-context-with-rum-attributes branch May 21, 2024 16:54
This was referenced May 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants