Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RUM-2925 feat: Add backtrace generation capability to DatadogCoreProtocol #1687

Merged

Conversation

ncreated
Copy link
Member

@ncreated ncreated commented Feb 19, 2024

What and why?

📦 This PR adds a new capability to the SDK core: generating backtrace. Before, stack traces were only captured for crashes, now we enable the SDK to dump all threads and binary images at any moment at runtime.

This will be used in App Hangs monitoring to associate main thread's stack trace to pending hang. The actual work will be done in next PR, here we only focus on adding "backtrace reporting" capability to core.

How?

An entry point to this functionality is generateLiveReport API in PLCR:

/**
 * Generate a live crash report, without triggering an actual crash condition. This may be used to log
 * current process state without actually crashing. The crash report data will be returned on
 * success.
 * 
 * @return Returns nil if the crash report data could not be loaded.
 */
- (NSData *) generateLiveReport {

It dumps a snapshot of all running threads in the process. Because it uses the exact format of PLCrashReport, we're able to map it into our DDCrashReport representation (that includes threads and binary images).

Because the use case here is different than crash reporting, this PR adds BacktraceReport type that holds essential information for later stack traces symbolication in Datadog app:

public struct BacktraceReport {
    public let stack: String
    public let threads: [DDThread]
    public let binaryImages: [BinaryImage]
    public let wasTruncated: Bool
}

Internal types are shared with DDCrashReport for code reuse.

🏗️ Architecture change

Architecture-wise, the challenge in this PR was to expose backtrace generation (living in PLCR and abstracted by our Crash Reporting) to other feature modules (RUM in particular, but Logs and Trace can follow).

To solve this, when enabled, DatadogCrashReporting registers BacktraceRecordingFeature to the core. This is similar how NetworkInstrumentationFeature operates. Later on, when requesting backtrace, RUM will call core and the call will be delegated to Crash Reporting. Picture:

Screenshot 2024-02-19 at 12 46 38

Following on the action points from our last iOS Dev Chat, I moved DatadogCrashReporting APIs to DatadogInternal in this PR. This way both RUM and Crash Reporting can use the same model definition without requiring type erasing through "baggage" concept:

Screenshot 2024-02-19 at 12 56 13

Moving also the DDCrashReport which unlocks the performance optimisation that we have planned for RUM-2971 as a follow up action from (internal) SDK latency investigation. By sharing the model definition between features, there is no longer a need for coding it through message-bus "baggage".

Review checklist

  • Feature or bugfix MUST have appropriate tests (unit, integration)
  • Make sure each commit and the PR mention the Issue number or JIRA reference
  • Add CHANGELOG entry for user facing changes

Custom CI job configuration (optional)

  • Run unit tests for Core, RUM, Trace, Logs, CR and WVT
  • Run unit tests for Session Replay
  • Run integration tests
  • Run smoke tests
  • Run tests for tools/

@ncreated ncreated changed the title Ncreated/rum 2925/record the stack trace of app hang RUM-2925 feat: Add backtrace generation capability to DatadogCoreProtocol Feb 19, 2024
@datadog-datadog-prod-us1
Copy link

datadog-datadog-prod-us1 bot commented Feb 19, 2024

Datadog Report

Branch report: ncreated/RUM-2925/record-the-stack-trace-of-app-hang
Commit report: 1568791
Test service: dd-sdk-ios

✅ 0 Failed, 2758 Passed, 0 Skipped, 10m 39.79s Wall Time

Comment on lines -232 to +233
let featureDirectories = try directory.getFeatureDirectories(forFeatureNamed: T.name)
if let feature = feature as? DatadogRemoteFeature {
let featureDirectories = try directory.getFeatureDirectories(forFeatureNamed: T.name)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a miss - we need to create storage directories only for DatadogRemoteFeature. Otherwise we will be creating it for NetworkInstrumentationFeature and BacktraceReportingFeature unnecessarily. Added extra unit tests for this change.

Comment on lines +41 to +43
if let plcr = PLCrashReporterPlugin.thirdPartyCrashReporter {
try core.register(backtraceReporter: BacktraceReporter(reporter: plcr))
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be more elegant, but would require revamping the startup sequence for DatadogCrashReporting. It still depends on CrashReportingPlugin, which was V1 dependency-inverter, no longer required in V2. This refactor is medium-size, hence I didn't want to do it for this PR.

@ncreated ncreated marked this pull request as ready for review February 19, 2024 12:04
@ncreated ncreated requested review from a team as code owners February 19, 2024 12:04
Copy link
Contributor

@ganeshnj ganeshnj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just a small question 👌

import Foundation

/// Crash Report format supported by Datadog SDK.
@objc
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this to be inter-operable with objc?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair call, I traced this back to this commit 2f892c2 where we were adding Obj-c compatibility for Crash Reporting. This was back in V1. With V2 aligning module dependencies much better, this can be now internal to Swift domain, so it is not required. Changed here and consequently in few other places. ✅

Copy link
Member

@maxep maxep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!! I really appreciate the extension pattern applied to the core protocol 👌

Copy link
Member

@maciejburda maciejburda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM


/// Tests integration of `DatadogCore` and `DatadogCrashReporting` for backtrace generation.
class GeneratingBacktraceTests: XCTestCase {
private var core: DatadogCoreProxy! // swiftlint:disable:this implicitly_unwrapped_optional
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably worth having this lint rule global to all the tests

Comment on lines +65 to +68
// This is quite opportunistic - we map PLCR's live report through existing `DDCrashReport` builder to
// then extract essential elements for assembling `BacktraceReport`. It works for now, but be careful
// with how this evolves. We may need a dedicated `BacktraceReport` builder that only shares some code
// with `DDCrashReport` builder.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it's mostly affected by the PLCR update?

Do we expect to have failed unit tests when something like this happen, or we should have some extra measures?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is PLCR-safe 👍. Basically, both for Crash Reports and Backtrace Reports, the value returned by PLCR is PLCrashReport. We then map it through our DDCrashReportBuilder, whereas implementing distinct BacktraceReportBuilder would be more correct and scalable. We don't need it today, hence I'm leaving this decision context in comment.

Base automatically changed from ncreated/RUM-2925/report-non-fatal-app-hangs-in-RUM to app-hangs February 23, 2024 14:05
@ncreated ncreated merged commit 91da074 into app-hangs Feb 23, 2024
8 checks passed
@ncreated ncreated deleted the ncreated/RUM-2925/record-the-stack-trace-of-app-hang branch February 23, 2024 16:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants