-
Notifications
You must be signed in to change notification settings - Fork 133
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RUM-3461 feat: Send fatal App Hang after app restart #1759
RUM-3461 feat: Send fatal App Hang after app restart #1759
Conversation
Datadog ReportBranch report: ✅ 0 Failed, 2919 Passed, 0 Skipped, 11m 22.99s Wall Time 🔻 Code Coverage Decreases vs Default Branch (10)
|
0243a15
to
b954910
Compare
b954910
to
66a117d
Compare
@testable import DatadogCrashReporting | ||
@testable import DatadogCore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 It makes a significant step towards moving CrashReportReceiverTest
to RUM tests target, instead of maintaining it in Core tests. What is left is the dependency on DatadogCrashReporting.CrashContext
mock. This will be resolved once all CR models are moved to DatadogInternal
👍.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, well done 👏 👏
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some questions, one critical on privacy side.
Reads well 🎉
XCTAssertNotNil(sentRUMError.additionalAttributes?[DDError.meta], "It must contain crash details") | ||
XCTAssertNotNil(sentRUMError.additionalAttributes?[DDError.wasTruncated], "It must contain crash details") | ||
XCTAssertEqual(sentRUMError.model.error.sourceType, .ios, "Must send .ios as the sourceType") | ||
XCTAssertEqual(sentRUMError.dd.session?.plan, .plan1, "All RUM events should use RUM Lite plan") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice cleanup 🧹
/// The crash context | ||
let context: CrashContext | ||
} | ||
|
||
private struct CrashReport: Decodable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❓
no longer needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes 👍. This was the CrashReport
structure sent from DatadogCrashReporting
to DatadogLogs
over message bus. Because in #1687 we moved DDCrashReport
to DatadogInternal
we no longer need to encode and decode it over the bus.
In other words, the DatadogCrashReporting.CrashReportSender
sends DatadogInternal.DDCrashReport
, so here in DatadogLogs
we can receive exactly the same type. This PR also adds PassthroughCodable
conformance to DDCrashReport
, so we can bypass serialization when passing it over message bus - which brings perf improvement when compared to decoding removed struct.
/// A crash with given metadata information. | ||
case crash | ||
/// A fatal App Hang. | ||
case hang |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❓
How is an hang fatal?
Is this a recoverable hang or customer quit after this hang? or we don't know about such thing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Fatal" means a hang that was followed by app termination (process kill). The termination is the aspect that differentiates non-fatal and fatal hangs. See:
Currently we don't differentiate if the termination was due to user force quitting a hanged app or if the app was killed by OS watchdog due to prolonging hang. We should be able to distinguish these when watchdog terminations are implemented - we will have to track "user force quits" to recognise OOM errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the links.
monitor.stop() | ||
|
||
// When | ||
featureScope.contextMock.trackingConsent = .mockRandom() // no matter of the consent in restarted session |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❓
from privacy point of view, shouldn't both before and after tracing must be granted
?
There could be extra information we might attach to old events (at the time granted) in the subsequent run of the app where the consent is not granted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thinking more on this, can we assert on the event that we recorded and sent is same, nothing gets added after.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from privacy point of view, shouldn't both before and after tracing must be
granted
?
What we do here is the same how it works in Crash Reporting, so it follows earlier product decisions.
There could be extra information we might attach to old events (at the time granted) in the subsequent run of the app where the consent is not granted.
thinking more on this, can we assert on the event that we recorded and sent is same, nothing gets added after
This is fair call 👍. Today there is no such extra information added, but I agree with the principle of testing this. As discussed on Slack, we could introduce DDAssertDif()
to expect changes only on certain values. I did a spike on adding it in this PR, but I failed due to an extra complexity of coding Encodable
values as part of RUM events which is not compatible with the basic reflection comparison that we implement today. It turns out to be more complex task - detaching to separate JIRA: RUM-4011
.
[ | ||
"No pending App Hang found", // from hanged process | ||
"Sending fatal App hang as RUM error with issuing RUM view update", // from next process | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚀 nice assertions.
What and why?
📦 This PR adds the logic of sending RUM errors for fatal App Hangs. It completes the previous work done in #1751 where fatal hangs were tracked, but not yet sent.
🎁 Because the condition of uploading fatal App Hangs to previous RUM session is very similar to the flow of sending RUM errors for app crashes, this PR brings additional refactoring to reuse existing concepts for both kinds of fatal error. That includes:
RUM-3115
by droppingRUMCrashEvent
and settingerror.threads
,error.meta
,error.binary_images
anderror.was_truncated
directly inRUMErrorEvent
;FatalErrorBuilder
that is shared between fatal App Hangs handler and Crash receiver.To cut corners, proposed implementation considers only the main scenario when hang happens in existing RUM session with an active RUM view. It ignores the complexity of "no view" and "no foreground session" situations that are handled in crash reporting (through
RUMOffViewEventsHandlingRule
). There is a massive opportunity for reusing the same logic throughFatalErrorBuilder
but it must be preceded by additional refactoring, notably model sharing between crash and fatal hang contexts inDatadogInternal
. That work is detached toRUM-3840
.How?
When app is restarted with "pending hang" information:
This is achieved in similar fashion as in crash reporting. An extra context (tracking consent + NTP offset + last RUM view) is written to
DataStore
on "hang start". It is deleted on "hang end" or "hang cancelled". If found after app restart, the RUM error and RUM view are constructed withFatalErrorBuilder
. For code reuse, theFatalErrorBuilder
is also used for buildling RUM errors that transport crash reports.Review checklist
Custom CI job configuration (optional)
tools/