-
Notifications
You must be signed in to change notification settings - Fork 134
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RUM-1843 Method Call telemetry #1723
RUM-1843 Method Call telemetry #1723
Conversation
59427c9
to
3d9523a
Compare
Datadog ReportBranch report: ✅ 0 Failed, 2821 Passed, 0 Skipped, 13m 7.38s Wall Time 🔻 Code Coverage Decreases vs Default Branch (7)
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚀 Great to see it! Well done 👍, I left few minor remarks (non-blocking). Let's have more opinions on sampling value 🙏 .
let methodCalledTrace = telemetry.startMethodCalled( | ||
operationName: MethodCallConstants.captureRecordOperationName, | ||
callerClass: MethodCallConstants.className, | ||
samplingRate: methodCallTelemetrySamplingRate // Effectively 3% * 15% = 0.45% of calls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question/ The sampling here is crucial, so it may be worth having more opinions.
Given the SR interval of 100ms
, this equals (60s / 100ms) * 0.45%
so 2.7
metrics per minute. Having the "only 100 telemetry events per session" limit, it means that SR will exhaust it after 37 minutes (100 / 2.7
) assuming no other telemetries were sent. It sounds rationale to me, but I wouldn't mind having even lower sampling, let's say 2x lower (15 → 7.5
, so 0.225%
effectively). Thoughts @maciejburda @plousada @maxep @ganeshnj ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can definitely be lower. Even 5% seems enough (effectively 0.15%).
As discussed - maybe all telemetries should be sampled on the call side instead. This gives more control for future metrics, and removes the need to remember the math of 20% * 15% * Y% = Desired%
.
Right now whatever % we choose for the call side sampling will have to unnecessarily travel through the message bus to be rejected by top level sampling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now whatever % we choose for the call side sampling will have to unnecessarily travel through the message bus to be rejected by top level sampling.
Let's create backlog item 🙏, we can accept it for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my bigger concern about such telemetries are about turning them off.
Once this code is out in the app, we will continue getting this telemetry even though we no longer need them because the concern has been addressed or this is way too much of data or etc.
This brings the topic of remote configs but we are not there yet.
/// - samplingRate: The sampling rate of the method call. Value between `0.0` and `100.0`, where `0.0` means NO event will be processed and `100.0` means ALL events will be processed. Note that this value is multiplicated by telemetry sampling (by default 20%) and metric events sampling (hardcoded to 15%). Making it effectively 3% sampling rate for sending events, when this value is set to `100`. | ||
/// | ||
/// - Returns: A `MethodCalledTrace` instance to be used to stop the method call and measure it's execution time. It can be `nil` if the method call is not sampled. | ||
func startMethodCalled( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
identifying method with just name and class seems naive but probably okay for current requirements.
For example RequestBuilder
protocol we have which has same impl name everywhere.
There is concept of fully qualified name which gives the unique name of the method in the current context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Some good points there
operation_name
is agnostic of the platform because we can measure the same "behaviour" in both platforms, but they might have totally different function signature. For the class name we currently do this:
static let className = { String(reflecting: Recorder.self) }()
Which produces: "DatadogSessionReplay.Recorder"
Capture Record
is indeed first and simplistic example Method Called
, but if we need we can expand this model for the future use cases.
What and why?
This PR adds telemetry for Method Called (internal).
This telemetry allows measuring execution time of methods + adds call side sampling for heavily called functions.
Along with these interfaces, we implement first function to utilise this:
SessionReplay.Recorder.captureNextRecord()
.How?
It's mostly reusing existing strategy for Batch metrics adding a little bit of convenience methods for measuring execution time.
Review checklist
Custom CI job configuration (optional)
tools/