Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RUM-1843 Method Call telemetry #1723

Merged
merged 6 commits into from
Mar 15, 2024

Conversation

maciejburda
Copy link
Member

@maciejburda maciejburda commented Mar 13, 2024

What and why?

This PR adds telemetry for Method Called (internal).

This telemetry allows measuring execution time of methods + adds call side sampling for heavily called functions.

Along with these interfaces, we implement first function to utilise this: SessionReplay.Recorder.captureNextRecord().

How?

It's mostly reusing existing strategy for Batch metrics adding a little bit of convenience methods for measuring execution time.

Review checklist

  • Feature or bugfix MUST have appropriate tests (unit, integration)
  • Make sure each commit and the PR mention the Issue number or JIRA reference
  • Add CHANGELOG entry for user facing changes

Custom CI job configuration (optional)

  • Run unit tests for Core, RUM, Trace, Logs, CR and WVT
  • Run unit tests for Session Replay
  • Run integration tests
  • Run smoke tests
  • Run tests for tools/

@maciejburda maciejburda force-pushed the maciey/RUM-1843-capture-next-record-telemetry branch from 59427c9 to 3d9523a Compare March 13, 2024 11:22
@datadog-datadog-prod-us1
Copy link

datadog-datadog-prod-us1 bot commented Mar 13, 2024

Datadog Report

Branch report: maciey/RUM-1843-capture-next-record-telemetry
Commit report: 1d9837f
Test service: dd-sdk-ios

✅ 0 Failed, 2821 Passed, 0 Skipped, 13m 7.38s Wall Time
🔻 Test Sessions change in coverage: 7 decreased, 7 increased

🔻 Code Coverage Decreases vs Default Branch (7)

This report shows up to 5 code coverage decreases.

  • test DatadogTraceTests tvOS 49.74% (-0.22%) - Details
  • test DatadogTraceTests iOS 49.71% (-0.17%) - Details
  • test DatadogLogsTests tvOS 45.70% (-0.16%) - Details
  • test DatadogLogsTests iOS 45.64% (-0.16%) - Details
  • test DatadogCrashReportingTests iOS 28.91% (-0.12%) - Details

@maciejburda maciejburda requested a review from ncreated March 13, 2024 15:26
@maciejburda maciejburda marked this pull request as ready for review March 13, 2024 16:14
@maciejburda maciejburda requested review from a team as code owners March 13, 2024 16:14
ncreated
ncreated previously approved these changes Mar 14, 2024
Copy link
Member

@ncreated ncreated left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀 Great to see it! Well done 👍, I left few minor remarks (non-blocking). Let's have more opinions on sampling value 🙏 .

DatadogInternal/Sources/Telemetry/Telemetry.swift Outdated Show resolved Hide resolved
DatadogRUM/Sources/Integrations/TelemetryReceiver.swift Outdated Show resolved Hide resolved
let methodCalledTrace = telemetry.startMethodCalled(
operationName: MethodCallConstants.captureRecordOperationName,
callerClass: MethodCallConstants.className,
samplingRate: methodCallTelemetrySamplingRate // Effectively 3% * 15% = 0.45% of calls
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question/ The sampling here is crucial, so it may be worth having more opinions.

Given the SR interval of 100ms, this equals (60s / 100ms) * 0.45% so 2.7 metrics per minute. Having the "only 100 telemetry events per session" limit, it means that SR will exhaust it after 37 minutes (100 / 2.7) assuming no other telemetries were sent. It sounds rationale to me, but I wouldn't mind having even lower sampling, let's say 2x lower (15 → 7.5, so 0.225% effectively). Thoughts @maciejburda @plousada @maxep @ganeshnj ?

Copy link
Member Author

@maciejburda maciejburda Mar 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can definitely be lower. Even 5% seems enough (effectively 0.15%).

As discussed - maybe all telemetries should be sampled on the call side instead. This gives more control for future metrics, and removes the need to remember the math of 20% * 15% * Y% = Desired%.

Right now whatever % we choose for the call side sampling will have to unnecessarily travel through the message bus to be rejected by top level sampling.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now whatever % we choose for the call side sampling will have to unnecessarily travel through the message bus to be rejected by top level sampling.

Let's create backlog item 🙏, we can accept it for now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my bigger concern about such telemetries are about turning them off.

Once this code is out in the app, we will continue getting this telemetry even though we no longer need them because the concern has been addressed or this is way too much of data or etc.

This brings the topic of remote configs but we are not there yet.

DatadogSessionReplay/Sources/Recorder/Recorder.swift Outdated Show resolved Hide resolved
/// - samplingRate: The sampling rate of the method call. Value between `0.0` and `100.0`, where `0.0` means NO event will be processed and `100.0` means ALL events will be processed. Note that this value is multiplicated by telemetry sampling (by default 20%) and metric events sampling (hardcoded to 15%). Making it effectively 3% sampling rate for sending events, when this value is set to `100`.
///
/// - Returns: A `MethodCalledTrace` instance to be used to stop the method call and measure it's execution time. It can be `nil` if the method call is not sampled.
func startMethodCalled(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

identifying method with just name and class seems naive but probably okay for current requirements.

For example RequestBuilder protocol we have which has same impl name everywhere.

There is concept of fully qualified name which gives the unique name of the method in the current context.

Copy link
Member Author

@maciejburda maciejburda Mar 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Some good points there

operation_name is agnostic of the platform because we can measure the same "behaviour" in both platforms, but they might have totally different function signature. For the class name we currently do this:

static let className = { String(reflecting: Recorder.self) }()

Which produces: "DatadogSessionReplay.Recorder"

Capture Record is indeed first and simplistic example Method Called, but if we need we can expand this model for the future use cases.

@maciejburda maciejburda merged commit 6acd10b into develop Mar 15, 2024
8 checks passed
@maciejburda maciejburda deleted the maciey/RUM-1843-capture-next-record-telemetry branch March 15, 2024 14:15
@maciejburda maciejburda mentioned this pull request Mar 15, 2024
8 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants