Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trace sample rate does not work for child traces #2064

Closed
shagedorn opened this issue Sep 25, 2024 · 6 comments
Closed

Trace sample rate does not work for child traces #2064

shagedorn opened this issue Sep 25, 2024 · 6 comments
Labels
question Further information is requested

Comments

@shagedorn
Copy link

Describe the bug

In our app, we have automatic network tracking/tracing enabled, but also send a few custom traces. We apply the trace sample rate in several places in the Datadog setup (shortened for brevity):

let traceSampleRate = 5.0

RUM.enable(with: .init(
  …
  urlSessionTracking: .init(
    firstPartyHostsTracing: .trace(
      hosts: […],
      // apply sample rate to network tracking
      sampleRate: traceSampleRate
    )
  )
))

Trace.enable(with: .init(
  // apply sample rate to custom traces
  sampleRate: traceSampleRate,
  urlSessionTracking: .init(
    firstPartyHostsTracing: .trace(
      hosts: […],
      // apply sample rate to network traces
      sampleRate: traceSampleRate
    )
  ),
  …
))

We noticed in production that some of our custom traces show up much more often than others and suspected an issue with the sampling rate.

Reproduction steps

We were able to track it down by setting the sample rate to 0 (in our debug environment), and could see that a few traces always come through regardless. It turns out all custom traces with a parent trace are affected, regardless if we set the parent explicitly or Datadog detects it automatically. So, in pseudocode:

trace1.setActive()
  trace2.setActive()
  trace2.finish()
trace1.finish()

…would lead to trace2 escaping the downsampling.

trace1.setActive()
trace1.finish()

trace2.setActive()
trace2.finish()

…whereas this works as expected – given a sample rate of 0, neither of them would be sent.

SDK logs

Expected behavior

The trace sample rate should govern all traces to effectively limit cost.

We will need to disable custom traces entirely until this is fixed because it produces costs we cannot control, so we hope for a quick solution.

Affected SDK versions

2.14.2 - 2.17.0

Latest working SDK version

unknown

Did you confirm if the latest SDK version fixes the bug?

Yes

Integration Methods

SPM

Xcode Version

Xcode 15.4

Swift Version

Swift 5.9

MacOS Version

macOS Sonoma 14.7 (23H124)

Deployment Target

iOS 16

Device Information

Reproduces in simulators and devices

Other relevant information

No response

@shagedorn shagedorn added the bug Something isn't working label Sep 25, 2024
@ncreated
Copy link
Member

Hey @shagedorn 👋.

Thanks a lot for providing the reproduction steps; that’s always incredibly helpful. However, in this case, I couldn't reproduce the issue. I tried the following minimal setup with a trace sampleRate of 0:

import DatadogCore
import DatadogTrace

Datadog.initialize(
    with: .init(clientToken: "<client-token>", env: "<env>"),
    trackingConsent: .granted
)
Datadog.verbosityLevel = .debug

Trace.enable(
    with: .init(sampleRate: 0)
)

for _ in (0..<10) {
    let span1 = Tracer.shared().startSpan(operationName: "parent").setActive()
    let span2 = Tracer.shared().startSpan(operationName: "child").setActive()
    span2.finish()
    span1.finish()
}

No spans were indexed.

Here are a few important points to keep in mind about APM sampling, particularly when using a sampleRate of 0 or higher:

sampleRate: 0

Even with a sampleRate of 0, all spans are sent from the SDK and ingested by Datadog APM. These spans will appear in the LIVE view (APM Traces), but because none of them will be indexed, they won’t contribute to your billing:

  • To view ingested spans:
  • To view indexed spans, e.g., from the last 30 minutes:

sampleRate: 0-100

Since version 2.11.0, DatadogTrace supports head-based sampling (see also: #1713). With this feature:

  • If the root span is sampled, all child spans will be sampled.
  • If the root span is not sampled, none of the child spans will be sampled.

For this reason, it's important not to assume that a 50% sample rate will index 50% of spans. Instead, it will index 50% of traces.

sampleRate in Distributed Tracing

The sampleRate set in the urlSessionTracking API applies to distributed traces — the traces our SDK automatically creates for intercepted network requests. This sampling rate is also respected by compatible downstream backend tracers.

⚠ In your code snippet, I noticed that urlSessionTracking is configured for both DatadogRUM and DatadogTrace. This isn't necessary. In fact, DatadogTrace instrumentation skips creating a span if the request was already instrumented by RUM. I recommend only configuring urlSessionTracking for DatadogRUM in your setup, as the DatadogTrace option is mainly useful for non-RUM users.


@shagedorn, I hope this clears things up and aligns with your expectations. Let me know how it sounds or if you need further clarification.

@ncreated ncreated added the awaiting response Waiting for response / confirmation from the reporter label Oct 18, 2024
@shagedorn
Copy link
Author

Hi @ncreated, thanks a lot for all that context 🙏🏻 There's some new information in here, we're re-evaluating our initial report and will follow up with an update shortly!

@shagedorn
Copy link
Author

Even with a sampleRate of 0, all spans are sent from the SDK and ingested by Datadog APM. These spans will appear in the LIVE view (APM Traces), but because none of them will be indexed, they won’t contribute to your billing

Soo… we were blissfully unaware of head-based sampling. Thank you for clarifying this!

For this reason, it's important not to assume that a 50% sample rate will index 50% of spans. Instead, it will index 50% of traces.

This is also helpful to know, thank you!

I recommend only configuring urlSessionTracking for DatadogRUM in your setup, as the DatadogTrace option is mainly useful for non-RUM users.

Thank you, we'll clean this up 🙏🏻


✅ With this context, our original bug report isn't valid anymore.

❓ However, I tried your example (and our own code) and I feel like there's an opposite bug, or another misunderstanding:

When I set the sample rate to 0, it seems that only the child span is displayed in the LIVE view:

Image

…whereas when I set it to 100, I see both child & parent, in the LIVE view and also in the indexed view:

Image

Is this expected behaviour?

@maxep maxep removed the awaiting response Waiting for response / confirmation from the reporter label Oct 24, 2024
@maxep
Copy link
Member

maxep commented Oct 24, 2024

Hello @shagedorn 👋

That is unexpected. I can easily replicate and I can confirm that the SDK is sending both span events when sampled out.
I will keep you posted when we get to the bottom of this.

Thank you for the report!

@maxep
Copy link
Member

maxep commented Nov 4, 2024

Hello 👋

So I have more insights from the APM intake team: as @ncreated explained, the Live view presents events between ingestion and indexing. The sample decision for keeping a trace is based on the root span (the parent in your example), so when a child span is ingested, its sampling is ignored because it has a parent. Therefor, it is ingested (not yet indexed) and you see it in the Live view.
But when the root span arrives, the intake will make the decision to drop it because it's sampled out, which explain why it is not seen in the LIVE view. Ultimately, the entire trace is dropped.

Here a more detailed description of the Ingestion flow: https://docs.datadoghq.com/tracing/trace_pipeline/ingestion_controls/

So this behaviour is by-design of our ingestion flow and the LIVE view.

I hope it's clear, let me know if you need more info!

@maxep maxep added question Further information is requested and removed bug Something isn't working labels Nov 4, 2024
@shagedorn
Copy link
Author

Thanks a lot for digging into this and providing additional details! I will close this issue then, it seems that there's no bug, just a misinterpretation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants