Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

data loss submitting metrics to datadog #18123

Closed
seankhliao opened this issue Jul 31, 2023 · 17 comments
Closed

data loss submitting metrics to datadog #18123

seankhliao opened this issue Jul 31, 2023 · 17 comments
Assignees
Labels
sink: datadog_metrics Anything `datadog_metrics` sink related source: datadog_agent Anything `datadog_agent` source related type: bug A code related bug.

Comments

@seankhliao
Copy link

seankhliao commented Jul 31, 2023

A note for the community

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

We have a mildly complex set of metrics collection pipelines:

  • app -otlp-> opentelemetry collector -datadog protocol-> vector -> datadog
  • app -prometheus-> opentelemetry collector -datadog protocol-> vector -> datadog
  • datadog agent -datadog protocol-> vector -> datadog

For months we've been observing that sometimes our metric data would have less than expected values when graphed in datadog. Teams reported metric values being lower compared to our previous vendor, two counters incremented at the same place have different values, etc.

We have a canary application that emits a counter (and logs) at a constant rate, and for this counter we also see lower than expected values. Sometimes it's an unstable minor loss, but often we would see a consistently lower value for extended periods of time (hours ~ days).
A restart of vector occasionally recovers or triggers the issue, but it's not a necessary event, we've observed sudden extended drops with no change to our deployments.

We previously ruled out an error within the application and opentelemetry collector components of our pipelines, direct submission of counters from both places produced our expected values, even as we observed the same metric going through vector dropping points. We also counted events within vector using vector tap, and the data points were all present.

We recently added a proxy written in go that would deserialize the metrics submission request sent by vector,
and 1) resubmit them under a new name to our metrics pipeline 2) log the value (which still ends up in datadog) 3) logs the response status.
We observe a consistent 202 Accepted response from datadog.

Below I've included a selection of graphs of the data we have, an explanation of metric names:

  • polaris.o11y.canary.logs: the canary application emitting logs
  • polaris.o11y.canary.otlp: the canary application emitting a counter through our OTLP pipeline
  • polaris.o11y.canary.prom.total: the canary exposing the metric over prometheus exposition
  • ddproxy.canary.otlp: the proxy extracting the polaris.o11y.canary.otlp counter from vector's submission request and resubmitting it as a histogram over our OTLP pipeline
  • ddproxy.canary.prom: the proxy extracting the polaris.o11y.canary.prom.total counter from vector's submission request and resubmitting it as a histogram over our OTLP pipeline
  • ddproxy.datadog_logged.otlp: the same extracted value logged and generated as a custom metric in datadog
  • ddproxy.datadog_logged.prom: the same extracted value logged and generated as custom metric in datadog

Notes:

  • we chose to resubmit as distributions to take a different path (/api/v1/series vs /api/beta/sketches)

Example 1 week view of our canaries
2023-07-31-202137

Dark red is our original canary metric, resubmitted as orange it has a much higher value
2023-07-31-201846

Sometimes our resubmitted metric is lower
2023-07-31-201752

Configuration

sinks:
  # https://vector.dev/docs/reference/configuration/sinks/datadog_metrics/
  # Push internal_metrics for Vector agent to Datadog
  # Known issue: https://github.com/vectordotdev/vector/issues/10870
  datadog_metrics:
    type: datadog_metrics
    default_api_key: "${DD_API_KEY}"
    site: "${DD_SITE}"
    endpoint: "http://localhost:8090"
    inputs:
      - send_to_datadog.enabled

    buffer:
      max_events: 8000
      when_full: drop_newest

    request:
      # value chosen from observing production instances
      # increase pods count to scale
      # disables adaptive concurrency
      concurrency: 200

    batch:
      # if we don't limit this,
      # it appears that Datdog will just close the connection on too large payloads
      # Datadog payload limits are 5242880 bytes raw, 512000 compressed,
      # but vector calculates sizes before serialization and compression
      # https://docs.datadoghq.com/api/latest/metrics/#submit-metrics
      max_events: 4000

Version

0.29.1-distroless-libc

Debug Output

No response

Example Data

No response

Additional Context

No response

References

No response

@seankhliao seankhliao added the type: bug A code related bug. label Jul 31, 2023
@neuronull
Copy link
Contributor

Hi @seankhliao, thanks for this report and for including many data points, very helpful. I have a couple questions:

  1. Of the 3 pipeline scenarios, the last one (just DD Agent -> Vector -> Datadog), is that expressing the data loss in the same way as the other two scenarios? Just want to boil it down to a minimal setup that reproduces the issue, and understand if there is potentially anything specific about having the otel collector in the mix (because that scenario hasn't been thoroughly tested by us).

    a. if the answer is that otel does have an impact, any details you could share about your configurations on that end would be very useful to help try to reproduce the issue.

    b. similarly, Anything significant about your datadog_agent source config in vector? any transforms?

  2. What versions of the DD Agent and Otel collector are you utilizing?

  3. Would you be able to share your Agent config?

  4. Just to be certain- you aren't seeing any errors along the path? i.e this is a "silent" loss ?

  5. This might be a shot in the dark but did you by chance monitor any resource utilization (cpu/memory mainly) with vector during any of the experiments?

Thanks!

@neuronull neuronull added sink: datadog_metrics Anything `datadog_metrics` sink related source: datadog_agent Anything `datadog_agent` source related labels Aug 1, 2023
@seankhliao
Copy link
Author

seankhliao commented Aug 1, 2023

I can share our config for vector / otel collectors https://github.com/seankhliao/testrepo0298
Note that for vector we use the same config but deploy separate instances for each pipeline.
(side note: our vector unit tests take 2 mins to run, it's really slow for some reason).

  1. tbh I am not as familiar with the DD Agent -> Vector pipeline (it's a very recent addition, we were previously using an otel collector scraping kubelet) and the data it produces (system metrics), though my understanding was that the loss was observed as irregular (missing) data points when zoomed in to a series.

  2. current versions
    a. Vector 0.29.1
    b. DD Agent 7.43.1
    c. opentelemetry collector contrib 0.79.0

  3. I've included it in the repo as values we pass to the helm chart

  4. Given our proxy's measurement point of after the vector datadog_metrics sink, and the datapoints it finds and logs are ones that we expect to see, I am fairly certain it has passed through our entire pipeline without getting dropped. Perhaps the diagram below explains it better?
    a. side note: when we did have in pipeline data loss, we found vector's internal metrics were pretty useless in diagnosing the issue, and it was more observable from watching the retry queues in the opentelemetry collectors

  5. sure, also the resource requests / limits we run vector with. The proxy runs as a sidecar to vector without resource limits.
    a. side note: we previously also tried running vector with 60 core / 240GB machines, it did not help. Vector generally didn't go beyond utilizing 10 cores / 1GB memory.

requests:
  cpu: 4000m
  memory: 1Gi
limits:
  cpu: 5000m
  memory: 1Gi

2023-08-01-213517


diagram of data paths
dd-2023-08-01-2150

@neuronull
Copy link
Contributor

Thanks for providing all that extra details, diagram, and the repo, that's great.

though my understanding was that the loss was observed as irregular (missing) data points when zoomed in to a series.

This could be due to aggregation being done by either the Agent or Vector. I believe that Vector's aggregation algorithm should not be evoked if the Agent is part of the pipeline, since the Agent would be aggregating these metrics already. If we are doing that in Vector with Agent upstream, that's probably a bug.
But, if the totals are accurate, then the missing data points when zoomed in is likely expected due to aggregation. It would be a problem if there was a net reduction in the counts.

Given our proxy's measurement point of after the vector datadog_metrics sink, and the datapoints it finds and logs are ones that we expect to see, I am fairly certain it has passed through our entire pipeline without getting dropped. Perhaps the diagram below explains it better?

Roger. Yeah was mostly wondering if you are seeing either Vector/Agent/Otel reporting errors, but it sounds like that is not the case (e.g. Vector "thinks" it is operating correctly)

side note: when we did have in pipeline data loss, we found vector's internal metrics were pretty useless in diagnosing the issue, and it was more observable from watching the retry queues in the opentelemetry collectors

That's helpful feedback, thank you. By this do you mean our internal metrics were not accurately reflecting the errors / dropped events ?

Thanks for providing the resource details, was mostly curious if you observed vector with either excessive CPU or memory usage and that seems to not be the case.

@neuronull
Copy link
Contributor

We have a canary application that emits a counter (and logs) at a constant rate,

Any chance you'd be able to share this? We'll likely want to repro this problem and using as close to the same setup that you are as possible will increase our chances of getting the repro.

@seankhliao
Copy link
Author

That's helpful feedback, thank you. By this do you mean our internal metrics were not accurately reflecting the errors / dropped events ?

Yes, we looked through every metric and none of them indicated anything was wrong even though our otel collectors were having some of their requests rejected.

canary code: https://github.com/seankhliao/testrepo0299

I'm not sure how easy it is to reproduce in isolation, since our vector instances are currently running with full production load with a lot of other metric data in the pipelines.

@neuronull
Copy link
Contributor

Yes, we looked through every metric and none of them indicated anything was wrong even though our otel collectors were having some of their requests rejected.

Yikes, OK that's something we'll need to look into.

canary code: https://github.com/seankhliao/testrepo0299

❤️

I'm not sure how easy it is to reproduce in isolation, since our vector instances are currently running with full production load with a lot of other metric data in the pipelines.

Was going to ask about that yeah ... if it mostly expresses in high volume scenarios.

This is not the first time we are seeing similar reports to yours (though the first time the Otel Collector is in the mix). Prior efforts to reproduce the issue did not pan out so it remains a bit of a mystery. But it's pretty clear there is something wrong. I'm hopeful that we'll be able to break ground with this new data point you have provided.

@jszwedko
Copy link
Member

jszwedko commented Aug 2, 2023

Hi @seankhliao !

We'd like to take a look at the metrics directly in Datadog. If that's alright with you, would you mind opening a support case through Datadog so that we can track it?

@seankhliao
Copy link
Author

Do i need to put any details in there or can i just link this thread?

@jszwedko
Copy link
Member

jszwedko commented Aug 2, 2023

Do i need to put any details in there or can i just link this thread?

You can just link this thread there and we can follow up.

@seankhliao
Copy link
Author

case 1288375

@seankhliao
Copy link
Author

I have an example of what of losing data from the datadog agent looks like.
If this is aggregation, it's not being done properly.

2023-08-10-105313

@lukesteensen
Copy link
Member

To summarize our findings here, the primary issue seemed to be the configured removal of tags like pod.uid that would uniquely identify a series coming from a single running instance of Vector. The Datadog backend does not do aggregation of incoming values beyond what is done on the client side, so values that "conflict" (i.e. have the same series name, tag values, and timestamp) will be overwritten. The fix for this problem is to retain a differentiating tag such that data from one instance does not overwrite data from another.

@seankhliao
Copy link
Author

I would dispute that, from our email threads, even after adding back the pod uid tags, we still see ~10% loss (better than 50%, but still non zero).

Our pure integration sources which we pass through vector without stripping any tags also sees intermittent loss of data.

This also doesn't explain the loss in distribution metrics.

@jszwedko
Copy link
Member

Hey!

I would dispute that, from our email threads, even after adding back the pod uid tags, we still see ~10% loss (better than 50%, but still non zero).

Happy to follow up on the email thread. We had been waiting for additional response.

Our pure integration sources which we pass through vector without stripping any tags also sees intermittent loss of data.

This would be new, I think. I realize it will likely be difficult, but what would really help here is an MRE showing the data loss with just Vector by itself that we could run to reproduce. One difficulty we've had with this issue is the number of moving parts.

This also doesn't explain the loss in distribution metrics.

I think we put this on the email thread, but we discovered during the investigation that distribution metrics have the same requirement as other metric types where data points (timestamp / values) need to be unique going into the Datadog back-end or they will overwrite each other.

@paulkirby-hotjar
Copy link

Hey, so we've been dealing with a very similar problem to what the decided root cause of this is - we send distribution metrics but we remove some tags from them, such that duplicate timestamp/values are sent which causes a loss of accuracy. This seems like Datadog is not handling this properly somehow, because I would have expected that a duplicate submission would just be added as an additional data point to the distribution.

In any case, do we have any suggestions on how to mitigate this? In our case, we are dropping tags based on an automated script that generates a "blocklist" for us based on what tags are actually used on the DD end, per metric. Our only alternative right now seems to be to prevent distribution metrics from being analyzed by this automation, which means higher costs for us.

Perhaps the aggregation component? I'm not sure if that would aggregate sketch metrics properly though, seeing as we aren't able to reference the sketch field from VRL...

@jszwedko
Copy link
Member

Hi @paulkirby-hotjar !

I think using the aggregate transform would help with this situation (it should correctly aggregate sketches). Would you mind opening this as a GitHub Discussion? I think that'll make it easier for others that have the same question to find it along with the answer.

@paulkirby-hotjar
Copy link

Let me give that a try and see if it works, and if it does then I will!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sink: datadog_metrics Anything `datadog_metrics` sink related source: datadog_agent Anything `datadog_agent` source related type: bug A code related bug.
Projects
None yet
Development

No branches or pull requests

5 participants