data loss submitting metrics to datadog #18123

seankhliao · 2023-07-31T19:31:25Z

A note for the community

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

We have a mildly complex set of metrics collection pipelines:

app -otlp-> opentelemetry collector -datadog protocol-> vector -> datadog
app -prometheus-> opentelemetry collector -datadog protocol-> vector -> datadog
datadog agent -datadog protocol-> vector -> datadog

For months we've been observing that sometimes our metric data would have less than expected values when graphed in datadog. Teams reported metric values being lower compared to our previous vendor, two counters incremented at the same place have different values, etc.

We have a canary application that emits a counter (and logs) at a constant rate, and for this counter we also see lower than expected values. Sometimes it's an unstable minor loss, but often we would see a consistently lower value for extended periods of time (hours ~ days).
A restart of vector occasionally recovers or triggers the issue, but it's not a necessary event, we've observed sudden extended drops with no change to our deployments.

We previously ruled out an error within the application and opentelemetry collector components of our pipelines, direct submission of counters from both places produced our expected values, even as we observed the same metric going through vector dropping points. We also counted events within vector using vector tap, and the data points were all present.

We recently added a proxy written in go that would deserialize the metrics submission request sent by vector,
and 1) resubmit them under a new name to our metrics pipeline 2) log the value (which still ends up in datadog) 3) logs the response status.
We observe a consistent 202 Accepted response from datadog.

Below I've included a selection of graphs of the data we have, an explanation of metric names:

polaris.o11y.canary.logs: the canary application emitting logs
polaris.o11y.canary.otlp: the canary application emitting a counter through our OTLP pipeline
polaris.o11y.canary.prom.total: the canary exposing the metric over prometheus exposition
ddproxy.canary.otlp: the proxy extracting the polaris.o11y.canary.otlp counter from vector's submission request and resubmitting it as a histogram over our OTLP pipeline
ddproxy.canary.prom: the proxy extracting the polaris.o11y.canary.prom.total counter from vector's submission request and resubmitting it as a histogram over our OTLP pipeline
ddproxy.datadog_logged.otlp: the same extracted value logged and generated as a custom metric in datadog
ddproxy.datadog_logged.prom: the same extracted value logged and generated as custom metric in datadog

Notes:

we chose to resubmit as distributions to take a different path (/api/v1/series vs /api/beta/sketches)

Example 1 week view of our canaries

Dark red is our original canary metric, resubmitted as orange it has a much higher value

Sometimes our resubmitted metric is lower

Configuration

sinks:
  # https://vector.dev/docs/reference/configuration/sinks/datadog_metrics/
  # Push internal_metrics for Vector agent to Datadog
  # Known issue: https://github.com/vectordotdev/vector/issues/10870
  datadog_metrics:
    type: datadog_metrics
    default_api_key: "${DD_API_KEY}"
    site: "${DD_SITE}"
    endpoint: "http://localhost:8090"
    inputs:
      - send_to_datadog.enabled

    buffer:
      max_events: 8000
      when_full: drop_newest

    request:
      # value chosen from observing production instances
      # increase pods count to scale
      # disables adaptive concurrency
      concurrency: 200

    batch:
      # if we don't limit this,
      # it appears that Datdog will just close the connection on too large payloads
      # Datadog payload limits are 5242880 bytes raw, 512000 compressed,
      # but vector calculates sizes before serialization and compression
      # https://docs.datadoghq.com/api/latest/metrics/#submit-metrics
      max_events: 4000

Version

0.29.1-distroless-libc

Debug Output

No response

Example Data

No response

Additional Context

No response

References

No response

The text was updated successfully, but these errors were encountered:

neuronull · 2023-08-01T19:25:54Z

Hi @seankhliao, thanks for this report and for including many data points, very helpful. I have a couple questions:

Of the 3 pipeline scenarios, the last one (just DD Agent -> Vector -> Datadog), is that expressing the data loss in the same way as the other two scenarios? Just want to boil it down to a minimal setup that reproduces the issue, and understand if there is potentially anything specific about having the otel collector in the mix (because that scenario hasn't been thoroughly tested by us).

a. if the answer is that otel does have an impact, any details you could share about your configurations on that end would be very useful to help try to reproduce the issue.

b. similarly, Anything significant about your datadog_agent source config in vector? any transforms?
What versions of the DD Agent and Otel collector are you utilizing?
Would you be able to share your Agent config?
Just to be certain- you aren't seeing any errors along the path? i.e this is a "silent" loss ?
This might be a shot in the dark but did you by chance monitor any resource utilization (cpu/memory mainly) with vector during any of the experiments?

Thanks!

seankhliao · 2023-08-01T21:00:04Z

I can share our config for vector / otel collectors https://github.com/seankhliao/testrepo0298
Note that for vector we use the same config but deploy separate instances for each pipeline.
(side note: our vector unit tests take 2 mins to run, it's really slow for some reason).

tbh I am not as familiar with the DD Agent -> Vector pipeline (it's a very recent addition, we were previously using an otel collector scraping kubelet) and the data it produces (system metrics), though my understanding was that the loss was observed as irregular (missing) data points when zoomed in to a series.
current versions
a. Vector 0.29.1
b. DD Agent 7.43.1
c. opentelemetry collector contrib 0.79.0
I've included it in the repo as values we pass to the helm chart
Given our proxy's measurement point of after the vector datadog_metrics sink, and the datapoints it finds and logs are ones that we expect to see, I am fairly certain it has passed through our entire pipeline without getting dropped. Perhaps the diagram below explains it better?
a. side note: when we did have in pipeline data loss, we found vector's internal metrics were pretty useless in diagnosing the issue, and it was more observable from watching the retry queues in the opentelemetry collectors
sure, also the resource requests / limits we run vector with. The proxy runs as a sidecar to vector without resource limits.
a. side note: we previously also tried running vector with 60 core / 240GB machines, it did not help. Vector generally didn't go beyond utilizing 10 cores / 1GB memory.

requests:
  cpu: 4000m
  memory: 1Gi
limits:
  cpu: 5000m
  memory: 1Gi

diagram of data paths

neuronull · 2023-08-02T15:36:41Z

Thanks for providing all that extra details, diagram, and the repo, that's great.

though my understanding was that the loss was observed as irregular (missing) data points when zoomed in to a series.

This could be due to aggregation being done by either the Agent or Vector. I believe that Vector's aggregation algorithm should not be evoked if the Agent is part of the pipeline, since the Agent would be aggregating these metrics already. If we are doing that in Vector with Agent upstream, that's probably a bug.
But, if the totals are accurate, then the missing data points when zoomed in is likely expected due to aggregation. It would be a problem if there was a net reduction in the counts.

Given our proxy's measurement point of after the vector datadog_metrics sink, and the datapoints it finds and logs are ones that we expect to see, I am fairly certain it has passed through our entire pipeline without getting dropped. Perhaps the diagram below explains it better?

Roger. Yeah was mostly wondering if you are seeing either Vector/Agent/Otel reporting errors, but it sounds like that is not the case (e.g. Vector "thinks" it is operating correctly)

side note: when we did have in pipeline data loss, we found vector's internal metrics were pretty useless in diagnosing the issue, and it was more observable from watching the retry queues in the opentelemetry collectors

That's helpful feedback, thank you. By this do you mean our internal metrics were not accurately reflecting the errors / dropped events ?

Thanks for providing the resource details, was mostly curious if you observed vector with either excessive CPU or memory usage and that seems to not be the case.

neuronull · 2023-08-02T15:45:33Z

We have a canary application that emits a counter (and logs) at a constant rate,

Any chance you'd be able to share this? We'll likely want to repro this problem and using as close to the same setup that you are as possible will increase our chances of getting the repro.

seankhliao · 2023-08-02T16:01:22Z

That's helpful feedback, thank you. By this do you mean our internal metrics were not accurately reflecting the errors / dropped events ?

Yes, we looked through every metric and none of them indicated anything was wrong even though our otel collectors were having some of their requests rejected.

canary code: https://github.com/seankhliao/testrepo0299

I'm not sure how easy it is to reproduce in isolation, since our vector instances are currently running with full production load with a lot of other metric data in the pipelines.

neuronull · 2023-08-02T16:15:54Z

Yes, we looked through every metric and none of them indicated anything was wrong even though our otel collectors were having some of their requests rejected.

Yikes, OK that's something we'll need to look into.

canary code: https://github.com/seankhliao/testrepo0299

❤️

I'm not sure how easy it is to reproduce in isolation, since our vector instances are currently running with full production load with a lot of other metric data in the pipelines.

Was going to ask about that yeah ... if it mostly expresses in high volume scenarios.

This is not the first time we are seeing similar reports to yours (though the first time the Otel Collector is in the mix). Prior efforts to reproduce the issue did not pan out so it remains a bit of a mystery. But it's pretty clear there is something wrong. I'm hopeful that we'll be able to break ground with this new data point you have provided.

jszwedko · 2023-08-02T19:31:42Z

Hi @seankhliao !

We'd like to take a look at the metrics directly in Datadog. If that's alright with you, would you mind opening a support case through Datadog so that we can track it?

seankhliao · 2023-08-02T19:49:00Z

Do i need to put any details in there or can i just link this thread?

jszwedko · 2023-08-02T19:56:57Z

Do i need to put any details in there or can i just link this thread?

You can just link this thread there and we can follow up.

seankhliao · 2023-08-02T20:26:34Z

case 1288375

seankhliao · 2023-08-10T10:00:59Z

I have an example of what of losing data from the datadog agent looks like.
If this is aggregation, it's not being done properly.

lukesteensen · 2023-10-16T22:37:18Z

To summarize our findings here, the primary issue seemed to be the configured removal of tags like pod.uid that would uniquely identify a series coming from a single running instance of Vector. The Datadog backend does not do aggregation of incoming values beyond what is done on the client side, so values that "conflict" (i.e. have the same series name, tag values, and timestamp) will be overwritten. The fix for this problem is to retain a differentiating tag such that data from one instance does not overwrite data from another.

seankhliao · 2023-10-16T22:59:34Z

I would dispute that, from our email threads, even after adding back the pod uid tags, we still see ~10% loss (better than 50%, but still non zero).

Our pure integration sources which we pass through vector without stripping any tags also sees intermittent loss of data.

This also doesn't explain the loss in distribution metrics.

jszwedko · 2023-10-17T11:59:49Z

Hey!

I would dispute that, from our email threads, even after adding back the pod uid tags, we still see ~10% loss (better than 50%, but still non zero).

Happy to follow up on the email thread. We had been waiting for additional response.

Our pure integration sources which we pass through vector without stripping any tags also sees intermittent loss of data.

This would be new, I think. I realize it will likely be difficult, but what would really help here is an MRE showing the data loss with just Vector by itself that we could run to reproduce. One difficulty we've had with this issue is the number of moving parts.

This also doesn't explain the loss in distribution metrics.

I think we put this on the email thread, but we discovered during the investigation that distribution metrics have the same requirement as other metric types where data points (timestamp / values) need to be unique going into the Datadog back-end or they will overwrite each other.

paulkirby-hotjar · 2023-10-25T15:37:29Z

Hey, so we've been dealing with a very similar problem to what the decided root cause of this is - we send distribution metrics but we remove some tags from them, such that duplicate timestamp/values are sent which causes a loss of accuracy. This seems like Datadog is not handling this properly somehow, because I would have expected that a duplicate submission would just be added as an additional data point to the distribution.

In any case, do we have any suggestions on how to mitigate this? In our case, we are dropping tags based on an automated script that generates a "blocklist" for us based on what tags are actually used on the DD end, per metric. Our only alternative right now seems to be to prevent distribution metrics from being analyzed by this automation, which means higher costs for us.

Perhaps the aggregation component? I'm not sure if that would aggregate sketch metrics properly though, seeing as we aren't able to reference the sketch field from VRL...

jszwedko · 2023-10-25T16:06:23Z

Hi @paulkirby-hotjar !

I think using the aggregate transform would help with this situation (it should correctly aggregate sketches). Would you mind opening this as a GitHub Discussion? I think that'll make it easier for others that have the same question to find it along with the answer.

paulkirby-hotjar · 2023-10-25T16:07:52Z

Let me give that a try and see if it works, and if it does then I will!

seankhliao added the type: bug A code related bug. label Jul 31, 2023

neuronull added sink: datadog_metrics Anything `datadog_metrics` sink related source: datadog_agent Anything `datadog_agent` source related labels Aug 1, 2023

jszwedko mentioned this issue Aug 4, 2023

Vector batch bytes limits are based on in-memory sizing of events #10020

Open

ghost mentioned this issue Aug 4, 2023

Low metrics throughput to Datadog #18163

Closed

jszwedko assigned lukesteensen Sep 7, 2023

lukesteensen closed this as completed Oct 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data loss submitting metrics to datadog #18123

data loss submitting metrics to datadog #18123

seankhliao commented Jul 31, 2023 •

edited by jszwedko

Loading

neuronull commented Aug 1, 2023

seankhliao commented Aug 1, 2023 •

edited

Loading

neuronull commented Aug 2, 2023

neuronull commented Aug 2, 2023

seankhliao commented Aug 2, 2023

neuronull commented Aug 2, 2023

jszwedko commented Aug 2, 2023

seankhliao commented Aug 2, 2023

jszwedko commented Aug 2, 2023

seankhliao commented Aug 2, 2023

seankhliao commented Aug 10, 2023

lukesteensen commented Oct 16, 2023

seankhliao commented Oct 16, 2023

jszwedko commented Oct 17, 2023

paulkirby-hotjar commented Oct 25, 2023

jszwedko commented Oct 25, 2023

paulkirby-hotjar commented Oct 25, 2023

data loss submitting metrics to datadog #18123

data loss submitting metrics to datadog #18123

Comments

seankhliao commented Jul 31, 2023 • edited by jszwedko Loading

A note for the community

Problem

Configuration

Version

Debug Output

Example Data

Additional Context

References

neuronull commented Aug 1, 2023

seankhliao commented Aug 1, 2023 • edited Loading

neuronull commented Aug 2, 2023

neuronull commented Aug 2, 2023

seankhliao commented Aug 2, 2023

neuronull commented Aug 2, 2023

jszwedko commented Aug 2, 2023

seankhliao commented Aug 2, 2023

jszwedko commented Aug 2, 2023

seankhliao commented Aug 2, 2023

seankhliao commented Aug 10, 2023

lukesteensen commented Oct 16, 2023

seankhliao commented Oct 16, 2023

jszwedko commented Oct 17, 2023

paulkirby-hotjar commented Oct 25, 2023

jszwedko commented Oct 25, 2023

paulkirby-hotjar commented Oct 25, 2023

seankhliao commented Jul 31, 2023 •

edited by jszwedko

Loading

seankhliao commented Aug 1, 2023 •

edited

Loading