-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
data loss submitting metrics to datadog #18123
Comments
Hi @seankhliao, thanks for this report and for including many data points, very helpful. I have a couple questions:
Thanks! |
I can share our config for vector / otel collectors https://github.com/seankhliao/testrepo0298
requests:
cpu: 4000m
memory: 1Gi
limits:
cpu: 5000m
memory: 1Gi |
Thanks for providing all that extra details, diagram, and the repo, that's great.
This could be due to aggregation being done by either the Agent or Vector. I believe that Vector's aggregation algorithm should not be evoked if the Agent is part of the pipeline, since the Agent would be aggregating these metrics already. If we are doing that in Vector with Agent upstream, that's probably a bug.
Roger. Yeah was mostly wondering if you are seeing either Vector/Agent/Otel reporting errors, but it sounds like that is not the case (e.g. Vector "thinks" it is operating correctly)
That's helpful feedback, thank you. By this do you mean our internal metrics were not accurately reflecting the errors / dropped events ? Thanks for providing the resource details, was mostly curious if you observed vector with either excessive CPU or memory usage and that seems to not be the case. |
Any chance you'd be able to share this? We'll likely want to repro this problem and using as close to the same setup that you are as possible will increase our chances of getting the repro. |
Yes, we looked through every metric and none of them indicated anything was wrong even though our otel collectors were having some of their requests rejected. canary code: https://github.com/seankhliao/testrepo0299 I'm not sure how easy it is to reproduce in isolation, since our vector instances are currently running with full production load with a lot of other metric data in the pipelines. |
Yikes, OK that's something we'll need to look into.
❤️
Was going to ask about that yeah ... if it mostly expresses in high volume scenarios. This is not the first time we are seeing similar reports to yours (though the first time the Otel Collector is in the mix). Prior efforts to reproduce the issue did not pan out so it remains a bit of a mystery. But it's pretty clear there is something wrong. I'm hopeful that we'll be able to break ground with this new data point you have provided. |
Hi @seankhliao ! We'd like to take a look at the metrics directly in Datadog. If that's alright with you, would you mind opening a support case through Datadog so that we can track it? |
Do i need to put any details in there or can i just link this thread? |
You can just link this thread there and we can follow up. |
case 1288375 |
To summarize our findings here, the primary issue seemed to be the configured removal of tags like |
I would dispute that, from our email threads, even after adding back the pod uid tags, we still see ~10% loss (better than 50%, but still non zero). Our pure integration sources which we pass through vector without stripping any tags also sees intermittent loss of data. This also doesn't explain the loss in distribution metrics. |
Hey!
Happy to follow up on the email thread. We had been waiting for additional response.
This would be new, I think. I realize it will likely be difficult, but what would really help here is an MRE showing the data loss with just Vector by itself that we could run to reproduce. One difficulty we've had with this issue is the number of moving parts.
I think we put this on the email thread, but we discovered during the investigation that distribution metrics have the same requirement as other metric types where data points (timestamp / values) need to be unique going into the Datadog back-end or they will overwrite each other. |
Hey, so we've been dealing with a very similar problem to what the decided root cause of this is - we send distribution metrics but we remove some tags from them, such that duplicate timestamp/values are sent which causes a loss of accuracy. This seems like Datadog is not handling this properly somehow, because I would have expected that a duplicate submission would just be added as an additional data point to the distribution. In any case, do we have any suggestions on how to mitigate this? In our case, we are dropping tags based on an automated script that generates a "blocklist" for us based on what tags are actually used on the DD end, per metric. Our only alternative right now seems to be to prevent distribution metrics from being analyzed by this automation, which means higher costs for us. Perhaps the aggregation component? I'm not sure if that would aggregate sketch metrics properly though, seeing as we aren't able to reference the sketch field from VRL... |
Hi @paulkirby-hotjar ! I think using the |
Let me give that a try and see if it works, and if it does then I will! |
A note for the community
Problem
We have a mildly complex set of metrics collection pipelines:
For months we've been observing that sometimes our metric data would have less than expected values when graphed in datadog. Teams reported metric values being lower compared to our previous vendor, two counters incremented at the same place have different values, etc.
We have a canary application that emits a counter (and logs) at a constant rate, and for this counter we also see lower than expected values. Sometimes it's an unstable minor loss, but often we would see a consistently lower value for extended periods of time (hours ~ days).
A restart of vector occasionally recovers or triggers the issue, but it's not a necessary event, we've observed sudden extended drops with no change to our deployments.
We previously ruled out an error within the application and opentelemetry collector components of our pipelines, direct submission of counters from both places produced our expected values, even as we observed the same metric going through vector dropping points. We also counted events within vector using
vector tap
, and the data points were all present.We recently added a proxy written in go that would deserialize the metrics submission request sent by vector,
and 1) resubmit them under a new name to our metrics pipeline 2) log the value (which still ends up in datadog) 3) logs the response status.
We observe a consistent 202 Accepted response from datadog.
Below I've included a selection of graphs of the data we have, an explanation of metric names:
polaris.o11y.canary.logs
: the canary application emitting logspolaris.o11y.canary.otlp
: the canary application emitting a counter through our OTLP pipelinepolaris.o11y.canary.prom.total
: the canary exposing the metric over prometheus expositionddproxy.canary.otlp
: the proxy extracting thepolaris.o11y.canary.otlp
counter from vector's submission request and resubmitting it as a histogram over our OTLP pipelineddproxy.canary.prom
: the proxy extracting thepolaris.o11y.canary.prom.total
counter from vector's submission request and resubmitting it as a histogram over our OTLP pipelineddproxy.datadog_logged.otlp
: the same extracted value logged and generated as a custom metric in datadogddproxy.datadog_logged.prom
: the same extracted value logged and generated as custom metric in datadogNotes:
/api/v1/series
vs/api/beta/sketches
)Example 1 week view of our canaries
Dark red is our original canary metric, resubmitted as orange it has a much higher value
Sometimes our resubmitted metric is lower
Configuration
Version
0.29.1-distroless-libc
Debug Output
No response
Example Data
No response
Additional Context
No response
References
No response
The text was updated successfully, but these errors were encountered: