-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Internal Metrics Hangs Forever #16561
Comments
Hi @smitthakkar96 ! Thanks for this report! Could you provide the configuration file you are using? That would be helpful in reproducing. |
Here is the configuration, sorry I forgot to put it earlier. Please let me know if you need anything else :)
data_dir = "/vector-data-dir"
[api]
enabled = true
address = "127.0.0.1:8686"
type = "datadog_agent"
address = "[::]:8080"
multiple_outputs = false
type = "internal_logs"
type = "remap"
inputs = [ "internal_metrics" ]
source = """
.tags.foo = "bar"
.tags.foo1 = "bar1"
"""
type = "remap"
inputs = [ "datadog_agents" ]
source = """
.host = del(.hostname)
.ddtags = parse_key_value!(.ddtags, key_value_delimiter: ":",
field_delimiter: ",")
.ddtags.vector_aggregator = get_hostname!()
.ddtags.env = "staging"
# Re-encode Datadog tags as a string for the `datadog_logs` sink
.ddtags = encode_key_value(.ddtags, key_value_delimiter: ":",
field_delimiter: ",")
# Datadog Agents pass a "status" field that is stripped when ingested
del(.status)
"""
type = "datadog_logs"
inputs = [ "remap_logs_for_datadog", "internal_logs" ]
default_api_key = "${DATADOG_API_KEY}"
site = "datadoghq.eu"
buffer.type = "disk"
buffer.max_size = 268435488
type = "datadog_metrics"
inputs = [ "remap_enrich_internal_metrics_with_static_tags" ]
default_api_key = "${DATADOG_API_KEY}"
site = "datadoghq.eu"
buffer.type = "disk"
buffer.max_size = 268435488 I've also added |
Just checking the graphs again, during all these load tests, there were stages where buffers got full, it's expected the buffers to block incoming events but after the buffer is flushed other sinks start to behave normally but datadog_metrics sinks stop to receive metrics from |
We did another load test today and observed that when the pods get saturated, the |
Interesting, thanks for the details and configuration! I wouldn't expect backpressure in the pipeline that is sending from the The other behavior seems normal if Vector is not keeping up with the input: that the buffer will grow. Here it seems to be bumping up against the CPU limits you set. I'm curious also about the volumes you are backing the disk buffers with. You seem to be in AWS: are these EBS volumes? I'm wondering if they could be constraining the throughput. |
@jszwedko thanks for your response, we have some more findings, and I think that might help bring some clarity We deployed Vector and didn't run any load tests this time; instead just let Vector run for a few hours with regular staging traffic. We saw a few weird things:
Yes, these EBS volumes are attached to Vector Aggregator pods. Would it be helpful if I reproduce this in a kind cluster and send over the manifests? For time being we switched to Prometheus exporter sink and added auto-discovery annotations on the pod so Datadog Agent can scrape these metrics but it would be good if we can send these metrics directly. |
Yesterday there was an outage on Datadog We were curious about how Vector behaved; hence we opened the dashboard. As you can see, at one point when the buffer got full, the throughput went to |
Interesting, thanks for sharing that @smitthakkar96 . One guess I have is that, though throughput recovered, the sink still isn't sending events fast enough to drain the buffer. You could verify that by comparing the |
@z2665 do you see anything interesting in the log output? One thought would be that the buffer reader got stuck. Also, which version of Vector? |
The logs are normal. There's no output of any value. This issue occurs several times in our production environment with versions 0.23 and 0.28. However, the frequency is very low |
Unfortunately, we observed the same problem on the vector upgraded to 0.31. This issue is not resolved |
Hi @smitthakkar96, The original issue mentioned should be resolved in the new v0.34.0 release. There was a performance bottleneck in the Datadog Metrics sink that has now been fixed. Can you try it out and let us know if it resolves your issue? |
A note for the community
Problem
Hello there,
We (Delivery Hero) are in process of adopting Vector. We deployed Vector on our staging environment and started load testing it. We noticed that the
internal_metrics
stops updating when the pods get saturated. We use the datadog_metrics sink to send internal_metrics to Datadog. Running vector top shows that internal metrics are not flushing more metrics, whereas other sinks like Datadog Logs and S3 used to ship k8s logs to Datadog and S3 keep working fine. The impact is a lack ofinternal_metrics
until Vector is restarted. Attached is the screenshot of our monitoring dashboard. As you can see, logs from Vector keep coming, but panels that use internal metrics are blank.Possible duplicate of #13718 however I am not sure.
How did you generate the load?
We ran the load test on an EKS cluster running
c6g.xlarge
with 3 Vector pods having requests and limits of2000m
. To generate load, we deployed another 120-150 pods running Vectordemo_logs
source withstdout
sink. These demo_logs was tailed by Datadog Agent and later forwarded to Vector.Configuration
No response
Version
0.27.0
Debug Output
No response
Example Data
No response
Additional Context
No response
References
No response
The text was updated successfully, but these errors were encountered: