Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Internal Metrics Hangs Forever #16561

Open
smitthakkar96 opened this issue Feb 23, 2023 · 14 comments
Open

Internal Metrics Hangs Forever #16561

smitthakkar96 opened this issue Feb 23, 2023 · 14 comments
Labels
source: internal_metrics Anything `internal_metrics` source related type: bug A code related bug.

Comments

@smitthakkar96
Copy link

smitthakkar96 commented Feb 23, 2023

A note for the community

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

Hello there,

We (Delivery Hero) are in process of adopting Vector. We deployed Vector on our staging environment and started load testing it. We noticed that the internal_metrics stops updating when the pods get saturated. We use the datadog_metrics sink to send internal_metrics to Datadog. Running vector top shows that internal metrics are not flushing more metrics, whereas other sinks like Datadog Logs and S3 used to ship k8s logs to Datadog and S3 keep working fine. The impact is a lack of internal_metrics until Vector is restarted. Attached is the screenshot of our monitoring dashboard. As you can see, logs from Vector keep coming, but panels that use internal metrics are blank.

image

Possible duplicate of #13718 however I am not sure.

How did you generate the load?

We ran the load test on an EKS cluster running c6g.xlarge with 3 Vector pods having requests and limits of 2000m. To generate load, we deployed another 120-150 pods running Vector demo_logs source with stdout sink. These demo_logs was tailed by Datadog Agent and later forwarded to Vector.

Configuration

No response

Version

0.27.0

Debug Output

No response

Example Data

No response

Additional Context

No response

References

No response

@smitthakkar96 smitthakkar96 added the type: bug A code related bug. label Feb 23, 2023
@jszwedko
Copy link
Member

Hi @smitthakkar96 !

Thanks for this report! Could you provide the configuration file you are using? That would be helpful in reproducing.

@smitthakkar96
Copy link
Author

smitthakkar96 commented Feb 23, 2023

Here is the configuration, sorry I forgot to put it earlier. Please let me know if you need anything else :)

vector.toml

data_dir = "/vector-data-dir"

[api]
enabled = true
address = "127.0.0.1:8686"

sources/datadog_agents.toml

type = "datadog_agent"
address = "[::]:8080"
multiple_outputs = false

sources/internal_logs.toml

type = "internal_logs"

transforms/remap_enrich_internal_metrics_with_static_tags.toml

type = "remap"
inputs = [ "internal_metrics" ]
source = """
.tags.foo = "bar"
.tags.foo1 = "bar1"
"""

transforms/remap_logs_for_datadog.toml

type = "remap"

inputs = [ "datadog_agents" ]

source = """

.host = del(.hostname)

.ddtags = parse_key_value!(.ddtags, key_value_delimiter: ":",
field_delimiter: ",")

.ddtags.vector_aggregator = get_hostname!()

.ddtags.env = "staging"

# Re-encode Datadog tags as a string for the `datadog_logs` sink

.ddtags = encode_key_value(.ddtags, key_value_delimiter: ":",
field_delimiter: ",")

# Datadog Agents pass a "status" field that is stripped when ingested

del(.status)
"""

sinks/datadog_logs.toml

type = "datadog_logs"
inputs = [ "remap_logs_for_datadog", "internal_logs" ]
default_api_key = "${DATADOG_API_KEY}"
site = "datadoghq.eu"
buffer.type = "disk"
buffer.max_size = 268435488

sinks/datadog_metrics.toml

type = "datadog_metrics"
inputs = [ "remap_enrich_internal_metrics_with_static_tags" ]
default_api_key = "${DATADOG_API_KEY}"
site = "datadoghq.eu"
buffer.type = "disk"
buffer.max_size = 268435488

I've also added How did you generate the load? in the original description

@smitthakkar96
Copy link
Author

Screenshot 2023-02-23 at 22 07 34

Just checking the graphs again, during all these load tests, there were stages where buffers got full, it's expected the buffers to block incoming events but after the buffer is flushed other sinks start to behave normally but datadog_metrics sinks stop to receive metrics from internal_metrics source. I will do another load test tomorrow to verify if this exclusively happens when the buffer gets full.

@smitthakkar96
Copy link
Author

We did another load test today and observed that when the pods get saturated, the internal_metrics keep coming until the buffer is full. Once we stop the load test and the buffer has free space it should allow for new events and internal_metrics should keep flowing.

@smitthakkar96
Copy link
Author

Screenshot 2023-02-27 at 10 16 36

I just noticed today, after restarting Vector we saw that our buffer byte size keeps growing.

@jszwedko
Copy link
Member

jszwedko commented Mar 1, 2023

Interesting, thanks for the details and configuration! I wouldn't expect backpressure in the pipeline that is sending from the datadog_agent source to the datadog_logs sink to affect your pipeline that is sending internal_metrics to datadog_metrics since they are completely separate pipelines. My one guess is that that pipeline is somehow being starved, but I'm not sure how that could happen.

The other behavior seems normal if Vector is not keeping up with the input: that the buffer will grow. Here it seems to be bumping up against the CPU limits you set. I'm curious also about the volumes you are backing the disk buffers with. You seem to be in AWS: are these EBS volumes? I'm wondering if they could be constraining the throughput.

@smitthakkar96
Copy link
Author

smitthakkar96 commented Mar 1, 2023

@jszwedko thanks for your response, we have some more findings, and I think that might help bring some clarity

Screenshot 2023-03-01 at 16 30 13

Screenshot 2023-03-01 at 16 32 38

We deployed Vector and didn't run any load tests this time; instead just let Vector run for a few hours with regular staging traffic. We saw a few weird things:

  • Incoming is much higher than outgoing, and memory keeps growing despite disk buffers (Is it possible there is a memory leak? 2 CPUs with normal load aren't able to flush all internal_metrics to Datadog, is this expected?)
  • After the disk buffer gets full due to incoming bytes being higher than outgoing, no new metric from Vector is sent. From what I understood, the events that are already in the buffer will keep flushing right. It will only block new events and that too until the buffer is full. In our case buffer never gets empty, after it gets full (checked this with vector top as well, there is no update it just shows --/s)

I'm also curious about the volumes you are backing the disk buffers with. You seem to be in AWS: are these EBS volumes? Could they be constraining the throughput?

Yes, these EBS volumes are attached to Vector Aggregator pods.

Would it be helpful if I reproduce this in a kind cluster and send over the manifests?

For time being we switched to Prometheus exporter sink and added auto-discovery annotations on the pod so Datadog Agent can scrape these metrics but it would be good if we can send these metrics directly.

@smitthakkar96
Copy link
Author

Yesterday there was an outage on Datadog

Screenshot 2023-03-09 at 09 45 15

We were curious about how Vector behaved; hence we opened the dashboard. As you can see, at one point when the buffer got full, the throughput went to 0, and then when Datadog recovered, we can see throughput increasing, upon checking the logs, we can see the logs are also being sent just fine, so everything works as expected though if you see the buffer byte size it never went down. As you can see in the screenshot below, the CPU and Memory usage also never went down. I would like to know if this concerns telemetry or Vector itself.

Screenshot 2023-03-09 at 09 48 36

@jszwedko
Copy link
Member

jszwedko commented Mar 9, 2023

Interesting, thanks for sharing that @smitthakkar96 .

One guess I have is that, though throughput recovered, the sink still isn't sending events fast enough to drain the buffer. You could verify that by comparing the component_sent_events_total metric for the sink with buffer_received_events_total to see if the sent events is exceeding the incoming events.

@z2665
Copy link

z2665 commented Jul 10, 2023

We have what appears to be the same problem in our vector agent with k8s log as source and loki as sink.
the loki sink use 1GiB disk buffer
It behaves as a vector agent that takes up all the cpu of the pod limit, but the backend doesn't receive any logs
image
At the same time we can see that the evnet buffer fills up quickly and stops falling

vector_buffer_byte_size{component_type="loki"}/vector_buffer_max_byte_size{component_type="loki"} 

image
the component_sent_events_total and buffer_received_events_total Reduced to 0

increase(vector_buffer_received_events_total{component_type="loki"}[1m])

image

increase(vector_component_sent_events_total{component_type="kubernetes_logs"}[1m])

image

This state will remain until we restart the vector agent.
We can make sure that loki is working and that the network is working, and that the other vector agents in the same cluster are working properly

@jszwedko
Copy link
Member

@z2665 do you see anything interesting in the log output? One thought would be that the buffer reader got stuck. Also, which version of Vector?

@z2665
Copy link

z2665 commented Jul 19, 2023

@z2665 do you see anything interesting in the log output? One thought would be that the buffer reader got stuck. Also, which version of Vector?

The logs are normal. There's no output of any value. This issue occurs several times in our production environment with versions 0.23 and 0.28. However, the frequency is very low

@z2665
Copy link

z2665 commented Aug 9, 2023

Unfortunately, we observed the same problem on the vector upgraded to 0.31. This issue is not resolved

@dsmith3197
Copy link
Contributor

dsmith3197 commented Nov 7, 2023

Hi @smitthakkar96,

The original issue mentioned should be resolved in the new v0.34.0 release. There was a performance bottleneck in the Datadog Metrics sink that has now been fixed. Can you try it out and let us know if it resolves your issue?

#18759

@dsmith3197 dsmith3197 added the source: internal_metrics Anything `internal_metrics` source related label Dec 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
source: internal_metrics Anything `internal_metrics` source related type: bug A code related bug.
Projects
None yet
Development

No branches or pull requests

4 participants