-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unbounded memory growth for high throughput kubernetes_logs #17313
Comments
It seems likely this could be a combination of file handles, as well as internal metric growth. To confirm it's 32 pods to the single Vector pod collecting their logs? |
@spencergilbert ah actually 30, not 32. But yes, all 30 pods are colocated on the same node as the vector daemon, and are logging to file via the Each deployment is labeled according to the size of logs it's producing. They each are emitting 100 logs/s except for the 128K deployment, which emits 1 log/s. I believe EKS has a default rotation size of 10MB, so the (estimated) rate at which we'd see file rotations should be |
@spencergilbert any updates here? |
This sounds like it could partially be being caused by the known issue around runaway internal metric cardinality for the Separately if it is that Vector has a lot of open file handles (have you confirmed this using Do you have a rough sense of the volume of logs per node? |
@jszwedko thanks for the extra info! re internal metrics cardinality I'm including what we captured there: I haven't actually investigated open file handle counts yet, do you know if there's an easy way to run |
If you're using the distroless image you'd need to use an ephemeral container, otherwise I expect |
Hello @jszwedko, could this also be related to the memory leak issue #11025 that I'm getting since version 0.19? I'm using the The only solution to avoid a OOM crash is to restart the vector process periodically. |
I double checked my assumption of the vector pods growing memory usage faster when many containers are started on there nodes. The following picture shows an aggregation per node (one vector instance running on each) counting how many distinct container ids did log something on them over the past 7 days. The first two which bars are significantly higher then the others are exactly the two nodes where the vector pods got restarts from my earlier post #17313 (comment). File descriptors checked with lsof are increasing as well. But confusingly the fd counts are only in the hundreds. I would have assumed multiple thousands before those becom an issue? EDIT: I gave the file descriptors count another look and got them from all running pods. This is the list:
confusingly this does not show a clear correlation between amount of open file descriptors and used memory??? Anyway the correlation with amount of started containers holds true. |
Hi folks, I have upgraded new vector's version My vector's spec:
vector.yaml: |
api:
enabled: true
sources:
kubernetes_logs:
type: kubernetes_logs
glob_minimum_cooldown_ms: 500
max_read_bytes: 2048000
oldest_first: true
...
cpu:
request: 50m
limit: 100m
memory:
request: 1000Mib
limit: 1000Mib |
It's an important clue to help to locate the root cause! @sharonx @jszwedko @spencergilbert |
Re:
Sadly I have to day that my previous conclusion was wrong. The pod's memory was stable for a while (~2h) and the memory started growing again. The container was OOMKilled every 15min or so after. |
i'm seeing the same thing and probably have even larger log load per node in kubernetes. I thought it was due to the s3 sink not sending to s3 or sending too small of files but when I switch it to the blackhole it continues. So it seems like the cause is the kubernetes source. I've even tried to remove the simple transition that was there |
cc @wanjunlei |
A note for the community
Problem
We're seeing this behavior for vector 0.29.1 on EKS 1.24. We are load testing vector by running it as a daemonset against pods writing files via the
json-file
log driver for containerd. There's 32 pods writing 100 logs/s each, so we're pushing 3200 logs/s through the vector daemon on the node. Log sizes vary from 1.25K-128K, with mostly smaller writes. We see unbounded memory growth consistently even when setting a blackhole sink. Based on log throughput and EKS' rotation size I'd imagine that we're rotating through a ton of files, so we believe that Vector is probably hoarding open file handles.Setup
Chart.yaml
Values.yaml
Vector Config
The redacted remaps below are just adding additional static tags.
Configuration
No response
Version
0.29.1
Debug Output
No response
Example Data
Additional Context
No response
References
#14750
The text was updated successfully, but these errors were encountered: