-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vector sink with disk buffer stuck after pod restart #19155
Comments
We run Vector as StatefulSets in EKS, so each vector pod has a dedicated PV provisioned for it too. Recently we observed a similar issue with orphaned .dat files when some pods got OOM killed. After we increased the memory limit, Vector pods restarted successfully and continued to create/process new .dat files for a while. However, because the old .dat files were still around, Vector failed to increment the number in the buffer file due to naming conflict with an old .dat file, and stopped sending logs to sinks with disk buffers. Once we deleted the old .dat files and restarted the affected pod, it started working again. I think this might have been the case for you too, since in your example, it would not have been possible to create a newer .dat file named "buffer-data-21.dat" because an old one already existed with the same filename. We didn't see any errors related to "Unexpected status: 307 Temporary Redirect" though. The only errors we saw were dropped events with "reason=Source send cancelled," as well as CPU and buffer usage flatlining. |
It looks to me like if there's any old .dat files on startup that don't get cleared up, then this will silently get stuck due to the waiting at vector/lib/vector-buffers/src/variants/disk_v2/writer.rs Lines 1147 to 1150 in 1579627
There's also an interesting explanation in this comment just a bit before that line about it waiting on the reader: vector/lib/vector-buffers/src/variants/disk_v2/writer.rs Lines 1045 to 1058 in 1579627
Does this mean that some kind of old buffer file cleanup needs to be done on startup when loading in the old ledger? How would Vector be able to tell what is still a valid buffer file versus an old/orphaned one? I'm not really familiar with the data structure used here, although I'm interested in learning more about it! |
That's great info! cc @tobz, who might be able to provide more guidance or suggestions for potential fixes. |
A note for the community
Problem
We are using Vector version 0.33.1 to process and forward Datadog metrics. Due to HPA and Kubernetes node rotation mechanism, we enabled the disk buffer to be sure that all metrics will be forwarded to DD server. Unfortunately, after pod/nodes restart, we observed a situation where the vector pod start, the process is up but data are not being processed or process but not sink. In the correctly mounted persistent volume, we see old
buffer-data-*.dat
files. Since we use Kubernetes HPA, multiple orphaned dat files exist in the PVC store:Also, is there a chance to add detailed
warn
in the logs when the vector blocks the sink and cannot forward metrics? From an observability perspective, we can currently only catch this based on vector internal metrics and Datadog agent metrics. It would be useful to have this information directly in the logs.Configuration
Version
0.33.1
Debug Output
I also found this in debug:
Example Data
pod logs:
Additional Context
Removing the *.lock file doesn't help.
Flows that do not use this sink, such as internal metrics, are forwarded correctly.
Vector doesn't create additional
*.dat
files during start.When we remove all orphan *.dat files from the persistent volume and restart the pod, data processing resumes.
Not all restarts reproduce this issue, there are some pods that were restarted by us multiple times and it's always correctly processing data.
References
No response
The text was updated successfully, but these errors were encountered: