-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vector agent stops watching logs from new pods #8616
Comments
@uthng I've added some details to #7527 |
I am also seeing the same errors: @spencergilbert we are about migrating our apps into EKS with using vector Kubernetes plugin, any ETA when you fix it? This is critical for us. |
I regularly get a similar problem on talos.dev
|
Just noting that we have plans to re-address this issue this quarter. |
Hi have the same on 0.16 and 0.17, but 0.15.2 works flawlessly in my case. Repeats on few k8s clusters. |
Hello, we also have this issue. |
Hi, |
This issue still exist in 0.18.1.
|
We are facing the same issue on version 0.15.2 on GCP.
|
Having the same issue.
|
The error is while using vector 0.18.1 (also test other previous versions but the same issue occur ) The pods works for about 1-2 hours and after get the following error. Jan 11 12:55:49.760 ERROR source{component_kind="source" component_id=kubernetes_logs component_type=kubernetes_logs component_name=kubernetes_logs}: vector::internal_events::kubernetes::instrumenting_watcher: Watch invocation failed. error=Other { source: BadStatus { status: 401 } } internal_log_rate_secs=5 Jan 11 12:55:49.760 ERROR source{component_kind="source" component_id=kubernetes_logs component_type=kubernetes_logs component_name=kubernetes_logs}: vector::kubernetes::reflector: Watcher error. error=BadStatus { status: 401 } Jan 11 12:55:49.760 ERROR source{component_kind="source" component_id=kubernetes_logs component_type=kubernetes_logs component_name=kubernetes_logs}: vector::sources::kubernetes_logs: Namespace reflector process exited with an error. error=watch invocation failed Jan 11 12:55:49.760 INFO source{component_kind="source" component_id=kubernetes_logs component_type=kubernetes_logs component_name=kubernetes_logs}: vector::sources::kubernetes_logs: Reflector process completed gracefully. Jan 11 12:55:49.762 INFO source{component_kind="source" component_id=kubernetes_logs component_type=kubernetes_logs component_name=kubernetes_logs}: vector::sources::kubernetes_logs: Event processing loop completed gracefully. Jan 11 12:55:49.762 INFO source{component_kind="source" component_id=kubernetes_logs component_type=kubernetes_logs component_name=kubernetes_logs}: vector::sources::kubernetes_logs: File server completed gracefully. Jan 11 12:55:49.762 INFO source{component_kind="source" component_id=kubernetes_logs component_type=kubernetes_logs component_name=kubernetes_logs}: vector::sources::kubernetes_logs: Done. |
@tomer-epstein @BredSt It seems that Vector doesn't support token rotation, that's why it stops working after expiration time. You can work around this by either disabling the feature flag (impossible after 1.22) or manually mounting the service account token. Vector freezing after 401 error happens is still related to this bug though. |
@eplightning |
@eplightning @tomer-epstein FWIW the planned rewrite to |
@spencergilbert when do you expect it will be released? |
It's scheduled to be worked on this quarter, but as far as I know hasn't been planned more precisely than that. |
We are facing the same issue with k8s 1.22.x, is there is something that we can do? this is blocker and removing the token rotation from vector chart doest help, please advice. |
if possible, I would ask this be a top priority this quarter. We had to downgrade due to this issue since kubernetes logs source is our primary use case with vector and this is quite critical |
Hey all! Thanks to every one that provided additional details. We are currently in the process of migrating to using kube-rs which we hope will either resolve this issue with Vector not picking up new pods or make it easier to track down. Regarding the token rotation, we are tracking that in #11146 but will be resolved by the migration to |
We ran into the same issue, logs for tracking https://gist.github.com/MaxRink/52fbe0037ff2710eb57a668da2ef71d6 |
Hi, I think there are two separate issues here: And there are logs when the vector just stops watching for new pods, ex: This happens on all clusters for me with the 0.20.0 version - there's no token rotation, and the watch stream failed message starts to appear regularly after the vector pod startup. |
We've merged in a PR replacing our in-house implementation with We'd love to get feedback from anyone who upgrades to the new code! |
waiting for kube-rs (if i noticed correctly)... But when to expect |
We're working on cutting that release this week 👍 |
@spencergilbert already tested |
The upgrade guide and highlights can be seen here: https://vector.dev/highlights/2022-03-22-0-21-0-upgrade-guide/#kubernetes-logs and https://vector.dev/highlights/2022-03-28-kube-for-kubernetes_logs/ |
0.21.0 has been released! We'd appreciate if people affected by this issue could try it out and let us know if you still see it. |
@jszwedko More than 1 day of operation without any Desync issue. |
Cleaning up some issues. I'll close this since we believe it to be resolved, but please re-open if you still see this issue with Vector >= 0.21.1. |
Vector Version
Vector Configuration File
Debug Output
The issue reproduces in production where Vector cannot be run in Debug mode.
Expected Behavior
Vector doesn't ignore logs from new pods.
This is quite disturbing given it's difficult to detect, since there are no errors/metrics I can alert on.
Actual Behavior
Vector is deployed in EKS (Kubernetes 1.17+) as agent (DaemonSet). I'm doing releases on a regular basis (meaning pods get deleted/re-created at least weekly). I noticed that after one such release multiple clusters stopped delivering logs. Although containers were running and logging (nothing really changed), Vector just was ignoring new pods. I upgraded Vector to 0.15 (from 0.13) as I saw a few similar issues and some desync errors in logs. However it seems to happen again - a cluster stopped delivering logs (except a single service which wasn't released). In logs I see lots of desync errors, however it happened days before Vector started ignoring logs.
And 3 days ago Vector just stopped watching logs from old pods and didn't start watching new ones. No other errors prior this. Those are the last logs from Vector. Once I restarted the DaemonSet it detected new logs and started consuming them.
Additional Context
References
#7934 seems to show same symptoms
The text was updated successfully, but these errors were encountered: