Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vector agent stops watching logs from new pods #8616

Closed
Tracked by #10016 ...
alexgavrisco opened this issue Aug 6, 2021 · 32 comments
Closed
Tracked by #10016 ...

Vector agent stops watching logs from new pods #8616

alexgavrisco opened this issue Aug 6, 2021 · 32 comments
Labels
source: kubernetes_logs Anything `kubernetes_logs` source related type: bug A code related bug.

Comments

@alexgavrisco
Copy link
Contributor

alexgavrisco commented Aug 6, 2021

Vector Version

version="0.15.0" arch="x86_64" build_id="994d812 2021-07-16"

Vector Configuration File

# Configuration for vector.
# Docs: https://vector.dev/docs/

data_dir = "/vector-data-dir"

[api]
  enabled = false
  address = "0.0.0.0:8686"
  playground = true

[log_schema]
  host_key = "host"
  message_key = "log"
  source_type_key = "source_type"
  timestamp_key = "time"

# Ingest logs from Kubernetes.
[sources.kubernetes_logs]
  type = "kubernetes_logs"
  extra_field_selector = "metadata.namespace==default"
  max_line_bytes = 262144



# Emit internal Vector metrics.
[sources.internal_metrics]
  type = "internal_metrics"

# Expose metrics for scraping in the Prometheus format.
[sinks.prometheus_sink]
  address = "0.0.0.0:2020"
  inputs = ["internal_metrics"]
  type = "prometheus"


[transforms.cluster_tagging]
  inputs = ["kubernetes_logs"]
  source = "...parsing json...adding some fields"
  type = "remap"

[sinks.splunk]
  type = "splunk_hec"
  inputs = ["cluster_tagging"]
  # ...
  batch.timeout_secs = 10
  request.concurrency = "adaptive"

Debug Output

The issue reproduces in production where Vector cannot be run in Debug mode.

Expected Behavior

Vector doesn't ignore logs from new pods.
This is quite disturbing given it's difficult to detect, since there are no errors/metrics I can alert on.

Actual Behavior

Vector is deployed in EKS (Kubernetes 1.17+) as agent (DaemonSet). I'm doing releases on a regular basis (meaning pods get deleted/re-created at least weekly). I noticed that after one such release multiple clusters stopped delivering logs. Although containers were running and logging (nothing really changed), Vector just was ignoring new pods. I upgraded Vector to 0.15 (from 0.13) as I saw a few similar issues and some desync errors in logs. However it seems to happen again - a cluster stopped delivering logs (except a single service which wasn't released). In logs I see lots of desync errors, however it happened days before Vector started ignoring logs.

Jul 28 11:18:52.156 ERROR source{component_kind="source" component_name=kubernetes_logs component_type=kubernetes_logs}: vector::internal_events::kubernetes::instrumenting_watcher: Watch stream failed. error=Desync { source: Desync } internal_log_rate_secs=5
Jul 28 11:18:52.156  WARN source{component_kind="source" component_name=kubernetes_logs component_type=kubernetes_logs}: vector::internal_events::kubernetes::reflector: Handling desync. error=Desync

And 3 days ago Vector just stopped watching logs from old pods and didn't start watching new ones. No other errors prior this. Those are the last logs from Vector. Once I restarted the DaemonSet it detected new logs and started consuming them.

ug 03 10:21:46.149  INFO source{component_kind="source" component_name=kubernetes_logs component_type=kubernetes_logs}:file_server: vector::internal_events::file::source: Found new file to watch. path=/var/log/pods/default_foo-856498f4fc-x795d_dc8d9e9e-f0a8-4594-9fc9-e3a83e5cbb77/foo/0.log
Aug 03 10:21:46.149  INFO source{component_kind="source" component_name=kubernetes_logs component_type=kubernetes_logs}:file_server: vector::internal_events::file::source: Stopped watching file. path=/var/log/pods/default_foo-856498f4fc-x795d_dc8d9e9e-f0a8-4594-9fc9-e3a83e5cbb77/foo/0.log
Aug 03 10:43:54.928  INFO source{component_kind="source" component_name=kubernetes_logs component_type=kubernetes_logs}:file_server: vector::internal_events::file::source: Stopped watching file. path=/var/log/pods/default_foo856498f4fc-x795d_dc8d9e9e-f0a8-4594-9fc9-e3a83e5cbb77/foo/0.log

Additional Context

References

#7934 seems to show same symptoms

@alexgavrisco alexgavrisco added the type: bug A code related bug. label Aug 6, 2021
@alexgavrisco
Copy link
Contributor Author

Vectors pods weren't throttled or OOMKilled. From resource usage perspective, they just flat-lined once they stopped watching logs
image

@alexgavrisco
Copy link
Contributor Author

Files added/Files unwatched metrics show the same pattern - at some point it just stopped watching for new log files
image

@spencergilbert spencergilbert added the source: kubernetes_logs Anything `kubernetes_logs` source related label Aug 6, 2021
@uthng
Copy link

uthng commented Aug 11, 2021

Hi, I got the same behavior. For some pods and for any reason I dont know, vector stops watching log. I have been thinking that it was corrected in the issue #6053 and #5846. It is quite critical I think. Thanks for your help.

@alexgavrisco
Copy link
Contributor Author

@uthng I've added some details to #7527
Unfortunately the problem is still there and it doesn't seem that there's an immediate fix. So far I have 2 options - rollback to a version which doesn't have the problem (we've been using 0.8 with "experimental" kubernetes support, before upgrading to this one) or try to restart Vector once it enters this state (e.g. #7401 (comment) )

@gartemiev
Copy link

gartemiev commented Sep 21, 2021

I am also seeing the same errors:
Sep 21 09:02:48.653 WARN source{component_kind="source" component_name=kubernetes_logs component_type=kubernetes_logs}: vector::internal_events::kubernetes::reflector: Handling desync. error=Desync Sep 21 09:02:49.228 ERROR source{component_kind="source" component_name=k8s_all component_type=kubernetes_logs}: vector::internal_events::kubernetes::instrumenting_watcher: Watch stream failed. error=Desync { source: Desync } internal_log_rate_secs=5

@spencergilbert we are about migrating our apps into EKS with using vector Kubernetes plugin, any ETA when you fix it? This is critical for us.

@reyvonger
Copy link

I regularly get a similar problem on talos.dev

Sep 28 22:00:58.358 ERROR source{component_kind="source" component_id=kubernetes_logs component_type=kubernetes_logs component_name=kubernetes_logs}: vector::internal_events::kubernetes::instrumenting_watcher: Watch stream failed. error=Desync { source: Desync } internal_log_rate_secs=5

@jszwedko
Copy link
Member

Just noting that we have plans to re-address this issue this quarter.

@imcitius
Copy link

Hi have the same on 0.16 and 0.17, but 0.15.2 works flawlessly in my case. Repeats on few k8s clusters.

@mickaelrecoquillay
Copy link

Hello, we also have this issue.
it's not only for new pods, 5 minutes after starting vector-agent, it stops watchins logs and we get this error message: Error=Desync.
It works correctly in 0.15.2.

@ajaygupta978
Copy link

Hi,
I am also facing this issue. Here is link in discord
https://discord.com/channels/742820443487993987/746070591097798688/904730829983469649

@igor-nikiforov
Copy link

This issue still exist in 0.18.1.

2021-12-06T19:57:05.723299Z ERROR source{component_kind="source" component_id=java-logs component_type=kubernetes_logs component_name=java-logs}: vector::internal_events::kubernetes::instrumenting_watcher: Watch stream failed. error=Desync { source: Desync } internal_log_rate_secs=5
2021-12-06T19:57:05.723335Z  WARN source{component_kind="source" component_id=java-logs component_type=kubernetes_logs component_name=java-logs}: vector::internal_events::kubernetes::reflector: Handling desync. error=Desync

@tomer-epstein
Copy link

tomer-epstein commented Dec 8, 2021

We are facing the same issue on version 0.15.2 on GCP.
The pod is marked as running but no logs available and it doesn't functional.
On aws it worked.

Dec 08 10:02:49.560 ERROR source{component_kind="source" component_name=kubernetes_logs component_type=kubernetes_logs}: vector::internal_events::kubernetes::instrumenting_watcher: Watch invocation failed. error=Other { source: BadStatus { status: 401 } } internal_log_rate_secs=5
Dec 08 10:02:49.560 ERROR source{component_kind="source" component_name=kubernetes_logs component_type=kubernetes_logs}: vector::kubernetes::reflector: Watcher error. error=BadStatus { status: 401 }
Dec 08 10:02:49.560 ERROR source{component_kind="source" component_name=kubernetes_logs component_type=kubernetes_logs}: vector::sources::kubernetes_logs: Reflector process exited with an error. error=watch invocation failed

@karlmartink
Copy link

Having the same issue.
version="0.18.1" arch="x86_64" build_id="c4adb60 2021-11-30"

2021-12-09T13:50:10.928892Z ERROR source{component_kind="source" component_id=kube_logs_1 component_type=kubernetes_logs component_name=kube_logs_1}: vector::internal_events::kubernetes::instrumenting_watcher: Watch stream failed. error=Desync { source: Desync } internal_log_rate_secs=5 2021-12-09T13:50:10.929016Z WARN source{component_kind="source" component_id=kube_logs_1 component_type=kubernetes_logs component_name=kube_logs_1}: vector::internal_events::kubernetes::reflector: Handling desync. error=Desync

@BredSt
Copy link

BredSt commented Jan 11, 2022

The error is while using vector 0.18.1 (also test other previous versions but the same issue occur )
K8S on GCP 1.21.6

The pods works for about 1-2 hours and after get the following error.

Jan 11 12:55:49.760 ERROR source{component_kind="source" component_id=kubernetes_logs component_type=kubernetes_logs component_name=kubernetes_logs}: vector::internal_events::kubernetes::instrumenting_watcher: Watch invocation failed. error=Other { source: BadStatus { status: 401 } } internal_log_rate_secs=5

Jan 11 12:55:49.760 ERROR source{component_kind="source" component_id=kubernetes_logs component_type=kubernetes_logs component_name=kubernetes_logs}: vector::kubernetes::reflector: Watcher error. error=BadStatus { status: 401 }

Jan 11 12:55:49.760 ERROR source{component_kind="source" component_id=kubernetes_logs component_type=kubernetes_logs component_name=kubernetes_logs}: vector::sources::kubernetes_logs: Namespace reflector process exited with an error. error=watch invocation failed

Jan 11 12:55:49.760 INFO source{component_kind="source" component_id=kubernetes_logs component_type=kubernetes_logs component_name=kubernetes_logs}: vector::sources::kubernetes_logs: Reflector process completed gracefully.

Jan 11 12:55:49.762 INFO source{component_kind="source" component_id=kubernetes_logs component_type=kubernetes_logs component_name=kubernetes_logs}: vector::sources::kubernetes_logs: Event processing loop completed gracefully.

Jan 11 12:55:49.762 INFO source{component_kind="source" component_id=kubernetes_logs component_type=kubernetes_logs component_name=kubernetes_logs}: vector::sources::kubernetes_logs: File server completed gracefully.

Jan 11 12:55:49.762 INFO source{component_kind="source" component_id=kubernetes_logs component_type=kubernetes_logs component_name=kubernetes_logs}: vector::sources::kubernetes_logs: Done.

@eplightning
Copy link

@tomer-epstein @BredSt
401 errors are most likely caused by https://kubernetes.io/docs/reference/access-authn-authz/service-accounts-admin/#bound-service-account-token-volume - which is default since K8s 1.21 .

It seems that Vector doesn't support token rotation, that's why it stops working after expiration time. You can work around this by either disabling the feature flag (impossible after 1.22) or manually mounting the service account token.

Vector freezing after 401 error happens is still related to this bug though.

@tomer-epstein
Copy link

tomer-epstein commented Jan 13, 2022

@eplightning
disabling the feature flag solved the issue in k&s version 1.21. (as you said this would be impossible at k&s version1.22)
didn't try yet to manually mounting the service account token.

@spencergilbert
Copy link
Contributor

spencergilbert commented Jan 13, 2022

@eplightning @tomer-epstein FWIW the planned rewrite to kube should cover this as they're adding support for it (kube-rs/kube#768)

@tomer-epstein
Copy link

@spencergilbert when do you expect it will be released?

@spencergilbert
Copy link
Contributor

@spencergilbert when do you expect it will be released?

It's scheduled to be worked on this quarter, but as far as I know hasn't been planned more precisely than that.

@JohanJermey
Copy link

We are facing the same issue with k8s 1.22.x, is there is something that we can do? this is blocker and removing the token rotation from vector chart doest help, please advice.

@treethought
Copy link

if possible, I would ask this be a top priority this quarter. We had to downgrade due to this issue since kubernetes logs source is our primary use case with vector and this is quite critical

@jszwedko
Copy link
Member

jszwedko commented Feb 14, 2022

Hey all!

Thanks to every one that provided additional details. We are currently in the process of migrating to using kube-rs which we hope will either resolve this issue with Vector not picking up new pods or make it easier to track down.

Regarding the token rotation, we are tracking that in #11146 but will be resolved by the migration to kube-rs as well given the crate has support for refreshing the token.

@MaxRink
Copy link

MaxRink commented Mar 7, 2022

We ran into the same issue, logs for tracking https://gist.github.com/MaxRink/52fbe0037ff2710eb57a668da2ef71d6

@rati3l
Copy link

rati3l commented Mar 21, 2022

Hi,

I think there are two separate issues here:
there are logs with
vector::kubernetes::reflector: Watcher error. error=BadStatus { status: 401 }
Which is caused by the token rotation.

And there are logs when the vector just stops watching for new pods, ex:
vector::internal_events::kubernetes::reflector: Handling desync. error=Desync vector::internal_events::kubernetes::instrumenting_watcher: Watch stream failed. error=Desync { source: Desync } internal_log_rate_secs=5

This happens on all clusters for me with the 0.20.0 version - there's no token rotation, and the watch stream failed message starts to appear regularly after the vector pod startup.

@spencergilbert
Copy link
Contributor

We've merged in a PR replacing our in-house implementation with kube's library. The new code will be available in the 0.21 release (or in the nightly releases now). We're hoping this change solves some of the failure cases and isolates the rest so they'll be easier to diagnose and resolve.

We'd love to get feedback from anyone who upgrades to the new code!

@up-to-you
Copy link

waiting for kube-rs (if i noticed correctly)... But when to expect 0.21 ? Impatient to evaluate it

@spencergilbert
Copy link
Contributor

waiting for kube-rs (if i noticed correctly)... But when to expect 0.21 ? Impatient to evaluate it

We're working on cutting that release this week 👍

@up-to-you
Copy link

up-to-you commented Apr 14, 2022

@spencergilbert already tested 0.21, but waiting for helm chart 0.10 to be published. Getting connection error on POD boot, os error 111 during ...v1/namespaces? request. My thought it's due to RBAC policy maybe.

@spencergilbert
Copy link
Contributor

spencergilbert commented Apr 14, 2022

@spencergilbert already tested 0.21, but waiting for helm chart 0.10 to be published. Getting connection error on POD boot, os error 111 during ...v1/namespaces? request. My thought it's due to RBAC policy maybe.

The upgrade guide and highlights can be seen here: https://vector.dev/highlights/2022-03-22-0-21-0-upgrade-guide/#kubernetes-logs and https://vector.dev/highlights/2022-03-28-kube-for-kubernetes_logs/

@jszwedko
Copy link
Member

0.21.0 has been released! We'd appreciate if people affected by this issue could try it out and let us know if you still see it.

@up-to-you
Copy link

@jszwedko More than 1 day of operation without any Desync issue.
But encountered other one in Clickhouse sink - connection or worker thread hangs without any log aware (frequency ~ every hour in 20mb/s log traffic).
Setting concurrency: 1 , timeout_secs: 2 and removing batch.max_bytes (was 100mb) solved it.
As mentioned - we don't have any logs to file an issue for such case.

@jszwedko
Copy link
Member

Cleaning up some issues. I'll close this since we believe it to be resolved, but please re-open if you still see this issue with Vector >= 0.21.1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
source: kubernetes_logs Anything `kubernetes_logs` source related type: bug A code related bug.
Projects
None yet
Development

No branches or pull requests