Vector agent stops watching logs from new pods #8616

alexgavrisco · 2021-08-06T12:09:02Z

Vector Version

version="0.15.0" arch="x86_64" build_id="994d812 2021-07-16"

Vector Configuration File

# Configuration for vector.
# Docs: https://vector.dev/docs/

data_dir = "/vector-data-dir"

[api]
  enabled = false
  address = "0.0.0.0:8686"
  playground = true

[log_schema]
  host_key = "host"
  message_key = "log"
  source_type_key = "source_type"
  timestamp_key = "time"

# Ingest logs from Kubernetes.
[sources.kubernetes_logs]
  type = "kubernetes_logs"
  extra_field_selector = "metadata.namespace==default"
  max_line_bytes = 262144



# Emit internal Vector metrics.
[sources.internal_metrics]
  type = "internal_metrics"

# Expose metrics for scraping in the Prometheus format.
[sinks.prometheus_sink]
  address = "0.0.0.0:2020"
  inputs = ["internal_metrics"]
  type = "prometheus"


[transforms.cluster_tagging]
  inputs = ["kubernetes_logs"]
  source = "...parsing json...adding some fields"
  type = "remap"

[sinks.splunk]
  type = "splunk_hec"
  inputs = ["cluster_tagging"]
  # ...
  batch.timeout_secs = 10
  request.concurrency = "adaptive"

Debug Output

The issue reproduces in production where Vector cannot be run in Debug mode.

Expected Behavior

Vector doesn't ignore logs from new pods.
This is quite disturbing given it's difficult to detect, since there are no errors/metrics I can alert on.

Actual Behavior

Vector is deployed in EKS (Kubernetes 1.17+) as agent (DaemonSet). I'm doing releases on a regular basis (meaning pods get deleted/re-created at least weekly). I noticed that after one such release multiple clusters stopped delivering logs. Although containers were running and logging (nothing really changed), Vector just was ignoring new pods. I upgraded Vector to 0.15 (from 0.13) as I saw a few similar issues and some desync errors in logs. However it seems to happen again - a cluster stopped delivering logs (except a single service which wasn't released). In logs I see lots of desync errors, however it happened days before Vector started ignoring logs.

Jul 28 11:18:52.156 ERROR source{component_kind="source" component_name=kubernetes_logs component_type=kubernetes_logs}: vector::internal_events::kubernetes::instrumenting_watcher: Watch stream failed. error=Desync { source: Desync } internal_log_rate_secs=5
Jul 28 11:18:52.156  WARN source{component_kind="source" component_name=kubernetes_logs component_type=kubernetes_logs}: vector::internal_events::kubernetes::reflector: Handling desync. error=Desync

And 3 days ago Vector just stopped watching logs from old pods and didn't start watching new ones. No other errors prior this. Those are the last logs from Vector. Once I restarted the DaemonSet it detected new logs and started consuming them.

ug 03 10:21:46.149  INFO source{component_kind="source" component_name=kubernetes_logs component_type=kubernetes_logs}:file_server: vector::internal_events::file::source: Found new file to watch. path=/var/log/pods/default_foo-856498f4fc-x795d_dc8d9e9e-f0a8-4594-9fc9-e3a83e5cbb77/foo/0.log
Aug 03 10:21:46.149  INFO source{component_kind="source" component_name=kubernetes_logs component_type=kubernetes_logs}:file_server: vector::internal_events::file::source: Stopped watching file. path=/var/log/pods/default_foo-856498f4fc-x795d_dc8d9e9e-f0a8-4594-9fc9-e3a83e5cbb77/foo/0.log
Aug 03 10:43:54.928  INFO source{component_kind="source" component_name=kubernetes_logs component_type=kubernetes_logs}:file_server: vector::internal_events::file::source: Stopped watching file. path=/var/log/pods/default_foo856498f4fc-x795d_dc8d9e9e-f0a8-4594-9fc9-e3a83e5cbb77/foo/0.log

Additional Context

References

#7934 seems to show same symptoms

The text was updated successfully, but these errors were encountered:

alexgavrisco · 2021-08-06T12:18:42Z

Vectors pods weren't throttled or OOMKilled. From resource usage perspective, they just flat-lined once they stopped watching logs

alexgavrisco · 2021-08-06T12:45:32Z

Files added/Files unwatched metrics show the same pattern - at some point it just stopped watching for new log files

uthng · 2021-08-11T14:14:22Z

Hi, I got the same behavior. For some pods and for any reason I dont know, vector stops watching log. I have been thinking that it was corrected in the issue #6053 and #5846. It is quite critical I think. Thanks for your help.

alexgavrisco · 2021-08-16T09:37:42Z

@uthng I've added some details to #7527
Unfortunately the problem is still there and it doesn't seem that there's an immediate fix. So far I have 2 options - rollback to a version which doesn't have the problem (we've been using 0.8 with "experimental" kubernetes support, before upgrading to this one) or try to restart Vector once it enters this state (e.g. #7401 (comment) )

gartemiev · 2021-09-21T09:15:02Z

I am also seeing the same errors:
Sep 21 09:02:48.653 WARN source{component_kind="source" component_name=kubernetes_logs component_type=kubernetes_logs}: vector::internal_events::kubernetes::reflector: Handling desync. error=Desync Sep 21 09:02:49.228 ERROR source{component_kind="source" component_name=k8s_all component_type=kubernetes_logs}: vector::internal_events::kubernetes::instrumenting_watcher: Watch stream failed. error=Desync { source: Desync } internal_log_rate_secs=5

@spencergilbert we are about migrating our apps into EKS with using vector Kubernetes plugin, any ETA when you fix it? This is critical for us.

reyvonger · 2021-09-28T22:13:11Z

I regularly get a similar problem on talos.dev

Sep 28 22:00:58.358 ERROR source{component_kind="source" component_id=kubernetes_logs component_type=kubernetes_logs component_name=kubernetes_logs}: vector::internal_events::kubernetes::instrumenting_watcher: Watch stream failed. error=Desync { source: Desync } internal_log_rate_secs=5

jszwedko · 2021-10-18T23:15:43Z

Just noting that we have plans to re-address this issue this quarter.

imcitius · 2021-10-22T07:40:21Z

Hi have the same on 0.16 and 0.17, but 0.15.2 works flawlessly in my case. Repeats on few k8s clusters.

mickaelrecoquillay · 2021-10-25T08:31:53Z

Hello, we also have this issue.
it's not only for new pods, 5 minutes after starting vector-agent, it stops watchins logs and we get this error message: Error=Desync.
It works correctly in 0.15.2.

ajaygupta978 · 2021-11-01T14:54:55Z

Hi,
I am also facing this issue. Here is link in discord
https://discord.com/channels/742820443487993987/746070591097798688/904730829983469649

igor-nikiforov · 2021-12-06T20:03:38Z

This issue still exist in 0.18.1.

2021-12-06T19:57:05.723299Z ERROR source{component_kind="source" component_id=java-logs component_type=kubernetes_logs component_name=java-logs}: vector::internal_events::kubernetes::instrumenting_watcher: Watch stream failed. error=Desync { source: Desync } internal_log_rate_secs=5
2021-12-06T19:57:05.723335Z  WARN source{component_kind="source" component_id=java-logs component_type=kubernetes_logs component_name=java-logs}: vector::internal_events::kubernetes::reflector: Handling desync. error=Desync

tomer-epstein · 2021-12-08T14:01:27Z

We are facing the same issue on version 0.15.2 on GCP.
The pod is marked as running but no logs available and it doesn't functional.
On aws it worked.

Dec 08 10:02:49.560 ERROR source{component_kind="source" component_name=kubernetes_logs component_type=kubernetes_logs}: vector::internal_events::kubernetes::instrumenting_watcher: Watch invocation failed. error=Other { source: BadStatus { status: 401 } } internal_log_rate_secs=5
Dec 08 10:02:49.560 ERROR source{component_kind="source" component_name=kubernetes_logs component_type=kubernetes_logs}: vector::kubernetes::reflector: Watcher error. error=BadStatus { status: 401 }
Dec 08 10:02:49.560 ERROR source{component_kind="source" component_name=kubernetes_logs component_type=kubernetes_logs}: vector::sources::kubernetes_logs: Reflector process exited with an error. error=watch invocation failed

karlmartink · 2021-12-09T14:08:07Z

Having the same issue.
version="0.18.1" arch="x86_64" build_id="c4adb60 2021-11-30"

2021-12-09T13:50:10.928892Z ERROR source{component_kind="source" component_id=kube_logs_1 component_type=kubernetes_logs component_name=kube_logs_1}: vector::internal_events::kubernetes::instrumenting_watcher: Watch stream failed. error=Desync { source: Desync } internal_log_rate_secs=5 2021-12-09T13:50:10.929016Z WARN source{component_kind="source" component_id=kube_logs_1 component_type=kubernetes_logs component_name=kube_logs_1}: vector::internal_events::kubernetes::reflector: Handling desync. error=Desync

BredSt · 2022-01-11T16:04:21Z

The error is while using vector 0.18.1 (also test other previous versions but the same issue occur )
K8S on GCP 1.21.6

The pods works for about 1-2 hours and after get the following error.

Jan 11 12:55:49.760 ERROR source{component_kind="source" component_id=kubernetes_logs component_type=kubernetes_logs component_name=kubernetes_logs}: vector::internal_events::kubernetes::instrumenting_watcher: Watch invocation failed. error=Other { source: BadStatus { status: 401 } } internal_log_rate_secs=5

Jan 11 12:55:49.760 ERROR source{component_kind="source" component_id=kubernetes_logs component_type=kubernetes_logs component_name=kubernetes_logs}: vector::kubernetes::reflector: Watcher error. error=BadStatus { status: 401 }

Jan 11 12:55:49.760 ERROR source{component_kind="source" component_id=kubernetes_logs component_type=kubernetes_logs component_name=kubernetes_logs}: vector::sources::kubernetes_logs: Namespace reflector process exited with an error. error=watch invocation failed

Jan 11 12:55:49.760 INFO source{component_kind="source" component_id=kubernetes_logs component_type=kubernetes_logs component_name=kubernetes_logs}: vector::sources::kubernetes_logs: Reflector process completed gracefully.

Jan 11 12:55:49.762 INFO source{component_kind="source" component_id=kubernetes_logs component_type=kubernetes_logs component_name=kubernetes_logs}: vector::sources::kubernetes_logs: Event processing loop completed gracefully.

Jan 11 12:55:49.762 INFO source{component_kind="source" component_id=kubernetes_logs component_type=kubernetes_logs component_name=kubernetes_logs}: vector::sources::kubernetes_logs: File server completed gracefully.

Jan 11 12:55:49.762 INFO source{component_kind="source" component_id=kubernetes_logs component_type=kubernetes_logs component_name=kubernetes_logs}: vector::sources::kubernetes_logs: Done.

eplightning · 2022-01-11T19:15:59Z

@tomer-epstein @BredSt
401 errors are most likely caused by https://kubernetes.io/docs/reference/access-authn-authz/service-accounts-admin/#bound-service-account-token-volume - which is default since K8s 1.21 .

It seems that Vector doesn't support token rotation, that's why it stops working after expiration time. You can work around this by either disabling the feature flag (impossible after 1.22) or manually mounting the service account token.

Vector freezing after 401 error happens is still related to this bug though.

tomer-epstein · 2022-01-13T08:50:15Z

@eplightning
disabling the feature flag solved the issue in k&s version 1.21. (as you said this would be impossible at k&s version1.22)
didn't try yet to manually mounting the service account token.

spencergilbert · 2022-01-13T14:36:36Z

@eplightning @tomer-epstein FWIW the planned rewrite to kube should cover this as they're adding support for it (kube-rs/kube#768)

tomer-epstein · 2022-01-13T16:09:31Z

@spencergilbert when do you expect it will be released?

spencergilbert · 2022-01-13T17:16:29Z

@spencergilbert when do you expect it will be released?

It's scheduled to be worked on this quarter, but as far as I know hasn't been planned more precisely than that.

JohanJermey · 2022-02-14T20:17:40Z

We are facing the same issue with k8s 1.22.x, is there is something that we can do? this is blocker and removing the token rotation from vector chart doest help, please advice.

treethought · 2022-02-14T21:13:13Z

if possible, I would ask this be a top priority this quarter. We had to downgrade due to this issue since kubernetes logs source is our primary use case with vector and this is quite critical

jszwedko · 2022-02-14T21:57:57Z

Hey all!

Thanks to every one that provided additional details. We are currently in the process of migrating to using kube-rs which we hope will either resolve this issue with Vector not picking up new pods or make it easier to track down.

Regarding the token rotation, we are tracking that in #11146 but will be resolved by the migration to kube-rs as well given the crate has support for refreshing the token.

MaxRink · 2022-03-07T17:00:13Z

We ran into the same issue, logs for tracking https://gist.github.com/MaxRink/52fbe0037ff2710eb57a668da2ef71d6

rati3l · 2022-03-21T08:34:45Z

Hi,

I think there are two separate issues here:
there are logs with
vector::kubernetes::reflector: Watcher error. error=BadStatus { status: 401 }
Which is caused by the token rotation.

And there are logs when the vector just stops watching for new pods, ex:
vector::internal_events::kubernetes::reflector: Handling desync. error=Desync vector::internal_events::kubernetes::instrumenting_watcher: Watch stream failed. error=Desync { source: Desync } internal_log_rate_secs=5

This happens on all clusters for me with the 0.20.0 version - there's no token rotation, and the watch stream failed message starts to appear regularly after the vector pod startup.

spencergilbert · 2022-04-01T17:03:08Z

We've merged in a PR replacing our in-house implementation with kube's library. The new code will be available in the 0.21 release (or in the nightly releases now). We're hoping this change solves some of the failure cases and isolates the rest so they'll be easier to diagnose and resolve.

We'd love to get feedback from anyone who upgrades to the new code!

up-to-you · 2022-04-10T16:47:38Z

waiting for kube-rs (if i noticed correctly)... But when to expect 0.21 ? Impatient to evaluate it

spencergilbert · 2022-04-11T13:25:43Z

waiting for kube-rs (if i noticed correctly)... But when to expect 0.21 ? Impatient to evaluate it

We're working on cutting that release this week 👍

up-to-you · 2022-04-14T19:32:34Z

@spencergilbert already tested 0.21, but waiting for helm chart 0.10 to be published. Getting connection error on POD boot, os error 111 during ...v1/namespaces? request. My thought it's due to RBAC policy maybe.

spencergilbert · 2022-04-14T19:38:08Z

@spencergilbert already tested 0.21, but waiting for helm chart 0.10 to be published. Getting connection error on POD boot, os error 111 during ...v1/namespaces? request. My thought it's due to RBAC policy maybe.

The upgrade guide and highlights can be seen here: https://vector.dev/highlights/2022-03-22-0-21-0-upgrade-guide/#kubernetes-logs and https://vector.dev/highlights/2022-03-28-kube-for-kubernetes_logs/

jszwedko · 2022-04-15T18:54:35Z

0.21.0 has been released! We'd appreciate if people affected by this issue could try it out and let us know if you still see it.

up-to-you · 2022-04-16T01:40:11Z

@jszwedko More than 1 day of operation without any Desync issue.
But encountered other one in Clickhouse sink - connection or worker thread hangs without any log aware (frequency ~ every hour in 20mb/s log traffic).
Setting concurrency: 1 , timeout_secs: 2 and removing batch.max_bytes (was 100mb) solved it.
As mentioned - we don't have any logs to file an issue for such case.

jszwedko · 2022-04-25T18:38:59Z

Cleaning up some issues. I'll close this since we believe it to be resolved, but please re-open if you still see this issue with Vector >= 0.21.1.

alexgavrisco added the type: bug A code related bug. label Aug 6, 2021

spencergilbert added the source: kubernetes_logs Anything `kubernetes_logs` source related label Aug 6, 2021

spencergilbert mentioned this issue Aug 6, 2021

kubernetes_logs error handling RFC #7527

Open

7 tasks

jszwedko mentioned this issue Oct 13, 2021

k8s_logs stops processing on Watch invocation failed error #7401

Closed

gyrter mentioned this issue Oct 27, 2021

[log-shipper] Logs stuck after upgrade to actual vector deckhouse/deckhouse#255

Closed

2 tasks

jszwedko mentioned this issue Nov 12, 2021

kubernetes_logs source fixes #10016

Closed

4 tasks

k8s-comandante mentioned this issue Dec 13, 2021

Problems with container logs in release 0.18.0/0.18.1 #10413

Closed

spencergilbert mentioned this issue Jan 13, 2022

401 Unauthorized for GCP related sinks after a while #10828

Closed

This was referenced Feb 2, 2022

Support Kubernetes auth token rotation #11146

Closed

Migrate the kubernetes_logs source to kube-rs #11147

Closed

nabokihms mentioned this issue Mar 21, 2022

[log-shipper] Vector stops collecting logs when connection to the kube-apiserver is lost deckhouse/deckhouse#1093

Closed

2 tasks

nkinkade mentioned this issue Mar 29, 2022

Replaces Vector with fluent-bit m-lab/k8s-support#665

Merged

wtchangdm mentioned this issue Apr 11, 2022

Vector randomly stops shipping certain k8s logs #12014

Open

jszwedko closed this as completed Apr 25, 2022

nkinkade mentioned this issue Dec 21, 2022

Replaces fluent-bit with Vector m-lab/k8s-support#754

Merged

garethpelly mentioned this issue Jan 12, 2024

gcp_stackdriver_logs: 401 Unauthorised each hour #19614

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vector agent stops watching logs from new pods #8616

Vector agent stops watching logs from new pods #8616

alexgavrisco commented Aug 6, 2021 •

edited

Loading

alexgavrisco commented Aug 6, 2021

alexgavrisco commented Aug 6, 2021

uthng commented Aug 11, 2021

alexgavrisco commented Aug 16, 2021

gartemiev commented Sep 21, 2021 •

edited

Loading

reyvonger commented Sep 28, 2021

jszwedko commented Oct 18, 2021

imcitius commented Oct 22, 2021

mickaelrecoquillay commented Oct 25, 2021

ajaygupta978 commented Nov 1, 2021

igor-nikiforov commented Dec 6, 2021

tomer-epstein commented Dec 8, 2021 •

edited

Loading

karlmartink commented Dec 9, 2021

BredSt commented Jan 11, 2022 •

edited

Loading

eplightning commented Jan 11, 2022

tomer-epstein commented Jan 13, 2022 •

edited

Loading

spencergilbert commented Jan 13, 2022 •

edited

Loading

tomer-epstein commented Jan 13, 2022

spencergilbert commented Jan 13, 2022

JohanJermey commented Feb 14, 2022

treethought commented Feb 14, 2022

jszwedko commented Feb 14, 2022 •

edited

Loading

MaxRink commented Mar 7, 2022

rati3l commented Mar 21, 2022

spencergilbert commented Apr 1, 2022

up-to-you commented Apr 10, 2022

spencergilbert commented Apr 11, 2022

up-to-you commented Apr 14, 2022 •

edited

Loading

spencergilbert commented Apr 14, 2022 •

edited

Loading

jszwedko commented Apr 15, 2022

up-to-you commented Apr 16, 2022

jszwedko commented Apr 25, 2022

Vector agent stops watching logs from new pods #8616

Vector agent stops watching logs from new pods #8616

Comments

alexgavrisco commented Aug 6, 2021 • edited Loading

Vector Version

Vector Configuration File

Debug Output

Expected Behavior

Actual Behavior

Additional Context

References

alexgavrisco commented Aug 6, 2021

alexgavrisco commented Aug 6, 2021

uthng commented Aug 11, 2021

alexgavrisco commented Aug 16, 2021

gartemiev commented Sep 21, 2021 • edited Loading

reyvonger commented Sep 28, 2021

jszwedko commented Oct 18, 2021

imcitius commented Oct 22, 2021

mickaelrecoquillay commented Oct 25, 2021

ajaygupta978 commented Nov 1, 2021

igor-nikiforov commented Dec 6, 2021

tomer-epstein commented Dec 8, 2021 • edited Loading

karlmartink commented Dec 9, 2021

BredSt commented Jan 11, 2022 • edited Loading

eplightning commented Jan 11, 2022

tomer-epstein commented Jan 13, 2022 • edited Loading

spencergilbert commented Jan 13, 2022 • edited Loading

tomer-epstein commented Jan 13, 2022

spencergilbert commented Jan 13, 2022

JohanJermey commented Feb 14, 2022

treethought commented Feb 14, 2022

jszwedko commented Feb 14, 2022 • edited Loading

MaxRink commented Mar 7, 2022

rati3l commented Mar 21, 2022

spencergilbert commented Apr 1, 2022

up-to-you commented Apr 10, 2022

spencergilbert commented Apr 11, 2022

up-to-you commented Apr 14, 2022 • edited Loading

spencergilbert commented Apr 14, 2022 • edited Loading

jszwedko commented Apr 15, 2022

up-to-you commented Apr 16, 2022

jszwedko commented Apr 25, 2022

alexgavrisco commented Aug 6, 2021 •

edited

Loading

gartemiev commented Sep 21, 2021 •

edited

Loading

tomer-epstein commented Dec 8, 2021 •

edited

Loading

BredSt commented Jan 11, 2022 •

edited

Loading

tomer-epstein commented Jan 13, 2022 •

edited

Loading

spencergilbert commented Jan 13, 2022 •

edited

Loading

jszwedko commented Feb 14, 2022 •

edited

Loading

up-to-you commented Apr 14, 2022 •

edited

Loading

spencergilbert commented Apr 14, 2022 •

edited

Loading