Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fluentd stops exporting logs after some time #25819

Closed
astraldawn opened this issue Mar 6, 2020 · 11 comments
Closed

Fluentd stops exporting logs after some time #25819

astraldawn opened this issue Mar 6, 2020 · 11 comments
Assignees
Labels
area/logging internal kind/bug Issues that are defects reported by users or that we know have reached a real release
Milestone

Comments

@astraldawn
Copy link

astraldawn commented Mar 6, 2020

What kind of request is this (question/bug/enhancement/feature request): Bug

Steps to reproduce (least amount of steps as possible):

  • Create two clusters with RKE via Rancher UI (Add Cluster - Custom), main cluster and es cluster
  • Deploy logging on main cluster that sends logs to ElasticSearch on es cluster
  • Send some logs over a period of time (between 33k to 110k log messages over 24 hours)

Result:

  • Fluentd (rancher-logging-fluentd) stops sending logs to ElasticSearch with the following error
Failed to flush the buffer, 
retry_time=0, next_retry_seconds=xxxx, chunk=CHUNK_ID, 
error_class=Fluent::Plugin::ElasticSearchOutput::RecoverableRequestFailure 
error="could not push logs to ElasticSearch cluster 
({:host=>HOST_NAME, :port=>443, :scheme=>\"https\"}): connect_write timeout reached"

This error message repeats with retry_time increasing and next_retry_seconds getting further and further away.

Observed these failures on 6 out of 18 nodes which ran fluentd, ranging from 12 to 20 hours after re-deployment of rancher-logging-fluentd-linux daemon set.

On es cluster

  • Ingress controller logs show no requests received from any nodes on main cluster where fluentd has stopped sending logs
  • No errors from ElasticSearch, ElasticSearch CPU use is low throughout

Other details that may be helpful:

Environment information

  • Rancher version: v2.3.5
  • Installation option (single install/HA): HA

Cluster information

  • Cluster type: Custom
  • Machine type: VM / 16 vCPU / 64 GB RAM
  • Kubernetes version: 1.16.6 (main cluster), 1.17.2 (es cluster)
  • Docker version: 18.9.8 (both)
  • Fluentd image: rancher/fluentd:v0.1.19
@gyf304
Copy link

gyf304 commented Mar 10, 2020

Same issue here. Logging stops and fluentd buffer backlogs after a while.

@astraldawn
Copy link
Author

Have had success by using the following (Cluster logging > Edit as a form)

<match *>
  @type elasticsearch
  include_tag_key true
  hosts ES_HOST
  logstash_prefix "LOGSTASH_PREFIX"
  logstash_format true
  logstash_dataformat LOGSTASH_FORMAT
  type_name "container_log"
  ssl_verify false
  ssl_version TLSv1_2
  reload_connections false
  reconnect_on_error true
  reload_on_failure true
</match>

Last 3 lines are from advice given at: uken/fluent-plugin-elasticsearch#525

Rest are the defaults from Rancher ES template

@Tejeev
Copy link

Tejeev commented Mar 26, 2020

I see that reload_connections, reconnect_on_error, and reload_on_failure are now part of fluent/fluentd-kubernetes-daemonset. What are the chances we can get them tested and into the Rancher Helm charts?

https://github.com/fluent/fluentd-kubernetes-daemonset/blob/04122c95689ad2e7b106023b9e4b9894f2ab6426/templates/conf/fluent.conf.erb#L31-L33

@Tejeev Tejeev added internal area/logging kind/bug Issues that are defects reported by users or that we know have reached a real release labels Mar 26, 2020
@Tejeev
Copy link

Tejeev commented Mar 27, 2020

@astraldawn and @gyf304, is your ES instance deployed in AWS?

@astraldawn
Copy link
Author

Deployment is on-prem (VMware VMs)

@Tejeev
Copy link

Tejeev commented Apr 17, 2020

I believe this is a duplicate of #21744

@Tejeev
Copy link

Tejeev commented Apr 17, 2020

@astraldawn Do you use any proxy settings?

@maggieliu maggieliu modified the milestones: v2.4 - Backlog, v2.4.4 Apr 20, 2020
@astraldawn
Copy link
Author

We do not use any proxy settings

@Tejeev
Copy link

Tejeev commented Apr 20, 2020

Thanks @astraldawn,
Check out #21744. I think Logan is now going to submit a PR for setting the following by default for Rancher Logging:

  reload_connections false
  reconnect_on_error true
  reload_on_failure true

@loganhz
Copy link

loganhz commented Apr 21, 2020

Close in favor of #21744

@gaochundong
Copy link

+1 nice

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/logging internal kind/bug Issues that are defects reported by users or that we know have reached a real release
Projects
None yet
Development

No branches or pull requests

6 participants