Fluentd stops exporting logs after some time #25819

astraldawn · 2020-03-06T03:42:09Z

What kind of request is this (question/bug/enhancement/feature request): Bug

Steps to reproduce (least amount of steps as possible):

Create two clusters with RKE via Rancher UI (Add Cluster - Custom), main cluster and es cluster
Deploy logging on main cluster that sends logs to ElasticSearch on es cluster
Send some logs over a period of time (between 33k to 110k log messages over 24 hours)

Result:

Fluentd (rancher-logging-fluentd) stops sending logs to ElasticSearch with the following error

Failed to flush the buffer, 
retry_time=0, next_retry_seconds=xxxx, chunk=CHUNK_ID, 
error_class=Fluent::Plugin::ElasticSearchOutput::RecoverableRequestFailure 
error="could not push logs to ElasticSearch cluster 
({:host=>HOST_NAME, :port=>443, :scheme=>\"https\"}): connect_write timeout reached"

This error message repeats with retry_time increasing and next_retry_seconds getting further and further away.

Observed these failures on 6 out of 18 nodes which ran fluentd, ranging from 12 to 20 hours after re-deployment of rancher-logging-fluentd-linux daemon set.

On es cluster

Ingress controller logs show no requests received from any nodes on main cluster where fluentd has stopped sending logs
No errors from ElasticSearch, ElasticSearch CPU use is low throughout

Other details that may be helpful:

Redeployment of rancher-logging-fluentd-linux daemon set fixes the problem temporarily.
Some previous logs are sent when the daemon set is redeployed. Continuing to investigate which logs were not sent.
Possible fix by modifying fluentd config: Fluentd stopped sending data to ES for somewhile. uken/fluent-plugin-elasticsearch#525

Environment information

Rancher version: v2.3.5
Installation option (single install/HA): HA

Cluster information

Cluster type: Custom
Machine type: VM / 16 vCPU / 64 GB RAM
Kubernetes version: 1.16.6 (main cluster), 1.17.2 (es cluster)
Docker version: 18.9.8 (both)
Fluentd image: rancher/fluentd:v0.1.19

The text was updated successfully, but these errors were encountered:

gyf304 · 2020-03-10T23:04:36Z

Same issue here. Logging stops and fluentd buffer backlogs after a while.

astraldawn · 2020-03-11T01:51:29Z

Have had success by using the following (Cluster logging > Edit as a form)

<match *>
  @type elasticsearch
  include_tag_key true
  hosts ES_HOST
  logstash_prefix "LOGSTASH_PREFIX"
  logstash_format true
  logstash_dataformat LOGSTASH_FORMAT
  type_name "container_log"
  ssl_verify false
  ssl_version TLSv1_2
  reload_connections false
  reconnect_on_error true
  reload_on_failure true
</match>

Last 3 lines are from advice given at: uken/fluent-plugin-elasticsearch#525

Rest are the defaults from Rancher ES template

Tejeev · 2020-03-26T18:00:27Z

I see that reload_connections, reconnect_on_error, and reload_on_failure are now part of fluent/fluentd-kubernetes-daemonset. What are the chances we can get them tested and into the Rancher Helm charts?

https://github.com/fluent/fluentd-kubernetes-daemonset/blob/04122c95689ad2e7b106023b9e4b9894f2ab6426/templates/conf/fluent.conf.erb#L31-L33

Tejeev · 2020-03-27T17:47:57Z

@astraldawn and @gyf304, is your ES instance deployed in AWS?

astraldawn · 2020-03-28T05:30:18Z

Deployment is on-prem (VMware VMs)

Tejeev · 2020-04-17T19:00:42Z

I believe this is a duplicate of #21744

Tejeev · 2020-04-17T19:01:39Z

@astraldawn Do you use any proxy settings?

astraldawn · 2020-04-20T06:43:26Z

We do not use any proxy settings

Tejeev · 2020-04-20T13:24:36Z

Thanks @astraldawn,
Check out #21744. I think Logan is now going to submit a PR for setting the following by default for Rancher Logging:

  reload_connections false
  reconnect_on_error true
  reload_on_failure true

loganhz · 2020-04-21T06:45:53Z

Close in favor of #21744

gaochundong · 2020-09-17T12:00:34Z

+1 nice

nanorobocop mentioned this issue Mar 16, 2020

Fluentd on K8s stopped flushing logs to Elastic uken/fluent-plugin-elasticsearch#732

Closed

1 task

Tejeev added internal area/logging kind/bug Issues that are defects reported by users or that we know have reached a real release labels Mar 26, 2020

Tejeev mentioned this issue Apr 17, 2020

Logging: AWS Elasticsearch: Cannot get new connection from pool #21744

Closed

maggieliu assigned loganhz Apr 20, 2020

maggieliu modified the milestones: v2.4 - Backlog, v2.4.4 Apr 20, 2020

loganhz closed this as completed Apr 21, 2020

loganhz mentioned this issue Apr 21, 2020

[2.4] Set reload_connections to false for logging #26732

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fluentd stops exporting logs after some time #25819

Fluentd stops exporting logs after some time #25819

astraldawn commented Mar 6, 2020 •

edited

Loading

gyf304 commented Mar 10, 2020

astraldawn commented Mar 11, 2020

Tejeev commented Mar 26, 2020

Tejeev commented Mar 27, 2020

astraldawn commented Mar 28, 2020

Tejeev commented Apr 17, 2020

Tejeev commented Apr 17, 2020

astraldawn commented Apr 20, 2020

Tejeev commented Apr 20, 2020

loganhz commented Apr 21, 2020

gaochundong commented Sep 17, 2020

Fluentd stops exporting logs after some time #25819

Fluentd stops exporting logs after some time #25819

Comments

astraldawn commented Mar 6, 2020 • edited Loading

gyf304 commented Mar 10, 2020

astraldawn commented Mar 11, 2020

Tejeev commented Mar 26, 2020

Tejeev commented Mar 27, 2020

astraldawn commented Mar 28, 2020

Tejeev commented Apr 17, 2020

Tejeev commented Apr 17, 2020

astraldawn commented Apr 20, 2020

Tejeev commented Apr 20, 2020

loganhz commented Apr 21, 2020

gaochundong commented Sep 17, 2020

astraldawn commented Mar 6, 2020 •

edited

Loading