Fluentd stopped sending data to ES for somewhile. #525

hustshawn · 2019-01-10T05:02:26Z

Problem

I used the fluentd with your plugin to collect logs from docker containers and send to ES. It works at the very begining. But later, the ES unable to recieve the logs from fluentd. The ES is always running fine. And I find there is no indices of the new day(eg. fluentd-20190110, only the old indice 20190109 exist) in the ES.

However, if I restart my docker containers with fluentd, it can start sending logs to ES.

...

Steps to replicate

The fluentd config

# fluentd/conf/fluent.conf
<source>
  @type forward
  port 24224
  bind 0.0.0.0
</source>
<match *.**>
  @type copy
  <store>
    @type elasticsearch
    host my-es-host
    port 9200
    logstash_format true
    logstash_prefix fluentd
    logstash_dateformat %Y%m%d
    include_tag_key true
    type_name access_log
    tag_key @log_name
    flush_interval 5s
  </store>
  <store>
    @type stdout
  </store>
</match>

Expected Behavior or What you need to ask

The fluentd should keep sending logs to ES.

Using Fluentd and ES plugin versions

OS version
Bare Metal or within Docker or Kubernetes or others?
Docker
Fluentd v0.12 or v0.14/v1.0
- paste result of fluentd --version or td-agent --version
  v1.3.2-1.0
ES plugin 2.x.y or 1.x.y
- paste boot log of fluentd or td-agent
- paste result of fluent-gem list, td-agent-gem list or your Gemfile.lock
ES version (optional)
6.5.4

The text was updated successfully, but these errors were encountered:

cosmo0920 · 2019-01-10T05:27:38Z

Could you provide your Fluentd docker log?

<match *.**>

The above settings is very dangerous.
This blackhole pattern causes flood of declined log:
https://github.com/uken/fluent-plugin-elasticsearch#declined-logs-are-resubmitted-forever-why

hustshawn · 2019-01-10T06:55:41Z

Hi @cosmo0920 ,
The fluentd logs are looks like below

fluentd_1        | 2019-01-09 03:15:52 +0000 [info]: parsing config file is succeeded path="/fluentd/etc/fluent.conf"
fluentd_1        | 2019-01-09 03:15:52 +0000 [info]: 'flush_interval' is configured at out side of <buffer>. 'flush_mode' is set to 'interval' to keep existing behaviour
fluentd_1        | 2019-01-09 03:15:52 +0000 [info]: Detected ES 6.x: ES 7.x will only accept `_doc` in type_name.
fluentd_1        | 2019-01-09 03:15:52 +0000 [warn]: To prevent events traffic jam, you should specify 2 or more 'flush_thread_count'.
fluentd_1        | 2019-01-09 03:15:52 +0000 [info]: using configuration file: <ROOT>
fluentd_1        |   <source>
fluentd_1        |     @type forward
fluentd_1        |     port 24224
fluentd_1        |     bind "0.0.0.0"
fluentd_1        |   </source>
fluentd_1        |   <match *.**>
fluentd_1        |     @type copy
fluentd_1        |     <store>
fluentd_1        |       @type "elasticsearch"
fluentd_1        |       host my-es-host
fluentd_1        |       port 9200
fluentd_1        |       logstash_format true
fluentd_1        |       logstash_prefix "fluentd"
fluentd_1        |       logstash_dateformat "%Y%m%d"
fluentd_1        |       include_tag_key true
fluentd_1        |       type_name "access_log"
fluentd_1        |       tag_key "@log_name"
fluentd_1        |       flush_interval 1s
fluentd_1        |       <buffer>
fluentd_1        |         flush_interval 1s
fluentd_1        |       </buffer>
fluentd_1        |     </store>
fluentd_1        |     <store>
fluentd_1        |       @type "stdout"
fluentd_1        |     </store>
fluentd_1        |   </match>
fluentd_1        | </ROOT>
fluentd_1        | 2019-01-09 03:15:52 +0000 [info]: starting fluentd-1.3.2 pid=5 ruby="2.5.2"
fluentd_1        | 2019-01-09 03:15:52 +0000 [info]: spawn command to main:  cmdline=["/usr/bin/ruby", "-Eascii-8bit:ascii-8bit", "/usr/bin/fluentd", "-c", "/fluentd/etc/fluent.conf", "-p", "/fluentd/plugins", "--under-supervisor"]
fluentd_1        | 2019-01-09 03:15:53 +0000 [info]: gem 'fluent-plugin-elasticsearch' version '3.0.1'
fluentd_1        | 2019-01-09 03:15:53 +0000 [info]: gem 'fluentd' version '1.3.2'
fluentd_1        | 2019-01-09 03:15:53 +0000 [info]: adding match pattern="*.**" type="copy"
fluentd_1        | 2019-01-09 03:15:53 +0000 [info]: #0 'flush_interval' is configured at out side of <buffer>. 'flush_mode' is set to 'interval' to keep existing behaviour
fluentd_1        | 2019-01-09 03:15:53 +0000 [info]: #0 Detected ES 6.x: ES 7.x will only accept `_doc` in type_name.
fluentd_1        | 2019-01-09 03:15:53 +0000 [warn]: #0 To prevent events traffic jam, you should specify 2 or more 'flush_thread_count'.
fluentd_1        | 2019-01-09 03:15:53 +0000 [info]: adding source type="forward"
fluentd_1        | 2019-01-09 03:15:53 +0000 [info]: #0 starting fluentd worker pid=13 ppid=5 worker=0
fluentd_1        | 2019-01-09 03:15:53 +0000 [info]: #0 listening port port=24224 bind="0.0.0.0"
fluentd_1        | 2019-01-09 03:15:53 +0000 [info]: #0 fluentd worker is now running worker=0
fluentd_1        | 2019-01-09 03:15:53.601732394 +0000 fluent.info: {"worker":0,"message":"fluentd worker is now running worker=0"}

....

cosmo0920 · 2019-01-10T06:58:34Z

Umm..., could you share fluentd error log between from 2019-01-10 2:00 to 2019-01-10 11:00 ?

Shared log is booting log. It just says that Fluentd was launched normally.

hustshawn · 2019-01-10T07:33:52Z

@cosmo0920 I find something like this

[fluentd_1        |[0m 2019-01-10 02:16:45 +0000 [warn]: #0 failed to flush the buffer. retry_time=15 next_retry_seconds=2019-01-10 07:21:51 +0000 chunk="57f0d689aeefe7b1ef1da592fed4d444" error_class=Fluent::Plugin::ElasticsearchOutput::RecoverableRequestFailure error="could not push logs to Elasticsearch cluster ({:host=>\"my-es-host\", :port=>9200, :scheme=>\"http\"}): Connection refused - connect(2) for 172.18.0.2:9200 (Errno::ECONNREFUSED)"
[fluentd_1        |[0m   2019-01-10 02:16:45 +0000 [warn]: #0 suppressed same stacktrace
[fluentd_1        |[0m 2019-01-10 02:16:45.424613201 +0000 fluent.warn: {"retry_time":15,"next_retry_seconds":"2019-01-10 07:21:51 +0000","chunk":"57f0d689aeefe7b1ef1da592fed4d444","error":"#<Fluent::Plugin::ElasticsearchOutput::RecoverableRequestFailure: could not push logs to Elasticsearch cluster ({:host=>\"my-es-host\", :port=>9200, :scheme=>\"http\"}): Connection refused - connect(2) for 172.18.0.2:9200 (Errno::ECONNREFUSED)>","message":"failed to flush the buffer. retry_time=15 next_retry_seconds=2019-01-10 07:21:51 +0000 chunk=\"57f0d689aeefe7b1ef1da592fed4d444\" error_class=Fluent::Plugin::ElasticsearchOutput::RecoverableRequestFailure error=\"could not push logs to Elasticsearch cluster ({:host=>\\\"my-es-host\\\", :port=>9200, :scheme=>\\\"http\\\"}): Connection refused - connect(2) for 172.18.0.2:9200 (Errno::ECONNREFUSED)\""}

cosmo0920 · 2019-01-10T07:38:22Z

It seems that ES plugin cannot push events due to ECONNREFUSED.
This error is from network stack.
Could you check docker networking settings or ES side log?

hustshawn · 2019-01-10T07:49:17Z

@cosmo0920 My ES is setup with AWS EC2, and the networking should be fine, without disconnect or DNS issue.
I also find some extra logs just above previous logs.

^[[36mfluentd_1        |^[[0m   2019-01-09 21:47:30 +0000 [warn]: #0 /usr/lib/ruby/gems/2.5.0/gems/fluent-plugin-elasticsearch-3.0.1/lib/fluent/plugin/out_elasticsearch.rb:645:in `rescue in send_bulk'
^[[36mfluentd_1        |^[[0m   2019-01-09 21:47:30 +0000 [warn]: #0 /usr/lib/ruby/gems/2.5.0/gems/fluent-plugin-elasticsearch-3.0.1/lib/fluent/plugin/out_elasticsearch.rb:627:in `send_bulk'
^[[36mfluentd_1        |^[[0m   2019-01-09 21:47:30 +0000 [warn]: #0 /usr/lib/ruby/gems/2.5.0/gems/fluent-plugin-elasticsearch-3.0.1/lib/fluent/plugin/out_elasticsearch.rb:534:in `block in write'
^[[36mfluentd_1        |^[[0m   2019-01-09 21:47:30 +0000 [warn]: #0 /usr/lib/ruby/gems/2.5.0/gems/fluent-plugin-elasticsearch-3.0.1/lib/fluent/plugin/out_elasticsearch.rb:533:in `each'
^[[36mfluentd_1        |^[[0m   2019-01-09 21:47:30 +0000 [warn]: #0 /usr/lib/ruby/gems/2.5.0/gems/fluent-plugin-elasticsearch-3.0.1/lib/fluent/plugin/out_elasticsearch.rb:533:in `write'
^[[36mfluentd_1        |^[[0m   2019-01-09 21:47:30 +0000 [warn]: #0 /usr/lib/ruby/gems/2.5.0/gems/fluentd-1.3.2/lib/fluent/plugin/output.rb:1123:in `try_flush'
^[[36mfluentd_1        |^[[0m   2019-01-09 21:47:30 +0000 [warn]: #0 /usr/lib/ruby/gems/2.5.0/gems/fluentd-1.3.2/lib/fluent/plugin/output.rb:1423:in `flush_thread_run'
^[[36mfluentd_1        |^[[0m   2019-01-09 21:47:30 +0000 [warn]: #0 /usr/lib/ruby/gems/2.5.0/gems/fluentd-1.3.2/lib/fluent/plugin/output.rb:452:in `block (2 levels) in start'
^[[36mfluentd_1        |^[[0m   2019-01-09 21:47:30 +0000 [warn]: #0 /usr/lib/ruby/gems/2.5.0/gems/fluentd-1.3.2/lib/fluent/plugin_helper/thread.rb:78:in `block in thread_create'

hustshawn · 2019-01-10T07:55:18Z

@cosmo0920 Here is more logs from ES

elasticsearch_1  | [2019-01-10T04:41:01,689][WARN ][o.e.g.DanglingIndicesState] [-utwWeF] [[fluentd-20190109/JvyIBQfkQZGjNEXy0you4A]] can not be imported as a dangling index, as index with same name already exists in cluster metadata
elasticsearch_1  | [2019-01-10T04:41:01,689][WARN ][o.e.g.DanglingIndicesState] [-utwWeF] [[.kibana_1/1rFuKeKfRDel1FPUWShc4w]] can not be imported as a dangling index, as index with same name already exists in cluster metadata
elasticsearch_1  | [2019-01-10T04:41:01,795][WARN ][o.e.g.DanglingIndicesState] [-utwWeF] [[fluentd-20190109/JvyIBQfkQZGjNEXy0you4A]] can not be imported as a dangling index, as index with same name already exists in cluster metadata
elasticsearch_1  | [2019-01-10T04:41:01,795][WARN ][o.e.g.DanglingIndicesState] [-utwWeF] [[.kibana_1/1rFuKeKfRDel1FPUWShc4w]] can not be imported as a dangling index, as index with same name already exists in cluster metadata
elasticsearch_1  | [2019-01-10T04:41:01,823][WARN ][o.e.g.DanglingIndicesState] [-utwWeF] [[fluentd-20190109/JvyIBQfkQZGjNEXy0you4A]] can not be imported as a dangling index, as index with same name already exists in cluster metadata
elasticsearch_1  | [2019-01-10T04:41:01,823][WARN ][o.e.g.DanglingIndicesState] [-utwWeF] [[.kibana_1/1rFuKeKfRDel1FPUWShc4w]] can not be imported as a dangling index, as index with same name already exists in cluster metadata
elasticsearch_1  | [2019-01-10T04:41:01,833][WARN ][o.e.g.DanglingIndicesState] [-utwWeF] [[fluentd-20190109/JvyIBQfkQZGjNEXy0you4A]] can not be imported as a dangling index, as index with same name already exists in cluster metadata
elasticsearch_1  | [2019-01-10T04:41:01,833][WARN ][o.e.g.DanglingIndicesState] [-utwWeF] [[.kibana_1/1rFuKeKfRDel1FPUWShc4w]] can not be imported as a dangling index, as index with same name already exists in cluster metadata
elasticsearch_1  | [2019-01-10T04:41:01,835][INFO ][o.e.c.r.a.AllocationService] [-utwWeF] Cluster health status changed from [RED] to [YELLOW] (reason: [shards started [[fluentd-20190108][2]] ...]).
elasticsearch_1  | [2019-01-10T04:41:01,843][WARN ][o.e.g.DanglingIndicesState] [-utwWeF] [[fluentd-20190109/JvyIBQfkQZGjNEXy0you4A]] can not be imported as a dangling index, as index with same name already exists in cluster metadata
elasticsearch_1  | [2019-01-10T04:41:01,847][WARN ][o.e.g.DanglingIndicesState] [-utwWeF] [[.kibana_1/1rFuKeKfRDel1FPUWShc4w]] can not be imported as a dangling index, as index with same name already exists in cluster metadata
elasticsearch_1  | [2019-01-10T04:41:08,712][INFO ][o.e.c.m.MetaDataMappingService] [-utwWeF] [fluentd-20190110/j4oWJJa8Rla-l48sMgHLog] update_mapping [access_log]
elasticsearch_1  | [2019-01-10T04:41:08,724][WARN ][o.e.g.DanglingIndicesState] [-utwWeF] [[fluentd-20190109/JvyIBQfkQZGjNEXy0you4A]] can not be imported as a dangling index, as index with same name already exists in cluster metadata
elasticsearch_1  | [2019-01-10T04:41:08,724][WARN ][o.e.g.DanglingIndicesState] [-utwWeF] [[.kibana_1/1rFuKeKfRDel1FPUWShc4w]] can not be imported as a dangling index, as index with same name already exists in cluster metadata
elasticsearch_1  | [2019-01-10T06:18:09,832][INFO ][o.e.c.m.MetaDataMappingService] [-utwWeF] [fluentd-20190110/j4oWJJa8Rla-l48sMgHLog] update_mapping [access_log]
elasticsearch_1  | [2019-01-10T06:18:09,843][WARN ][o.e.g.DanglingIndicesState] [-utwWeF] [[fluentd-20190109/JvyIBQfkQZGjNEXy0you4A]] can not be imported as a dangling index, as index with same name already exists in cluster metadata
elasticsearch_1  | [2019-01-10T06:18:09,843][WARN ][o.e.g.DanglingIndicesState] [-utwWeF] [[.kibana_1/1rFuKeKfRDel1FPUWShc4w]] can not be imported as a dangling index, as index with same name already exists in cluster metadata
elasticsearch_1  | [2019-01-10T06:18:09,859][INFO ][o.e.c.m.MetaDataMappingService] [-utwWeF] [fluentd-20190110/j4oWJJa8Rla-l48sMgHLog] update_mapping [access_log]
elasticsearch_1  | [2019-01-10T06:18:09,867][WARN ][o.e.g.DanglingIndicesState] [-utwWeF] [[fluentd-20190109/JvyIBQfkQZGjNEXy0you4A]] can not be imported as a dangling index, as index with same name already exists in cluster metadata
elasticsearch_1  | [2019-01-10T06:18:09,868][WARN ][o.e.g.DanglingIndicesState] [-utwWeF] [[.kibana_1/1rFuKeKfRDel1FPUWShc4w]] can not be imported as a dangling index, as index with same name already exists in cluster metadata

and actually, I have two nodes/host with same configuration that collect logs from my application server, do you think that should be a concern for this issue?

If it is true, is there any other way in the fluentd configuration to distinguish the logs collected from which node? like hostname or host ip as the metadata?

cosmo0920 · 2019-01-11T02:46:19Z

do you think that should be a concern for this issue?

It should check Docker networking.
Bare metal environment might not cause networking issue.
Here is the another case due to docker networking: #416

The above issue is also only occurred within docker not bare metal environment.

If it is true, is there any other way in the fluentd configuration to distinguish the logs collected from which node? like hostname or host ip as the metadata?

in_forward has the option adds hostname:
https://docs.fluentd.org/v1.0/articles/in_forward#source_hostname_key

hustshawn · 2019-01-14T02:37:44Z

@cosmo0920 Thanks for you for your advice. But I have to use the fluentd in docker, and it looks like the issue still there. The services in my docker is always running well. It probably not the docker networking issue.

emmayang · 2019-02-12T08:55:40Z

Met similar issue, but I have the fluend deployed as a daemonset under kube-system namespace.

And I can confirm ES is running well all the time, since fluentd is only one of my logging sources, and other sources can work well and showing logs correctly in ES.

hustshawn · 2019-02-13T02:36:48Z

@emmayang Same issue on my kube platform.

cosmo0920 · 2019-02-13T04:26:15Z

Hmmm..., could you try typhoeus backend instead of excon?
typhoeus can handle keep-alive by default.
https://github.com/uken/fluent-plugin-elasticsearch#http_backend

twittyc · 2019-02-18T14:39:03Z

I'm also seeing this same issue when running fluentd with ES plugin in Kubernetes. I tried both backends and typhoeus didn't work at all, while the default backend would work on initial connection (fresh deploy) and then stop sending data almost immediately.

EDIT: I believe my issues were not from the ES plugin but performance tuning that I needed to do on Fluentd.

aaron1989041 · 2019-03-19T02:57:22Z

I have similar problems.I also have huge number for warnings as below:
"failed to flush the buffer. retry_time=0 next_retry_seconds=2019-03-19 01:30:36 +0000 chunk="584686c3d47849db61228ea7e6f29bb5" error_class=Fluent::Plugin::ElasticsearchOutput::RecoverableRequestFailure error="could not push logs to Elasticsearch cluster ({:host=>\"es-cn-v0h10rbfl000kfon8..com\", :port=>9200, :scheme=>\"http\", :user=>\"elastic\", :password=>\"obfuscated\"}): connect_write timeout reached""
when this error happens ,the only way is to restart the fluentd container.but then log gap happens.

ChSch3000 · 2019-03-19T08:58:06Z

Same problem here. I'm using fluentd-kubernetes-daemonset.
Already opened an Issue here
fluent/fluentd-kubernetes-daemonset#280
After deployment the plugin works fine and ships all logs to ES. But after a few hours the plugin stops with following error:

2019-03-19 08:24:32 +0000 : #0 [out_es] failed to flush the buffer. retry_time=2810 next_retry_seconds=2019-03-19 08:25:05 +0000 chunk="5846b2b0d6d06c398eee3540256d465d" error_class=Fluent::Plugin::ElasticsearchOutput::RecoverableReque │
│ stFailure error="could not push logs to Elasticsearch cluster ({:host=>\"elastic.xyz.com\", :port=>443, :scheme=>\"https\", :user=>\"elastic\", :password=>\"obfuscated\", :path=>\"\"}): connect_write timeout reached"

Only solution is to restart the pod. But this isnt' an acceptable solution,

cosmo0920 · 2019-03-19T09:49:37Z

Set reload_connections as false can help this issue?
I launched docker-compose environment with fluent/fluentd#2334 (comment) settings but I didn't reproduce in my local environment.
To reproduce this issue, we should handle a massive events?

bidiudiu · 2019-03-20T02:44:43Z

Set reload_connections as false can help this issue?
I launched docker-compose environment with fluent/fluentd#2334 (comment) settings but I didn't reproduce in my local environment.
To reproduce this issue, we should handle a massive events?

@cosmo0920, I'm afraid so...In my case, the hits reach 100000+ then the issue happens.

In fluentd, here's error info:

2019-03-20 02:07:53 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2019-03-20 02:07:54 +0000 error_class="Elasticsearch::Transport::Transport::Error" error="Cannot get new connection from pool." plugin_id="object:3f880ef7f118"
2019-03-20 02:07:53 +0000 [warn]: /var/lib/gems/2.3.0/gems/elasticsearch-transport-1.0.18/lib/elasticsearch/transport/transport/base.rb:249:in perform_request' 2019-03-20 02:07:53 +0000 [warn]: /var/lib/gems/2.3.0/gems/elasticsearch-transport-1.0.18/lib/elasticsearch/transport/transport/http/faraday.rb:20:in perform_request'
2019-03-20 02:07:53 +0000 [warn]: /var/lib/gems/2.3.0/gems/elasticsearch-transport-1.0.18/lib/elasticsearch/transport/client.rb:128:in perform_request' 2019-03-20 02:07:53 +0000 [warn]: /var/lib/gems/2.3.0/gems/elasticsearch-api-1.0.18/lib/elasticsearch/api/actions/bulk.rb:90:in bulk'
2019-03-20 02:07:53 +0000 [warn]: /var/lib/gems/2.3.0/gems/fluent-plugin-elasticsearch-1.9.2/lib/fluent/plugin/out_elasticsearch.rb:353:in send_bulk' 2019-03-20 02:07:53 +0000 [warn]: /var/lib/gems/2.3.0/gems/fluent-plugin-elasticsearch-1.9.2/lib/fluent/plugin/out_elasticsearch.rb:339:in write_objects'
2019-03-20 02:07:53 +0000 [warn]: /var/lib/gems/2.3.0/gems/fluentd-0.12.43/lib/fluent/output.rb:490:in write' 2019-03-20 02:07:53 +0000 [warn]: /var/lib/gems/2.3.0/gems/fluentd-0.12.43/lib/fluent/buffer.rb:354:in write_chunk'
2019-03-20 02:07:53 +0000 [warn]: /var/lib/gems/2.3.0/gems/fluentd-0.12.43/lib/fluent/buffer.rb:333:in pop' 2019-03-20 02:07:53 +0000 [warn]: /var/lib/gems/2.3.0/gems/fluentd-0.12.43/lib/fluent/output.rb:342:in try_flush'
2019-03-20 02:07:53 +0000 [warn]: /var/lib/gems/2.3.0/gems/fluentd-0.12.43/lib/fluent/output.rb:149:in `run'

I'll try 'reconnect_on_error true' and give feedback.

ChSch3000 · 2019-03-20T08:35:22Z

Set reload_connections as false can help this issue?
I launched docker-compose environment with fluent/fluentd#2334 (comment) settings but I didn't reproduce in my local environment.
To reproduce this issue, we should handle a massive events?

Maybe this is the solution for me. Set reload_connection to false, now it's working for about 18h without troubles. I will monitor it for the next few hours / days.

cosmo0920 · 2019-03-20T08:55:11Z

@bidiudiu @ChSch3000 Thank you for your issue confirmations and clarifications!

fluentd-kubernates-daemonset provides the following environment variable:

FLUENT_ELASTICSEARCH_RELOAD_CONNECTIONS (default: true)

This should be specified:

FLUENT_ELASTICSEARCH_RELOAD_CONNECTIONS=false

cosmo0920 · 2019-03-20T09:16:57Z

I've added FAQ for this situation. #564

Any lack of information to solve this issue?

bidiudiu · 2019-03-22T05:52:16Z

Thanks @cosmo0920. I add settings below and it works fine:

reconnect_on_error true
  reload_on_failure true
  reload_connections false

cosmo0920 · 2019-03-22T07:55:44Z

reconnect_on_error true
reload_on_failure true
reload_connections false

OK. Thanks for confirming, @bidiudiu !
I'll add more descriptions for this issue into FAQ.

This is reported in uken/fluent-plugin-elasticsearch#525. Invalid sniffer information is obtained by default, but we can avoid the following configuration: ```aconf reload_connections false reconnect_on_error true reload_on_failure true ``` To specify reload_on_failure on fluentd-kubernetes-daemonset, we should introduce a new envver to specify it. Signed-off-by: Hiroshi Hatake <hatake@clear-code.com>

…e sniffering This is reported in uken/fluent-plugin-elasticsearch#525. Invalid sniffer information is obtained by default, but we can avoid the following configuration: ```aconf reload_connections false reconnect_on_error true reload_on_failure true ``` To specify reload_on_failure on fluentd-kubernetes-daemonset, we should introduce a new envver to specify it. Signed-off-by: Hiroshi Hatake <hatake@clear-code.com>

fluent-plugin-elasticsearch reloads connection after 10000 requests. (Not correspond to events counts because ES plugin uses bulk API.) This functionality which is originated from elasticsearch-ruby gem is enabled by default. Sometimes this reloading functionality bothers users to send events with ES plugin. On k8s platform, users sometimes shall specify the following settings: ```aconf reload_connections false reconnect_on_error true reload_on_failure true ``` This is originally reported at uken/fluent-plugin-elasticsearch#525. On k8s, Fluentd sometimes handles flood of events. This is a pitfall to use fluent-plugin-elasticsearch on k8s. So, this parameter set should be default. Signed-off-by: Hiroshi Hatake <hatake@clear-code.com>

dogzzdogzz · 2019-05-09T02:42:56Z

Can we change the default value of those settings for fluentd-kubernetes-daemonset ? I think everyone who uses fluentd-kubernetes-daemonset will encounter this issue easily ?

hustshawn · 2019-05-09T02:58:07Z

@dogzzdogzz if you are using helm to install, eg. helm upgrade --install logging-fluentd -f your-values.yml kiwigrid/fluentd-elasticsearch --namespace your-namespace, you can just modify the fluentd config in your-values.yml.

Part of my snippet looks like this,

  output.conf: |
    # Enriches records with Kubernetes metadata
    <filter kubernetes.**>
      @type kubernetes_metadata
    </filter>
    <match **>
      @id elasticsearch
      @type elasticsearch
      @log_level info
      include_tag_key true
      type_name _doc
      host "#{ENV['OUTPUT_HOST']}"
      port "#{ENV['OUTPUT_PORT']}"
      scheme "#{ENV['OUTPUT_SCHEME']}"
      ssl_version "#{ENV['OUTPUT_SSL_VERSION']}"
      logstash_format true
      logstash_prefix "#{ENV['LOGSTASH_PREFIX']}"
      reload_connections false
      reconnect_on_error true
      reload_on_failure true
      slow_flush_log_threshold 25.0
      <buffer>
        @type file
        path /var/log/fluentd-buffers/kubernetes.system.buffer
        flush_mode interval
        flush_interval 5s
        flush_thread_count 4
        chunk_full_threshold 0.9
        # retry_forever
        retry_type exponential_backoff
        retry_timeout 1m
        retry_max_interval 30
        chunk_limit_size "#{ENV['OUTPUT_BUFFER_CHUNK_LIMIT']}"
        queue_limit_length "#{ENV['OUTPUT_BUFFER_QUEUE_LIMIT']}"
        overflow_action drop_oldest_chunk
      </buffer>
    </match>

cosmo0920 · 2019-05-09T05:42:15Z

@dogzzdogzz The latest fluentd-kubernetes-daemonset includes the above settings by default.

darthchudi · 2019-07-01T02:58:16Z

Tried using the exact same config as #525 (comment) but the issue still persists. Fluentd stops shipping logs to Elasticsearch after some time.

amulyamalla · 2020-08-05T13:34:16Z

@cosmo0920
same issue persists , unable to send logs after few times ,
As per the observation , flunetd run absolutely fine till no restart , when pod get restarted problem occurs

2020-08-05 09:58:12 +0000 [warn]: [sample-service] failed to flush the buffer. retry_time=2 next_retry_seconds=2020-08-05 09:58:14 +0000 chunk="5ac1e67bde2f323981d71058390e5ebe" error_class=Fluent::Plugin::ElasticsearchOutput::RecoverableRequestFailure error="could not push logs to Elasticsearch cluster ({:host=>\"192.168.0.15\", :port=>9500, :scheme=>\"http\", :user=>\"fluentd\", :password=>\"obfuscated\"}, {:host=>\"192.168.0.16\", :port=>9500, :scheme=>\"http\", :user=>\"fluentd\", :password=>\"obfuscated\"}): read timeout reached"

**Resoultion :** 
    the only solution i found that forcefully restart fluend pod , hence new container send logs immediately

cosmo0920 · 2020-08-06T00:38:24Z

You should add simple sniffer loading code and specify loaded simple sniffer class:
https://github.com/uken/fluent-plugin-elasticsearch#sniffer-class-name
Default sniffer class causes this issue.

hari819 · 2021-03-16T15:17:59Z

You should add simple sniffer loading code and specify loaded simple sniffer class:
https://github.com/uken/fluent-plugin-elasticsearch#sniffer-class-name
Default sniffer class causes this issue.

did this work , to solve the "failed to flush the buffer" error , if so could you post the configuration ,
i have tried running fluentd with sniffer class , but i still get the same error

Thanks,

Brian-McM · 2021-03-18T20:17:32Z

Yes me too, I've loaded the sniffer class and it's still giving me that error. I'm using version 4.0.5, but I'm getting that error as soon as the fluentd pods restart, there's no grace period where this is succeeding at sending logs. ~~Initially it was working though, and the scheme is set to https~~ I double checked and it was actually sending successfully on restart.

mokhos · 2022-03-31T14:41:42Z

Same issue here. Did anyone found a concrete solution?
I tried these, but no luck.

reconnect_on_error true
reload_on_failure true
reload_connections false

Also the sniffer_class solution doesn't work for me at all and throws an error.

mokhos · 2022-04-04T14:12:30Z

So I found the solution 4days ago and I've been testing it ever since. So after the change I made, my fluentd didn't stopped or crashed sending logs to elasticsearch.

My solution was to change the buffer path in a way I saw in fluentd documentation.

path /opt/bitnami/fluentd/logs/buffers/logs.*.buffer

instead of

path /opt/bitnami/fluentd/logs/buffers/logs.buffer

This worked for me.

hari819 · 2022-04-05T06:05:58Z

So I found the solution 4days ago and I've been testing it ever since. So after the change I made, my fluentd didn't stopped or crashed sending logs to elasticsearch.

My solution was to change the buffer path in a way I saw in fluentd documentation.

path /opt/bitnami/fluentd/logs/buffers/logs.*.buffer

instead of

path /opt/bitnami/fluentd/logs/buffers/logs.buffer

This worked for me.

@mokhos , please could you let us know the version of fluentd / fluentd-plugin-elasticsearch , you were using to test this configuration ?

mokhos · 2022-04-05T08:17:25Z

So I found the solution 4days ago and I've been testing it ever since. So after the change I made, my fluentd didn't stopped or crashed sending logs to elasticsearch.
My solution was to change the buffer path in a way I saw in fluentd documentation.
path /opt/bitnami/fluentd/logs/buffers/logs.*.buffer
instead of
path /opt/bitnami/fluentd/logs/buffers/logs.buffer
This worked for me.

@mokhos , please could you let us know the version of fluentd / fluentd-plugin-elasticsearch , you were using to test this configuration ?

I have used below versions:

2022-03-30 11:56:59 +0000 [info]: gem 'fluentd' version '1.14.5'

2022-03-30 11:56:59 +0000 [info]: gem 'fluent-plugin-elasticsearch' version '5.1.5'

srujith07 · 2022-05-25T14:10:56Z

Hi @cosmo0920, I am also facing the same issue , it will be helpful if you share your solution with me

If i restarted my td-agent.service the logs are coming for sometime in the elasticsearch after 3-6 mins they are getting stopped automatically and no error is showing in td-agent logs.

Here is my configuration:

<match "mytopicname">
      @type elasticsearch
      hosts        "my_IP_address_here"
      ca_file       "my_path_here"
      client_cert  "my_path_here" 
      client_key  " my_path_here" 
      ssl_verify  true
      user   "my_username"
      password "my_password"
      logstash_format true
      logstash_prefix "my_index_name"
      logstash_date_format  my_date_format
      time_key_format  "my time format"
      type_name  fluentd
      log_es_400_reason true
      include_timestamp true
      reconnect_on_error true
      reload_on_failure true
      reload_connections false
     <buffer>
          @type file
           path     "my path here"
           chunk_limit_size 10m
     </buffer>
</match>

also tried.

<match "mytopicname">
      @type elasticsearch
      hosts        "my_IP_address_here"
      ca_file       "my_path_here"
      client_cert  "my_path_here" 
      client_key  " my_path_here" 
      ssl_verify  true
      user   "my_username"
      password "my_password"
      logstash_format true
      logstash_prefix "my_index_name"
      logstash_date_format  my_date_format
      time_key_format  "my time format"
      type_name  fluentd
      log_es_400_reason true
      include_timestamp true
      reconnect_on_error true
      reload_on_failure true
      reload_connections false
      slow_flush_log_threshold  25.0
     <buffer>
          @type file
           path     "syslog.*.buffer"
           chunk_limit_size 50m
           flush_mode interval
           flush_interval  5s
           flush_thread_count 4
          overflow_action drop_oldest_chunk
          retry_timeout 1m
          retry_max_interval 30
          chunk_full_threshold 0.9
      </buffer>
</match>

Please help !!!!!

Note : The above configuration is not copy pasted ignore indentation

…with fluentd related to uken/fluent-plugin-elasticsearch#525

xgbt · 2023-12-07T07:57:44Z

Thanks @cosmo0920. I add settings below and it works fine:
reconnect_on_error true
  reload_on_failure true
  reload_connections false

It works for me

cosmo0920 added question help wanted labels Jan 25, 2019

cosmo0920 added the kubernetes label Feb 27, 2019

cosmo0920 mentioned this issue Mar 20, 2019

Add FAQ for high load k8s EFK stack #564

Merged

7 tasks

cosmo0920 mentioned this issue Mar 22, 2019

Add FLUENT_ELASTICSEARCH_RELOAD_ON_FAILURE envvar to suppress unstable sniffering fluent/fluentd-kubernetes-daemonset#282

Merged

cosmo0920 mentioned this issue Mar 22, 2019

Modify FAQ for highly load k8s EFK stack #566

Merged

7 tasks

cosmo0920 closed this as completed in #566 Mar 25, 2019

cosmo0920 mentioned this issue Apr 12, 2019

Use highly load preferrable fluent-plugin-elasticsearch configuration fluent/fluentd-kubernetes-daemonset#300

Merged

yanrong1990 mentioned this issue Jun 18, 2019

fluentd stop send log to Elasticsearch for some while #601

Open

imriss mentioned this issue Jul 31, 2019

Please add reload_on_failure true and reload_connections false to elasticsearch plugin kube-logging/logging-operator#117

Closed

jorgebirck mentioned this issue Nov 22, 2019

failed to flush the buffer #600

Open

ziaudin mentioned this issue Jan 14, 2020

reload_connections setting is not applied to fluentd kube-logging/logging-operator#314

Closed

astraldawn mentioned this issue Mar 6, 2020

Fluentd stops exporting logs after some time rancher/rancher#25819

Closed

nanorobocop mentioned this issue Mar 16, 2020

Fluentd on K8s stopped flushing logs to Elastic #732

Closed

1 task

gacyberrange pushed a commit to gacybercenter/kinetic that referenced this issue Dec 20, 2022

fix: handling of cinder log formats and opensearch disconnect issues …

0631225

…with fluentd related to uken/fluent-plugin-elasticsearch#525

Fluentd stopped sending data to ES for somewhile. #525

Fluentd stopped sending data to ES for somewhile. #525

Comments

hustshawn commented Jan 10, 2019 • edited Loading

Problem

Steps to replicate

Expected Behavior or What you need to ask

Using Fluentd and ES plugin versions

cosmo0920 commented Jan 10, 2019

hustshawn commented Jan 10, 2019 • edited Loading

cosmo0920 commented Jan 10, 2019 • edited Loading

hustshawn commented Jan 10, 2019

cosmo0920 commented Jan 10, 2019

hustshawn commented Jan 10, 2019

hustshawn commented Jan 10, 2019

cosmo0920 commented Jan 11, 2019

hustshawn commented Jan 14, 2019

emmayang commented Feb 12, 2019

hustshawn commented Feb 13, 2019

cosmo0920 commented Feb 13, 2019

twittyc commented Feb 18, 2019 • edited Loading

aaron1989041 commented Mar 19, 2019

ChSch3000 commented Mar 19, 2019 • edited Loading

cosmo0920 commented Mar 19, 2019 • edited Loading

bidiudiu commented Mar 20, 2019 • edited Loading

ChSch3000 commented Mar 20, 2019

cosmo0920 commented Mar 20, 2019

cosmo0920 commented Mar 20, 2019

bidiudiu commented Mar 22, 2019 • edited Loading

cosmo0920 commented Mar 22, 2019 • edited Loading

dogzzdogzz commented May 9, 2019

hustshawn commented May 9, 2019 • edited Loading

cosmo0920 commented May 9, 2019

darthchudi commented Jul 1, 2019

amulyamalla commented Aug 5, 2020

cosmo0920 commented Aug 6, 2020

hari819 commented Mar 16, 2021

Brian-McM commented Mar 18, 2021 • edited Loading

mokhos commented Mar 31, 2022

mokhos commented Apr 4, 2022

hari819 commented Apr 5, 2022

mokhos commented Apr 5, 2022

srujith07 commented May 25, 2022 • edited Loading

xgbt commented Dec 7, 2023

hustshawn commented Jan 10, 2019 •

edited

Loading

hustshawn commented Jan 10, 2019 •

edited

Loading

cosmo0920 commented Jan 10, 2019 •

edited

Loading

twittyc commented Feb 18, 2019 •

edited

Loading

ChSch3000 commented Mar 19, 2019 •

edited

Loading

cosmo0920 commented Mar 19, 2019 •

edited

Loading

bidiudiu commented Mar 20, 2019 •

edited

Loading

bidiudiu commented Mar 22, 2019 •

edited

Loading

cosmo0920 commented Mar 22, 2019 •

edited

Loading

hustshawn commented May 9, 2019 •

edited

Loading

Brian-McM commented Mar 18, 2021 •

edited

Loading

srujith07 commented May 25, 2022 •

edited

Loading