Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fluentd stopped sending data to ES for somewhile. #525

Closed
hustshawn opened this issue Jan 10, 2019 · 36 comments
Closed

Fluentd stopped sending data to ES for somewhile. #525

hustshawn opened this issue Jan 10, 2019 · 36 comments

Comments

@hustshawn
Copy link

hustshawn commented Jan 10, 2019

Problem

I used the fluentd with your plugin to collect logs from docker containers and send to ES. It works at the very begining. But later, the ES unable to recieve the logs from fluentd. The ES is always running fine. And I find there is no indices of the new day(eg. fluentd-20190110, only the old indice 20190109 exist) in the ES.

However, if I restart my docker containers with fluentd, it can start sending logs to ES.
image

...

Steps to replicate

The fluentd config

# fluentd/conf/fluent.conf
<source>
  @type forward
  port 24224
  bind 0.0.0.0
</source>
<match *.**>
  @type copy
  <store>
    @type elasticsearch
    host my-es-host
    port 9200
    logstash_format true
    logstash_prefix fluentd
    logstash_dateformat %Y%m%d
    include_tag_key true
    type_name access_log
    tag_key @log_name
    flush_interval 5s
  </store>
  <store>
    @type stdout
  </store>
</match>

Expected Behavior or What you need to ask

The fluentd should keep sending logs to ES.

Using Fluentd and ES plugin versions

  • OS version
  • Bare Metal or within Docker or Kubernetes or others?
    Docker
  • Fluentd v0.12 or v0.14/v1.0
    • paste result of fluentd --version or td-agent --version
      v1.3.2-1.0
  • ES plugin 2.x.y or 1.x.y
    • paste boot log of fluentd or td-agent
    • paste result of fluent-gem list, td-agent-gem list or your Gemfile.lock
  • ES version (optional)
    6.5.4
@cosmo0920
Copy link
Collaborator

Could you provide your Fluentd docker log?

<match *.**>

The above settings is very dangerous.
This blackhole pattern causes flood of declined log:
https://github.com/uken/fluent-plugin-elasticsearch#declined-logs-are-resubmitted-forever-why

@hustshawn
Copy link
Author

hustshawn commented Jan 10, 2019

Hi @cosmo0920 ,
The fluentd logs are looks like below

fluentd_1        | 2019-01-09 03:15:52 +0000 [info]: parsing config file is succeeded path="/fluentd/etc/fluent.conf"
fluentd_1        | 2019-01-09 03:15:52 +0000 [info]: 'flush_interval' is configured at out side of <buffer>. 'flush_mode' is set to 'interval' to keep existing behaviour
fluentd_1        | 2019-01-09 03:15:52 +0000 [info]: Detected ES 6.x: ES 7.x will only accept `_doc` in type_name.
fluentd_1        | 2019-01-09 03:15:52 +0000 [warn]: To prevent events traffic jam, you should specify 2 or more 'flush_thread_count'.
fluentd_1        | 2019-01-09 03:15:52 +0000 [info]: using configuration file: <ROOT>
fluentd_1        |   <source>
fluentd_1        |     @type forward
fluentd_1        |     port 24224
fluentd_1        |     bind "0.0.0.0"
fluentd_1        |   </source>
fluentd_1        |   <match *.**>
fluentd_1        |     @type copy
fluentd_1        |     <store>
fluentd_1        |       @type "elasticsearch"
fluentd_1        |       host my-es-host
fluentd_1        |       port 9200
fluentd_1        |       logstash_format true
fluentd_1        |       logstash_prefix "fluentd"
fluentd_1        |       logstash_dateformat "%Y%m%d"
fluentd_1        |       include_tag_key true
fluentd_1        |       type_name "access_log"
fluentd_1        |       tag_key "@log_name"
fluentd_1        |       flush_interval 1s
fluentd_1        |       <buffer>
fluentd_1        |         flush_interval 1s
fluentd_1        |       </buffer>
fluentd_1        |     </store>
fluentd_1        |     <store>
fluentd_1        |       @type "stdout"
fluentd_1        |     </store>
fluentd_1        |   </match>
fluentd_1        | </ROOT>
fluentd_1        | 2019-01-09 03:15:52 +0000 [info]: starting fluentd-1.3.2 pid=5 ruby="2.5.2"
fluentd_1        | 2019-01-09 03:15:52 +0000 [info]: spawn command to main:  cmdline=["/usr/bin/ruby", "-Eascii-8bit:ascii-8bit", "/usr/bin/fluentd", "-c", "/fluentd/etc/fluent.conf", "-p", "/fluentd/plugins", "--under-supervisor"]
fluentd_1        | 2019-01-09 03:15:53 +0000 [info]: gem 'fluent-plugin-elasticsearch' version '3.0.1'
fluentd_1        | 2019-01-09 03:15:53 +0000 [info]: gem 'fluentd' version '1.3.2'
fluentd_1        | 2019-01-09 03:15:53 +0000 [info]: adding match pattern="*.**" type="copy"
fluentd_1        | 2019-01-09 03:15:53 +0000 [info]: #0 'flush_interval' is configured at out side of <buffer>. 'flush_mode' is set to 'interval' to keep existing behaviour
fluentd_1        | 2019-01-09 03:15:53 +0000 [info]: #0 Detected ES 6.x: ES 7.x will only accept `_doc` in type_name.
fluentd_1        | 2019-01-09 03:15:53 +0000 [warn]: #0 To prevent events traffic jam, you should specify 2 or more 'flush_thread_count'.
fluentd_1        | 2019-01-09 03:15:53 +0000 [info]: adding source type="forward"
fluentd_1        | 2019-01-09 03:15:53 +0000 [info]: #0 starting fluentd worker pid=13 ppid=5 worker=0
fluentd_1        | 2019-01-09 03:15:53 +0000 [info]: #0 listening port port=24224 bind="0.0.0.0"
fluentd_1        | 2019-01-09 03:15:53 +0000 [info]: #0 fluentd worker is now running worker=0
fluentd_1        | 2019-01-09 03:15:53.601732394 +0000 fluent.info: {"worker":0,"message":"fluentd worker is now running worker=0"}

....

@cosmo0920
Copy link
Collaborator

cosmo0920 commented Jan 10, 2019

Umm..., could you share fluentd error log between from 2019-01-10 2:00 to 2019-01-10 11:00 ?

Shared log is booting log. It just says that Fluentd was launched normally.

@hustshawn
Copy link
Author

@cosmo0920 I find something like this

[fluentd_1        |[0m 2019-01-10 02:16:45 +0000 [warn]: #0 failed to flush the buffer. retry_time=15 next_retry_seconds=2019-01-10 07:21:51 +0000 chunk="57f0d689aeefe7b1ef1da592fed4d444" error_class=Fluent::Plugin::ElasticsearchOutput::RecoverableRequestFailure error="could not push logs to Elasticsearch cluster ({:host=>\"my-es-host\", :port=>9200, :scheme=>\"http\"}): Connection refused - connect(2) for 172.18.0.2:9200 (Errno::ECONNREFUSED)"
[fluentd_1        |[0m   2019-01-10 02:16:45 +0000 [warn]: #0 suppressed same stacktrace
[fluentd_1        |[0m 2019-01-10 02:16:45.424613201 +0000 fluent.warn: {"retry_time":15,"next_retry_seconds":"2019-01-10 07:21:51 +0000","chunk":"57f0d689aeefe7b1ef1da592fed4d444","error":"#<Fluent::Plugin::ElasticsearchOutput::RecoverableRequestFailure: could not push logs to Elasticsearch cluster ({:host=>\"my-es-host\", :port=>9200, :scheme=>\"http\"}): Connection refused - connect(2) for 172.18.0.2:9200 (Errno::ECONNREFUSED)>","message":"failed to flush the buffer. retry_time=15 next_retry_seconds=2019-01-10 07:21:51 +0000 chunk=\"57f0d689aeefe7b1ef1da592fed4d444\" error_class=Fluent::Plugin::ElasticsearchOutput::RecoverableRequestFailure error=\"could not push logs to Elasticsearch cluster ({:host=>\\\"my-es-host\\\", :port=>9200, :scheme=>\\\"http\\\"}): Connection refused - connect(2) for 172.18.0.2:9200 (Errno::ECONNREFUSED)\""}

@cosmo0920
Copy link
Collaborator

It seems that ES plugin cannot push events due to ECONNREFUSED.
This error is from network stack.
Could you check docker networking settings or ES side log?

@hustshawn
Copy link
Author

@cosmo0920 My ES is setup with AWS EC2, and the networking should be fine, without disconnect or DNS issue.
I also find some extra logs just above previous logs.

^[[36mfluentd_1        |^[[0m   2019-01-09 21:47:30 +0000 [warn]: #0 /usr/lib/ruby/gems/2.5.0/gems/fluent-plugin-elasticsearch-3.0.1/lib/fluent/plugin/out_elasticsearch.rb:645:in `rescue in send_bulk'
^[[36mfluentd_1        |^[[0m   2019-01-09 21:47:30 +0000 [warn]: #0 /usr/lib/ruby/gems/2.5.0/gems/fluent-plugin-elasticsearch-3.0.1/lib/fluent/plugin/out_elasticsearch.rb:627:in `send_bulk'
^[[36mfluentd_1        |^[[0m   2019-01-09 21:47:30 +0000 [warn]: #0 /usr/lib/ruby/gems/2.5.0/gems/fluent-plugin-elasticsearch-3.0.1/lib/fluent/plugin/out_elasticsearch.rb:534:in `block in write'
^[[36mfluentd_1        |^[[0m   2019-01-09 21:47:30 +0000 [warn]: #0 /usr/lib/ruby/gems/2.5.0/gems/fluent-plugin-elasticsearch-3.0.1/lib/fluent/plugin/out_elasticsearch.rb:533:in `each'
^[[36mfluentd_1        |^[[0m   2019-01-09 21:47:30 +0000 [warn]: #0 /usr/lib/ruby/gems/2.5.0/gems/fluent-plugin-elasticsearch-3.0.1/lib/fluent/plugin/out_elasticsearch.rb:533:in `write'
^[[36mfluentd_1        |^[[0m   2019-01-09 21:47:30 +0000 [warn]: #0 /usr/lib/ruby/gems/2.5.0/gems/fluentd-1.3.2/lib/fluent/plugin/output.rb:1123:in `try_flush'
^[[36mfluentd_1        |^[[0m   2019-01-09 21:47:30 +0000 [warn]: #0 /usr/lib/ruby/gems/2.5.0/gems/fluentd-1.3.2/lib/fluent/plugin/output.rb:1423:in `flush_thread_run'
^[[36mfluentd_1        |^[[0m   2019-01-09 21:47:30 +0000 [warn]: #0 /usr/lib/ruby/gems/2.5.0/gems/fluentd-1.3.2/lib/fluent/plugin/output.rb:452:in `block (2 levels) in start'
^[[36mfluentd_1        |^[[0m   2019-01-09 21:47:30 +0000 [warn]: #0 /usr/lib/ruby/gems/2.5.0/gems/fluentd-1.3.2/lib/fluent/plugin_helper/thread.rb:78:in `block in thread_create'

@hustshawn
Copy link
Author

@cosmo0920 Here is more logs from ES

elasticsearch_1  | [2019-01-10T04:41:01,689][WARN ][o.e.g.DanglingIndicesState] [-utwWeF] [[fluentd-20190109/JvyIBQfkQZGjNEXy0you4A]] can not be imported as a dangling index, as index with same name already exists in cluster metadata
elasticsearch_1  | [2019-01-10T04:41:01,689][WARN ][o.e.g.DanglingIndicesState] [-utwWeF] [[.kibana_1/1rFuKeKfRDel1FPUWShc4w]] can not be imported as a dangling index, as index with same name already exists in cluster metadata
elasticsearch_1  | [2019-01-10T04:41:01,795][WARN ][o.e.g.DanglingIndicesState] [-utwWeF] [[fluentd-20190109/JvyIBQfkQZGjNEXy0you4A]] can not be imported as a dangling index, as index with same name already exists in cluster metadata
elasticsearch_1  | [2019-01-10T04:41:01,795][WARN ][o.e.g.DanglingIndicesState] [-utwWeF] [[.kibana_1/1rFuKeKfRDel1FPUWShc4w]] can not be imported as a dangling index, as index with same name already exists in cluster metadata
elasticsearch_1  | [2019-01-10T04:41:01,823][WARN ][o.e.g.DanglingIndicesState] [-utwWeF] [[fluentd-20190109/JvyIBQfkQZGjNEXy0you4A]] can not be imported as a dangling index, as index with same name already exists in cluster metadata
elasticsearch_1  | [2019-01-10T04:41:01,823][WARN ][o.e.g.DanglingIndicesState] [-utwWeF] [[.kibana_1/1rFuKeKfRDel1FPUWShc4w]] can not be imported as a dangling index, as index with same name already exists in cluster metadata
elasticsearch_1  | [2019-01-10T04:41:01,833][WARN ][o.e.g.DanglingIndicesState] [-utwWeF] [[fluentd-20190109/JvyIBQfkQZGjNEXy0you4A]] can not be imported as a dangling index, as index with same name already exists in cluster metadata
elasticsearch_1  | [2019-01-10T04:41:01,833][WARN ][o.e.g.DanglingIndicesState] [-utwWeF] [[.kibana_1/1rFuKeKfRDel1FPUWShc4w]] can not be imported as a dangling index, as index with same name already exists in cluster metadata
elasticsearch_1  | [2019-01-10T04:41:01,835][INFO ][o.e.c.r.a.AllocationService] [-utwWeF] Cluster health status changed from [RED] to [YELLOW] (reason: [shards started [[fluentd-20190108][2]] ...]).
elasticsearch_1  | [2019-01-10T04:41:01,843][WARN ][o.e.g.DanglingIndicesState] [-utwWeF] [[fluentd-20190109/JvyIBQfkQZGjNEXy0you4A]] can not be imported as a dangling index, as index with same name already exists in cluster metadata
elasticsearch_1  | [2019-01-10T04:41:01,847][WARN ][o.e.g.DanglingIndicesState] [-utwWeF] [[.kibana_1/1rFuKeKfRDel1FPUWShc4w]] can not be imported as a dangling index, as index with same name already exists in cluster metadata
elasticsearch_1  | [2019-01-10T04:41:08,712][INFO ][o.e.c.m.MetaDataMappingService] [-utwWeF] [fluentd-20190110/j4oWJJa8Rla-l48sMgHLog] update_mapping [access_log]
elasticsearch_1  | [2019-01-10T04:41:08,724][WARN ][o.e.g.DanglingIndicesState] [-utwWeF] [[fluentd-20190109/JvyIBQfkQZGjNEXy0you4A]] can not be imported as a dangling index, as index with same name already exists in cluster metadata
elasticsearch_1  | [2019-01-10T04:41:08,724][WARN ][o.e.g.DanglingIndicesState] [-utwWeF] [[.kibana_1/1rFuKeKfRDel1FPUWShc4w]] can not be imported as a dangling index, as index with same name already exists in cluster metadata
elasticsearch_1  | [2019-01-10T06:18:09,832][INFO ][o.e.c.m.MetaDataMappingService] [-utwWeF] [fluentd-20190110/j4oWJJa8Rla-l48sMgHLog] update_mapping [access_log]
elasticsearch_1  | [2019-01-10T06:18:09,843][WARN ][o.e.g.DanglingIndicesState] [-utwWeF] [[fluentd-20190109/JvyIBQfkQZGjNEXy0you4A]] can not be imported as a dangling index, as index with same name already exists in cluster metadata
elasticsearch_1  | [2019-01-10T06:18:09,843][WARN ][o.e.g.DanglingIndicesState] [-utwWeF] [[.kibana_1/1rFuKeKfRDel1FPUWShc4w]] can not be imported as a dangling index, as index with same name already exists in cluster metadata
elasticsearch_1  | [2019-01-10T06:18:09,859][INFO ][o.e.c.m.MetaDataMappingService] [-utwWeF] [fluentd-20190110/j4oWJJa8Rla-l48sMgHLog] update_mapping [access_log]
elasticsearch_1  | [2019-01-10T06:18:09,867][WARN ][o.e.g.DanglingIndicesState] [-utwWeF] [[fluentd-20190109/JvyIBQfkQZGjNEXy0you4A]] can not be imported as a dangling index, as index with same name already exists in cluster metadata
elasticsearch_1  | [2019-01-10T06:18:09,868][WARN ][o.e.g.DanglingIndicesState] [-utwWeF] [[.kibana_1/1rFuKeKfRDel1FPUWShc4w]] can not be imported as a dangling index, as index with same name already exists in cluster metadata

and actually, I have two nodes/host with same configuration that collect logs from my application server, do you think that should be a concern for this issue?

If it is true, is there any other way in the fluentd configuration to distinguish the logs collected from which node? like hostname or host ip as the metadata?

@cosmo0920
Copy link
Collaborator

do you think that should be a concern for this issue?

It should check Docker networking.
Bare metal environment might not cause networking issue.
Here is the another case due to docker networking: #416

The above issue is also only occurred within docker not bare metal environment.

If it is true, is there any other way in the fluentd configuration to distinguish the logs collected from which node? like hostname or host ip as the metadata?

in_forward has the option adds hostname:
https://docs.fluentd.org/v1.0/articles/in_forward#source_hostname_key

@hustshawn
Copy link
Author

@cosmo0920 Thanks for you for your advice. But I have to use the fluentd in docker, and it looks like the issue still there. The services in my docker is always running well. It probably not the docker networking issue.

@emmayang
Copy link

Met similar issue, but I have the fluend deployed as a daemonset under kube-system namespace.

And I can confirm ES is running well all the time, since fluentd is only one of my logging sources, and other sources can work well and showing logs correctly in ES.

@hustshawn
Copy link
Author

@emmayang Same issue on my kube platform.

@cosmo0920
Copy link
Collaborator

Hmmm..., could you try typhoeus backend instead of excon?
typhoeus can handle keep-alive by default.
https://github.com/uken/fluent-plugin-elasticsearch#http_backend

@twittyc
Copy link

twittyc commented Feb 18, 2019

I'm also seeing this same issue when running fluentd with ES plugin in Kubernetes. I tried both backends and typhoeus didn't work at all, while the default backend would work on initial connection (fresh deploy) and then stop sending data almost immediately.

EDIT: I believe my issues were not from the ES plugin but performance tuning that I needed to do on Fluentd.

@aaron1989041
Copy link

I have similar problems.I also have huge number for warnings as below:
"failed to flush the buffer. retry_time=0 next_retry_seconds=2019-03-19 01:30:36 +0000 chunk="584686c3d47849db61228ea7e6f29bb5" error_class=Fluent::Plugin::ElasticsearchOutput::RecoverableRequestFailure error="could not push logs to Elasticsearch cluster ({:host=>\"es-cn-v0h10rbfl000kfon8..com\", :port=>9200, :scheme=>\"http\", :user=>\"elastic\", :password=>\"obfuscated\"}): connect_write timeout reached""
when this error happens ,the only way is to restart the fluentd container.but then log gap happens.

@ChSch3000
Copy link

ChSch3000 commented Mar 19, 2019

Same problem here. I'm using fluentd-kubernetes-daemonset.
Already opened an Issue here
fluent/fluentd-kubernetes-daemonset#280
After deployment the plugin works fine and ships all logs to ES. But after a few hours the plugin stops with following error:

2019-03-19 08:24:32 +0000 : #0 [out_es] failed to flush the buffer. retry_time=2810 next_retry_seconds=2019-03-19 08:25:05 +0000 chunk="5846b2b0d6d06c398eee3540256d465d" error_class=Fluent::Plugin::ElasticsearchOutput::RecoverableReque │
│ stFailure error="could not push logs to Elasticsearch cluster ({:host=>\"elastic.xyz.com\", :port=>443, :scheme=>\"https\", :user=>\"elastic\", :password=>\"obfuscated\", :path=>\"\"}): connect_write timeout reached"

Only solution is to restart the pod. But this isnt' an acceptable solution,

@cosmo0920
Copy link
Collaborator

cosmo0920 commented Mar 19, 2019

Set reload_connections as false can help this issue?
I launched docker-compose environment with fluent/fluentd#2334 (comment) settings but I didn't reproduce in my local environment.
To reproduce this issue, we should handle a massive events?

@bidiudiu
Copy link

bidiudiu commented Mar 20, 2019

Set reload_connections as false can help this issue?
I launched docker-compose environment with fluent/fluentd#2334 (comment) settings but I didn't reproduce in my local environment.
To reproduce this issue, we should handle a massive events?

@cosmo0920, I'm afraid so...In my case, the hits reach 100000+ then the issue happens.
image

In fluentd, here's error info:

2019-03-20 02:07:53 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2019-03-20 02:07:54 +0000 error_class="Elasticsearch::Transport::Transport::Error" error="Cannot get new connection from pool." plugin_id="object:3f880ef7f118"
2019-03-20 02:07:53 +0000 [warn]: /var/lib/gems/2.3.0/gems/elasticsearch-transport-1.0.18/lib/elasticsearch/transport/transport/base.rb:249:in perform_request' 2019-03-20 02:07:53 +0000 [warn]: /var/lib/gems/2.3.0/gems/elasticsearch-transport-1.0.18/lib/elasticsearch/transport/transport/http/faraday.rb:20:in perform_request'
2019-03-20 02:07:53 +0000 [warn]: /var/lib/gems/2.3.0/gems/elasticsearch-transport-1.0.18/lib/elasticsearch/transport/client.rb:128:in perform_request' 2019-03-20 02:07:53 +0000 [warn]: /var/lib/gems/2.3.0/gems/elasticsearch-api-1.0.18/lib/elasticsearch/api/actions/bulk.rb:90:in bulk'
2019-03-20 02:07:53 +0000 [warn]: /var/lib/gems/2.3.0/gems/fluent-plugin-elasticsearch-1.9.2/lib/fluent/plugin/out_elasticsearch.rb:353:in send_bulk' 2019-03-20 02:07:53 +0000 [warn]: /var/lib/gems/2.3.0/gems/fluent-plugin-elasticsearch-1.9.2/lib/fluent/plugin/out_elasticsearch.rb:339:in write_objects'
2019-03-20 02:07:53 +0000 [warn]: /var/lib/gems/2.3.0/gems/fluentd-0.12.43/lib/fluent/output.rb:490:in write' 2019-03-20 02:07:53 +0000 [warn]: /var/lib/gems/2.3.0/gems/fluentd-0.12.43/lib/fluent/buffer.rb:354:in write_chunk'
2019-03-20 02:07:53 +0000 [warn]: /var/lib/gems/2.3.0/gems/fluentd-0.12.43/lib/fluent/buffer.rb:333:in pop' 2019-03-20 02:07:53 +0000 [warn]: /var/lib/gems/2.3.0/gems/fluentd-0.12.43/lib/fluent/output.rb:342:in try_flush'
2019-03-20 02:07:53 +0000 [warn]: /var/lib/gems/2.3.0/gems/fluentd-0.12.43/lib/fluent/output.rb:149:in `run'

I'll try 'reconnect_on_error true' and give feedback.

@ChSch3000
Copy link

Set reload_connections as false can help this issue?
I launched docker-compose environment with fluent/fluentd#2334 (comment) settings but I didn't reproduce in my local environment.
To reproduce this issue, we should handle a massive events?

Maybe this is the solution for me. Set reload_connection to false, now it's working for about 18h without troubles. I will monitor it for the next few hours / days.

@cosmo0920
Copy link
Collaborator

@bidiudiu @ChSch3000 Thank you for your issue confirmations and clarifications!

fluentd-kubernates-daemonset provides the following environment variable:

  • FLUENT_ELASTICSEARCH_RELOAD_CONNECTIONS (default: true)

This should be specified:

  • FLUENT_ELASTICSEARCH_RELOAD_CONNECTIONS=false

@cosmo0920
Copy link
Collaborator

I've added FAQ for this situation. #564

Any lack of information to solve this issue?

@bidiudiu
Copy link

bidiudiu commented Mar 22, 2019

Thanks @cosmo0920. I add settings below and it works fine:

reconnect_on_error true
  reload_on_failure true
  reload_connections false

@cosmo0920
Copy link
Collaborator

cosmo0920 commented Mar 22, 2019

reconnect_on_error true
reload_on_failure true
reload_connections false

OK. Thanks for confirming, @bidiudiu !
I'll add more descriptions for this issue into FAQ.

cosmo0920 added a commit to cosmo0920/fluentd-kubernetes-daemonset that referenced this issue Mar 22, 2019
This is reported in
uken/fluent-plugin-elasticsearch#525.

Invalid sniffer information is obtained by default, but we can avoid
the following configuration:

```aconf
reload_connections false
reconnect_on_error true
reload_on_failure true
```

To specify reload_on_failure on fluentd-kubernetes-daemonset,
we should introduce a new envver to specify it.

Signed-off-by: Hiroshi Hatake <hatake@clear-code.com>
cosmo0920 added a commit to cosmo0920/fluentd-kubernetes-daemonset that referenced this issue Mar 22, 2019
…e sniffering

This is reported in
uken/fluent-plugin-elasticsearch#525.

Invalid sniffer information is obtained by default, but we can avoid
the following configuration:

```aconf
reload_connections false
reconnect_on_error true
reload_on_failure true
```

To specify reload_on_failure on fluentd-kubernetes-daemonset,
we should introduce a new envver to specify it.

Signed-off-by: Hiroshi Hatake <hatake@clear-code.com>
cosmo0920 added a commit to cosmo0920/fluentd-kubernetes-daemonset that referenced this issue Apr 12, 2019
fluent-plugin-elasticsearch reloads connection after 10000 requests. (Not correspond to events counts because ES plugin uses bulk API.)

This functionality which is originated from elasticsearch-ruby gem is enabled by default.

Sometimes this reloading functionality bothers users to send events with ES plugin.

On k8s platform, users sometimes shall specify the following settings:

```aconf
reload_connections false
reconnect_on_error true
reload_on_failure true
```

This is originally reported at
uken/fluent-plugin-elasticsearch#525.

On k8s, Fluentd sometimes handles flood of events.
This is a pitfall to use fluent-plugin-elasticsearch on k8s.
So, this parameter set should be default.

Signed-off-by: Hiroshi Hatake <hatake@clear-code.com>
@dogzzdogzz
Copy link

Can we change the default value of those settings for fluentd-kubernetes-daemonset ? I think everyone who uses fluentd-kubernetes-daemonset will encounter this issue easily ?

@hustshawn
Copy link
Author

hustshawn commented May 9, 2019

@dogzzdogzz if you are using helm to install, eg. helm upgrade --install logging-fluentd -f your-values.yml kiwigrid/fluentd-elasticsearch --namespace your-namespace, you can just modify the fluentd config in your-values.yml.

Part of my snippet looks like this,

  output.conf: |
    # Enriches records with Kubernetes metadata
    <filter kubernetes.**>
      @type kubernetes_metadata
    </filter>
    <match **>
      @id elasticsearch
      @type elasticsearch
      @log_level info
      include_tag_key true
      type_name _doc
      host "#{ENV['OUTPUT_HOST']}"
      port "#{ENV['OUTPUT_PORT']}"
      scheme "#{ENV['OUTPUT_SCHEME']}"
      ssl_version "#{ENV['OUTPUT_SSL_VERSION']}"
      logstash_format true
      logstash_prefix "#{ENV['LOGSTASH_PREFIX']}"
      reload_connections false
      reconnect_on_error true
      reload_on_failure true
      slow_flush_log_threshold 25.0
      <buffer>
        @type file
        path /var/log/fluentd-buffers/kubernetes.system.buffer
        flush_mode interval
        flush_interval 5s
        flush_thread_count 4
        chunk_full_threshold 0.9
        # retry_forever
        retry_type exponential_backoff
        retry_timeout 1m
        retry_max_interval 30
        chunk_limit_size "#{ENV['OUTPUT_BUFFER_CHUNK_LIMIT']}"
        queue_limit_length "#{ENV['OUTPUT_BUFFER_QUEUE_LIMIT']}"
        overflow_action drop_oldest_chunk
      </buffer>
    </match>

@cosmo0920
Copy link
Collaborator

@dogzzdogzz The latest fluentd-kubernetes-daemonset includes the above settings by default.

@darthchudi
Copy link

Tried using the exact same config as #525 (comment) but the issue still persists. Fluentd stops shipping logs to Elasticsearch after some time.

@amulyamalla
Copy link

@cosmo0920
same issue persists , unable to send logs after few times ,
As per the observation , flunetd run absolutely fine till no restart , when pod get restarted problem occurs

2020-08-05 09:58:12 +0000 [warn]: [sample-service] failed to flush the buffer. retry_time=2 next_retry_seconds=2020-08-05 09:58:14 +0000 chunk="5ac1e67bde2f323981d71058390e5ebe" error_class=Fluent::Plugin::ElasticsearchOutput::RecoverableRequestFailure error="could not push logs to Elasticsearch cluster ({:host=>\"192.168.0.15\", :port=>9500, :scheme=>\"http\", :user=>\"fluentd\", :password=>\"obfuscated\"}, {:host=>\"192.168.0.16\", :port=>9500, :scheme=>\"http\", :user=>\"fluentd\", :password=>\"obfuscated\"}): read timeout reached"

**Resoultion :** 
    the only solution i found that forcefully restart fluend pod , hence new container send logs immediately 

@cosmo0920
Copy link
Collaborator

You should add simple sniffer loading code and specify loaded simple sniffer class:
https://github.com/uken/fluent-plugin-elasticsearch#sniffer-class-name
Default sniffer class causes this issue.

@hari819
Copy link

hari819 commented Mar 16, 2021

You should add simple sniffer loading code and specify loaded simple sniffer class:
https://github.com/uken/fluent-plugin-elasticsearch#sniffer-class-name
Default sniffer class causes this issue.

did this work , to solve the "failed to flush the buffer" error , if so could you post the configuration ,
i have tried running fluentd with sniffer class , but i still get the same error

Thanks,

@Brian-McM
Copy link

Brian-McM commented Mar 18, 2021

Yes me too, I've loaded the sniffer class and it's still giving me that error. I'm using version 4.0.5, but I'm getting that error as soon as the fluentd pods restart, there's no grace period where this is succeeding at sending logs. Initially it was working though, and the scheme is set to https I double checked and it was actually sending successfully on restart.

@mokhos
Copy link

mokhos commented Mar 31, 2022

Same issue here. Did anyone found a concrete solution?
I tried these, but no luck.

reconnect_on_error true
reload_on_failure true
reload_connections false

Also the sniffer_class solution doesn't work for me at all and throws an error.

@mokhos
Copy link

mokhos commented Apr 4, 2022

So I found the solution 4days ago and I've been testing it ever since. So after the change I made, my fluentd didn't stopped or crashed sending logs to elasticsearch.

My solution was to change the buffer path in a way I saw in fluentd documentation.

path /opt/bitnami/fluentd/logs/buffers/logs.*.buffer

instead of

path /opt/bitnami/fluentd/logs/buffers/logs.buffer

This worked for me.

@hari819
Copy link

hari819 commented Apr 5, 2022

So I found the solution 4days ago and I've been testing it ever since. So after the change I made, my fluentd didn't stopped or crashed sending logs to elasticsearch.

My solution was to change the buffer path in a way I saw in fluentd documentation.

path /opt/bitnami/fluentd/logs/buffers/logs.*.buffer

instead of

path /opt/bitnami/fluentd/logs/buffers/logs.buffer

This worked for me.

@mokhos , please could you let us know the version of fluentd / fluentd-plugin-elasticsearch , you were using to test this configuration ?

@mokhos
Copy link

mokhos commented Apr 5, 2022

So I found the solution 4days ago and I've been testing it ever since. So after the change I made, my fluentd didn't stopped or crashed sending logs to elasticsearch.
My solution was to change the buffer path in a way I saw in fluentd documentation.
path /opt/bitnami/fluentd/logs/buffers/logs.*.buffer
instead of
path /opt/bitnami/fluentd/logs/buffers/logs.buffer
This worked for me.

@mokhos , please could you let us know the version of fluentd / fluentd-plugin-elasticsearch , you were using to test this configuration ?

I have used below versions:

2022-03-30 11:56:59 +0000 [info]: gem 'fluentd' version '1.14.5' 
2022-03-30 11:56:59 +0000 [info]: gem 'fluent-plugin-elasticsearch' version '5.1.5'

@srujith07
Copy link

srujith07 commented May 25, 2022

Hi @cosmo0920, I am also facing the same issue , it will be helpful if you share your solution with me

If i restarted my td-agent.service the logs are coming for sometime in the elasticsearch after 3-6 mins they are getting stopped automatically and no error is showing in td-agent logs.

Here is my configuration:

<match "mytopicname">
      @type elasticsearch
      hosts        "my_IP_address_here"
      ca_file       "my_path_here"
      client_cert  "my_path_here" 
      client_key  " my_path_here" 
      ssl_verify  true
      user   "my_username"
      password "my_password"
      logstash_format true
      logstash_prefix "my_index_name"
      logstash_date_format  my_date_format
      time_key_format  "my time format"
      type_name  fluentd
      log_es_400_reason true
      include_timestamp true
      reconnect_on_error true
      reload_on_failure true
      reload_connections false
     <buffer>
          @type file
           path     "my path here"
           chunk_limit_size 10m
     </buffer>
</match>

also tried.

<match "mytopicname">
      @type elasticsearch
      hosts        "my_IP_address_here"
      ca_file       "my_path_here"
      client_cert  "my_path_here" 
      client_key  " my_path_here" 
      ssl_verify  true
      user   "my_username"
      password "my_password"
      logstash_format true
      logstash_prefix "my_index_name"
      logstash_date_format  my_date_format
      time_key_format  "my time format"
      type_name  fluentd
      log_es_400_reason true
      include_timestamp true
      reconnect_on_error true
      reload_on_failure true
      reload_connections false
      slow_flush_log_threshold  25.0
     <buffer>
          @type file
           path     "syslog.*.buffer"
           chunk_limit_size 50m
           flush_mode interval
           flush_interval  5s
           flush_thread_count 4
          overflow_action drop_oldest_chunk
          retry_timeout 1m
          retry_max_interval 30
          chunk_full_threshold 0.9
      </buffer>
</match>

Please help !!!!!

Note : The above configuration is not copy pasted ignore indentation

gacyberrange pushed a commit to gacybercenter/kinetic that referenced this issue Dec 20, 2022
@xgbt
Copy link

xgbt commented Dec 7, 2023

Thanks @cosmo0920. I add settings below and it works fine:

reconnect_on_error true
  reload_on_failure true
  reload_connections false

It works for me

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests