fluentd stopps forwarding logs to ES #280

ChSch3000 · 2019-03-19T08:33:46Z

I've deployed the fluentd-kubernetes-deamonset and configured it to send the logs to Elastic Search. Everything worked so far, I can see the logsin Kibana.
But after 8-10h ElasticSearch doesn't receive logs anymore.

2019-03-19 08:24:32 +0000 : #0 [out_es] failed to flush the buffer. retry_time=2810 next_retry_seconds=2019-03-19 08:25:05 +0000 chunk="5846b2b0d6d06c398eee3540256d465d" error_class=Fluent::Plugin::ElasticsearchOutput::RecoverableReque │
│ stFailure error="could not push logs to Elasticsearch cluster ({:host=>\"elastic.ionedev.com\", :port=>443, :scheme=>\"https\", :user=>\"elastic\", :password=>\"obfuscated\", :path=>\"\"}): connect_write timeout reached"

When I restart the pods, they'll work for the next few hours, until they stop with the same error.

Here is my fluentd configuration:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: fluentd
  namespace: kube-logging
  labels:
    app: fluentd
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: fluentd
  labels:
    app: fluentd
rules:
- apiGroups:
  - ""
  resources:
  - pods
  - namespaces
  verbs:
  - get
  - list
  - watch
---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: fluentd
roleRef:
  kind: ClusterRole
  name: fluentd
  apiGroup: rbac.authorization.k8s.io
subjects:
- kind: ServiceAccount
  name: fluentd
  namespace: kube-logging
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: fluentd
  namespace: kube-logging
  labels:
    app: fluentd
spec:
  selector:
    matchLabels:
      app: fluentd
  template:
    metadata:
      labels:
        app: fluentd
    spec:
      serviceAccount: fluentd
      serviceAccountName: fluentd
      tolerations:
      - key: node-role.kubernetes.io/master
        effect: NoSchedule
      containers:
      - name: fluentd
        image: fluent/fluentd-kubernetes-daemonset:v1.3.3-debian-elasticsearch-1.3
        env:
          - name:  FLUENT_ELASTICSEARCH_HOST
            value: "<HIDDEN>"
          - name:  FLUENT_ELASTICSEARCH_PORT
            value: "443"
          - name: FLUENT_ELASTICSEARCH_SCHEME
            value: "https"
          - name: FLUENT_UID
            value: "0"
          - name: FLUENT_ELASTICSEARCH_SSL_VERSION
            value: TLSv1_2
          - name: FLUENT_ELASTICSEARCH_USER
            value: "elastic"
          - name: FLUENT_ELASTICSEARCH_PASSWORD
            value: "<HIDDEN>"
          - name: FLUENTD_SYSTEMD_CONF
            value: "disable"
          - name: FLUENT_ELASTICSEARCH_LOGSTASH_PREFIX
            value: "ionedev"
          - name: FLUENT_ELASTICSEARCH_BUFFER_QUEUE_LIMIT_LENGTH 
           value: "64"                                          
          - name: FLUENT_ELASTICSEARCH_BUFFER_FLUSH_INTERVAL     
            value: 3s                                            
         - name: FLUENT_ELASTICSEARCH_BUFFER_CHUNK_LIMIT_SIZE   
            value: 25M                                           
        resources:
          limits:
            memory: 512Mi
          requests:
            cpu: 100m
            memory: 200Mi
        volumeMounts:
        - name: varlog
          mountPath: /var/log
        - name: varlibdockercontainers
          mountPath: /var/lib/docker/containers
          readOnly: true
        - name: config
          mountPath: /fluentd/etc/conf.d
      terminationGracePeriodSeconds: 30
      volumes:
      - name: varlog
        hostPath:
          path: /var/log
      - name: varlibdockercontainers
        hostPath:
          path: /var/lib/docker/containers
      - name: config
        configMap:
          name: ia-config

Maybe the HTTP timeout is too low, but I cant see an option to configure this with environment variables.

The text was updated successfully, but these errors were encountered:

repeatedly · 2019-03-20T12:52:01Z

Maybe the HTTP timeout is too low, but I cant see an option to configure this with environment variables.

Does this mean ES accepts the request from fluentd but ES can't handle it, right?
If build-in configuration doesn't support all parameters, you can use ConfigMap for it.

shinebayar-g · 2019-04-04T15:28:07Z

Does your Elasticsearch actually handling all logs fast enough? Cpu, Ram usage is okay?
I think you're hitting limit retry_max_interval which is by default 30? Actually I don't know retry_forever true is disabling this limit or not..

   <buffer>
     flush_thread_count "#{ENV['FLUENT_ELASTICSEARCH_BUFFER_FLUSH_THREAD_COUNT'] || '8'}"
     flush_interval "#{ENV['FLUENT_ELASTICSEARCH_BUFFER_FLUSH_INTERVAL'] || '5s'}"
     chunk_limit_size "#{ENV['FLUENT_ELASTICSEARCH_BUFFER_CHUNK_LIMIT_SIZE'] || '2M'}"
     queue_limit_length "#{ENV['FLUENT_ELASTICSEARCH_BUFFER_QUEUE_LIMIT_LENGTH'] || '32'}"
     retry_max_interval "#{ENV['FLUENT_ELASTICSEARCH_BUFFER_RETRY_MAX_INTERVAL'] || '30'}"
     retry_forever true
   </buffer>

Documentation says:

If the bottom chunk write out fails, it will remain in the queue and Fluentd will retry after waiting several seconds (retry_wait). If the retry limit has not been disabled (retry_forever is false) and the retry count exceeds the specified limit (retry_max_times), all chunks in the queue are discarded. The retry wait time doubles each time (1.0sec, 2.0sec, 4.0sec, …) until retry_max_interval is reached. If the queue length exceeds the specified limit (queue_limit_length), new events are rejected.

There is also queue_limit_length is mentioned. But couldn't find default value of this.

Edit: Doc says:

If true, plugin will ignore retry_timeout and retry_max_times options and retry flushing forever.

Kefa7y · 2019-05-20T22:14:52Z

@ChSch3000 Just curious, are you running AWS ElasticSearch service as your current ES log storage?

arunv707 · 2019-07-03T05:54:40Z

Hey @Kefa7y, I am having the same issue in my AWS EKS cluster where fluentd is configured to push logs using in_tail. The CPU of the pod is maxed to 100% and then it stops sending logs to ES. And yes, I am using AWS Elasticsearch Service.

kubectl exec -it fluentd-c5gzl -- top
Tasks:   4 total,   2 running,   2 sleeping,   0 stopped,   0 zombie
%Cpu(s): 53.4 us,  0.7 sy,  0.0 ni, 45.9 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem :  7865400 total,  2766144 free,   967288 used,  4131968 buff/cache
KiB Swap:        0 total,        0 free,        0 used.  6540908 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
   13 root      20   0  268320  87652  10088 R 100.0  1.1  13:57.37 ruby
    1 root      20   0   15828   1696   1572 S   0.0  0.0   0:00.02 tini
    6 root      20   0  197328  52088   9584 S   0.0  0.7   0:00.79 ruby
   35 root      20   0   50136   3820   3208 R   0.0  0.0   0:00.01 top

Instance type I am using is t2.medium.elasticsearch and I am just setting up things, so there are not so many logs. I am using multiformat parser plugin and my kubernetes.conf file is like:

<match fluent.**>
  @type null
</match>

<source>
  @type tail
  @id envoy_logs
  path /var/log/containers/envoy-*.log
  pos_file /var/log/envoy-containers.log.pos
  tag envoy_logs.*
  read_from_head true
  <parse>
    @type multi_format
    <pattern>
      format regexp
      expression /(?<envoy_logs>.*)/
    </pattern>
  </parse>
</source>

<source>
  @type tail
  @id xxx_logs
  path /var/log/containers/xxx*.log
  # exclude_path ["/var/log/containers/envoy.**", "/var/log/containers/coredns.**"]
  pos_file /var/log/fluentd-containers.log.pos
  tag kubernetes.*
  read_from_head true
  # format multi_format
  <parse>
    @type multi_format
    <pattern>
      format regexp
      expression /^{.*\[(?<k8s_time_local>.*)\]\s\\"(?<k8s_method>.*)\s(?<k8s_url>\/.*)\s(?<k8s_protocol>.*)\\"\s(?<k8s_status>\d{3})(?<k8s_response_time>.*)\s\\"-\\"\s\\"(?<k8s_agent>.*)\\"\s\\"(?<k8s_request_id>.*)\\"\s(?<k8s_time2>.*)\s-\s\.\\n",(?<k8s_end>.*)}$/
    </pattern>
    <pattern>
      format regexp
      expression /^{\"log\":\"(?<k8s_remote_addr>[^ ]*)\s-(?<k8s_remote_user>.[^ ]*)\s\[(?<k8s_time_local>[^\]]*)\]\s\\\"(?<k8s_method>.[^ ]*)(?<k8s_request>.[^ ]*)(?<k8s_protocol>.[^ ]*)\\\"(?<k8s_status>.[^ ]*)(?<k8s_body_bytes_sent>.[^ ]*)\s\\\"\-\\\"\s\\\"(?<k8s_agent>.*)\\\"\s(?<k8s_request_id>.[^ ]*)(?<k8s_request_time>.[^ ]*)(?<k8s_upstream_connect_time>.[^ ]*)(?<k8s_upstream_header_time>.[^ ]*)(?<k8s_upstream_response_time>.[^ ]*)\\.\"\,(?<k8s_end>.*)}$/ #"
    </pattern>
    <pattern>
      format regexp
      expression /^{\"log\":\"(?<k8s_remote_addr>[^ ]*)\s-(?<k8s_remote_user>.[^ ]*)\s\[(?<k8s_time_local>[^\]]*)\]\s\\\"(?<k8s_method>.[^ ]*)(?<k8s_request>.[^ ]*)(?<k8s_protocol>.[^ ]*)\\\"(?<k8s_status>.[^ ]*)(?<k8s_body_bytes_sent>.[^ ]*)\s\\"(?<k8s_URL>.[^ ]*)\\"\s\\"(?<k8s_agent>.*)\\\"\s(?<k8s_request_id>.[^ ]*)(?<k8s_request_time>.[^ ]*)(?<k8s_upstream_connect_time>.[^ ]*)(?<k8s_upstream_header_time>.[^ ]*)(?<k8s_upstream_response_time>.[^ ]*)\\.\"\,(?<k8s_end>.*)}$/ #"
    </pattern>
    <pattern>
      format regexp
      expression /^{"log":"(?<k8s_log_type>.*)\s{\\\"time\\":\\"(?<k8s_time_local>.*)\\",\\"id\\":\\"(?<k8s_request_id>.*)\\",\\"ip\\":\\"(?<k8s_remote_addr>.*)\\",\\"host\\":\\"(?<k8s_host>.*)\\",\\"method\\":\\"(?<k8s_method>.*)\\",\\"uri\\":\\"(?<k8s_uri>.*)\\",\s\\"agent\\":\\"(?<k8s_agent>.*)\\",\s\\"status\\":(?<k8s_status>\d{0,}),\\"latency\\":\\"(?<k8s_latency>.*)\\",\\"in\\":(?<k8s_in>.*),\\"out\\":(?<k8s_out>.*)}\\n",(?<k8s_end>.*)}$/ #"
    </pattern>
    <pattern>
      format regexp
      expression /^{"log":"\s(?<k8s_remote_addr>((\d{1,3}.){0,}(,\s|(\d{1,3}){1,3})){0,})\s-\s(?<k8s_remote_user>.*)\s\[(?<k8s_time_local>[^\]]*)\]\s\\\"(?<k8s_method>.[^ ]*)\s(?<k8s_request>.*)\s(?<k8s_protocol>.*)\\"\s(?<k8s_status>\d{3})\s(?<k8s_body_bytes_sent>.*)\s\\"(?<k8s_URL>.*\/)\\"\s\\"(?<k8s_agent>.*)\\"\s\\"(?<k8s_request_id>.*)\\\"\s(?<k8s_request_time>.*)\s(?<k8s_upstream_response_time>.*)\s\.\\n",(?<k8s_end>.*)/
    </pattern>
    <pattern>
      format regexp
      expression /^{"log":"\s(?<k8s_remote_addr>((\d{1,3}.){0,}(,\s|(\d{1,3}){1,3})){0,})\s-\s(?<k8s_remote_user>.*)\s\[(?<k8s_time_local>[^\]]*)\]\s\\\"(?<k8s_method>.[^ ]*)\s(?<k8s_request>.*)\s(?<k8s_protocol>.*)\\"\s(?<k8s_status>\d{3})\s(?<k8s_body_bytes_sent>.*)\s\\"-\\"\s\\"(?<k8s_agent>.*)\\"\s\\"(?<k8s_request_id>.*)\\\"\s(?<k8s_request_time>.*)\s(?<k8s_upstream_response_time>.*)\s\.\\n",(?<k8s_end>.*)/
    </pattern>
    <pattern>
      #catch all regexp that DO NOT MATCH the regexps above and store it in the field - catchall_log
      format regexp
      expression /(?<catchall_log>.*)/
    </pattern>
  </parse>
</source>

<filter kubernetes.**>
  type kubernetes_metadata
</filter>

<match kubernetes.var.log.containers.fluentd**>
  @type null
</match>
<match fluent.info>
  @type null
</match>
<match fluent.warn>
  @type null

repeatedly · 2019-07-04T08:48:59Z

@arunv707 JFYI. You can use none parser instead of single pattern regexp for catch all.
It is more faster.

result

Warming up --------------------------------------
              regexp    44.454k i/100ms
                none   250.743k i/100ms
Calculating -------------------------------------
              regexp    513.904k (± 1.9%) i/s -      2.578M in   5.019090s
                none      5.734M (± 2.8%) i/s -     28.835M in   5.033127s

code

require 'benchmark/ips'

TEXT = '192.168.0.1 - - [28/Feb/2013:12:00:00 +0900] [14/Feb/2013:12:00:00 +0900] "true /,/user HTTP/1.1" 200 777'

RE = /(?<catchall_log>.*)/
# extract from parser_regexp
def regexp(text)
  m = RE.match(text)
  unless m
    return
  end
  r = {}
  m.names.each do |name|
    if value = m[name]
      r[name] = value
    end
  end
  r
end

KEY = 'catchall_log'
# extract from parser_none
def none(text)
  {KEY => text}
end

Benchmark.ips do |x|
  x.report "regexp" do
    regexp(TEXT)
  end

  x.report "none" do
    none(TEXT)
  end
end

cosmo0920 · 2020-07-27T06:21:43Z

At current master es images for daemonset, I couldn't reproduce and running over 4 days.
This should be fixed. Closing.

This was referenced Mar 19, 2019

fluentd stop sending logs to elasticsearch after a few hours(@type forward) fluent/fluentd#2334

Closed

Fluentd stopped sending data to ES for somewhile. uken/fluent-plugin-elasticsearch#525

Closed

cosmo0920 closed this as completed Jul 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fluentd stopps forwarding logs to ES #280

fluentd stopps forwarding logs to ES #280

ChSch3000 commented Mar 19, 2019 •

edited

Loading

repeatedly commented Mar 20, 2019

shinebayar-g commented Apr 4, 2019 •

edited

Loading

Kefa7y commented May 20, 2019

arunv707 commented Jul 3, 2019

repeatedly commented Jul 4, 2019

cosmo0920 commented Jul 27, 2020

fluentd stopps forwarding logs to ES #280

fluentd stopps forwarding logs to ES #280

Comments

ChSch3000 commented Mar 19, 2019 • edited Loading

repeatedly commented Mar 20, 2019

shinebayar-g commented Apr 4, 2019 • edited Loading

Kefa7y commented May 20, 2019

arunv707 commented Jul 3, 2019

repeatedly commented Jul 4, 2019

cosmo0920 commented Jul 27, 2020

ChSch3000 commented Mar 19, 2019 •

edited

Loading

shinebayar-g commented Apr 4, 2019 •

edited

Loading