Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fluentd stopps forwarding logs to ES #280

Closed
ChSch3000 opened this issue Mar 19, 2019 · 6 comments
Closed

fluentd stopps forwarding logs to ES #280

ChSch3000 opened this issue Mar 19, 2019 · 6 comments

Comments

@ChSch3000
Copy link

ChSch3000 commented Mar 19, 2019

I've deployed the fluentd-kubernetes-deamonset and configured it to send the logs to Elastic Search. Everything worked so far, I can see the logsin Kibana.
But after 8-10h ElasticSearch doesn't receive logs anymore.

2019-03-19 08:24:32 +0000 : #0 [out_es] failed to flush the buffer. retry_time=2810 next_retry_seconds=2019-03-19 08:25:05 +0000 chunk="5846b2b0d6d06c398eee3540256d465d" error_class=Fluent::Plugin::ElasticsearchOutput::RecoverableReque │
│ stFailure error="could not push logs to Elasticsearch cluster ({:host=>\"elastic.ionedev.com\", :port=>443, :scheme=>\"https\", :user=>\"elastic\", :password=>\"obfuscated\", :path=>\"\"}): connect_write timeout reached"

When I restart the pods, they'll work for the next few hours, until they stop with the same error.

Here is my fluentd configuration:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: fluentd
  namespace: kube-logging
  labels:
    app: fluentd
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: fluentd
  labels:
    app: fluentd
rules:
- apiGroups:
  - ""
  resources:
  - pods
  - namespaces
  verbs:
  - get
  - list
  - watch
---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: fluentd
roleRef:
  kind: ClusterRole
  name: fluentd
  apiGroup: rbac.authorization.k8s.io
subjects:
- kind: ServiceAccount
  name: fluentd
  namespace: kube-logging
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: fluentd
  namespace: kube-logging
  labels:
    app: fluentd
spec:
  selector:
    matchLabels:
      app: fluentd
  template:
    metadata:
      labels:
        app: fluentd
    spec:
      serviceAccount: fluentd
      serviceAccountName: fluentd
      tolerations:
      - key: node-role.kubernetes.io/master
        effect: NoSchedule
      containers:
      - name: fluentd
        image: fluent/fluentd-kubernetes-daemonset:v1.3.3-debian-elasticsearch-1.3
        env:
          - name:  FLUENT_ELASTICSEARCH_HOST
            value: "<HIDDEN>"
          - name:  FLUENT_ELASTICSEARCH_PORT
            value: "443"
          - name: FLUENT_ELASTICSEARCH_SCHEME
            value: "https"
          - name: FLUENT_UID
            value: "0"
          - name: FLUENT_ELASTICSEARCH_SSL_VERSION
            value: TLSv1_2
          - name: FLUENT_ELASTICSEARCH_USER
            value: "elastic"
          - name: FLUENT_ELASTICSEARCH_PASSWORD
            value: "<HIDDEN>"
          - name: FLUENTD_SYSTEMD_CONF
            value: "disable"
          - name: FLUENT_ELASTICSEARCH_LOGSTASH_PREFIX
            value: "ionedev"
          - name: FLUENT_ELASTICSEARCH_BUFFER_QUEUE_LIMIT_LENGTH 
           value: "64"                                          
          - name: FLUENT_ELASTICSEARCH_BUFFER_FLUSH_INTERVAL     
            value: 3s                                            
         - name: FLUENT_ELASTICSEARCH_BUFFER_CHUNK_LIMIT_SIZE   
            value: 25M                                           
        resources:
          limits:
            memory: 512Mi
          requests:
            cpu: 100m
            memory: 200Mi
        volumeMounts:
        - name: varlog
          mountPath: /var/log
        - name: varlibdockercontainers
          mountPath: /var/lib/docker/containers
          readOnly: true
        - name: config
          mountPath: /fluentd/etc/conf.d
      terminationGracePeriodSeconds: 30
      volumes:
      - name: varlog
        hostPath:
          path: /var/log
      - name: varlibdockercontainers
        hostPath:
          path: /var/lib/docker/containers
      - name: config
        configMap:
          name: ia-config

Maybe the HTTP timeout is too low, but I cant see an option to configure this with environment variables.

@repeatedly
Copy link
Member

Maybe the HTTP timeout is too low, but I cant see an option to configure this with environment variables.

Does this mean ES accepts the request from fluentd but ES can't handle it, right?
If build-in configuration doesn't support all parameters, you can use ConfigMap for it.

@shinebayar-g
Copy link
Contributor

shinebayar-g commented Apr 4, 2019

  • Does your Elasticsearch actually handling all logs fast enough? Cpu, Ram usage is okay?
  • I think you're hitting limit retry_max_interval which is by default 30? Actually I don't know retry_forever true is disabling this limit or not..
   <buffer>
     flush_thread_count "#{ENV['FLUENT_ELASTICSEARCH_BUFFER_FLUSH_THREAD_COUNT'] || '8'}"
     flush_interval "#{ENV['FLUENT_ELASTICSEARCH_BUFFER_FLUSH_INTERVAL'] || '5s'}"
     chunk_limit_size "#{ENV['FLUENT_ELASTICSEARCH_BUFFER_CHUNK_LIMIT_SIZE'] || '2M'}"
     queue_limit_length "#{ENV['FLUENT_ELASTICSEARCH_BUFFER_QUEUE_LIMIT_LENGTH'] || '32'}"
     retry_max_interval "#{ENV['FLUENT_ELASTICSEARCH_BUFFER_RETRY_MAX_INTERVAL'] || '30'}"
     retry_forever true
   </buffer>

Documentation says:

If the bottom chunk write out fails, it will remain in the queue and Fluentd will retry after waiting several seconds (retry_wait). If the retry limit has not been disabled (retry_forever is false) and the retry count exceeds the specified limit (retry_max_times), all chunks in the queue are discarded. The retry wait time doubles each time (1.0sec, 2.0sec, 4.0sec, …) until retry_max_interval is reached. If the queue length exceeds the specified limit (queue_limit_length), new events are rejected.

There is also queue_limit_length is mentioned. But couldn't find default value of this.

Edit: Doc says:

If true, plugin will ignore retry_timeout and retry_max_times options and retry flushing forever.

@Kefa7y
Copy link

Kefa7y commented May 20, 2019

@ChSch3000 Just curious, are you running AWS ElasticSearch service as your current ES log storage?

@arunv707
Copy link

arunv707 commented Jul 3, 2019

Hey @Kefa7y, I am having the same issue in my AWS EKS cluster where fluentd is configured to push logs using in_tail. The CPU of the pod is maxed to 100% and then it stops sending logs to ES. And yes, I am using AWS Elasticsearch Service.

kubectl exec -it fluentd-c5gzl -- top
Tasks:   4 total,   2 running,   2 sleeping,   0 stopped,   0 zombie
%Cpu(s): 53.4 us,  0.7 sy,  0.0 ni, 45.9 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem :  7865400 total,  2766144 free,   967288 used,  4131968 buff/cache
KiB Swap:        0 total,        0 free,        0 used.  6540908 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
   13 root      20   0  268320  87652  10088 R 100.0  1.1  13:57.37 ruby
    1 root      20   0   15828   1696   1572 S   0.0  0.0   0:00.02 tini
    6 root      20   0  197328  52088   9584 S   0.0  0.7   0:00.79 ruby
   35 root      20   0   50136   3820   3208 R   0.0  0.0   0:00.01 top

Instance type I am using is t2.medium.elasticsearch and I am just setting up things, so there are not so many logs. I am using multiformat parser plugin and my kubernetes.conf file is like:

<match fluent.**>
  @type null
</match>

<source>
  @type tail
  @id envoy_logs
  path /var/log/containers/envoy-*.log
  pos_file /var/log/envoy-containers.log.pos
  tag envoy_logs.*
  read_from_head true
  <parse>
    @type multi_format
    <pattern>
      format regexp
      expression /(?<envoy_logs>.*)/
    </pattern>
  </parse>
</source>

<source>
  @type tail
  @id xxx_logs
  path /var/log/containers/xxx*.log
  # exclude_path ["/var/log/containers/envoy.**", "/var/log/containers/coredns.**"]
  pos_file /var/log/fluentd-containers.log.pos
  tag kubernetes.*
  read_from_head true
  # format multi_format
  <parse>
    @type multi_format
    <pattern>
      format regexp
      expression /^{.*\[(?<k8s_time_local>.*)\]\s\\"(?<k8s_method>.*)\s(?<k8s_url>\/.*)\s(?<k8s_protocol>.*)\\"\s(?<k8s_status>\d{3})(?<k8s_response_time>.*)\s\\"-\\"\s\\"(?<k8s_agent>.*)\\"\s\\"(?<k8s_request_id>.*)\\"\s(?<k8s_time2>.*)\s-\s\.\\n",(?<k8s_end>.*)}$/
    </pattern>
    <pattern>
      format regexp
      expression /^{\"log\":\"(?<k8s_remote_addr>[^ ]*)\s-(?<k8s_remote_user>.[^ ]*)\s\[(?<k8s_time_local>[^\]]*)\]\s\\\"(?<k8s_method>.[^ ]*)(?<k8s_request>.[^ ]*)(?<k8s_protocol>.[^ ]*)\\\"(?<k8s_status>.[^ ]*)(?<k8s_body_bytes_sent>.[^ ]*)\s\\\"\-\\\"\s\\\"(?<k8s_agent>.*)\\\"\s(?<k8s_request_id>.[^ ]*)(?<k8s_request_time>.[^ ]*)(?<k8s_upstream_connect_time>.[^ ]*)(?<k8s_upstream_header_time>.[^ ]*)(?<k8s_upstream_response_time>.[^ ]*)\\.\"\,(?<k8s_end>.*)}$/ #"
    </pattern>
    <pattern>
      format regexp
      expression /^{\"log\":\"(?<k8s_remote_addr>[^ ]*)\s-(?<k8s_remote_user>.[^ ]*)\s\[(?<k8s_time_local>[^\]]*)\]\s\\\"(?<k8s_method>.[^ ]*)(?<k8s_request>.[^ ]*)(?<k8s_protocol>.[^ ]*)\\\"(?<k8s_status>.[^ ]*)(?<k8s_body_bytes_sent>.[^ ]*)\s\\"(?<k8s_URL>.[^ ]*)\\"\s\\"(?<k8s_agent>.*)\\\"\s(?<k8s_request_id>.[^ ]*)(?<k8s_request_time>.[^ ]*)(?<k8s_upstream_connect_time>.[^ ]*)(?<k8s_upstream_header_time>.[^ ]*)(?<k8s_upstream_response_time>.[^ ]*)\\.\"\,(?<k8s_end>.*)}$/ #"
    </pattern>
    <pattern>
      format regexp
      expression /^{"log":"(?<k8s_log_type>.*)\s{\\\"time\\":\\"(?<k8s_time_local>.*)\\",\\"id\\":\\"(?<k8s_request_id>.*)\\",\\"ip\\":\\"(?<k8s_remote_addr>.*)\\",\\"host\\":\\"(?<k8s_host>.*)\\",\\"method\\":\\"(?<k8s_method>.*)\\",\\"uri\\":\\"(?<k8s_uri>.*)\\",\s\\"agent\\":\\"(?<k8s_agent>.*)\\",\s\\"status\\":(?<k8s_status>\d{0,}),\\"latency\\":\\"(?<k8s_latency>.*)\\",\\"in\\":(?<k8s_in>.*),\\"out\\":(?<k8s_out>.*)}\\n",(?<k8s_end>.*)}$/ #"
    </pattern>
    <pattern>
      format regexp
      expression /^{"log":"\s(?<k8s_remote_addr>((\d{1,3}.){0,}(,\s|(\d{1,3}){1,3})){0,})\s-\s(?<k8s_remote_user>.*)\s\[(?<k8s_time_local>[^\]]*)\]\s\\\"(?<k8s_method>.[^ ]*)\s(?<k8s_request>.*)\s(?<k8s_protocol>.*)\\"\s(?<k8s_status>\d{3})\s(?<k8s_body_bytes_sent>.*)\s\\"(?<k8s_URL>.*\/)\\"\s\\"(?<k8s_agent>.*)\\"\s\\"(?<k8s_request_id>.*)\\\"\s(?<k8s_request_time>.*)\s(?<k8s_upstream_response_time>.*)\s\.\\n",(?<k8s_end>.*)/
    </pattern>
    <pattern>
      format regexp
      expression /^{"log":"\s(?<k8s_remote_addr>((\d{1,3}.){0,}(,\s|(\d{1,3}){1,3})){0,})\s-\s(?<k8s_remote_user>.*)\s\[(?<k8s_time_local>[^\]]*)\]\s\\\"(?<k8s_method>.[^ ]*)\s(?<k8s_request>.*)\s(?<k8s_protocol>.*)\\"\s(?<k8s_status>\d{3})\s(?<k8s_body_bytes_sent>.*)\s\\"-\\"\s\\"(?<k8s_agent>.*)\\"\s\\"(?<k8s_request_id>.*)\\\"\s(?<k8s_request_time>.*)\s(?<k8s_upstream_response_time>.*)\s\.\\n",(?<k8s_end>.*)/
    </pattern>
    <pattern>
      #catch all regexp that DO NOT MATCH the regexps above and store it in the field - catchall_log
      format regexp
      expression /(?<catchall_log>.*)/
    </pattern>
  </parse>
</source>

<filter kubernetes.**>
  type kubernetes_metadata
</filter>

<match kubernetes.var.log.containers.fluentd**>
  @type null
</match>
<match fluent.info>
  @type null
</match>
<match fluent.warn>
  @type null

@repeatedly
Copy link
Member

@arunv707 JFYI. You can use none parser instead of single pattern regexp for catch all.
It is more faster.

  • result
Warming up --------------------------------------
              regexp    44.454k i/100ms
                none   250.743k i/100ms
Calculating -------------------------------------
              regexp    513.904k (± 1.9%) i/s -      2.578M in   5.019090s
                none      5.734M (± 2.8%) i/s -     28.835M in   5.033127s

  • code
require 'benchmark/ips'

TEXT = '192.168.0.1 - - [28/Feb/2013:12:00:00 +0900] [14/Feb/2013:12:00:00 +0900] "true /,/user HTTP/1.1" 200 777'

RE = /(?<catchall_log>.*)/
# extract from parser_regexp
def regexp(text)
  m = RE.match(text)
  unless m
    return
  end
  r = {}
  m.names.each do |name|
    if value = m[name]
      r[name] = value
    end
  end
  r
end

KEY = 'catchall_log'
# extract from parser_none
def none(text)
  {KEY => text}
end

Benchmark.ips do |x|
  x.report "regexp" do
    regexp(TEXT)
  end

  x.report "none" do
    none(TEXT)
  end
end

@cosmo0920
Copy link
Contributor

At current master es images for daemonset, I couldn't reproduce and running over 4 days.
This should be fixed. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants