Filestream data duplication in filebeat 8.9.1 #36379

germain05 · 2023-08-21T10:19:14Z

Hello, I read various issues regarding the data duplication error messages in filebeat logs, However I haven't really understood what the root cause is. Please can some one explain me really what the root cause is? below is the error message

{"log.level":"error","@timestamp":"2023-08-21T10:05:27.167Z","log.logger":"input","log.origin":{"file.name":"input-logfile/manager.go","file.line":183},"message":"filestream input with ID 'filestream-kubernetes-pod-1170a1d331b2efe193b3816759e44d789b1841d8bed791f9859d442098341d9d' already exists, this will lead to data duplication, please use a different ID. Metrics collection has been disabled on this input.","service.name":"filebeat","ecs.version":"1.6.0"}

below is my daemonset.yaml:

filebeat.autodiscover:
      providers:
        - type: kubernetes
          node: ${NODE_NAME}
          hints.enabled: true
          hints.default_config:
            # Using explicit filestream input until container input is switched to filestream, see https://github.com/elastic/beats/issues/34393
            type: filestream
            prospector.scanner.symlinks: true
            id: filestream-kubernetes-pod-${data.kubernetes.container.id}
            take_over: true
            paths:
              - /var/log/containers/*${data.kubernetes.container.id}.log
            parsers:
            # telling Filebeat to use its default container log parser configuration for processing logs from Kubernetes containers
            - container: ~ 
          add_resource_metadata:
            cronjob: false # disable on purpose to resolve memory leakadge issue. See: https://discuss.elastic.co/t/filebeat-memory-leak-via-filebeat-autodiscover-and-200-000-goroutines/322082
            deployment: false # disable on purpose to resolve memory leakadge issue. See: https://discuss.elastic.co/t/filebeat-memory-leak-via-filebeat-autodiscover-and-200-000-goroutines/322082
            namespace:
              enabled: true
    fields_under_root: true
    fields:
      kubernetes.cluster: {{ .Values.name }}
      kubernetes.stage: {{ (split "-" .Values.name)._1 }}
    processors:
      - add_host_metadata:
          netinfo.enabled: false
          when.not.equals.kubernetes.namespace_labels.namespace-type: application
      - drop_fields:
          fields: ['ecs.version', 'kubernetes.namespace_uid']
          ignore_missing: true
          when.not.equals.kubernetes.namespace_labels.namespace-type: application
      - drop_fields:
          fields: ['kubernetes.node.uid', 'kubernetes.pod.ip', '/^kubernetes.node.labels.*/']
          ignore_missing: true
      # the "index-name" field is used by ELK to determine the effective index
      # the effective index is "index-name" suffixed by the current day
      - copy_fields:
          fields:
            - from: kubernetes.labels.xxx/index-name
              to: index-name
          fail_on_error: false
          ignore_missing: true
          when.not.has_fields: ['index-name']
      # all applications in our namespaces will use the xxx index, if not overwritten by a label
      - add_fields:
          target: ''
          fields:
            index-name: xxx
          when:
            and:
            - not.has_fields: ['index-name']
            - or:
              - equals.kubernetes.namespace_labels.namespace-type: shared
              - equals.kubernetes.namespace_labels.namespace-type: helper
      - add_fields:
          fields:
            agent.hostname: ${HOSTNAME}
          target: ""
      - copy_fields:
          fields:
            - from: container.image.name
              to: kubernetes.container.image
          fail_on_error: false
          ignore_missing: true
          target: "kubernetes"
      - decode_json_fields:
          fields: ['message']
          overwrite_keys: true
          target: ""
      # Keep only the mentioned fields and drop the rest for index-name yyyy. 
      - include_fields:
          when:
            contains:
              index-name: yyyy
          fields: ["reason", "message", "source", "firstTimestamp", "lastTimestamp", "count", "type", "involvedObject.kind", "involvedObject.namespace", "involvedObject.name", "involvedObject.labels", "index-name", "kubernetes.cluster", "kubernetes.stage"]
      # the "tenant" field is just for convinience
      - copy_fields:
          fields:
            - from: kubernetes.namespace_labels.tenant
              to: tenant
          fail_on_error: false
          ignore_missing: true
          when.not.has_fields: ['tenant']
      # drop events without index-name, because ELK can't handle them anyway
      - drop_event:
          when.not.has_fields: ['index-name']
    output.logstash:
      hosts:
      - {{ printf "%s:%d" .Values.log_sink.address (.Values.log_sink.port | int) }}
      ssl:
        certificate_authorities:
          - "/etc/certs/ca.pem"

The text was updated successfully, but these errors were encountered:

belimawr · 2023-08-23T08:52:21Z

Hi @germain05, I have a few questions to better understand the problem you're facing:

What version of Filebeat are you running?
Could you also share the whole manifest that deploys and configures Filebeat? You shared only the Filebeat configuration
Are you actually experiencing data duplication or are you just seeing the log messages you mentioned?
Have you managed to reproduce it in a small, controlled experiment? Like using Kind or Minikube with just a few pods running?

germain05 · 2023-08-23T09:12:05Z

I am running the latest version, that is 8.9.1
below are the resources:
daemonset.yaml

{{- range $type, $data := .Values.addons.filebeat }}
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: filebeat-{{ $type }}
  namespace: logging
  labels:
    k8s-app: filebeat
spec:
  selector:
    matchLabels:
      k8s-app: filebeat
  template:
    metadata:
      labels:
        k8s-app: filebeat
        # TODO Add component, version labels
    spec:
      priorityClassName: "cluster-essential"
      tolerations:
        - key: node.kubernetes.io/role
          value: master
          effect: "NoSchedule"
        - key: "CriticalAddonsOnly"
          operator: "Exists"
      nodeSelector:
        node.kubernetes.io/role: {{ $type }}
      serviceAccountName: filebeat-platform
      terminationGracePeriodSeconds: 30
      hostNetwork: true
      dnsPolicy: ClusterFirstWithHostNet
      containers:
      - name: filebeat
        image: docker.elastic.co/beats/filebeat:{{ $.Values.sw_versions.filebeat.version }}
        args: [
          "-c", "/etc/filebeat.yml",
          "-e",
        ]
        env:
        - name: NODE_NAME
          valueFrom:
            fieldRef:
              fieldPath: spec.nodeName
        securityContext:
          runAsUser: 0
          privileged: true
        resources:
        {{ toYaml $data.resources | nindent 12 }}
        volumeMounts:
        - name: config
          mountPath: /etc/filebeat.yml
          readOnly: true
          subPath: filebeat.yml
        - name: config
          mountPath: /etc/fill_index_fallback_processor.js
          readOnly: true
          subPath: fill_index_fallback_processor.js
        - name: data
          mountPath: /usr/share/filebeat/data
        - name: varlibdockercontainers
          mountPath: /var/lib/docker/containers
          readOnly: true
        - name: varlog
          mountPath: /var/log
          readOnly: true
        - name: ca-certificates
          mountPath: /etc/ca
          readOnly: true
      volumes:
      - name: config
        configMap:
          defaultMode: 0600
          name: filebeat-config
      - name: varlibdockercontainers
        hostPath:
          path: /var/lib/docker/containers
      - name: varlog
        hostPath:
          path: /var/log
      # data folder stores a registry of read status for all files, so we don't send everything again on a Filebeat pod restart
      - name: data
        hostPath:
          path: /var/lib/filebeat-data
          type: DirectoryOrCreate
      - name: ca-certificates
        configMap:
          name: ca-certificates
{{- end }}

config.yaml:

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: filebeat-config
  namespace: logging
  labels:
    k8s-app: filebeat
data:
  filebeat.yml: |-
    {{ tpl (.Files.Get "configuration/filebeat_config.yaml") . }}

rbac.yaml:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: filebeat-platform
  namespace: logging
  labels:
    k8s-app: filebeat
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: platform-filebeat
subjects:
- kind: ServiceAccount
  name: filebeat-platform
  namespace: logging
roleRef:
  kind: ClusterRole
  name: filebeat
  apiGroup: rbac.authorization.k8s.io
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: filebeat
  labels:
    k8s-app: filebeat
rules:
- apiGroups: [""]
  resources:
  - namespaces
  - pods
  - nodes
  verbs:
  - get
  - watch
  - list
- apiGroups: ["apps"]
  resources:
    - replicasets
  verbs: ["get", "list", "watch"]
- apiGroups: ["batch"]
  resources:
    - jobs
  verbs: ["get", "list", "watch"]

I guess from the log message that even if there was a duplicated data we would not see it in our kibana. Since the log message seems to have been implemented from [filebeat] Check for duplicate ID for filestream metrics #35972, which will not collect metrics from the duplicated ID, I am not sure if I got it correctly, However my root question is why in the first place should we have duplicated ID in a config with unique ID.
Unfortunately I didn't reproduce it in a smaller small, controlled experiment.

Hope the additional informations can help

belimawr · 2023-08-24T07:52:46Z

Hi @germain05,

Thanks for all the information. v8.9.1 does have the fix, so it's pretty puzzling to me that you're still facing this issue. There is always the change it is still a bug, triggered by another code path. The fact that you cannot reproduce it in a smaller, controlled experiment makes me believe it is a hard-to-reproduce race condition.

If there is duplicated data, you will see the duplicated documents in Kibana. The metrics for the duplicated input will not be collected because we can't have two metrics instances with the same ID.

However my root question is why in the first place should we have duplicated ID in a config with unique ID.

The ID is not duplicated, if the bug is still there it is an odd race condition when inputs are started/stopped that is affecting our bookkeeping code.

If I have time this week, I'll try again to reproduce it.

germain05 · 2023-08-24T08:08:00Z

Hi @belimawr, I unfortunately didn't had time to try to reproduce it in a smallerl, controlled experiment. I will wait again and see if you can reproduce it, and also have some explanation of why it still happen. Thanks for the efforts.

germain05 · 2023-08-24T08:18:42Z

@belimawr please can you share the part of your code causing this bug, it will be nice if I can share with my team mates what exactly is the root cause of the issue, so it is clear for everyone.

belimawr · 2023-08-25T16:58:53Z

Hi @germain05, or course I can :D

The PR fixing the issue is this one: #35134.

Here is the explanation. Bear in mind that the code I'm linking already contains the fix.

This duplicated message happens because during a config validation

beats/filebeat/input/v2/compat/compat.go

Lines 73 to 86 in 3161fc0

    
           func (f *factory) CheckConfig(cfg *conf.C) error { 
        
           	_, err := f.loader.Configure(cfg) 
        
           	if err != nil { 
        
           		return fmt.Errorf("runner factory could not check config: %w", err) 
        
           	} 
        
           	if err = f.loader.Delete(cfg); err != nil { 
        
           		return fmt.Errorf( 
        
           			"runner factory failed to delete an input after config check: %w", 
        
           			err) 
        
           	} 
        
           	return nil 
        
           }

an instance of the input is created by this line

beats/filebeat/input/v2/compat/compat.go

Line 74 in 3161fc0

_, err := f.loader.Configure(cfg)

however because the input is never properly started it was is also never stopped.

When the Filestream input is instantiated, we add it's ID to a map in the input manager:

beats/filebeat/input/filestream/internal/input-logfile/manager.go

Lines 181 to 192 in 3161fc0

    
           cim.idsMux.Lock() 
        
           if _, exists := cim.ids[settings.ID]; exists { 
        
           	cim.Logger.Errorf("filestream input with ID '%s' already exists, this "+ 
        
           		"will lead to data duplication, please use a different ID. Metrics "+ 
        
           		"collection has been disabled on this input.", settings.ID) 
        
           	metricsID = "" 
        
           } 
        
           // TODO: improve how inputs with empty IDs are tracked. 
        
           // https://github.com/elastic/beats/issues/35202 
        
           cim.ids[settings.ID] = struct{}{} 
        
           cim.idsMux.Unlock()

.
This entry was only removed when the input was stopped:

beats/filebeat/input/filestream/internal/input-logfile/manager.go

Lines 251 to 255 in 3161fc0

    
           func (cim *InputManager) StopInput(id string) { 
        
           	cim.idsMux.Lock() 
        
           	delete(cim.ids, id) 
        
           	cim.idsMux.Unlock() 
        
           }

The PR I linked above fixes it. At least on our tests we didn't manage to reproduce the issue after the fix and we have seen it working in many deployments.

Another possibility for this issue was because the autodiscover code was generating too many "duplicated" events what could lead to inputs being started/stopped multiple times within a few seconds which could lead Filebeat to an odd state (mostly related to the Log input). This also has been resolved: #35645.

germain05 · 2023-09-05T21:27:40Z

Hi @belimawr ,

I want to let you know that, we could confirm that we actually have duplicate data and they are shipped to kibana. Is possible to investigate further this issue?

germain05 · 2023-09-07T09:40:34Z

Hi could be that this issue: #36378, is related to the issue I have here?

belimawr · 2023-09-12T10:50:09Z

I managed to do some testing, and I can confirm I'm able to reproduce getting the log messages about duplicated IDs with Filebeat v8.9.2

{"log.level":"error","@timestamp":"2023-09-12T10:29:19.964Z","log.logger":"input","log.origin":{"file.name":"input-logfile/manager.go","file.line":183},"message":"filestream input with ID 'filestream-kubernetes-pod-9a085d002a005aa2f45719f19c614e65c66a95e4470be00cb1084a813f145331' already exists, this will lead to data duplication, please use a different ID. Metrics collection has been disabled on this input.","service.name":"filebeat","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2023-09-12T10:29:19.965Z","log.logger":"input","log.origin":{"file.name":"input-logfile/manager.go","file.line":183},"message":"filestream input with ID 'filestream-kubernetes-pod-fe8d74e8ac8625dfd2e38e82b48d8f7601c209ff33e6587c8b7e2e482e9031bb' already exists, this will lead to data duplication, please use a different ID. Metrics collection has been disabled on this input.","service.name":"filebeat","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2023-09-12T10:29:19.965Z","log.logger":"input","log.origin":{"file.name":"input-logfile/manager.go","file.line":183},"message":"filestream input with ID 'filestream-kubernetes-pod-1a419927532441b1caa4025edc7ac059beb299de38e1bd8fa6b53ea2aff76682' already exists, this will lead to data duplication, please use a different ID. Metrics collection has been disabled on this input.","service.name":"filebeat","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2023-09-12T10:29:19.965Z","log.logger":"input","log.origin":{"file.name":"input-logfile/manager.go","file.line":183},"message":"filestream input with ID 'filestream-kubernetes-pod-73771409bfd55231e2aac5cef85e0a8250a241fcafe0ed342f1d3f063b532590' already exists, this will lead to data duplication, please use a different ID. Metrics collection has been disabled on this input.","service.name":"filebeat","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2023-09-12T10:29:19.966Z","log.logger":"input","log.origin":{"file.name":"input-logfile/manager.go","file.line":183},"message":"filestream input with ID 'filestream-kubernetes-pod-acb252d8f2ed802e6b1d12291af93262018973d307049bde875e43634a7d6307' already exists, this will lead to data duplication, please use a different ID. Metrics collection has been disabled on this input.","service.name":"filebeat","ecs.version":"1.6.0"}

However I did not manage to reproduce any data duplication.

#36378 is an interesting new data point on your case, @germain05.

Are you migrating from the log/container input to filestream input using the takeover mode? The take over mode is definitely a new variable that is worth investigating.

It would be very helpful if you can consistently reproduce it and describe how to reproduce in a controlled environment.

germain05 · 2023-09-12T11:03:26Z

@belimawr I am migrating from container input to filestream

belimawr · 2023-09-12T12:28:26Z

That really changes things and gives us a clear path to investigate.

If you look at your data, were the duplicated events ingested far apart in time? One of my guesses is that the state migration from the log (container) input to filestream is not working as expected on this case.

belimawr · 2023-09-12T12:42:57Z

@germain05 I was talked with the team and there are a few things to notice:

Using the take_over mode to migrate from container input to filestream is not officially supported, hence it's not mentioned in our documentation.
We have an issue to migrate the container input to filestream, it will happen transparently to the users, after all how those inputs run under the hood is an implementation detail. Here is the issue: Migrate container input to use filestream instead of log input #34393
As mentioned in the issue, the container input extends the state, this extra information is not supported by the take_over mode yet. So I could see this causing some unexpected behaviour.

If I have time this week, I'll try to dig a bit more into this issue.

belimawr · 2023-09-12T12:48:37Z

The PR introducing the take_over mode details how to test it and lists many debug log messages that show the migration process working: #34292

If you still have some logs from when you first changed to filestream, that will likely help you to understand if the migration of the state is happening correctly.

germain05 · 2023-09-12T12:58:50Z

That really changes things and gives us a clear path to investigate.

If you look at your data, were the duplicated events ingested far apart in time? One of my guesses is that the state migration from the log (container) input to filestream is not working as expected on this case.

no they are next to each order, even at the same time

germain05 · 2023-09-12T13:25:41Z

@germain05 I was talked with the team and there are a few things to notice:

Using the take_over mode to migrate from container input to filestream is not officially supported, hence it's not mentioned in our documentation.

We have an issue to migrate the container input to filestream, it will happen transparently to the users, after all how those inputs run under the hood is an implementation detail. Here is the issue: Migrate container input to use filestream instead of log input #34393

As mentioned in the issue, the container input extends the state, this extra information is not supported by the take_over mode yet. So I could see this causing some unexpected behaviour.

If I have time this week, I'll try to dig a bit more into this issue.

I was already aware of the issue regarding the migration from the container input to filestream, you can see that it is mentioned in comment in the filebeat config I share with you. We tried this because we are stuck at filebeat v7.12.0 now for a very long time. We've been dealing with high CPU and memory consumption issues on Filebeat versions beyond 7.12.0, which led us to explore alternative solutions. We also look into the solutions of most of the issues regarding cpu and memory leakage in filebeat and known of them was really helpful. apart from the data duplication the resources consumption looks quite good with filestream input, it will be nice if you can really look dipper into the issue.

Your willingness to investigate this issue further is much appreciated. We'll be eagerly awaiting any updates or findings you might uncover.

belimawr · 2023-09-12T16:46:53Z

@rdner pointed out what is happening: The take_over mode, as it is implemented only works when Filebeat is starting, before the inputs are started because it needs exclusive access to the registry. This means that any dynamic loading of inputs (like autodiscover does) will not work, the take_over will just not run.

Which will lead to data duplication because the file states are not migrated from the log input to the filestream input.

germain05 · 2023-09-12T16:56:12Z

@rdner pointed out what is happening: The take_over mode, as it is implemented only works when Filebeat is starting, before the inputs are started because it needs exclusive access to the registry. This means that any dynamic loading of inputs (like autodiscover does) will not work, the take_over will just not run.

Which will lead to data duplication because the file states are not migrated from the log input to the filestream input.

I was pretty much convince that the issue was lying somewhere within the take_over. What will be the next step then, any nearest plan for a smooth migration from container input to filestream?

what about the log messages with duplicated IDs you could reproduce? any plan to fix that? even if the data in that case are not actually duplicated

belimawr · 2023-09-13T08:26:34Z

What will be the next step then, any nearest plan for a smooth migration from container input to filestream?

Yes, we have an open issue for that: #34393. It should be done at some point in the near future, but I can't give any specific deadline or target release.

what about the log messages with duplicated IDs you could reproduce? any plan to fix that?

Yes, I can reproduce it and I re-opened the original issue with a description of how I reproduced it. #31767 it also should be fixed in the near future.

pierrehilbert · 2023-11-10T10:17:11Z

Closing in favor of #31767

botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Aug 21, 2023

germain05 mentioned this issue Aug 22, 2023

filestream input logs an error when an existing input is reloaded with the same ID #31767

Closed

belimawr added the Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team label Aug 23, 2023

botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Aug 23, 2023

ajax-shmyrko-o mentioned this issue Aug 30, 2023

filebeat.config.inputs to load external configuration produces error #34613

Open

oshmyrko mentioned this issue Sep 8, 2023

[Filebeat] filestream input resends whole log files after restart #36541

Closed

pierrehilbert closed this as not planned Won't fix, can't repro, duplicate, stale Nov 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Filestream data duplication in filebeat 8.9.1 #36379

Filestream data duplication in filebeat 8.9.1 #36379

germain05 commented Aug 21, 2023

belimawr commented Aug 23, 2023

germain05 commented Aug 23, 2023 •

edited

Loading

belimawr commented Aug 24, 2023

germain05 commented Aug 24, 2023

germain05 commented Aug 24, 2023

belimawr commented Aug 25, 2023

germain05 commented Sep 5, 2023

germain05 commented Sep 7, 2023

belimawr commented Sep 12, 2023

germain05 commented Sep 12, 2023

belimawr commented Sep 12, 2023

belimawr commented Sep 12, 2023

belimawr commented Sep 12, 2023

germain05 commented Sep 12, 2023

germain05 commented Sep 12, 2023

belimawr commented Sep 12, 2023

germain05 commented Sep 12, 2023

belimawr commented Sep 13, 2023 •

edited

Loading

pierrehilbert commented Nov 10, 2023

Filestream data duplication in filebeat 8.9.1 #36379

Filestream data duplication in filebeat 8.9.1 #36379

Comments

germain05 commented Aug 21, 2023

belimawr commented Aug 23, 2023

germain05 commented Aug 23, 2023 • edited Loading

belimawr commented Aug 24, 2023

germain05 commented Aug 24, 2023

germain05 commented Aug 24, 2023

belimawr commented Aug 25, 2023

germain05 commented Sep 5, 2023

germain05 commented Sep 7, 2023

belimawr commented Sep 12, 2023

germain05 commented Sep 12, 2023

belimawr commented Sep 12, 2023

belimawr commented Sep 12, 2023

belimawr commented Sep 12, 2023

germain05 commented Sep 12, 2023

germain05 commented Sep 12, 2023

belimawr commented Sep 12, 2023

germain05 commented Sep 12, 2023

belimawr commented Sep 13, 2023 • edited Loading

pierrehilbert commented Nov 10, 2023

germain05 commented Aug 23, 2023 •

edited

Loading

belimawr commented Sep 13, 2023 •

edited

Loading