Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filestream data duplication in filebeat 8.9.1 #36379

Closed
germain05 opened this issue Aug 21, 2023 · 19 comments
Closed

Filestream data duplication in filebeat 8.9.1 #36379

germain05 opened this issue Aug 21, 2023 · 19 comments
Labels
Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team

Comments

@germain05
Copy link

Hello, I read various issues regarding the data duplication error messages in filebeat logs, However I haven't really understood what the root cause is. Please can some one explain me really what the root cause is? below is the error message

{"log.level":"error","@timestamp":"2023-08-21T10:05:27.167Z","log.logger":"input","log.origin":{"file.name":"input-logfile/manager.go","file.line":183},"message":"filestream input with ID 'filestream-kubernetes-pod-1170a1d331b2efe193b3816759e44d789b1841d8bed791f9859d442098341d9d' already exists, this will lead to data duplication, please use a different ID. Metrics collection has been disabled on this input.","service.name":"filebeat","ecs.version":"1.6.0"}

below is my daemonset.yaml:

filebeat.autodiscover:
      providers:
        - type: kubernetes
          node: ${NODE_NAME}
          hints.enabled: true
          hints.default_config:
            # Using explicit filestream input until container input is switched to filestream, see https://github.com/elastic/beats/issues/34393
            type: filestream
            prospector.scanner.symlinks: true
            id: filestream-kubernetes-pod-${data.kubernetes.container.id}
            take_over: true
            paths:
              - /var/log/containers/*${data.kubernetes.container.id}.log
            parsers:
            # telling Filebeat to use its default container log parser configuration for processing logs from Kubernetes containers
            - container: ~ 
          add_resource_metadata:
            cronjob: false # disable on purpose to resolve memory leakadge issue. See: https://discuss.elastic.co/t/filebeat-memory-leak-via-filebeat-autodiscover-and-200-000-goroutines/322082
            deployment: false # disable on purpose to resolve memory leakadge issue. See: https://discuss.elastic.co/t/filebeat-memory-leak-via-filebeat-autodiscover-and-200-000-goroutines/322082
            namespace:
              enabled: true
    fields_under_root: true
    fields:
      kubernetes.cluster: {{ .Values.name }}
      kubernetes.stage: {{ (split "-" .Values.name)._1 }}
    processors:
      - add_host_metadata:
          netinfo.enabled: false
          when.not.equals.kubernetes.namespace_labels.namespace-type: application
      - drop_fields:
          fields: ['ecs.version', 'kubernetes.namespace_uid']
          ignore_missing: true
          when.not.equals.kubernetes.namespace_labels.namespace-type: application
      - drop_fields:
          fields: ['kubernetes.node.uid', 'kubernetes.pod.ip', '/^kubernetes.node.labels.*/']
          ignore_missing: true
      # the "index-name" field is used by ELK to determine the effective index
      # the effective index is "index-name" suffixed by the current day
      - copy_fields:
          fields:
            - from: kubernetes.labels.xxx/index-name
              to: index-name
          fail_on_error: false
          ignore_missing: true
          when.not.has_fields: ['index-name']
      # all applications in our namespaces will use the xxx index, if not overwritten by a label
      - add_fields:
          target: ''
          fields:
            index-name: xxx
          when:
            and:
            - not.has_fields: ['index-name']
            - or:
              - equals.kubernetes.namespace_labels.namespace-type: shared
              - equals.kubernetes.namespace_labels.namespace-type: helper
      - add_fields:
          fields:
            agent.hostname: ${HOSTNAME}
          target: ""
      - copy_fields:
          fields:
            - from: container.image.name
              to: kubernetes.container.image
          fail_on_error: false
          ignore_missing: true
          target: "kubernetes"
      - decode_json_fields:
          fields: ['message']
          overwrite_keys: true
          target: ""
      # Keep only the mentioned fields and drop the rest for index-name yyyy. 
      - include_fields:
          when:
            contains:
              index-name: yyyy
          fields: ["reason", "message", "source", "firstTimestamp", "lastTimestamp", "count", "type", "involvedObject.kind", "involvedObject.namespace", "involvedObject.name", "involvedObject.labels", "index-name", "kubernetes.cluster", "kubernetes.stage"]
      # the "tenant" field is just for convinience
      - copy_fields:
          fields:
            - from: kubernetes.namespace_labels.tenant
              to: tenant
          fail_on_error: false
          ignore_missing: true
          when.not.has_fields: ['tenant']
      # drop events without index-name, because ELK can't handle them anyway
      - drop_event:
          when.not.has_fields: ['index-name']
    output.logstash:
      hosts:
      - {{ printf "%s:%d" .Values.log_sink.address (.Values.log_sink.port | int) }}
      ssl:
        certificate_authorities:
          - "/etc/certs/ca.pem"
@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Aug 21, 2023
@belimawr belimawr added the Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team label Aug 23, 2023
@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Aug 23, 2023
@belimawr
Copy link
Contributor

Hi @germain05, I have a few questions to better understand the problem you're facing:

  1. What version of Filebeat are you running?
  2. Could you also share the whole manifest that deploys and configures Filebeat? You shared only the Filebeat configuration
  3. Are you actually experiencing data duplication or are you just seeing the log messages you mentioned?
  4. Have you managed to reproduce it in a small, controlled experiment? Like using Kind or Minikube with just a few pods running?

@germain05
Copy link
Author

germain05 commented Aug 23, 2023

  1. I am running the latest version, that is 8.9.1
  2. below are the resources:
    daemonset.yaml
{{- range $type, $data := .Values.addons.filebeat }}
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: filebeat-{{ $type }}
  namespace: logging
  labels:
    k8s-app: filebeat
spec:
  selector:
    matchLabels:
      k8s-app: filebeat
  template:
    metadata:
      labels:
        k8s-app: filebeat
        # TODO Add component, version labels
    spec:
      priorityClassName: "cluster-essential"
      tolerations:
        - key: node.kubernetes.io/role
          value: master
          effect: "NoSchedule"
        - key: "CriticalAddonsOnly"
          operator: "Exists"
      nodeSelector:
        node.kubernetes.io/role: {{ $type }}
      serviceAccountName: filebeat-platform
      terminationGracePeriodSeconds: 30
      hostNetwork: true
      dnsPolicy: ClusterFirstWithHostNet
      containers:
      - name: filebeat
        image: docker.elastic.co/beats/filebeat:{{ $.Values.sw_versions.filebeat.version }}
        args: [
          "-c", "/etc/filebeat.yml",
          "-e",
        ]
        env:
        - name: NODE_NAME
          valueFrom:
            fieldRef:
              fieldPath: spec.nodeName
        securityContext:
          runAsUser: 0
          privileged: true
        resources:
        {{ toYaml $data.resources | nindent 12 }}
        volumeMounts:
        - name: config
          mountPath: /etc/filebeat.yml
          readOnly: true
          subPath: filebeat.yml
        - name: config
          mountPath: /etc/fill_index_fallback_processor.js
          readOnly: true
          subPath: fill_index_fallback_processor.js
        - name: data
          mountPath: /usr/share/filebeat/data
        - name: varlibdockercontainers
          mountPath: /var/lib/docker/containers
          readOnly: true
        - name: varlog
          mountPath: /var/log
          readOnly: true
        - name: ca-certificates
          mountPath: /etc/ca
          readOnly: true
      volumes:
      - name: config
        configMap:
          defaultMode: 0600
          name: filebeat-config
      - name: varlibdockercontainers
        hostPath:
          path: /var/lib/docker/containers
      - name: varlog
        hostPath:
          path: /var/log
      # data folder stores a registry of read status for all files, so we don't send everything again on a Filebeat pod restart
      - name: data
        hostPath:
          path: /var/lib/filebeat-data
          type: DirectoryOrCreate
      - name: ca-certificates
        configMap:
          name: ca-certificates
{{- end }}

config.yaml:

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: filebeat-config
  namespace: logging
  labels:
    k8s-app: filebeat
data:
  filebeat.yml: |-
    {{ tpl (.Files.Get "configuration/filebeat_config.yaml") . }}

rbac.yaml:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: filebeat-platform
  namespace: logging
  labels:
    k8s-app: filebeat
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: platform-filebeat
subjects:
- kind: ServiceAccount
  name: filebeat-platform
  namespace: logging
roleRef:
  kind: ClusterRole
  name: filebeat
  apiGroup: rbac.authorization.k8s.io
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: filebeat
  labels:
    k8s-app: filebeat
rules:
- apiGroups: [""]
  resources:
  - namespaces
  - pods
  - nodes
  verbs:
  - get
  - watch
  - list
- apiGroups: ["apps"]
  resources:
    - replicasets
  verbs: ["get", "list", "watch"]
- apiGroups: ["batch"]
  resources:
    - jobs
  verbs: ["get", "list", "watch"]
  1. I guess from the log message that even if there was a duplicated data we would not see it in our kibana. Since the log message seems to have been implemented from [filebeat] Check for duplicate ID for filestream metrics #35972, which will not collect metrics from the duplicated ID, I am not sure if I got it correctly, However my root question is why in the first place should we have duplicated ID in a config with unique ID.

  2. Unfortunately I didn't reproduce it in a smaller small, controlled experiment.

Hope the additional informations can help

@belimawr
Copy link
Contributor

Hi @germain05,

Thanks for all the information. v8.9.1 does have the fix, so it's pretty puzzling to me that you're still facing this issue. There is always the change it is still a bug, triggered by another code path. The fact that you cannot reproduce it in a smaller, controlled experiment makes me believe it is a hard-to-reproduce race condition.

If there is duplicated data, you will see the duplicated documents in Kibana. The metrics for the duplicated input will not be collected because we can't have two metrics instances with the same ID.

However my root question is why in the first place should we have duplicated ID in a config with unique ID.

The ID is not duplicated, if the bug is still there it is an odd race condition when inputs are started/stopped that is affecting our bookkeeping code.

If I have time this week, I'll try again to reproduce it.

@germain05
Copy link
Author

Hi @belimawr, I unfortunately didn't had time to try to reproduce it in a smallerl, controlled experiment. I will wait again and see if you can reproduce it, and also have some explanation of why it still happen. Thanks for the efforts.

@germain05
Copy link
Author

@belimawr please can you share the part of your code causing this bug, it will be nice if I can share with my team mates what exactly is the root cause of the issue, so it is clear for everyone.

@belimawr
Copy link
Contributor

Hi @germain05, or course I can :D

The PR fixing the issue is this one: #35134.

Here is the explanation. Bear in mind that the code I'm linking already contains the fix.

This duplicated message happens because during a config validation

func (f *factory) CheckConfig(cfg *conf.C) error {
_, err := f.loader.Configure(cfg)
if err != nil {
return fmt.Errorf("runner factory could not check config: %w", err)
}
if err = f.loader.Delete(cfg); err != nil {
return fmt.Errorf(
"runner factory failed to delete an input after config check: %w",
err)
}
return nil
}
an instance of the input is created by this line
_, err := f.loader.Configure(cfg)
however because the input is never properly started it was is also never stopped.

When the Filestream input is instantiated, we add it's ID to a map in the input manager:

cim.idsMux.Lock()
if _, exists := cim.ids[settings.ID]; exists {
cim.Logger.Errorf("filestream input with ID '%s' already exists, this "+
"will lead to data duplication, please use a different ID. Metrics "+
"collection has been disabled on this input.", settings.ID)
metricsID = ""
}
// TODO: improve how inputs with empty IDs are tracked.
// https://github.com/elastic/beats/issues/35202
cim.ids[settings.ID] = struct{}{}
cim.idsMux.Unlock()
.
This entry was only removed when the input was stopped:
func (cim *InputManager) StopInput(id string) {
cim.idsMux.Lock()
delete(cim.ids, id)
cim.idsMux.Unlock()
}

The PR I linked above fixes it. At least on our tests we didn't manage to reproduce the issue after the fix and we have seen it working in many deployments.

Another possibility for this issue was because the autodiscover code was generating too many "duplicated" events what could lead to inputs being started/stopped multiple times within a few seconds which could lead Filebeat to an odd state (mostly related to the Log input). This also has been resolved: #35645.

@germain05
Copy link
Author

Hi @belimawr ,

I want to let you know that, we could confirm that we actually have duplicate data and they are shipped to kibana. Is possible to investigate further this issue?

@germain05
Copy link
Author

Hi could be that this issue: #36378, is related to the issue I have here?

@belimawr
Copy link
Contributor

I managed to do some testing, and I can confirm I'm able to reproduce getting the log messages about duplicated IDs with Filebeat v8.9.2

{"log.level":"error","@timestamp":"2023-09-12T10:29:19.964Z","log.logger":"input","log.origin":{"file.name":"input-logfile/manager.go","file.line":183},"message":"filestream input with ID 'filestream-kubernetes-pod-9a085d002a005aa2f45719f19c614e65c66a95e4470be00cb1084a813f145331' already exists, this will lead to data duplication, please use a different ID. Metrics collection has been disabled on this input.","service.name":"filebeat","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2023-09-12T10:29:19.965Z","log.logger":"input","log.origin":{"file.name":"input-logfile/manager.go","file.line":183},"message":"filestream input with ID 'filestream-kubernetes-pod-fe8d74e8ac8625dfd2e38e82b48d8f7601c209ff33e6587c8b7e2e482e9031bb' already exists, this will lead to data duplication, please use a different ID. Metrics collection has been disabled on this input.","service.name":"filebeat","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2023-09-12T10:29:19.965Z","log.logger":"input","log.origin":{"file.name":"input-logfile/manager.go","file.line":183},"message":"filestream input with ID 'filestream-kubernetes-pod-1a419927532441b1caa4025edc7ac059beb299de38e1bd8fa6b53ea2aff76682' already exists, this will lead to data duplication, please use a different ID. Metrics collection has been disabled on this input.","service.name":"filebeat","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2023-09-12T10:29:19.965Z","log.logger":"input","log.origin":{"file.name":"input-logfile/manager.go","file.line":183},"message":"filestream input with ID 'filestream-kubernetes-pod-73771409bfd55231e2aac5cef85e0a8250a241fcafe0ed342f1d3f063b532590' already exists, this will lead to data duplication, please use a different ID. Metrics collection has been disabled on this input.","service.name":"filebeat","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2023-09-12T10:29:19.966Z","log.logger":"input","log.origin":{"file.name":"input-logfile/manager.go","file.line":183},"message":"filestream input with ID 'filestream-kubernetes-pod-acb252d8f2ed802e6b1d12291af93262018973d307049bde875e43634a7d6307' already exists, this will lead to data duplication, please use a different ID. Metrics collection has been disabled on this input.","service.name":"filebeat","ecs.version":"1.6.0"}

However I did not manage to reproduce any data duplication.

#36378 is an interesting new data point on your case, @germain05.

Are you migrating from the log/container input to filestream input using the takeover mode? The take over mode is definitely a new variable that is worth investigating.

It would be very helpful if you can consistently reproduce it and describe how to reproduce in a controlled environment.

@germain05
Copy link
Author

@belimawr I am migrating from container input to filestream

@belimawr
Copy link
Contributor

That really changes things and gives us a clear path to investigate.

If you look at your data, were the duplicated events ingested far apart in time? One of my guesses is that the state migration from the log (container) input to filestream is not working as expected on this case.

@belimawr
Copy link
Contributor

@germain05 I was talked with the team and there are a few things to notice:

  1. Using the take_over mode to migrate from container input to filestream is not officially supported, hence it's not mentioned in our documentation.
  2. We have an issue to migrate the container input to filestream, it will happen transparently to the users, after all how those inputs run under the hood is an implementation detail. Here is the issue: Migrate container input to use filestream instead of log input #34393
  3. As mentioned in the issue, the container input extends the state, this extra information is not supported by the take_over mode yet. So I could see this causing some unexpected behaviour.

If I have time this week, I'll try to dig a bit more into this issue.

@belimawr
Copy link
Contributor

The PR introducing the take_over mode details how to test it and lists many debug log messages that show the migration process working: #34292

If you still have some logs from when you first changed to filestream, that will likely help you to understand if the migration of the state is happening correctly.

@germain05
Copy link
Author

That really changes things and gives us a clear path to investigate.

If you look at your data, were the duplicated events ingested far apart in time? One of my guesses is that the state migration from the log (container) input to filestream is not working as expected on this case.

no they are next to each order, even at the same time

@germain05
Copy link
Author

@germain05 I was talked with the team and there are a few things to notice:

  1. Using the take_over mode to migrate from container input to filestream is not officially supported, hence it's not mentioned in our documentation.
  2. We have an issue to migrate the container input to filestream, it will happen transparently to the users, after all how those inputs run under the hood is an implementation detail. Here is the issue: Migrate container input to use filestream instead of log input #34393
  3. As mentioned in the issue, the container input extends the state, this extra information is not supported by the take_over mode yet. So I could see this causing some unexpected behaviour.

If I have time this week, I'll try to dig a bit more into this issue.

I was already aware of the issue regarding the migration from the container input to filestream, you can see that it is mentioned in comment in the filebeat config I share with you. We tried this because we are stuck at filebeat v7.12.0 now for a very long time. We've been dealing with high CPU and memory consumption issues on Filebeat versions beyond 7.12.0, which led us to explore alternative solutions. We also look into the solutions of most of the issues regarding cpu and memory leakage in filebeat and known of them was really helpful. apart from the data duplication the resources consumption looks quite good with filestream input, it will be nice if you can really look dipper into the issue.

Your willingness to investigate this issue further is much appreciated. We'll be eagerly awaiting any updates or findings you might uncover.

@belimawr
Copy link
Contributor

@rdner pointed out what is happening: The take_over mode, as it is implemented only works when Filebeat is starting, before the inputs are started because it needs exclusive access to the registry. This means that any dynamic loading of inputs (like autodiscover does) will not work, the take_over will just not run.

Which will lead to data duplication because the file states are not migrated from the log input to the filestream input.

@germain05
Copy link
Author

@rdner pointed out what is happening: The take_over mode, as it is implemented only works when Filebeat is starting, before the inputs are started because it needs exclusive access to the registry. This means that any dynamic loading of inputs (like autodiscover does) will not work, the take_over will just not run.

Which will lead to data duplication because the file states are not migrated from the log input to the filestream input.

I was pretty much convince that the issue was lying somewhere within the take_over. What will be the next step then, any nearest plan for a smooth migration from container input to filestream?

what about the log messages with duplicated IDs you could reproduce? any plan to fix that? even if the data in that case are not actually duplicated

@belimawr
Copy link
Contributor

belimawr commented Sep 13, 2023

What will be the next step then, any nearest plan for a smooth migration from container input to filestream?

Yes, we have an open issue for that: #34393. It should be done at some point in the near future, but I can't give any specific deadline or target release.

what about the log messages with duplicated IDs you could reproduce? any plan to fix that?

Yes, I can reproduce it and I re-opened the original issue with a description of how I reproduced it. #31767 it also should be fixed in the near future.

@pierrehilbert
Copy link
Collaborator

Closing in favor of #31767

@pierrehilbert pierrehilbert closed this as not planned Won't fix, can't repro, duplicate, stale Nov 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team
Projects
None yet
Development

No branches or pull requests

3 participants