filebeat: input v2 compat uses random ID for CheckConfig #41585

AndersonQ · 2024-11-11T15:07:34Z

Proposed commit message

filebeat: input v2 compat uses random ID for CheckConfig

The CheckConfig function validates a configuration by creating and immediately discarding an input. However, a potential conflict arises when CheckConfig is used with autodiscover in Kubernetes.

Autodiscover accumulates configuration changes and applies them in batches. This can be problematic if a stop event for a pod is closely followed by a start event for the same pod (e.g., during a pod restart) before the inputs are reloaded. In this scenario, autodiscover might attempt to validate the configuration for the start event while the input for the pod is already running. This would lead to filestream input manager to see two inputs with the same ID, triggering a log warning.

Although this situation generates a warning, it doesn't result in data duplication. As the second input is only created to validate the configuration and later discarded. Also the reload process will ensure only new inputs are created, any input already running won't be duplicated.

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
~~[ ] I have made corresponding changes to the documentation~~
~~[ ] I have made corresponding change to the default configuration files~~
I have added tests that prove my fix is effective or that my feature works
I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Disruptive User Impact

N/A

How to test this PR locally

Run a kind cluster:

kind create cluster
select it: kubectl config use-context kind-kind
build filebeat docker image: DEV=true SNAPSHOT=true TEST_PLATFORMS="linux/amd64" TEST_PACKAGES="docker" mage package
add the docker image to kind: kind load docker-image docker.elastic.co/beats/filebeat:9.0.0-SNAPSHOT
start filebeat in the cluster kubectl apply -f k8s.yaml.

k8s.yaml

apiVersion: v1
kind: ServiceAccount
metadata:
  name: filebeat
  namespace: kube-system
  labels:
    k8s-app: filebeat
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: filebeat
  labels:
    k8s-app: filebeat
rules:
  - apiGroups: [""] # "" indicates the core API group
    resources:
      - namespaces
      - pods
      - nodes
    verbs:
      - get
      - watch
      - list
  - apiGroups: ["apps"]
    resources:
      - replicasets
    verbs: ["get", "list", "watch"]
  - apiGroups: ["batch"]
    resources:
      - jobs
    verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: filebeat
  # should be the namespace where filebeat is running
  namespace: kube-system
  labels:
    k8s-app: filebeat
rules:
  - apiGroups:
      - coordination.k8s.io
    resources:
      - leases
    verbs: ["get", "create", "update"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: filebeat-kubeadm-config
  namespace: kube-system
  labels:
    k8s-app: filebeat
rules:
  - apiGroups: [""]
    resources:
      - configmaps
    resourceNames:
      - kubeadm-config
    verbs: ["get"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: filebeat
subjects:
  - kind: ServiceAccount
    name: filebeat
    namespace: kube-system
roleRef:
  kind: ClusterRole
  name: filebeat
  apiGroup: rbac.authorization.k8s.io
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: filebeat
  namespace: kube-system
subjects:
  - kind: ServiceAccount
    name: filebeat
    namespace: kube-system
roleRef:
  kind: Role
  name: filebeat
  apiGroup: rbac.authorization.k8s.io
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: filebeat-kubeadm-config
  namespace: kube-system
subjects:
  - kind: ServiceAccount
    name: filebeat
    namespace: kube-system
roleRef:
  kind: Role
  name: filebeat-kubeadm-config
  apiGroup: rbac.authorization.k8s.io
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: filebeat-config
  namespace: kube-system
  labels:
    k8s-app: filebeat
data:
  filebeat.yml: |-
    filebeat.autodiscover:
      providers:
        - type: kubernetes
          node: ${NODE_NAME}
          hints.enabled: true
          hints.default_config:
            type: filestream
            prospector.scanner.symlinks: true
            id: filestream-kubernetes-pod-${data.kubernetes.container.id}
            take_over: true
            paths:
              - /var/log/containers/*${data.kubernetes.container.id}.log
            parsers:
              - container: ~
    processors:
      - add_cloud_metadata:
      - add_host_metadata:
    logging:
      level: debug
    output.elasticsearch:
      hosts: ["https://localhost:9200"] # you can let it failing, no need for a real ES
      protocol: "https"
      allow_older_versions: true
    #      username: "elastic"
    #      password: "changeme"
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: filebeat
  namespace: kube-system
  labels:
    k8s-app: filebeat
spec:
  selector:
    matchLabels:
      k8s-app: filebeat
  template:
    metadata:
      labels:
        k8s-app: filebeat
    spec:
      serviceAccountName: filebeat
      terminationGracePeriodSeconds: 30
      hostNetwork: true
      dnsPolicy: ClusterFirstWithHostNet
      containers:
        - name: filebeat
          image: docker.elastic.co/beats/filebeat:9.0.0-SNAPSHOT
          args: [
            "-c", "/etc/filebeat.yml",
            "-e",
          ]
          env:
            - name: ELASTICSEARCH_HOST
              value: elasticsearch
            - name: ELASTICSEARCH_PORT
              value: "9200"
            - name: ELASTICSEARCH_USERNAME
              value: elastic
            - name: ELASTICSEARCH_PASSWORD
              value: changeme
            - name: ELASTIC_CLOUD_ID
              value: "cliud-id"
            - name: ELASTIC_CLOUD_AUTH
              value: "aa:bb"
            - name: NODE_NAME
              valueFrom:
                fieldRef:
                  fieldPath: spec.nodeName
          securityContext:
            runAsUser: 0
            # If using Red Hat OpenShift uncomment this:
            #privileged: true
          resources:
            limits:
              memory: 200Mi
            requests:
              cpu: 100m
              memory: 100Mi
          volumeMounts:
            - name: config
              mountPath: /etc/filebeat.yml
              readOnly: true
              subPath: filebeat.yml
            - name: data
              mountPath: /usr/share/filebeat/data
            - name: varlibdockercontainers
              mountPath: /var/lib/docker/containers
              readOnly: true
            - name: varlog
              mountPath: /var/log
              readOnly: true
      volumes:
        - name: config
          configMap:
            defaultMode: 0640
            name: filebeat-config
        - name: varlibdockercontainers
          hostPath:
            path: /var/lib/docker/containers
        - name: varlog
          hostPath:
            path: /var/log
        # data folder stores a registry of read status for all files, so we don't send everything again on a Filebeat pod restart
        - name: data
          hostPath:
            # When filebeat runs as non-root user, this directory needs to be writable by group (g+w).
            path: /var/lib/filebeat-data
            type: DirectoryOrCreate

start 2 busybox: kubectl run busybox1 --image=busybox, kubectl run busybox2 --image=busybox
wait, you should not see an error like:

filestream input with ID 'filestream-kubernetes-pod-3f834ee8c5d20f465e3a75433c9bd58a9968e154c32c92d1ef1948797820a64a' already exists, this will lead to data duplication, please use a different ID. Metrics collection has been disabled on this input.

{"log.level":"error","@timestamp":"2024-11-08T14:14:53.021Z","log.logger":"input","log.origin":{"function":"github.com/elastic/beats/v7/filebeat/input/filestream/internal/input-logfile.(*InputManager).Create","file.name":"input-logfile/manager.go","file.line":174},"message":"filestream input with ID 'filestream-kubernetes-pod-3f834ee8c5d20f465e3a75433c9bd58a9968e154c32c92d1ef1948797820a64a' already exists, this will lead to data duplication, please use a different ID. Metrics collection has been disabled on this input.","service.name":"filebeat","ecs.version":"1.6.0"}

repeat the process with the current filebeat release, you'll see the warning

Related issues

Closes filestream input logs an error when an existing input is reloaded with the same ID #31767

Use cases

filebeat on kubernetes using sutodiscover

elasticmachine · 2024-11-11T15:07:53Z

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

mergify · 2024-11-11T15:08:19Z

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @AndersonQ? 🙏.
For such, you'll need to label your PR with:

The upcoming major version of the Elastic Stack
The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

backport-8./d is the label to automatically backport to the 8./d branch. /d is the digit

mergify · 2024-11-11T15:08:20Z

backport-8.x has been added to help with the transition to the new branch 8.x.
If you don't need it please use backport-skip label and remove the backport-8.x label.

The CheckConfig function validates a configuration by creating and immediately discarding an input. However, a potential conflict arises when CheckConfig is used with autodiscover in Kubernetes. Autodiscover accumulates configuration changes and applies them in batches. This can be problematic if a stop event for a pod is closely followed by a start event for the same pod (e.g., during a pod restart) before the inputs are reloaded. In this scenario, autodiscover might attempt to validate the configuration for the start event while the input for the pod is already running. This would lead to filestream input manager to see two inputs with the same ID, triggering a log warning. Although this situation generates a warning, it doesn't result in data duplication. As the second input is only created to validate the configuration and later discarded. Also the reload process will ensure only new inputs are created, any input already running won't be duplicated.

belimawr · 2024-11-11T15:20:59Z

CHANGELOG.next.asciidoc

@@ -48,6 +48,7 @@ https://github.com/elastic/beats/compare/v8.8.1\...main[Check the HEAD diff]
 - Change log.file.path field in awscloudwatch input to nested object. {pull}41099[41099]
 - Remove deprecated awscloudwatch field from Filebeat. {pull}41089[41089]
 - The performance of ingesting SQS data with the S3 input has improved by up to 60x for queues with many small events. `max_number_of_messages` config for SQS mode is now ignored, as the new design no longer needs a manual cap on messages. Instead, use `number_of_workers` to scale ingestion rate in both S3 and SQS modes. The increased efficiency may increase network bandwidth consumption, which can be throttled by lowering `number_of_workers`. It may also increase number of events stored in memory, which can be throttled by lowering the configured size of the internal queue. {pull}40699[40699]
+- Fixes filestream logging the error "filestream input with ID 'ID' already exists, this will lead to data duplication[...]" on Kubernetes when using autodiscover,


The issue/PR links are missing, also the line ends in a comma, did you forget to add/commit the rest of the line?

belimawr · 2024-11-11T15:24:07Z

filebeat/input/v2/compat/compat.go

+	}
+
+	// using math/rand for performance, generate a 0-9 string
+	err = testCfg.SetString("inputID", -1, inputID+strconv.Itoa(rand.Intn(10)))


Question:
Do you think adding a single digit at the end of the string is enough? I believe it's better to use something longer to avoid the chance of collision.

belimawr · 2024-11-11T15:44:25Z

I agree with the fix that in the sense that CheckConfig should not conflict and log errors even if the ID is duplicated because having a unique ID is not required. So I believe the scope of this PR is correct.

However the scenario you described to test the PR makes me wonder if the Kubernetes autodiscover is working as expected. Starting a new pod should not re-trigger the input start for an existing pod. That reminds me of an old issue where pod start/stop events were emitted when processing other Kubernetes events: #34717. I wonder if there was a regression, or there is another cause for this behaviour now.

…ts' into 31767-filestream-id-already-exists

AndersonQ · 2024-11-12T09:25:24Z

I agree with the fix that in the sense that CheckConfig should not conflict and log errors even if the ID is duplicated because having a unique ID is not required. So I believe the scope of this PR is correct.

However the scenario you described to test the PR makes me wonder if the Kubernetes autodiscover is working as expected. Starting a new pod should not re-trigger the input start for an existing pod. That reminds me of an old issue where pod start/stop events were emitted when processing other Kubernetes events: #34717. I wonder if there was a regression, or there is another cause for this behaviour now.

that is a good point. But here, on the scenario used to reproduce this issue it isn't a new pod start triggering a start for an existing pod. It's the same pod, a stop then a start events are triggered while it's on CrashLoopBackOff.

In order to investigate a possible issue with autodiscover, fist it'd be necessary to find out if the start/stop events for the pod in CrashLoopBackOff are or not expected.

Anyway it's another issue.

filebeat/input/v2/compat/compat.go

…ts' into 31767-filestream-id-already-exists

The CheckConfig function validates a configuration by creating and immediately discarding an input. However, a potential conflict arises when CheckConfig is used with autodiscover in Kubernetes. Autodiscover accumulates configuration changes and applies them in batches. This can be problematic if a stop event for a pod is closely followed by a start event for the same pod (e.g., during a pod restart) before the inputs are reloaded. In this scenario, autodiscover might attempt to validate the configuration for the start event while the input for the pod is already running. This would lead to filestream input manager to see two inputs with the same ID, triggering a log warning. Although this situation generates a warning, it doesn't result in data duplication. As the second input is only created to validate the configuration and later discarded. Also the reload process will ensure only new inputs are created, any input already running won't be duplicated. (cherry picked from commit 697ede4)

The CheckConfig function validates a configuration by creating and immediately discarding an input. However, a potential conflict arises when CheckConfig is used with autodiscover in Kubernetes. Autodiscover accumulates configuration changes and applies them in batches. This can be problematic if a stop event for a pod is closely followed by a start event for the same pod (e.g., during a pod restart) before the inputs are reloaded. In this scenario, autodiscover might attempt to validate the configuration for the start event while the input for the pod is already running. This would lead to filestream input manager to see two inputs with the same ID, triggering a log warning. Although this situation generates a warning, it doesn't result in data duplication. As the second input is only created to validate the configuration and later discarded. Also the reload process will ensure only new inputs are created, any input already running won't be duplicated.

…1641) The CheckConfig function validates a configuration by creating and immediately discarding an input. However, a potential conflict arises when CheckConfig is used with autodiscover in Kubernetes. Autodiscover accumulates configuration changes and applies them in batches. This can be problematic if a stop event for a pod is closely followed by a start event for the same pod (e.g., during a pod restart) before the inputs are reloaded. In this scenario, autodiscover might attempt to validate the configuration for the start event while the input for the pod is already running. This would lead to filestream input manager to see two inputs with the same ID, triggering a log warning. Although this situation generates a warning, it doesn't result in data duplication. As the second input is only created to validate the configuration and later discarded. Also the reload process will ensure only new inputs are created, any input already running won't be duplicated. (cherry picked from commit 697ede4) Co-authored-by: Anderson Queiroz <anderson.queiroz@elastic.co>

…CheckConfig (#41642) * filebeat: input v2 compat uses random ID for CheckConfig (#41585) The CheckConfig function validates a configuration by creating and immediately discarding an input. However, a potential conflict arises when CheckConfig is used with autodiscover in Kubernetes. Autodiscover accumulates configuration changes and applies them in batches. This can be problematic if a stop event for a pod is closely followed by a start event for the same pod (e.g., during a pod restart) before the inputs are reloaded. In this scenario, autodiscover might attempt to validate the configuration for the start event while the input for the pod is already running. This would lead to filestream input manager to see two inputs with the same ID, triggering a log warning. Although this situation generates a warning, it doesn't result in data duplication. As the second input is only created to validate the configuration and later discarded. Also the reload process will ensure only new inputs are created, any input already running won't be duplicated. (cherry picked from commit 697ede4) --------- Co-authored-by: Anderson Queiroz <anderson.queiroz@elastic.co>

AndersonQ requested a review from a team as a code owner November 11, 2024 15:07

AndersonQ requested review from belimawr and rdner November 11, 2024 15:07

botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Nov 11, 2024

AndersonQ self-assigned this Nov 11, 2024

AndersonQ added the Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team label Nov 11, 2024

botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Nov 11, 2024

AndersonQ added bug needs_team Indicates that the issue/PR needs a Team:* label labels Nov 11, 2024

botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Nov 11, 2024

mergify bot added the backport-8.x Automated backport to the 8.x branch with mergify label Nov 11, 2024

AndersonQ force-pushed the 31767-filestream-id-already-exists branch from 12cc1ab to 32f04e3 Compare November 11, 2024 15:12

belimawr reviewed Nov 11, 2024

View reviewed changes

AndersonQ added 4 commits November 11, 2024 16:49

fix changelog

5103cc1

Merge branch 'main' into 31767-filestream-id-already-exists

d10e55f

fix refactor

ab8f727

Merge remote-tracking branch 'origin/31767-filestream-id-already-exis…

0481528

…ts' into 31767-filestream-id-already-exists

fix refactor

67bcd91

AndersonQ force-pushed the 31767-filestream-id-already-exists branch from 95c0b33 to 67bcd91 Compare November 12, 2024 10:56

mauri870 reviewed Nov 12, 2024

View reviewed changes

filebeat/input/v2/compat/compat.go Outdated Show resolved Hide resolved

AndersonQ added 2 commits November 12, 2024 18:30

use math/rand/v2

fd66a37

Merge branch 'main' into 31767-filestream-id-already-exists

9fc2539

AndersonQ requested review from mauri870 and belimawr November 13, 2024 09:55

rdner reviewed Nov 13, 2024

View reviewed changes

AndersonQ added 2 commits November 14, 2024 09:09

generateCheckConfig returns nil on error and uses uuid

de41ef4

Merge remote-tracking branch 'origin/31767-filestream-id-already-exis…

51fabd3

…ts' into 31767-filestream-id-already-exists

rdner approved these changes Nov 14, 2024

View reviewed changes

mauri870 approved these changes Nov 14, 2024

View reviewed changes

AndersonQ merged commit 697ede4 into elastic:main Nov 14, 2024
31 checks passed

AndersonQ deleted the 31767-filestream-id-already-exists branch November 14, 2024 13:08

AndersonQ added the backport-8.16 Automated backport with mergify label Nov 14, 2024

mergify bot mentioned this pull request Nov 14, 2024

[8.x](backport #41585) filebeat: input v2 compat uses random ID for CheckConfig #41641

Merged

4 tasks

mergify bot mentioned this pull request Nov 14, 2024

[8.16](backport #41585) filebeat: input v2 compat uses random ID for CheckConfig #41642

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

filebeat: input v2 compat uses random ID for CheckConfig #41585

filebeat: input v2 compat uses random ID for CheckConfig #41585

AndersonQ commented Nov 11, 2024 •

edited

Loading

elasticmachine commented Nov 11, 2024

mergify bot commented Nov 11, 2024

mergify bot commented Nov 11, 2024

belimawr Nov 11, 2024

belimawr Nov 11, 2024

belimawr commented Nov 11, 2024

AndersonQ commented Nov 12, 2024

filebeat: input v2 compat uses random ID for CheckConfig #41585

filebeat: input v2 compat uses random ID for CheckConfig #41585

Conversation

AndersonQ commented Nov 11, 2024 • edited Loading

Proposed commit message

Checklist

Disruptive User Impact

How to test this PR locally

Related issues

Use cases

elasticmachine commented Nov 11, 2024

mergify bot commented Nov 11, 2024

mergify bot commented Nov 11, 2024

belimawr Nov 11, 2024

Choose a reason for hiding this comment

belimawr Nov 11, 2024

Choose a reason for hiding this comment

belimawr commented Nov 11, 2024

AndersonQ commented Nov 12, 2024

AndersonQ commented Nov 11, 2024 •

edited

Loading