Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

filebeat: input v2 compat uses random ID for CheckConfig #41585

Merged
merged 10 commits into from
Nov 14, 2024

Conversation

AndersonQ
Copy link
Member

@AndersonQ AndersonQ commented Nov 11, 2024

Proposed commit message

filebeat: input v2 compat uses random ID for CheckConfig

The CheckConfig function validates a configuration by creating and immediately discarding an input. However, a potential conflict arises when CheckConfig is used with autodiscover in Kubernetes.

Autodiscover accumulates configuration changes and applies them in batches. This can be problematic if a stop event for a pod is closely followed by a start event for the same pod (e.g., during a pod restart) before the inputs are reloaded. In this scenario, autodiscover might attempt to validate the configuration for the start event while the input for the pod is already running. This would lead to filestream input manager to see two inputs with the same ID, triggering a log warning.

Although this situation generates a warning, it doesn't result in data duplication. As the second input is only created to validate the configuration and later discarded. Also the reload process will ensure only new inputs are created, any input already running won't be duplicated.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • [ ] I have made corresponding changes to the documentation
  • [ ] I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Disruptive User Impact

  • N/A

How to test this PR locally

Run a kind cluster:

  • kind create cluster
  • select it: kubectl config use-context kind-kind
  • build filebeat docker image: DEV=true SNAPSHOT=true TEST_PLATFORMS="linux/amd64" TEST_PACKAGES="docker" mage package
  • add the docker image to kind: kind load docker-image docker.elastic.co/beats/filebeat:9.0.0-SNAPSHOT
  • start filebeat in the cluster kubectl apply -f k8s.yaml.
k8s.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: filebeat
  namespace: kube-system
  labels:
    k8s-app: filebeat
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: filebeat
  labels:
    k8s-app: filebeat
rules:
  - apiGroups: [""] # "" indicates the core API group
    resources:
      - namespaces
      - pods
      - nodes
    verbs:
      - get
      - watch
      - list
  - apiGroups: ["apps"]
    resources:
      - replicasets
    verbs: ["get", "list", "watch"]
  - apiGroups: ["batch"]
    resources:
      - jobs
    verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: filebeat
  # should be the namespace where filebeat is running
  namespace: kube-system
  labels:
    k8s-app: filebeat
rules:
  - apiGroups:
      - coordination.k8s.io
    resources:
      - leases
    verbs: ["get", "create", "update"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: filebeat-kubeadm-config
  namespace: kube-system
  labels:
    k8s-app: filebeat
rules:
  - apiGroups: [""]
    resources:
      - configmaps
    resourceNames:
      - kubeadm-config
    verbs: ["get"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: filebeat
subjects:
  - kind: ServiceAccount
    name: filebeat
    namespace: kube-system
roleRef:
  kind: ClusterRole
  name: filebeat
  apiGroup: rbac.authorization.k8s.io
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: filebeat
  namespace: kube-system
subjects:
  - kind: ServiceAccount
    name: filebeat
    namespace: kube-system
roleRef:
  kind: Role
  name: filebeat
  apiGroup: rbac.authorization.k8s.io
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: filebeat-kubeadm-config
  namespace: kube-system
subjects:
  - kind: ServiceAccount
    name: filebeat
    namespace: kube-system
roleRef:
  kind: Role
  name: filebeat-kubeadm-config
  apiGroup: rbac.authorization.k8s.io
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: filebeat-config
  namespace: kube-system
  labels:
    k8s-app: filebeat
data:
  filebeat.yml: |-
    filebeat.autodiscover:
      providers:
        - type: kubernetes
          node: ${NODE_NAME}
          hints.enabled: true
          hints.default_config:
            type: filestream
            prospector.scanner.symlinks: true
            id: filestream-kubernetes-pod-${data.kubernetes.container.id}
            take_over: true
            paths:
              - /var/log/containers/*${data.kubernetes.container.id}.log
            parsers:
              - container: ~
    processors:
      - add_cloud_metadata:
      - add_host_metadata:
    logging:
      level: debug
    output.elasticsearch:
      hosts: ["https://localhost:9200"] # you can let it failing, no need for a real ES
      protocol: "https"
      allow_older_versions: true
    #      username: "elastic"
    #      password: "changeme"
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: filebeat
  namespace: kube-system
  labels:
    k8s-app: filebeat
spec:
  selector:
    matchLabels:
      k8s-app: filebeat
  template:
    metadata:
      labels:
        k8s-app: filebeat
    spec:
      serviceAccountName: filebeat
      terminationGracePeriodSeconds: 30
      hostNetwork: true
      dnsPolicy: ClusterFirstWithHostNet
      containers:
        - name: filebeat
          image: docker.elastic.co/beats/filebeat:9.0.0-SNAPSHOT
          args: [
            "-c", "/etc/filebeat.yml",
            "-e",
          ]
          env:
            - name: ELASTICSEARCH_HOST
              value: elasticsearch
            - name: ELASTICSEARCH_PORT
              value: "9200"
            - name: ELASTICSEARCH_USERNAME
              value: elastic
            - name: ELASTICSEARCH_PASSWORD
              value: changeme
            - name: ELASTIC_CLOUD_ID
              value: "cliud-id"
            - name: ELASTIC_CLOUD_AUTH
              value: "aa:bb"
            - name: NODE_NAME
              valueFrom:
                fieldRef:
                  fieldPath: spec.nodeName
          securityContext:
            runAsUser: 0
            # If using Red Hat OpenShift uncomment this:
            #privileged: true
          resources:
            limits:
              memory: 200Mi
            requests:
              cpu: 100m
              memory: 100Mi
          volumeMounts:
            - name: config
              mountPath: /etc/filebeat.yml
              readOnly: true
              subPath: filebeat.yml
            - name: data
              mountPath: /usr/share/filebeat/data
            - name: varlibdockercontainers
              mountPath: /var/lib/docker/containers
              readOnly: true
            - name: varlog
              mountPath: /var/log
              readOnly: true
      volumes:
        - name: config
          configMap:
            defaultMode: 0640
            name: filebeat-config
        - name: varlibdockercontainers
          hostPath:
            path: /var/lib/docker/containers
        - name: varlog
          hostPath:
            path: /var/log
        # data folder stores a registry of read status for all files, so we don't send everything again on a Filebeat pod restart
        - name: data
          hostPath:
            # When filebeat runs as non-root user, this directory needs to be writable by group (g+w).
            path: /var/lib/filebeat-data
            type: DirectoryOrCreate
  • start 2 busybox: kubectl run busybox1 --image=busybox, kubectl run busybox2 --image=busybox
  • wait, you should not see an error like:
filestream input with ID 'filestream-kubernetes-pod-3f834ee8c5d20f465e3a75433c9bd58a9968e154c32c92d1ef1948797820a64a' already exists, this will lead to data duplication, please use a different ID. Metrics collection has been disabled on this input.

{"log.level":"error","@timestamp":"2024-11-08T14:14:53.021Z","log.logger":"input","log.origin":{"function":"github.com/elastic/beats/v7/filebeat/input/filestream/internal/input-logfile.(*InputManager).Create","file.name":"input-logfile/manager.go","file.line":174},"message":"filestream input with ID 'filestream-kubernetes-pod-3f834ee8c5d20f465e3a75433c9bd58a9968e154c32c92d1ef1948797820a64a' already exists, this will lead to data duplication, please use a different ID. Metrics collection has been disabled on this input.","service.name":"filebeat","ecs.version":"1.6.0"}
  • repeat the process with the current filebeat release, you'll see the warning

Related issues

Use cases

filebeat on kubernetes using sutodiscover

@AndersonQ AndersonQ requested a review from a team as a code owner November 11, 2024 15:07
@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Nov 11, 2024
@AndersonQ AndersonQ self-assigned this Nov 11, 2024
@AndersonQ AndersonQ added the Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team label Nov 11, 2024
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Nov 11, 2024
@AndersonQ AndersonQ added bug needs_team Indicates that the issue/PR needs a Team:* label labels Nov 11, 2024
@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Nov 11, 2024
Copy link
Contributor

mergify bot commented Nov 11, 2024

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @AndersonQ? 🙏.
For such, you'll need to label your PR with:

  • The upcoming major version of the Elastic Stack
  • The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-8./d is the label to automatically backport to the 8./d branch. /d is the digit

Copy link
Contributor

mergify bot commented Nov 11, 2024

backport-8.x has been added to help with the transition to the new branch 8.x.
If you don't need it please use backport-skip label and remove the backport-8.x label.

@mergify mergify bot added the backport-8.x Automated backport to the 8.x branch with mergify label Nov 11, 2024
The CheckConfig function validates a configuration by creating and immediately discarding an input. However, a potential conflict arises when CheckConfig is used with autodiscover in Kubernetes.

Autodiscover accumulates configuration changes and applies them in batches. This can be problematic if a stop event for a pod is closely followed by a start event for the same pod (e.g., during a pod restart) before the inputs are reloaded. In this scenario, autodiscover might attempt to validate the configuration for the start event while the input for the pod is already running. This would lead to filestream input manager to see two inputs with the same ID, triggering a log warning.

Although this situation generates a warning, it doesn't result in data duplication. As the second input is only created to validate the configuration and later discarded. Also the reload process will ensure only new inputs are created, any input already running won't be duplicated.
@AndersonQ AndersonQ force-pushed the 31767-filestream-id-already-exists branch from 12cc1ab to 32f04e3 Compare November 11, 2024 15:12
@@ -48,6 +48,7 @@ https://github.com/elastic/beats/compare/v8.8.1\...main[Check the HEAD diff]
- Change log.file.path field in awscloudwatch input to nested object. {pull}41099[41099]
- Remove deprecated awscloudwatch field from Filebeat. {pull}41089[41089]
- The performance of ingesting SQS data with the S3 input has improved by up to 60x for queues with many small events. `max_number_of_messages` config for SQS mode is now ignored, as the new design no longer needs a manual cap on messages. Instead, use `number_of_workers` to scale ingestion rate in both S3 and SQS modes. The increased efficiency may increase network bandwidth consumption, which can be throttled by lowering `number_of_workers`. It may also increase number of events stored in memory, which can be throttled by lowering the configured size of the internal queue. {pull}40699[40699]
- Fixes filestream logging the error "filestream input with ID 'ID' already exists, this will lead to data duplication[...]" on Kubernetes when using autodiscover,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue/PR links are missing, also the line ends in a comma, did you forget to add/commit the rest of the line?

}

// using math/rand for performance, generate a 0-9 string
err = testCfg.SetString("inputID", -1, inputID+strconv.Itoa(rand.Intn(10)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question:
Do you think adding a single digit at the end of the string is enough? I believe it's better to use something longer to avoid the chance of collision.

@belimawr
Copy link
Contributor

I agree with the fix that in the sense that CheckConfig should not conflict and log errors even if the ID is duplicated because having a unique ID is not required. So I believe the scope of this PR is correct.

However the scenario you described to test the PR makes me wonder if the Kubernetes autodiscover is working as expected. Starting a new pod should not re-trigger the input start for an existing pod. That reminds me of an old issue where pod start/stop events were emitted when processing other Kubernetes events: #34717. I wonder if there was a regression, or there is another cause for this behaviour now.

@AndersonQ
Copy link
Member Author

I agree with the fix that in the sense that CheckConfig should not conflict and log errors even if the ID is duplicated because having a unique ID is not required. So I believe the scope of this PR is correct.

However the scenario you described to test the PR makes me wonder if the Kubernetes autodiscover is working as expected. Starting a new pod should not re-trigger the input start for an existing pod. That reminds me of an old issue where pod start/stop events were emitted when processing other Kubernetes events: #34717. I wonder if there was a regression, or there is another cause for this behaviour now.

that is a good point. But here, on the scenario used to reproduce this issue it isn't a new pod start triggering a start for an existing pod. It's the same pod, a stop then a start events are triggered while it's on CrashLoopBackOff.

In order to investigate a possible issue with autodiscover, fist it'd be necessary to find out if the start/stop events for the pod in CrashLoopBackOff are or not expected.

Anyway it's another issue.

@AndersonQ AndersonQ force-pushed the 31767-filestream-id-already-exists branch from 95c0b33 to 67bcd91 Compare November 12, 2024 10:56
filebeat/input/v2/compat/compat.go Show resolved Hide resolved
filebeat/input/v2/compat/compat.go Show resolved Hide resolved
filebeat/input/v2/compat/compat.go Outdated Show resolved Hide resolved
filebeat/input/v2/compat/compat.go Show resolved Hide resolved
filebeat/input/v2/compat/compat.go Outdated Show resolved Hide resolved
@AndersonQ AndersonQ merged commit 697ede4 into elastic:main Nov 14, 2024
31 checks passed
@AndersonQ AndersonQ deleted the 31767-filestream-id-already-exists branch November 14, 2024 13:08
@AndersonQ AndersonQ added the backport-8.16 Automated backport with mergify label Nov 14, 2024
mergify bot pushed a commit that referenced this pull request Nov 14, 2024
The CheckConfig function validates a configuration by creating and immediately discarding an input. However, a potential conflict arises when CheckConfig is used with autodiscover in Kubernetes.

Autodiscover accumulates configuration changes and applies them in batches. This can be problematic if a stop event for a pod is closely followed by a start event for the same pod (e.g., during a pod restart) before the inputs are reloaded. In this scenario, autodiscover might attempt to validate the configuration for the start event while the input for the pod is already running. This would lead to filestream input manager to see two inputs with the same ID, triggering a log warning.

Although this situation generates a warning, it doesn't result in data duplication. As the second input is only created to validate the configuration and later discarded. Also the reload process will ensure only new inputs are created, any input already running won't be duplicated.

(cherry picked from commit 697ede4)
mergify bot pushed a commit that referenced this pull request Nov 14, 2024
The CheckConfig function validates a configuration by creating and immediately discarding an input. However, a potential conflict arises when CheckConfig is used with autodiscover in Kubernetes.

Autodiscover accumulates configuration changes and applies them in batches. This can be problematic if a stop event for a pod is closely followed by a start event for the same pod (e.g., during a pod restart) before the inputs are reloaded. In this scenario, autodiscover might attempt to validate the configuration for the start event while the input for the pod is already running. This would lead to filestream input manager to see two inputs with the same ID, triggering a log warning.

Although this situation generates a warning, it doesn't result in data duplication. As the second input is only created to validate the configuration and later discarded. Also the reload process will ensure only new inputs are created, any input already running won't be duplicated.

(cherry picked from commit 697ede4)
Kavindu-Dodan pushed a commit to Kavindu-Dodan/beats that referenced this pull request Nov 14, 2024
The CheckConfig function validates a configuration by creating and immediately discarding an input. However, a potential conflict arises when CheckConfig is used with autodiscover in Kubernetes.

Autodiscover accumulates configuration changes and applies them in batches. This can be problematic if a stop event for a pod is closely followed by a start event for the same pod (e.g., during a pod restart) before the inputs are reloaded. In this scenario, autodiscover might attempt to validate the configuration for the start event while the input for the pod is already running. This would lead to filestream input manager to see two inputs with the same ID, triggering a log warning.

Although this situation generates a warning, it doesn't result in data duplication. As the second input is only created to validate the configuration and later discarded. Also the reload process will ensure only new inputs are created, any input already running won't be duplicated.
AndersonQ added a commit that referenced this pull request Nov 18, 2024
…1641)

The CheckConfig function validates a configuration by creating and immediately discarding an input. However, a potential conflict arises when CheckConfig is used with autodiscover in Kubernetes.

Autodiscover accumulates configuration changes and applies them in batches. This can be problematic if a stop event for a pod is closely followed by a start event for the same pod (e.g., during a pod restart) before the inputs are reloaded. In this scenario, autodiscover might attempt to validate the configuration for the start event while the input for the pod is already running. This would lead to filestream input manager to see two inputs with the same ID, triggering a log warning.

Although this situation generates a warning, it doesn't result in data duplication. As the second input is only created to validate the configuration and later discarded. Also the reload process will ensure only new inputs are created, any input already running won't be duplicated.

(cherry picked from commit 697ede4)

Co-authored-by: Anderson Queiroz <anderson.queiroz@elastic.co>
AndersonQ added a commit that referenced this pull request Nov 22, 2024
…CheckConfig (#41642)

* filebeat: input v2 compat uses random ID for CheckConfig (#41585)

The CheckConfig function validates a configuration by creating and immediately discarding an input. However, a potential conflict arises when CheckConfig is used with autodiscover in Kubernetes.

Autodiscover accumulates configuration changes and applies them in batches. This can be problematic if a stop event for a pod is closely followed by a start event for the same pod (e.g., during a pod restart) before the inputs are reloaded. In this scenario, autodiscover might attempt to validate the configuration for the start event while the input for the pod is already running. This would lead to filestream input manager to see two inputs with the same ID, triggering a log warning.

Although this situation generates a warning, it doesn't result in data duplication. As the second input is only created to validate the configuration and later discarded. Also the reload process will ensure only new inputs are created, any input already running won't be duplicated.

(cherry picked from commit 697ede4)

---------

Co-authored-by: Anderson Queiroz <anderson.queiroz@elastic.co>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-8.x Automated backport to the 8.x branch with mergify backport-8.16 Automated backport with mergify bug Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

filestream input logs an error when an existing input is reloaded with the same ID
5 participants