Make Autodiscover more selective when watching K8s events #34717

gbanasiak · 2023-03-02T12:54:27Z

Setup

Observed in Filebeat 8.5.3 running in Azure Kubernetes v1.24.6 with the following autodiscover configuration:

filebeat:
  autodiscover:
    providers:
    - templates:
      - condition:
          or:
          - equals:
              kubernetes:
                container:
                  name: elasticsearch
          - equals:
              kubernetes:
                container:
                  name: kibana
        config:
        - paths:
          - /var/log/pods/${data.kubernetes.namespace}_${data.kubernetes.pod.name}_${data.kubernetes.pod.uid}/${data.kubernetes.container.name}/*.log
          processors:
          - decode_json_fields:
              add_error_key: true
              fields:
              - message
              max_depth: 1
              overwrite_keys: true
              process_array: false
              target: ""
          type: container
      type: kubernetes

Context

While addressing #23139 Node and Namespace watchers were added in Autodicover to make Node/Namespace metadata changes instantly available to the related Pods. Currently, when a Node modification is received via a watcher, all Pods located on this Node are retrieved from a watcher store (src) and stop and start Pod events are emitted (src). A single Pod can emit multiple start and stop events (src). This happens regardless of the nature of the change of the Node object.

In the environment specified in Setup section, it was observed that Node objects can be updated as frequently as every 10 seconds. Filebeat logs do not allow to determine what exactly was received via watch APIs, but this can be verified as follows:

kubectl proxy &
stdbuf -oL curl -s '127.0.0.1:8001/api/v1/namespaces?allowWatchBookmarks=true&watch=true&pretty=false' > namespaces.log &
stdbuf -oL curl -s '127.0.0.1:8001/api/v1/nodes?allowWatchBookmarks=true&watch=true&pretty=false' > nodes.log &
stdbuf -oL curl -s '127.0.0.1:8001/api/v1/pods?allowWatchBookmarks=true&watch=true&pretty=false' > pods.log &

Node objects were modified in .status.conditions[] array which is related to Node monitoring process. The only field modified was lastHeartbeatTime. Here's an example of message received via watch API:

{
    "type": "MODIFIED",
    "object": {
        "kind": "Node",
        "apiVersion": "v1",
        "metadata": {
            "name": "<REDACTED>",  ...
        },
        "spec": { ...
        },
        "status": {
            "capacity": { ...
            },
            "allocatable": { ...
            },
            "conditions": [
                {
                    "type": "FilesystemCorruptionProblem",
                    "status": "False",
                    "lastHeartbeatTime": "2023-02-27T08:10:59Z", <--- HERE
                    "lastTransitionTime": "2023-02-09T07:41:54Z",
                    "reason": "FilesystemIsOK",
                    "message": "Filesystem is healthy"
                },
                {
                    "type": "ContainerRuntimeProblem",
                    "status": "False",
                    "lastHeartbeatTime": "2023-02-27T08:10:59Z", <--- HERE
                    "lastTransitionTime": "2023-02-10T20:37:25Z",
                    "reason": "ContainerRuntimeIsUp",
                    "message": "container runtime service is up"
                },
                {
                    "type": "KubeletProblem",
                    "status": "False",
                    "lastHeartbeatTime": "2023-02-27T08:10:59Z", <--- HERE
                    "lastTransitionTime": "2023-02-10T20:37:25Z",
                    "reason": "KubeletIsUp",
                    "message": "kubelet service is up"
                },
                {
                    "type": "FrequentDockerRestart",
                    "status": "False",
                    "lastHeartbeatTime": "2023-02-27T08:10:59Z", <--- HERE
                    "lastTransitionTime": "2023-02-09T07:41:54Z",
                    "reason": "NoFrequentDockerRestart",
                    "message": "docker is functioning properly"
                },
                {
                    "type": "FrequentUnregisterNetDevice",
                    "status": "False",
                    "lastHeartbeatTime": "2023-02-27T08:10:59Z", <--- HERE
                    "lastTransitionTime": "2023-02-09T07:41:54Z",
                    "reason": "NoFrequentUnregisterNetDevice",
                    "message": "node is functioning properly"
                },
                {
                    "type": "VMEventScheduled",
                    "status": "False",
                    "lastHeartbeatTime": "2023-02-27T08:10:59Z", <--- HERE
                    "lastTransitionTime": "2023-02-10T20:37:45Z",
                    "reason": "NoVMEventScheduled",
                    "message": "VM has no scheduled event"
                },
                {
                    "type": "ReadonlyFilesystem",
                    "status": "False",
                    "lastHeartbeatTime": "2023-02-27T08:10:59Z", <--- HERE
                    "lastTransitionTime": "2023-02-09T07:41:54Z",
                    "reason": "FilesystemIsNotReadOnly",
                    "message": "Filesystem is not read-only"
                },
                {
                    "type": "FrequentContainerdRestart",
                    "status": "False",
                    "lastHeartbeatTime": "2023-02-27T08:10:59Z", <--- HERE
                    "lastTransitionTime": "2023-02-09T07:41:54Z",
                    "reason": "NoFrequentContainerdRestart",
                    "message": "containerd is functioning properly"
                },
                {
                    "type": "KernelDeadlock",
                    "status": "False",
                    "lastHeartbeatTime": "2023-02-27T08:10:59Z", <--- HERE
                    "lastTransitionTime": "2023-02-09T07:41:54Z",
                    "reason": "KernelHasNoDeadlock",
                    "message": "kernel has no deadlock"
                },
                {
                    "type": "FrequentKubeletRestart",
                    "status": "False",
                    "lastHeartbeatTime": "2023-02-27T08:10:59Z", <--- HERE
                    "lastTransitionTime": "2023-02-09T07:41:54Z",
                    "reason": "NoFrequentKubeletRestart",
                    "message": "kubelet is functioning properly"
                },
                {
                    "type": "MemoryPressure",
                    "status": "False",
                    "lastHeartbeatTime": "2023-02-27T08:15:15Z", <--- HERE
                    "lastTransitionTime": "2023-02-10T20:38:18Z",
                    "reason": "KubeletHasSufficientMemory",
                    "message": "kubelet has sufficient memory available"
                },
                {
                    "type": "DiskPressure",
                    "status": "False",
                    "lastHeartbeatTime": "2023-02-27T08:15:15Z", <--- HERE
                    "lastTransitionTime": "2023-02-10T20:38:18Z",
                    "reason": "KubeletHasNoDiskPressure",
                    "message": "kubelet has no disk pressure"
                },
                {
                    "type": "PIDPressure",
                    "status": "False",
                    "lastHeartbeatTime": "2023-02-27T08:15:15Z", <--- HERE
                    "lastTransitionTime": "2023-02-10T20:38:18Z",
                    "reason": "KubeletHasSufficientPID",
                    "message": "kubelet has sufficient PID available"
                },
                {
                    "type": "Ready",
                    "status": "True",
                    "lastHeartbeatTime": "2023-02-27T08:15:15Z", <--- HERE
                    "lastTransitionTime": "2023-02-10T20:38:18Z",
                    "reason": "KubeletReady",
                    "message": "kubelet is posting ready status. AppArmor enabled"
                }
            ],
            "addresses": [ ...
            ],
            "daemonEndpoints": { ...
            },
            "nodeInfo": { ...
            },
            "images": [ ...
            ],
            "volumesInUse": [ ...
            ],
            "volumesAttached": [ ...
            ]
        }
    }
}

As Node metadata is not modified, the entire configuration reload process that follows is unnecessary.

Here's an excerpt from Filebeat debug log that shows that every 10s a sequence of 4 stop and 4 start events is emitted, just for a single Pod (there are many Pods on a single Node, so this is multiplied):

2023-02-14T12:03:02.406Z          autodiscover/autodiscover.go  267     Got a stop event.       esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:02.406Z          autodiscover/autodiscover.go  267     Got a stop event.       esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:02.407Z          autodiscover/autodiscover.go  267     Got a stop event.       esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:02.407Z          autodiscover/autodiscover.go  267     Got a stop event.       esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:02.604Z          autodiscover/autodiscover.go  182     Got a start event.      esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:02.604Z          autodiscover/autodiscover.go  182     Got a start event.      esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:02.605Z          autodiscover/autodiscover.go  182     Got a start event.      esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:02.607Z          autodiscover/autodiscover.go  182     Got a start event.      esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:12.091Z          autodiscover/autodiscover.go  267     Got a stop event.       esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:12.092Z          autodiscover/autodiscover.go  267     Got a stop event.       esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:12.093Z          autodiscover/autodiscover.go  267     Got a stop event.       esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:12.093Z          autodiscover/autodiscover.go  267     Got a stop event.       esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:12.099Z          autodiscover/autodiscover.go  182     Got a start event.      esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:12.100Z          autodiscover/autodiscover.go  182     Got a start event.      esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:12.101Z          autodiscover/autodiscover.go  182     Got a start event.      esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:12.102Z          autodiscover/autodiscover.go  182     Got a start event.      esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:22.415Z          autodiscover/autodiscover.go  267     Got a stop event.       esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:22.415Z          autodiscover/autodiscover.go  267     Got a stop event.       esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:22.416Z          autodiscover/autodiscover.go  267     Got a stop event.       esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:22.417Z          autodiscover/autodiscover.go  267     Got a stop event.       esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:22.715Z          autodiscover/autodiscover.go  182     Got a start event.      esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:22.716Z          autodiscover/autodiscover.go  182     Got a start event.      esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:22.717Z          autodiscover/autodiscover.go  182     Got a start event.      esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:22.718Z          autodiscover/autodiscover.go  182     Got a start event.      esdevazk8swe1-3-kb-84795bfcd6-cd984

With the above template configuration most of these start/stop should be ignored, but not events for Kibana Pods. Here is an interesting debug log excerpt which shows a sequence of stop and start events that lead to a runner error potentially caused by log input race condition (see #34388 (comment)). Once such runner error occurs Autodiscover worker initiates a configuration reload on every received event (src).

2023-02-14T12:03:02.406Z          autodiscover/autodiscover.go  267     Got a stop event.       esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:02.406Z          autodiscover/autodiscover.go  267     Got a stop event.       esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:02.406Z                       cfgfile/list.go  64      Starting reload procedure, current runners: 3   
2023-02-14T12:03:02.406Z                       cfgfile/list.go  82      Start list: 0, Stop list: 1     
2023-02-14T12:03:02.407Z          autodiscover/autodiscover.go  267     Got a stop event.       esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:02.407Z                       cfgfile/list.go  64      Starting reload procedure, current runners: 2   
2023-02-14T12:03:02.407Z                       cfgfile/list.go  82      Start list: 0, Stop list: 1     
2023-02-14T12:03:02.407Z          autodiscover/autodiscover.go  267     Got a stop event.       esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:02.407Z                       cfgfile/list.go  64      Starting reload procedure, current runners: 1   
2023-02-14T12:03:02.407Z                       cfgfile/list.go  82      Start list: 0, Stop list: 1     
2023-02-14T12:03:02.604Z          autodiscover/autodiscover.go  182     Got a start event.      esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:02.604Z          autodiscover/autodiscover.go  182     Got a start event.      esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:02.605Z                       cfgfile/list.go  64      Starting reload procedure, current runners: 0   
2023-02-14T12:03:02.605Z                       cfgfile/list.go  82      Start list: 1, Stop list: 0     
2023-02-14T12:03:02.605Z                       cfgfile/list.go  107     Error creating runner from config: Can only start an input when all related states are finished: {Id: native::4927659-2049, Finished: false, Fileinfo: &{0.log 31895141 416 {696363713 63811972980 0xaaaad7612520} {2049 4927659 33184 1 0 0 0 0 31895141 4096 0 62304 {1676357648 659620958} {1676376180 696363713} {1676376180 696363713} [0 0]}}, Source: /var/log/pods/elastic_esdevazk8swe1-3-kb-84795bfcd6-cd984_340bb25a-a329-4d4b-b934-e8605c3d4ee8/kibana/0.log, Offset: 29901889, Timestamp: 2023-02-14 12:03:01.723335032 +0000 UTC m=+149.690897691, TTL: -1ns, Type: container, Meta: map[], FileStateOS: 4927659-2049}    
2023-02-14T12:03:02.605Z          autodiscover/autodiscover.go  182     Got a start event.      esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:02.606Z          autodiscover/autodiscover.go  156     Reloading existing autodiscover configs after error     
2023-02-14T12:03:02.606Z                       cfgfile/list.go  64      Starting reload procedure, current runners: 0   
2023-02-14T12:03:02.606Z                       cfgfile/list.go  82      Start list: 2, Stop list: 0     
2023-02-14T12:03:02.607Z                       cfgfile/list.go  107     Error creating runner from config: Can only start an input when all related states are finished: {Id: native::4927659-2049, Finished: false, Fileinfo: &{0.log 31895141 416 {696363713 63811972980 0xaaaad7612520} {2049 4927659 33184 1 0 0 0 0 31895141 4096 0 62304 {1676357648 659620958} {1676376180 696363713} {1676376180 696363713} [0 0]}}, Source: /var/log/pods/elastic_esdevazk8swe1-3-kb-84795bfcd6-cd984_340bb25a-a329-4d4b-b934-e8605c3d4ee8/kibana/0.log, Offset: 29901889, Timestamp: 2023-02-14 12:03:01.723335032 +0000 UTC m=+149.690897691, TTL: -1ns, Type: container, Meta: map[], FileStateOS: 4927659-2049}    
2023-02-14T12:03:02.607Z          autodiscover/autodiscover.go  182     Got a start event.      esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:02.608Z          autodiscover/autodiscover.go  156     Reloading existing autodiscover configs after error     
2023-02-14T12:03:02.608Z                       cfgfile/list.go  64      Starting reload procedure, current runners: 1   
2023-02-14T12:03:02.608Z                       cfgfile/list.go  82      Start list: 2, Stop list: 0     
2023-02-14T12:03:02.609Z                       cfgfile/list.go  107     Error creating runner from config: Can only start an input when all related states are finished: {Id: native::4927659-2049, Finished: false, Fileinfo: &{0.log 31895141 416 {696363713 63811972980 0xaaaad7612520} {2049 4927659 33184 1 0 0 0 0 31895141 4096 0 62304 {1676357648 659620958} {1676376180 696363713} {1676376180 696363713} [0 0]}}, Source: /var/log/pods/elastic_esdevazk8swe1-3-kb-84795bfcd6-cd984_340bb25a-a329-4d4b-b934-e8605c3d4ee8/kibana/0.log, Offset: 29901889, Timestamp: 2023-02-14 12:03:01.723335032 +0000 UTC m=+149.690897691, TTL: -1ns, Type: container, Meta: map[], FileStateOS: 4927659-2049}    
2023-02-14T12:03:02.609Z          autodiscover/autodiscover.go  267     Got a stop event.       aad-pod-identity-nmi-5jm82
2023-02-14T12:03:02.609Z          autodiscover/autodiscover.go  156     Reloading existing autodiscover configs after error <--- HERE 
2023-02-14T12:03:02.609Z                       cfgfile/list.go  64      Starting reload procedure, current runners: 2   
2023-02-14T12:03:02.609Z                       cfgfile/list.go  82      Start list: 1, Stop list: 0     
2023-02-14T12:03:02.610Z                       cfgfile/list.go  107     Error creating runner from config: Can only start an input when all related states are finished: {Id: native::4927659-2049, Finished: false, Fileinfo: &{0.log 31895141 416 {696363713 63811972980 0xaaaad7612520} {2049 4927659 33184 1 0 0 0 0 31895141 4096 0 62304 {1676357648 659620958} {1676376180 696363713} {1676376180 696363713} [0 0]}}, Source: /var/log/pods/elastic_esdevazk8swe1-3-kb-84795bfcd6-cd984_340bb25a-a329-4d4b-b934-e8605c3d4ee8/kibana/0.log, Offset: 29901889, Timestamp: 2023-02-14 12:03:01.723335032 +0000 UTC m=+149.690897691, TTL: -1ns, Type: container, Meta: map[], FileS
tateOS: 4927659-2049}    
2023-02-14T12:03:02.610Z          autodiscover/autodiscover.go  267     Got a stop event.       aad-pod-identity-nmi-5jm82
2023-02-14T12:03:02.610Z          autodiscover/autodiscover.go  156     Reloading existing autodiscover configs after error <--- HERE

Ultimately, the following frequency of configuration reloads was observed:

   1 2023-02-14T12:03:01
  66 2023-02-14T12:03:02
 113 2023-02-14T12:03:12 <--- HERE
  27 2023-02-14T12:03:13
   6 2023-02-14T12:03:22
   5 2023-02-14T12:03:32
  79 2023-02-14T12:03:33
   1 2023-02-14T12:03:42
  28 2023-02-14T12:03:43
   1 2023-02-14T12:03:52
  66 2023-02-14T12:03:53
...

Enhancement request

Autodiscover should filter Node and Namespace object modifications that are not changing its metadata.

Workarounds

If hints are not used, Node and Namespace watchers can be disabled entirely (src) with the following configuration:

filebeat:
  autodiscover:
    providers:
    - add_resource_metadata:
        namespace:
          enabled: false
        node:
          enabled: false
      hints.enabled: false

The text was updated successfully, but these errors were encountered:

ChrsMark · 2023-03-06T03:23:29Z

FYI @gizas @rameshelastic @mlunadia

gizas · 2024-01-15T14:56:35Z

@gbanasiak closing this as completed after #37338

gbanasiak added enhancement autodiscovery labels Mar 2, 2023

botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Mar 2, 2023

ChrsMark added the Team:Cloudnative-Monitoring Label for the Cloud Native Monitoring team label Mar 6, 2023

botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Mar 6, 2023

belimawr mentioned this issue Sep 27, 2023

Debounce input reload on autodiscover #35645

Merged

4 tasks

This was referenced Nov 23, 2023

[Meta]Investigate resource consumption of Elastic Agent with K8s Integration elastic/elastic-agent#3801

Open

Kubernetes secret provider: Caching it to avoid performance issues elastic/elastic-agent#3594

Closed

ChrsMark mentioned this issue Nov 29, 2023

[Metricbeat autodiscover][Provider Kubernetes] Add condition to node/namespace watchers #37181

Closed

6 tasks

constanca-m mentioned this issue Dec 4, 2023

[Kubernetes secret provider] Add cache for the secrets elastic/elastic-agent#3822

Merged

7 tasks

gizas mentioned this issue Dec 7, 2023

Check Pod Autodiscover conditions for node/namespace updates #37338

Closed

4 tasks

gizas closed this as completed Jan 15, 2024

belimawr mentioned this issue Mar 22, 2024

Investigate number of debug messages of add_kubernetes_metadata processor #38529

Closed

belimawr mentioned this issue Nov 11, 2024

filebeat: input v2 compat uses random ID for CheckConfig #41585

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make Autodiscover more selective when watching K8s events #34717

Make Autodiscover more selective when watching K8s events #34717

gbanasiak commented Mar 2, 2023 •

edited

Loading

ChrsMark commented Mar 6, 2023

gizas commented Jan 15, 2024

Make Autodiscover more selective when watching K8s events #34717

Make Autodiscover more selective when watching K8s events #34717

Comments

gbanasiak commented Mar 2, 2023 • edited Loading

Setup

Context

Enhancement request

Workarounds

ChrsMark commented Mar 6, 2023

gizas commented Jan 15, 2024

gbanasiak commented Mar 2, 2023 •

edited

Loading