Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make Autodiscover more selective when watching K8s events #34717

Closed
Tracked by #3801
gbanasiak opened this issue Mar 2, 2023 · 2 comments
Closed
Tracked by #3801

Make Autodiscover more selective when watching K8s events #34717

gbanasiak opened this issue Mar 2, 2023 · 2 comments
Labels
autodiscovery enhancement Team:Cloudnative-Monitoring Label for the Cloud Native Monitoring team

Comments

@gbanasiak
Copy link
Contributor

gbanasiak commented Mar 2, 2023

Setup

Observed in Filebeat 8.5.3 running in Azure Kubernetes v1.24.6 with the following autodiscover configuration:

filebeat:
  autodiscover:
    providers:
    - templates:
      - condition:
          or:
          - equals:
              kubernetes:
                container:
                  name: elasticsearch
          - equals:
              kubernetes:
                container:
                  name: kibana
        config:
        - paths:
          - /var/log/pods/${data.kubernetes.namespace}_${data.kubernetes.pod.name}_${data.kubernetes.pod.uid}/${data.kubernetes.container.name}/*.log
          processors:
          - decode_json_fields:
              add_error_key: true
              fields:
              - message
              max_depth: 1
              overwrite_keys: true
              process_array: false
              target: ""
          type: container
      type: kubernetes

Context

While addressing #23139 Node and Namespace watchers were added in Autodicover to make Node/Namespace metadata changes instantly available to the related Pods. Currently, when a Node modification is received via a watcher, all Pods located on this Node are retrieved from a watcher store (src) and stop and start Pod events are emitted (src). A single Pod can emit multiple start and stop events (src). This happens regardless of the nature of the change of the Node object.

In the environment specified in Setup section, it was observed that Node objects can be updated as frequently as every 10 seconds. Filebeat logs do not allow to determine what exactly was received via watch APIs, but this can be verified as follows:

kubectl proxy &
stdbuf -oL curl -s '127.0.0.1:8001/api/v1/namespaces?allowWatchBookmarks=true&watch=true&pretty=false' > namespaces.log &
stdbuf -oL curl -s '127.0.0.1:8001/api/v1/nodes?allowWatchBookmarks=true&watch=true&pretty=false' > nodes.log &
stdbuf -oL curl -s '127.0.0.1:8001/api/v1/pods?allowWatchBookmarks=true&watch=true&pretty=false' > pods.log &

Node objects were modified in .status.conditions[] array which is related to Node monitoring process. The only field modified was lastHeartbeatTime. Here's an example of message received via watch API:

{
    "type": "MODIFIED",
    "object": {
        "kind": "Node",
        "apiVersion": "v1",
        "metadata": {
            "name": "<REDACTED>",  ...
        },
        "spec": { ...
        },
        "status": {
            "capacity": { ...
            },
            "allocatable": { ...
            },
            "conditions": [
                {
                    "type": "FilesystemCorruptionProblem",
                    "status": "False",
                    "lastHeartbeatTime": "2023-02-27T08:10:59Z", <--- HERE
                    "lastTransitionTime": "2023-02-09T07:41:54Z",
                    "reason": "FilesystemIsOK",
                    "message": "Filesystem is healthy"
                },
                {
                    "type": "ContainerRuntimeProblem",
                    "status": "False",
                    "lastHeartbeatTime": "2023-02-27T08:10:59Z", <--- HERE
                    "lastTransitionTime": "2023-02-10T20:37:25Z",
                    "reason": "ContainerRuntimeIsUp",
                    "message": "container runtime service is up"
                },
                {
                    "type": "KubeletProblem",
                    "status": "False",
                    "lastHeartbeatTime": "2023-02-27T08:10:59Z", <--- HERE
                    "lastTransitionTime": "2023-02-10T20:37:25Z",
                    "reason": "KubeletIsUp",
                    "message": "kubelet service is up"
                },
                {
                    "type": "FrequentDockerRestart",
                    "status": "False",
                    "lastHeartbeatTime": "2023-02-27T08:10:59Z", <--- HERE
                    "lastTransitionTime": "2023-02-09T07:41:54Z",
                    "reason": "NoFrequentDockerRestart",
                    "message": "docker is functioning properly"
                },
                {
                    "type": "FrequentUnregisterNetDevice",
                    "status": "False",
                    "lastHeartbeatTime": "2023-02-27T08:10:59Z", <--- HERE
                    "lastTransitionTime": "2023-02-09T07:41:54Z",
                    "reason": "NoFrequentUnregisterNetDevice",
                    "message": "node is functioning properly"
                },
                {
                    "type": "VMEventScheduled",
                    "status": "False",
                    "lastHeartbeatTime": "2023-02-27T08:10:59Z", <--- HERE
                    "lastTransitionTime": "2023-02-10T20:37:45Z",
                    "reason": "NoVMEventScheduled",
                    "message": "VM has no scheduled event"
                },
                {
                    "type": "ReadonlyFilesystem",
                    "status": "False",
                    "lastHeartbeatTime": "2023-02-27T08:10:59Z", <--- HERE
                    "lastTransitionTime": "2023-02-09T07:41:54Z",
                    "reason": "FilesystemIsNotReadOnly",
                    "message": "Filesystem is not read-only"
                },
                {
                    "type": "FrequentContainerdRestart",
                    "status": "False",
                    "lastHeartbeatTime": "2023-02-27T08:10:59Z", <--- HERE
                    "lastTransitionTime": "2023-02-09T07:41:54Z",
                    "reason": "NoFrequentContainerdRestart",
                    "message": "containerd is functioning properly"
                },
                {
                    "type": "KernelDeadlock",
                    "status": "False",
                    "lastHeartbeatTime": "2023-02-27T08:10:59Z", <--- HERE
                    "lastTransitionTime": "2023-02-09T07:41:54Z",
                    "reason": "KernelHasNoDeadlock",
                    "message": "kernel has no deadlock"
                },
                {
                    "type": "FrequentKubeletRestart",
                    "status": "False",
                    "lastHeartbeatTime": "2023-02-27T08:10:59Z", <--- HERE
                    "lastTransitionTime": "2023-02-09T07:41:54Z",
                    "reason": "NoFrequentKubeletRestart",
                    "message": "kubelet is functioning properly"
                },
                {
                    "type": "MemoryPressure",
                    "status": "False",
                    "lastHeartbeatTime": "2023-02-27T08:15:15Z", <--- HERE
                    "lastTransitionTime": "2023-02-10T20:38:18Z",
                    "reason": "KubeletHasSufficientMemory",
                    "message": "kubelet has sufficient memory available"
                },
                {
                    "type": "DiskPressure",
                    "status": "False",
                    "lastHeartbeatTime": "2023-02-27T08:15:15Z", <--- HERE
                    "lastTransitionTime": "2023-02-10T20:38:18Z",
                    "reason": "KubeletHasNoDiskPressure",
                    "message": "kubelet has no disk pressure"
                },
                {
                    "type": "PIDPressure",
                    "status": "False",
                    "lastHeartbeatTime": "2023-02-27T08:15:15Z", <--- HERE
                    "lastTransitionTime": "2023-02-10T20:38:18Z",
                    "reason": "KubeletHasSufficientPID",
                    "message": "kubelet has sufficient PID available"
                },
                {
                    "type": "Ready",
                    "status": "True",
                    "lastHeartbeatTime": "2023-02-27T08:15:15Z", <--- HERE
                    "lastTransitionTime": "2023-02-10T20:38:18Z",
                    "reason": "KubeletReady",
                    "message": "kubelet is posting ready status. AppArmor enabled"
                }
            ],
            "addresses": [ ...
            ],
            "daemonEndpoints": { ...
            },
            "nodeInfo": { ...
            },
            "images": [ ...
            ],
            "volumesInUse": [ ...
            ],
            "volumesAttached": [ ...
            ]
        }
    }
}

As Node metadata is not modified, the entire configuration reload process that follows is unnecessary.

Here's an excerpt from Filebeat debug log that shows that every 10s a sequence of 4 stop and 4 start events is emitted, just for a single Pod (there are many Pods on a single Node, so this is multiplied):

2023-02-14T12:03:02.406Z          autodiscover/autodiscover.go  267     Got a stop event.       esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:02.406Z          autodiscover/autodiscover.go  267     Got a stop event.       esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:02.407Z          autodiscover/autodiscover.go  267     Got a stop event.       esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:02.407Z          autodiscover/autodiscover.go  267     Got a stop event.       esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:02.604Z          autodiscover/autodiscover.go  182     Got a start event.      esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:02.604Z          autodiscover/autodiscover.go  182     Got a start event.      esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:02.605Z          autodiscover/autodiscover.go  182     Got a start event.      esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:02.607Z          autodiscover/autodiscover.go  182     Got a start event.      esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:12.091Z          autodiscover/autodiscover.go  267     Got a stop event.       esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:12.092Z          autodiscover/autodiscover.go  267     Got a stop event.       esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:12.093Z          autodiscover/autodiscover.go  267     Got a stop event.       esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:12.093Z          autodiscover/autodiscover.go  267     Got a stop event.       esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:12.099Z          autodiscover/autodiscover.go  182     Got a start event.      esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:12.100Z          autodiscover/autodiscover.go  182     Got a start event.      esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:12.101Z          autodiscover/autodiscover.go  182     Got a start event.      esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:12.102Z          autodiscover/autodiscover.go  182     Got a start event.      esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:22.415Z          autodiscover/autodiscover.go  267     Got a stop event.       esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:22.415Z          autodiscover/autodiscover.go  267     Got a stop event.       esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:22.416Z          autodiscover/autodiscover.go  267     Got a stop event.       esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:22.417Z          autodiscover/autodiscover.go  267     Got a stop event.       esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:22.715Z          autodiscover/autodiscover.go  182     Got a start event.      esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:22.716Z          autodiscover/autodiscover.go  182     Got a start event.      esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:22.717Z          autodiscover/autodiscover.go  182     Got a start event.      esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:22.718Z          autodiscover/autodiscover.go  182     Got a start event.      esdevazk8swe1-3-kb-84795bfcd6-cd984

With the above template configuration most of these start/stop should be ignored, but not events for Kibana Pods. Here is an interesting debug log excerpt which shows a sequence of stop and start events that lead to a runner error potentially caused by log input race condition (see #34388 (comment)). Once such runner error occurs Autodiscover worker initiates a configuration reload on every received event (src).

2023-02-14T12:03:02.406Z          autodiscover/autodiscover.go  267     Got a stop event.       esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:02.406Z          autodiscover/autodiscover.go  267     Got a stop event.       esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:02.406Z                       cfgfile/list.go  64      Starting reload procedure, current runners: 3   
2023-02-14T12:03:02.406Z                       cfgfile/list.go  82      Start list: 0, Stop list: 1     
2023-02-14T12:03:02.407Z          autodiscover/autodiscover.go  267     Got a stop event.       esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:02.407Z                       cfgfile/list.go  64      Starting reload procedure, current runners: 2   
2023-02-14T12:03:02.407Z                       cfgfile/list.go  82      Start list: 0, Stop list: 1     
2023-02-14T12:03:02.407Z          autodiscover/autodiscover.go  267     Got a stop event.       esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:02.407Z                       cfgfile/list.go  64      Starting reload procedure, current runners: 1   
2023-02-14T12:03:02.407Z                       cfgfile/list.go  82      Start list: 0, Stop list: 1     
2023-02-14T12:03:02.604Z          autodiscover/autodiscover.go  182     Got a start event.      esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:02.604Z          autodiscover/autodiscover.go  182     Got a start event.      esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:02.605Z                       cfgfile/list.go  64      Starting reload procedure, current runners: 0   
2023-02-14T12:03:02.605Z                       cfgfile/list.go  82      Start list: 1, Stop list: 0     
2023-02-14T12:03:02.605Z                       cfgfile/list.go  107     Error creating runner from config: Can only start an input when all related states are finished: {Id: native::4927659-2049, Finished: false, Fileinfo: &{0.log 31895141 416 {696363713 63811972980 0xaaaad7612520} {2049 4927659 33184 1 0 0 0 0 31895141 4096 0 62304 {1676357648 659620958} {1676376180 696363713} {1676376180 696363713} [0 0]}}, Source: /var/log/pods/elastic_esdevazk8swe1-3-kb-84795bfcd6-cd984_340bb25a-a329-4d4b-b934-e8605c3d4ee8/kibana/0.log, Offset: 29901889, Timestamp: 2023-02-14 12:03:01.723335032 +0000 UTC m=+149.690897691, TTL: -1ns, Type: container, Meta: map[], FileStateOS: 4927659-2049}    
2023-02-14T12:03:02.605Z          autodiscover/autodiscover.go  182     Got a start event.      esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:02.606Z          autodiscover/autodiscover.go  156     Reloading existing autodiscover configs after error     
2023-02-14T12:03:02.606Z                       cfgfile/list.go  64      Starting reload procedure, current runners: 0   
2023-02-14T12:03:02.606Z                       cfgfile/list.go  82      Start list: 2, Stop list: 0     
2023-02-14T12:03:02.607Z                       cfgfile/list.go  107     Error creating runner from config: Can only start an input when all related states are finished: {Id: native::4927659-2049, Finished: false, Fileinfo: &{0.log 31895141 416 {696363713 63811972980 0xaaaad7612520} {2049 4927659 33184 1 0 0 0 0 31895141 4096 0 62304 {1676357648 659620958} {1676376180 696363713} {1676376180 696363713} [0 0]}}, Source: /var/log/pods/elastic_esdevazk8swe1-3-kb-84795bfcd6-cd984_340bb25a-a329-4d4b-b934-e8605c3d4ee8/kibana/0.log, Offset: 29901889, Timestamp: 2023-02-14 12:03:01.723335032 +0000 UTC m=+149.690897691, TTL: -1ns, Type: container, Meta: map[], FileStateOS: 4927659-2049}    
2023-02-14T12:03:02.607Z          autodiscover/autodiscover.go  182     Got a start event.      esdevazk8swe1-3-kb-84795bfcd6-cd984
2023-02-14T12:03:02.608Z          autodiscover/autodiscover.go  156     Reloading existing autodiscover configs after error     
2023-02-14T12:03:02.608Z                       cfgfile/list.go  64      Starting reload procedure, current runners: 1   
2023-02-14T12:03:02.608Z                       cfgfile/list.go  82      Start list: 2, Stop list: 0     
2023-02-14T12:03:02.609Z                       cfgfile/list.go  107     Error creating runner from config: Can only start an input when all related states are finished: {Id: native::4927659-2049, Finished: false, Fileinfo: &{0.log 31895141 416 {696363713 63811972980 0xaaaad7612520} {2049 4927659 33184 1 0 0 0 0 31895141 4096 0 62304 {1676357648 659620958} {1676376180 696363713} {1676376180 696363713} [0 0]}}, Source: /var/log/pods/elastic_esdevazk8swe1-3-kb-84795bfcd6-cd984_340bb25a-a329-4d4b-b934-e8605c3d4ee8/kibana/0.log, Offset: 29901889, Timestamp: 2023-02-14 12:03:01.723335032 +0000 UTC m=+149.690897691, TTL: -1ns, Type: container, Meta: map[], FileStateOS: 4927659-2049}    
2023-02-14T12:03:02.609Z          autodiscover/autodiscover.go  267     Got a stop event.       aad-pod-identity-nmi-5jm82
2023-02-14T12:03:02.609Z          autodiscover/autodiscover.go  156     Reloading existing autodiscover configs after error <--- HERE 
2023-02-14T12:03:02.609Z                       cfgfile/list.go  64      Starting reload procedure, current runners: 2   
2023-02-14T12:03:02.609Z                       cfgfile/list.go  82      Start list: 1, Stop list: 0     
2023-02-14T12:03:02.610Z                       cfgfile/list.go  107     Error creating runner from config: Can only start an input when all related states are finished: {Id: native::4927659-2049, Finished: false, Fileinfo: &{0.log 31895141 416 {696363713 63811972980 0xaaaad7612520} {2049 4927659 33184 1 0 0 0 0 31895141 4096 0 62304 {1676357648 659620958} {1676376180 696363713} {1676376180 696363713} [0 0]}}, Source: /var/log/pods/elastic_esdevazk8swe1-3-kb-84795bfcd6-cd984_340bb25a-a329-4d4b-b934-e8605c3d4ee8/kibana/0.log, Offset: 29901889, Timestamp: 2023-02-14 12:03:01.723335032 +0000 UTC m=+149.690897691, TTL: -1ns, Type: container, Meta: map[], FileS
tateOS: 4927659-2049}    
2023-02-14T12:03:02.610Z          autodiscover/autodiscover.go  267     Got a stop event.       aad-pod-identity-nmi-5jm82
2023-02-14T12:03:02.610Z          autodiscover/autodiscover.go  156     Reloading existing autodiscover configs after error <--- HERE

Ultimately, the following frequency of configuration reloads was observed:

   1 2023-02-14T12:03:01
  66 2023-02-14T12:03:02
 113 2023-02-14T12:03:12 <--- HERE
  27 2023-02-14T12:03:13
   6 2023-02-14T12:03:22
   5 2023-02-14T12:03:32
  79 2023-02-14T12:03:33
   1 2023-02-14T12:03:42
  28 2023-02-14T12:03:43
   1 2023-02-14T12:03:52
  66 2023-02-14T12:03:53
...

Enhancement request

Autodiscover should filter Node and Namespace object modifications that are not changing its metadata.

Workarounds

If hints are not used, Node and Namespace watchers can be disabled entirely (src) with the following configuration:

filebeat:
  autodiscover:
    providers:
    - add_resource_metadata:
        namespace:
          enabled: false
        node:
          enabled: false
      hints.enabled: false
@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Mar 2, 2023
@ChrsMark ChrsMark added the Team:Cloudnative-Monitoring Label for the Cloud Native Monitoring team label Mar 6, 2023
@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Mar 6, 2023
@ChrsMark
Copy link
Member

ChrsMark commented Mar 6, 2023

FYI @gizas @rameshelastic @mlunadia

@gizas
Copy link
Contributor

gizas commented Jan 15, 2024

@gbanasiak closing this as completed after #37338

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
autodiscovery enhancement Team:Cloudnative-Monitoring Label for the Cloud Native Monitoring team
Projects
None yet
Development

No branches or pull requests

3 participants