Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Agent] Support Node and Service autodiscovery in k8s provider #26801

Merged
merged 25 commits into from
Jul 22, 2021

Conversation

ChrsMark
Copy link
Member

@ChrsMark ChrsMark commented Jul 9, 2021

What does this PR do?

This PR adds more resources in kubernetes dynamic provider.

Why is it important?

So as to support Node and Service discovery via kubernetes dynamic provider of Agent.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
    - [ ] I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

How to test this PR locally

Test with node resource`:

  1. Use the following config:
providers.kubernetes:
  scope: cluster
  kube_config: /Users/chrismark/.kube/config
  cleanup_timeout: 360s
  resources:
    node: 
      enabled: true

inputs:
  - type: logfile
    streams:
      - paths: ${kubernetes.node.name}/another.log
  1. Run inspect command to check what is the compiled configuration (use proper path for the elastic-agent.yml config file :
    ./elastic-agent -c /Users/ubuntu/Desktop/elastic-agent.yml inspect output -o default
  2. Verify that something similar to the following is produced:
filebeat:
  inputs:
  - index: logs-generic-default
    paths: kind-control-plane/another.log
    processors:
    - add_fields:
        fields:
          node:
            annotations:
              kubeadm:
                alpha:
                  kubernetes:
                    io/cri-socket: unix:///run/containerd/containerd.sock
              node:
                alpha:
                  kubernetes:
                    io/ttl: "0"
              volumes:
                kubernetes:
                  io/controller-managed-attach-detach: "true"
            ip: 172.18.0.3
            labels:
              beta.kubernetes.io/arch: amd64
              beta.kubernetes.io/os: linux
              kubernetes.io/arch: amd64
              kubernetes.io/hostname: kind-control-plane
              kubernetes.io/os: linux
              node-role.kubernetes.io/master: ""
            name: kind-control-plane
            uid: 65e26851-66d2-44f9-948d-37e2c43f50f7
        target: kubernetes

Test with service resource.

providers.kubernetes:
  scope: cluster
  kube_config: /Users/chrismark/.kube/config
  cleanup_timeout: 360s
  resources:
    service: 
      enabled: true

inputs:
  - type: logfile
    streams:
      - paths: ${kubernetes.service.name}/another.log

Test with pod

providers.kubernetes:
  scope: cluster
  kube_config: /Users/chrismark/.kube/config
  cleanup_timeout: 360s
  resources:
    pod:
       enabled: true

inputs:
  - type: logfile
    streams:
      - paths: ${kubernetes.pod.name}/another.log

Test with service resource and node scope.

providers.kubernetes:
  scope: node
  kube_config: /Users/chrismark/.kube/config
 cleanup_timeout: 360s
  resources:
    service:
      enabled: true
      

inputs:
  - type: logfile
    streams:
      - paths: ${kubernetes.service.name}/another.log
  1. Verify from the logs that the scope was enforced to cluster scope (can not set scope to node when using resource Service. resetting scope to cluster), and that kubernetes.scope field is populated with cluster value.

Test with pod at node scope and define node

providers.kubernetes:
  scope: node
  kube_config: /Users/chrismark/.kube/config
  cleanup_timeout: 360s
  node: "kind-control-plane"
  resources:
    pod:
      enabled: true
      

inputs:
  - type: logfile
    streams:
      - paths: ${kubernetes.pod.name}/another.log

Related issues

ChrsMark added 3 commits July 8, 2021 17:20
Signed-off-by: chrismark <chrismarkou92@gmail.com>
Signed-off-by: chrismark <chrismarkou92@gmail.com>
Signed-off-by: chrismark <chrismarkou92@gmail.com>
@ChrsMark ChrsMark added Team:Integrations Label for the Integrations team autodiscovery kubernetes Enable builds in the CI for kubernetes v7.15.0 backport-v7.15.0 Automated backport with mergify labels Jul 9, 2021
@ChrsMark ChrsMark self-assigned this Jul 9, 2021
@botelastic botelastic bot added needs_team Indicates that the issue/PR needs a Team:* label and removed needs_team Indicates that the issue/PR needs a Team:* label labels Jul 9, 2021
ChrsMark added 2 commits July 9, 2021 12:08
Signed-off-by: chrismark <chrismarkou92@gmail.com>
Signed-off-by: chrismark <chrismarkou92@gmail.com>
@elasticmachine
Copy link
Collaborator

elasticmachine commented Jul 9, 2021

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2021-07-22T07:12:23.231+0000

  • Duration: 98 min 0 sec

  • Commit: 701236c

Test stats 🧪

Test Results
Failed 0
Passed 7012
Skipped 16
Total 7028

Trends 🧪

Image of Build Times

Image of Tests

💚 Flaky test report

Tests succeeded.

Expand to view the summary

Test stats 🧪

Test Results
Failed 0
Passed 7012
Skipped 16
Total 7028

Signed-off-by: chrismark <chrismarkou92@gmail.com>
Copy link
Contributor

@MichaelKatsoulis MichaelKatsoulis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. It is nice splitting different resource watchers in their respectful files.

Signed-off-by: chrismark <chrismarkou92@gmail.com>
@ChrsMark ChrsMark marked this pull request as ready for review July 12, 2021 07:39
@elasticmachine
Copy link
Collaborator

Pinging @elastic/integrations (Team:Integrations)

@ChrsMark ChrsMark requested a review from jsoriano July 12, 2021 07:40
@ChrsMark
Copy link
Member Author

Opening this for review. I tested it manually with the scenarios mentioned in this PR's description. I plan to add unit tests for the provider too but the implementation should be ready for review now.

ChrsMark added 6 commits July 12, 2021 11:44
Signed-off-by: chrismark <chrismarkou92@gmail.com>
Signed-off-by: chrismark <chrismarkou92@gmail.com>
Signed-off-by: chrismark <chrismarkou92@gmail.com>
Signed-off-by: chrismark <chrismarkou92@gmail.com>
Signed-off-by: chrismark <chrismarkou92@gmail.com>
Signed-off-by: chrismark <chrismarkou92@gmail.com>
@ChrsMark
Copy link
Member Author

ChrsMark commented Jul 13, 2021

Note: Latest commit (43f5136) tries to improve data emission handling in yield like way so as to help with test covering too. We can revert it if we find that it is not of our liking.

@jsoriano
Copy link
Member

@ChrsMark thanks for working on this. Before going deep into implementation details, I would like to raise a concern, do we have plans to support discovery of multiple resources?

With current configuration it doesn't seem to be possible. There is no way to declare multiple providers of the same kind as in Beats, and resource supports a single kind of resource.

providers.kubernetes:
  resource: pod
  scope: cluster
  kube_config: /Users/chrismark/.kube/config

I would see two options for that, one would be to make resource a list, and then make one provider to be able to discover multiple kinds of resources. But then for each kind of resource we may need be able to define different scopes or even configs.
The other option would be to have different providers, one for each kind of resource, so we would have providers such as kubernetes_pod and kubernetes_node, kubernetes_service and so on... then we would be able to have one provider for each kind of resource. But this may be confusing when using common kubernetes fields as the namespace in the config (should it be kubernetes_pod.namespace and kubernetes_service.namespace, or kubernetes.namespace for both providers?).

Of course, with current implementation there is also the option of running multiple agents, one for each kind of resource, but this may be a bit overkill and cumbersome for users.

@ChrsMark
Copy link
Member Author

ChrsMark commented Jul 13, 2021

That's a fair point @jsoriano , thanks for bringing this up! Especially now that same Agent will run collectors for logs+metrics+uptime at the same time, using the same config, we should be able to provide such flexibility to our users.

I think that splitting into different providers per k8s resource would make sense but fields should be reported under the same namespace like kubernetes.*. Maybe this is doable by setting the target to kubernetes at

. But in dynamic variables resolution we would need to refer to each namespace explicitly , ie ${kubernetes_service.service.name} == 'kube-controller-manager', which I'm not if is something we would like.

Other way to tackle this is to enable support for defining k8s provider multiple times but this most probably should happen on controller provider's layer.

@exekias any thoughts on this?

// Check if resource is service. If yes then default the scope to "cluster".
if c.Resources.Service != nil {
if c.Scope == "node" {
logp.L().Warnf("can not set scope to `node` when using resource `Service`. resetting scope to `cluster`")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this logger be namespaced? I also see logger widely used in agent, not sure if there is any preference here

Comment on lines 39 to 46
type ResourceConfig struct {
KubeConfig string `config:"kube_config"`
Namespace string `config:"namespace"`
SyncPeriod time.Duration `config:"sync_period"`
CleanupTimeout time.Duration `config:"cleanup_timeout" validate:"positive"`

// Needed when resource is a Pod or Node
Node string `config:"node"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering about use cases for resource specific settings, do you have anyone in mind?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that I'm thinking of it again, maybe it is over-engineering to provide this option at the moment since the base config shared for all the resources should cover the cases. Flexibility for different accesses per resource or variant settings options would be nice to think of but we can and see if users actually need them. So, I will change it and move to single config for all of the resources.

func (c *Config) Validate() error {
// Check if resource is service. If yes then default the scope to "cluster".
if c.Resources.Service != nil {
if c.Scope == "node" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's interesting that you can override almost all settings per resource, but not scope

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Node: c.Node,
}
if c.Resources.Pod == nil {
c.Resources.Pod = baseCfg
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if the user only overrides resources.pod.namespace? Does that mean that the rest of settings will be empty?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

func (n *node) emitRunning(node *kubernetes.Node) {
data := generateNodeData(node)
data.mapping["scope"] = n.scope
if data == nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this happen taking the previous line into account?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch

Comment on lines 89 to 90
n.emitStopped(node)
n.emitRunning(node)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory emitRunning should be enough here, right? This will AddOrUpdate

return false
}

func getAddress(node *kubernetes.Node) string {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A explanatory comment here would help

// Pass annotations to all events so that it can be used in templating and by annotation builders.
annotations := common.MapStr{}
for k, v := range node.GetObjectMeta().GetAnnotations() {
safemapstr.Put(annotations, k, v)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we be dedotting these?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about adding this when we will deal with metadata in general, but it's ok to add it now. Adding.

"node": map[string]interface{}{
"uid": string(node.GetUID()),
"name": node.GetName(),
"labels": node.GetLabels(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for labels, should we dedot?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above!


providerDataChan := make(chan providerData)
done := make(chan bool, 1)
go generateContainerData(pod, containers, containerstatuses, providerDataChan, done)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why using a channel for this? What would you think about emitting directly from the function?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Channel usage helps to isolate the generator function so as to be possible to be tested with unit tests, following a yield-like approach.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

understood, you could also build a mocked emitter passed as a parameter to retrieve the results in the tests, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good idea. Something like 6f39378?

ChrsMark added 2 commits July 21, 2021 12:09
Signed-off-by: chrismark <chrismarkou92@gmail.com>
Signed-off-by: chrismark <chrismarkou92@gmail.com>
@ChrsMark
Copy link
Member Author

Comments addressed and tested locally following the updated scenarios listed in the PR's description.

@ChrsMark ChrsMark requested a review from exekias July 21, 2021 09:27
ChrsMark added 2 commits July 21, 2021 12:45
Signed-off-by: chrismark <chrismarkou92@gmail.com>
Signed-off-by: chrismark <chrismarkou92@gmail.com>
Copy link
Member

@jsoriano jsoriano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking good. Added a question about the future of dedotting and the possibility of using flattened type instead.

Also I think we still need to polish some use cases that we recently polished in Beats, as discovery of short-living pods, crashing containers, ephemeral containers and so on, but this can be done as follow ups. elastic/e2e-testing#1090 will help to validate these use cases 😇

// Scope of the provider (cluster or node)
Scope string `config:"scope"`
LabelsDedot bool `config:"labels.dedot"`
AnnotationsDedot bool `config:"annotations.dedot"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@exekias @ChrsMark wdyt about start using the flattened type in Agent instead of dedotting? There are still some trade-offs but they should be eventually addressed. Hopefully flattened is the future for this kind of fields.
More context here: https://github.com/elastic/obs-dc-team/issues/461

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not completely aware of pros/cons but with a quick view flattened type looks better than dedoting to me, and sounds like a good idea regarding timing to do this change now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

taking this into account I'm inclined to leave dedotting out of this PR and investigate the experience when using flattened for these fields, any thoughts? Also, in case we end up introducing it, I would like to be opinionated here, and avoid adding any config parameter for it.

I'm particularly concerned about doing things like grouping some metric by a label, which is a valid use. I'm less concerned about annotations...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm +1 in leaving this out for now and open a follow-up issue to work on for dedotting/flattening

}

// InitDefaults initializes the default values for the config.
func (c *Config) InitDefaults() {
c.SyncPeriod = 10 * time.Minute
c.CleanupTimeout = 60 * time.Second
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may be hitting this issue #20543, not sure how we can do something now depending on the kind of data we are collecting.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeap, the provider is not really aware of the inputs right now, but maybe it could be handled better in the future if we introduce a "smart" controller which enables providers according to the inputs.

p.emitContainers(pod, pod.Spec.Containers, pod.Status.ContainerStatuses)

// TODO deal with init containers stopping after initialization
p.emitContainers(pod, pod.Spec.InitContainers, pod.Status.InitContainerStatuses)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: add support for ephemeral containers.

Signed-off-by: chrismark <chrismarkou92@gmail.com>
@ChrsMark
Copy link
Member Author

This is looking good. Added a question about the future of dedotting and the possibility of using flattened type instead.

Also I think we still need to polish some use cases that we recently polished in Beats, as discovery of short-living pods, crashing containers, ephemeral containers and so on, but this can be done as follow ups. elastic/e2e-testing#1090 will help to validate these use cases 😇

Thanks! Testing should be improved for sure and in a more complete/e2e approach. We have in our backlog elastic/e2e-testing#1090 which can cover this need I think, and I'm thinking that this could be better to be implemented after we have a more complete codebase including metadata handling too.

Signed-off-by: chrismark <chrismarkou92@gmail.com>
@ChrsMark
Copy link
Member Author

@exekias @jsoriano I removed dedoting settings for now so as to better evaluate the possible usage of flattened in a separate issue. I will create follow-up issues for all the TODOs added in this PR.

Let me know if there is anything else missing :).

Copy link
Member

@jsoriano jsoriano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be merged, and we can iterate on details and specific use cases in future PRs.

Copy link
Contributor

@blakerouse blakerouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks for all the fixes and changes!

@ChrsMark ChrsMark merged commit 6635acb into elastic:master Jul 22, 2021
mergify bot pushed a commit that referenced this pull request Jul 22, 2021
ChrsMark added a commit that referenced this pull request Jul 23, 2021
… (#27014)

(cherry picked from commit 6635acb)

Co-authored-by: Chris Mark <chrismarkou92@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
autodiscovery backport-v7.15.0 Automated backport with mergify kubernetes Enable builds in the CI for kubernetes Team:Integrations Label for the Integrations team v7.15.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add more resources in kubernetes composable provider
6 participants