Automatically adjust Elastic Agent hostPath permissions #6599

naemono · 2023-03-27T18:56:23Z

closes #6239
closes #6543

Background

Currently it is required to have the following set when running Elastic Agent with a hostPath:

    podTemplate:
      spec:
        containers:
          - name: agent
            securityContext:
              runAsUser: 0

The only way to avoid this is configuring an emptyDir volume instead of hostPath.

What this proposes

Detailed further in #6239 we want to automatically add an initContainer that maintains the Agent permissions to avoid requiring the Agent to run perpetually as root.

If the following all are true, an initContainer is automatically added to Elastic Agent that maintains permissions

Agent volume is not set to emptyDir.
Agent version is above 7.15.
Agent spec is not configured to run as root.

Additional notable change

Removes runAsUser: 0 from all e2e Agent tests as it's unnecessary now.
Since Agent in fleet mode was required to run as root to be able to update the CA trust store, If Agent is >= 7.14.0 this is no * longer required, and the agent does not need to run as root.
When running in Openshift, privileged: true is required to be enabled for the initConatiner, along with chcon -Rt svirt_sandbox_file_t /usr/share/elastic-agent/state is needed to be run to managed Selinux permissions properly.

Testing

Local testing make e2e-local of all Agent e2e tests using 8.7.0-SNAPSHOT and enabling all disabled Agent tests because of TestFleet* fail on 8.6.x #6331
Full remote e2e test run using 7.x
Full remote e2e test run using 8.6
Adding of additional Unit tests
Tested in Openshift 4.12
Full testing of Fleet to ensure resolution of Revisit Elastic Agent certificate handling #6543

TODO

Additional fleet testing
Getting some validation around the root requirement in Agent
Update documentation to be clear what is happening in these new circumstances

Signed-off-by: Michael Montgomery <mmontg1@gmail.com>

Disable global CA test when running locally. Signed-off-by: Michael Montgomery <mmontg1@gmail.com>

Signed-off-by: Michael Montgomery <mmontg1@gmail.com>

naemono · 2023-03-28T02:48:02Z

buildkite test this -f p=gke,s=7.17.8

…ath-chown

thbkrkr · 2023-03-28T10:53:27Z

buildkite test this -f p=gke,s=7.17.8

naemono · 2023-03-28T18:43:37Z

buildkite test this -f p=gke,s=8.6.2

Signed-off-by: Michael Montgomery <mmontg1@gmail.com>

barkbay · 2023-04-05T07:48:02Z

config/recipes/elastic-agent/fleet-kubernetes-integration.yaml

@@ -97,11 +95,7 @@ spec:
    podTemplate:
      spec:
        serviceAccountName: elastic-agent
-        hostNetwork: true


Why are we removing hostNetwork: true?

barkbay · 2023-04-05T07:52:34Z

config/recipes/elastic-agent/fleet-kubernetes-integration.yaml

@@ -79,8 +79,6 @@ spec:
      spec:
        serviceAccountName: fleet-server
        automountServiceAccountToken: true
-        securityContext:
-          runAsUser: 0


I think this is still required for the container's CA bundle to be updated, see:

cloud-on-k8s/pkg/controller/agent/pod.go

Lines 360 to 366 in 379c19a

// Beats managed by the Elastic Agent don't trust the Elasticsearch CA that Elastic Agent itself is configured

// to trust. There is currently no way to configure those Beats to trust a particular CA. The intended way to handle

// it is to allow Fleet to provide Beat output settings, but due to https://github.com/elastic/kibana/issues/102794

// this is not supported outside of UI. To workaround this limitation the Agent is going to update Pod-wide CA store

// before starting Elastic Agent.

cmd := trustCAScript(path.Join(certificatesDir(esAssociation), CAFileName))

return builder.WithCommand([]string{"/usr/bin/env", "bash", "-c", cmd}), nil

https://www.elastic.co/guide/en/cloud-on-k8s/current/k8s-elastic-agent-fleet-known-limitations.html#k8s-elastic-agent-fleet-known-limitations

barkbay

I didn't manage to get it working on OpenShift:

pod/elastic-agent-agent-mdgdt    0/1     Init:Error   2 (26s ago)   29s   10.128.2.24   barkbay-ocp-kwdmv-worker-d-84dkv.c.elastic-cloud-dev.internal   <none>           <none>
pod/elastic-agent-agent-p9vql    0/1     Init:Error   2 (26s ago)   30s   10.131.0.17   barkbay-ocp-kwdmv-worker-c-5q6vm.c.elastic-cloud-dev.internal   <none>           <none>
pod/elastic-agent-agent-r5dxn    0/1     Init:Error   2 (26s ago)   30s   10.129.2.18   barkbay-ocp-kwdmv-worker-b-6sh5v.c.elastic-cloud-dev.internal   <none>           <none

k logs pod/elastic-agent-agent-p9vql -c permissions
chmod: changing permissions of '/usr/share/elastic-agent/state': Permission denied

Setting privileged: true helps the init container to run:

--- a/pkg/controller/agent/volume.go
+++ b/pkg/controller/agent/volume.go
@@ -7,6 +7,7 @@ package agent
 import (
        corev1 "k8s.io/api/core/v1"
        "k8s.io/apimachinery/pkg/api/resource"
+       ptr "k8s.io/utils/pointer"
 
        "github.com/blang/semver/v4"
 
@@ -57,7 +58,8 @@ func maybeAgentInitContainerForHostpathVolume(spec *agentv1alpha1.AgentSpec, v s
                        Command: hostPathVolumeInitContainerCommand(),
                        Name:    hostPathVolumeInitContainerName,
                        SecurityContext: &corev1.SecurityContext{
-                               RunAsUser: pointer.Int64(0),
+                               RunAsUser:  pointer.Int64(0),
+                               Privileged: ptr.Bool(true),
                        },
                        Resources: hostPathVolumeInitContainerResources,
                        VolumeMounts: []corev1.VolumeMount{

But then the agent container fails to start:

pod/elastic-agent-agent-78bjs    0/1     CrashLoopBackOff   5 (95s ago)    4m36s   10.128.2.25   barkbay-ocp-kwdmv-worker-d-84dkv.c.elastic-cloud-dev.internal   <none>           <none>
pod/elastic-agent-agent-r8gmq    0/1     CrashLoopBackOff   5 (89s ago)    4m35s   10.131.0.18   barkbay-ocp-kwdmv-worker-c-5q6vm.c.elastic-cloud-dev.internal   <none>           <none>
pod/elastic-agent-agent-w2ztb    0/1     CrashLoopBackOff   5 (102s ago)   4m36s   10.129.2.20   barkbay-ocp-kwdmv-worker-b-6sh5v.c.elastic-cloud-dev.internal   <none>           <none>

kubectl logs elastic-agent-agent-r8gmq -n agent -f
Defaulted container "agent" out of: agent, permissions (init)
Error: preparing STATE_PATH(/usr/share/elastic-agent/state) failed: mkdir /usr/share/elastic-agent/state/data: permission denied
For help, please see our troubleshooting guide at https://www.elastic.co/guide/en/fleet/8.6/fleet-troubleshooting.html

(the ServiceAccount I'm using for Agent is in the privileged SCC)

barkbay · 2023-04-05T09:45:03Z

pkg/controller/agent/volume.go

+set -e
+if [[ -d /usr/share/elastic-agent/state ]]; then
+  chmod g+rw /usr/share/elastic-agent/state
+  chgrp 1000 /usr/share/elastic-agent/state


Why 1000?

barkbay · 2023-04-05T10:05:25Z

But then the agent container fails to start:

The agent container still has to run in privileged mode:

  daemonSet:
    podTemplate:
      spec:
        automountServiceAccountToken: true
        serviceAccountName: elastic-agent
        containers:
        - name: agent
          securityContext:
            privileged: true

elastic-agent@elastic-agent-agent-4w562:~/state$ id
uid=1000(elastic-agent) gid=1000(elastic-agent) groups=1000(elastic-agent),0(root)

I guess we use 1000 as a default value for chgrp because it is the default user/group in the Docker image?

naemono · 2023-04-05T14:25:05Z

I guess we use 1000 as a default value for chgrp because it is the default user/group in the Docker image?

That is correct.

I didn't manage to get it working on OpenShift:

That is odd as I tested on Openshift, and still have the agent running successfully in my cluster

❯ kc get agent -n elastic elastic-agent -o yaml | yq e '.spec.daemonSet' -
podTemplate:
  metadata:
    creationTimestamp: null
  spec:
    automountServiceAccountToken: true
    containers:
      - env:
          - name: NODE_NAME
            valueFrom:
              fieldRef:
                fieldPath: spec.nodeName
        name: agent
        resources: {}
    serviceAccountName: elastic-agent
updateStrategy: {}

❯ kc get pod -n elastic elastic-agent-agent-9mtrh -o yaml | yq e '.spec.containers' -
- args:
    - -e
    - -c
    - /etc/agent.yml
  env:
    - name: NODE_NAME
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: spec.nodeName
  image: docker.elastic.co/beats/elastic-agent:8.6.1
  imagePullPolicy: IfNotPresent
  name: agent
  resources:
    limits:
      cpu: 200m
      memory: 350Mi
    requests:
      cpu: 200m
      memory: 350Mi
  terminationMessagePath: /dev/termination-log
  terminationMessagePolicy: File
  volumeMounts:
    - mountPath: /usr/share/elastic-agent/state
      name: agent-data
    - mountPath: /etc/agent.yml
      name: config
      readOnly: true
      subPath: agent.yml
    - mountPath: /mnt/elastic-internal/elasticsearch-association/elastic/elasticsearch/certs
      name: elasticsearch-certs-0
      readOnly: true
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-5jg8f
      readOnly: true

# Note this is a previous version where I was doing some debugging....
❯ kc get pod -n elastic elastic-agent-agent-9mtrh -o yaml | yq e '.spec.initContainers' -
- command:
    - /usr/bin/env
    - bash
    - -c
    - |
      #!/usr/bin/env bash
      set -e
      find /usr/share/elastic-agent -ls
      if [[ -d /usr/share/elastic-agent/state ]]; then
        echo "Adjusting g+rw of /usr/share/elastic-agent/state"
        chmod g+rw /usr/share/elastic-agent/state
        echo "Adjusting group ownership of /usr/share/elastic-agent/state"
        chgrp 1000 /usr/share/elastic-agent/state
        if [ -n "$(ls -A /usr/share/elastic-agent/state 2>/dev/null)" ]; then
          echo "Adjusting group ownership of /usr/share/elastic-agent/state/*"
          chgrp 1000 /usr/share/elastic-agent/state/*
          echo "Adjusting g+rw of /usr/share/elastic-agent/state/*"
          chmod g+rw /usr/share/elastic-agent/state/*
        fi
      fi
  image: docker.elastic.co/beats/elastic-agent:8.6.1
  imagePullPolicy: IfNotPresent
  name: permissions
  resources:
    limits:
      cpu: 100m
      memory: 128Mi
    requests:
      cpu: 100m
      memory: 128Mi
  securityContext:
    runAsUser: 0
  terminationMessagePath: /dev/termination-log
  terminationMessagePolicy: File
  volumeMounts:
    - mountPath: /usr/share/elastic-agent/state
      name: agent-data
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-5jg8f
      readOnly: true

❯ kc exec -it -n elastic elastic-agent-agent-9mtrh -- id
Defaulted container "agent" out of: agent, permissions (init)
uid=1000(elastic-agent) gid=1000(elastic-agent) groups=1000(elastic-agent),0(root)

I did have to run this:

oc adm policy add-scc-to-user privileged -z elastic-agent -n elastic

Maybe it's something to do with Openshift versions? What version were you running @barkbay ?

❯ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.9.9

I'm going to wipe this out, and start fresh and see if I can replicate what you're seeing....

barkbay · 2023-04-05T15:15:41Z

Maybe it's something to do with Openshift versions?

Server Version: 4.10.12
Kubernetes Version: v1.23.5+70fb84c

To be honest I'm a bit surprised that it's possible to change the permissions on the host file system from a container without being privileged because of SELinux. My understanding is that the purpose of SELinux is to prevent changes on the host even if a process is running as root (but maybe I'm wrong and being in the privileged SCC should allow that).

barkbay · 2023-04-05T15:19:22Z

I guess we use 1000 as a default value for chgrp because it is the default user/group in the Docker image?

That is correct.

I have to admit that I'm not a big fan of depending on such implementation detail. We should assume that a container can run as any user id.

naemono · 2023-04-05T16:03:23Z

I have to admit that I'm not a big fan of depending on such implementation detail. We should assume that a container can run as any user id.

Good point. I'll work to deduce the group from the configuration so this can run as any and update when implementation/testing is complete.

barkbay · 2023-04-06T09:22:57Z

I'm going to wipe this out, and start fresh and see if I can replicate what you're seeing....

I did the same test again on a brand new cluster with the same result:

pod/elastic-agent-agent-bth8q    0/1     Init:Error              4 (58s ago)   2m11s   10.129.2.12   barkbay-ocp-2xw5s-worker-d-pjpgf.c.elastic-cloud-dev.internal   <none>           <none>
pod/elastic-agent-agent-pbsbc    0/1     Init:Error              4 (54s ago)   2m11s   10.128.2.13   barkbay-ocp-2xw5s-worker-b-sr7kw.c.elastic-cloud-dev.internal   <none>           <none>
pod/elastic-agent-agent-wnqxt    0/1     Init:CrashLoopBackOff   3 (50s ago)   2m11s   10.131.0.30   barkbay-ocp-2xw5s-worker-c-bbf8b.c.elastic-cloud-dev.internal   <none>           <none>
pod/elasticsearch-es-default-0   1/1     Running                 0             2m51s   10.131.0.29   barkbay-ocp-2xw5s-worker-c-bbf8b.c.elastic-cloud-dev.internal   <none>           <none>

k logs pod/elastic-agent-agent-pbsbc -c permissions
chmod: changing permissions of '/usr/share/elastic-agent/state': Permission denied

The original resources manifest I used is here: https://gist.github.com/barkbay/e9c240ea1a7333d428e5508a155de66c#file-kubernetes-integration-yaml

Note that I adjusted the namespace as resources are usually never deployed in the default one (and I think this is even more true on OpenShift).

barkbay · 2023-04-06T09:35:50Z

As mentioned in one of my previous message I suspect SELinux to be the culprit:

type=AVC msg=audit(1680773014.792:68): avc:  denied  { setattr } for  pid=199848 comm="chmod" name="state" dev="sda4" ino=44055922 scontext=system_u:system_r:container_t:s0:c472,c648 tcontext=system_u:object_r:container_var_lib_t:s0 tclass=dir permissive=0
type=SYSCALL msg=audit(1680773014.792:68): arch=c000003e syscall=268 success=no exit=-13 a0=ffffff9c a1=5615e593f3b0 a2=1fd a3=fffff3ff items=0 ppid=199829 pid=199848 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="chmod" exe="/usr/bin/chmod" subj=system_u:system_r:container_t:s0:c472,c648 key=(null)ARCH=x86_64 SYSCALL=fchmodat AUID="unset" UID="root" GID="root" EUID="root" SUID="root" FSUID="root" EGID="root" SGID="root" FSGID="root"
type=PROCTITLE msg=audit(1680773014.792:68): proctitle=63686D6F6400672B7277002F7573722F73686172652F656C61737469632D6167656E742F7374617465

sh-4.4# ls -laZ /var/lib/elastic-agent/my-agent-project/elastic-agent
total 0
drwxr-xr-x. 3 root root system_u:object_r:container_var_lib_t:s0 19 Apr  6 09:17 .
drwxr-xr-x. 3 root root system_u:object_r:container_var_lib_t:s0 27 Apr  6 09:17 ..
drwxr-xr-x. 2 root root system_u:object_r:container_var_lib_t:s0  6 Apr  6 09:17 state

permissions container runs with the following label: system_u:system_r:container_t

naemono · 2023-04-06T12:39:19Z

Thanks for the follow up. I replicated the same behavior late yesterday after having massive issues with my OpenShift cluster and finally just rebuilding it. I'm still a bit baffled as to why it was working before but will move forward with what I'm seeing now

…econciliation. Detect openshift when adding agent init container to be able to add 'privileged: true' automatically. Run 'chcon' on Agent state directory when running within Openshift. Add bool func to our utils/pointer package. Update tests for new functionality Signed-off-by: Michael Montgomery <mmontg1@gmail.com>

Signed-off-by: Michael Montgomery <mmontg1@gmail.com>

naemono · 2023-04-06T17:18:38Z

@barkbay With recent changes, this fully works within Openshift now.

I've also made some changes to try and handle #6543, but I'm still doing some testing around that feature.

pebrc · 2023-04-11T08:17:26Z

Drive-by comment: Is it a good idea to special case OpenShift here? Would not the same restrictions that we are trying to work around for OpenShift apply to any non-OpenShift cluster as well if it has SELinux set up?

barkbay · 2023-04-11T08:20:50Z

pkg/controller/agent/volume.go

+
+const (
+	hostPathVolumeInitContainerName = "permissions"
+	chconCmd                        = "chcon -Rt svirt_sandbox_file_t /usr/share/elastic-agent/state"


Using chcon in a privileged container without explicit user consent seems wrong to me from a security point of view.

barkbay · 2023-04-11T08:35:46Z

Would not the same restrictions that we are trying to work around for OpenShift apply to any non-OpenShift cluster as well if it has SELinux set up?

Yes. Also I'm wondering if the opposite is possible: running OpenShift on a file system without SELinux, what would be the result of the chcon command in that case?

More generally, I'm a bit puzzled by the idea of building a feature on something that I considered a "best effort" (using isOpenShift(...) to detect OpenShift) until now, and for which a flag could be used as an escape hatch.

naemono · 2023-04-18T15:29:52Z

This is being closed in favor of documenting a daemonset that can be used to prepare the agent directory for running elastic agent without the need to run as root. This decision was made after discussing the security concerns around automatically managing these permissions without explicit user consent, and having a daemonset that needed to be applied prior to running Agent allows such consent.

naemono added 6 commits March 22, 2023 13:21

wip

9e28cfb

Signed-off-by: Michael Montgomery <mmontg1@gmail.com>

wip

4e0f3e9

Signed-off-by: Michael Montgomery <mmontg1@gmail.com>

Remove runAsUser: 0 from all agent recipes.

c5437c8

Disable global CA test when running locally. Signed-off-by: Michael Montgomery <mmontg1@gmail.com>

Clean up comments.

9c80554

Signed-off-by: Michael Montgomery <mmontg1@gmail.com>

Clean up comments a bit further.

3a4507a

Signed-off-by: Michael Montgomery <mmontg1@gmail.com>

Bit more comment cleanup.

780466a

Signed-off-by: Michael Montgomery <mmontg1@gmail.com>

naemono added >bug Something isn't working >enhancement Enhancement of existing functionality labels Mar 27, 2023

naemono mentioned this pull request Mar 27, 2023

Beats: Using secureSettings with hostPath fails without runAsUser: 0 #6600

Open

Merge remote-tracking branch 'upstream/main' into 6239-es-agent-hostp…

8358b41

…ath-chown

naemono mentioned this pull request Mar 28, 2023

Discuss/Draft: Automatically adjust Elastic Beat hostPath permissions #6612

Closed

naemono added 2 commits March 28, 2023 14:36

Add unit tests

f3ee77d

Signed-off-by: Michael Montgomery <mmontg1@gmail.com>

Fix test description

83f9218

Signed-off-by: Michael Montgomery <mmontg1@gmail.com>

naemono marked this pull request as ready for review March 29, 2023 03:03

barkbay reviewed Apr 5, 2023

View reviewed changes

naemono added 2 commits April 6, 2023 11:57

Adjust version of Agent that this fix was contained within.

097a308

Signed-off-by: Michael Montgomery <mmontg1@gmail.com>

barkbay reviewed Apr 11, 2023

View reviewed changes

naemono closed this Apr 18, 2023

naemono mentioned this pull request Apr 18, 2023

Support running Agent as a non-root #6700

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automatically adjust Elastic Agent hostPath permissions #6599

Automatically adjust Elastic Agent hostPath permissions #6599

naemono commented Mar 27, 2023 •

edited

Loading

naemono commented Mar 28, 2023

thbkrkr commented Mar 28, 2023

naemono commented Mar 28, 2023

barkbay Apr 5, 2023

barkbay Apr 5, 2023 •

edited

Loading

barkbay left a comment

barkbay Apr 5, 2023

barkbay commented Apr 5, 2023

naemono commented Apr 5, 2023

barkbay commented Apr 5, 2023

barkbay commented Apr 5, 2023

naemono commented Apr 5, 2023

barkbay commented Apr 6, 2023

barkbay commented Apr 6, 2023

naemono commented Apr 6, 2023

naemono commented Apr 6, 2023

pebrc commented Apr 11, 2023

barkbay Apr 11, 2023

barkbay commented Apr 11, 2023

naemono commented Apr 18, 2023

	// Beats managed by the Elastic Agent don't trust the Elasticsearch CA that Elastic Agent itself is configured
	// to trust. There is currently no way to configure those Beats to trust a particular CA. The intended way to handle
	// it is to allow Fleet to provide Beat output settings, but due to https://github.com/elastic/kibana/issues/102794
	// this is not supported outside of UI. To workaround this limitation the Agent is going to update Pod-wide CA store
	// before starting Elastic Agent.
	cmd := trustCAScript(path.Join(certificatesDir(esAssociation), CAFileName))
	return builder.WithCommand([]string{"/usr/bin/env", "bash", "-c", cmd}), nil

Automatically adjust Elastic Agent hostPath permissions #6599

Automatically adjust Elastic Agent hostPath permissions #6599

Conversation

naemono commented Mar 27, 2023 • edited Loading

Background

What this proposes

Additional notable change

Testing

TODO

naemono commented Mar 28, 2023

thbkrkr commented Mar 28, 2023

naemono commented Mar 28, 2023

barkbay Apr 5, 2023

Choose a reason for hiding this comment

barkbay Apr 5, 2023 • edited Loading

Choose a reason for hiding this comment

barkbay left a comment

Choose a reason for hiding this comment

barkbay Apr 5, 2023

Choose a reason for hiding this comment

barkbay commented Apr 5, 2023

naemono commented Apr 5, 2023

barkbay commented Apr 5, 2023

barkbay commented Apr 5, 2023

naemono commented Apr 5, 2023

barkbay commented Apr 6, 2023

barkbay commented Apr 6, 2023

naemono commented Apr 6, 2023

naemono commented Apr 6, 2023

pebrc commented Apr 11, 2023

barkbay Apr 11, 2023

Choose a reason for hiding this comment

barkbay commented Apr 11, 2023

naemono commented Apr 18, 2023

naemono commented Mar 27, 2023 •

edited

Loading

barkbay Apr 5, 2023 •

edited

Loading