Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatically adjust Elastic Agent hostPath permissions #6599

Closed
wants to merge 11 commits into from

Conversation

naemono
Copy link
Contributor

@naemono naemono commented Mar 27, 2023

closes #6239
closes #6543

Background

Currently it is required to have the following set when running Elastic Agent with a hostPath:

    podTemplate:
      spec:
        containers:
          - name: agent
            securityContext:
              runAsUser: 0

The only way to avoid this is configuring an emptyDir volume instead of hostPath.

What this proposes

Detailed further in #6239 we want to automatically add an initContainer that maintains the Agent permissions to avoid requiring the Agent to run perpetually as root.

If the following all are true, an initContainer is automatically added to Elastic Agent that maintains permissions

  1. Agent volume is not set to emptyDir.
  2. Agent version is above 7.15.
  3. Agent spec is not configured to run as root.

Additional notable change

  • Removes runAsUser: 0 from all e2e Agent tests as it's unnecessary now.
  • Since Agent in fleet mode was required to run as root to be able to update the CA trust store, If Agent is >= 7.14.0 this is no * longer required, and the agent does not need to run as root.
  • When running in Openshift, privileged: true is required to be enabled for the initConatiner, along with chcon -Rt svirt_sandbox_file_t /usr/share/elastic-agent/state is needed to be run to managed Selinux permissions properly.

Testing

TODO

  • Additional fleet testing
  • Getting some validation around the root requirement in Agent
  • Update documentation to be clear what is happening in these new circumstances

naemono added 6 commits March 22, 2023 13:21
Signed-off-by: Michael Montgomery <mmontg1@gmail.com>
Signed-off-by: Michael Montgomery <mmontg1@gmail.com>
Disable global CA test when running locally.

Signed-off-by: Michael Montgomery <mmontg1@gmail.com>
Signed-off-by: Michael Montgomery <mmontg1@gmail.com>
Signed-off-by: Michael Montgomery <mmontg1@gmail.com>
Signed-off-by: Michael Montgomery <mmontg1@gmail.com>
@naemono naemono added >bug Something isn't working >enhancement Enhancement of existing functionality labels Mar 27, 2023
@naemono
Copy link
Contributor Author

naemono commented Mar 28, 2023

buildkite test this -f p=gke,s=7.17.8

@thbkrkr
Copy link
Contributor

thbkrkr commented Mar 28, 2023

buildkite test this -f p=gke,s=7.17.8

@naemono
Copy link
Contributor Author

naemono commented Mar 28, 2023

buildkite test this -f p=gke,s=8.6.2

naemono added 2 commits March 28, 2023 14:36
Signed-off-by: Michael Montgomery <mmontg1@gmail.com>
Signed-off-by: Michael Montgomery <mmontg1@gmail.com>
@naemono naemono marked this pull request as ready for review March 29, 2023 03:03
@@ -97,11 +95,7 @@ spec:
podTemplate:
spec:
serviceAccountName: elastic-agent
hostNetwork: true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we removing hostNetwork: true?

@@ -79,8 +79,6 @@ spec:
spec:
serviceAccountName: fleet-server
automountServiceAccountToken: true
securityContext:
runAsUser: 0
Copy link
Contributor

@barkbay barkbay Apr 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is still required for the container's CA bundle to be updated, see:

Copy link
Contributor

@barkbay barkbay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't manage to get it working on OpenShift:

pod/elastic-agent-agent-mdgdt    0/1     Init:Error   2 (26s ago)   29s   10.128.2.24   barkbay-ocp-kwdmv-worker-d-84dkv.c.elastic-cloud-dev.internal   <none>           <none>
pod/elastic-agent-agent-p9vql    0/1     Init:Error   2 (26s ago)   30s   10.131.0.17   barkbay-ocp-kwdmv-worker-c-5q6vm.c.elastic-cloud-dev.internal   <none>           <none>
pod/elastic-agent-agent-r5dxn    0/1     Init:Error   2 (26s ago)   30s   10.129.2.18   barkbay-ocp-kwdmv-worker-b-6sh5v.c.elastic-cloud-dev.internal   <none>           <none
k logs pod/elastic-agent-agent-p9vql -c permissions
chmod: changing permissions of '/usr/share/elastic-agent/state': Permission denied

Setting privileged: true helps the init container to run:

--- a/pkg/controller/agent/volume.go
+++ b/pkg/controller/agent/volume.go
@@ -7,6 +7,7 @@ package agent
 import (
        corev1 "k8s.io/api/core/v1"
        "k8s.io/apimachinery/pkg/api/resource"
+       ptr "k8s.io/utils/pointer"
 
        "github.com/blang/semver/v4"
 
@@ -57,7 +58,8 @@ func maybeAgentInitContainerForHostpathVolume(spec *agentv1alpha1.AgentSpec, v s
                        Command: hostPathVolumeInitContainerCommand(),
                        Name:    hostPathVolumeInitContainerName,
                        SecurityContext: &corev1.SecurityContext{
-                               RunAsUser: pointer.Int64(0),
+                               RunAsUser:  pointer.Int64(0),
+                               Privileged: ptr.Bool(true),
                        },
                        Resources: hostPathVolumeInitContainerResources,
                        VolumeMounts: []corev1.VolumeMount{

But then the agent container fails to start:

pod/elastic-agent-agent-78bjs    0/1     CrashLoopBackOff   5 (95s ago)    4m36s   10.128.2.25   barkbay-ocp-kwdmv-worker-d-84dkv.c.elastic-cloud-dev.internal   <none>           <none>
pod/elastic-agent-agent-r8gmq    0/1     CrashLoopBackOff   5 (89s ago)    4m35s   10.131.0.18   barkbay-ocp-kwdmv-worker-c-5q6vm.c.elastic-cloud-dev.internal   <none>           <none>
pod/elastic-agent-agent-w2ztb    0/1     CrashLoopBackOff   5 (102s ago)   4m36s   10.129.2.20   barkbay-ocp-kwdmv-worker-b-6sh5v.c.elastic-cloud-dev.internal   <none>           <none>
kubectl logs elastic-agent-agent-r8gmq -n agent -f
Defaulted container "agent" out of: agent, permissions (init)
Error: preparing STATE_PATH(/usr/share/elastic-agent/state) failed: mkdir /usr/share/elastic-agent/state/data: permission denied
For help, please see our troubleshooting guide at https://www.elastic.co/guide/en/fleet/8.6/fleet-troubleshooting.html

(the ServiceAccount I'm using for Agent is in the privileged SCC)

set -e
if [[ -d /usr/share/elastic-agent/state ]]; then
chmod g+rw /usr/share/elastic-agent/state
chgrp 1000 /usr/share/elastic-agent/state
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why 1000?

@barkbay
Copy link
Contributor

barkbay commented Apr 5, 2023

But then the agent container fails to start:

The agent container still has to run in privileged mode:

  daemonSet:
    podTemplate:
      spec:
        automountServiceAccountToken: true
        serviceAccountName: elastic-agent
        containers:
        - name: agent
          securityContext:
            privileged: true
elastic-agent@elastic-agent-agent-4w562:~/state$ id
uid=1000(elastic-agent) gid=1000(elastic-agent) groups=1000(elastic-agent),0(root)

I guess we use 1000 as a default value for chgrp because it is the default user/group in the Docker image?

@naemono
Copy link
Contributor Author

naemono commented Apr 5, 2023

I guess we use 1000 as a default value for chgrp because it is the default user/group in the Docker image?

That is correct.

I didn't manage to get it working on OpenShift:

That is odd as I tested on Openshift, and still have the agent running successfully in my cluster

❯ kc get agent -n elastic elastic-agent -o yaml | yq e '.spec.daemonSet' -
podTemplate:
  metadata:
    creationTimestamp: null
  spec:
    automountServiceAccountToken: true
    containers:
      - env:
          - name: NODE_NAME
            valueFrom:
              fieldRef:
                fieldPath: spec.nodeName
        name: agent
        resources: {}
    serviceAccountName: elastic-agent
updateStrategy: {}

❯ kc get pod -n elastic elastic-agent-agent-9mtrh -o yaml | yq e '.spec.containers' -
- args:
    - -e
    - -c
    - /etc/agent.yml
  env:
    - name: NODE_NAME
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: spec.nodeName
  image: docker.elastic.co/beats/elastic-agent:8.6.1
  imagePullPolicy: IfNotPresent
  name: agent
  resources:
    limits:
      cpu: 200m
      memory: 350Mi
    requests:
      cpu: 200m
      memory: 350Mi
  terminationMessagePath: /dev/termination-log
  terminationMessagePolicy: File
  volumeMounts:
    - mountPath: /usr/share/elastic-agent/state
      name: agent-data
    - mountPath: /etc/agent.yml
      name: config
      readOnly: true
      subPath: agent.yml
    - mountPath: /mnt/elastic-internal/elasticsearch-association/elastic/elasticsearch/certs
      name: elasticsearch-certs-0
      readOnly: true
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-5jg8f
      readOnly: true

# Note this is a previous version where I was doing some debugging....
❯ kc get pod -n elastic elastic-agent-agent-9mtrh -o yaml | yq e '.spec.initContainers' -
- command:
    - /usr/bin/env
    - bash
    - -c
    - |
      #!/usr/bin/env bash
      set -e
      find /usr/share/elastic-agent -ls
      if [[ -d /usr/share/elastic-agent/state ]]; then
        echo "Adjusting g+rw of /usr/share/elastic-agent/state"
        chmod g+rw /usr/share/elastic-agent/state
        echo "Adjusting group ownership of /usr/share/elastic-agent/state"
        chgrp 1000 /usr/share/elastic-agent/state
        if [ -n "$(ls -A /usr/share/elastic-agent/state 2>/dev/null)" ]; then
          echo "Adjusting group ownership of /usr/share/elastic-agent/state/*"
          chgrp 1000 /usr/share/elastic-agent/state/*
          echo "Adjusting g+rw of /usr/share/elastic-agent/state/*"
          chmod g+rw /usr/share/elastic-agent/state/*
        fi
      fi
  image: docker.elastic.co/beats/elastic-agent:8.6.1
  imagePullPolicy: IfNotPresent
  name: permissions
  resources:
    limits:
      cpu: 100m
      memory: 128Mi
    requests:
      cpu: 100m
      memory: 128Mi
  securityContext:
    runAsUser: 0
  terminationMessagePath: /dev/termination-log
  terminationMessagePolicy: File
  volumeMounts:
    - mountPath: /usr/share/elastic-agent/state
      name: agent-data
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-5jg8f
      readOnly: true

❯ kc exec -it -n elastic elastic-agent-agent-9mtrh -- id
Defaulted container "agent" out of: agent, permissions (init)
uid=1000(elastic-agent) gid=1000(elastic-agent) groups=1000(elastic-agent),0(root)

I did have to run this:

oc adm policy add-scc-to-user privileged -z elastic-agent -n elastic

Maybe it's something to do with Openshift versions? What version were you running @barkbay ?

❯ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.9.9

I'm going to wipe this out, and start fresh and see if I can replicate what you're seeing....

@barkbay
Copy link
Contributor

barkbay commented Apr 5, 2023

Maybe it's something to do with Openshift versions?

Server Version: 4.10.12
Kubernetes Version: v1.23.5+70fb84c

To be honest I'm a bit surprised that it's possible to change the permissions on the host file system from a container without being privileged because of SELinux. My understanding is that the purpose of SELinux is to prevent changes on the host even if a process is running as root (but maybe I'm wrong and being in the privileged SCC should allow that).

@barkbay
Copy link
Contributor

barkbay commented Apr 5, 2023

I guess we use 1000 as a default value for chgrp because it is the default user/group in the Docker image?

That is correct.

I have to admit that I'm not a big fan of depending on such implementation detail. We should assume that a container can run as any user id.

@naemono
Copy link
Contributor Author

naemono commented Apr 5, 2023

I have to admit that I'm not a big fan of depending on such implementation detail. We should assume that a container can run as any user id.

Good point. I'll work to deduce the group from the configuration so this can run as any and update when implementation/testing is complete.

@barkbay
Copy link
Contributor

barkbay commented Apr 6, 2023

I'm going to wipe this out, and start fresh and see if I can replicate what you're seeing....

I did the same test again on a brand new cluster with the same result:

pod/elastic-agent-agent-bth8q    0/1     Init:Error              4 (58s ago)   2m11s   10.129.2.12   barkbay-ocp-2xw5s-worker-d-pjpgf.c.elastic-cloud-dev.internal   <none>           <none>
pod/elastic-agent-agent-pbsbc    0/1     Init:Error              4 (54s ago)   2m11s   10.128.2.13   barkbay-ocp-2xw5s-worker-b-sr7kw.c.elastic-cloud-dev.internal   <none>           <none>
pod/elastic-agent-agent-wnqxt    0/1     Init:CrashLoopBackOff   3 (50s ago)   2m11s   10.131.0.30   barkbay-ocp-2xw5s-worker-c-bbf8b.c.elastic-cloud-dev.internal   <none>           <none>
pod/elasticsearch-es-default-0   1/1     Running                 0             2m51s   10.131.0.29   barkbay-ocp-2xw5s-worker-c-bbf8b.c.elastic-cloud-dev.internal   <none>           <none>
k logs pod/elastic-agent-agent-pbsbc -c permissions
chmod: changing permissions of '/usr/share/elastic-agent/state': Permission denied

The original resources manifest I used is here: https://gist.github.com/barkbay/e9c240ea1a7333d428e5508a155de66c#file-kubernetes-integration-yaml

Note that I adjusted the namespace as resources are usually never deployed in the default one (and I think this is even more true on OpenShift).

@barkbay
Copy link
Contributor

barkbay commented Apr 6, 2023

As mentioned in one of my previous message I suspect SELinux to be the culprit:

type=AVC msg=audit(1680773014.792:68): avc:  denied  { setattr } for  pid=199848 comm="chmod" name="state" dev="sda4" ino=44055922 scontext=system_u:system_r:container_t:s0:c472,c648 tcontext=system_u:object_r:container_var_lib_t:s0 tclass=dir permissive=0
type=SYSCALL msg=audit(1680773014.792:68): arch=c000003e syscall=268 success=no exit=-13 a0=ffffff9c a1=5615e593f3b0 a2=1fd a3=fffff3ff items=0 ppid=199829 pid=199848 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="chmod" exe="/usr/bin/chmod" subj=system_u:system_r:container_t:s0:c472,c648 key=(null)ARCH=x86_64 SYSCALL=fchmodat AUID="unset" UID="root" GID="root" EUID="root" SUID="root" FSUID="root" EGID="root" SGID="root" FSGID="root"
type=PROCTITLE msg=audit(1680773014.792:68): proctitle=63686D6F6400672B7277002F7573722F73686172652F656C61737469632D6167656E742F7374617465
sh-4.4# ls -laZ /var/lib/elastic-agent/my-agent-project/elastic-agent
total 0
drwxr-xr-x. 3 root root system_u:object_r:container_var_lib_t:s0 19 Apr  6 09:17 .
drwxr-xr-x. 3 root root system_u:object_r:container_var_lib_t:s0 27 Apr  6 09:17 ..
drwxr-xr-x. 2 root root system_u:object_r:container_var_lib_t:s0  6 Apr  6 09:17 state

permissions container runs with the following label: system_u:system_r:container_t

@naemono
Copy link
Contributor Author

naemono commented Apr 6, 2023

Thanks for the follow up. I replicated the same behavior late yesterday after having massive issues with my OpenShift cluster and finally just rebuilding it. I'm still a bit baffled as to why it was working before but will move forward with what I'm seeing now

naemono added 2 commits April 6, 2023 11:57
…econciliation.

Detect openshift when adding agent init container to be able to add 'privileged: true' automatically.
Run 'chcon' on Agent state directory when running within Openshift.
Add bool func to our utils/pointer package.
Update tests for new functionality

Signed-off-by: Michael Montgomery <mmontg1@gmail.com>
Signed-off-by: Michael Montgomery <mmontg1@gmail.com>
@naemono
Copy link
Contributor Author

naemono commented Apr 6, 2023

@barkbay With recent changes, this fully works within Openshift now.

I've also made some changes to try and handle #6543, but I'm still doing some testing around that feature.

@pebrc
Copy link
Collaborator

pebrc commented Apr 11, 2023

Drive-by comment: Is it a good idea to special case OpenShift here? Would not the same restrictions that we are trying to work around for OpenShift apply to any non-OpenShift cluster as well if it has SELinux set up?


const (
hostPathVolumeInitContainerName = "permissions"
chconCmd = "chcon -Rt svirt_sandbox_file_t /usr/share/elastic-agent/state"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using chcon in a privileged container without explicit user consent seems wrong to me from a security point of view.

@barkbay
Copy link
Contributor

barkbay commented Apr 11, 2023

Would not the same restrictions that we are trying to work around for OpenShift apply to any non-OpenShift cluster as well if it has SELinux set up?

Yes. Also I'm wondering if the opposite is possible: running OpenShift on a file system without SELinux, what would be the result of the chcon command in that case?

More generally, I'm a bit puzzled by the idea of building a feature on something that I considered a "best effort" (using isOpenShift(...) to detect OpenShift) until now, and for which a flag could be used as an escape hatch.

@naemono
Copy link
Contributor Author

naemono commented Apr 18, 2023

This is being closed in favor of documenting a daemonset that can be used to prepare the agent directory for running elastic agent without the need to run as root. This decision was made after discussing the security concerns around automatically managing these permissions without explicit user consent, and having a daemonset that needed to be applied prior to running Agent allows such consent.

@naemono naemono closed this Apr 18, 2023
@naemono naemono mentioned this pull request Apr 18, 2023
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug Something isn't working >enhancement Enhancement of existing functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Potentially chown Elastic Agent hostpath data directory Revisit Elastic Agent certificate handling
4 participants