chown/chgrp on dynamic provisioned pvc fails #300

gazal-k · 2021-01-11T23:11:01Z

/kind bug

What happened?

Using dynamic provisioning feature introduced by #274 with applications that try to chown their pv fails.

What you expected to happen?

With the old efs-provisioner, this caused no issues.
But with dynamic provisioning in this csi driver, the chown command fails. I must admit, I don't understand how the uid/gid thing works with EFS access points. The pod user does not seem to have any association with the uid/gid on the access point, however pods can read & write mounted pv just fine.

How to reproduce it (as minimally and precisely as possible)?

Use the dynamic provisioning feature introuduced by Introduce Controller with Create and Delete Volume capability. Adds support for Dynamic Provisioning via access points #274

Create a storageClasss:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: efs-sc
mountOptions:
- tls
parameters:
  directoryPerms: "700"
  fileSystemId: fs-<id>
  provisioningMode: efs-ap
provisioner: efs.csi.aws.com

Use it with application charts that have a chown step in the beginning, like:
- https://github.com/grafana/helm-charts/blob/8dfa6da2790911ee78e2c9cf62f950f20e4a8129/charts/grafana/templates/_pod.tpl#L22-L40
- https://github.com/jenkins-x-charts/nexus/blob/f05c2172a550c6bf3daf2c25eef940e67290d346/Dockerfile#L9-L10

For the first chart, we saw the initContainer failing. We just tried disabling the initContainer on the grafana chart with these chart overrides:

initChownData:
  enabled: false

And the application worked fine.

For the nexus chart, there's a whole bunch of errors from the logs,

chgrp: changing group of '/nexus-data/elasticsearch/nexus': Operation not permitted

Perhaps these charts don't really need that step with dynamic provisioning. Another nexus chart seems to have an option to skip the chown step; https://github.com/travelaudience/docker-nexus/blob/a86261e35734ae514c0236e8f371402e2ea0feec/run#L3-L6

Anything else we need to know?:

Environment

Kubernetes version (use kubectl version):

Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.1", GitCommit:"c4d752765b3bbac2237bf87cf0b1c2e307844666", GitTreeState:"clean", BuildDate:"2020-12-23T02:22:53Z", GoVersion:"go1.15.6", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"17+", GitVersion:"v1.17.12-eks-7684af", GitCommit:"7684af4ac41370dd109ac13817023cb8063e3d45", GitTreeState:"clean", BuildDate:"2020-10-20T22:57:40Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}

Driver version:

amazon/aws-efs-csi-driver:master
quay.io/k8scsi/csi-node-driver-registrar:v1.3.0
k8s.gcr.io/sig-storage/csi-provisioner:v2.0.2

The text was updated successfully, but these errors were encountered:

Similar to the option here: travelaudience/docker-nexus#33. This maybe required to address this issue: kubernetes-sigs/aws-efs-csi-driver#300.

Similar to what's done here: travelaudience/docker-nexus#33. This maybe required to address this issue: kubernetes-sigs/aws-efs-csi-driver#300.

Similar to what's done here: travelaudience/docker-nexus#33. This maybe required to address this issue: kubernetes-sigs/aws-efs-csi-driver#300. Signed-off-by: Gazal K <mohamed.gazal@target.com.au>

gabegorelick · 2021-03-30T16:51:25Z

I'm also seeing this. This means that things like postgres, which fails to start if its data directory is not owned by the postgres user, don't work with dynamic provisioning.

From https://docs.aws.amazon.com/efs/latest/ug/accessing-fs-nfs-permissions.html,

By default, root squashing is disabled on EFS file systems. Amazon EFS behaves like a Linux NFS server with no_root_squash. If a user or group ID is 0, Amazon EFS treats that user as the root user, and bypasses permissions checks (allowing access and modification to all file system objects). Root squashing can be enabled on a client connection when the AWS Identity and Access Management (AWS IAM) identity or resource policy does not allow access to the ClientRootAccess action. When root squashing is enabled, the root user is converted to a user with limited permissions on the NFS server.

I can reproduce this without a file system policy, and with a file system policy that grants elasticfilesystem:ClientRootAccess, it doesn't seem to make a difference. Granting elasticfilesystem:ClientRootAccess to the driver and pod's roles also doesn't help.

wochanda · 2021-04-22T14:51:23Z

Thanks for reporting this. We're investigating possible fixes, but in the meantime let me explain the reason this is happening.

Dynamic provisioning shares a single file system among multiple PVs by using EFS Access Points. The way Access Points work is they allow server-side overwrites of user/group information, overriding whatever the user/group of the app/container is. When we create an AP with dynamic provisioning we allocate a unique UID/GID (for instance 50000:50000) to overwrite all operations to, and create a unique directory (e.g. /ap50000) that is owned by that user/group. This ensures that no matter how the container is configured it has read/write access to its root directory.

What is happening in this case is the application is trying to take its own steps to make its root directory writeable. For instance, if in the container there is an application user with UID/GID 100:100, when it does a ls -la on its FS root directory it sees that it is owned by 50000:50000, not 100:100, so it assumes it needs to do a chown/chmod for it to work. However, even if we allowed this command to go through the application would lose access to its own volume.

This is why the original issue above was resolved by disabling the chown/chgrp checks. This method can be used as a workaround for any application, since you can trust the PVs to be writeable out of box.

gabegorelick · 2021-04-22T15:04:09Z

This is why the original issue above was resolved by disabling the chown/chgrp checks. This method can be used as a workaround for any application

Some applications don't support disabling ownership checks. E.g. I'm not aware of any way to disable it in Postgres. In such cases, the only workaround I've found is to create a user for the UID assigned by the driver (something like useradd --uid "$(stat -c '%u' "$mountpath")") and then run the application as that user.

wongma7 · 2021-04-22T21:58:39Z

Part of the reason efs-provisioner worked seamlessly is it relied on a beta annotation "pv.beta.kubernetes.io/gid" https://github.com/kubernetes-retired/external-storage/blob/201f40d78a9d3fd57d8a441cfc326988d88f35ec/nfs/pkg/volume/provision.go#L62 that silently does basically what your workaround does: it ensures that the Pod using the PV has the annotated group in its supplemental groups (i.e. if the annotation says '100' and you execute groups as the pod user, '100' will be among them)

This feature is very old and predates the rigorous KEP/feature tracking system that exists today, and I think it's been forgotten by sig-storage. Certainly i am culpable for relying on it but doing nothing to make it more than a beta annotation, I'll try to bring it up in isg-storage and see if we can rely on it as an alternative solution.

gabegorelick · 2021-04-23T15:30:57Z

Part of the reason efs-provisioner worked seamlessly is it relied on a beta annotation "pv.beta.kubernetes.io/gid" https://github.com/kubernetes-retired/external-storage/blob/201f40d78a9d3fd57d8a441cfc326988d88f35ec/nfs/pkg/volume/provision.go#L62 that silently does basically what your workaround does: it ensures that the Pod using the PV has the annotated group in its supplemental groups

Does EFS still reject the chown call when using efs-provisioner, or is it that applications are expected not to call chown since their group already owns the directory?

If I'm reading the Postgres code correctly, it seems to check the UID of the directory. So I'm not sure supplemental groups would work for all use cases.

wongma7 · 2021-04-23T19:05:17Z

Does EFS still reject the chown call when using efs-provisioner, or is it that applications are expected not to call chown since their group already owns the directory?

It's the latter, so you are right, the supplemental group approach won't work in postgres case.

Does a subdirectory of the access point have the same restriction on chown'ing? If not, I suppose one could create the subdirectory before starting the applicatoin, maybe with an initContainer, and then let the application chown it. Still not a very elegant workaround tho.

fejta-bot · 2021-07-22T19:56:33Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

MarcoMatarazzo · 2021-07-22T22:35:08Z

/remove-lifecycle stale

ankitbansal-gc · 2021-09-07T10:15:41Z

Can someone please suggest a workaround for postgres chown issue when using EFS with dynamic provisioning via access points? (I honestly hope I don't have to fallback to static provisioning to make postgres work on EKS!)

ankitbansal-gc · 2021-09-07T12:06:31Z

This is why the original issue above was resolved by disabling the chown/chgrp checks. This method can be used as a workaround for any application

Some applications don't support disabling ownership checks. E.g. I'm not aware of any way to disable it in Postgres. In such cases, the only workaround I've found is to create a user for the UID assigned by the driver (something like useradd --uid "$(stat -c '%u' "$mountpath")") and then run the application as that user.

@gabegorelick - if I setup runAsUser and runAsGroup in pod security context then postgres pod fails with a FATAL error

The files belonging to this database system will be owned by user "postgres".
This user must also own the server process.

The database cluster will be initialized with locale "en_US.utf8".
The default database encoding has accordingly been set to "UTF8".
The default text search configuration will be set to "english".

Data page checksums are disabled.

fixing permissions on existing directory /var/lib/postgresql/data/pgdata ... ok
creating subdirectories ... ok
selecting dynamic shared memory implementation ... posix
selecting default max_connections ... 20
selecting default shared_buffers ... 400kB
selecting default time zone ... Etc/UTC
creating configuration files ... ok
2021-09-07 11:06:38.315 UTC [68] FATAL:  data directory "/var/lib/postgresql/data/pgdata" has wrong ownership
2021-09-07 11:06:38.315 UTC [68] HINT:  The server must be started by the user that owns the data directory.
child process exited with exit code 1
initdb: removing contents of data directory "/var/lib/postgresql/data/pgdata"
running bootstrap script ...

Here is the kubernetes manifest I am using

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: postgres
spec:
  serviceName: postgres-service
  selector:
    matchLabels:
      app: postgres
  replicas: 2
  template:
    metadata:
      labels:
        app: postgres
    spec:
      securityContext:
        fsGroup: 999
        runAsUser: 999
        runAsGroup: 999
      containers:
        - name: postgres
          image: postgres:latest
          imagePullPolicy: IfNotPresent
          volumeMounts:
            - name: postgres-persistent-storage
              mountPath: /var/lib/postgresql/data
          env:
            - name: POSTGRES_DB
              value: crm
            - name: PGDATA
              value: /var/lib/postgresql/data/pgdata
            - name: POSTGRES_USER
              valueFrom:
                secretKeyRef:
                  name: db-credentials
                  key: user
            - name: POSTGRES_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: db-credentials
                  key: pwd
  # Volume Claim
      volumes:
      - name: postgres-persistent-storage
        persistentVolumeClaim:
          claimName: efs-claim

gazal-k · 2021-09-07T12:29:00Z

This issue probably needs to be addressed, but I'm not sure about using EFS for postgres. I have seen less demanding storage use cases where EFS struggles.

gabegorelick · 2021-09-07T14:56:11Z

Can someone please suggest a workaround for postgres chown issue when using EFS with dynamic provisioning via access points?

I used #434. That's still not merged though, so you'll have to run a fork if you want it. But that is the only way that I'm aware of to specify a UID for the provisioned volumes.

if I setup runAsUser and runAsGroup in pod security context then postgres pod fails with a FATAL error

One workaround is to not do this. Instead, use runAsUser: 0 and fsGroup: 0. Then in your container's entrypoint, invoke postgres (or whatever process you want to run) as the UID of the dynamic volume.

ankitbansal-gc · 2021-09-07T16:03:03Z

One workaround is to not do this. Instead, use runAsUser: 0 and fsGroup: 0. Then in your container's entrypoint, invoke postgres (or whatever process you want to run) as the UID of the dynamic volume.

@gabegorelick can you please guide me on what do I need to change the entrypoint command of container from cmd [postgres] to?

wernich-vg · 2021-09-28T08:36:29Z

I see the PR fixing this is still not merged. I'm not running postgres but trying to run freshclam with EFS storage for the signatures database. In the entrypoint they chown the database directory with no prior checks or options to overload.

nickperkins · 2021-09-29T06:54:52Z

I would love to see this Pr merged in. Currently hitting this issue while trying to run docker:dind with a pv

raghulkrishna · 2021-09-30T06:28:36Z

i am also having same issue with some of my applications

Colbize · 2021-10-14T20:00:24Z

Ran into this issue, as gabegorelick mention, only got it to work by chowning the mount path with the UID and GUID of the dynamic volume and then setting postgres user to the same IDs. Not ideal, but it works...

Helm code:

command: ["bash", "-c"]
args: ["usermod -u $(stat -c '%u' '{{ .Values.postgres_volume_mount_path }}')  postgres && \
        groupmod -g $(stat -c '%u' '{{ .Values.postgres_volume_mount_path }}')  postgres && \
        chown -R postgres:postgres {{ .Values.postgres_volume_mount_path }} && \
        /usr/local/bin/docker-entrypoint.sh postgres"]

srudin · 2023-03-09T04:32:41Z

How can this be closed? I understand there are not enough resources to work on it now but it seems quite severe so it ought to be kept in the backlog rather than just be closed.
And considering the feedback it would probably make sense to assign it a higher priority and use some of the available resources to actually solve it.

RyanStan · 2023-05-23T13:39:07Z

/reopen

k8s-ci-robot · 2023-05-23T13:39:11Z

@RyanStan: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

mskanth972 · 2023-05-23T13:41:20Z

/reopen

k8s-ci-robot · 2023-05-23T13:41:24Z

@mskanth972: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-triage-robot · 2023-06-22T14:14:28Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot · 2023-06-22T14:14:33Z

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

z0rc · 2023-06-22T15:01:13Z

I hate this bot.

/reopen
/remove-lifecycle rotten

k8s-ci-robot · 2023-06-22T15:01:19Z

@z0rc: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

I hate this bot.

/reopen
/remove-lifecycle rotten

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

lotyp · 2023-09-25T12:01:40Z

So? Any solution? I can't install rancher monitoring, which uses grafana and persistence stack on EFS, because of this error

Grafana deployment, that is supplied by rancher tries to execute chown and fails:

chown: /var/lib/grafana: Operation not permitted
2023-09-25T11:54:17.450552187Z chown: /var/lib/grafana: Operation not permitted

wsj31013 · 2023-12-07T10:16:28Z

I am also experiencing the same issue while using EKS and the EFS CSI Driver....

aschaber1 · 2023-12-17T19:26:39Z

/reopen

k8s-ci-robot · 2023-12-17T19:26:43Z

@aschaber1: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

jetndra · 2024-01-18T07:23:46Z

please reopen it as still facing same issue efs,

davidgiffin · 2024-01-30T21:32:13Z

/reopen

k8s-ci-robot · 2024-01-30T21:32:17Z

@davidgiffin: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

DylanWard14 · 2024-02-05T01:33:14Z

@jetndra

please reopen it as still facing same issue efs,

Did you find a solution, I am running into this issue also

gazal-k · 2024-02-05T02:31:33Z

For anybody else facing this issue, please see this comment to understand why this is happening.

The EFS CSI driver does implement the storage interface itself quite well. It's just that some applications expect to be able to then run commands like chown on the PV, which are not supported. Making such steps optional is one way to avoid application failure when using this driver. e.g.: jenkins-x-charts/nexus#65.

armhart · 2024-06-28T12:37:08Z

Using a fixed groupid works for me:

#1202 (comment)

jetndra · 2024-08-09T12:35:56Z

@jetndra

please reopen it as still facing same issue efs,

Did you find a solution, I am running into this issue also

As of workaround works for me :

command: ["bash", "-c"]
        args: ["usermod -u $(stat -c '%u' '/var/lib/postgresql/data')  postgres && \
                groupmod -g $(stat -c '%u' '/var/lib/postgresql/data')  postgres && \
                chown -R postgres:postgres /var/lib/postgresql && \
                /usr/local/bin/docker-entrypoint.sh postgres"]

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Jan 11, 2021

gazal-k added a commit to gazal-k/nexus that referenced this issue Jan 11, 2021

fix: Add option to disable chown on start

5444965

Similar to the option here: travelaudience/docker-nexus#33. This maybe required to address this issue: kubernetes-sigs/aws-efs-csi-driver#300.

gazal-k added a commit to gazal-k/nexus that referenced this issue Jan 11, 2021

fix: Add option to disable chown on start

f5c70d6

Similar to what's done here: travelaudience/docker-nexus#33. This maybe required to address this issue: kubernetes-sigs/aws-efs-csi-driver#300.

gazal-k mentioned this issue Jan 11, 2021

fix: Add option to disable chown on start jenkins-x-charts/nexus#65

Merged

angelabad mentioned this issue Apr 28, 2021

chown fails utkuozdemir/pv-migrate#23

Closed

tfrancisci mentioned this issue May 11, 2021

Way to specify posix userid for efs access points #393

Closed

gabegorelick mentioned this issue Jun 1, 2021

Add uid and gid parameter #434

Merged

bartek-lopatka mentioned this issue Jun 16, 2021

Opsexp 922 chown for volumes Alfresco/acs-deployment#567

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 22, 2021

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 22, 2021

zoltan-fedor mentioned this issue Sep 22, 2021

Kubeflow install on EKS is no longer possible due to open aws-efs-csi-driver bug kubeflow/kubeflow#6150

Closed

nipr-jdoenges mentioned this issue Oct 18, 2021

changing group id not allowed. EFS Dynamic provisioning. #520

Closed

ileixe mentioned this issue Apr 11, 2023

Add fsType "efs" in storageclass.yaml example #983

Closed

k8s-ci-robot reopened this May 23, 2023

k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Jun 22, 2023

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jun 22, 2023

This was referenced Dec 14, 2023

[bitnami/postgresql] When installing postgresql from the helm chart it goes to crashloopbackoff bitnami/charts#19999

Closed

[bitnami/postgresql] CrashLoopBackoff when installing postgressql (or postgressql-ha) with a pvc bitnami/charts#7282

Closed

thesuperzapper mentioned this issue Jun 30, 2024

fix: notebook server images with non-root SecurityContext kubeflow/kubeflow#7622

Merged

coolibre mentioned this issue Aug 21, 2024

Support specifying uid and gid tiredofit/docker-nginx#23

Closed

jean-francoismougnot mentioned this issue Sep 6, 2024

[bitnami/odoo] chown: changing ownership of '/bitnami/odoo': Operation not permitted when booting bitnami/charts#29279

Closed

aleskandro mentioned this issue Nov 23, 2024

MULTIARCH-5153: Implement a test config for arm64 testing of CNV usin… openshift/release#58544

Merged

chown/chgrp on dynamic provisioned pvc fails #300

chown/chgrp on dynamic provisioned pvc fails #300

Comments

gazal-k commented Jan 11, 2021 • edited Loading

gabegorelick commented Mar 30, 2021

wochanda commented Apr 22, 2021

gabegorelick commented Apr 22, 2021

wongma7 commented Apr 22, 2021

gabegorelick commented Apr 23, 2021

wongma7 commented Apr 23, 2021

fejta-bot commented Jul 22, 2021

MarcoMatarazzo commented Jul 22, 2021

ankitbansal-gc commented Sep 7, 2021 • edited Loading

ankitbansal-gc commented Sep 7, 2021

gazal-k commented Sep 7, 2021

gabegorelick commented Sep 7, 2021

ankitbansal-gc commented Sep 7, 2021 • edited Loading

wernich-vg commented Sep 28, 2021

nickperkins commented Sep 29, 2021

raghulkrishna commented Sep 30, 2021

Colbize commented Oct 14, 2021

srudin commented Mar 9, 2023

RyanStan commented May 23, 2023

k8s-ci-robot commented May 23, 2023

mskanth972 commented May 23, 2023

k8s-ci-robot commented May 23, 2023

k8s-triage-robot commented Jun 22, 2023

k8s-ci-robot commented Jun 22, 2023

z0rc commented Jun 22, 2023

k8s-ci-robot commented Jun 22, 2023

lotyp commented Sep 25, 2023 • edited Loading

wsj31013 commented Dec 7, 2023

aschaber1 commented Dec 17, 2023

k8s-ci-robot commented Dec 17, 2023

jetndra commented Jan 18, 2024

davidgiffin commented Jan 30, 2024

k8s-ci-robot commented Jan 30, 2024

DylanWard14 commented Feb 5, 2024 • edited Loading

gazal-k commented Feb 5, 2024

armhart commented Jun 28, 2024

jetndra commented Aug 9, 2024

gazal-k commented Jan 11, 2021 •

edited

Loading

ankitbansal-gc commented Sep 7, 2021 •

edited

Loading

ankitbansal-gc commented Sep 7, 2021 •

edited

Loading

lotyp commented Sep 25, 2023 •

edited

Loading

DylanWard14 commented Feb 5, 2024 •

edited

Loading