Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chown/chgrp on dynamic provisioned pvc fails #300

Closed
gazal-k opened this issue Jan 11, 2021 · 76 comments
Closed

chown/chgrp on dynamic provisioned pvc fails #300

gazal-k opened this issue Jan 11, 2021 · 76 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@gazal-k
Copy link
Contributor

gazal-k commented Jan 11, 2021

/kind bug

What happened?

Using dynamic provisioning feature introduced by #274 with applications that try to chown their pv fails.

What you expected to happen?

With the old efs-provisioner, this caused no issues.
But with dynamic provisioning in this csi driver, the chown command fails. I must admit, I don't understand how the uid/gid thing works with EFS access points. The pod user does not seem to have any association with the uid/gid on the access point, however pods can read & write mounted pv just fine.

How to reproduce it (as minimally and precisely as possible)?

For the first chart, we saw the initContainer failing. We just tried disabling the initContainer on the grafana chart with these chart overrides:

initChownData:
  enabled: false

And the application worked fine.

For the nexus chart, there's a whole bunch of errors from the logs,

chgrp: changing group of '/nexus-data/elasticsearch/nexus': Operation not permitted

Perhaps these charts don't really need that step with dynamic provisioning. Another nexus chart seems to have an option to skip the chown step; https://github.com/travelaudience/docker-nexus/blob/a86261e35734ae514c0236e8f371402e2ea0feec/run#L3-L6

Anything else we need to know?:

Environment

  • Kubernetes version (use kubectl version):
    Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.1", GitCommit:"c4d752765b3bbac2237bf87cf0b1c2e307844666", GitTreeState:"clean", BuildDate:"2020-12-23T02:22:53Z", GoVersion:"go1.15.6", Compiler:"gc", Platform:"linux/amd64"}
    Server Version: version.Info{Major:"1", Minor:"17+", GitVersion:"v1.17.12-eks-7684af", GitCommit:"7684af4ac41370dd109ac13817023cb8063e3d45", GitTreeState:"clean", BuildDate:"2020-10-20T22:57:40Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}
  • Driver version:
    amazon/aws-efs-csi-driver:master
    quay.io/k8scsi/csi-node-driver-registrar:v1.3.0
    k8s.gcr.io/sig-storage/csi-provisioner:v2.0.2
    
@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Jan 11, 2021
gazal-k added a commit to gazal-k/nexus that referenced this issue Jan 11, 2021
Similar to the option here: travelaudience/docker-nexus#33. This maybe required to address this issue: kubernetes-sigs/aws-efs-csi-driver#300.
gazal-k added a commit to gazal-k/nexus that referenced this issue Jan 11, 2021
Similar to what's done here: travelaudience/docker-nexus#33. This maybe required to address this issue: kubernetes-sigs/aws-efs-csi-driver#300.
gazal-k added a commit to gazal-k/nexus that referenced this issue Jan 11, 2021
Similar to what's done here: travelaudience/docker-nexus#33. This maybe required to address this issue: kubernetes-sigs/aws-efs-csi-driver#300.

Signed-off-by: Gazal K <mohamed.gazal@target.com.au>
@gabegorelick
Copy link

I'm also seeing this. This means that things like postgres, which fails to start if its data directory is not owned by the postgres user, don't work with dynamic provisioning.

From https://docs.aws.amazon.com/efs/latest/ug/accessing-fs-nfs-permissions.html,

By default, root squashing is disabled on EFS file systems. Amazon EFS behaves like a Linux NFS server with no_root_squash. If a user or group ID is 0, Amazon EFS treats that user as the root user, and bypasses permissions checks (allowing access and modification to all file system objects). Root squashing can be enabled on a client connection when the AWS Identity and Access Management (AWS IAM) identity or resource policy does not allow access to the ClientRootAccess action. When root squashing is enabled, the root user is converted to a user with limited permissions on the NFS server.

I can reproduce this without a file system policy, and with a file system policy that grants elasticfilesystem:ClientRootAccess, it doesn't seem to make a difference. Granting elasticfilesystem:ClientRootAccess to the driver and pod's roles also doesn't help.

@wochanda
Copy link

Thanks for reporting this. We're investigating possible fixes, but in the meantime let me explain the reason this is happening.

Dynamic provisioning shares a single file system among multiple PVs by using EFS Access Points. The way Access Points work is they allow server-side overwrites of user/group information, overriding whatever the user/group of the app/container is. When we create an AP with dynamic provisioning we allocate a unique UID/GID (for instance 50000:50000) to overwrite all operations to, and create a unique directory (e.g. /ap50000) that is owned by that user/group. This ensures that no matter how the container is configured it has read/write access to its root directory.

What is happening in this case is the application is trying to take its own steps to make its root directory writeable. For instance, if in the container there is an application user with UID/GID 100:100, when it does a ls -la on its FS root directory it sees that it is owned by 50000:50000, not 100:100, so it assumes it needs to do a chown/chmod for it to work. However, even if we allowed this command to go through the application would lose access to its own volume.

This is why the original issue above was resolved by disabling the chown/chgrp checks. This method can be used as a workaround for any application, since you can trust the PVs to be writeable out of box.

@gabegorelick
Copy link

This is why the original issue above was resolved by disabling the chown/chgrp checks. This method can be used as a workaround for any application

Some applications don't support disabling ownership checks. E.g. I'm not aware of any way to disable it in Postgres. In such cases, the only workaround I've found is to create a user for the UID assigned by the driver (something like useradd --uid "$(stat -c '%u' "$mountpath")") and then run the application as that user.

@wongma7
Copy link
Contributor

wongma7 commented Apr 22, 2021

Part of the reason efs-provisioner worked seamlessly is it relied on a beta annotation "pv.beta.kubernetes.io/gid" https://github.com/kubernetes-retired/external-storage/blob/201f40d78a9d3fd57d8a441cfc326988d88f35ec/nfs/pkg/volume/provision.go#L62 that silently does basically what your workaround does: it ensures that the Pod using the PV has the annotated group in its supplemental groups (i.e. if the annotation says '100' and you execute groups as the pod user, '100' will be among them)

This feature is very old and predates the rigorous KEP/feature tracking system that exists today, and I think it's been forgotten by sig-storage. Certainly i am culpable for relying on it but doing nothing to make it more than a beta annotation, I'll try to bring it up in isg-storage and see if we can rely on it as an alternative solution.

@gabegorelick
Copy link

Part of the reason efs-provisioner worked seamlessly is it relied on a beta annotation "pv.beta.kubernetes.io/gid" https://github.com/kubernetes-retired/external-storage/blob/201f40d78a9d3fd57d8a441cfc326988d88f35ec/nfs/pkg/volume/provision.go#L62 that silently does basically what your workaround does: it ensures that the Pod using the PV has the annotated group in its supplemental groups

Does EFS still reject the chown call when using efs-provisioner, or is it that applications are expected not to call chown since their group already owns the directory?

If I'm reading the Postgres code correctly, it seems to check the UID of the directory. So I'm not sure supplemental groups would work for all use cases.

@wongma7
Copy link
Contributor

wongma7 commented Apr 23, 2021

Does EFS still reject the chown call when using efs-provisioner, or is it that applications are expected not to call chown since their group already owns the directory?

It's the latter, so you are right, the supplemental group approach won't work in postgres case.

Does a subdirectory of the access point have the same restriction on chown'ing? If not, I suppose one could create the subdirectory before starting the applicatoin, maybe with an initContainer, and then let the application chown it. Still not a very elegant workaround tho.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 22, 2021
@MarcoMatarazzo
Copy link

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 22, 2021
@ankitbansal-gc
Copy link

ankitbansal-gc commented Sep 7, 2021

Can someone please suggest a workaround for postgres chown issue when using EFS with dynamic provisioning via access points? (I honestly hope I don't have to fallback to static provisioning to make postgres work on EKS!)

@ankitbansal-gc
Copy link

This is why the original issue above was resolved by disabling the chown/chgrp checks. This method can be used as a workaround for any application

Some applications don't support disabling ownership checks. E.g. I'm not aware of any way to disable it in Postgres. In such cases, the only workaround I've found is to create a user for the UID assigned by the driver (something like useradd --uid "$(stat -c '%u' "$mountpath")") and then run the application as that user.

@gabegorelick - if I setup runAsUser and runAsGroup in pod security context then postgres pod fails with a FATAL error

The files belonging to this database system will be owned by user "postgres".
This user must also own the server process.

The database cluster will be initialized with locale "en_US.utf8".
The default database encoding has accordingly been set to "UTF8".
The default text search configuration will be set to "english".

Data page checksums are disabled.

fixing permissions on existing directory /var/lib/postgresql/data/pgdata ... ok
creating subdirectories ... ok
selecting dynamic shared memory implementation ... posix
selecting default max_connections ... 20
selecting default shared_buffers ... 400kB
selecting default time zone ... Etc/UTC
creating configuration files ... ok
2021-09-07 11:06:38.315 UTC [68] FATAL:  data directory "/var/lib/postgresql/data/pgdata" has wrong ownership
2021-09-07 11:06:38.315 UTC [68] HINT:  The server must be started by the user that owns the data directory.
child process exited with exit code 1
initdb: removing contents of data directory "/var/lib/postgresql/data/pgdata"
running bootstrap script ...

Here is the kubernetes manifest I am using

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: postgres
spec:
  serviceName: postgres-service
  selector:
    matchLabels:
      app: postgres
  replicas: 2
  template:
    metadata:
      labels:
        app: postgres
    spec:
      securityContext:
        fsGroup: 999
        runAsUser: 999
        runAsGroup: 999
      containers:
        - name: postgres
          image: postgres:latest
          imagePullPolicy: IfNotPresent
          volumeMounts:
            - name: postgres-persistent-storage
              mountPath: /var/lib/postgresql/data
          env:
            - name: POSTGRES_DB
              value: crm
            - name: PGDATA
              value: /var/lib/postgresql/data/pgdata
            - name: POSTGRES_USER
              valueFrom:
                secretKeyRef:
                  name: db-credentials
                  key: user
            - name: POSTGRES_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: db-credentials
                  key: pwd
  # Volume Claim
      volumes:
      - name: postgres-persistent-storage
        persistentVolumeClaim:
          claimName: efs-claim

@gazal-k
Copy link
Contributor Author

gazal-k commented Sep 7, 2021

This issue probably needs to be addressed, but I'm not sure about using EFS for postgres. I have seen less demanding storage use cases where EFS struggles.

@gabegorelick
Copy link

Can someone please suggest a workaround for postgres chown issue when using EFS with dynamic provisioning via access points?

I used #434. That's still not merged though, so you'll have to run a fork if you want it. But that is the only way that I'm aware of to specify a UID for the provisioned volumes.

if I setup runAsUser and runAsGroup in pod security context then postgres pod fails with a FATAL error

One workaround is to not do this. Instead, use runAsUser: 0 and fsGroup: 0. Then in your container's entrypoint, invoke postgres (or whatever process you want to run) as the UID of the dynamic volume.

@ankitbansal-gc
Copy link

ankitbansal-gc commented Sep 7, 2021

One workaround is to not do this. Instead, use runAsUser: 0 and fsGroup: 0. Then in your container's entrypoint, invoke postgres (or whatever process you want to run) as the UID of the dynamic volume.

@gabegorelick can you please guide me on what do I need to change the entrypoint command of container from cmd [postgres] to?

@wernich-vg
Copy link

I see the PR fixing this is still not merged. I'm not running postgres but trying to run freshclam with EFS storage for the signatures database. In the entrypoint they chown the database directory with no prior checks or options to overload.

@nickperkins
Copy link

I would love to see this Pr merged in. Currently hitting this issue while trying to run docker:dind with a pv

@raghulkrishna
Copy link

i am also having same issue with some of my applications

@Colbize
Copy link

Colbize commented Oct 14, 2021

Ran into this issue, as gabegorelick mention, only got it to work by chowning the mount path with the UID and GUID of the dynamic volume and then setting postgres user to the same IDs. Not ideal, but it works...

Helm code:

command: ["bash", "-c"]
args: ["usermod -u $(stat -c '%u' '{{ .Values.postgres_volume_mount_path }}')  postgres && \
        groupmod -g $(stat -c '%u' '{{ .Values.postgres_volume_mount_path }}')  postgres && \
        chown -R postgres:postgres {{ .Values.postgres_volume_mount_path }} && \
        /usr/local/bin/docker-entrypoint.sh postgres"]

@srudin
Copy link

srudin commented Mar 9, 2023

How can this be closed? I understand there are not enough resources to work on it now but it seems quite severe so it ought to be kept in the backlog rather than just be closed.
And considering the feedback it would probably make sense to assign it a higher priority and use some of the available resources to actually solve it.

@RyanStan
Copy link
Contributor

/reopen

@k8s-ci-robot
Copy link
Contributor

@RyanStan: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@mskanth972
Copy link
Contributor

/reopen

@k8s-ci-robot
Copy link
Contributor

@mskanth972: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot reopened this May 23, 2023
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

@k8s-ci-robot k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Jun 22, 2023
@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@z0rc
Copy link
Contributor

z0rc commented Jun 22, 2023

I hate this bot.

/reopen
/remove-lifecycle rotten

@k8s-ci-robot
Copy link
Contributor

@z0rc: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

I hate this bot.

/reopen
/remove-lifecycle rotten

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jun 22, 2023
@lotyp
Copy link

lotyp commented Sep 25, 2023

So? Any solution? I can't install rancher monitoring, which uses grafana and persistence stack on EFS, because of this error

Grafana deployment, that is supplied by rancher tries to execute chown and fails:

chown: /var/lib/grafana: Operation not permitted
2023-09-25T11:54:17.450552187Z chown: /var/lib/grafana: Operation not permitted

@wsj31013
Copy link

wsj31013 commented Dec 7, 2023

I am also experiencing the same issue while using EKS and the EFS CSI Driver....

@aschaber1
Copy link

/reopen

@k8s-ci-robot
Copy link
Contributor

@aschaber1: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@jetndra
Copy link

jetndra commented Jan 18, 2024

please reopen it as still facing same issue efs,

@davidgiffin
Copy link

/reopen

@k8s-ci-robot
Copy link
Contributor

@davidgiffin: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@DylanWard14
Copy link

DylanWard14 commented Feb 5, 2024

@jetndra

please reopen it as still facing same issue efs,

Did you find a solution, I am running into this issue also

@gazal-k
Copy link
Contributor Author

gazal-k commented Feb 5, 2024

For anybody else facing this issue, please see this comment to understand why this is happening.

The EFS CSI driver does implement the storage interface itself quite well. It's just that some applications expect to be able to then run commands like chown on the PV, which are not supported. Making such steps optional is one way to avoid application failure when using this driver. e.g.: jenkins-x-charts/nexus#65.

@armhart
Copy link

armhart commented Jun 28, 2024

Using a fixed groupid works for me:

#1202 (comment)

@jetndra
Copy link

jetndra commented Aug 9, 2024

@jetndra

please reopen it as still facing same issue efs,

Did you find a solution, I am running into this issue also

As of workaround works for me :

command: ["bash", "-c"]
        args: ["usermod -u $(stat -c '%u' '/var/lib/postgresql/data')  postgres && \
                groupmod -g $(stat -c '%u' '/var/lib/postgresql/data')  postgres && \
                chown -R postgres:postgres /var/lib/postgresql && \
                /usr/local/bin/docker-entrypoint.sh postgres"]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests