Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[4.10] CephFS seems to be broken with FCOS 35 upgrade. image registry writes fail. #1160

Closed
fortinj66 opened this issue Mar 16, 2022 · 31 comments

Comments

@fortinj66
Copy link
Contributor

Describe the bug
writing to image registry fails after upgrade from OKD 4.9 -> 4.10 using ceph-fs filesystems.

Version
OKD 4.10 VMWare IPI

How reproducible
100%

See #1153 for details

@fortinj66 fortinj66 changed the title CephFS seems to be broken with FCOS 35 upgrade. image registry writes fail. [4.10] CephFS seems to be broken with FCOS 35 upgrade. image registry writes fail. Mar 16, 2022
@schuemann
Copy link

I have the same problem. Not only with the image registry!
Both with a clean okd 4.10 installation with rook 1.8.7 / ceph 16.2.7 as well as with a okd 4.9 updated to 4.10.

Creating more than one new file on a cephfs volume leads to permission denied errors. Waiting more than a minute between the requests seems to help (but is obviously no solution).

sh-4.4$ echo 1 > /test/1.txt
sh-4.4$ echo 2 > /test/2.txt
sh: /test/2.txt: Permission denied
sh-4.4$ echo 3 > /test/3.txt
sh: /test/3.txt: Permission denied
sh-4.4$ ls -la /test/
ls: cannot access '/test/3.txt': Permission denied
ls: cannot access '/test/2.txt': Permission denied
total 1
drwxrwxrwx. 2 root root  3 Mar 17 22:46 .
dr-xr-xr-x. 1 root root 40 Mar 17 22:44 ..
-rw-r--r--. 1 rook rook  2 Mar 17 22:46 1.txt
-?????????? ? ?    ?     ?            ? 2.txt
-?????????? ? ?    ?     ?            ? 3.txt
sh-4.4$ sleep 120
sh-4.4$ echo > /test/4.txt
sh-4.4$ ls -la /test/
ls: cannot access '/test/3.txt': Permission denied
ls: cannot access '/test/2.txt': Permission denied
total 1
drwxrwxrwx. 2 root root  4 Mar 17 22:48 .
dr-xr-xr-x. 1 root root 40 Mar 17 22:44 ..
-rw-r--r--. 1 rook rook  2 Mar 17 22:46 1.txt
-?????????? ? ?    ?     ?            ? 2.txt
-?????????? ? ?    ?     ?            ? 3.txt
-rw-r--r--. 1 rook rook  1 Mar 17 22:48 4.txt

No problems with rdb block volumes. No hints in logs or on the ceph status dashboard.

@mschaefers
Copy link

Same problem here. okd 4.9 worked, okd 4.10 doesn't:

Version Details:
Updated from 4.9.0-0.okd-2022-02-12-140851 to 4.10.0-0.okd-2022-03-07-131213 (bare metal installation)
CephFS CSI Driver: Helm Chart 3.5.1 https://ceph.github.io/csi-charts
Ceph MDS: ceph version 16.2.7 (f9aa029788115b5df5eeee328f584156565ee5b7) pacific (stable)

@LorbusChris
Copy link
Contributor

@fortinj66
Copy link
Contributor Author

Reset the bugzilla as a Fedora/FCOS and Ceph issue rather than an image registry issue

@SriRamanujam
Copy link

I had some time tonight to look into this a bit further. I first tried to reproduce the issue from a bare CoreOS VM. Since my CephFS can be mounted from outside of my OKD cluster, I was able to mount my image registry's CephFS volume and do some tests.

I could not reproduce the issue from a bare FCOS VM. I tried experimentally creating single files one at a time, creating many thousands of files in a loop from the shell, etc. Nothing worked.

Okay, that puts this firmly back into the realm of OKD. The next step I took was to start up a standalone pod with the registry cephfs mounted into it and try to reproduce the issue there. I was careful to ensure the security context in my standalone pod matched the image registry pod, just to be extra sure. I could not reproduce the issue from a standalone pod.

As a last resort, I swapped my image registry pod back to using the CephFS mount, and the issue was immediately reproducible. While poking around at that, I found something rather interesting. Check this out:

A broken directory, as seen from inside the image registry pod:

sh-4.4$ pwd
/registry/docker/registry/v2/repositories/media/jackett/_uploads/1a31f11f-b7bd-413f-a9fd-829f4a996ec7
sh-4.4$ ls -lashZ
ls: cannot access 'data': Permission denied
total 512
  0 drwxr-xr-x.   2 1000330000 root system_u:object_r:container_file_t:s0:c12,c18   2 Apr 17 04:04 .
  0 drwxr-xr-x. 270 1000330000 root system_u:object_r:container_file_t:s0:c12,c18 268 Apr 17 04:04 ..
  ? -??????????   ? ?          ?                                                0   ?            ? data
512 -rw-r--r--.   1 1000330000 root system_u:object_r:container_file_t:s0:c12,c18  20 Apr 17 04:04 startedat
sh-4.4$

That same directory, as seen from my standalone debug pod:

sh-5.1$ pwd
/data/docker/registry/v2/repositories/media/jackett/_uploads/1a31f11f-b7bd-413f-a9fd-829f4a996ec7
sh-5.1$ ls -lashZ
total 512
  0 drwxr-xr-x.   2 1000330000 root system_u:object_r:container_file_t:s0:c12,c18   2 Apr 17 04:04 .
  0 drwxr-xr-x. 270 1000330000 root system_u:object_r:container_file_t:s0:c12,c18 268 Apr 17 04:04 ..
  0 -rw-r--r--.   1 1000330000 root system_u:object_r:container_file_t:s0:c12,c18   0 Apr 17 04:04 data
512 -rw-r--r--.   1 1000330000 root system_u:object_r:container_file_t:s0:c12,c18  20 Apr 17 04:04 startedat
sh-5.1$

Note that from the standalone pod, the SELinux context is absolutely fine! I can touch the file, edit it as usual, everything works. However, you know what's really interesting? If I use the standalone pod to delete and re-create the data file, then the registry pod sees the file with the correct context data!

I am unsure where to go from here. SELinux contexts are stored as xattrs, so perhaps the Ceph MDS is messing up? But in that case, I would have seen this regardless of OKD version 'cause I haven't upgraded my version of Ceph in a bit. This very clearly started with OKD 4.10, though. So perhaps the image registry is doing something new, and that new something is interacting badly with Ceph?

@SriRamanujam
Copy link

From the perspective of the host, here is what a failing data file looks like:

[root@worker2 e4d8b3ba-8232-4a0e-8cfd-ad2c51b43e61]# pwd
/var/lib/kubelet/pods/8a3acbb1-e7b2-4d01-ad6b-31f1301ed2c8/volumes/kubernetes.io~csi/pvc-a6b40fa4-01c6-489e-8991-eeecae63ff78/mount/docker/registry/v2/repositories/media/jackett/_uploads/e4d8b3ba-8232-4a0e-8cfd-ad2c51b43e61
[root@worker2 e4d8b3ba-8232-4a0e-8cfd-ad2c51b43e61]# ls -alhsZ
total 512
  0 drwxr-xr-x.   2 1000330000 root system_u:object_r:container_file_t:s0:c12,c18   2 Apr 17 05:43 .
  0 drwxr-xr-x. 282 1000330000 root system_u:object_r:container_file_t:s0:c12,c18 280 Apr 17 05:43 ..
  0 -rw-r--r--.   1 1000330000 root system_u:object_r:unlabeled_t:s0                0 Apr 17 05:43 data
512 -rw-r--r--.   1 1000330000 root system_u:object_r:container_file_t:s0:c12,c18  20 Apr 17 05:43 startedat

Why it has a proper label outside of the pod but question marks inside, I don't know. I'm also wondering if an xattr of all zeroes corresponds to system_u:object_r:unlabeled_t:s0.

Another note for debugging and/or theorycrafting: my debug pod and my registry pod are two different hosts.

@fortinj66
Copy link
Contributor Author

@SriRamanujam Take a look at this: coreos/fedora-coreos-tracker#1167

I can recreate without using the image registry...

What I find odd is that I am running the exact same container image and the exact same cephfs code. The only difference is OKD 4.10 and FCOS 35.

Unfortunately, I don't have an external Ceph cluster :(

@SriRamanujam
Copy link

@SriRamanujam Take a look at this: coreos/fedora-coreos-tracker#1167

I saw that issue, and in fact I was all ready to write up reproduction steps for that issue until I couldn't reproduce it with a bare FCOS VM :(

I can recreate without using the image registry...

What were your reproduction steps? I was doing for i in $(seq 1 1000); do echo "testing123" > test$i.txt; done inside of the cephfs mount and hoping to get a denial, but it never happened.

What I find odd is that I am running the exact same container image and the exact same cephfs code. The only difference is OKD 4.10 and FCOS 35.

Same :(

Unfortunately, I don't have an external Ceph cluster :(

If you are using Rook, you can set hostNetwork: true in your CephCluster resource and that will change the mons and MDSes to listen on the host's IP address rather than the ClusterIP. I have mine set up that way, and I also have a one-liner that can mount the CephFS root from outside the cluster. I'm not sure if that's disruptive to existing PVC mounts, though, as I've had mine set up like this from the start.

@fortinj66
Copy link
Contributor Author

I wonder if it could be a crio/selinux issue on FCOS 35. Can you try mounting the CephFS mount in a container in the standalone FCOS node?

@depouill
Copy link

Hi,
as mentioned here : coreos/fedora-coreos-tracker#1167 (comment)
Same problem with an external ceph cluster after upgrading to OKD 4.10.

@SriRamanujam
Copy link

Okay, so I have made a breakthrough of sorts. I am able to reproduce the issue from within my registry container. The important thing is that I have to make a new directory beforehand, then cd into there and try to touch a thousand files as I described in #1160 (comment).

When I do this, I am able to see that exactly one of the 1000 files gets a proper context, and the rest are question marks. This exactly matches the behavior I see in the registry-managed folders, where startedat (presumably created first) gets a proper label and data (the important one) getting all question marks.

I am still unable to reproduce from outside of the container, neither from my laptop nor from a bare FCOS install (both running Fedora 35).

@SriRamanujam
Copy link

This is from my workstation. I have mounted the registry cephfs mount to my workstation, using the same options and such as the CSI does. When I touch a bunch of files from this mount, the file contexts are different from what I expect. Unmounting and re-mounting the cephfs mount magically makes the file contexts line up with the expected value.

sramanujam@sriramanujam.lan@hapes /tmp/cephfs/docker/registry/v2/test/testing2 
❯ for i in $(seq 1 1000); do echo "testing123" > test$i.txt; done
sramanujam@sriramanujam.lan@hapes /tmp/cephfs/docker/registry/v2/test/testing2 
❯ ls -lashZ | head
total 500K
  0 drwxr-xr-x. 2 sramanujam@sriramanujam.lan domain users@sriramanujam.lan unconfined_u:object_r:container_file_t:s0:c12,c18 1000 Apr 21 22:05 .
  0 drwxrwxrwx. 3 root                        root                          unconfined_u:object_r:container_file_t:s0:c12,c18    1 Apr 21 22:05 ..
512 -rw-r--r--. 1 sramanujam@sriramanujam.lan domain users@sriramanujam.lan system_u:object_r:unlabeled_t:s0                    11 Apr 21 22:05 test1000.txt
512 -rw-r--r--. 1 sramanujam@sriramanujam.lan domain users@sriramanujam.lan system_u:object_r:unlabeled_t:s0                    11 Apr 21 22:11 test100.txt
512 -rw-r--r--. 1 sramanujam@sriramanujam.lan domain users@sriramanujam.lan system_u:object_r:unlabeled_t:s0                    11 Apr 21 22:11 test101.txt
512 -rw-r--r--. 1 sramanujam@sriramanujam.lan domain users@sriramanujam.lan system_u:object_r:unlabeled_t:s0                    11 Apr 21 22:11 test102.txt
512 -rw-r--r--. 1 sramanujam@sriramanujam.lan domain users@sriramanujam.lan system_u:object_r:unlabeled_t:s0                    11 Apr 21 22:11 test103.txt
512 -rw-r--r--. 1 sramanujam@sriramanujam.lan domain users@sriramanujam.lan system_u:object_r:unlabeled_t:s0                    11 Apr 21 22:11 test104.txt
512 -rw-r--r--. 1 sramanujam@sriramanujam.lan domain users@sriramanujam.lan system_u:object_r:unlabeled_t:s0                    11 Apr 21 22:11 test105.txt
sramanujam@sriramanujam.lan@hapes /tmp/cephfs/docker/registry/v2/test/testing2 
❯ cd /tmp
sramanujam@sriramanujam.lan@hapes /tmp 
❯ sudo umount cephfs/
sramanujam@sriramanujam.lan@hapes /tmp 
❯ sudo mount -t ceph "$(oc -n rook-ceph get configmap/rook-ceph-mon-endpoints -o jsonpath={.data.csi-cluster-config-json} | jq -r '.[0].monitors | @csv' | sed 's/"//g')":/volumes/csi/csi-vol-4decf389-d18c-11eb-94dd-0a580a81040a/fb85b670-c8c0-44b2-98db-1c093796145e -o name="$(oc -n rook-ceph get secret/rook-csi-cephfs-node -o jsonpath={.data.adminID} | base64 -d)",secret="$(oc -n rook-ceph get secret/rook-csi-cephfs-node -o jsonpath={.data.adminKey} | base64 -d)",mds_namespace=library-cephfs ./cephfs
sramanujam@sriramanujam.lan@hapes /tmp 
❯ cd /tmp/cephfs/docker/registry/v2/test/testing2
sramanujam@sriramanujam.lan@hapes /tmp/cephfs/docker/registry/v2/test/testing2 
❯ ls -lashZ | head
total 500K
  0 drwxr-xr-x. 2 sramanujam@sriramanujam.lan domain users@sriramanujam.lan unconfined_u:object_r:container_file_t:s0:c12,c18 1000 Apr 21 22:05 .
  0 drwxrwxrwx. 3 root                        root                          unconfined_u:object_r:container_file_t:s0:c12,c18    1 Apr 21 22:05 ..
512 -rw-r--r--. 1 sramanujam@sriramanujam.lan domain users@sriramanujam.lan unconfined_u:object_r:container_file_t:s0:c12,c18   11 Apr 21 22:05 test1000.txt
512 -rw-r--r--. 1 sramanujam@sriramanujam.lan domain users@sriramanujam.lan unconfined_u:object_r:container_file_t:s0:c12,c18   11 Apr 21 22:11 test100.txt
512 -rw-r--r--. 1 sramanujam@sriramanujam.lan domain users@sriramanujam.lan unconfined_u:object_r:container_file_t:s0:c12,c18   11 Apr 21 22:11 test101.txt
512 -rw-r--r--. 1 sramanujam@sriramanujam.lan domain users@sriramanujam.lan unconfined_u:object_r:container_file_t:s0:c12,c18   11 Apr 21 22:11 test102.txt
512 -rw-r--r--. 1 sramanujam@sriramanujam.lan domain users@sriramanujam.lan unconfined_u:object_r:container_file_t:s0:c12,c18   11 Apr 21 22:11 test103.txt
512 -rw-r--r--. 1 sramanujam@sriramanujam.lan domain users@sriramanujam.lan unconfined_u:object_r:container_file_t:s0:c12,c18   11 Apr 21 22:11 test104.txt
512 -rw-r--r--. 1 sramanujam@sriramanujam.lan domain users@sriramanujam.lan unconfined_u:object_r:container_file_t:s0:c12,c18   11 Apr 21 22:11 test105.txt

I think this could go some way towards explaining why I am unable to reproduce the denials outside of the container. The security context system_u:object_r:unlabeled_t:s0 matches the context of a question mark file as seen from the container host's context (see #1160 (comment)). However, the container user is extremely restricted. Permissions on an administrator user outside of the container probably still allow access to files with the system_u:object_r:unlabeled_t:s0 context on them.

@SriRamanujam
Copy link

Okay, I figured it out. This commit was merged into the kernel during the 5.16 release cycle, changing the ceph driver's default to be asynchronous dirops. This asynchronicity seems to be the root cause of the missing/incorrect labels. Setting the mount option wsync changes it back to the old behavior, which fixes the issue for me.

However, getting your StorageClass to actually use the wsync option is more than a bit annoying. Apparently the Ceph CSI plugin reads this in from a custom parameter set on the StorageClass. This, then, gets stored on the actual PersistentVolumes under /spec/csi/volumeAttributes. Neither of these fields are editable after creation. Therefore, in order to fix this, we're all going to have to delete and re-create our CephFS StorageClasses and every PV that needs this mount option set (which, imo, is all of them).

I also think this might be worth reporting as a bug against the kernel itself, since this, at heart, is probably affecting propagation of extended attributes and selinux contexts for everyone, not just OKD+Ceph users.

@takyon77
Copy link

takyon77 commented Apr 22, 2022

@SriRamanujam When installing ceph-csi-cephfs, there's an option 'kernelMountOptions' which allow to pass kernel mount options. When I set it with 'wsync' it appears to indeed remediate the issue!
Thanks!

@vrutkovs
Copy link
Member

@SriRamanujam great investigation, would you mind reporting this to Fedora / upstream? Bonus points for making a KNOWN_ISSUE.md update PR.

vrutkovs added a commit to vrutkovs/okd that referenced this issue Apr 22, 2022
vrutkovs added a commit that referenced this issue Apr 22, 2022
@vrutkovs vrutkovs pinned this issue Apr 22, 2022
@mschaefers
Copy link

@SriRamanujam I can confirm that the issue can be solved by passing kernelMountOptions: wsync to the StorageClass

@darren-oxford
Copy link

Great work on tracking down the problem @SriRamanujam I have delayed updating to 4.10 to avoid running into this issue but when I delete the cephfs StorageClass and add it back again with kernelMountOptions: wsync in the parameters stanza it is ignored. What am I doing wrong, I am trying to use the rook/ceph installed as part of the Openshift Data Foundation operator (formerly OpenShift Container Storage) if that makes a difference?

@SriRamanujam
Copy link

@vrutkovs I have updated the existing bug ticket @fortinj66 filed with this information, will that be sufficient?

@darren-oxford You have to delete and re-create your PV once you've added the option to your StorageClass.

@darren-oxford
Copy link

darren-oxford commented Apr 22, 2022

@SriRamanujam Unfortunately recreating the StorageClass as follows...

apiVersion: storage.k8s.io/v1
metadata:
  name: ocs-storagecluster-cephfs
  annotations:
    description: Provides RWO and RWX Filesystem volumes
provisioner: openshift-storage.cephfs.csi.ceph.com
parameters:
  kernelMountOptions: wsync
  clusterID: openshift-storage
  csi.storage.k8s.io/controller-expand-secret-name: rook-csi-cephfs-provisioner
  csi.storage.k8s.io/controller-expand-secret-namespace: openshift-storage
  csi.storage.k8s.io/node-stage-secret-name: rook-csi-cephfs-node
  csi.storage.k8s.io/node-stage-secret-namespace: openshift-storage
  csi.storage.k8s.io/provisioner-secret-name: rook-csi-cephfs-provisioner
  csi.storage.k8s.io/provisioner-secret-namespace: openshift-storage
  fsName: ocs-storagecluster-cephfilesystem
reclaimPolicy: Delete
allowVolumeExpansion: true
volumeBindingMode: Immediate

results in kernelMountOptions: wsync being ignored when creating the StorageClass so is missing from the StorageClass when it is re-created.

@SriRamanujam
Copy link

@darren-oxford You might be missing kind: StorageClass in your YAML. Apart from that, that's more or less identical to what I have.

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: rook-ceph-cephfs
  annotations:
    storageclass.kubernetes.io/is-default-class: "true"
provisioner: rook-ceph.cephfs.csi.ceph.com
parameters:
  # clusterID is the namespace where operator is deployed.
  clusterID: rook-ceph

  # CephFS filesystem name into which the volume shall be created
  fsName: library-cephfs

  # Ceph pool into which the volume shall be created
  # Required for provisionVolume: "true"
  pool: library-cephfs-data0

  # Root path of an existing CephFS volume
  # Required for provisionVolume: "false"
  # rootPath: /absolute/path

  kernelMountOptions: wsync

  # The secrets contain Ceph admin credentials. These are generated automatically by the operator
  # in the same namespace as the cluster.
  csi.storage.k8s.io/provisioner-secret-name: rook-csi-cephfs-provisioner
  csi.storage.k8s.io/provisioner-secret-namespace: rook-ceph
  csi.storage.k8s.io/node-stage-secret-name: rook-csi-cephfs-node
  csi.storage.k8s.io/node-stage-secret-namespace: rook-ceph

  # (optional) The driver can use either ceph-fuse (fuse) or ceph kernel client (kernel)
  # If omitted, default volume mounter will be used - this is determined by probing for ceph-fuse
  # or by setting the default mounter explicitly via --volumemounter command-line argument.
  # mounter: kernel
reclaimPolicy: Delete
mountOptions:
  # uncomment the following line for debugging
  #- debug

@darren-oxford
Copy link

Thanks @SriRamanujam kind: StorageClass is there, just missed it of pasting it here. Tried it again and OpenShift Data Foundation operator is overwriting it. Thanks for confirming that I am doing it right though, seems this workaround may not work with ODF.

I guess I should switch to standard rook Ceph as I have found ODF to be incredibly frustrating.

@mschaefers
Copy link

@darren-oxford if your StorageClass is controlled/maintained by some operator, you need to find a way of configuring the StorageClass in your operator config. That would be you only option aside from not using the ODF operator at all.

@fortinj66
Copy link
Contributor Author

I can also confirm that the fix works on ceph-rook

@pflaeging
Copy link

pflaeging commented Apr 25, 2022

Hi, I've got the same problem with my 4.10 test cluster (baremetal UPI on KVM with rook-ceph).

Here's my fix:

  • login as cluster admin ;-)
  • export your storage class oc get sc/rook-cephfs -o yaml > sc-temp.yaml
  • delete rook-cephfs storage class oc delete sc/rook-cephfs
  • edit sc-temp.yaml
    • strip it down to the required lines (like this)
    • add the new parameter: kernelMountOptions: wsync
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: rook-cephfs
provisioner: rook-ceph.cephfs.csi.ceph.com # driver:namespace:operator
parameters:
  ### THIS is the new option
  kernelMountOptions: wsync
  ###
  clusterID: rook-ceph # namespace:cluster
  fsName: myfs
  pool: myfs-replicated
  csi.storage.k8s.io/provisioner-secret-name: rook-csi-cephfs-provisioner
  csi.storage.k8s.io/provisioner-secret-namespace: rook-ceph
  csi.storage.k8s.io/controller-expand-secret-name: rook-csi-cephfs-provisioner
  csi.storage.k8s.io/controller-expand-secret-namespace: rook-ceph 
  csi.storage.k8s.io/node-stage-secret-name: rook-csi-cephfs-node
  csi.storage.k8s.io/node-stage-secret-namespace: rook-ceph 
reclaimPolicy: Delete
allowVolumeExpansion: true
mountOptions:
  • deploy it with oc apply -f sc-temp.yaml
  • reboot all nodes (one after another) (just to be sure ;-)

After this, everything is working like before 4.10, ...

@depouill
Copy link

Hi, I've got the same problem with my 4.10 test cluster (baremetal UPI on KVM with rook-ceph).

Here's my fix:

* login as cluster admin ;-)

* export your storage class `oc get sc/rook-cephfs -o yaml > sc-temp.yaml`

* delete rook-cephfs storage class `oc delete sc/rook-cephfs`

* edit `sc-temp.yaml`
  
  * strip it down to the required lines (like this)
  * add the new parameter: `kernelMountOptions: wsync`
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: rook-cephfs
provisioner: rook-ceph.cephfs.csi.ceph.com # driver:namespace:operator
parameters:
  ### THIS is the new option
  kernelMountOptions: wsync
  ###
  clusterID: rook-ceph # namespace:cluster
  fsName: myfs
  pool: myfs-replicated
  csi.storage.k8s.io/provisioner-secret-name: rook-csi-cephfs-provisioner
  csi.storage.k8s.io/provisioner-secret-namespace: rook-ceph
  csi.storage.k8s.io/controller-expand-secret-name: rook-csi-cephfs-provisioner
  csi.storage.k8s.io/controller-expand-secret-namespace: rook-ceph 
  csi.storage.k8s.io/node-stage-secret-name: rook-csi-cephfs-node
  csi.storage.k8s.io/node-stage-secret-namespace: rook-ceph 
reclaimPolicy: Delete
allowVolumeExpansion: true
mountOptions:
* deploy it with `oc apply -f sc-temp.yaml`

* reboot all nodes (one after another) (just to be sure ;-)

After this, everything is working like before 4.10, ...

I'm not sure it works every time. Mount options are not modified for me.

@depouill
Copy link

depouill commented Apr 25, 2022

I can also confirm that the fix works on ceph-rook

It works for me too, but it is a liitle bit hard to reconstruct each PV

@vrutkovs
Copy link
Member

openshift/okd-machine-os#364 should have a new kernel with a fix, new nightly is on the way

@vrutkovs
Copy link
Member

Fix included in https://amd64.origin.releases.ci.openshift.org/releasestream/4-stable/release/4.10.0-0.okd-2022-05-28-062148

@Gjonni
Copy link

Gjonni commented Jun 21, 2022

maybe it's a stupid question .. but what happens to the volumes on which we applied the previous fix?

kernelMountOptions: wsync

thank you

@darren-oxford
Copy link

...what happens to the volumes on which we applied the previous fix?

AFIK, nothing as the problem was caused by a decision to change the default behaviour to wsync off, the fix has been to have it default on again. Your specifying the kernelMountOptions: wsync is essentially just superfluous and will not change the behaviour.

@Gjonni
Copy link

Gjonni commented Jun 21, 2022

perfect thank you very much

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests