Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(zfspv): potential data loss in case of pod deletion #89

Merged
merged 1 commit into from
Apr 22, 2020

Conversation

pawanpraka1
Copy link
Contributor

@pawanpraka1 pawanpraka1 commented Apr 22, 2020

fixes: #88

looks like a bug in ZFS as when you change the mountpoint property to none,
ZFS automatically umounts the file system. When we delete the pod, we get the
unmount request for the old pod and mount request for the new pod. Unmount
is done by the driver by setting mountpoint to none and the driver assumes that
unmount has done and proceeded to delete the mountpath, but here zfs has not unmounted
the dataset

$ sudo zfs get all zfspv-pool/pvc-3fe69b0e-9f91-4c6e-8e5c-eb4218468765 | grep mount
zfspv-pool/pvc-3fe69b0e-9f91-4c6e-8e5c-eb4218468765  mounted               yes                                                                                                -
zfspv-pool/pvc-3fe69b0e-9f91-4c6e-8e5c-eb4218468765  mountpoint            none                                                                                               local
zfspv-pool/pvc-3fe69b0e-9f91-4c6e-8e5c-eb4218468765  canmount              on

The above output is saying that dataset is mounted (mounted = yes) and mountpoint is none, which is not possible

And system is also saying that dataset is mounted

$ sudo mount | grep pvc-3fe69b0e-9f91-4c6e-8e5c-eb4218468765
zfspv-pool/pvc-3fe69b0e-9f91-4c6e-8e5c-eb4218468765 on /var/lib/kubelet/pods/3dbd0a03-c7e5-44f1-9f94-407f6ac96316/volumes/kubernetes.io~csi/pvc-3fe69b0e-9f91-4c6e-8e5c-eb4218468765/mount type zfs (rw,xattr,noacl)

here, the driver will assume that dataset has been unmouted and proceed to delete the
mountpath and it will delete the data as part of cleaning up for the NodeUnPublish request.

Shifting to use zfs umount instead of doing zfs set mountpoint=none for umounting the dataset.
Also the driver is using os.RemoveAll which is very risky as it will clean
child also, since the mountpoint is not supposed to have anything,
just os.Remove is sufficient and it will fail if there is anything there.

This is the log we got from the node agent.

43212 time="2020-04-21T17:01:06Z" level=error msg="zfspv: failed to remove mount path Error: unlinkat /var/lib/kubelet/pods/2f9fe907-53b9-46c0-9b9b-d6bed9467836/volumes/ kubernetes.io~csi/pvc-8120d443-c6a7-4f2a-9240-05deb3e05850/mount: device or resource busy

Signed-off-by: Pawan pawan@mayadata.io

@pawanpraka1 pawanpraka1 added the bug Something isn't working. label Apr 22, 2020
@pawanpraka1 pawanpraka1 added this to the v0.6.1 milestone Apr 22, 2020
@pawanpraka1 pawanpraka1 requested a review from kmova April 22, 2020 08:26
@pawanpraka1 pawanpraka1 force-pushed the dataloss branch 2 times, most recently from 212416e to b0c0f87 Compare April 22, 2020 08:29
@codecov-io
Copy link

codecov-io commented Apr 22, 2020

Codecov Report

Merging #89 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master      #89   +/-   ##
=======================================
  Coverage   23.57%   23.57%           
=======================================
  Files          14       14           
  Lines         475      475           
=======================================
  Hits          112      112           
  Misses        362      362           
  Partials        1        1           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 280949c...08e6a90. Read the comment docs.

@pawanpraka1 pawanpraka1 force-pushed the dataloss branch 2 times, most recently from 08e6a90 to 91ee5ae Compare April 22, 2020 11:51
looks like a bug in ZFS as when you change the mountpoint property to none,
ZFS automatically umounts the file system. When we delete the pod, we get the
unmount request for the old pod and mount request for the new pod. Unmount
is done by the driver by setting mountpoint to none and the driver assumes that
unmount has done and proceeded to delete the mountpath, but here zfs has not unmounted
the dataset

```
$ sudo zfs get all zfspv-pool/pvc-3fe69b0e-9f91-4c6e-8e5c-eb4218468765 | grep mount
zfspv-pool/pvc-3fe69b0e-9f91-4c6e-8e5c-eb4218468765  mounted               yes                                                                                                -
zfspv-pool/pvc-3fe69b0e-9f91-4c6e-8e5c-eb4218468765  mountpoint            none                                                                                               local
zfspv-pool/pvc-3fe69b0e-9f91-4c6e-8e5c-eb4218468765  canmount              on
```

here, the driver will assume that dataset has been unmouted and proceed to delete the
mountpath and it will delete the data as part of cleaning up for the NodeUnPublish request.

Shifting to use zfs umount instead of doing zfs set mountpoint=none for umounting the dataset.
Also the driver is using os.RemoveAll which is very risky as it will clean
child also, since the mountpoint is not supposed to have anything,
just os.Remove is sufficient and it will fail if there is anything there.

Signed-off-by: Pawan <pawan@mayadata.io>
@kmova kmova merged commit d57976e into openebs:master Apr 22, 2020
@kmova kmova changed the title fix(zfspv): fixing data loss in case of pod deletion fix(zfspv): potential data loss in case of pod deletion Apr 22, 2020
@kmova kmova modified the milestones: v0.6.1, v0.7.0 Apr 22, 2020
@pawanpraka1 pawanpraka1 deleted the dataloss branch April 23, 2020 06:50
@pawanpraka1 pawanpraka1 added the pr/release-note PR has to be added in the release note. label Apr 27, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working. pr/release-note PR has to be added in the release note.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

pod restarts on a loaded setup seen to cause data loss.
3 participants