Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v4.2.0 snapshot-controller stops working after a while #580

Closed
jsafrane opened this issue Aug 19, 2021 · 5 comments · Fixed by #581
Closed

v4.2.0 snapshot-controller stops working after a while #580

jsafrane opened this issue Aug 19, 2021 · 5 comments · Fixed by #581
Assignees
Labels
priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now.

Comments

@jsafrane
Copy link
Contributor

As seen in CI, snapshot-controller stops processing VolumeSnapshots after a while. For example, this CI run failed to delete a few VolumeSnapshotContents, however, the controller stopped caring about them around 15:06:00.

Kubernete CI job got flaky on 08-17: https://testgrid.k8s.io/sig-storage-csi-ci#1.21-on-master
Simiarly, external-snapshotter 1-21-on-kubernetes-1-21 tests on master is flaky too: https://testgrid.k8s.io/sig-storage-csi-external-snapshotter#1-21-on-kubernetes-1-21

@jsafrane
Copy link
Contributor Author

Looking at the snapshot-controller container, all its snapshotWorker goroutines get stuck at:

goroutine 113 [semacquire, 13 minutes]:
sync.runtime_SemacquireMutex(0xc0004f59ec, 0x0, 0x1)
        /usr/lib/golang/src/runtime/sema.go:71 +0x47
sync.(*Mutex).lockSlow(0xc0004f59e8)
        /usr/lib/golang/src/sync/mutex.go:138 +0x105
sync.(*Mutex).Lock(...)
        /usr/lib/golang/src/sync/mutex.go:81
github.com/kubernetes-csi/external-snapshotter/v4/pkg/metrics.(*operationMetricsManager).OperationStart(0xc0004f59e0, 0x19aba21, 0xe, 0xc000703e60, 0x24, 0xc0004b6050, 0xf, 0x19a600c, 0x7, 0x0, ...)
        /go/src/github.com/kubernetes-csi/external-snapshotter/pkg/metrics/metrics.go:186 +0x217
github.com/kubernetes-csi/external-snapshotter/v4/pkg/common-controller.(*csiSnapshotCommonController).processSnapshotWithDeletionTimestamp(0xc0005980f0, 0xc000732f00, 0xc000732f00, 0x0)
        /go/src/github.com/kubernetes-csi/external-snapshotter/pkg/common-controller/snapshot_controller.go:252 +0x3cf
github.com/kubernetes-csi/external-snapshotter/v4/pkg/common-controller.(*csiSnapshotCommonController).syncSnapshot(0xc0005980f0, 0xc000732f00, 0x19723a0, 0xc000732f00)
        /go/src/github.com/kubernetes-csi/external-snapshotter/pkg/common-controller/snapshot_controller.go:205 +0x1175
github.com/kubernetes-csi/external-snapshotter/v4/pkg/common-controller.(*csiSnapshotCommonController).updateSnapshot(0xc0005980f0, 0xc000732f00, 0x0, 0x0)
        /go/src/github.com/kubernetes-csi/external-snapshotter/pkg/common-controller/snapshot_controller_base.go:374 +0x326
github.com/kubernetes-csi/external-snapshotter/v4/pkg/common-controller.(*csiSnapshotCommonController).syncSnapshotByKey(0xc0005980f0, 0xc000703f50, 0x24, 0xc00044e000, 0x1)
        /go/src/github.com/kubernetes-csi/external-snapshotter/pkg/common-controller/snapshot_controller_base.go:230 +0xed7
github.com/kubernetes-csi/external-snapshotter/v4/pkg/common-controller.(*csiSnapshotCommonController).snapshotWorker(0xc0005980f0)
        /go/src/github.com/kubernetes-csi/external-snapshotter/pkg/common-controller/snapshot_controller_base.go:195 +0xf8
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0xc0003c8000)
        /go/src/github.com/kubernetes-csi/external-snapshotter/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155 +0x5f
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc0003c8000, 0x1b91080, 0xc00024c030, 0x1, 0xc00044e060)
        /go/src/github.com/kubernetes-csi/external-snapshotter/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156 +0x9b
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc0003c8000, 0x0, 0x0, 0x1, 0xc00044e060)
        /go/src/github.com/kubernetes-csi/external-snapshotter/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x98
k8s.io/apimachinery/pkg/util/wait.Until(0xc0003c8000, 0x0, 0xc00044e060)
        /go/src/github.com/kubernetes-csi/external-snapshotter/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:90 +0x4d
created by github.com/kubernetes-csi/external-snapshotter/v4/pkg/common-controller.(*csiSnapshotCommonController).Run
        /go/src/github.com/kubernetes-csi/external-snapshotter/pkg/common-controller/snapshot_controller_base.go:146 +0x305

@jsafrane
Copy link
Contributor Author

This goroutine looks suspicious:

sync.(*Mutex).Lock(...)
        /usr/lib/golang/src/sync/mutex.go:81
github.com/kubernetes-csi/external-snapshotter/v4/pkg/metrics.(*operationMetricsManager).recordCancelMetric(0xc0004f59e0, 0xc000586340, 0xf, 0x19ad312, 0xf, 0xc03fab6813e766f3, 0x2b8624670b, 0x265d1e0, 0x19ab9e9, 0xe, ...)
        /go/src/github.com/kubernetes-csi/external-snapshotter/pkg/metrics/metrics.go:252 +0x21d
github.com/kubernetes-csi/external-snapshotter/v4/pkg/metrics.(*operationMetricsManager).RecordMetrics(0xc0004f59e0, 0x19aba21, 0xe, 0xc0004e4090, 0x24, 0x1b92240, 0xc00026bae0, 0xc0005a5cb0, 0xf)
        /go/src/github.com/kubernetes-csi/external-snapshotter/pkg/metrics/metrics.go:234 +0x7ab

Both RecordMetrics and recordCancelMetric want to acquire the same mutex in the same goroutine. Mutexes are not recursive in go and therefore recordCancelMetric gets stuck, holding the mutex.

@jsafrane
Copy link
Contributor Author

/assign

@jsafrane
Copy link
Contributor Author

/priority critical-urgent
I can deadlock snapshot-controller by just running Kubernetes tests in parallel.

@k8s-ci-robot k8s-ci-robot added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Aug 19, 2021
@jsafrane
Copy link
Contributor Author

cc @ggriffiths @xing-yang see above, we may need a new patch release soon-ish. v4.2.0 is not really usable in my test environment. See #581 for a simple fix, I hope I got the right deadlock.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants