-
Notifications
You must be signed in to change notification settings - Fork 189
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCPBUGS-32678: Fix a panic caused by a data race #1070
OCPBUGS-32678: Fix a panic caused by a data race #1070
Conversation
Skipping CI for Draft Pull Request. |
/test all |
…ynchronization After the CVO has started and its worker goroutines are starting to work, a panic may occur due to a data race. Originally (at 961873d), the `SyncWorker.Start()` method was exclusively waiting for a signal on the `SyncWorker.notify` channel from the `SyncWorker.Update()` method before attempting the first application of a payload. After the `Start()` method got the signal, its required values were already initialized by the `SyncWorker.Update()` method, such as the `SyncWorker.work` field. However, the 9222fa9 introduced another section of the CVO that can signal the `SyncWorker.notify` channel. The `SyncWorker.NotifyAboutManagedResourceActivity()` method now signals the channel upon an activity on a managed resource. This signal may occur before the `SyncWorker.Update()` has finished. The signal thus starts an attempt of the `SyncWorker.Start()` method to apply a payload before the `SyncWorker.work` field could have been set by the `SyncWorker.Update()` method. This currently results in an "invalid memory address or nil pointer dereference" panic in the `SyncWork.calculateNextFrom()` method because a nil pointer is passed by the `SyncWorker.Start()` method. This panic can be seen in the following logs [1, 2]. Be sure to notice messages such as `Notify the sync worker: Cluster operator etcd changed Degraded from "False" to "True"` caused by the `NotifyAboutManagedResourceActivity` method right before a panic occurs. This commit will add a test to check whether a panic will occur when the notify channel already contains an element at a startup of the Start() method. A desired behavior for the Start() method is not to cause a panic when a notify signal is received before the Update() method has time to finish. The cancellation of the Start() method's context must remain functional. This commit will test this as well. It is possible that a nonempty notify channel at the Start() method's startup will not have time to cause a panic in the implemented test. However, such a case is improbable due to the non-complexity of the underlying code, and a three-seconds timeout should be enough time to cause such a panic. The alternative is to implement a fine-grained status of the sync worker, which would add additional code and complexity. [1] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-aws-workers-rhel8/1804819848023248896/artifacts/e2e-aws-workers-rhel8/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-545d8cb5db-v2bf4_cluster-version-operator_previous.log [2] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-ovirt/1805174927066664960/artifacts/e2e-ovirt/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-849c49896b-wnn8v_cluster-version-operator_previous.log
c95db38
to
f5a6202
Compare
After a test for the bug is introduced without its fix: $ git checkout 7f9cf621ce33f182319bd57146b45a178a1ed16d
HEAD is now at 7f9cf621 pkg/cvo/sync_worker_test.go: Add test for panic caused by goroutine synchronization
$ go test ./... -count=1
...
I0715 14:42:41.027363 348647 sync_worker.go:565] Start: starting sync worker
I0715 14:42:41.027465 348647 sync_worker.go:584] Notify the sync worker: Cluster operator A changed versions
E0715 14:42:41.028015 348647 runtime.go:79] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
goroutine 596 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic({0x1c91a20, 0x32c6ed0})
/home/dhurta/work/davoska-cluster-version-operator/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:75 +0x85
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc000693500?})
/home/dhurta/work/davoska-cluster-version-operator/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:49 +0x6b
panic({0x1c91a20?, 0x32c6ed0?})
/home/dhurta/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.22.1.linux-amd64/src/runtime/panic.go:770 +0x132
github.com/openshift/cluster-version-operator/pkg/cvo.(*SyncWork).calculateNextFrom(0xc00083bc20, 0x0)
/home/dhurta/work/davoska-cluster-version-operator/cluster-version-operator/pkg/cvo/sync_worker.go:730 +0x58
github.com/openshift/cluster-version-operator/pkg/cvo.(*SyncWorker).Start.func1()
/home/dhurta/work/davoska-cluster-version-operator/cluster-version-operator/pkg/cvo/sync_worker.go:589 +0x2f2
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x30?)
/home/dhurta/work/davoska-cluster-version-operator/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:226 +0x33
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc0006a0f68, {0x228dfe0, 0xc0007b4570}, 0x1, 0xc0007be000)
/home/dhurta/work/davoska-cluster-version-operator/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:227 +0xaf
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc0006a0f68, 0x989680, 0x0, 0x1, 0xc0007be000)
/home/dhurta/work/davoska-cluster-version-operator/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:204 +0x7f
k8s.io/apimachinery/pkg/util/wait.Until(...)
/home/dhurta/work/davoska-cluster-version-operator/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:161
github.com/openshift/cluster-version-operator/pkg/cvo.(*SyncWorker).Start(0xc0007a3088, {0x22ac5e8, 0xc0004b1590}, 0x1)
/home/dhurta/work/davoska-cluster-version-operator/cluster-version-operator/pkg/cvo/sync_worker.go:569 +0x135
github.com/openshift/cluster-version-operator/pkg/cvo.Test_SyncWorkerShouldNotPanicDueToNotifySignalAtStartUp.func1()
/home/dhurta/work/davoska-cluster-version-operator/cluster-version-operator/pkg/cvo/sync_worker_test.go:490 +0x2d
created by github.com/openshift/cluster-version-operator/pkg/cvo.Test_SyncWorkerShouldNotPanicDueToNotifySignalAtStartUp in goroutine 594
/home/dhurta/work/davoska-cluster-version-operator/cluster-version-operator/pkg/cvo/sync_worker_test.go:489 +0x12e
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x58 pc=0x19e80d8]
goroutine 596 [running]:
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc000693500?})
/home/dhurta/work/davoska-cluster-version-operator/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:56 +0xcd
panic({0x1c91a20?, 0x32c6ed0?})
/home/dhurta/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.22.1.linux-amd64/src/runtime/panic.go:770 +0x132
github.com/openshift/cluster-version-operator/pkg/cvo.(*SyncWork).calculateNextFrom(0xc00083bc20, 0x0)
/home/dhurta/work/davoska-cluster-version-operator/cluster-version-operator/pkg/cvo/sync_worker.go:730 +0x58
github.com/openshift/cluster-version-operator/pkg/cvo.(*SyncWorker).Start.func1()
/home/dhurta/work/davoska-cluster-version-operator/cluster-version-operator/pkg/cvo/sync_worker.go:589 +0x2f2
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x30?)
/home/dhurta/work/davoska-cluster-version-operator/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:226 +0x33
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc0006a0f68, {0x228dfe0, 0xc0007b4570}, 0x1, 0xc0007be000)
/home/dhurta/work/davoska-cluster-version-operator/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:227 +0xaf
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc0006a0f68, 0x989680, 0x0, 0x1, 0xc0007be000)
/home/dhurta/work/davoska-cluster-version-operator/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:204 +0x7f
k8s.io/apimachinery/pkg/util/wait.Until(...)
/home/dhurta/work/davoska-cluster-version-operator/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:161
github.com/openshift/cluster-version-operator/pkg/cvo.(*SyncWorker).Start(0xc0007a3088, {0x22ac5e8, 0xc0004b1590}, 0x1)
/home/dhurta/work/davoska-cluster-version-operator/cluster-version-operator/pkg/cvo/sync_worker.go:569 +0x135
github.com/openshift/cluster-version-operator/pkg/cvo.Test_SyncWorkerShouldNotPanicDueToNotifySignalAtStartUp.func1()
/home/dhurta/work/davoska-cluster-version-operator/cluster-version-operator/pkg/cvo/sync_worker_test.go:490 +0x2d
created by github.com/openshift/cluster-version-operator/pkg/cvo.Test_SyncWorkerShouldNotPanicDueToNotifySignalAtStartUp in goroutine 594
/home/dhurta/work/davoska-cluster-version-operator/cluster-version-operator/pkg/cvo/sync_worker_test.go:489 +0x12e
FAIL github.com/openshift/cluster-version-operator/pkg/cvo 3.818s
ok github.com/openshift/cluster-version-operator/pkg/cvo/internal 0.022s
ok github.com/openshift/cluster-version-operator/pkg/featuregates 4.423s
ok github.com/openshift/cluster-version-operator/pkg/payload 1.068s
ok github.com/openshift/cluster-version-operator/pkg/payload/precondition 0.008s
ok github.com/openshift/cluster-version-operator/pkg/payload/precondition/clusterversion 0.010s
ok github.com/openshift/cluster-version-operator/pkg/start 0.011s
FAIL After the fix is introduced as well:
|
@Davoska: This pull request references Jira Issue OCPBUGS-32678, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
/jira refresh |
@Davoska: This pull request references Jira Issue OCPBUGS-32678, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
Requesting review from QA contact: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
/hold |
The bug is explained in more detail in the commit that is responsible for testing the bug's behavior. For more information, see the commit 7f9cf62 This commit will fix the issue by introducing a new channel to exclusively signal the start of the initial attempt of the application after the `SyncWorker.Update()` has finished. It may also be used for any consecutive signals to attempt an application. This ensures the `Update()` has finished at least once before we attempt to apply the payload. A new local channel is added to the `Start()` method to not cause a potential deadlock. The `Start()` method is the sole consumer of the `notify` and `startApply` channels. In a potential scenario, the `Start()` goroutine may consume an element from one of the channels and get starved. The goroutine would not be able to produce the element back into the channel and break out of the loop, because the channel could already be full. This would create a deadlock as there wouldn't be any consumers of the channels. An alternative is to utilize other Go structs; however, using channels requires minimal changes to the present code.
f5a6202
to
1505f20
Compare
/unhold |
/cc @dis016 |
/retest |
@Davoska: all tests passed! Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
/payload-job periodic-ci-openshift-release-master-ci-4.16-e2e-gcp-ovn-upgrade |
@dis016: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/4f8e4820-433f-11ef-90f9-8507aca5b27e-0 |
/payload-job periodic-ci-openshift-release-master-ci-4.17-e2e-gcp-ovn-upgrade |
@dis016: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/08f505b0-4368-11ef-90f7-126d5667830a-0 |
/payload-aggregate periodic-ci-openshift-release-master-ci-4.17-e2e-gcp-ovn-upgrade |
@petr-muller: it appears that you have attempted to use some version of the payload command, but your comment was incorrectly formatted and cannot be acted upon. See the docs for usage info. |
/payload-aggregate periodic-ci-openshift-release-master-ci-4.17-e2e-gcp-ovn-upgrade 10 |
@petr-muller: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/019ba380-451d-11ef-9f6b-fa7282a932c1-0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. I am not that happy about increased complexity, but I do not have a good counterproposal.
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: Davoska, petr-muller The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
I think this can wait for |
Thank you, Petr, for the review. This can definitely wait for the Testing this PR is not as easy, unfortunately. The original panic occurred only once in a while, and the new changes are small and technical. The new tests check for the panic that was happening due to the data race. Tests do seem to report that everything is working as expected, but I appreciate any testing I can get, as the worst-case scenario is a potential deadlock. Hello, @dis016 👋 What's your take on the PR? I am not sure what to recommend for testing such a PR. The CI is passing, and the time window to check for the data race in a live cluster is quite small. A static analysis is always appreciated, but I'll leave it up to you. |
My opinion is that because the original panic was very rare, we do not have a reproducer (and it would be very hard to create one) and you actually added a new unit test that exercises the behavior, additional testing before merge does not bring too much value. Basic scenarios (install, upgrade in different directions, hypershift) are covered by CI. So we likely haven't broken anything obvious, and we do not change any externally observable behaviors that the testing could focus on. That means the remaining factor that we could make worse is overall, large-scale reliability, and that factor is best tested by letting the PR merge so that the code reaches the payload testing battery of jobs. |
Well said. I have nothing to add. /label no-qe |
@Davoska: Jira Issue OCPBUGS-32678: All pull requests linked via external trackers have merged: Jira Issue OCPBUGS-32678 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
[ART PR BUILD NOTIFIER] Distgit: cluster-version-operator |
After the CVO has started and its worker goroutines are starting to work, a panic may occur due to a data race.
Originally (at 961873d), the
SyncWorker.Start()
method was exclusively waiting for a signal on theSyncWorker.notify
channel from theSyncWorker.Update()
method before attempting the first application of a payload.After the
Start()
method got the signal, its required values were already initialized by theSyncWorker.Update()
method, such as theSyncWorker.work
field.However, the 9222fa9 introduced another section of the CVO that can signal the
SyncWorker.notify
channel.The
SyncWorker.NotifyAboutManagedResourceActivity()
method now signals the channel upon an activity on a managed resource. This signal may occur before theSyncWorker.Update()
has finished. The signal thus starts an attempt of theSyncWorker.Start()
method to apply a payload before theSyncWorker.work
field could have been set by theSyncWorker.Update()
method.This currently results in an
"invalid memory address or nil pointer dereference"
panic in theSyncWork.calculateNextFrom()
method because a nil pointer is passed by theSyncWorker.Start()
method.This panic can be seen in the following logs [1, 2]. Be sure to notice messages such as
Notify the sync worker: Cluster operator etcd changed Degraded from "False" to "True"
caused by theNotifyAboutManagedResourceActivity
method right before a panic occurs.This PR will fix the issue by introducing a new channel to exclusively signal the start of the initial attempt of the application after the
SyncWorker.Update()
has finished. It may also be used for any consecutive signals to attempt an application. This ensures theUpdate()
has finished at least once before we attempt to apply the payload.[1] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-aws-workers-rhel8/1804819848023248896/artifacts/e2e-aws-workers-rhel8/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-545d8cb5db-v2bf4_cluster-version-operator_previous.log
[2] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-ovirt/1805174927066664960/artifacts/e2e-ovirt/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-849c49896b-wnn8v_cluster-version-operator_previous.log
This pull request references https://issues.redhat.com/browse/OCPBUGS-32678