OCPBUGS-32678: Fix a panic caused by a data race #1070

Davoska · 2024-07-12T13:43:26Z

After the CVO has started and its worker goroutines are starting to work, a panic may occur due to a data race.

Originally (at 961873d), the SyncWorker.Start() method was exclusively waiting for a signal on the SyncWorker.notify channel from the SyncWorker.Update() method before attempting the first application of a payload.

After the Start() method got the signal, its required values were already initialized by the SyncWorker.Update() method, such as the SyncWorker.work field.

However, the 9222fa9 introduced another section of the CVO that can signal the SyncWorker.notify channel.

The SyncWorker.NotifyAboutManagedResourceActivity() method now signals the channel upon an activity on a managed resource. This signal may occur before the SyncWorker.Update() has finished. The signal thus starts an attempt of the SyncWorker.Start() method to apply a payload before the SyncWorker.work field could have been set by the SyncWorker.Update() method.

This currently results in an "invalid memory address or nil pointer dereference" panic in the SyncWork.calculateNextFrom() method because a nil pointer is passed by the SyncWorker.Start() method.

This panic can be seen in the following logs [1, 2]. Be sure to notice messages such as Notify the sync worker: Cluster operator etcd changed Degraded from "False" to "True" caused by the
NotifyAboutManagedResourceActivity method right before a panic occurs.

This PR will fix the issue by introducing a new channel to exclusively signal the start of the initial attempt of the application after the SyncWorker.Update() has finished. It may also be used for any consecutive signals to attempt an application. This ensures the Update() has finished at least once before we attempt to apply the payload.

[1] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-aws-workers-rhel8/1804819848023248896/artifacts/e2e-aws-workers-rhel8/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-545d8cb5db-v2bf4_cluster-version-operator_previous.log
[2] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-ovirt/1805174927066664960/artifacts/e2e-ovirt/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-849c49896b-wnn8v_cluster-version-operator_previous.log

This pull request references https://issues.redhat.com/browse/OCPBUGS-32678

openshift-ci · 2024-07-12T13:43:30Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

Davoska · 2024-07-12T13:44:11Z

/test all

…ynchronization After the CVO has started and its worker goroutines are starting to work, a panic may occur due to a data race. Originally (at 961873d), the `SyncWorker.Start()` method was exclusively waiting for a signal on the `SyncWorker.notify` channel from the `SyncWorker.Update()` method before attempting the first application of a payload. After the `Start()` method got the signal, its required values were already initialized by the `SyncWorker.Update()` method, such as the `SyncWorker.work` field. However, the 9222fa9 introduced another section of the CVO that can signal the `SyncWorker.notify` channel. The `SyncWorker.NotifyAboutManagedResourceActivity()` method now signals the channel upon an activity on a managed resource. This signal may occur before the `SyncWorker.Update()` has finished. The signal thus starts an attempt of the `SyncWorker.Start()` method to apply a payload before the `SyncWorker.work` field could have been set by the `SyncWorker.Update()` method. This currently results in an "invalid memory address or nil pointer dereference" panic in the `SyncWork.calculateNextFrom()` method because a nil pointer is passed by the `SyncWorker.Start()` method. This panic can be seen in the following logs [1, 2]. Be sure to notice messages such as `Notify the sync worker: Cluster operator etcd changed Degraded from "False" to "True"` caused by the `NotifyAboutManagedResourceActivity` method right before a panic occurs. This commit will add a test to check whether a panic will occur when the notify channel already contains an element at a startup of the Start() method. A desired behavior for the Start() method is not to cause a panic when a notify signal is received before the Update() method has time to finish. The cancellation of the Start() method's context must remain functional. This commit will test this as well. It is possible that a nonempty notify channel at the Start() method's startup will not have time to cause a panic in the implemented test. However, such a case is improbable due to the non-complexity of the underlying code, and a three-seconds timeout should be enough time to cause such a panic. The alternative is to implement a fine-grained status of the sync worker, which would add additional code and complexity. [1] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-aws-workers-rhel8/1804819848023248896/artifacts/e2e-aws-workers-rhel8/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-545d8cb5db-v2bf4_cluster-version-operator_previous.log [2] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-ovirt/1805174927066664960/artifacts/e2e-ovirt/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-849c49896b-wnn8v_cluster-version-operator_previous.log

Davoska · 2024-07-15T12:47:49Z

After a test for the bug is introduced without its fix:

$ git checkout 7f9cf621ce33f182319bd57146b45a178a1ed16d
HEAD is now at 7f9cf621 pkg/cvo/sync_worker_test.go: Add test for panic caused by goroutine synchronization
$ go test ./... -count=1
...
I0715 14:42:41.027363  348647 sync_worker.go:565] Start: starting sync worker
I0715 14:42:41.027465  348647 sync_worker.go:584] Notify the sync worker: Cluster operator A changed versions
E0715 14:42:41.028015  348647 runtime.go:79] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
goroutine 596 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic({0x1c91a20, 0x32c6ed0})
        /home/dhurta/work/davoska-cluster-version-operator/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:75 +0x85
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc000693500?})
        /home/dhurta/work/davoska-cluster-version-operator/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:49 +0x6b
panic({0x1c91a20?, 0x32c6ed0?})
        /home/dhurta/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.22.1.linux-amd64/src/runtime/panic.go:770 +0x132
github.com/openshift/cluster-version-operator/pkg/cvo.(*SyncWork).calculateNextFrom(0xc00083bc20, 0x0)
        /home/dhurta/work/davoska-cluster-version-operator/cluster-version-operator/pkg/cvo/sync_worker.go:730 +0x58
github.com/openshift/cluster-version-operator/pkg/cvo.(*SyncWorker).Start.func1()
        /home/dhurta/work/davoska-cluster-version-operator/cluster-version-operator/pkg/cvo/sync_worker.go:589 +0x2f2
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x30?)
        /home/dhurta/work/davoska-cluster-version-operator/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:226 +0x33
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc0006a0f68, {0x228dfe0, 0xc0007b4570}, 0x1, 0xc0007be000)
        /home/dhurta/work/davoska-cluster-version-operator/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:227 +0xaf
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc0006a0f68, 0x989680, 0x0, 0x1, 0xc0007be000)
        /home/dhurta/work/davoska-cluster-version-operator/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:204 +0x7f
k8s.io/apimachinery/pkg/util/wait.Until(...)
        /home/dhurta/work/davoska-cluster-version-operator/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:161
github.com/openshift/cluster-version-operator/pkg/cvo.(*SyncWorker).Start(0xc0007a3088, {0x22ac5e8, 0xc0004b1590}, 0x1)
        /home/dhurta/work/davoska-cluster-version-operator/cluster-version-operator/pkg/cvo/sync_worker.go:569 +0x135
github.com/openshift/cluster-version-operator/pkg/cvo.Test_SyncWorkerShouldNotPanicDueToNotifySignalAtStartUp.func1()
        /home/dhurta/work/davoska-cluster-version-operator/cluster-version-operator/pkg/cvo/sync_worker_test.go:490 +0x2d
created by github.com/openshift/cluster-version-operator/pkg/cvo.Test_SyncWorkerShouldNotPanicDueToNotifySignalAtStartUp in goroutine 594
        /home/dhurta/work/davoska-cluster-version-operator/cluster-version-operator/pkg/cvo/sync_worker_test.go:489 +0x12e
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
        panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x58 pc=0x19e80d8]

goroutine 596 [running]:
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc000693500?})
        /home/dhurta/work/davoska-cluster-version-operator/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:56 +0xcd
panic({0x1c91a20?, 0x32c6ed0?})
        /home/dhurta/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.22.1.linux-amd64/src/runtime/panic.go:770 +0x132
github.com/openshift/cluster-version-operator/pkg/cvo.(*SyncWork).calculateNextFrom(0xc00083bc20, 0x0)
        /home/dhurta/work/davoska-cluster-version-operator/cluster-version-operator/pkg/cvo/sync_worker.go:730 +0x58
github.com/openshift/cluster-version-operator/pkg/cvo.(*SyncWorker).Start.func1()
        /home/dhurta/work/davoska-cluster-version-operator/cluster-version-operator/pkg/cvo/sync_worker.go:589 +0x2f2
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x30?)
        /home/dhurta/work/davoska-cluster-version-operator/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:226 +0x33
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc0006a0f68, {0x228dfe0, 0xc0007b4570}, 0x1, 0xc0007be000)
        /home/dhurta/work/davoska-cluster-version-operator/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:227 +0xaf
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc0006a0f68, 0x989680, 0x0, 0x1, 0xc0007be000)
        /home/dhurta/work/davoska-cluster-version-operator/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:204 +0x7f
k8s.io/apimachinery/pkg/util/wait.Until(...)
        /home/dhurta/work/davoska-cluster-version-operator/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:161
github.com/openshift/cluster-version-operator/pkg/cvo.(*SyncWorker).Start(0xc0007a3088, {0x22ac5e8, 0xc0004b1590}, 0x1)
        /home/dhurta/work/davoska-cluster-version-operator/cluster-version-operator/pkg/cvo/sync_worker.go:569 +0x135
github.com/openshift/cluster-version-operator/pkg/cvo.Test_SyncWorkerShouldNotPanicDueToNotifySignalAtStartUp.func1()
        /home/dhurta/work/davoska-cluster-version-operator/cluster-version-operator/pkg/cvo/sync_worker_test.go:490 +0x2d
created by github.com/openshift/cluster-version-operator/pkg/cvo.Test_SyncWorkerShouldNotPanicDueToNotifySignalAtStartUp in goroutine 594
        /home/dhurta/work/davoska-cluster-version-operator/cluster-version-operator/pkg/cvo/sync_worker_test.go:489 +0x12e
FAIL    github.com/openshift/cluster-version-operator/pkg/cvo   3.818s
ok      github.com/openshift/cluster-version-operator/pkg/cvo/internal  0.022s
ok      github.com/openshift/cluster-version-operator/pkg/featuregates  4.423s
ok      github.com/openshift/cluster-version-operator/pkg/payload       1.068s
ok      github.com/openshift/cluster-version-operator/pkg/payload/precondition  0.008s
ok      github.com/openshift/cluster-version-operator/pkg/payload/precondition/clusterversion   0.010s
ok      github.com/openshift/cluster-version-operator/pkg/start 0.011s
FAIL

After the fix is introduced as well:

$ git checkout f5a6202ab90d4c9400cc04eedf75fe810b291068
Previous HEAD position was 7f9cf621 pkg/cvo/sync_worker_test.go: Add test for panic caused by goroutine synchronization
HEAD is now at f5a6202a pkg/cvo/sync_worker.go: Fix panic caused by data race
$ go test ./... -count=1
...
ok      github.com/openshift/cluster-version-operator/pkg/cvo   13.906s
ok      github.com/openshift/cluster-version-operator/pkg/cvo/internal  0.025s
ok      github.com/openshift/cluster-version-operator/pkg/featuregates  4.418s
ok      github.com/openshift/cluster-version-operator/pkg/payload       1.067s
ok      github.com/openshift/cluster-version-operator/pkg/payload/precondition  0.011s
ok      github.com/openshift/cluster-version-operator/pkg/payload/precondition/clusterversion   0.011s
ok      github.com/openshift/cluster-version-operator/pkg/start 0.013s

openshift-ci-robot · 2024-07-15T12:49:12Z

@Davoska: This pull request references Jira Issue OCPBUGS-32678, which is invalid:

expected the bug to target the "4.17.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

After the CVO has started and its worker goroutines are starting to work, a panic may occur due to a data race.

Originally (at 961873d), the SyncWorker.Start() method was exclusively waiting for a signal on the SyncWorker.notify channel from the SyncWorker.Update() method before attempting the first application of a payload.

After the Start() method got the signal, its required values were already initialized by the SyncWorker.Update() method, such as the SyncWorker.work field.

However, the 9222fa9 introduced another section of the CVO that can signal the SyncWorker.notify channel.

The SyncWorker.NotifyAboutManagedResourceActivity() method now signals the channel upon an activity on a managed resource. This signal may occur before the SyncWorker.Update() has finished. The signal thus starts an attempt of the SyncWorker.Start() method to apply a payload before the SyncWorker.work field could have been set by the SyncWorker.Update() method.

This currently results in an "invalid memory address or nil pointer dereference" panic in the SyncWork.calculateNextFrom() method because a nil pointer is passed by the SyncWorker.Start() method.

This panic can be seen in the following logs [1, 2]. Be sure to notice messages such as Notify the sync worker: Cluster operator etcd changed Degraded from "False" to "True" caused by the
NotifyAboutManagedResourceActivity method right before a panic occurs.

This PR will fix the issue by introducing a new channel to exclusively signal the start of the initial attempt of the application after the SyncWorker.Update() has finished. It may also be used for any consecutive signals to attempt an application. This ensures the Update() has finished at least once before we attempt to apply the payload.

[1] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-aws-workers-rhel8/1804819848023248896/artifacts/e2e-aws-workers-rhel8/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-545d8cb5db-v2bf4_cluster-version-operator_previous.log
[2] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-ovirt/1805174927066664960/artifacts/e2e-ovirt/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-849c49896b-wnn8v_cluster-version-operator_previous.log

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Davoska · 2024-07-15T12:51:18Z

/jira refresh

openshift-ci-robot · 2024-07-15T12:51:24Z

@Davoska: This pull request references Jira Issue OCPBUGS-32678, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.17.0) matches configured target version for branch (4.17.0)
bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @jiajliu

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Davoska · 2024-07-15T13:02:20Z

/hold
There is a potential deadlock in the new changes.

The bug is explained in more detail in the commit that is responsible for testing the bug's behavior. For more information, see the commit 7f9cf62 This commit will fix the issue by introducing a new channel to exclusively signal the start of the initial attempt of the application after the `SyncWorker.Update()` has finished. It may also be used for any consecutive signals to attempt an application. This ensures the `Update()` has finished at least once before we attempt to apply the payload. A new local channel is added to the `Start()` method to not cause a potential deadlock. The `Start()` method is the sole consumer of the `notify` and `startApply` channels. In a potential scenario, the `Start()` goroutine may consume an element from one of the channels and get starved. The goroutine would not be able to produce the element back into the channel and break out of the loop, because the channel could already be full. This would create a deadlock as there wouldn't be any consumers of the channels. An alternative is to utilize other Go structs; however, using channels requires minimal changes to the present code.

Davoska · 2024-07-15T13:38:32Z

/unhold

jiajliu · 2024-07-16T01:13:58Z

/cc @dis016

dis016 · 2024-07-16T04:29:15Z

/retest

openshift-ci · 2024-07-16T06:43:52Z

@Davoska: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

dis016 · 2024-07-16T06:47:46Z

/payload-job periodic-ci-openshift-release-master-ci-4.16-e2e-gcp-ovn-upgrade

openshift-ci · 2024-07-16T06:47:51Z

@dis016: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-release-master-ci-4.16-e2e-gcp-ovn-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/4f8e4820-433f-11ef-90f9-8507aca5b27e-0

dis016 · 2024-07-16T11:39:16Z

/payload-job periodic-ci-openshift-release-master-ci-4.17-e2e-gcp-ovn-upgrade

openshift-ci · 2024-07-16T11:39:21Z

@dis016: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-release-master-ci-4.17-e2e-gcp-ovn-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/08f505b0-4368-11ef-90f7-126d5667830a-0

petr-muller · 2024-07-18T15:45:56Z

/payload-aggregate periodic-ci-openshift-release-master-ci-4.17-e2e-gcp-ovn-upgrade

openshift-ci · 2024-07-18T15:46:00Z

@petr-muller: it appears that you have attempted to use some version of the payload command, but your comment was incorrectly formatted and cannot be acted upon. See the docs for usage info.

petr-muller · 2024-07-18T15:47:14Z

/payload-aggregate periodic-ci-openshift-release-master-ci-4.17-e2e-gcp-ovn-upgrade 10

openshift-ci · 2024-07-18T15:47:18Z

@petr-muller: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-release-master-ci-4.17-e2e-gcp-ovn-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/019ba380-451d-11ef-9f6b-fa7282a932c1-0

petr-muller

Looks good to me. I am not that happy about increased complexity, but I do not have a good counterproposal.

openshift-ci · 2024-07-19T13:27:31Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Davoska, petr-muller

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [Davoska,petr-muller]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

petr-muller · 2024-07-19T13:28:27Z

I think this can wait for acknowledge-critical-fixes-only lift. @Davoska I will leave it up to you whether you want to ask QE to pre-merge test this or not.

Davoska · 2024-07-19T15:00:39Z

Thank you, Petr, for the review.

This can definitely wait for the acknowledge-critical-fixes-only lift 👍

Testing this PR is not as easy, unfortunately. The original panic occurred only once in a while, and the new changes are small and technical. The new tests check for the panic that was happening due to the data race. Tests do seem to report that everything is working as expected, but I appreciate any testing I can get, as the worst-case scenario is a potential deadlock.

Hello, @dis016 👋 What's your take on the PR?

I am not sure what to recommend for testing such a PR. The CI is passing, and the time window to check for the data race in a live cluster is quite small. A static analysis is always appreciated, but I'll leave it up to you.

petr-muller · 2024-07-19T18:22:27Z

My opinion is that because the original panic was very rare, we do not have a reproducer (and it would be very hard to create one) and you actually added a new unit test that exercises the behavior, additional testing before merge does not bring too much value. Basic scenarios (install, upgrade in different directions, hypershift) are covered by CI. So we likely haven't broken anything obvious, and we do not change any externally observable behaviors that the testing could focus on. That means the remaining factor that we could make worse is overall, large-scale reliability, and that factor is best tested by letting the PR merge so that the code reaches the payload testing battery of jobs.

Davoska · 2024-07-21T16:42:27Z

Well said. I have nothing to add.

/label no-qe

openshift-ci-robot · 2024-07-22T14:53:16Z

@Davoska: Jira Issue OCPBUGS-32678: All pull requests linked via external trackers have merged:

openshift/cluster-version-operator#1070

Jira Issue OCPBUGS-32678 has been moved to the MODIFIED state.

In response to this:

After the CVO has started and its worker goroutines are starting to work, a panic may occur due to a data race.

Originally (at 961873d), the SyncWorker.Start() method was exclusively waiting for a signal on the SyncWorker.notify channel from the SyncWorker.Update() method before attempting the first application of a payload.

After the Start() method got the signal, its required values were already initialized by the SyncWorker.Update() method, such as the SyncWorker.work field.

However, the 9222fa9 introduced another section of the CVO that can signal the SyncWorker.notify channel.

The SyncWorker.NotifyAboutManagedResourceActivity() method now signals the channel upon an activity on a managed resource. This signal may occur before the SyncWorker.Update() has finished. The signal thus starts an attempt of the SyncWorker.Start() method to apply a payload before the SyncWorker.work field could have been set by the SyncWorker.Update() method.

This currently results in an "invalid memory address or nil pointer dereference" panic in the SyncWork.calculateNextFrom() method because a nil pointer is passed by the SyncWorker.Start() method.

This panic can be seen in the following logs [1, 2]. Be sure to notice messages such as Notify the sync worker: Cluster operator etcd changed Degraded from "False" to "True" caused by the
NotifyAboutManagedResourceActivity method right before a panic occurs.

This PR will fix the issue by introducing a new channel to exclusively signal the start of the initial attempt of the application after the SyncWorker.Update() has finished. It may also be used for any consecutive signals to attempt an application. This ensures the Update() has finished at least once before we attempt to apply the payload.

[1] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-aws-workers-rhel8/1804819848023248896/artifacts/e2e-aws-workers-rhel8/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-545d8cb5db-v2bf4_cluster-version-operator_previous.log
[2] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-ovirt/1805174927066664960/artifacts/e2e-ovirt/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-849c49896b-wnn8v_cluster-version-operator_previous.log

This pull request references https://issues.redhat.com/browse/OCPBUGS-32678

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-bot · 2024-07-22T18:19:27Z

[ART PR BUILD NOTIFIER]

Distgit: cluster-version-operator
This PR has been included in build cluster-version-operator-container-v4.17.0-202407221718.p0.g8244b52.assembly.stream.el9.
All builds following this will include this PR.

openshift-ci bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Jul 12, 2024

Davoska force-pushed the ocpbugs-32678-cvo-panics branch from c95db38 to f5a6202 Compare July 15, 2024 12:32

Davoska changed the title ~~WIP: Fix a panic caused by a data race~~ OCPBUGS-32678: Fix a panic caused by a data race Jul 15, 2024

openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Jul 15, 2024

Davoska marked this pull request as ready for review July 15, 2024 12:50

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 15, 2024

openshift-ci-robot added the jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. label Jul 15, 2024

openshift-ci-robot removed the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Jul 15, 2024

openshift-ci bot requested review from jiajliu, hongkailiu and wking July 15, 2024 12:51

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 15, 2024

Davoska force-pushed the ocpbugs-32678-cvo-panics branch from f5a6202 to 1505f20 Compare July 15, 2024 13:36

openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 15, 2024

openshift-ci bot requested a review from dis016 July 16, 2024 01:13

petr-muller approved these changes Jul 19, 2024

View reviewed changes

openshift-ci bot assigned petr-muller Jul 19, 2024

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jul 19, 2024

openshift-ci bot added the no-qe Allows PRs to merge without qe-approved label label Jul 21, 2024

openshift-merge-bot bot merged commit 8244b52 into openshift:master Jul 22, 2024
12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OCPBUGS-32678: Fix a panic caused by a data race #1070

OCPBUGS-32678: Fix a panic caused by a data race #1070

Davoska commented Jul 12, 2024 •

edited

Loading

openshift-ci bot commented Jul 12, 2024

Davoska commented Jul 12, 2024

Davoska commented Jul 15, 2024

openshift-ci-robot commented Jul 15, 2024

Davoska commented Jul 15, 2024

openshift-ci-robot commented Jul 15, 2024

Davoska commented Jul 15, 2024 •

edited

Loading

Davoska commented Jul 15, 2024

jiajliu commented Jul 16, 2024

dis016 commented Jul 16, 2024

openshift-ci bot commented Jul 16, 2024

dis016 commented Jul 16, 2024

openshift-ci bot commented Jul 16, 2024

dis016 commented Jul 16, 2024

openshift-ci bot commented Jul 16, 2024

petr-muller commented Jul 18, 2024

openshift-ci bot commented Jul 18, 2024

petr-muller commented Jul 18, 2024

openshift-ci bot commented Jul 18, 2024

petr-muller left a comment

openshift-ci bot commented Jul 19, 2024

petr-muller commented Jul 19, 2024

Davoska commented Jul 19, 2024

petr-muller commented Jul 19, 2024

Davoska commented Jul 21, 2024

openshift-ci-robot commented Jul 22, 2024

openshift-bot commented Jul 22, 2024

OCPBUGS-32678: Fix a panic caused by a data race #1070

OCPBUGS-32678: Fix a panic caused by a data race #1070

Conversation

Davoska commented Jul 12, 2024 • edited Loading

openshift-ci bot commented Jul 12, 2024

Davoska commented Jul 12, 2024

Davoska commented Jul 15, 2024

openshift-ci-robot commented Jul 15, 2024

Davoska commented Jul 15, 2024

openshift-ci-robot commented Jul 15, 2024

Davoska commented Jul 15, 2024 • edited Loading

Davoska commented Jul 15, 2024

jiajliu commented Jul 16, 2024

dis016 commented Jul 16, 2024

openshift-ci bot commented Jul 16, 2024

dis016 commented Jul 16, 2024

openshift-ci bot commented Jul 16, 2024

dis016 commented Jul 16, 2024

openshift-ci bot commented Jul 16, 2024

petr-muller commented Jul 18, 2024

openshift-ci bot commented Jul 18, 2024

petr-muller commented Jul 18, 2024

openshift-ci bot commented Jul 18, 2024

petr-muller left a comment

Choose a reason for hiding this comment

openshift-ci bot commented Jul 19, 2024

petr-muller commented Jul 19, 2024

Davoska commented Jul 19, 2024

petr-muller commented Jul 19, 2024

Davoska commented Jul 21, 2024

openshift-ci-robot commented Jul 22, 2024

openshift-bot commented Jul 22, 2024

Davoska commented Jul 12, 2024 •

edited

Loading

Davoska commented Jul 15, 2024 •

edited

Loading