Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-32678: Fix a panic caused by a data race #1070

Merged

Conversation

Davoska
Copy link
Contributor

@Davoska Davoska commented Jul 12, 2024

After the CVO has started and its worker goroutines are starting to work, a panic may occur due to a data race.

Originally (at 961873d), the SyncWorker.Start() method was exclusively waiting for a signal on the SyncWorker.notify channel from the SyncWorker.Update() method before attempting the first application of a payload.

After the Start() method got the signal, its required values were already initialized by the SyncWorker.Update() method, such as the SyncWorker.work field.

However, the 9222fa9 introduced another section of the CVO that can signal the SyncWorker.notify channel.

The SyncWorker.NotifyAboutManagedResourceActivity() method now signals the channel upon an activity on a managed resource. This signal may occur before the SyncWorker.Update() has finished. The signal thus starts an attempt of the SyncWorker.Start() method to apply a payload before the SyncWorker.work field could have been set by the SyncWorker.Update() method.

This currently results in an "invalid memory address or nil pointer dereference" panic in the SyncWork.calculateNextFrom() method because a nil pointer is passed by the SyncWorker.Start() method.

This panic can be seen in the following logs [1, 2]. Be sure to notice messages such as Notify the sync worker: Cluster operator etcd changed Degraded from "False" to "True" caused by the
NotifyAboutManagedResourceActivity method right before a panic occurs.

This PR will fix the issue by introducing a new channel to exclusively signal the start of the initial attempt of the application after the SyncWorker.Update() has finished. It may also be used for any consecutive signals to attempt an application. This ensures the Update() has finished at least once before we attempt to apply the payload.

[1] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-aws-workers-rhel8/1804819848023248896/artifacts/e2e-aws-workers-rhel8/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-545d8cb5db-v2bf4_cluster-version-operator_previous.log
[2] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-ovirt/1805174927066664960/artifacts/e2e-ovirt/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-849c49896b-wnn8v_cluster-version-operator_previous.log

This pull request references https://issues.redhat.com/browse/OCPBUGS-32678

Copy link
Contributor

openshift-ci bot commented Jul 12, 2024

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci openshift-ci bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Jul 12, 2024
@Davoska
Copy link
Contributor Author

Davoska commented Jul 12, 2024

/test all

…ynchronization

After the CVO has started and its worker goroutines are starting to
work, a panic may occur due to a data race.

Originally (at 961873d), the `SyncWorker.Start()` method was
exclusively waiting for a signal on the `SyncWorker.notify` channel from
the `SyncWorker.Update()` method before attempting the first
application of a payload.

After the `Start()` method got the signal, its required values were
already initialized by the `SyncWorker.Update()` method, such as the
`SyncWorker.work` field.

However, the 9222fa9 introduced another section of the CVO that can
signal the `SyncWorker.notify` channel.

The `SyncWorker.NotifyAboutManagedResourceActivity()` method now signals
the channel upon an activity on a managed resource. This signal may
occur before the `SyncWorker.Update()` has finished. The signal thus
starts an attempt of the `SyncWorker.Start()` method to apply a payload
before the `SyncWorker.work` field could have been set by the
`SyncWorker.Update()` method.

This currently results in an "invalid memory address or nil pointer
dereference" panic in the `SyncWork.calculateNextFrom()` method because
a nil pointer is passed by the `SyncWorker.Start()` method.

This panic can be seen in the following logs [1, 2]. Be sure to notice
messages such as `Notify the sync worker: Cluster operator etcd changed
Degraded from "False" to "True"` caused by the
`NotifyAboutManagedResourceActivity` method right before a panic occurs.

This commit will add a test to check whether a panic will occur when the
notify channel already contains an element at a startup of the Start()
method.

A desired behavior for the Start() method is not to cause a panic when
a notify signal is received before the Update() method has time to
finish. The cancellation of the Start() method's context must remain
functional. This commit will test this as well.

It is possible that a nonempty notify channel at the Start() method's
startup will not have time to cause a panic in the implemented test.
However, such a case is improbable due to the non-complexity of the
underlying code, and a three-seconds timeout should be enough time to
cause such a panic. The alternative is to implement a fine-grained
status of the sync worker, which would add additional code and
complexity.

[1] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-aws-workers-rhel8/1804819848023248896/artifacts/e2e-aws-workers-rhel8/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-545d8cb5db-v2bf4_cluster-version-operator_previous.log
[2] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-ovirt/1805174927066664960/artifacts/e2e-ovirt/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-849c49896b-wnn8v_cluster-version-operator_previous.log
@Davoska
Copy link
Contributor Author

Davoska commented Jul 15, 2024

After a test for the bug is introduced without its fix:

$ git checkout 7f9cf621ce33f182319bd57146b45a178a1ed16d
HEAD is now at 7f9cf621 pkg/cvo/sync_worker_test.go: Add test for panic caused by goroutine synchronization
$ go test ./... -count=1
...
I0715 14:42:41.027363  348647 sync_worker.go:565] Start: starting sync worker
I0715 14:42:41.027465  348647 sync_worker.go:584] Notify the sync worker: Cluster operator A changed versions
E0715 14:42:41.028015  348647 runtime.go:79] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
goroutine 596 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic({0x1c91a20, 0x32c6ed0})
        /home/dhurta/work/davoska-cluster-version-operator/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:75 +0x85
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc000693500?})
        /home/dhurta/work/davoska-cluster-version-operator/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:49 +0x6b
panic({0x1c91a20?, 0x32c6ed0?})
        /home/dhurta/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.22.1.linux-amd64/src/runtime/panic.go:770 +0x132
github.com/openshift/cluster-version-operator/pkg/cvo.(*SyncWork).calculateNextFrom(0xc00083bc20, 0x0)
        /home/dhurta/work/davoska-cluster-version-operator/cluster-version-operator/pkg/cvo/sync_worker.go:730 +0x58
github.com/openshift/cluster-version-operator/pkg/cvo.(*SyncWorker).Start.func1()
        /home/dhurta/work/davoska-cluster-version-operator/cluster-version-operator/pkg/cvo/sync_worker.go:589 +0x2f2
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x30?)
        /home/dhurta/work/davoska-cluster-version-operator/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:226 +0x33
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc0006a0f68, {0x228dfe0, 0xc0007b4570}, 0x1, 0xc0007be000)
        /home/dhurta/work/davoska-cluster-version-operator/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:227 +0xaf
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc0006a0f68, 0x989680, 0x0, 0x1, 0xc0007be000)
        /home/dhurta/work/davoska-cluster-version-operator/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:204 +0x7f
k8s.io/apimachinery/pkg/util/wait.Until(...)
        /home/dhurta/work/davoska-cluster-version-operator/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:161
github.com/openshift/cluster-version-operator/pkg/cvo.(*SyncWorker).Start(0xc0007a3088, {0x22ac5e8, 0xc0004b1590}, 0x1)
        /home/dhurta/work/davoska-cluster-version-operator/cluster-version-operator/pkg/cvo/sync_worker.go:569 +0x135
github.com/openshift/cluster-version-operator/pkg/cvo.Test_SyncWorkerShouldNotPanicDueToNotifySignalAtStartUp.func1()
        /home/dhurta/work/davoska-cluster-version-operator/cluster-version-operator/pkg/cvo/sync_worker_test.go:490 +0x2d
created by github.com/openshift/cluster-version-operator/pkg/cvo.Test_SyncWorkerShouldNotPanicDueToNotifySignalAtStartUp in goroutine 594
        /home/dhurta/work/davoska-cluster-version-operator/cluster-version-operator/pkg/cvo/sync_worker_test.go:489 +0x12e
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
        panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x58 pc=0x19e80d8]

goroutine 596 [running]:
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc000693500?})
        /home/dhurta/work/davoska-cluster-version-operator/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:56 +0xcd
panic({0x1c91a20?, 0x32c6ed0?})
        /home/dhurta/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.22.1.linux-amd64/src/runtime/panic.go:770 +0x132
github.com/openshift/cluster-version-operator/pkg/cvo.(*SyncWork).calculateNextFrom(0xc00083bc20, 0x0)
        /home/dhurta/work/davoska-cluster-version-operator/cluster-version-operator/pkg/cvo/sync_worker.go:730 +0x58
github.com/openshift/cluster-version-operator/pkg/cvo.(*SyncWorker).Start.func1()
        /home/dhurta/work/davoska-cluster-version-operator/cluster-version-operator/pkg/cvo/sync_worker.go:589 +0x2f2
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x30?)
        /home/dhurta/work/davoska-cluster-version-operator/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:226 +0x33
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc0006a0f68, {0x228dfe0, 0xc0007b4570}, 0x1, 0xc0007be000)
        /home/dhurta/work/davoska-cluster-version-operator/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:227 +0xaf
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc0006a0f68, 0x989680, 0x0, 0x1, 0xc0007be000)
        /home/dhurta/work/davoska-cluster-version-operator/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:204 +0x7f
k8s.io/apimachinery/pkg/util/wait.Until(...)
        /home/dhurta/work/davoska-cluster-version-operator/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:161
github.com/openshift/cluster-version-operator/pkg/cvo.(*SyncWorker).Start(0xc0007a3088, {0x22ac5e8, 0xc0004b1590}, 0x1)
        /home/dhurta/work/davoska-cluster-version-operator/cluster-version-operator/pkg/cvo/sync_worker.go:569 +0x135
github.com/openshift/cluster-version-operator/pkg/cvo.Test_SyncWorkerShouldNotPanicDueToNotifySignalAtStartUp.func1()
        /home/dhurta/work/davoska-cluster-version-operator/cluster-version-operator/pkg/cvo/sync_worker_test.go:490 +0x2d
created by github.com/openshift/cluster-version-operator/pkg/cvo.Test_SyncWorkerShouldNotPanicDueToNotifySignalAtStartUp in goroutine 594
        /home/dhurta/work/davoska-cluster-version-operator/cluster-version-operator/pkg/cvo/sync_worker_test.go:489 +0x12e
FAIL    github.com/openshift/cluster-version-operator/pkg/cvo   3.818s
ok      github.com/openshift/cluster-version-operator/pkg/cvo/internal  0.022s
ok      github.com/openshift/cluster-version-operator/pkg/featuregates  4.423s
ok      github.com/openshift/cluster-version-operator/pkg/payload       1.068s
ok      github.com/openshift/cluster-version-operator/pkg/payload/precondition  0.008s
ok      github.com/openshift/cluster-version-operator/pkg/payload/precondition/clusterversion   0.010s
ok      github.com/openshift/cluster-version-operator/pkg/start 0.011s
FAIL

After the fix is introduced as well:

$ git checkout f5a6202ab90d4c9400cc04eedf75fe810b291068
Previous HEAD position was 7f9cf621 pkg/cvo/sync_worker_test.go: Add test for panic caused by goroutine synchronization
HEAD is now at f5a6202a pkg/cvo/sync_worker.go: Fix panic caused by data race
$ go test ./... -count=1
...
ok      github.com/openshift/cluster-version-operator/pkg/cvo   13.906s
ok      github.com/openshift/cluster-version-operator/pkg/cvo/internal  0.025s
ok      github.com/openshift/cluster-version-operator/pkg/featuregates  4.418s
ok      github.com/openshift/cluster-version-operator/pkg/payload       1.067s
ok      github.com/openshift/cluster-version-operator/pkg/payload/precondition  0.011s
ok      github.com/openshift/cluster-version-operator/pkg/payload/precondition/clusterversion   0.011s
ok      github.com/openshift/cluster-version-operator/pkg/start 0.013s

@Davoska Davoska changed the title WIP: Fix a panic caused by a data race OCPBUGS-32678: Fix a panic caused by a data race Jul 15, 2024
@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Jul 15, 2024
@openshift-ci-robot
Copy link
Contributor

@Davoska: This pull request references Jira Issue OCPBUGS-32678, which is invalid:

  • expected the bug to target the "4.17.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

After the CVO has started and its worker goroutines are starting to work, a panic may occur due to a data race.

Originally (at 961873d), the SyncWorker.Start() method was exclusively waiting for a signal on the SyncWorker.notify channel from the SyncWorker.Update() method before attempting the first application of a payload.

After the Start() method got the signal, its required values were already initialized by the SyncWorker.Update() method, such as the SyncWorker.work field.

However, the 9222fa9 introduced another section of the CVO that can signal the SyncWorker.notify channel.

The SyncWorker.NotifyAboutManagedResourceActivity() method now signals the channel upon an activity on a managed resource. This signal may occur before the SyncWorker.Update() has finished. The signal thus starts an attempt of the SyncWorker.Start() method to apply a payload before the SyncWorker.work field could have been set by the SyncWorker.Update() method.

This currently results in an "invalid memory address or nil pointer dereference" panic in the SyncWork.calculateNextFrom() method because a nil pointer is passed by the SyncWorker.Start() method.

This panic can be seen in the following logs [1, 2]. Be sure to notice messages such as Notify the sync worker: Cluster operator etcd changed Degraded from "False" to "True" caused by the
NotifyAboutManagedResourceActivity method right before a panic occurs.

This PR will fix the issue by introducing a new channel to exclusively signal the start of the initial attempt of the application after the SyncWorker.Update() has finished. It may also be used for any consecutive signals to attempt an application. This ensures the Update() has finished at least once before we attempt to apply the payload.

[1] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-aws-workers-rhel8/1804819848023248896/artifacts/e2e-aws-workers-rhel8/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-545d8cb5db-v2bf4_cluster-version-operator_previous.log
[2] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-ovirt/1805174927066664960/artifacts/e2e-ovirt/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-849c49896b-wnn8v_cluster-version-operator_previous.log

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@Davoska Davoska marked this pull request as ready for review July 15, 2024 12:50
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 15, 2024
@Davoska
Copy link
Contributor Author

Davoska commented Jul 15, 2024

/jira refresh

@openshift-ci-robot openshift-ci-robot added the jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. label Jul 15, 2024
@openshift-ci-robot
Copy link
Contributor

@Davoska: This pull request references Jira Issue OCPBUGS-32678, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.17.0) matches configured target version for branch (4.17.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @jiajliu

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot removed the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Jul 15, 2024
@Davoska
Copy link
Contributor Author

Davoska commented Jul 15, 2024

/hold
There is a potential deadlock in the new changes.

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 15, 2024
The bug is explained in more detail in the commit that is responsible
for testing the bug's behavior. For more information, see the commit
7f9cf62

This commit will fix the issue by introducing a new channel to
exclusively signal the start of the initial attempt of the application
after the `SyncWorker.Update()` has finished. It may also be used for
any consecutive signals to attempt an application. This ensures the
`Update()` has finished at least once before we attempt to apply the
payload.

A new local channel is added to the `Start()` method to not cause a
potential deadlock. The `Start()` method is the sole consumer of the
`notify` and `startApply` channels. In a potential scenario, the
`Start()` goroutine may consume an element from one of the channels and
get starved. The goroutine would not be able to produce the element
back into the channel and break out of the loop, because the channel
could already be full. This would create a deadlock as there wouldn't be
any consumers of the channels. An alternative is to utilize other Go
structs; however, using channels requires minimal changes to the
present code.
@Davoska
Copy link
Contributor Author

Davoska commented Jul 15, 2024

/unhold

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 15, 2024
@jiajliu
Copy link

jiajliu commented Jul 16, 2024

/cc @dis016

@openshift-ci openshift-ci bot requested a review from dis016 July 16, 2024 01:13
@dis016
Copy link

dis016 commented Jul 16, 2024

/retest

Copy link
Contributor

openshift-ci bot commented Jul 16, 2024

@Davoska: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@dis016
Copy link

dis016 commented Jul 16, 2024

/payload-job periodic-ci-openshift-release-master-ci-4.16-e2e-gcp-ovn-upgrade

Copy link
Contributor

openshift-ci bot commented Jul 16, 2024

@dis016: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-ci-4.16-e2e-gcp-ovn-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/4f8e4820-433f-11ef-90f9-8507aca5b27e-0

@dis016
Copy link

dis016 commented Jul 16, 2024

/payload-job periodic-ci-openshift-release-master-ci-4.17-e2e-gcp-ovn-upgrade

Copy link
Contributor

openshift-ci bot commented Jul 16, 2024

@dis016: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-ci-4.17-e2e-gcp-ovn-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/08f505b0-4368-11ef-90f7-126d5667830a-0

@petr-muller
Copy link
Member

/payload-aggregate periodic-ci-openshift-release-master-ci-4.17-e2e-gcp-ovn-upgrade

Copy link
Contributor

openshift-ci bot commented Jul 18, 2024

@petr-muller: it appears that you have attempted to use some version of the payload command, but your comment was incorrectly formatted and cannot be acted upon. See the docs for usage info.

@petr-muller
Copy link
Member

/payload-aggregate periodic-ci-openshift-release-master-ci-4.17-e2e-gcp-ovn-upgrade 10

Copy link
Contributor

openshift-ci bot commented Jul 18, 2024

@petr-muller: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-ci-4.17-e2e-gcp-ovn-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/019ba380-451d-11ef-9f6b-fa7282a932c1-0

Copy link
Member

@petr-muller petr-muller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. I am not that happy about increased complexity, but I do not have a good counterproposal.

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jul 19, 2024
Copy link
Contributor

openshift-ci bot commented Jul 19, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Davoska, petr-muller

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [Davoska,petr-muller]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@petr-muller
Copy link
Member

I think this can wait for acknowledge-critical-fixes-only lift. @Davoska I will leave it up to you whether you want to ask QE to pre-merge test this or not.

@Davoska
Copy link
Contributor Author

Davoska commented Jul 19, 2024

Thank you, Petr, for the review.

This can definitely wait for the acknowledge-critical-fixes-only lift 👍

Testing this PR is not as easy, unfortunately. The original panic occurred only once in a while, and the new changes are small and technical. The new tests check for the panic that was happening due to the data race. Tests do seem to report that everything is working as expected, but I appreciate any testing I can get, as the worst-case scenario is a potential deadlock.

Hello, @dis016 👋 What's your take on the PR?

I am not sure what to recommend for testing such a PR. The CI is passing, and the time window to check for the data race in a live cluster is quite small. A static analysis is always appreciated, but I'll leave it up to you.

@petr-muller
Copy link
Member

My opinion is that because the original panic was very rare, we do not have a reproducer (and it would be very hard to create one) and you actually added a new unit test that exercises the behavior, additional testing before merge does not bring too much value. Basic scenarios (install, upgrade in different directions, hypershift) are covered by CI. So we likely haven't broken anything obvious, and we do not change any externally observable behaviors that the testing could focus on. That means the remaining factor that we could make worse is overall, large-scale reliability, and that factor is best tested by letting the PR merge so that the code reaches the payload testing battery of jobs.

@Davoska
Copy link
Contributor Author

Davoska commented Jul 21, 2024

Well said. I have nothing to add.

/label no-qe

@openshift-ci openshift-ci bot added the no-qe Allows PRs to merge without qe-approved label label Jul 21, 2024
@openshift-merge-bot openshift-merge-bot bot merged commit 8244b52 into openshift:master Jul 22, 2024
12 checks passed
@openshift-ci-robot
Copy link
Contributor

@Davoska: Jira Issue OCPBUGS-32678: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-32678 has been moved to the MODIFIED state.

In response to this:

After the CVO has started and its worker goroutines are starting to work, a panic may occur due to a data race.

Originally (at 961873d), the SyncWorker.Start() method was exclusively waiting for a signal on the SyncWorker.notify channel from the SyncWorker.Update() method before attempting the first application of a payload.

After the Start() method got the signal, its required values were already initialized by the SyncWorker.Update() method, such as the SyncWorker.work field.

However, the 9222fa9 introduced another section of the CVO that can signal the SyncWorker.notify channel.

The SyncWorker.NotifyAboutManagedResourceActivity() method now signals the channel upon an activity on a managed resource. This signal may occur before the SyncWorker.Update() has finished. The signal thus starts an attempt of the SyncWorker.Start() method to apply a payload before the SyncWorker.work field could have been set by the SyncWorker.Update() method.

This currently results in an "invalid memory address or nil pointer dereference" panic in the SyncWork.calculateNextFrom() method because a nil pointer is passed by the SyncWorker.Start() method.

This panic can be seen in the following logs [1, 2]. Be sure to notice messages such as Notify the sync worker: Cluster operator etcd changed Degraded from "False" to "True" caused by the
NotifyAboutManagedResourceActivity method right before a panic occurs.

This PR will fix the issue by introducing a new channel to exclusively signal the start of the initial attempt of the application after the SyncWorker.Update() has finished. It may also be used for any consecutive signals to attempt an application. This ensures the Update() has finished at least once before we attempt to apply the payload.

[1] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-aws-workers-rhel8/1804819848023248896/artifacts/e2e-aws-workers-rhel8/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-545d8cb5db-v2bf4_cluster-version-operator_previous.log
[2] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-ovirt/1805174927066664960/artifacts/e2e-ovirt/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-849c49896b-wnn8v_cluster-version-operator_previous.log

This pull request references https://issues.redhat.com/browse/OCPBUGS-32678

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-bot
Copy link
Contributor

[ART PR BUILD NOTIFIER]

Distgit: cluster-version-operator
This PR has been included in build cluster-version-operator-container-v4.17.0-202407221718.p0.g8244b52.assembly.stream.el9.
All builds following this will include this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. no-qe Allows PRs to merge without qe-approved label
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants