Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix panic in nova-controller if cell0 DB creation is late #356

Conversation

gibizer
Copy link
Contributor

@gibizer gibizer commented May 5, 2023

We saw panic during cell creation in the nova-controller intermittently:

goroutine 882 [running]:
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile.func1()
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:119 +0x1fa
panic({0x1814fe0, 0x295e110})
	/usr/local/go/src/runtime/panic.go:884 +0x212
github.com/openstack-k8s-operators/nova-operator/controllers.(*NovaReconciler).Reconcile(0xc000386cd0, {0x1ccf298, 0xc001d5a120}, {{{0xc001cbb4e0?, 0x10?}, {0xc001cbb4f0?, 0x40da67?}}})
	/remote-source/controllers/nova_controller.go:390 +0x3162

After chasing this for a while I figured that the panic happens when both cell1 DB and MQ creation finishes before the cell0 DB creation. So this patch adds a test case that reproduces the panic in a stable way.

The troubleshooting showed that there was a bug in the nova-controller in the ensureCell loop. If the cell0 was still waiting for the DB to be created but cell1 had all the prerequisites done (DB, MQ) then the code tried to look up cell0 which was not in the internal cell map causing panic.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 5, 2023

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci openshift-ci bot added the approved label May 5, 2023
@gibizer gibizer changed the title Reproduce panic if cell0 DB creation is late Fix panic in nova-controller if cell0 DB creation is late May 5, 2023
@gibizer gibizer marked this pull request as ready for review May 5, 2023 16:32
@openshift-ci openshift-ci bot requested review from bogdando and SeanMooney May 5, 2023 16:32
@gibizer gibizer force-pushed the nova-controller-panic-if-cell1-early branch from 78da53a to 7394e43 Compare May 5, 2023 16:45
Comment on lines 803 to 806
// We cannot merge this test as is because if the tests run
// in parallel then this test hangs forever (not even the global
// ginkgo --timeout stops it). If you run it with --procs 1 locally
// then it shows the panic and terminates
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ginkgo 2.9.4 might have a fix for this hang onsi/ginkgo#1192 And I'm trying bump nova-operator to that version in #361

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hum interesting that kind of sucks but ok

controllers/nova_controller.go Show resolved Hide resolved
@@ -758,4 +758,59 @@ var _ = Describe("Nova multi cell", func() {
)
})
})
When("cell1 DB and MQ create finishes before cell0 DB create", func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice so I would be tempted to put this in a regression test module but we didn't actually have a bug filed for this.

for this instance I'm ok with leave this here but how would you feel about introducing the two commit workflows and a regression module into the nova-operator like we have in nova.

in general.

Comment on lines 803 to 806
// We cannot merge this test as is because if the tests run
// in parallel then this test hangs forever (not even the global
// ginkgo --timeout stops it). If you run it with --procs 1 locally
// then it shows the panic and terminates
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hum interesting that kind of sucks but ok

@bogdando

This comment was marked as resolved.

@gibizer gibizer force-pushed the nova-controller-panic-if-cell1-early branch from 7394e43 to 12ff016 Compare May 9, 2023 10:52
@openshift-ci openshift-ci bot removed the lgtm label May 9, 2023
We saw panic during cell creation in the nova-controller intermittently:

goroutine 882 [running]:
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile.func1()
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:119 +0x1fa
panic({0x1814fe0, 0x295e110})
	/usr/local/go/src/runtime/panic.go:884 +0x212
github.com/openstack-k8s-operators/nova-operator/controllers.(*NovaReconciler).Reconcile(0xc000386cd0, {0x1ccf298, 0xc001d5a120}, {{{0xc001cbb4e0?, 0x10?}, {0xc001cbb4f0?, 0x40da67?}}})
	/remote-source/controllers/nova_controller.go:390 +0x3162

After chasing this for a while I figured that the panic happens when
both cell1 DB and MQ creation finishes before the cell0 DB creation. So
this patch adds a test case that reproduces the panic in a stable way.
There was a bug in the nova-controller in the ensureCell loop. If the
cell0 was still waiting for the DB to be created but cell1 had all the
prerequisites are done (DB, MQ) then the code tried to look up cell0
which was not in the internal cell map causing panic.
@gibizer gibizer force-pushed the nova-controller-panic-if-cell1-early branch from 12ff016 to 342b263 Compare May 9, 2023 10:55
Comment on lines +695 to +698
keystoneAPI := th.GetKeystoneAPI(keystoneAPIName)
keystoneAPI.Status.APIEndpoints["internal"] = "http://keystone-internal-openstack.testing"
Eventually(func(g Gomega) {
g.Expect(k8sClient.Status().Update(ctx, keystoneAPI.DeepCopy())).Should(Succeed())
Copy link
Contributor

@bogdando bogdando May 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for some other places like this, we don't do DeepCopy(). Would be nice to align those getters interfaces

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need the DeepCopy here. Probably a copy paste error. We would need the DeepCopy though if there would be other things inside the Eventually that depends on the keystoneAPI variable. As far as I understand the Update() call will store the result of the update in the object passed to it.

@bogdando bogdando self-requested a review May 10, 2023 11:11
@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 10, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bogdando, gibizer, SeanMooney

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [SeanMooney,bogdando,gibizer]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@bogdando
Copy link
Contributor

bogdando commented May 10, 2023

/test functional

@openshift-ci
Copy link
Contributor