Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix gateway datapath disrupt on update #301

Conversation

elvgarrui
Copy link
Contributor

@elvgarrui elvgarrui commented Jun 10, 2024

Reimplementing the vswitchd reload using separate start and stop scripts so that it can be executed partially between consecutives instances of the pod.

TODO:

  • Semaphore
  • Kuttl testing

Copy link
Contributor

openshift-ci bot commented Jun 10, 2024

Hi @elvgarrui. Thanks for your PR.

I'm waiting for a openstack-k8s-operators member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@booxter
Copy link
Contributor

booxter commented Jun 10, 2024

This will need a kuttl test that would validate that flows injected by OVN are retained after ovs pod restart.

@booxter
Copy link
Contributor

booxter commented Jun 11, 2024

/ok-to-test

@elvgarrui
Copy link
Contributor Author

New version includes the reviews from @ralonsoh. I have tested this in my env and it works better now. The flows are still not correctly written on the DB with the start script, though.

Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://review.rdoproject.org/zuul/buildset/33635dc070e14fe3b0deb94ad63b556d

✔️ openstack-k8s-operators-content-provider SUCCESS in 2h 06m 49s
ovn-operator-tempest-multinode FAILURE in 1h 48m 30s

@elvgarrui elvgarrui force-pushed the vswitchd_restart_split branch 4 times, most recently from 3ac4451 to a677d49 Compare June 14, 2024 13:03

# Start vswitchd by asking it to wait till flow restore is finished.
ovs-vsctl --no-wait set open_vswitch . other_config:flow-restore-wait=true
/usr/sbin/ovs-vswitchd --pidfile --mlockall --detach

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I note that ovs-ctl does some additional things like setting MAXFDs and setting up logging. Do we need to do that here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@otherwiseguy maybe?.. Would be nice if you could check and report back. In the meantime, this PR does not attempt to change how the service is started. This command was used before to start the container, so there's no change here (except that --detach is added).

@elvgarrui
Copy link
Contributor Author

I have removed the sleep on the ovsdb-server preStop. We still need the kuttl testing

@booxter booxter marked this pull request as ready for review June 14, 2024 16:29
@booxter
Copy link
Contributor

booxter commented Jun 14, 2024

/hold

Copy link
Contributor

@booxter booxter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking forward for kuttl scenario.

templates/ovncontroller/bin/stop-vswitchd.sh Outdated Show resolved Hide resolved
templates/ovncontroller/bin/stop-vswitchd.sh Outdated Show resolved Hide resolved
templates/ovncontroller/bin/stop-vswitchd.sh Outdated Show resolved Hide resolved
templates/ovncontroller/bin/stop-vswitchd.sh Outdated Show resolved Hide resolved
templates/ovncontroller/bin/stop-vswitchd.sh Outdated Show resolved Hide resolved
templates/ovncontroller/bin/stop-vswitchd.sh Outdated Show resolved Hide resolved
templates/ovncontroller/bin/stop-ovsdb-server.sh Outdated Show resolved Hide resolved
templates/ovncontroller/bin/stop-ovsdb-server.sh Outdated Show resolved Hide resolved
@@ -0,0 +1,36 @@
#!/bin/sh
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(no action required) I know it's also sh (and not bash) in other scripts here, but I wonder if we should just use bash here - I for one am not very knowledgeable to confirm that this script doesn't use any bash-isms.

@elvgarrui elvgarrui force-pushed the vswitchd_restart_split branch 3 times, most recently from 7ae5f03 to 16ba507 Compare June 17, 2024 16:58
@karelyatin
Copy link
Contributor

@ralonsoh can you rebase and fix finalizers in assert file as per c2b1320

@ralonsoh
Copy link
Contributor

@ralonsoh can you rebase and fix finalizers in assert file as per c2b1320

Right now, actually I'll also squash the 3 commits in this PR into one.

@ralonsoh
Copy link
Contributor

/retest-required

Copy link
Contributor

@karelyatin karelyatin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM, some nits and clarifications inline

tests/kuttl/tests/ovn_restart_flow/03-assert.yaml Outdated Show resolved Hide resolved
templates/ovncontroller/bin/start-vswitchd.sh Outdated Show resolved Hide resolved
templates/ovncontroller/bin/stop-vswitchd.sh Show resolved Hide resolved
templates/ovncontroller/bin/stop-vswitchd.sh Outdated Show resolved Hide resolved
tests/kuttl/tests/ovn_restart_flow/03-assert.yaml Outdated Show resolved Hide resolved
tests/kuttl/tests/ovn_restart_flow/03-assert.yaml Outdated Show resolved Hide resolved
tests/kuttl/tests/ovn_restart_flow/03-assert.yaml Outdated Show resolved Hide resolved
tests/kuttl/tests/ovn_restart_flow/03-assert.yaml Outdated Show resolved Hide resolved
tests/kuttl/tests/ovn_restart_flow/03-assert.yaml Outdated Show resolved Hide resolved
tests/kuttl/tests/ovn_restart_flow/03-assert.yaml Outdated Show resolved Hide resolved
tests/kuttl/tests/ovn_restart_flow/03-assert.yaml Outdated Show resolved Hide resolved
tests/kuttl/tests/ovn_restart_flow/03-assert.yaml Outdated Show resolved Hide resolved
tests/kuttl/tests/ovn_restart_flow/03-assert.yaml Outdated Show resolved Hide resolved
@booxter
Copy link
Contributor

booxter commented Jun 21, 2024

Please also update the commit message:

  • remove WIP
  • add some details as to how this patch avoids disruption (mention that flows are restored via ovs-save.)

@ralonsoh
Copy link
Contributor

Hi folks, I think this patch is ready. I don't know (I think I can't) change the PR name (and remove the WIP tag) but is removed from the commit message now. Don't hesitate to ping me here, by mail or in IRC. Thanks!

@booxter booxter changed the title [WIP] Fix gateway datapath disrupt on update Fix gateway datapath disrupt on update Jun 24, 2024
@booxter
Copy link
Contributor

booxter commented Jun 24, 2024

/lgtm

@openshift-ci openshift-ci bot added the lgtm label Jun 24, 2024
@booxter
Copy link
Contributor

booxter commented Jun 24, 2024

/hold off

Copy link
Contributor

@karelyatin karelyatin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Just a cleanup inline if revision needed else can follow up

node=$(oc get nodes -o name|sort|head -1| sed "s|node/||g")
controller_pod=$(oc get pod -n $NAMESPACE -l service=ovn-controller-ovs --field-selector spec.nodeName=$node -o name | head -1)
expected_flows="table=100, priority=200 actions=drop"
oc rsh -n $NAMESPACE --container ovs-vswitchd $controller_pod ovs-ofctl dump-flows br-test-flows --no-stats | grep -q "$expected_flows" || exit 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's also delete the test bridge after test

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me push a new PS

Copy link
Contributor

@ralonsoh ralonsoh Jun 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've pushed a new PS. I was thinking about adding a trap in this script, but that means always deleting the bridge if the script exits. We only want to remove the bridge if the script first reads the flows correctly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only want to remove the bridge if the script first reads the flows correctly.

Why?

Re: trap: you can set trap and then remove it with trap - EXIT.

Reimplementing the vswitchd reload using separate start and stop scripts
so that it can be executed partially between consecutives instances of
the pod. To avoid disruptions, it runs ovs save-flows command on the
stop-vswitchd script, saving the resulting backup on a mounted folder
from the host and then loading that backup on the start script. For this
to succeed, the script needs to set the flow-restore-wait flag from ovs
to true while loading the flows.

Note that ovs save-flows command needs to have ovsdb-server running, so
a semaphore like mechanism was added to ensure the ovsdb-server
container is never deleted before ovs-vswitchd.

Closes-Issue: OSPRH-6326
@karelyatin
Copy link
Contributor

/lgtm

@openshift-ci openshift-ci bot added the lgtm label Jun 24, 2024
@booxter
Copy link
Contributor

booxter commented Jun 24, 2024

/approve

There's only so much yak to shave. Some improvements to kuttl may still be useful, but I don't think we should wait for it addressed here.

Copy link
Contributor

openshift-ci bot commented Jun 24, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: booxter, elvgarrui

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-bot openshift-merge-bot bot merged commit 9f91003 into openstack-k8s-operators:main Jun 24, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants