Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Configure OVS NIC for OVN #1860

Merged
merged 2 commits into from
Jul 29, 2020

Conversation

trozet
Copy link
Contributor

@trozet trozet commented Jun 22, 2020

OVN-k8s needs to be able to move the primary NIC of the host into the
OVS bridge, making the physical NIC essentially a Layer 2 port. This
patch uses nmcli to configure OVS with the NIC on it, as well as bring
up the Layer 3 OVS interface. The configuration can be auto-detected
based on the existing default gateway interface or NM configuration key
files may be provided during ignition and placed on the host.

@openshift-ci-robot openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 22, 2020
@trozet
Copy link
Contributor Author

trozet commented Jun 22, 2020

@runcom FYI

@trozet
Copy link
Contributor Author

trozet commented Jun 22, 2020

/test e2e-ovn-step-registry

@openshift-ci-robot
Copy link
Contributor

@trozet: The specified target(s) for /test were not found.
The following commands are available to trigger jobs:

  • /test cluster-bootimages
  • /test e2e-aws
  • /test e2e-aws-disruptive
  • /test e2e-aws-scaleup-rhel7
  • /test e2e-azure
  • /test e2e-gcp-op
  • /test e2e-gcp-upgrade
  • /test e2e-metal-ipi
  • /test e2e-openstack
  • /test e2e-ovirt
  • /test e2e-vsphere
  • /test e2e-vsphere-upi
  • /test images
  • /test unit
  • /test verify

Use /test all to run the following jobs:

  • pull-ci-openshift-machine-config-operator-master-e2e-aws
  • pull-ci-openshift-machine-config-operator-master-e2e-aws-scaleup-rhel7
  • pull-ci-openshift-machine-config-operator-master-e2e-gcp-op
  • pull-ci-openshift-machine-config-operator-master-e2e-gcp-upgrade
  • pull-ci-openshift-machine-config-operator-master-e2e-metal-ipi
  • pull-ci-openshift-machine-config-operator-master-images
  • pull-ci-openshift-machine-config-operator-master-unit
  • pull-ci-openshift-machine-config-operator-master-verify

In response to this:

/test e2e-ovn-step-registry

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@trozet
Copy link
Contributor Author

trozet commented Jun 22, 2020

@dcbw fyi

@ashcrow ashcrow requested review from yuqi-zhang and sinnykumari and removed request for ashcrow June 23, 2020 16:38
@kikisdeliveryservice
Copy link
Contributor

Is there any documentation/summary about this work? Jira? BZ? Enhancement?

@trozet
Copy link
Contributor Author

trozet commented Jun 23, 2020

Is there any documentation/summary about this work? Jira? BZ? Enhancement?

https://issues.redhat.com/browse/SDN-1030 and more specifically https://issues.redhat.com/browse/SDN-1032

path: "/usr/local/bin/configure-ovs.sh"
contents:
inline: |
#!/usr/bin/env bash
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW after #1766 lands, it will become possible to add Go code into the MCD binary itself and have it execute on the host reliably.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks. I had planned on moving to go after I got the bash working. I didn't know about this option.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#1766 landed. Now as I understand things this PR is about controlling the SDN, which we do not need during "firstboot" which is when we perform os updates, only after kubelet is ready to start.

I think all of this bash code could become a machine-config-daemon initialize-ovs subcommand or so, then the MCD just invokes systemd-run --unit mco-ovs /run/bin/machine-config-daemon initialize-ovs when it starts up the first time on a node. (Note that you will need to handle the case of the MCD being restarted without a node restart, e.g. new daemonset rollout, so your code should be idempotent)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK after some real-time discussion, it sounds like we need this before kubelet, which does not fit neatly into the current architecture. Today we have:

  • Code shipped as part of the OS that runs before and up to kubelet (podman, crio, kubelet)
  • The special MCD code introduced for Use MCD binary from container in /run/bin #1766 - but that only applies on the firstboot, after that the MCD pod takes over (i.e. it's after kubelet)

(We're also implicitly having a larger debate about how much "host management" should happen in a pod)

One thing we could do is ship this code as part of a container image (actually, it could be the SDN container), and build upon the "host binary" trick we're already doing there, but instead of extracting it from the pod, have our systemd unit pull and extract it directly and run Before=kubelet.service. Or, another variant of this is to basically just podman run --privileged -v /:/host and have the container perform the same trick the MCD does and extract its binary to the host, where it can then be run via systemd-run.

@cgwalters
Copy link
Member

https://issues.redhat.com/browse/SDN-1030 and more specifically https://issues.redhat.com/browse/SDN-1032

This is the problem with using a private JIRA instance to track work on public Free/Open Source software...

@trozet
Copy link
Contributor Author

trozet commented Jun 23, 2020

/retest

@trozet
Copy link
Contributor Author

trozet commented Jun 25, 2020

I think I'm going to take a different approach to this PR and try to configure all of the NICs before networking comes up.

@trozet trozet force-pushed the configure_ovs_nic branch 3 times, most recently from adaeb21 to 55c97cf Compare June 25, 2020 14:20
@trozet
Copy link
Contributor Author

trozet commented Jun 25, 2020

/test e2e-ovn-step-registry

@trozet trozet force-pushed the configure_ovs_nic branch 3 times, most recently from 21fc932 to 39564eb Compare June 25, 2020 22:50
@kikisdeliveryservice
Copy link
Contributor

/test e2e-ovn-step-registry

Is this PR intended to fix the ovn-step-registry job? Bc it has a very spotty passage rate.

@trozet
Copy link
Contributor Author

trozet commented Jun 26, 2020

/test e2e-ovn-step-registry

Is this PR intended to fix the ovn-step-registry job? Bc it has a very spotty passage rate.

No, this PR is not intended to fix anything. If you have some links to other jobs where it is failing @dcbw or I can take a look.

@abhat
Copy link
Contributor

abhat commented Jul 28, 2020

/lgtm

Other review comments can be handled later in a follow-up PR.

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jul 28, 2020
Copy link
Contributor

@danwinship danwinship left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have recommendations for simplifications but no blocking objections to the current state. (ie, all of this could potentially be addressed in followups)

@trozet
Copy link
Contributor Author

trozet commented Jul 28, 2020

/retest

@danwinship
Copy link
Contributor

/lgtm

@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: abhat, cgwalters, danwinship, trozet

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@trozet
Copy link
Contributor Author

trozet commented Jul 28, 2020

@kikisdeliveryservice can you please remove the hold?

@kikisdeliveryservice
Copy link
Contributor

@kikisdeliveryservice can you please remove the hold?

👍

/hold cancel

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 28, 2020
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

10 similar comments
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@trozet
Copy link
Contributor Author

trozet commented Jul 29, 2020

@kikisdeliveryservice is there anyway we can override e2e-gcp-upgrade? This patch only affects the ovn-kubernetes path (not openshift-sdn). The e2e-gcp-upgrade job is failing due to a known bug in openshift-sdn:

https://bugzilla.redhat.com/show_bug.cgi?id=1828858

The only job actually testing this this patch is e2e-ovn-step-registry which passed.

@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Jul 29, 2020

@trozet: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-aws-proxy 6eda617360404442bac8e1756d8be9978f512b8b link /test e2e-aws-proxy
ci/prow/e2e-metal-ipi 7fe59c9 link /test e2e-metal-ipi

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@runcom
Copy link
Member

runcom commented Jul 29, 2020

As explained, freeing up some jobs, this can't affect e2e-gcp-upgrade (maybe some other future upgarde job will)

EDIT: to further confirm https://prow.ci.openshift.org/pr-history/?org=openshift&repo=machine-config-operator&pr=1860 (this commit passed already)

/override ci/prow/e2e-gcp-upgrade

@openshift-ci-robot
Copy link
Contributor

@runcom: Overrode contexts on behalf of runcom: ci/prow/e2e-gcp-upgrade

In response to this:

As explained, freeing up some jobs, this can't affect e2e-gcp-upgrade (maybe some other future upgarde job will)

/override ci/prow/e2e-gcp-upgrade

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@runcom runcom added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 29, 2020
@openshift-merge-robot openshift-merge-robot merged commit 0234512 into openshift:master Jul 29, 2020
@@ -5,7 +5,8 @@ contents: |
Description=Update GCP routes for forwarded IPs.
ConditionKernelCommandLine=|ignition.platform.id=gce
ConditionKernelCommandLine=|ignition.platform.id=gcp
After=network.target
Wants=network-online.target
After=network-online.target
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

triple checking on this service - @squeed you wrote the original https://github.com/openshift/machine-config-operator/blob/release-4.5/templates/master/00-master/gcp/units/openshift-gcp-routes.service#L8 w/o the wants but it seems Tim found an issue requiring it for gcp+ovn - just triple checking as this service caused some weird issues in the past

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as a follow up, looks like Casey is on PTO also - talked to Stefan, we're not fully sure but Tim pointed out this is needed anyway in OVN+GCP

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes from what I found the gcp routes thing relies on network connectivity and finding an interface to use. With these changes that interface will change sometime after network.target but before network-online.target...meaning there could be a race where gcp routes uses the wrong interface. I see GCP just passed with this patch on

https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_ovn-kubernetes/216/pull-ci-openshift-ovn-kubernetes-master-e2e-gcp-ovn/1288509043568021504

So I think we are OK.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.