Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Describing steps to support out-of-tree providers #463

Merged

Conversation

Danil-Grigorev
Copy link
Contributor

@Danil-Grigorev Danil-Grigorev commented Sep 1, 2020

This enhancement proposal describes the migration of cloud platforms from the deprecated in-tree cloud providers to Cloud Controller Manager services that implement external cloud provider interface.

@openshift-ci-robot openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 1, 2020
Copy link
Contributor

@JoelSpeed JoelSpeed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a really really good start! I've left a bunch of comments, suggestions and questions throughout.

I think it would be good to expand on the CCM operator as this isn't really explained in detail, I assume we are going to need an operator that determines what to deploy daemonset wise based on the Infrastructure?

enhancements/machine-api/out-of-tree-provider-support.md Outdated Show resolved Hide resolved
enhancements/machine-api/out-of-tree-provider-support.md Outdated Show resolved Hide resolved
enhancements/machine-api/out-of-tree-provider-support.md Outdated Show resolved Hide resolved
enhancements/machine-api/out-of-tree-provider-support.md Outdated Show resolved Hide resolved
enhancements/machine-api/out-of-tree-provider-support.md Outdated Show resolved Hide resolved
enhancements/machine-api/out-of-tree-provider-support.md Outdated Show resolved Hide resolved
enhancements/machine-api/out-of-tree-provider-support.md Outdated Show resolved Hide resolved
enhancements/machine-api/out-of-tree-provider-support.md Outdated Show resolved Hide resolved
enhancements/machine-api/out-of-tree-provider-support.md Outdated Show resolved Hide resolved
enhancements/machine-api/out-of-tree-provider-support.md Outdated Show resolved Hide resolved
@Danil-Grigorev Danil-Grigorev changed the title [WIP] Describing steps to support out-of-tree providers Describing steps to support out-of-tree providers Sep 12, 2020
@Danil-Grigorev Danil-Grigorev marked this pull request as ready for review September 12, 2020 20:43
@openshift-ci-robot openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 12, 2020
Copy link
Contributor

@michaelgugino michaelgugino left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, discussions from recent call: I think this component should be deployed by whatever operator controls the kube-controller-manager. The lifecycle of this component and that component are tightly coupled, regardless of where the source code lives.

It's likely we'll have to stagger the rollout of out-of-tree providers as some will be ready while others will not, and the ones that are ready to go out of tree are no longer receiving in-tree support. At least for the first iteration, we should make this component part of the same deployment as the KCM, IMO, as it should be a simple enough set of logic to determine whether or not to deploy the out of tree provider and the relevant options for the KCM.

@derekwaynecarr
Copy link
Member

@michaelgugino I agree. It should just run as a container next to kcm in same pod. We have experience in this approach with managed cloud services.

@smarterclayton
Copy link
Contributor

@derekwaynecarr @michaelgugino I agree. It should just run as a container next to kcm in same pod. We have experience in this approach with managed cloud services.

I agree, I expect this to be hidden and abstracted by the infrastructure (where the code runs is not an operational detail for teams)

Copy link
Contributor

@JoelSpeed JoelSpeed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to add more detail about the CNM on Windows machines, plus I've added a few suggestions to improve the wording as below.

For the CNM on Windows we need to add a section something along the lines of

#### Windows Machine Config Operator

Most changes described in this document should be transparent to WMCO.
WMCO reads the output of MCO's rendered configuration and as such, will pick up the changes we need to make to Kubelet config via that mechanism.

However, on certain platforms (e.g. Azure), the Cloud Node Manager (CNM, responsible for initialising the Node) must be run on the Node itself via a DaemonSet.

Since Red Hat cannot supply or support Windows container images, we cannot run a DaemonSet for the CNM targeted at Windows Nodes as we would do on Linux Nodes.
Instead, we must adapt the WMCO to, on these particular platforms, deploy a new Windows service that runs the CNM on the Node.

This pattern is already in place for other components that are required to run on the host (eg CNI and Kube-Proxy), so we will be able to reuse the existing pattern to add support for CNM on platforms that require a CNM per host.

@aravindhp Could you check over this paragraph above, do you think that captures the content we need for WMCO or have I missed something?

enhancements/machine-api/out-of-tree-provider-support.md Outdated Show resolved Hide resolved
enhancements/machine-api/out-of-tree-provider-support.md Outdated Show resolved Hide resolved
enhancements/machine-api/out-of-tree-provider-support.md Outdated Show resolved Hide resolved
@aravindhp
Copy link
Contributor

I think we need to add more detail about the CNM on Windows machines, plus I've added a few suggestions to improve the wording as below.

For the CNM on Windows we need to add a section something along the lines of

#### Windows Machine Config Operator

Most changes described in this document should be transparent to WMCO.
WMCO reads the output of MCO's rendered configuration and as such, will pick up the changes we need to make to Kubelet config via that mechanism.

We don't blindly apply the Linux configuration, so we should ensure that WMCO/WMCB picks up the required pieces during the implementation.

However, on certain platforms (e.g. Azure), the Cloud Node Manager (CNM, responsible for initialising the Node) must be run on the Node itself via a DaemonSet.

Since Red Hat cannot supply or support Windows container images, we cannot run a DaemonSet for the CNM targeted at Windows Nodes as we would do on Linux Nodes.
Instead, we must adapt the WMCO to, on these particular platforms, deploy a new Windows service that runs the CNM on the Node.

This pattern is already in place for other components that are required to run on the host (eg CNI and Kube-Proxy), so we will be able to reuse the existing pattern to add support for CNM on platforms that require a CNM per host.


@aravindhp Could you check over this paragraph above, do you think that captures the content we need for WMCO or have I missed something?

Yes, in general this looks good.

/cc @openshift/openshift-team-windows-containers

@JoelSpeed
Copy link
Contributor

I've made some updates to my previous suggestion around WMCO to make it clearer that there are two changes to consider based on feedback from Aravindh and Danil

#### Windows Machine Config Operator

##### Kubelet Changes

To generate the configuration for Windows nodes, WMCO reads the output of MCO's rendered configuration as a basis for its own config for Kubelet on the Windows node.
As we are making changes to the output of the Kubelet service in MCO (namely change the value of the `cloud-provider` flag), we will need to verify that WMCO reads this flag and copies its value to the Kubelet Windows service.

##### Node Initialization

On most platforms, Node initialization is handled centrally by the CMM, specifically the Cloud Node Manager (CNM) running within it.
However, On certain platforms (e.g. Azure), the CNM must be run on the Node itself, typically via a DaemonSet.

Since Red Hat cannot supply or support Windows container images, we cannot run a DaemonSet for the CNM targeted at Windows Nodes as we would do on Linux Nodes.
Instead, we must adapt the WMCO to, on these particular platforms, deploy a new Windows service that runs the CNM on the Node.

This pattern is already in place for other components that are required to run on the host (eg CNI and Kube-Proxy), so we will be able to reuse the existing pattern to add support for CNM on platforms that require a CNM per host.

@JoelSpeed
Copy link
Contributor

I think this is ready to go now, @aravindhp could I get you to do another pass on the WMCO section? We added it into the proposal with the latest commit on this branch

@Danil-Grigorev If everything is ok from the WINC team, then I'll approve this and we can get it merged, thanks for all your work on this over the last year

Copy link
Contributor

@sebsoto sebsoto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the detail! Overall looks good to me

Copy link
Contributor

@elmiko elmiko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the updates Danil
/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jul 15, 2021
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Jul 15, 2021
- GCP

##### General availability - 4.10

- OpenStack upgrade and installation with the out-of-tree CCM by default. (GA)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need AWS and Azure in this list too, with the above description as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, added a general direction and missed AWS and Azure in 4.10. Does it now look better? @JoelSpeed

Copy link
Contributor

@elmiko elmiko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jul 15, 2021
@Fedosin
Copy link
Contributor

Fedosin commented Jul 15, 2021

/lgtm

@JoelSpeed
Copy link
Contributor

Given my email on June 30th proposed that this enhancement would be merged on the 9th of July if there wasn't any further comments, and the updates that have been made since to cater to the remaining few comments, I am happy at this point that we have addressed the concerns of the engineers who have reviewed this project so far and that this proposal now reflects the implementation as it is today, and captures the concerns and questions others may have if they need to work out why or how we implemented this project in the future.

Thank you to all those who have been involved in this effort, this is a much larger project than we had anticipated and we have had to work with many teams to get this through.

Congrats @Danil-Grigorev

/approve

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 15, 2021

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: elmiko, JoelSpeed, rtheis

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 15, 2021
@JoelSpeed
Copy link
Contributor

/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 15, 2021
@openshift-merge-robot openshift-merge-robot merged commit 7fc7e1c into openshift:master Jul 15, 2021
@kikisdeliveryservice
Copy link
Contributor

It seems really problematic that authors are approving their own enhancement

approvers:
  - "@enxebre"
  - "@JoelSpeed"
  - "@crawford"
  - "@derekwaynecarr"
  - "@enxebre"
  - "@eparis"
  - "@mrunalp"
  - "@sttts"
authors:
  - "@Danil-Grigorev"
  - "@Fedosin"
  - "@JoelSpeed"

This doesn't seem to have approvals from the people it said it needed to have approvals from?

@Danil-Grigorev Danil-Grigorev deleted the out-of-tree-support branch July 15, 2021 20:08
@JoelSpeed
Copy link
Contributor

It seems really problematic that authors are approving their own enhancement

I agree! But I think this is a wider problem with the enhancement process and something I know Doug (who was looking at this before IIRC) is aware of.

The problem in my opinion is a lack of clear ownership for enhancements. Who's responsibility is it to drive an enhancement forward and who's responsibility is it to say it can merge? This has been unclear to me since I joined RH.

For projects that only affect a single team, is the tech lead allowed to approve it? What if the tech lead wrote it? They're still the only one able to make an approval on the team, so they have to get someone outside the team to approve?

When it crosses team boundaries, do we need to get a /approve from every single tech lead who's team might be affected? Will one architects approval suffice? There doesn't seem to be a consistent pattern or guidance (that I've seen) on what to do here

What I disagree with however is that we've done anything wrong here.

In this particular case:

  • This enhancement spans a number of teams
  • It has been open for 10 months and has been reviewed by all of the teams affected at least once (some more so)
  • There have been hundreds and hundreds of comments on this (IIRC it was about 700 in total based off the this week in enhancements stats)
  • Several of the architects have read this and left their own feedback which we have resolved
  • The project is 95% implemented at this point, and bar some comments from WMCO, this has been stale for a long time - no one was giving further feedback
  • I reached out to the team leads who were affected by this change during the 4.9 cycle and got approval from them to go ahead with the changes. The main teams who were affected were Workloads, API, MCO and Storage, and I have spoken to each of them many times about this, they are aware and had told me they were happy with the changes.

Finally, I gave everyone a heads up about this on aos-devel and invited final feedback, as per my comment above. It was also mentioned in the "This week in enhancements" that same week.

I believe we have given people plenty of time to object to the changes or point out problems. Since there were no further objections, and the recent changes had already been signed off by forum-arch before they were implemented, I felt, and still feel, it was time to merge the enhancement.

If you have any additional feedback, I'd be happy to discuss that with you and propose an adjustment to this enhancement in a new PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet