Describing steps to support out-of-tree providers #463

Danil-Grigorev · 2020-09-01T13:34:13Z

This enhancement proposal describes the migration of cloud platforms from the deprecated in-tree cloud providers to Cloud Controller Manager services that implement external cloud provider interface.

JoelSpeed

This is a really really good start! I've left a bunch of comments, suggestions and questions throughout.

I think it would be good to expand on the CCM operator as this isn't really explained in detail, I assume we are going to need an operator that determines what to deploy daemonset wise based on the Infrastructure?

enhancements/machine-api/out-of-tree-provider-support.md

michaelgugino

So, discussions from recent call: I think this component should be deployed by whatever operator controls the kube-controller-manager. The lifecycle of this component and that component are tightly coupled, regardless of where the source code lives.

It's likely we'll have to stagger the rollout of out-of-tree providers as some will be ready while others will not, and the ones that are ready to go out of tree are no longer receiving in-tree support. At least for the first iteration, we should make this component part of the same deployment as the KCM, IMO, as it should be a simple enough set of logic to determine whether or not to deploy the out of tree provider and the relevant options for the KCM.

derekwaynecarr · 2020-09-17T16:38:29Z

@michaelgugino I agree. It should just run as a container next to kcm in same pod. We have experience in this approach with managed cloud services.

smarterclayton · 2020-09-17T16:46:11Z

@derekwaynecarr @michaelgugino I agree. It should just run as a container next to kcm in same pod. We have experience in this approach with managed cloud services.

I agree, I expect this to be hidden and abstracted by the infrastructure (where the code runs is not an operational detail for teams)

enhancements/machine-api/out-of-tree-provider-support.md

JoelSpeed

I think we need to add more detail about the CNM on Windows machines, plus I've added a few suggestions to improve the wording as below.

For the CNM on Windows we need to add a section something along the lines of

#### Windows Machine Config Operator

Most changes described in this document should be transparent to WMCO.
WMCO reads the output of MCO's rendered configuration and as such, will pick up the changes we need to make to Kubelet config via that mechanism.

However, on certain platforms (e.g. Azure), the Cloud Node Manager (CNM, responsible for initialising the Node) must be run on the Node itself via a DaemonSet.

Since Red Hat cannot supply or support Windows container images, we cannot run a DaemonSet for the CNM targeted at Windows Nodes as we would do on Linux Nodes.
Instead, we must adapt the WMCO to, on these particular platforms, deploy a new Windows service that runs the CNM on the Node.

This pattern is already in place for other components that are required to run on the host (eg CNI and Kube-Proxy), so we will be able to reuse the existing pattern to add support for CNM on platforms that require a CNM per host.

@aravindhp Could you check over this paragraph above, do you think that captures the content we need for WMCO or have I missed something?

enhancements/machine-api/out-of-tree-provider-support.md

aravindhp · 2021-07-13T16:29:05Z

I think we need to add more detail about the CNM on Windows machines, plus I've added a few suggestions to improve the wording as below.

For the CNM on Windows we need to add a section something along the lines of
#### Windows Machine Config Operator

Most changes described in this document should be transparent to WMCO.
WMCO reads the output of MCO's rendered configuration and as such, will pick up the changes we need to make to Kubelet config via that mechanism.

We don't blindly apply the Linux configuration, so we should ensure that WMCO/WMCB picks up the required pieces during the implementation.

However, on certain platforms (e.g. Azure), the Cloud Node Manager (CNM, responsible for initialising the Node) must be run on the Node itself via a DaemonSet.

Since Red Hat cannot supply or support Windows container images, we cannot run a DaemonSet for the CNM targeted at Windows Nodes as we would do on Linux Nodes.
Instead, we must adapt the WMCO to, on these particular platforms, deploy a new Windows service that runs the CNM on the Node.

This pattern is already in place for other components that are required to run on the host (eg CNI and Kube-Proxy), so we will be able to reuse the existing pattern to add support for CNM on platforms that require a CNM per host.
@aravindhp Could you check over this paragraph above, do you think that captures the content we need for WMCO or have I missed something?

Yes, in general this looks good.

/cc @openshift/openshift-team-windows-containers

JoelSpeed · 2021-07-14T10:49:25Z

I've made some updates to my previous suggestion around WMCO to make it clearer that there are two changes to consider based on feedback from Aravindh and Danil

#### Windows Machine Config Operator

##### Kubelet Changes

To generate the configuration for Windows nodes, WMCO reads the output of MCO's rendered configuration as a basis for its own config for Kubelet on the Windows node.
As we are making changes to the output of the Kubelet service in MCO (namely change the value of the `cloud-provider` flag), we will need to verify that WMCO reads this flag and copies its value to the Kubelet Windows service.

##### Node Initialization

On most platforms, Node initialization is handled centrally by the CMM, specifically the Cloud Node Manager (CNM) running within it.
However, On certain platforms (e.g. Azure), the CNM must be run on the Node itself, typically via a DaemonSet.

Since Red Hat cannot supply or support Windows container images, we cannot run a DaemonSet for the CNM targeted at Windows Nodes as we would do on Linux Nodes.
Instead, we must adapt the WMCO to, on these particular platforms, deploy a new Windows service that runs the CNM on the Node.

This pattern is already in place for other components that are required to run on the host (eg CNI and Kube-Proxy), so we will be able to reuse the existing pattern to add support for CNM on platforms that require a CNM per host.

JoelSpeed · 2021-07-14T12:40:54Z

I think this is ready to go now, @aravindhp could I get you to do another pass on the WMCO section? We added it into the proposal with the latest commit on this branch

@Danil-Grigorev If everything is ok from the WINC team, then I'll approve this and we can get it merged, thanks for all your work on this over the last year

sebsoto

Thanks for all the detail! Overall looks good to me

enhancements/machine-api/out-of-tree-provider-support.md

elmiko

thanks for the updates Danil
/lgtm

JoelSpeed · 2021-07-15T15:44:25Z

enhancements/machine-api/out-of-tree-provider-support.md

+- GCP
+
+##### General availability - 4.10
+
 - OpenStack upgrade and installation with the out-of-tree CCM by default. (GA)


Do we need AWS and Azure in this list too, with the above description as well?

Yep, added a general direction and missed AWS and Azure in 4.10. Does it now look better? @JoelSpeed

elmiko

/lgtm

Fedosin · 2021-07-15T18:16:15Z

/lgtm

JoelSpeed · 2021-07-15T18:35:45Z

Given my email on June 30th proposed that this enhancement would be merged on the 9th of July if there wasn't any further comments, and the updates that have been made since to cater to the remaining few comments, I am happy at this point that we have addressed the concerns of the engineers who have reviewed this project so far and that this proposal now reflects the implementation as it is today, and captures the concerns and questions others may have if they need to work out why or how we implemented this project in the future.

Thank you to all those who have been involved in this effort, this is a much larger project than we had anticipated and we have had to work with many teams to get this through.

Congrats @Danil-Grigorev

/approve

openshift-ci · 2021-07-15T18:36:22Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: elmiko, JoelSpeed, rtheis

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [JoelSpeed]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

JoelSpeed · 2021-07-15T18:39:06Z

/hold cancel

kikisdeliveryservice · 2021-07-15T18:57:59Z

It seems really problematic that authors are approving their own enhancement

approvers:
  - "@enxebre"
  - "@JoelSpeed"
  - "@crawford"
  - "@derekwaynecarr"
  - "@enxebre"
  - "@eparis"
  - "@mrunalp"
  - "@sttts"

authors:
  - "@Danil-Grigorev"
  - "@Fedosin"
  - "@JoelSpeed"

This doesn't seem to have approvals from the people it said it needed to have approvals from?

JoelSpeed · 2021-07-16T09:25:52Z

It seems really problematic that authors are approving their own enhancement

I agree! But I think this is a wider problem with the enhancement process and something I know Doug (who was looking at this before IIRC) is aware of.

The problem in my opinion is a lack of clear ownership for enhancements. Who's responsibility is it to drive an enhancement forward and who's responsibility is it to say it can merge? This has been unclear to me since I joined RH.

For projects that only affect a single team, is the tech lead allowed to approve it? What if the tech lead wrote it? They're still the only one able to make an approval on the team, so they have to get someone outside the team to approve?

When it crosses team boundaries, do we need to get a /approve from every single tech lead who's team might be affected? Will one architects approval suffice? There doesn't seem to be a consistent pattern or guidance (that I've seen) on what to do here

What I disagree with however is that we've done anything wrong here.

In this particular case:

This enhancement spans a number of teams
It has been open for 10 months and has been reviewed by all of the teams affected at least once (some more so)
There have been hundreds and hundreds of comments on this (IIRC it was about 700 in total based off the this week in enhancements stats)
Several of the architects have read this and left their own feedback which we have resolved
The project is 95% implemented at this point, and bar some comments from WMCO, this has been stale for a long time - no one was giving further feedback
I reached out to the team leads who were affected by this change during the 4.9 cycle and got approval from them to go ahead with the changes. The main teams who were affected were Workloads, API, MCO and Storage, and I have spoken to each of them many times about this, they are aware and had told me they were happy with the changes.

Finally, I gave everyone a heads up about this on aos-devel and invited final feedback, as per my comment above. It was also mentioned in the "This week in enhancements" that same week.

I believe we have given people plenty of time to object to the changes or point out problems. Since there were no further objections, and the recent changes had already been signed off by forum-arch before they were implemented, I felt, and still feel, it was time to merge the enhancement.

If you have any additional feedback, I'd be happy to discuss that with you and propose an adjustment to this enhancement in a new PR.

openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 1, 2020

openshift-ci-robot requested review from eparis and imcleod September 1, 2020 13:34

Danil-Grigorev force-pushed the out-of-tree-support branch from 277d59c to a217f77 Compare September 3, 2020 15:00

JoelSpeed reviewed Sep 4, 2020

View reviewed changes

russellb requested changes Sep 10, 2020

View reviewed changes

Danil-Grigorev changed the title ~~[WIP] Describing steps to support out-of-tree providers~~ Describing steps to support out-of-tree providers Sep 12, 2020

Danil-Grigorev marked this pull request as ready for review September 12, 2020 20:43

openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 12, 2020

michaelgugino reviewed Sep 14, 2020

View reviewed changes