Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Heterogeneous architecture clusters #1014

Merged
merged 1 commit into from
Jul 21, 2022

Conversation

Prashanth684
Copy link
Contributor

Introduces support for provisioning and upgrading heterogenous architecture clusters in phases

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 20, 2022

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci openshift-ci bot requested review from mandre and runcom January 20, 2022 22:41
Copy link
Contributor

@bparees bparees left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

first pass

there's also a fair bit of formatting and typo cleanup needed that i didn't explicilty scrub for

also in general "manifestlist" is written as one word, it's effectively a noun.

enhancements/multi-arch/heterogenous-architecture.md Outdated Show resolved Hide resolved
enhancements/multi-arch/heterogenous-architecture.md Outdated Show resolved Hide resolved
enhancements/multi-arch/heterogenous-architecture.md Outdated Show resolved Hide resolved
enhancements/multi-arch/heterogenous-architecture.md Outdated Show resolved Hide resolved
enhancements/multi-arch/heterogenous-architecture.md Outdated Show resolved Hide resolved
enhancements/multi-arch/heterogenous-architecture.md Outdated Show resolved Hide resolved
### Machine Config Operator

No changes are needed in the MCO for Phase 1 as the machine-os-content image would be a manifest listed image and the machine config daemon would extract the relevant architecture's machine-os-content based on the node it runs on. In the future architecture
specific configurations should not be required, instead they need to be templatized and be generalized for certain items, for example kubelet system reserved memory could scale based on page size.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so:
initially, we'll expect the user to provide any additional args/config in the machineset config
middle-state, we'll explicitly configure those things based on the arch
final-state, we'll auto-configure them based on generalized analysis

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i still have to understand from Colin/Mrunal the changes involved here for MCO.

Copy link
Contributor Author

@Prashanth684 Prashanth684 Feb 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mrunalp could you please clarify what specifically would change in MCO when we support heterogeneous clusters a opposed to what we are doing today ?

enhancements/multi-arch/heterogenous-architecture.md Outdated Show resolved Hide resolved
enhancements/multi-arch/heterogenous-architecture.md Outdated Show resolved Hide resolved
enhancements/multi-arch/heterogenous-architecture.md Outdated Show resolved Hide resolved
@Prashanth684 Prashanth684 force-pushed the hetero-arch branch 3 times, most recently from a46117c to e8348e1 Compare January 24, 2022 00:06
@Prashanth684 Prashanth684 changed the title Heterogenous architecture clusters Heterogeneous architecture clusters Jan 24, 2022
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 24, 2022
@Prashanth684
Copy link
Contributor Author

first pass

there's also a fair bit of formatting and typo cleanup needed that i didn't explicilty scrub for

also in general "manifestlist" is written as one word, it's effectively a noun.

did a fair bit of cleanup on the typos and formatting.

@Prashanth684 Prashanth684 force-pushed the hetero-arch branch 2 times, most recently from 46074ba to 6469fdc Compare January 26, 2022 18:12
@Prashanth684 Prashanth684 force-pushed the hetero-arch branch 8 times, most recently from 3a6fa8d to b925a5f Compare January 27, 2022 20:00
@Prashanth684 Prashanth684 force-pushed the hetero-arch branch 2 times, most recently from 68d577b to a54b492 Compare January 27, 2022 22:37
enhancements/multi-arch/heterogeneous-architecture.md Outdated Show resolved Hide resolved
enhancements/multi-arch/heterogeneous-architecture.md Outdated Show resolved Hide resolved
enhancements/multi-arch/heterogeneous-architecture.md Outdated Show resolved Hide resolved
enhancements/multi-arch/heterogeneous-architecture.md Outdated Show resolved Hide resolved
enhancements/multi-arch/heterogeneous-architecture.md Outdated Show resolved Hide resolved
enhancements/multi-arch/heterogeneous-architecture.md Outdated Show resolved Hide resolved
enhancements/multi-arch/heterogeneous-architecture.md Outdated Show resolved Hide resolved
enhancements/multi-arch/heterogeneous-architecture.md Outdated Show resolved Hide resolved
enhancements/multi-arch/heterogeneous-architecture.md Outdated Show resolved Hide resolved
enhancements/multi-arch/heterogeneous-architecture.md Outdated Show resolved Hide resolved

As part of this migration, the ClusterVersion API's spec would need to have an additional "architecture" field to indicate the architecture of the cluster.
This field would indicate whether the cluster is a homogeneous or heterogeneous cluster with the individual arches denoting a homogeneous cluster and
"multi" indicating a heterogeneous cluster. This field would also be part of the ClusterVersionStatus to indicate to consumers of the API who would want to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there going to be a metric exposing this so that we can slice our telemetry data based on this information?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes in the future we do want to have this. Today for collecting data on the number of clusters per arch we have to look at a node's arch label. This will make it easier going forward to identify the cluster arch from the ClusterVersionStatus field rather than looking at a node's arch in that cluster.

Copy link
Contributor

@bparees bparees left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm a little unclear on the intent of this EP at this point. Given 4.11 is about to ship, i'd expect this EP to be a lot more concrete about exactly what is/was done for 4.11/phase1, and then reserving the more "design" bits for the work that still needs to be done.

@Prashanth684
Copy link
Contributor Author

i'm a little unclear on the intent of this EP at this point. Given 4.11 is about to ship, i'd expect this EP to be a lot more concrete about exactly what is/was done for 4.11/phase1, and then reserving the more "design" bits for the work that still needs to be done.

this enhancement was opened in January and since then there are a lot of changes and lot of implementation details that have been worked on and changed. that being said, the deliverables for 4.11 (and other phases) are reflected accurately i believe. I have tried to keep most of the component sections up to date, but i see that there are several sections which are outdated. It would be good to get this merged soon, so incremental changes can be made by the individual component teams themselves as they progress. In the meanwhile I will work on updating all sections to reflect the latest so we can get it merged by end of next week.

@Prashanth684 Prashanth684 force-pushed the hetero-arch branch 3 times, most recently from a5db613 to 67c679a Compare July 19, 2022 16:25
@Prashanth684
Copy link
Contributor Author

i'm a little unclear on the intent of this EP at this point. Given 4.11 is about to ship, i'd expect this EP to be a lot more concrete about exactly what is/was done for 4.11/phase1, and then reserving the more "design" bits for the work that still needs to be done.

this enhancement was opened in January and since then there are a lot of changes and lot of implementation details that have been worked on and changed. that being said, the deliverables for 4.11 (and other phases) are reflected accurately i believe. I have tried to keep most of the component sections up to date, but i see that there are several sections which are outdated. It would be good to get this merged soon, so incremental changes can be made by the individual component teams themselves as they progress. In the meanwhile I will work on updating all sections to reflect the latest so we can get it merged by end of next week.

@bparees i've cleaned up some of the sections which did not reflect the current plan. Are there any other concerns from your side as i look to merge this week?

@bparees
Copy link
Contributor

bparees commented Jul 19, 2022

i've cleaned up some of the sections which did not reflect the current plan. Are there any other concerns from your side as i look to merge this week?

@Prashanth684 no further concerns, the cleanup has significantly improved readability and made it clearer what work is being proposed in what areas/components(and why). Thanks!

- When creating the hosted cluster, the latest x86 release payload is picked by default. This would have to change to pick a heterogeneous payload
when creating the hosted cluster on mixed architecture workers.
- For the hosted control plane to work on non x86 architecture worker nodes, the following image references would need changing:
- the release image fixtures referencing x86 imagestreams. these would need to be changed to manifestlist references or there would need
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which images is this referring this exactly? we infer everything else from hostedCluster.spec.release.image, so as far as that's heterogeneous, any component should pick the right arch from there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm referring to these: https://github.com/openshift/hypershift/tree/main/support/releaseinfo/fixtures . shouldn't there be a fixture added with manifestlisted pullspecs for heterogeneous?

Introduces support for provisioning and upgrading heterogenous architecture clusters in phases

Co-authored-by: Ben Parees <bparees@users.noreply.github.com>
@bparees
Copy link
Contributor

bparees commented Jul 21, 2022

/approve
/lgtm

i expect this EP will continue to be a living document as we work our way through the implementation phases, but for now it outlines the big picture steps we need to take and the implications.

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jul 21, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 21, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bparees

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 21, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 21, 2022

@Prashanth684: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet