Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: doc: Begin a document on adding a new OpenShift platform #1112

Closed

Conversation

smarterclayton
Copy link
Contributor

This covers the minimal steps and process to go from "nothing" to
"OpenShift is fully capable of running on your platform". Heavily
work in progress, but should capture the why, our support levels,
and our target config, as well as mechanical steps to get down the
line.

@openshift-ci-robot openshift-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 22, 2019
@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 22, 2019
@openshift-ci-robot openshift-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 22, 2019

### Enable core platform

1. **Boot** - Ensure RH CoreOS boots on the desired platform, that Ignition works, and that you have VM / machine images to test with
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd also note here that for new cloud platforms, Ignition may need support upstream. For example here's a PR for a non-top-tier cloud: coreos/ignition#667


To boot RHCoS to a new platform, you must:

1. Ensure ignition supports that platform via an OEM ID
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh I see you cover this here. I'd point to coreos/fedora-coreos-tracker#95
and actually in an ideal world patches land in FCOS first and we later backport them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to reference ignition directly for now so as to make it clear what the priority ordering is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The link is what I was alluding to

docs/dev/adding-new-platform.md Outdated Show resolved Hide resolved
5. **Enable Provisioning** Add a hidden installer option to this repo for the desired platform as a PR and implement the minimal features for bootstrap as well as a reliable teardown
6. **Enable Platform** Ensure all operators treat your platform as a no-op
7. **CI Job** Add a new CI job to the installer that uses the credentials above to run the installer against the platform and correctly tear down resources
8. **Publish Images** Ensure RH CoreOS images on the platform are being published to a location CI can test
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should publish images be before #7?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not actually required to get the PR up, which is just why I ordered it (you can publish one yourself into the CI infra)

docs/dev/adding-new-platform.md Outdated Show resolved Hide resolved
docs/dev/adding-new-platform.md Outdated Show resolved Hide resolved
5. Do *not* have automatic cloud provider permissions to perform infrastructure API calls
6. Have a domain name pointing to the load balancer IP(s) that is `api.<BASE_DOMAIN>`
7. Has an internal DNS CNAME pointing to each master called `etcd-N.<BASE_DOMAIN>` that
8. Has an optional internal load balancer that TCP load balances all master nodes, with a DNS name `internal-api.<BASE_DOMAIN>` pointing to the load balancer.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the DNS name optional too (or just the load balancer)? Would the external DNS need the internal-api name registered for use by the cluster without internal DNS?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DNS isn't optional for cert signing, but I guess you could technically sign your IP.

2. **Arch** - Identify the correct opinionated configuration for a desired platform supporting the default features.
3. **CI** - Identify credentials and setup for a CI environment, ensure those credentials exist and can be used in the CI enviroment
4. **Name** - Identify and get approved the correct naming for adding a new platform to the core API objects (specifically the [infrastructure config](https://github.com/openshift/api/blob/master/config/v1/types_infrastructure.go) and the installer config (https://github.com/openshift/installer/blob/master/pkg/types/aws/doc.go)) so that we are consistent
5. **Enable Provisioning** Add a hidden installer option to this repo for the desired platform as a PR and implement the minimal features for bootstrap as well as a reliable teardown
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we chronicle those out in a separate doc or section? Below, we identify the DNS and load balancer requirements ( L48-L76). We should be able to identify those, bucket and networking reqs for the current product and identify the IPI and UPI behaviors/expectations of those components.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those should be in Enable Provisioning.

3. Have low latency interconnections connections (<5ms RTT) and persistent disks that survive reboot and are provisoned for at least 300 IOPS
4. Have cloud or infrastructure firewall rules that at minimum allow the standard ports to be opened (see AWS provider)
5. Do *not* have automatic cloud provider permissions to perform infrastructure API calls
6. Have a domain name pointing to the load balancer IP(s) that is `api.<BASE_DOMAIN>`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

<CLUSTER_NAME>-api.<BASE_DOMAIN)>

4. Have cloud or infrastructure firewall rules that at minimum allow the standard ports to be opened (see AWS provider)
5. Do *not* have automatic cloud provider permissions to perform infrastructure API calls
6. Have a domain name pointing to the load balancer IP(s) that is `api.<BASE_DOMAIN>`
7. Has an internal DNS CNAME pointing to each master called `etcd-N.<BASE_DOMAIN>` that
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

<CLUSTER_NAME>-etcd-N.<BASE_DOMAIN>

This covers the minimal steps and process to go from "nothing" to
"OpenShift is fully capable of running on your platform". Heavily
work in progress, but should capture the why, our support levels,
and our target config, as well as mechanical steps to get down the
line.
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: smarterclayton

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

3. **CI** - Identify credentials and setup for a CI environment, ensure those credentials exist and can be used in the CI enviroment
4. **Name** - Identify and get approved the correct naming for adding a new platform to the core API objects (specifically the [infrastructure config](https://github.com/openshift/api/blob/master/config/v1/types_infrastructure.go) and the installer config (https://github.com/openshift/installer/blob/master/pkg/types/aws/doc.go)) so that we are consistent
5. **Enable Provisioning** Add a hidden installer option to this repo for the desired platform as a PR and implement the minimal features for bootstrap as well as a reliable teardown
6. **Enable Platform** Ensure all operators treat your platform as a no-op
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have a general policy for "operators treat unrecognized platforms as if they were none", then this step would not be required when adding a new platform.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, you have that policy down here. I think you can drop this list entry, and we can file bugs with any operators that are currently non-compliant.

Once the platform can be launched and tested, system features must be implemented. The sections below are roughly independent:

* General requirements:
* Replace the installer terraform destroy with one that doesn't rely on terraform state
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "terraform" -> "Terraform".

And maybe mention that this is because, once cluster components can create additional resources on the target platform, we'll still need to clean them up, and Terraform won't know about them.

1. Runs RH CoreOS
2. Is reachable by control plane nodes over the network
3. Is part of the control plane load balancer until it is removed
4. Can reach a network endpoint that hosts the bootstrap ignition file securely, or has the bootstrap ignition injected
Copy link
Member

@wking wking Jan 30, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "ignition" -> "Ignition" here and elsewhere in this doc.

The following clarifications to configurations are noted:

1. The control plane load balancer does not need to be exposed to the public internet, but the DNS entry must be visible from the location the installer is run.
2. Master nodes are not required to expose external IPs for SSH access, but can instead allow SSH from a bastion inside a protected network.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drop "Master" and the following list entry? This applies equally to master and compute nodes; I don't see an upside to splitting over two entries.


Red Hat CoreOS uses ignition to receive initial configuration from a remote source. Ignition has platform specific behavior to read that configuration that is determined by the `oemID` embedded in the VM image.

To boot RHCoS to a new platform, you must:
Copy link
Member

@wking wking Jan 30, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "RHCoS" -> "RHCOS", here and elsewhere in this doc? I think the acronym is [R]ed [H]at [C]ore [OS], not, [R]ed [H]at [Co]re O[S].

Continuous Integration
----------------------

To enable a new platform, require a core continuous integration testing loop that verifies that new changes do not regress our support for the platform. The minimum steps required are:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "require" -> "we require", or similar.


To enable a new platform, require a core continuous integration testing loop that verifies that new changes do not regress our support for the platform. The minimum steps required are:

1. Have an infrastructure that can receive API calls from the OpenShift CI system to provision/destroy instances
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"instances" -> "infrastructure".


1. Add a new hidden provisioner
2. Define the minimal platform parameters that the provisioner must support
3. Use Terraform or direct Go code to provision that platform via the credentials provided to the installer.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"provision" -> "provision and destroy"? One benefit of Terraform is that it makes centralized bootstrap teardown fairly straightforward, although you could certainly switch on the platform to invoke platform-specific Go bootstrap-teardown code. And we need to destroy resources for destroy cluster to keep the account from filling with cruft, although that doesn't need to be as specific as bootstrap teardown.

2. Define the minimal platform parameters that the provisioner must support
3. Use Terraform or direct Go code to provision that platform via the credentials provided to the installer.

A minimal provisioner must be able to launch the control plane and bootstrap node via an API call and accept any "environmental" settings like network or region as inputs. The installer should use the Route53 DNS provisioning code to set up round robin to the bootstrap and control plane nodes if necessary.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this Route 53 reference intentional? For example, libvirt uses its own DNS configuration for RRDNS, and doesn't involve Route 53.

@smarterclayton
Copy link
Contributor Author

/retest


1. The control plane nodes:
1. Run RH CoreOS, allowing in-place updates
2. Are fronted by a load balancer that allows raw TCP connections to port 6443 and exposes port 443
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be IaaS neutral, wouldn't it be possible to use Keepalive (within Kube since RHCOS is immutable)? It could be used either as LB or failover handler. Not using AWS doesn't automatically means having a hardware LB in front of a cluster.

@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Oct 4, 2019

@smarterclayton: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-aws-rhel8 974d6cc link /test e2e-aws-rhel8
ci/prow/e2e-aws-upgrade 974d6cc link /test e2e-aws-upgrade
ci/prow/e2e-aws 974d6cc link /test e2e-aws
ci/prow/e2e-aws-disruptive 974d6cc link /test e2e-aws-disruptive

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@abhinavdahiya
Copy link
Contributor

Closing due to this being open for a long time, Please feel free to reopen

/close

@openshift-ci-robot
Copy link
Contributor

@abhinavdahiya: Closed this PR.

In response to this:

Closing due to this being open for a long time, Please feel free to reopen

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@displague displague mentioned this pull request Jul 17, 2020
6 tasks
@displague displague mentioned this pull request Dec 10, 2020
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants