doc: Begin a document on adding a new OpenShift platform

This covers the minimal steps and process to go from "nothing" to "OpenShift is fully capable of running on your platform". Heavily work in progress, but should capture the why, our support levels, and our target config, as well as mechanical steps to get down the line.
openshift · Jan 22, 2019 · a556da8 · a556da8
1 parent b63074b
commit a556da8
Showing 1 changed file with 158 additions and 0 deletions.
diff --git a/docs/dev/adding-new-platform.md b/docs/dev/adding-new-platform.md
@@ -0,0 +1,158 @@
+How to add a new platform to OpenShift
+======================================
+
+This document describes the process for adding a new platform to OpenShift in stages. Because there are many components, the process is defined in terms of incremental safe spots that can be built up over time and delivered individually. It is *impossible* to all a new platform to OpenShift all in one chunk.
+
+The general order of adding a new platform for OpenShift is:
+
+### Enable core platform
+
+1. **Boot** - Ensure RH CoreOS boots on the desired platform, that Ignition works, and that you have VM / machine images to test with
+2. **Arch** - Identify the correct opinionated configuration for a desired platform supporting the default features.
+3. **CI** - Identify credentials and setup for a CI environment, ensure those credentials exist and can be used in the CI enviroment
+4. **Name** - Identify and get approved the correct naming for adding a new platform to the core API objects (specifically the [infrastructure config](https://github.com/openshift/api/blob/master/config/v1/types_infrastructure.go) and the installer config (https://github.com/openshift/installer/blob/master/pkg/types/aws/doc.go)) so that we are consistent
+5. **Enable Provisioning** Add a hidden installer option to this repo for the desired platform as a PR and implement the minimal features for bootstrap as well as a reliable teardown
+6. **Enable Platform** Ensure all operators treat your platform as a no-op
+7. **CI Job** Add a new CI job to the installer that uses the credentials above to run the installer against the platform and correctly tear down resources
+8. **Publish Images** Ensure RH CoreOS images on the platform are being published to a location CI can test
+9. **Enable** Merge the platform PR to the installer with a green passing CI job
+
+At this point the platform is said to be an `unsupported IPI` (installer provided infrastructure) install - work can begin enabling in other repositories. Once documentation is completed for step 2 as part of the official platform bring up, the platform can said to be `supported UPI` (user provided infrastructure) for the set of options in **Arch**
+
+### Enable day 2
+
+Once the platform can be booted and tested, day 2 automation can be enabled in individual operators. The items below are unordered because they can be completed in parallel.
+
+* Ensure the cloud provider is enabled in Kubernetes for your platform
+* Enable dynamic compute (MachineSets) by adding a cloud actuator for that platform
+* Enable the cloud credential operator for the platform
+* Enable cluster-ingress-operator to provision the wildcard domain names
+* Enable cluster-image-registry-operator to provision a storage bucket (if your platform supports object storage)
+* Enable cluster-storage-operator to set the correct default storage class
+
+Other important day 2 changes:
+
+* Replace the installer terraform destroy with one that doesn't rely on terraform state
+* Ensure the installer IPI support is consistent with other platform features (private config, etc)
+* Enable a CI job that verifies the e2e suite for the given platform runs
+
+At this point the platform is said to be a `supported IPI with Dynamic Compute` if the platform supports
+MachineSets, or `supported IPI without Dynamic Compute` if it does not.
+
+
+Defining the supported architecture for a platform
+--------------------------------------------------
+
+An OpenShift cluster programs the infrastructure it runs on to provide operational simplicity. For every platform, the minimum requirements are:
+
+1. The control plane nodes:
+   1. Run RH CoreOS, allowing in-place updates
+   2. Are fronted by a load balancer that allows raw TCP connections to port 6443 and exposes port 443
+   3. Have low latency interconnections connections (<5ms RTT) and persistent disks that survive reboot and are provisoned for at least 300 IOPS
+   4. Have cloud or infrastructure firewall rules that at minimum allow the standard ports to be opened (see AWS provider)
+   5. Do *not* have automatic cloud provider permissions to perform infrastructure API calls
+   6. Have a domain name pointing to the load balancer IP(s) that is `api.<BASE_DOMAIN>`
+   7. Has an internal DNS CNAME pointing to each master called `etcd-N.<BASE_DOMAIN>` that 
+   8. Has an optional internal load balancer that TCP load balances all master nodes, with a DNS name `internal-api.<BASE_DOMAIN>` pointing to the load balancer.
+2. The bootstrap node:
+   1. Runs RH CoreOS
+   2. Is reachable by control plane nodes over the network
+   3. Is part of the control plane load balancer until it is removed
+   4. Can reach a network endpoint that hosts the bootstrap ignition file securely, or has the bootstrap ignition injected
+3. All other compute nodes:
+   1. Must be able to reach the internal IPs reported by the master nodes directly
+   2. Have cloud or infrastructure firewall rules that at minimum allow ports 4789, 6443, 9000-10000, and 10250-10255 to be reachable
+
+The following clarifications to configurations are noted:
+
+1. The control plane load balancer does not need to be exposed to the public internet, but the DNS entry must be visible from the location the installer is run.
+2. Master nodes are not required to expose external IPs for SSH access, but can instead allow SSH from a bastion inside a protected network.
+3. Compute nodes do not require external IPs
+
+For dynamic infrastructure, the following permissions are required to be provided as part of the install:
+
+1. Service LoadBalancer - Load balancers can be created and removed, infastructure nodes can be queried
+2. Dynamic Storage - New volumes can be created, deleted, attached, and detached from nodes. Snapshot creation is optional if the platform supports snapshotting
+3. Dynamic Compute - New instances can be created, deleted, and restarted inside of the cluster's network / infrastructure, plus any platform specific constructs like programming instance groups for master load balancing on GCP.
+
+
+Booting RH CoreOS
+-----------------
+
+Red Hat CoreOS uses ignition to receive initial configuration from a remote source. Ignition has platform specific behavior to read that configuration that is determined by the `oemID` embedded in the VM image.
+
+To boot RHCoS to a new platform, you must:
+
+1. Ensure ignition supports that platform via an OEM ID
+2. Ensure that RHCoS has any necessary platform specific code to communicate with the host (for instance, on Azure the instance must periodically health check).
+3. Have a RHCoS image with the appropriate oemID tag set.
+
+There is a script that assists you in converting the generic VM image to have a specific oemID set in the [coreos-assembler repo as gf-oemid](https://github.com/coreos/coreos-assembler/blob/master/src/gf-oemid). See the instructions there to create an image with the appropriate ID.
+
+Once you have uploaded the image to your platform, and the machine stays up, you can begin porting the installer to have a minimal IPI.
+
+
+Continuous Integration
+----------------------
+
+To enable a new platform, require a core continuous integration testing loop that verifies that new changes do not regress our support for the platform. The minimum steps required are:
+
+1. Have an infrastructure that can receive API calls from the OpenShift CI system to provision/destroy instances
+2. Support at minimum 3 concurrent clusters on that infrastructure as "per release image" testing (https://origin-release.svc.ci.openshift.org) that verify a release supports that platform
+3. Also support a per-PR target that can be selectively run on the installer, core, and operator repositories in OpenShift in order to allow developers to test incremental changes to those components
+
+No PR will be merged to openshift/installer for platform support that cannot satisfy the above steps.
+
+
+Naming
+------
+
+The platform name will be part of our public API and must go through standard API review. The name
+should be consistent with common usage of the platform and be recognizable to a consumer.
+
+The following names for platforms are good examples of what is expected:
+
+* Amazon Web Services -> `aws` or `AWS`
+* Google Cloud Platform -> `gcp` or `GCP`
+* Azure -> `azure` or `Azure`
+* Libvirt -> `libvirt` or `Libvirt`
+* OpenStack -> `openstack` or `OpenStack`
+
+
+Enable Provisioning
+-------------------
+
+Since CI testing requires the ability to provision via an API, we define the basic path for supporting a platform as having a minimal provisioning path in the OpenShift installer. Not all platforms we support will have full infrastructure provisioning supported, but the basic path must be invokable via Go code in openshift-install before a platform can be certified. This ensures we have at least one path to installation.
+
+The OpenShift installer has normal and hidden provisioners. The hidden provisioners are explicitly unsupported for production use but are supported for testing. 
+
+1. Add a new hidden provisioner
+2. Define the minimal platform parameters that the provisioner must support
+3. Use Terraform or direct Go code to provision that platform via the credentials provided to the installer.
+
+A minimal provisioner must be able to launch the control plane and bootstrap node via an API call and accept any "environmental" settings like network or region as inputs. The installer should use the Route53 DNS provisioning code to set up round robin to the bootstrap and control plane nodes if necessary.
+
+
+Enable Platform
+---------------
+
+OpenShift handles platform functionality as a set of operators running on the platform that interface with users, admins, and infrastructure. Because operators handle day 2 reconfiguration of the cluster, many "installation" related duties are delegated to the operators.
+
+Operators derive their configuration from top level API objects called `global configuration`. One such object is the `Infrastructure` global config, which reports which platform the cluster is running on.
+
+All operators that react to infrastructure must support a `None` option, and any unrecognized infrastructure platform *MUST* be treated as `None`. When an operator starts, it should log a single warning if the infrastructure provider is not recognized and then fall back to `None`.
+
+When adding a new platform to the installer, the infrastructure setting should happen automatically during bootstrapping, and if a component does not correctly treat your new platform as `None` it should be fixed immediately.
+
+
+CI Job
+------
+
+The initial CI job for a new platform PR to `openshift/installer` must use the `cluster-installer-e2e` template but with an alternate profile, and the CI infrastructure should be configured with the credentials for your infrastructure in a `cluster-secrets-PLATFORM` secret. Talk to the testplatform team.
+This CI job will then be reused whenever a repo wants to test, or when we add new release tests.
+
+A new platform should pass many of the kubernetes conformance tests, so the default job would run the e2e suite `kubernetes/conformance`.  We may define a more scoped job if the platform cannot pass.
+
+The teardown behavior of the cluster is the hardest part of this process - because we run so many tests a day, it must be 100% reliable from the beginning. You should implement a reliable teardown mechanism in your `destroy` method, leveraging the OpenStack and AWS examples.
+
+We **will not** merge a new job if it does not have reliable cleanup in the face of failures, rate limits, etc, because it blocks other work.