Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[OCPCLOUD-1171] Implement CCCMO render to run CCM on bootstrap node #4947

Closed
wants to merge 2 commits into from

Conversation

Danil-Grigorev
Copy link

@Danil-Grigorev Danil-Grigorev commented May 20, 2021

This allows master nodes to be initialized with CCM replica from
bootstrap, speeding up process of cluster installation.

Without CCM running on bootstrap node, most of the cluster operator
and operand pods would have to tolerate new taint and a different
configuration of node.kubernetes.io/not-ready (NoSchedule) taint:

  • effect: NoSchedule
    key: node.cloudprovider.kubernetes.io/uninitialized
    value: "true"

Without this bootstrap addition or taint toleration the cluster with CCM
would never become ready.

To test this feature on AWS build installer and deploy on AWS.

To create a cluster run:

./hack/build.sh
./bin/openshift-install create manifests
cat <<EOF > manifests/manifest_feature_gate.yaml
---
apiVersion: config.openshift.io/v1
kind: FeatureGate
metadata:
  annotations:
    include.release.openshift.io/self-managed-high-availability: "true"
    include.release.openshift.io/single-node-developer: "true"
    release.openshift.io/create-only: "true"
  name: cluster
spec:
  customNoUpgrade:
    enabled:
    - ExternalCloudProvider
    - CSIMigrationAWS
    - CSIMigrationOpenStack
  featureSet: CustomNoUpgrade
EOF

OPENSHIFT_INSTALL_RELEASE_IMAGE_OVERRIDE=quay.io/dgrigore/release:cccmo-render ./bin/openshift-install create cluster

Gist with files CCCMO render outputs for AWS: https://gist.github.com/Danil-Grigorev/5732fd375dba7affff54eda267c4cd47

# Source config map is located in https://github.com/openshift/cluster-cloud-controller-manager-operator/blob/master/manifests/0000_26_cloud-controller-manager-operator_01_images.configmap.yaml
bootkube_podman_run \
--name copy-cloud-controller-manager-images \
--rm -it \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You probably don't need an interactive terminal on the bootstrap machine, right? And since in the happy case we will shortly remove the whole machine, I think we want to keep the metadata for debug logging. Can we drop this line?

# Copy the CCCMO images configMap to resolve pod images from internal registry in CCCMO render run
# Source config map is located in https://github.com/openshift/cluster-cloud-controller-manager-operator/blob/master/manifests/0000_26_cloud-controller-manager-operator_01_images.configmap.yaml
bootkube_podman_run \
--name copy-cloud-controller-manager-images \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: existing style seems to be an extra level of indentation for these wrapped lines.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error: error creating container storage: the container name "copy-cloud-controller-manager-images" is already in use by "5ca8294f8b2b2089541e267b4d46aeed08295f1ca16e14a45dd80807bf8e03ca". You have to remove that container to be able to reuse that name.: that name is already in use

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see my comment below, that is the root cause, https://github.com/openshift/installer/pull/4947/files#r662375716

--dest-dir=/assets/cloud-controller-manager-bootstrap \
--cluster-infrastructure-file=/assets/manifests/cluster-infrastructure-02-config.yml

cp cloud-controller-manager-bootstrap/bootstrap/* bootstrap-manifests//
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: // -> /

cp cloud-controller-manager-bootstrap/bootstrap/* bootstrap-manifests//

touch cloud-controller-manager-bootstrap.done
record_service_stage_success
Copy link
Member

@wking wking May 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: inconsistent indent, although that appears to be the existing style for record_service_stage_success in this file.

@wking
Copy link
Member

wking commented May 20, 2021

...most of the cluster operator and operand pods...

Since a fair amount of thought and testing has gone into this, I think it's worth explaining the deadlock that this is avoiding in more detail. Which operators and pods? Why would their being unable to come up on the tainted control plane node's block the cloud-controller-manager from coming up?

@Danil-Grigorev
Copy link
Author

@wking In my testing I saw CCCMO stuck in containerCreating state, as it was not able to pull the image. It might be that we should start with image-registry, the rest is under still under investigation. Would definitely document the findings

@wking
Copy link
Member

wking commented May 20, 2021

It might be that we should start with image-registry...

Image-registry is for local images, and the release image and its references all come from external registries during bootstrapping. The local registry is one of the last components to come up, because it runs on computer node's, which are in turn provisioned by machine-API components hosted on control-plane nodes.

@JoelSpeed
Copy link
Contributor

@Danil-Grigorev Just want to make sure that you are aware that, this PR, in its current state, shouldn't merge until the CCCMO image has been added to the full payload, and are accounting for that in your planning around this.

If you want to merge this before CCCMO is added to the full payload, you will need to ensure that the bootstrap script accounts for when there is no CCCMO image

@Danil-Grigorev
Copy link
Author

@JoelSpeed We would definitely look into this once again once render implementation would get merged in CCCMO and release addition request would be satisfied. The whole feature won't work until OpenStack, AWS and Azure cloud-provider repos won't be included in the release too.

This allows master nodes to be initialized with CCM replica from
bootstrap, speeding up process of cluster installation.

Without CCM running on bootstrap node, most of the cluster operator
and operand pods would have to tolerate new taint and a different
configuration of node.kubernetes.io/not-ready (NoSchedule) taint:

- effect: NoSchedule
  key: node.cloudprovider.kubernetes.io/uninitialized
  value: "true"
- effect: NoSchedule
  key: node.kubernetes.io/not-ready

Without this bootstrap addition or taint toleration the cluster with CCM
would never become ready.

To test this feature on AWS build installer and deploy on AWS.

To create a cluster run:
```bash
./hack/build.sh
./bin/openshift-install create manifests
cat <<EOF > manifests/manifest_feature_gate.yaml
---
apiVersion: config.openshift.io/v1
kind: FeatureGate
metadata:
  annotations:
    include.release.openshift.io/self-managed-high-availability: "true"
    include.release.openshift.io/single-node-developer: "true"
    release.openshift.io/create-only: "true"
  name: cluster
spec:
  customNoUpgrade:
    enabled:
    - ExternalCloudProvider
    - CSIMigrationAWS
    - CSIMigrationOpenStack
  featureSet: CustomNoUpgrade
EOF

OPENSHIFT_INSTALL_RELEASE_IMAGE_OVERRIDE=quay.io/dgrigore/release:cccmo-render ./bin/openshift-install create cluster
```
Copy link
Contributor

@elmiko elmiko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ran into a bug while testing

# Copy rendered resources to manifests folder
cp cloud-controller-manager-bootstrap/bootstrap/* bootstrap-manifests/
# Copy cloud config to /etc/kubernetes/bootstrap-configs
cp cloud-controller-manager-bootstrap/config/* /etc/kubernetes/bootstrap-configs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will fail if no cloud config is generated, when i run on aws and there is no cloud config, i see this in the bootkube.service output:

Jul 01 13:29:18 ip-10-0-12-89 bootkube.sh[2207]: cp: cannot stat 'cloud-controller-manager-bootstrap/config/*': No such file or directory

Copy link
Contributor

@elmiko elmiko Jul 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this patch is allowing me to proceed

diff --git a/data/data/bootstrap/files/usr/local/bin/bootkube.sh.template b/data/data/bootstrap/files/usr/local/bin/bootkube.sh.template
index a36dbee28..3c3c61d6c 100755
--- a/data/data/bootstrap/files/usr/local/bin/bootkube.sh.template
+++ b/data/data/bootstrap/files/usr/local/bin/bootkube.sh.template
@@ -245,8 +245,10 @@ then
        -c "cat /release-manifests/0000_26_cloud-controller-manager-operator_01_images.configmap.yaml" > manifests/ccm-images.yaml
 
        ADDITIONAL_FLAGS=""
+       CLOUD_CONFIG_EXISTS=""
        if [ -f "$PWD/manifests/cloud-provider-config.yaml" ]; then
            ADDITIONAL_FLAGS="--cloud-config-file=/assets/manifests/cloud-provider-config.yaml"
+           CLOUD_CONFIG_EXISTS="yes"
        fi
 
        bootkube_podman_run \
@@ -262,7 +264,9 @@ then
        # Copy rendered resources to manifests folder
        cp cloud-controller-manager-bootstrap/bootstrap/* bootstrap-manifests/
        # Copy cloud config to /etc/kubernetes/bootstrap-configs
-       cp cloud-controller-manager-bootstrap/config/* /etc/kubernetes/bootstrap-configs
+       if [ ! -z "$CLOUD_CONFIG_EXISTS" ]; then
+           cp cloud-controller-manager-bootstrap/config/* /etc/kubernetes/bootstrap-configs
+       fi
 
 
        touch cloud-controller-manager-bootstrap.done

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh... Bash weirdness. Apparently this throws an error you saw if the directory is empty, even if there is wildcard at the source for the copy.

cp cloud-controller-manager-bootstrap/config/* /etc/kubernetes/bootstrap-configs

This works instead, even if the directory is empty:

cp -r cloud-controller-manager-bootstrap/config/. /etc/kubernetes/bootstrap-configs

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 1, 2021

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign patrickdillon after the PR has been reviewed.
You can assign the PR to them by writing /assign @patrickdillon in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

bootkube_podman_run \
--name copy-cloud-controller-manager-images \
--entrypoint /bin/bash "${RELEASE_IMAGE_DIGEST}" \
-c "cat /release-manifests/0000_26_cloud-controller-manager-operator_01_images.configmap.yaml" > manifests/ccm-images.yaml
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/release-manifests/0000_26_cloud-controller-manager-operator_01_images.configmap.yaml: No such file or directory

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was it a CI build? Seems like CCCMO is not present in the payload you were using

# Copy the CCCMO images configMap to resolve pod images from internal registry in CCCMO render run
# Source config map is located in https://github.com/openshift/cluster-cloud-controller-manager-operator/blob/master/manifests/0000_26_cloud-controller-manager-operator_01_images.configmap.yaml
bootkube_podman_run \
--name copy-cloud-controller-manager-images \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error: error creating container storage: the container name "copy-cloud-controller-manager-images" is already in use by "5ca8294f8b2b2089541e267b4d46aeed08295f1ca16e14a45dd80807bf8e03ca". You have to remove that container to be able to reuse that name.: that name is already in use

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 2, 2021

@Danil-Grigorev: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-libvirt a45cffc link /test e2e-libvirt
ci/prow/e2e-aws-workers-rhel7 a45cffc link /test e2e-aws-workers-rhel7
ci/prow/e2e-aws-upgrade a45cffc link /test e2e-aws-upgrade
ci/prow/e2e-crc a45cffc link /test e2e-crc

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@JoelSpeed
Copy link
Contributor

/close

We have decided to take an alternate approach to bootstrapping which means this is no longer required.

See openshift/enhancements#463 for more details on the updated approach

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 15, 2021

@JoelSpeed: Closed this PR.

In response to this:

/close

We have decided to take an alternate approach to bootstrapping which means this is no longer required.

See openshift/enhancements#463 for more details on the updated approach

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot closed this Jul 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants