Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inject OSImageURL from CVO into templated MachineConfigs #363

Closed

Conversation

cgwalters
Copy link
Member

@cgwalters cgwalters commented Feb 1, 2019

This injects the OSImageURL into the "base"
config (e.g. 00-worker, 00-master). This differs from
previous pull requests which made it a separate MC, but that
adds visual noise and will exacerbate renderer race conditions.

@openshift-ci-robot openshift-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Feb 1, 2019
return MachineConfigFromIgnConfig(role, name, ignCfg), nil
mcfg := MachineConfigFromIgnConfig(role, name, ignCfg)
if osUpdatesEnabledForRole(config, role) {
mcfg.Spec.OSImageURL = config.OSImageURL
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's maybe also log something in the else branch here so it's clear why it's not picking it up for some pools?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Problem is we're going to see that message every time something in the cluster changes...I've learned my lesson there about adding debug prints.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #348

Copy link
Member

@ashcrow ashcrow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks sane

@cgwalters
Copy link
Member Author

Moving discussion of bootstrap over here.

Now that I think about it more...before or after this lands, in fact nothing will be reacting to the osimageurl. So we should be able to land the installer PR now.

@cgwalters
Copy link
Member Author

OK I started on using labels but am currently hitting a weird error with image corruption trying to deploy my updated controller image that is almost certainly unrelated. And I'm still learning the label API/semantics.

From a6faeec6dda4ea2c89eed9a3da945c355ef27769 Mon Sep 17 00:00:00 2001
From: Colin Walters <walters@verbum.org>
Date: Fri, 1 Feb 2019 16:17:07 +0000
Subject: [PATCH] Add code to inject OSImageURL, but disable it by default

First, this supports injecting the `OSImageURL` into the "base"
config (e.g. `00-worker`, `00-master`).  This differs from
previous pull requests which made it a separate MC, but that
adds visual noise and will exacerbate renderer race conditions.

However, in order to gain experience with this code, add a
`ControllerConfig` option which can disable injecting it for certain
roles.  This is set to `*` by default, so effectively we won't
do OS updates.

My idea here is that anyone who wants to test things out can
`oc edit controllerconfig` and empty that out.  Another useful
thing would be to change it to e.g. `{"master"}` and test OS updates
on workers without affecting the master.
---
 .../v1/types.go                               |  3 +++
 pkg/controller/template/render.go             | 25 ++++++++++++++++++-
 pkg/operator/operator.go                      |  6 +++++
 3 files changed, 33 insertions(+), 1 deletion(-)

diff --git a/pkg/apis/machineconfiguration.openshift.io/v1/types.go b/pkg/apis/machineconfiguration.openshift.io/v1/types.go
index cc38d0d..4f053f6 100644
--- a/pkg/apis/machineconfiguration.openshift.io/v1/types.go
+++ b/pkg/apis/machineconfiguration.openshift.io/v1/types.go
@@ -141,6 +141,9 @@ type ControllerConfigSpec struct {
 	// Images is map of images that are used by the controller.
 	Images map[string]string `json:"images"`
 
+	// Configure which pools receive OS updates from the CVO
+	OSUpdatesEnabledForPools *metav1.LabelSelector `json:"osUpdatesEnabledForPools,omitempty"`
+
 	// Sourced from configmap/machine-config-osimageurl
 	OSImageURL string `json:"osImageURL"`
 }
diff --git a/pkg/controller/template/render.go b/pkg/controller/template/render.go
index c63ddb0..46a7e31 100644
--- a/pkg/controller/template/render.go
+++ b/pkg/controller/template/render.go
@@ -15,6 +15,7 @@ import (
 	ctconfig "github.com/coreos/container-linux-config-transpiler/config"
 	cttypes "github.com/coreos/container-linux-config-transpiler/config/types"
 	ignv2_2types "github.com/coreos/ignition/config/v2_2/types"
+	"k8s.io/apimachinery/pkg/labels"
 	"github.com/ghodss/yaml"
 	"github.com/golang/glog"
 	mcfgv1 "github.com/openshift/machine-config-operator/pkg/apis/machineconfiguration.openshift.io/v1"
@@ -131,6 +132,20 @@ func platformFromControllerConfigSpec(ic *mcfgv1.ControllerConfigSpec) (string,
 	}
 }
 
+// osUpdatesEnabledForRole parses the OSUpdatesEnabledForPools flag, which is a
+// way to control injection of the OSImageURL into rendered machine configs.
+// Primarily intended for development/testing.
+func osUpdatesEnabledForRole(config *RenderConfig, role string) (bool, error) {
+	selector, err := metav1.LabelSelectorAsSelector(config.OSUpdatesEnabledForPools)
+	if err != nil {
+		return false, fmt.Errorf("invalid label selector: %v", err)
+	}
+
+	roleLabels := make(map[string]string)
+	roleLabels[role] = ""
+	return selector.Empty() || selector.Matches(labels.Set(roleLabels)), nil
+}
+
 func generateMachineConfigForName(config *RenderConfig, role, name, path string) (*mcfgv1.MachineConfig, error) {
 	platform, err := platformFromControllerConfigSpec(config.ControllerConfigSpec)
 	if err != nil {
@@ -233,7 +248,15 @@ func generateMachineConfigForName(config *RenderConfig, role, name, path string)
 		return nil, fmt.Errorf("error transpiling ct config to Ignition config: %v", err)
 	}
 
-	return MachineConfigFromIgnConfig(role, name, ignCfg), nil
+	mcfg := MachineConfigFromIgnConfig(role, name, ignCfg)
+	osUpdatesEnabled, err := osUpdatesEnabledForRole(config, role)
+	if err != nil {
+		return nil, err
+	} else if osUpdatesEnabled {
+		mcfg.Spec.OSImageURL = config.OSImageURL
+	}
+
+	return mcfg, nil
 }
 
 const (
diff --git a/pkg/operator/operator.go b/pkg/operator/operator.go
index 446a6c3..2955d01 100644
--- a/pkg/operator/operator.go
+++ b/pkg/operator/operator.go
@@ -343,6 +343,11 @@ func icFromClusterConfig(cm *v1.ConfigMap) (installertypes.InstallConfig, error)
 }
 
 func getRenderConfig(mc *mcfgv1.MCOConfig, etcdCAData, rootCAData []byte, ps *v1.ObjectReference, imgs Images) renderConfig {
+	// For now we disable OS updates until we've done more testing
+	osUpdateSelector, err := metav1.ParseToLabelSelector("")
+	if err != nil {
+		panic(err)
+	}
 	controllerconfig := mcfgv1.ControllerConfigSpec{
 		ClusterDNSIP:        mc.Spec.ClusterDNSIP,
 		CloudProviderConfig: mc.Spec.CloudProviderConfig,
@@ -354,6 +359,7 @@ func getRenderConfig(mc *mcfgv1.MCOConfig, etcdCAData, rootCAData []byte, ps *v1
 		PullSecret:          ps,
 		SSHKey:              mc.Spec.SSHKey,
 		OSImageURL:          imgs.MachineOSContent,
+		OSUpdatesEnabledForPools: osUpdateSelector,
 		Images: map[string]string{
 			templatectrl.EtcdImageKey:    imgs.Etcd,
 			templatectrl.SetupEtcdEnvKey: imgs.SetupEtcdEnv,
-- 
2.20.1

@openshift-ci-robot openshift-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 5, 2019
@openshift-ci-robot openshift-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Feb 7, 2019
osUpdatesEnabled, err := osUpdatesEnabledForRole(config, role)
if err != nil {
return nil, err
} else if osUpdatesEnabled {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for this else, you're returning on the branch aboce anyway

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused, the other one is usual error handling? Are you saying it'd e.g. be clearer like this?

diff --git a/pkg/controller/template/render.go b/pkg/controller/template/render.go
index 40aa9a8..97cf0e4 100644
--- a/pkg/controller/template/render.go
+++ b/pkg/controller/template/render.go
@@ -258,7 +258,8 @@ func generateMachineConfigForName(config *RenderConfig, role, name, path string)
 	osUpdatesEnabled, err := osUpdatesEnabledForRole(config, role)
 	if err != nil {
 		return nil, err
-	} else if osUpdatesEnabled {
+	}
+	if osUpdatesEnabled {
 		mcfg.Spec.OSImageURL = config.OSImageURL
 	}
 

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, that's how it's usually done when you have an if branch which returns

@runcom
Copy link
Member

runcom commented Feb 7, 2019

My idea here is that anyone who wants to test things out can
oc edit controllerconfig and empty that out. Another useful
thing would be to change it to e.g. {"master"} and test OS updates
on workers without affecting the master.

to clear out confusion, I believe this comment no longer stands, it's the other way around right?

@cgwalters
Copy link
Member Author

to clear out confusion, I believe this comment no longer stands, it's the other way around right?

Yeah, I updated the PR description to match the current commit message.

@cgwalters
Copy link
Member Author

/hold

For me to add some tests here that verify that the MCD has a target osimageurl.

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 7, 2019
@openshift-ci-robot openshift-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Feb 8, 2019
@cgwalters
Copy link
Member Author

/test e2e-aws-op

@cgwalters cgwalters changed the title Add code to inject OSImageURL, but disable it by default Add code to inject OSImageURL, but just for workers by default Feb 8, 2019
@cgwalters
Copy link
Member Author

I0208 18:18:26.188194    5356 daemon.go:516] Bootstrap pivot required
I0208 18:18:26.188204    5356 update.go:655] Updating OS to registry.svc.ci.openshift.org/ci-op-l08p2kyf/stable@sha256:21eba43a81fd1a6f9e114b2b957e398aeeb75cfd9a8f74c3b62fb714ba42e23c
I0208 18:18:26.188212    5356 run.go:13] Running: /bin/pivot registry.svc.ci.openshift.org/ci-op-l08p2kyf/stable@sha256:21eba43a81fd1a6f9e114b2b957e398aeeb75cfd9a8f74c3b62fb714ba42e23c
pivot version 0.0.2 (f55cf7b5d1b832ad3fecfff1aca09aa0f6969fc7)

...

I0208 18:20:58.639732    4944 start.go:52] Version: 3.11.0-586-g95a3071d-dirty
I0208 18:20:58.640449    4944 start.go:88] starting node writer
I0208 18:20:58.648801    4944 run.go:22] Running captured: chroot /rootfs rpm-ostree status --json
I0208 18:20:58.690876    4944 daemon.go:155] Booted osImageURL: registry.svc.ci.openshift.org/ci-op-l08p2kyf/stable@sha256:21eba43a81fd1a6f9e114b2b957e398aeeb75cfd9a8f74c3b62fb714ba42e23c (47.291)
I0208 18:20:58.692434    4944 daemon.go:227] Managing node: ip-10-0-174-241.ec2.internal
I0208 18:21:12.781634    4944 start.go:146] Calling chroot("/rootfs")
I0208 18:21:12.781706    4944 daemon.go:404] In bootstrap mode
I0208 18:21:13.551441    4944 daemon.go:432] Current+desired config: worker-095a512e0970e036a0f262d657a327da
I0208 18:21:13.551544    4944 daemon.go:520] No bootstrap pivot required; unlinking bootstrap node annotations
I0208 18:21:13.554818    4944 daemon.go:547] Validated on-disk state
I0208 18:21:13.554889    4944 daemon.go:579] In desired config worker-095a512e0970e036a0f262d657a327da
I0208 18:21:13.554993    4944 start.go:165] Starting MachineConfigDaemon
I0208 18:21:13.555048    4944 daemon.go:248] Enabling Kubelet Healthz Monitor
                    "worker": "3 out of 3 nodes have updated to latest configuration worker-095a512e0970e036a0f262d657a327da"

🎊

@cgwalters
Copy link
Member Author

Ah but here's the next problem, the machine-os-content needs to be updated:

We went from 47.308 to 47.291:

Upgraded:
  nss-altfiles 0-2.atomic.git20131217gite2a80593.el7 -> 2.18.1-11.el7
  ostree 2019.1-1.el7_6 -> 2019.1.5-6649032a375238255052a43adb8bc56faac989ca.8cbd7fc123ad6d6e4e8216211aee6f7dd6264886.el7
  ostree-grub2 2019.1-1.el7_6 -> 2019.1.5-6649032a375238255052a43adb8bc56faac989ca.8cbd7fc123ad6d6e4e8216211aee6f7dd6264886.el7
  pivot 0.0.2-0.1.el7 -> 0.0.2.11-f1ed664ed83e73268464e81019f213366d961bb2
  rpm-ostree 2019.1-3.atomic.el7 -> 2019.1.4-fa5be441b177a40b285ed1abc539c6f7770ab231.091833c72cefe9fcbb3af2b42dd07d8a8c9f63d2.el7
  rpm-ostree-libs 2019.1-3.atomic.el7 -> 2019.1.4-fa5be441b177a40b285ed1abc539c6f7770ab231.091833c72cefe9fcbb3af2b42dd07d8a8c9f63d2.el7
Downgraded:
  atomic-openshift-clients 4.0.0-0.164.0.git.0.88cca3f.el7 -> 4.0.0-0.150.0.git.0.f39ab66.el7
  atomic-openshift-hyperkube 4.0.0-0.164.0.git.0.88cca3f.el7 -> 4.0.0-0.150.0.git.0.f39ab66.el7
  atomic-openshift-node 4.0.0-0.164.0.git.0.88cca3f.el7 -> 4.0.0-0.150.0.git.0.f39ab66.el7
  cri-o 1.12.5-5.rhaos4.0.git9076a33.el7 -> 1.12.5-2.rhaos4.0.gitd4191df.el7
  glusterfs 3.12.2-40.el7 -> 3.12.2-32.el7
  glusterfs-client-xlators 3.12.2-40.el7 -> 3.12.2-32.el7
  glusterfs-fuse 3.12.2-40.el7 -> 3.12.2-32.el7
  glusterfs-libs 3.12.2-40.el7 -> 3.12.2-32.el7
  redhat-release-coreos 4.0-20180515.0.atomic.el7.0 -> 0-4749e1e9959e9dcb53804ed103dfde64c813ecd7.el7
  runc 1.0.0-58.dev.rhaos4.0.git2abd837.el7 -> 1.0.0-57.dev.git2abd837.el7
Removed:
  ostree-fuse-2019.1-1.el7_6.x86_64
Added:
  bubblewrap-0.3.1.6-94147e233fe200d1fe43a9a18c52475188b22798.el7.centos.x86_64
  ostree-libs-2019.1.5-6649032a375238255052a43adb8bc56faac989ca.8cbd7fc123ad6d6e4e8216211aee6f7dd6264886.el7.x86_64

@cgwalters
Copy link
Member Author

OMG the CI doesn't really want to merge this patch

Other PRs are going in though. I am still worried that something we're doing in the updates is affecting the cluster or later tests.

I'm not seeing a consistent pattern yet though.

@ashcrow
Copy link
Member

ashcrow commented Feb 15, 2019

HAProxy and Prometheus flakes

/retest

@cgwalters
Copy link
Member Author

Hmm. Other PRs are merging...still worried there's something "residual" we're doing to the cluster. But I just logged into this current e2e-aws cluster, and it looks fine...oc get clusteroperator is all clean, same for oc get pods --all-namespaces.

@cgwalters
Copy link
Member Author

Well that last run was a huge set of failures. Reading the logs, the pools and the operator are reporting ready/done before everything is updated, because they key off currentConfig which we set quickly. I think we should change the MCD to only set currentConfig when it's done pivoting. Otherwise the pools are lying.

But even then, the systems seem to be updated. E.g one of the master MCDs says:
I0215 22:24:35.343092 6783 daemon.go:647] In desired config master-14bca73029fbcab80e7b5d11d7f131b9

And the tests start much later:

2019/02/15 22:35:25 Container setup in pod e2e-aws completed successfully

@cgwalters
Copy link
Member Author

/retest

@kikisdeliveryservice
Copy link
Contributor

@cgwalters there are some bigger CI fixes that are in progress (in other repos for e2e-aws) that should be going in in the next few days. We can always hold this until you are back on Tuesday and see if the CI resolves itself.

@ashcrow
Copy link
Member

ashcrow commented Feb 16, 2019

/retest

@cgwalters
Copy link
Member Author

Looks like that last run hit this.

@cgwalters
Copy link
Member Author

I0216 05:45:03.659689    6232 start.go:52] Version: 3.11.0-670-gb2bebde8-dirty
I0216 05:45:03.660529    6232 start.go:88] starting node writer
I0216 05:45:03.664584    6232 run.go:22] Running captured: chroot /rootfs rpm-ostree status --json
I0216 05:45:03.749243    6232 daemon.go:168] Booted osImageURL: registry.svc.ci.openshift.org/rhcos/maipo@sha256:660061d6eae3ee6d93ca836cd52e6033f1d611c629c1ce47cf272c9e9bda2488 (47.318)                        
I0216 05:45:03.749539    6232 daemon.go:240] Managing node: ip-10-0-156-180.ec2.internal
F0216 05:45:03.770509    6232 start.go:142] binding pod mounts: exec: "mount": executable file not found in $PATH

wha 🤔

@cgwalters
Copy link
Member Author

/retest

@runcom
Copy link
Member

runcom commented Feb 16, 2019

network failures in the last run (Haproxy being always there) + one about security context which is the first time I see it

@cgwalters
Copy link
Member Author

/test images
/test e2e-aws

@runcom
Copy link
Member

runcom commented Feb 16, 2019

I wonder how #442 impacts this PR as well, tests seem to be quite stable there now (especially e2e-aws despite the usual flakes).

@smarterclayton
Copy link
Contributor

On the mount thing you somehow caught the new 4.0 base image (based on UBI and has no util-linux).

#445

will fix your issue and also allow that to get pulled.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

1 similar comment
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@smarterclayton
Copy link
Contributor

@runcom
Copy link
Member

runcom commented Feb 17, 2019

I think this is working because
https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/363/pull-ci-openshift-machine-config-operator-master-e2e-aws/1957 failed with the cri-o bug that they supposedly fixed?

yeah, that should be fixed assuming we're testing with the right crio package /cc @mrunalp

@smarterclayton
Copy link
Contributor

Yeah, machine-os-content was 313 which didn't have the fix. I just bumped to 318

/retest

@cgwalters
Copy link
Member Author

cgwalters commented Feb 17, 2019

OK my view of status here. First...I think I fixed #426 and I'd like to go with that (which is an additional commit on top) because this version isn't correctly having the controller manage updates, basically a node comes up and joins the cluster, MCD lands then we immediately reboot (not gated to 1 at a time by the controller) - I see that happening rapidly for the workers e.g.

I think this is working

It's definitely downgrading the cluster. If some of these e2e failures come down to e.g. tests assuming slightly newer kubelet or cri-o then indeed we should bump machine-os-content again.

@cgwalters
Copy link
Member Author

/lgtm cancel

Since I'd like to keep trying for #426
(but it's OK if this goes in as the other is a superset)

@openshift-ci-robot openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Feb 17, 2019
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jlebon

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@cgwalters
Copy link
Member Author

/hold

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 17, 2019
@openshift-ci-robot
Copy link
Contributor

@cgwalters: The following test failed, say /retest to rerun them all:

Test name Commit Details Rerun command
ci/prow/e2e-aws 60d70a6 link /test e2e-aws

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@cgwalters
Copy link
Member Author

Closing this one in favor of #426 which includes it.

@cgwalters cgwalters closed this Feb 17, 2019
osherdp pushed a commit to osherdp/machine-config-operator that referenced this pull request Apr 13, 2021
Add IBM cloud managed profile manifest patch
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants