Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 1928581: validate the proxy by trying oc image info #2539

Closed
wants to merge 3 commits into from

Conversation

QiWang19
Copy link
Member

@QiWang19 QiWang19 commented Apr 20, 2021

Try use skopeo inspect image using http proxy config for proxy validation. If the skopeo command fails, do not render the proxy.

- What I did

Close Bug 1928581 https://bugzilla.redhat.com/show_bug.cgi?id=1928581

- How to verify it

- Description for the changelog

Add HTTP proxy validation.

@openshift-ci-robot openshift-ci-robot added bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels Apr 20, 2021
@openshift-ci-robot
Copy link
Contributor

@QiWang19: This pull request references Bugzilla bug 1928581, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.8.0) matches configured target release for branch (4.8.0)
  • bug is in the state NEW, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

No GitHub users were found matching the public email listed for the QA contact in Bugzilla (schoudha@redhat.com), skipping review request.

In response to this:

Bug 1928581: validate the proxy by trying image pull

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: QiWang19
To complete the pull request process, please assign kikisdeliveryservice after the PR has been reviewed.
You can assign the PR to them by writing /assign @kikisdeliveryservice in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@sgreene570
Copy link
Contributor

sgreene570 commented Apr 21, 2021

What about proxy config changes made as a day 2 operation? Does the podman "test pull" code in this PR verify that image pulls work when modifying the cluster proxy config after installation?

@QiWang19
Copy link
Member Author

/retest

@QiWang19 QiWang19 force-pushed the valid-httpproxy branch 2 times, most recently from 643480d to 9a31c9a Compare July 7, 2021 17:12
@QiWang19
Copy link
Member Author

QiWang19 commented Jul 7, 2021

/retest

@sinnykumari
Copy link
Contributor

Considering that proxy config is a global config in the cluster, shouldn't this be fixed at source i.e checking for proxy config when it gets applied/updated to the cluster? Consumers like MCO comes into picture later on.

@QiWang19
Copy link
Member Author

QiWang19 commented Jul 8, 2021

/retest

@QiWang19
Copy link
Member Author

QiWang19 commented Jul 8, 2021

Considering that proxy config is a global config in the cluster, shouldn't this be fixed at source i.e checking for proxy config when it gets applied/updated to the cluster? Consumers like MCO comes into picture later on.

@sinnykumari From the Bugzilla discussion, when applying to the cluster the proxy will be validated by CNO. comment 17. It is the not a MCO change.
or could you give me a pointer where would be a proper place to validate it, looks like?

func ApplyMachineConfig(client mcfgclientv1.MachineConfigsGetter, required *mcfgv1.MachineConfig) (*mcfgv1.MachineConfig, bool, error) {

@sinnykumari
Copy link
Contributor

@sinnykumari From the Bugzilla discussion, when applying to the cluster the proxy will be validated by CNO. comment 17. It is the not a MCO change.

ah ok, I may have been confused then because this PR is making changes to MCO's MCC bootstrap mode. If CNO is going to validate the proxy (which makes sense to me), shouldn't this validation code be in the CNO repo?

@QiWang19
Copy link
Member Author

QiWang19 commented Jul 8, 2021

If the bootstrap mode has an invalid proxy, the CNO pod will fail to launch since the CNO images cannot be pulled down.

@QiWang19
Copy link
Member Author

QiWang19 commented Jul 9, 2021

@sinnykumari From the Bugzilla discussion, when applying to the cluster the proxy will be validated by CNO. comment 17. It is the not a MCO change.

ah ok, I may have been confused then because this PR is making changes to MCO's MCC bootstrap mode. If CNO is going to validate the proxy (which makes sense to me), shouldn't this validation code be in the CNO repo?

@sinnykumari PTAL. Add validation for places after the installation.

@QiWang19
Copy link
Member Author

/retest

@QiWang19 QiWang19 changed the title Bug 1928581: validate the proxy by trying image pull WIP: Bug 1928581: validate the proxy by trying image pull Aug 7, 2021
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 7, 2021
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 6, 2022
Signed-off-by: Qi Wang <qiwan@redhat.com>
@QiWang19
Copy link
Member Author

/retest-required

@QiWang19
Copy link
Member Author

@yuqi-zhang @palonsoro Could you review? The new commit can install openshift-client and exec oc command to fetch CNO image pull spec.

@palonsoro
Copy link
Contributor

@QiWang19 it looks good to me. Thanks!

@@ -12,6 +12,10 @@ COPY --from=builder /go/src/github.com/openshift/machine-config-operator/instroo
RUN cd / && tar xf /tmp/instroot.tar && rm -f /tmp/instroot.tar
COPY install /manifests

RUN dnf -y update && dnf -y reinstall shadow-utils && \
dnf -y install skopeo && dnf -y install openshift-clients && \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit unfortunate to add a whole new copy of skopeo and oc into the image, because we have them right there on the host too...
(And for that matter, it's actually useful to validate the proxy configuration from the host network namespace since that's where most image pulling will be happening)
Tricky to deal with without making the MCC privileged enough for host mounts though. But OTOH, the MCC really is privileged in a cluster sense anyways, so making it a privileged container (at least enough for host mounts) isn't really adding any new attack surface.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cgwalters I agree with the host network part, you made a great point here, that may be something worth considering.

However, regarding including the binaries, that's a usual burden that we are already paying a lot of times in a lot of components, nothing that should be surprising to us if we have OCP4 consisting of a number of clusteroperators which deploy a number of operands, almost all of them running inside containers.

Making the MCC deployment require access to the host and require the host to always have these binaries available, even when not crazy in practice, goes much against the spirit of having every component in a container so it is, well, self-contained.

A possible way of improvement here would be to make MCO image derive from the tools image shipped in the release instead of base, because the tools image includes the correct version of oc already. Other images that require oc already do it and benefit from image layer de-duplication in what regards the oc client.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A possible way of improvement here would be to make MCO image derive from the tools image shipped in the release instead of base, because the tools image includes the correct version of oc already.

Yeah, that'd help, but doesn't get us out of also shipping skopeo, which today also vendors large parts of the container runtime again.

Hmm, do we actually need to use skopeo vs just forking oc?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we do. oc is only used to find out which image must be checked (CNO image). Once found, the check is done with skopeo (oc doesn't provide a way to do it).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't just running e.g. oc image info be sufficient?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mmm, as long as executing this successfully guarantees a successful pull, it would.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oc today vendors the docker Go library for interacting with registries, whereas skopeo uses the github.com/containers/image bits. But ultimately...I can't imagine a case where one worked but not the other.

Today oc's fetching is kind of load-bearing because it's where we have e.g. oc image mirror etc that people use for disconnected.

I can't imagine a case where oc succeeds but skopeo (and/or podman/cri-o) would fail with respect to the proxy.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, we can just run oc image info

@yuqi-zhang
Copy link
Contributor

Sorry for the delay. I think I would need to check in with the team on this as a more general discussion on how to proceed.

@yuqi-zhang
Copy link
Contributor

/retest-required

} else if err != nil {
return err
}
if err := ctrlcommon.ProxyValidation(&proxy.Status, clusterversionCfg.Status.Desired.Version, icspRules); err != nil {
Copy link
Contributor

@jkyros jkyros Oct 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like you could talk me into putting this in a separate place for "sanity checks" before config rollout/eviction/drain, but I have concerns about the proposed location in SyncRenderConfig() -- if the proxyconfig fails this test for whatever reason, we don't get a RenderConfig, which prevents the rest of the sync functions from running, and that is problematic for general cluster stability during normal operations (among other things, it affects certificate rotations).

Specifically, in a case where the proxy was "valid" when it was configured, but is down/unreachable/etc at the time of the check, the MCO would degrade, wouldn't it?

I don't know that I have a spot picked out where it should go, because the MCO has typically not accepted preflight checks of this nature, but I'm kind of sympathetic here 😄

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yuqi-zhang could you help locate where MCO deploys the proxy settings to the Nodes so that the proxyvalidatoin can be done there?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, so the exact details are a bit up to debate. The way it's set up in this PR right now is blocking at the operator level which is potentially dangerous for reasons John has listed, and I agree that we should probably think about moving this to a consolidated "checking" location.

So, if we want to achieve the point of "don't roll out the proxy to nodes unless it most likely works", then it likely will have to happen at the controller level, between:

  1. https://github.com/openshift/machine-config-operator/blob/master/pkg/controller/render/render_controller.go#L546, where the rendered MC is generated for a pool
  2. https://github.com/openshift/machine-config-operator/blob/master/pkg/controller/node/node_controller.go#L846, where the config is rolled out to the MCP

So then this would be something like validateIncomingRenderedConfig before it gets rolled out to the pool, somehow, where right now we just validate the proxy, but could be extended as a node-specific pre-flight check of some sort. We could even have a flag that enables/disables this, if we don't want to change default behaviour.

The other side of this issue is, as we move towards layering, what if I built a new format OS image with a proxy built into it somehow? This validation path would not catch that if done directly in the image.

One last extension thought on validation that's a bit more encompassing: have a (flag enabled?) option to create an extra canary node on incoming updates to see if that node would break, before upgrading the rest of the nodes. That's a bit too far though.

In summary, I think this is a pretty complex topic. Right now I think maybe the safest option is at the controller level, but how exactly that would be done is a bit up in the air

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other side of this issue is, as we move towards layering, what if I built a new format OS image with a proxy built into it somehow? This validation path would not catch that if done directly in the image.

Ultimately I think what we want is ostreedev/ostree#2725 - basically, we try booting the new configuration, and roll back if kubelet isn't able to start.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yuqi-zhang Please review, need help with the operator code auto-generation regarding to the func (f *fixture) newController() (pkg/controller/render/render_controller_test.go)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need help with the operator code auto-generation regarding to the func (f *fixture) newController() (pkg/controller/render/render_controller_test.go)

Sorry, I don't quite follow, what is the issue with the code autogen?

Signed-off-by: Qi Wang <qiwan@redhat.com>
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 24, 2022

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: QiWang19, rphillips
Once this PR has been reviewed and has the lgtm label, please assign cgwalters for approval by writing /assign @cgwalters in a comment. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@@ -183,6 +183,12 @@ func (b *Bootstrap) Run(destDir string) error {
configs = append(configs, kconfigs...)
}

if releaseVersion, ok := cconfig.Annotations[ctrlcommon.ReleaseImageVersionAnnotationKey]; ok {
if err := ctrlcommon.ProxyValidation(cconfig.Spec.Proxy, releaseVersion, icspRules); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, just for my own curiosity, this will check via the bootstrap network on the bootstrap node right?

Is there a possibility that the bootstrap network is different? Does it even use the proxy you provide to the cluster?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bootstrap network will be the masters network in most if not all the cases.
For example, in on-prem environments where the keepalived VIP is deployed, the kube-apiserver VIP is first assigned to the bootstrap and eventually moves to one of the masters, so bootstrap and masters must be in the same subnet for that to happen.

const (
tagName = "cluster-network-operator"
imageInfo = "adm release info %s --image-for %s"
imageInfoWithICSP = "adm release info %s --image-for %s --icsp-file %s"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain how the ICSP affects the imagespec here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the oc command can accept the icsp file as an alternative source to retrieve the release image. https://github.com/openshift/oc/blob/3cdf3c29f0c109c94eb67124548a6b21fc5f6a22/pkg/cli/admin/release/info.go#L136.
If the icsp has been configured on the cluster I think we should pass them to the oc to get the release image and get the cno pull spec.

@@ -301,6 +303,25 @@ func (optr *Operator) syncRenderConfig(_ *renderConfig) error {
}
}
spec.AdditionalTrustBundle = trustBundle
clusterversionCfg, err := optr.configClient.ConfigV1().ClusterVersions().Get(context.TODO(), "version", metav1.GetOptions{})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are looking to remove these changes right?

}
if releaseVersion, ok := cc.Annotations[ctrlcommon.ReleaseImageVersionAnnotationKey]; ok {
if err := ctrlcommon.ProxyValidation(cc.Spec.Proxy, releaseVersion, icspRules); err != nil {
return err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so if we err out here, I think we just don't generate the rendered config right? I feel like maybe we should still generate the rendered config, the have the node controller do the validation and fail there, so we can reference which rendered MC is failing

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put another way, we still would have an issue where the rendered config doesn't get generated if we do it here, I think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, that error was out before the render config was generated and synced.
To let the node controller do the check, we can drop the validation from render_conller, and in the node controller we add validation before this line https://github.com/openshift/machine-config-operator/blob/master/pkg/controller/node/node_controller.go#L846, something like:

cconfigs, err := ctrl.ccLister.List(labels.Everything())
for _, cc := range cconfigs {
	retry(validation)
}

what do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think somewhere in the sync MCP function could work.

Although, hmm, this does mean that every time we perform an update of any sort, for every node that gets synced, we re-check the proxy, and even for scenarios that don't have any changes to the proxy, we re-validate, which seems like... a lot of unnecessary work.

So in that case, maybe having it as we do now is better, but only validate on a change to the proxy between old and new?

What do you think @jkyros ? I'm leaning towards reducing the # of times we validate if there isn't a change in the proxy, just not sure where the best place to do so would be. From a logic perspective, I think maybe render controller is easier, but comes with the downside of not generating a new rendered MC.

I think in my view, the best place would be, after we generate the rendered MC, before we roll out to a MCP, we do a one time check if the current->desired MC contains a proxy change, before allowing the node controller to roll out. Such that if there is an error, the user would see a new rendered MC, but the MCP not start an upgrade due to it being degraded on checking for proxy (with a certain amount of retries). But I don't know how well that plugs into what we have now without it either 1. looking clunkly or 2. adding a whole new interface of some sort to do so

if strings.Contains(string(rawOut), proxyErr) {
return fmt.Errorf("invalid http proxy: %w: error running %s %s: %s", err, oc, strings.Join(args, " "), string(rawOut))
}
return fmt.Errorf("%w: error running %s %s: %s", err, oc, strings.Join(args, " "), string(rawOut))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have some kind of retry here, for transient failures?

I guess we always retry via re-syncing technically, maybe it would be worth adding a requeue somewhere...

The issue I'm thinking about is, let's say the network is unstable, and we happen to fail the one validation, but the proxy is otherwise valid, what is the user experience there?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I agree. we can retry to deal with the risks.

} else if err != nil {
return err
}
if err := ctrlcommon.ProxyValidation(&proxy.Status, clusterversionCfg.Status.Desired.Version, icspRules); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need help with the operator code auto-generation regarding to the func (f *fixture) newController() (pkg/controller/render/render_controller_test.go)

Sorry, I don't quite follow, what is the issue with the code autogen?

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 24, 2022

@QiWang19: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-techpreview-featuregate a6bc3173636256e2df7a32c9f3691503dcaaf9f6 link /test e2e-aws-techpreview-featuregate
ci/prow/e2e-aws-single-node a3b414110c85473a66c4b8f1b5a72b259dd43205 link false /test e2e-aws-single-node
ci/prow/e2e-vsphere-upgrade a3b414110c85473a66c4b8f1b5a72b259dd43205 link false /test e2e-vsphere-upgrade
ci/prow/e2e-aws-serial a3b414110c85473a66c4b8f1b5a72b259dd43205 link false /test e2e-aws-serial
ci/prow/e2e-aws-workers-rhel8 a3b414110c85473a66c4b8f1b5a72b259dd43205 link false /test e2e-aws-workers-rhel8
ci/prow/e2e-aws-upgrade-single-node a3b414110c85473a66c4b8f1b5a72b259dd43205 link false /test e2e-aws-upgrade-single-node
ci/prow/e2e-aws-disruptive a3b414110c85473a66c4b8f1b5a72b259dd43205 link false /test e2e-aws-disruptive
ci/prow/e2e-aws-workers-rhel7 a3b414110c85473a66c4b8f1b5a72b259dd43205 link false /test e2e-aws-workers-rhel7
ci/prow/okd-e2e-aws a3b414110c85473a66c4b8f1b5a72b259dd43205 link false /test okd-e2e-aws
ci/prow/4.12-upgrade-from-stable-4.11-images a3b414110c85473a66c4b8f1b5a72b259dd43205 link true /test 4.12-upgrade-from-stable-4.11-images
ci/prow/okd-scos-e2e-gcp-op 30a0c20 link false /test okd-scos-e2e-gcp-op
ci/prow/okd-scos-e2e-upgrade 30a0c20 link false /test okd-scos-e2e-upgrade
ci/prow/okd-scos-e2e-vsphere 30a0c20 link false /test okd-scos-e2e-vsphere
ci/prow/unit a49df6a link true /test unit
ci/prow/okd-scos-e2e-aws a49df6a link false /test okd-scos-e2e-aws
ci/prow/e2e-gcp-op a49df6a link true /test e2e-gcp-op
ci/prow/e2e-agnostic-upgrade a49df6a link true /test e2e-agnostic-upgrade
ci/prow/e2e-aws a49df6a link true /test e2e-aws
ci/prow/verify a49df6a link true /test verify

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@QiWang19
Copy link
Member Author

Sorry, I don't quite follow, what is the issue with the code autogen?

the function signature change made for getting icsp https://github.com/openshift/machine-config-operator/pull/2539/files#diff-d38c494535eacf2f0876136ce2b6a6329c78e91d238f7cb2b8f75379427747c0R80
So in the test we need to match the arguments in the call, the arguments for this function are auto-generated by informer-gen, https://github.com/QiWang19/machine-config-operator/blob/a49df6a2bcf2803f77aff5c2247d549fbdc62fff/pkg/controller/render/render_controller_test.go#L69
I have run make update but it did not generate.

@yuqi-zhang
Copy link
Contributor

I have run make update but it did not generate.

Hmm, it's been a long time since we last updated that test.

What happens if you just try to manually update the test function with the additional necessary items?

i.e. a f.operatorClient = fakeoperatorclient.NewSimpleClientset(f.operatorObjects...) -> oi := operatorinformer.NewSharedInformerFactory(f.operatorClient, noResyncPeriodFunc()) -> oi.Operator().V1alpha1().ImageContentSourcePolicies(), like we do for e.g. containerruntimeconfigcontroller?

@yuqi-zhang
Copy link
Contributor

I've spent some time thinking through this general problem (validation of the configuration on the nodes), and I'd like to bring this up more for general discussion.

First, I'd like to go back to the MCO mission statement:

The MCO keeps the underlying CoreOS system up-to-date and applies configs. The team chooses:
Simplicity over sophistication
Being OPEN and Transparent over being opinionated
Verbosity and clarity over imposed safety
Empowering debugging over preventing bugs

The MCO was never designed to be, and I believe should still not be, the place where we provide imposed safety. The MCO does not check whether your configuration is correct today (other than syntax); instead it is simply the bridge between your configuration and the nodes. If we wanted to ensure configuration safety, there is simply too large of a matrix to ensure every configuration you provide is "safe", and the validation complexity will only increase as we move towards CoreOS Layering and providing image based updates. Simply put, this is a tradeoff we have made.

Side note: there is a sort of mitigation in place for "breaking updates": we roll out changes one node at a time (generally), and any singular node should always be replaceable.

Back to the point of this PR, proxy has always been a contentious issue. Fundamentally, the proxy object is not owned by the MCO. If any validation were to happen, that the root object owner should be making validations to the object changes before it is provided to the cluster for consumption. If a user provides a broken proxy, shouldn't the change be rejected in the first place? Instead of having it get all the way to the MCO generating a new config before saying: actually your proxy changes aren't valid because the MCC container can't pull the CNO image. The CNO could have done that before we even got here, reducing the complexity of transit. The MCO would then react to it

if proxy.Status == (configv1.ProxyStatus{}) {
and not lay down the "bad proxy"

Side note again: more broadly speaking, validation at source doesn't cover some other cases we've ran into in the past, such as important secrets/certs. etc. being deleted. Some of these objects are created during install time and never "managed", and the MCO simply consumes them.

I would also like to revisit the bug for a moment: up until this point https://bugzilla.redhat.com/show_bug.cgi?id=1928581#c12 we were still discussing how to properly validate at a CNO level, but right after we did some component switching for which there is no context for in the bug. Trevor makes some good points in https://bugzilla.redhat.com/show_bug.cgi?id=1928581#c17 and then we flipped it back to node. Did we ever get a chance to discuss this at a higher level?

Now, to also look at the flip slide, "MCO does not do validation" is not a view that cannot be changed. Openshift is constantly growing and adapting, such that if there is sufficient need to tackle a problem, I think we should consider it. As I see it, there are a few alternatives floating in mind:

  1. the MCO does validation as we see fit approach: this would be basically this PR: we add validations for issues that are seen a lot and are annoying to deal with, and we scatter it across the code. This obviously does not scale well, but can solve some immediate issues
  2. the MCO creates a new validation schema/API/controller approach: we would spend time designing and crafting a whole new method (controller?) that allows us to create extendable methods to validate configuration, with potential options to allow users to specify extra validation schema. This would probably take more design and I am not sure what's the best way to do so today
  3. the root owner takes responsibility approach, where the validation happens at a higher level before it reaches the MCO, and the MCO continues to be a consumer. This would likely also work better with layering, since the validation would happen pre-build
  4. the validate coreos layered images approach, where we create a new image validation schema specifically for the new layered image update workflow
  5. the create OCP config validator flow, where a new operator is generally responsible for watching important configuration changes

And lastly, I feel bad for this writeup, since many people have put a lot of work into this PR, but after all the back and forth I am still leaning towards "this isn't something we should do in the MCO". I am happy to discuss this further in any context, and I am willing to change my mind.

@QiWang19 QiWang19 changed the title Bug 1928581: validate the proxy by trying skopeo inspect image Bug 1928581: validate the proxy by trying oc image info Oct 26, 2022
@cgwalters
Copy link
Member

cgwalters commented Oct 26, 2022

Excellent writeup, I agree with most of it. My view on this is what we really want is automated rollbacks. Basically in this scenario:

  • node boots into new config
  • we fail to contact the proxy (I don't think kubelet fails in this scenario, but we can't fetch OS updates anymore? Presumably other pod workloads on the node fail)
  • This is detected by a health check
  • Roll back to previous config
  • Error from previous state is saved and reported

A question here is whether we then later try to reconcile again later. I think it'd make sense to do so, with a backoff to only trying the change again e.g. once a day at most or so?

@cgwalters
Copy link
Member

But to be clear I agree in this specific instance it'd make sense to have the proxy config be validated by an owning component before it gets rolled out.

@sinnykumari
Copy link
Contributor

Thank you Jerry for adding all the context and reasoning, great explanation! 100% agree with it and will echo again that validation should be done at the source not at the consumer level. This scales better, less error prone as provider have better knowledge of what is correct.
On the similar note, we backed off recently proposal which involved MCO updating infra object openshift/enhancements#1102 (comment)

@palonsoro
Copy link
Contributor

I agree that it should have been CNO and not MCO who does this test. Honestly, I don't understand how the bug ended in MCO in the first place. But if this is to be returned to CNO, we need some higher level co-ordination to make this possible.

BTW, the PR may not be the best discussion place for all of this but the bugzilla.

@rphillips
Copy link
Contributor

I am in agreement with the latest discussion. Let's close this PR and document a procedure to test the settings. The proxy settings should be tested in a staging environment.

@QiWang19 QiWang19 closed this Oct 31, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 31, 2022

@QiWang19: This pull request references Bugzilla bug 1928581. The bug has been updated to no longer refer to the pull request using the external bug tracker. All external bug links have been closed. The bug has been moved to the NEW state.
Warning: Failed to comment on Bugzilla bug with reason for changed state.

In response to this:

Bug 1928581: validate the proxy by trying oc image info

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@palonsoro
Copy link
Contributor

palonsoro commented Nov 1, 2022

I am in agreement with the latest discussion. Let's close this PR and document a procedure to test the settings. The proxy settings should be tested in a staging environment.

I agree with closing this in what regards MCO, because this should not (and never should have been) checked by MCO.

However, just relying on customer validation on stage environment is not a correct approach, because mistakes will always happen. The main point of this bug was not to protect from the error itself, but from the fact that there is no sane way to recover from it once it happened.

If this issue raises again, we shall open bug to the CNO, which where some proper solution should be placed.

@yuqi-zhang
Copy link
Contributor

Thanks everyone for the work and comments! will try to continue tracking this in Jira so we don't lose the context

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet