Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

πŸƒβ€β™‚οΈ Provide a way in KCP to disable CoreDNS management via annotations #3023

Merged

Conversation

sethp-nr
Copy link
Contributor

@sethp-nr sethp-nr commented May 6, 2020

What this PR does / why we need it:

We're seeing a lot of these log lines in our cluster(s):

E0506 19:19:39.493104       1 controller.go:258] controller-runtime/controller "msg"="Reconciler error" "error"="failed to update CoreDNS deployment: failed to validate CoreDNS: toVersion \"1.6.2\" must be greater than fromVersion \"1.6.2\""  "controller"="kubeadmcontrolplane" "request"={"Namespace":"md1a","Name":"md1a-controlplane"}

It looks like CoreDNS is being reconciled even though we haven't set up any of the DNS properties to opt in to that behavior.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label May 6, 2020
@sethp-nr sethp-nr changed the title fix: avoid reconciling CoreDNS when no fields are set πŸ› avoid reconciling CoreDNS when no fields are set May 6, 2020
@k8s-ci-robot k8s-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label May 6, 2020
@sethp-nr
Copy link
Contributor Author

sethp-nr commented May 6, 2020

cc @vincepri – not sure if this is the way we should be going, or if we want to look at adding in a controller-level flag to turn off the behavior.

Short version is that right now the KCP is failing to update our CoreDNS deployments and we're a little scared of when that bug is fixed and it is able to successfully change things :)

@sethp-nr
Copy link
Contributor Author

sethp-nr commented May 6, 2020

also cc @dthorsen

@vincepri
Copy link
Member

vincepri commented May 7, 2020

@sethp-nr If there are no changes, should be detected when we check if the from/to images are the same. This is the whole image that's specified in a deployment, so I guess the code somehow is detecting that there is a change.

Can we add a test case, or do you have a reproducible we can debug? I'd also suggest to look at the logic in

func (w *Workload) getCoreDNSInfo(ctx context.Context, clusterConfig *kubeadmv1.ClusterConfiguration) (*coreDNSInfo, error) {
which should be the method gathering information

@@ -8,5 +8,5 @@ spec:
spec:
containers:
# Change the value of image field below to your controller image URL
- image: gcr.io/k8s-staging-capi-docker/capd-manager:dev
- image: gcr.io/k8s-staging-capi-docker/capd-manager-amd64:dev
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't this should be part of the commit

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good catch, stung again by git commit -a

@@ -78,6 +78,12 @@ func (w *Workload) UpdateCoreDNS(ctx context.Context, kcp *controlplanev1.Kubead
return nil
}

// If we haven't been asked to do anything, return early
if clusterConfig.DNS.ImageRepository == "" && clusterConfig.ImageRepository == "" &&
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should return also if clusterConfig.DNS.Type != CoreDNS
@vincepri is it written somewhere that KCP does not support kube-dns as a DNS option?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is already the case, on line 77

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing out

@sethp-nr
Copy link
Contributor Author

sethp-nr commented May 7, 2020

@vincepri I'll look into adding a test; at this point I suspect it's because we're modifying the CoreDNS deployment out of band from kubeadm and so some state is guessing there is something to change.

What I was hoping for was some way to avoid having the KCP even check in these cases: we want to change things about the shape of the deployment (at this moment specifically it's tolerations) & would rather use one tool to manage both the version and the Corefile / image versions.

From looking into getCoreDNSInfo it looks like we could break the KCP's check flow, either intentionally or unintentionally, by changing the name of the Deployment or the container. It looks like things are working OK for now just because updating CoreDNS is the last thing happening in reconcile.

@sethp-nr
Copy link
Contributor Author

sethp-nr commented May 7, 2020

Ah, yup, that was it. In v0.3.2:

 FromImage: (string) (len=21) "coredns/coredns:1.6.2",
 ToImage: (string) (len=31) "docker.io/coredns/coredns:1.6.2"

And in current master, getDNSInfo is failing with an error:

unable to parse "coredns/coredns:1.6.2" deployment image: couldn't parse image name: repository name must be canonical

@vincepri
Copy link
Member

vincepri commented May 7, 2020

@sethp-nr Is the FromImage (the one that comes from the Deployment) changed outside CABPK?

@sethp-nr
Copy link
Contributor Author

sethp-nr commented May 7, 2020

Yeah, that's right – we use an ArgoCD app for managing CoreDNS.

@vincepri
Copy link
Member

vincepri commented May 7, 2020

Got it, I think it's fair to add a check, although the one in this PR could impact the ability to go from a value to the implicit defaults, if that makes sese

@sethp-nr
Copy link
Contributor Author

sethp-nr commented May 7, 2020

Hmm, yeah – it's the age old default-or-unset issue.

I'm open to suggestions, my bad ideas are:

  1. Change the kubeadm type to have an additional way to express "I want you to manage this" vs "please treat this as bootstrap-only", which I think requires participation from the kubeadm folks
  2. Add a controller-scoped flag to turn the DNS reconcile behavior off globally
  3. Add something to the KubeadmControlPlane to make this explicit on a per-control-plane level (maybe a doReconcile: [etcd, coredns, machines] with commensurate defaults in webhook?)

@vincepri
Copy link
Member

vincepri commented May 7, 2020

I wouldn't be opposed to add a flag or annotation to opt-out

@ncdc thoughts?

@ncdc
Copy link
Contributor

ncdc commented May 7, 2020

The choice to enable or disable CoreDNS management is probably at the provider level and not per KCP, right? Or do we think individual cluster owners would want to make that decision?

@sethp-nr
Copy link
Contributor Author

sethp-nr commented May 7, 2020

The main benefit of putting it on the KCP is that it's not somewhere else: it saves me from having to check controller flags on a pod in a different namespace if I'm wondering whether the KCP will be attempting to reconcile some DNS config I'm looking at.

I'll re-work this PR to use an annotation – I'm not sure if there's an underlying kubeadm issue that'd still be worth filing, though.

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels May 7, 2020
@ncdc
Copy link
Contributor

ncdc commented May 7, 2020

What's the potential kubeadm issue?

@ncdc
Copy link
Contributor

ncdc commented May 7, 2020

I'm also fine with a field or annotation on KCP

@k8s-ci-robot
Copy link
Contributor

k8s-ci-robot commented May 7, 2020

@sethp-nr: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
pull-cluster-api-capd-e2e 5f76d95 link /test pull-cluster-api-capd-e2e

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@sethp-nr
Copy link
Contributor Author

The kubeadm issue that I'm musing on is something around teasing apart the ongoing management of CoreDNS from bootstrapping – as it stands, I think the KCP is in an awkward place because turning up a cluster sometimes requires the ability to override things like registry and image tag, but that's not really sufficiently expressive to take full ownership of the CoreDNS pieces.

@vincepri
Copy link
Member

/milestone v0.3.7

@k8s-ci-robot k8s-ci-robot added this to the v0.3.7 milestone May 22, 2020
We're seeing a lot of these log lines in our clusters:

```
E0506 19:19:39.493104       1 controller.go:258] controller-runtime/controller "msg"="Reconciler error" "error"="failed to update CoreDNS deployment: failed to validate CoreDNS: toVersion \"1.6.2\" must be greater than fromVersion \"1.6.2\""  "controller"="kubeadmcontrolplane" "request"={"Namespace":"md1a","Name":"md1a-controlplane"}
```

We're managing the CoreDNS Deployment outside of the KCP, and have set
the repository to something it doesn't know about. We'd like to be able
to separate out the "use the default repository" cases from the
"do not manage CoreDNS" cases.
Copy link
Member

@vincepri vincepri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve
/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 22, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sethp-nr, vincepri

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 22, 2020
@vincepri
Copy link
Member

/retitle πŸƒβ€β™‚οΈ Provide a way in KCP to disable CoreDNS management via annotations

@k8s-ci-robot k8s-ci-robot changed the title πŸ› avoid reconciling CoreDNS when no fields are set πŸƒβ€β™‚οΈ Provide a way in KCP to disable CoreDNS management via annotations May 22, 2020
@k8s-ci-robot k8s-ci-robot merged commit d51a360 into kubernetes-sigs:master May 22, 2020
@sethp-nr sethp-nr deleted the fix/too-much-coredns branch May 22, 2020 16:15
tvs added a commit to tvs/cluster-api that referenced this pull request Jul 9, 2020
For air-gapped systems that cannot pull container images, the rolling
update of add-ons needs to be coordinated in a more particular fashion.
This allows users of the KCP to opt out of kube-proxy management with a
similar mechanism to how they would opt out of CoreDNS management (kubernetes-sigs#3023)
jzhoucliqr pushed a commit to spectrocloud/cluster-api that referenced this pull request Aug 27, 2020
For air-gapped systems that cannot pull container images, the rolling
update of add-ons needs to be coordinated in a more particular fashion.
This allows users of the KCP to opt out of kube-proxy management with a
similar mechanism to how they would opt out of CoreDNS management (kubernetes-sigs#3023)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants