Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

node level DNS may need override #284

Closed
BenTheElder opened this issue Feb 8, 2019 · 11 comments
Closed

node level DNS may need override #284

BenTheElder opened this issue Feb 8, 2019 · 11 comments
Assignees
Labels
kind/design Categorizes issue or PR as related to design. kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done.
Milestone

Comments

@BenTheElder
Copy link
Member

cc @floreks

TLDR floreks had kind in dind on kubeadm, in which case the kind "node" DNS was not configured in a working way. We may need to allow DNS to be overridden at the node level to fix EG mage pulling

@BenTheElder BenTheElder added kind/feature Categorizes issue or PR as related to a new feature. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Feb 11, 2019
@BenTheElder
Copy link
Member Author

cc @alejandrox1 I think we should probably add something about DNS mismatch to the known-issues.
In short:

  • DNS from the host environment will propagate to the kind "nodes"
  • If the host DNS is not appropriate for the nodes (EG DNS in an outer Kubernetes cluster dind pod) then this will cause a problem

Work around for the pod case is to configure the outer dind pod DNS to use upstream resolvers instead of the in-cluster resolvers.

More general solution will probably require that kind support configuring node-level DNS options.

@mikkeloscar
Copy link

I just hit this issue when trying to run https://github.com/kubernetes-incubator/metrics-server in a kind cluster.

The metrics server tries to go to the kubelets via the node hostname and fails:

$ kubectl --namespace kube-system logs -f metrics-server-cd4946b-fxgzf 
I0321 13:44:56.710455       1 serving.go:273] Generated self-signed cert (apiserver.local.config/certificates/apiserver.crt, apiserver.local.config/certificates/apiserver.key)
[restful] 2019/03/21 13:44:57 log.go:33: [restful/swagger] listing is available at https://:443/swaggerapi
[restful] 2019/03/21 13:44:57 log.go:33: [restful/swagger] https://:443/swaggerui/ is mapped to folder /swagger-ui/
I0321 13:44:57.461103       1 serve.go:96] Serving securely on [::]:443
E0321 13:45:07.834706       1 reststorage.go:144] unable to fetch pod metrics for pod e2e/es-operator-74b79684df-kkfjl: no metrics known for pod
E0321 13:45:07.834733       1 reststorage.go:144] unable to fetch pod metrics for pod e2e/es-master-c895c6f54-nzzfw: no metrics known for pod
E0321 13:45:08.733047       1 reststorage.go:144] unable to fetch pod metrics for pod e2e/es-operator-74b79684df-kkfjl: no metrics known for pod
E0321 13:45:08.733074       1 reststorage.go:144] unable to fetch pod metrics for pod e2e/es-master-c895c6f54-nzzfw: no metrics known for pod
E0321 13:45:57.623974       1 manager.go:102] unable to fully collect metrics: [unable to fully scrape metrics from source kubelet_summary:e2e-worker2: unable to fetch metrics from Kubelet e2e-worker2 (e2e-worker2): Get https://e2e-worker2:10250/stats/summary/: dial tcp: lookup e2e-worker2 on 10.96.0.10:53: server misbehaving, unable to fully scrape metrics from source kubelet_summary:e2e-worker3: unable to fetch metrics from Kubelet e2e-worker3 (e2e-worker3): Get https://e2e-worker3:10250/stats/summary/: dial tcp: lookup e2e-worker3 on 10.96.0.10:53: server misbehaving, unable to fully scrape metrics from source kubelet_summary:e2e-control-plane: unable to fetch metrics from Kubelet e2e-control-plane (e2e-control-plane): Get https://e2e-control-plane:10250/stats/summary/: dial tcp: lookup e2e-control-plane on 10.96.0.10:53: server misbehaving, unable to fully scrape metrics from source kubelet_summary:e2e-worker: unable to fetch metrics from Kubelet e2e-worker (e2e-worker): Get https://e2e-worker:10250/stats/summary/: dial tcp: lookup e2e-worker on 10.96.0.10:53: server misbehaving]

Would be nice if we could configure CoreDNS to resolve the node names.

@swachter
Copy link

swachter commented Mar 25, 2019

I had a similar problem where a pod in kind cluster could not resolve github.com. The reason seems to be that the default DNS server (8.8.8.8) is blocked in our network. After I added an /etc/docker/daemon.json to the kind node image with a dns entry including our dns the pod could resolve github.com.

cf. https://development.robinwinslow.uk/2016/06/23/fix-docker-networking-dns/

@BenTheElder
Copy link
Member Author

@mikkeloscar that's an interesting problem, I wonder why metrics-server is using hostnames rather than going via the node IP 🤔

@swachter that's a good workaround, FYI though instead of editing etc/docker/daemon.json on the kind nodes for this you should be able to use one of these more portable options:

https://kubernetes.io/docs/tasks/administer-cluster/dns-custom-nameservers/
https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/

@BenTheElder
Copy link
Member Author

for metrics server kubernetes-sigs/metrics-server#184 (comment) 🤔

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 23, 2019
@BenTheElder
Copy link
Member Author

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 26, 2019
@BenTheElder BenTheElder added priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. kind/design Categorizes issue or PR as related to design. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Aug 15, 2019
@BenTheElder
Copy link
Member Author

In general the problem is knowing what to override with. Ideally the host's DNS should be suitable, the only recurring case is that of sticking kind in a kubernetes pod from another cluster. In this case you can configure DNS on that pod.

It might be possible to create a wrapper script to detect the upstream DNS servers somehow and patch the dns config in the pod with this before running kind to make this more automatic, but for now we can just document this particular case (which is also the original in this issue) with #303

@dlipovetsky
Copy link
Contributor

I mentioned this in slack some time back, but this seems like a good place to leave a note.

When kind creates a node, it inherits the DNS configuration (/etc/resolv.conf) of the host. Changes to the host's configuration are not reflected on the kind node.

For example, if you start a kind cluster on your laptop, then switch to a different network, the DNS configuration on the kind nodes might not work anymore. In that case, you might see kubelet report errors pulling images, etc.

This is not a kind bug, but a consequence of how Docker manages /etc/resolv.conf. The Docker v17.09 docs describe how this works; later docs omit the details, but I don't think the implementation has changed.

Having node level DNS override would also address this issue, I think.

@BenTheElder
Copy link
Member Author

#1508 pretty much obviates this, instead docker's embedded DNS is made to work, which means that the host resolver is used by a docker DNS proxy socket for upstream requests.

this also means the host etc/resolv.conf is not really picked up, instead the internal resolver is used

@BenTheElder
Copy link
Member Author

should be fixed by #1508, no more awkard resolv.conf inheritance.

stg-0 added a commit to stg-0/kind that referenced this issue Sep 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/design Categorizes issue or PR as related to design. kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done.
Projects
None yet
Development

No branches or pull requests

6 participants