dns resolution fails on k8s where api server is down, no caching at all? #2629

damienleger · 2019-02-27T10:31:30Z

After an upgrade of k8s control plane that lead to lots of dns erros I've noticed that when api server is down dns resolution are not working. Seem like there are no cache at all whereas configuration suggests otherwise.

CoreDNS config is the one constructed by the migration script there https://github.com/coredns/deployment/tree/master/kubernetes

  Corefile: |
    .:53 {
        errors
        health
        kubernetes cluster.local in-addr.arpa ip6.arpa {
          pods insecure
          upstream
          fallthrough in-addr.arpa ip6.arpa
        }
        prometheus :9153
        forward . 8.8.8.8 8.8.4.4
        cache 30
        loop
        reload
        loadbalance
    }

You can reproduce this behavior with the following Vagrant box:
N.B. microk8s is still using kubedns, a migration to coredns is performed then...

Vagrant.configure("2") do |config|
  config.vm.box = "ubuntu/bionic64"

  config.vm.provider "virtualbox" do |v|
    v.linked_clone = true
  end

  config.vm.provider "virtualbox" do |v|
    v.memory = 2048
    v.cpus = 4
  end

  config.vm.provision "shell", inline: <<-SCRIPT
set +x
apt-get update && sudo apt-get install -y jq
sudo snap install microk8s --classic
sleep 10
sudo microk8s.start
sleep 10
microk8s.enable dns
sleep 10
git clone https://github.com/coredns/deployment.git
cd deployment/kubernetes/
cat deploy.sh | sed -e 's|kubectl|microk8s.kubectl|g' | bash | microk8s.kubectl apply -f -

cat <<-EOF | microk8s.kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: debug
spec:
  containers:
  - name: debug
    image: yolsgens/alpine-with-tools:2
    # Just sleep forever
    command: [ "/bin/ash", "-c", "--" ]
    args: [ "while true; do sleep 30; done;" ]
EOF

microk8s.kubectl delete --namespace=kube-system deployment kube-dns

cat <<-EOF | microk8s.kubectl apply -f -
apiVersion: v1
kind: Service
metadata:
  name: nginx
  labels:
    run: nginx
spec:
  ports:
  - port: 80
    protocol: TCP
  selector:
    run: nginx
EOF

SCRIPT

end

On ssh session n°1 we spam dns resolution:

microk8s.docker exec -ti $(microk8s.docker ps | fgrep debug |  fgrep alpine | awk '{print $1'}) ash
while true; do date; host nginx; sleep 1; done

On ssh session n°2 we stop api server:

sudo systemctl stop snap.microk8s.daemon-apiserver.service

You should instantly see dns errors on session n°1.

I've performed the same test:

by using the old kubedns, just remove the line cat deploy.sh | sed -e 's|kubectl|microk8s.kubectl|g' | bash | microk8s.kubectl apply -f - in Vagrantfile. Dns resolution keep working (fun fact, cache seems to never expires!?)
by putting a ttl on kubernetes section of coredns config as suggested here https://coredns.io/plugins/kubernetes/ , result are the same

I found this behavior really disturbing from HA point of view, plus it's not matching the behavior of old kubedns. Am I missing something in the configuration or anything else?

The text was updated successfully, but these errors were encountered:

chrisohaver · 2019-02-27T14:18:31Z

Initial thoughts...

Cache should be working here, and increasing ttl in kubernetes plugin should make those entries last longer in cache.

Sometimes it takes a couple of minutes for a Configmap change to be pushed to the Pods.

Is CoreDNS crashing and getting restarted at the time of the API stoppage?

What version of CoreDNS are you using.

Possibly related to #2464

damienleger · 2019-02-27T14:47:13Z

hello @chrisohaver,

Cache should be working here, and increasing ttl in kubernetes plugin should make those entries last longer in cache.

Sometimes it takes a couple of minutes for a Configmap change to be pushed to the Pods.

I've tried with ttl 30, delete the pod after to reload CoreDNS, same behavior.

Is CoreDNS crashing and getting restarted at the time of the API stoppage?

Indeed. Before API stop:

vagrant@ubuntu-bionic:~$ microk8s.docker ps | grep coredns
4705d17ae085        eb516548c180                 "/coredns -conf /etc…"   About a minute ago   Up About a minute                       k8s_coredns_coredns-5d668bd598-96kjx_kube-system_a24aceba-3a99-11e9-aab2-02888028663a_5
0188944cae22        eb516548c180                 "/coredns -conf /etc…"   About a minute ago   Up About a minute                       k8s_coredns_coredns-5d668bd598-7hgpz_kube-system_a5e4ef07-3a99-11e9-aab2-02888028663a_5
cf8830d60243        k8s.gcr.io/pause:3.1         "/pause"                 12 minutes ago       Up 12 minutes                           k8s_POD_coredns-5d668bd598-7hgpz_kube-system_a5e4ef07-3a99-11e9-aab2-02888028663a_0
7956e197f3b0        k8s.gcr.io/pause:3.1         "/pause"                 12 minutes ago       Up 12 minutes                           k8s_POD_coredns-5d668bd598-96kjx_kube-system_a24aceba-3a99-11e9-aab2-02888028663a_0

After stoppping API server

vagrant@ubuntu-bionic:~$ microk8s.docker ps | grep coredns
cf8830d60243        k8s.gcr.io/pause:3.1         "/pause"                 12 minutes ago      Up 12 minutes                           k8s_POD_coredns-5d668bd598-7hgpz_kube-system_a5e4ef07-3a99-11e9-aab2-02888028663a_0
7956e197f3b0        k8s.gcr.io/pause:3.1         "/pause"                 12 minutes ago      Up 12 minutes                           k8s_POD_coredns-5d668bd598-96kjx_kube-system_a24aceba-3a99-11e9-aab2-02888028663a_0

What version of CoreDNS are you using.

1.1.3

chrisohaver · 2019-02-27T14:53:47Z

So, CoreDNS is crashing when then API restarts. That would explain the empty cache. Cache does not persist past CoreDNS restarts.
You should be able to retrieve the logs for the previous instance to see what the error is. Using kubectl logs -p for example.

damienleger · 2019-02-27T14:57:44Z

Yes with docker cli:

vagrant@ubuntu-bionic:~$ microk8s.docker logs f122c912292b
.:53
2019-02-27T14:32:30.982Z [INFO] CoreDNS-1.3.1
2019-02-27T14:32:30.982Z [INFO] linux/amd64, go1.11.4, 6b56a9c
CoreDNS-1.3.1
linux/amd64, go1.11.4, 6b56a9c
2019-02-27T14:32:30.982Z [INFO] plugin/reload: Running configuration MD5 = b5780bf263aace648ab23ad78ecbed26
2019-02-27T14:32:36.994Z [ERROR] plugin/errors: 2 7097044966940628890.8660328651737545829. HINFO: read udp 10.1.1.7:56809->8.8.8.8:53: i/o timeout
2019-02-27T14:32:40.010Z [ERROR] plugin/errors: 2 7097044966940628890.8660328651737545829. HINFO: read udp 10.1.1.7:44697->8.8.4.4:53: i/o timeout
2019-02-27T14:32:41.012Z [ERROR] plugin/errors: 2 7097044966940628890.8660328651737545829. HINFO: read udp 10.1.1.7:37702->8.8.4.4:53: i/o timeout
2019-02-27T14:32:42.014Z [ERROR] plugin/errors: 2 7097044966940628890.8660328651737545829. HINFO: read udp 10.1.1.7:39593->8.8.8.8:53: i/o timeout
2019-02-27T14:32:45.015Z [ERROR] plugin/errors: 2 7097044966940628890.8660328651737545829. HINFO: read udp 10.1.1.7:49238->8.8.8.8:53: i/o timeout
2019-02-27T14:32:48.016Z [ERROR] plugin/errors: 2 7097044966940628890.8660328651737545829. HINFO: read udp 10.1.1.7:48542->8.8.4.4:53: i/o timeout
2019-02-27T14:32:51.019Z [ERROR] plugin/errors: 2 7097044966940628890.8660328651737545829. HINFO: read udp 10.1.1.7:60882->8.8.8.8:53: i/o timeout
2019-02-27T14:32:54.022Z [ERROR] plugin/errors: 2 7097044966940628890.8660328651737545829. HINFO: read udp 10.1.1.7:34502->8.8.8.8:53: i/o timeout
2019-02-27T14:32:57.023Z [ERROR] plugin/errors: 2 7097044966940628890.8660328651737545829. HINFO: read udp 10.1.1.7:60149->8.8.8.8:53: i/o timeout
2019-02-27T14:33:00.024Z [ERROR] plugin/errors: 2 7097044966940628890.8660328651737545829. HINFO: read udp 10.1.1.7:52974->8.8.4.4:53: i/o timeout
E0227 14:40:20.983229       1 streamwatcher.go:109] Unable to decode an event from the watch stream: http2: server sent GOAWAY and closed the connection; LastStreamID=13, ErrCode=NO_ERROR, debug=""
E0227 14:40:20.983229       1 streamwatcher.go:109] Unable to decode an event from the watch stream: http2: server sent GOAWAY and closed the connection; LastStreamID=13, ErrCode=NO_ERROR, debug=""
log: exiting because of error: log: cannot create log: open /tmp/coredns.coredns-5d668bd598-7hgpz.unknownuser.log.ERROR.20190227-144020.1: no such file or directory
vagrant@ubuntu-bionic:~$

chrisohaver · 2019-02-27T15:05:40Z

yes, this looks to be the same problem as #2464.

I think this is fixed with #2529 ... which isn't included in the latest release (1.3.1), but should be fixed if you build from master.

damienleger · 2019-02-27T15:41:38Z

OK I've just built and tested the current master version, I confirm the issue is gone.

I've tested with and without the extra 'ttl 60' config in kubernetes section, in both case dns resolution is still working after 100+ seconds. I don't really get the use of this ttl parameter to be honest, but I suppose cache is never invalidated when api server is down which is fine for me (and match with the old kubedns behavior).

Thanks!

chrisohaver · 2019-02-27T16:02:25Z

I don't really get the use of this ttl parameter to be honest, but I suppose cache is never invalidated when api server is down which is fine for me (and match with the old kubedns behavior).

Just FYI, ttl controls the ttl on the record, and thus how long the entry is cached in the coredns cache plugin (and any other intermediate caching dns server).
The kubernetes api client (client-go), maintains an object store of all services/endpoints/namespaces which is updated in real-ish time via a watch. This is separate form the cache plugin, and that store does not go away when the API connection is lost. So, after a service's cache entry expires, a subsequent request for the service re-creates the record from the k8s api object store.

kubedns behaves the same way in this regard because it uses the same kubernetes api client.

See kubernetes/kubernetes#74137 See kubernetes/kubernetes#73267 See coredns/coredns#2629

vitobotta · 2019-08-15T17:00:04Z

Hi, I just deployed a cluster with Rancher and coredns refuses to start with log: exiting because of error: log: cannot create log: - any idea of what I can do? It seems the fix isn't out yet or have I misunderstood? Thanks

chrisohaver · 2019-08-15T17:04:33Z

@vitobotta, If it's the same issue, it is fixed in any version after 1.3.1.
A work-around for verison 1.3.1 is to mount /tmp in the deployment pod spec (e.g. coredns/deployment#138).

vitobotta · 2019-08-15T17:17:59Z

Hi @chrisohaver , I have mounted /tmp and it seems to help but now in the logs I see E0815 17:16:01.234957 1 reflector.go:134] github.com/coredns/coredns/plugin/kubernetes/controller.go:315: Failed to list *v1.Service: Get https://10.43.0.1:443/api/v1/services?limit=500&resourceVersion=0: dial tcp 10.43.0.1:443: i/o timeout and it's stuck at NotReady. What else can I check?

vitobotta · 2019-08-15T17:19:39Z

Is the fact that I am using Weave relevant?

chrisohaver · 2019-08-15T18:36:54Z

@vitobotta, that error usually means there is something wrong with the CNI, or other network problem preventing Pods (in this case coredns) from connecting to a Service (in this case the kubernetes api).

vitobotta · 2019-08-15T18:38:05Z

I recreated the cluster using Canal instead of Weave and the CoreDNS issue didn't occur

stp-ip assigned johnbelamaric Feb 27, 2019

damienleger closed this as completed Feb 27, 2019

stp-ip unassigned johnbelamaric Feb 27, 2019

johnbelamaric mentioned this issue Mar 18, 2019

[1.14-beta2] coredns imediatelly crash if Kubernetes API is unavailable kubernetes/kubernetes#75414

Closed

fturib mentioned this issue Mar 18, 2019

1.14 Release Notes: "Known Issues" kubernetes/kubernetes#74425

Closed

Cynerva mentioned this issue Mar 21, 2019

Add CoreDNS version to release notes and upgrade notes charmed-kubernetes/kubernetes-docs#125

Merged

mboersma mentioned this issue Mar 25, 2019

feat: add support for Kubernetes 1.14.0 Azure/aks-engine#876

Merged

5 tasks

rfranzke added a commit to gardener/gardener that referenced this issue Mar 26, 2019

Adapt CoreDNS to recommended default configuration

fbcf14e

See kubernetes/kubernetes#74137 See kubernetes/kubernetes#73267 See coredns/coredns#2629

mboersma mentioned this issue Mar 29, 2019

dns-liveness pod restarts failing k8s 1.14 tests Azure/aks-engine#919

Closed

zerkms mentioned this issue May 17, 2019

Upgrade coredns to v1.4.0 or v1.5.0 kubernetes/kubernetes#78003

Closed

Pajk mentioned this issue May 20, 2019

Allow for some coredns customization kubernetes/kops#5839

Closed

vivekgarg20 mentioned this issue Jun 10, 2019

etcd v2 to v3 migration instructions cause cluster downtime when using CoreDNS kubernetes/kops#7065

Closed

zerkms mentioned this issue Jun 24, 2019

Upgrade coredns to v1.4.0 or v1.5.0 kubernetes/kubernetes#79319

Closed

moonek mentioned this issue Aug 5, 2019

kubeadm: revert CoreDNS to 1.3.1 kubernetes/kubernetes#78545

Merged

miekg added the kubernetes label Aug 15, 2019

miekg added the question label Aug 15, 2019

coredns locked as resolved and limited conversation to collaborators Aug 15, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dns resolution fails on k8s where api server is down, no caching at all? #2629

dns resolution fails on k8s where api server is down, no caching at all? #2629

damienleger commented Feb 27, 2019 •

edited

Loading

chrisohaver commented Feb 27, 2019 •

edited

Loading

damienleger commented Feb 27, 2019

chrisohaver commented Feb 27, 2019

damienleger commented Feb 27, 2019

chrisohaver commented Feb 27, 2019 •

edited

Loading

damienleger commented Feb 27, 2019

chrisohaver commented Feb 27, 2019

vitobotta commented Aug 15, 2019

chrisohaver commented Aug 15, 2019

vitobotta commented Aug 15, 2019

vitobotta commented Aug 15, 2019

chrisohaver commented Aug 15, 2019

vitobotta commented Aug 15, 2019

dns resolution fails on k8s where api server is down, no caching at all? #2629

dns resolution fails on k8s where api server is down, no caching at all? #2629

Comments

damienleger commented Feb 27, 2019 • edited Loading

chrisohaver commented Feb 27, 2019 • edited Loading

damienleger commented Feb 27, 2019

chrisohaver commented Feb 27, 2019

damienleger commented Feb 27, 2019

chrisohaver commented Feb 27, 2019 • edited Loading

damienleger commented Feb 27, 2019

chrisohaver commented Feb 27, 2019

vitobotta commented Aug 15, 2019

chrisohaver commented Aug 15, 2019

vitobotta commented Aug 15, 2019

vitobotta commented Aug 15, 2019

chrisohaver commented Aug 15, 2019

vitobotta commented Aug 15, 2019

damienleger commented Feb 27, 2019 •

edited

Loading

chrisohaver commented Feb 27, 2019 •

edited

Loading

chrisohaver commented Feb 27, 2019 •

edited

Loading