coredns-monitor creates Corefile with 127.0.0.53 forwarder #1073

jwhb · 2022-01-17T11:14:29Z

Describe the bug

In my OKD cluster the Corefile in Node master-1 is faulty. Instead of a cluster external DNS resolver it has 127.0.0.53 in the forward declaration.

One of my customers has the very same symptom (wrong Corefile on master-1) in their cluster and experiences very high CPU load (~2.3 cores) for this exact pod with frequent "i/o timeout" messages in coredns container logs.

When manually correcting the Corefile by replacing 127.0.0.53 with an actual DNS resolver IP (in my case 10.1.0.1), these messages disappear and the cpu load normalized to 0.002 cores.

Related to /issues/978.

Version

I am running OKD 4.9.0 IPI on vSphere 6.7:

[root@localhost ocp-install]# oc get clusterversion
NAME      VERSION                         AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.9.0-0.okd-2021-11-28-035710   True        False         45d     Cluster version is 4.9.0-0.okd-2021-11-28-035710

How reproducible
CPU load stays constantly high even after Pod restart. Only on one master node

Log bundle

master-1 (bad config)

Pod logs of master-1 coredns-monitor shows that its runtimecfg util is rendering a faulty Corefile with 127.0.0.53 in forward rule.

$ oc logs coredns-lab4-h9zq6-master-1 coredns-monitor
time="2022-01-12T12:59:19Z" level=info msg="Runtimecfg rendering template" path=/etc/coredns/Corefile
time="2022-01-12T13:08:20Z" level=info msg="Node change detected, rendering Corefile" Node Addresses="[{10.1.2.189 lab4-h9zq6-master-0 false} {10.1.2.190 lab4-h9zq6-master-1 false} {10.1.2.188 lab4-h9zq6-master-2 false} {10.1.2.205 lab4-h9zq6-worker-dlr5x false} {10.1.2.203 lab4-h9zq6-worker-k8dfd false} {10.1.2.207 lab4-h9zq6-worker-m5lqk false} {10.1.2.209 lab4-h9zq6-worker-v95g9 false}]"
time="2022-01-12T13:08:20Z" level=info msg=". {"
time="2022-01-12T13:08:20Z" level=info msg="    errors"
time="2022-01-12T13:08:20Z" level=info msg="    bufsize 512"
time="2022-01-12T13:08:20Z" level=info msg="    health :18080"
time="2022-01-12T13:08:20Z" level=info msg="    forward . 127.0.0.53 {"
time="2022-01-12T13:08:20Z" level=info msg="        policy sequential"
time="2022-01-12T13:08:20Z" level=info msg="    }"
time="2022-01-12T13:08:20Z" level=info msg="    cache 30"
time="2022-01-12T13:08:20Z" level=info msg="    reload"
time="2022-01-12T13:08:20Z" level=info msg="    template IN A lab4.company.corp {"
time="2022-01-12T13:08:20Z" level=info msg="        match .*.apps.lab4.company.corp"
time="2022-01-12T13:08:20Z" level=info msg="        answer \"{{ .Name }} 60 in {{ .Type }} 10.1.4.2\""
time="2022-01-12T13:08:20Z" level=info msg="        fallthrough"
time="2022-01-12T13:08:20Z" level=info msg="    }"
time="2022-01-12T13:08:20Z" level=info msg="    template IN AAAA lab4.company.corp {"
time="2022-01-12T13:08:20Z" level=info msg="        match .*.apps.lab4.company.corp"
time="2022-01-12T13:08:20Z" level=info msg="        fallthrough"
time="2022-01-12T13:08:20Z" level=info msg="    }"
time="2022-01-12T13:08:20Z" level=info msg="    template IN A lab4.company.corp {"
time="2022-01-12T13:08:20Z" level=info msg="        match api.lab4.company.corp"
time="2022-01-12T13:08:20Z" level=info msg="        answer \"{{ .Name }} 60 in {{ .Type }} 10.1.4.1\""
time="2022-01-12T13:08:20Z" level=info msg="        fallthrough"
time="2022-01-12T13:08:20Z" level=info msg="    }"
time="2022-01-12T13:08:20Z" level=info msg="    template IN AAAA lab4.company.corp {"
time="2022-01-12T13:08:20Z" level=info msg="        match api.lab4.company.corp"
time="2022-01-12T13:08:20Z" level=info msg="        fallthrough"
time="2022-01-12T13:08:20Z" level=info msg="    }"
time="2022-01-12T13:08:20Z" level=info msg="    template IN A lab4.company.corp {"
time="2022-01-12T13:08:20Z" level=info msg="        match api-int.lab4.company.corp"
time="2022-01-12T13:08:20Z" level=info msg="        answer \"{{ .Name }} 60 in {{ .Type }} 10.1.4.1\""
time="2022-01-12T13:08:20Z" level=info msg="        fallthrough"
time="2022-01-12T13:08:20Z" level=info msg="    }"
time="2022-01-12T13:08:20Z" level=info msg="    template IN AAAA lab4.company.corp {"
time="2022-01-12T13:08:20Z" level=info msg="        match api-int.lab4.company.corp"
time="2022-01-12T13:08:20Z" level=info msg="        fallthrough"
time="2022-01-12T13:08:20Z" level=info msg="    }"
time="2022-01-12T13:08:20Z" level=info msg="    hosts {"
time="2022-01-12T13:08:20Z" level=info msg="        10.1.2.189 lab4-h9zq6-master-0 lab4-h9zq6-master-0.lab4.company.corp"
time="2022-01-12T13:08:20Z" level=info msg="        10.1.2.190 lab4-h9zq6-master-1 lab4-h9zq6-master-1.lab4.company.corp"
time="2022-01-12T13:08:20Z" level=info msg="        10.1.2.188 lab4-h9zq6-master-2 lab4-h9zq6-master-2.lab4.company.corp"
time="2022-01-12T13:08:20Z" level=info msg="        10.1.2.205 lab4-h9zq6-worker-dlr5x lab4-h9zq6-worker-dlr5x.lab4.company.corp"
time="2022-01-12T13:08:20Z" level=info msg="        10.1.2.203 lab4-h9zq6-worker-k8dfd lab4-h9zq6-worker-k8dfd.lab4.company.corp"
time="2022-01-12T13:08:20Z" level=info msg="        10.1.2.207 lab4-h9zq6-worker-m5lqk lab4-h9zq6-worker-m5lqk.lab4.company.corp"
time="2022-01-12T13:08:20Z" level=info msg="        10.1.2.209 lab4-h9zq6-worker-v95g9 lab4-h9zq6-worker-v95g9.lab4.company.corp"
time="2022-01-12T13:08:20Z" level=info msg="        fallthrough"
time="2022-01-12T13:08:20Z" level=info msg="    }"
time="2022-01-12T13:08:20Z" level=info msg="}"

When I run the command of the coredns-monitor pod in the running container, it renders a correct configuration:

$ oc exec -it coredns-lab4-h9zq6-master-1 -c coredns-monitor -- bash
[root@lab4-h9zq6-master-1 /]# runtimecfg render --verbose /var/lib/kubelet/kubeconfig  --api-vip 10.1.4.1 --ingress-vip 10.1.4.2 /config --out-dir /tmp/test/
INFO[0000] . {
INFO[0000]     errors
INFO[0000]     bufsize 512
INFO[0000]     health :18080
INFO[0000]     forward . 10.1.0.1 {
INFO[0000]         policy sequential
INFO[0000]     }
INFO[0000]     cache 30
INFO[0000]     reload
INFO[0000]     template IN A lab4.company.corp {
INFO[0000]         match .*.apps.lab4.company.corp
INFO[0000]         answer "{{ .Name }} 60 in {{ .Type }} 10.1.4.2"
INFO[0000]         fallthrough
INFO[0000]     }
INFO[0000]     template IN AAAA lab4.company.corp {
INFO[0000]         match .*.apps.lab4.company.corp
INFO[0000]         fallthrough
INFO[0000]     }
INFO[0000]     template IN A lab4.company.corp {
INFO[0000]         match api.lab4.company.corp
INFO[0000]         answer "{{ .Name }} 60 in {{ .Type }} 10.1.4.1"
INFO[0000]         fallthrough
INFO[0000]     }
INFO[0000]     template IN AAAA lab4.company.corp {
INFO[0000]         match api.lab4.company.corp
INFO[0000]         fallthrough
INFO[0000]     }
INFO[0000]     template IN A lab4.company.corp {
INFO[0000]         match api-int.lab4.company.corp
INFO[0000]         answer "{{ .Name }} 60 in {{ .Type }} 10.1.4.1"
INFO[0000]         fallthrough
INFO[0000]     }
INFO[0000]     template IN AAAA lab4.company.corp {
INFO[0000]         match api-int.lab4.company.corp
INFO[0000]         fallthrough
INFO[0000]     }
INFO[0000]     hosts {
INFO[0000]         fallthrough
INFO[0000]     }
INFO[0000] }
INFO[0000]
INFO[0000] Runtimecfg rendering template                 path=/tmp/test/Corefile

master-0 (good config) (same for master-2)

For comparison, this is what the logs tell me for coredns-monitor on the other masters. The configuration looks good.

$ oc logs coredns-lab4-h9zq6-master-0 coredns-monitor
time="2022-01-12T12:59:43Z" level=info msg="Runtimecfg rendering template" path=/etc/coredns/Corefile
time="2022-01-12T13:08:43Z" level=info msg="Node change detected, rendering Corefile" Node Addresses="[{10.1.2.189 lab4-h9zq6-master-0 false} {10.1.2.190 lab4-h9zq6-master-1 false} {10.1.2.188 lab4-h9zq6-master-2 false} {10.1.2.205 lab4-h9zq6-worker-dlr5x false} {10.1.2.203 lab4-h9zq6-worker-k8dfd false} {10.1.2.207 lab4-h9zq6-worker-m5lqk false} {10.1.2.209 lab4-h9zq6-worker-v95g9 false}]"
time="2022-01-12T13:08:43Z" level=info msg=". {"
time="2022-01-12T13:08:43Z" level=info msg="    errors"
time="2022-01-12T13:08:43Z" level=info msg="    bufsize 512"
time="2022-01-12T13:08:43Z" level=info msg="    health :18080"
time="2022-01-12T13:08:43Z" level=info msg="    forward . 10.1.0.1 {"
time="2022-01-12T13:08:43Z" level=info msg="        policy sequential"
time="2022-01-12T13:08:43Z" level=info msg="    }"
time="2022-01-12T13:08:43Z" level=info msg="    cache 30"
time="2022-01-12T13:08:43Z" level=info msg="    reload"
time="2022-01-12T13:08:43Z" level=info msg="    template IN A company.corp {"
time="2022-01-12T13:08:43Z" level=info msg="        match .*.apps.company.corp"
time="2022-01-12T13:08:43Z" level=info msg="        answer \"{{ .Name }} 60 in {{ .Type }} 10.1.4.2\""
time="2022-01-12T13:08:43Z" level=info msg="        fallthrough"
time="2022-01-12T13:08:43Z" level=info msg="    }"
time="2022-01-12T13:08:43Z" level=info msg="    template IN AAAA company.corp {"
time="2022-01-12T13:08:43Z" level=info msg="        match .*.apps.company.corp"
time="2022-01-12T13:08:43Z" level=info msg="        fallthrough"
time="2022-01-12T13:08:43Z" level=info msg="    }"
time="2022-01-12T13:08:43Z" level=info msg="    template IN A company.corp {"
time="2022-01-12T13:08:43Z" level=info msg="        match api.company.corp"
time="2022-01-12T13:08:43Z" level=info msg="        answer \"{{ .Name }} 60 in {{ .Type }} 10.1.4.1\""
time="2022-01-12T13:08:43Z" level=info msg="        fallthrough"
time="2022-01-12T13:08:43Z" level=info msg="    }"
time="2022-01-12T13:08:43Z" level=info msg="    template IN AAAA company.corp {"
time="2022-01-12T13:08:43Z" level=info msg="        match api.company.corp"
time="2022-01-12T13:08:43Z" level=info msg="        fallthrough"
time="2022-01-12T13:08:43Z" level=info msg="    }"
time="2022-01-12T13:08:43Z" level=info msg="    template IN A company.corp {"
time="2022-01-12T13:08:43Z" level=info msg="        match api-int.company.corp"
time="2022-01-12T13:08:43Z" level=info msg="        answer \"{{ .Name }} 60 in {{ .Type }} 10.1.4.1\""
time="2022-01-12T13:08:43Z" level=info msg="        fallthrough"
time="2022-01-12T13:08:43Z" level=info msg="    }"
time="2022-01-12T13:08:43Z" level=info msg="    template IN AAAA company.corp {"
time="2022-01-12T13:08:43Z" level=info msg="        match api-int.company.corp"
time="2022-01-12T13:08:43Z" level=info msg="        fallthrough"
time="2022-01-12T13:08:43Z" level=info msg="    }"
time="2022-01-12T13:08:43Z" level=info msg="    hosts {"
time="2022-01-12T13:08:43Z" level=info msg="        10.1.2.189 lab4-h9zq6-master-0 lab4-h9zq6-master-0.company.corp"
time="2022-01-12T13:08:43Z" level=info msg="        10.1.2.190 lab4-h9zq6-master-1 lab4-h9zq6-master-1.company.corp"
time="2022-01-12T13:08:43Z" level=info msg="        10.1.2.188 lab4-h9zq6-master-2 lab4-h9zq6-master-2.company.corp"
time="2022-01-12T13:08:43Z" level=info msg="        10.1.2.205 lab4-h9zq6-worker-dlr5x lab4-h9zq6-worker-dlr5x.company.corp"
time="2022-01-12T13:08:43Z" level=info msg="        10.1.2.203 lab4-h9zq6-worker-k8dfd lab4-h9zq6-worker-k8dfd.company.corp"
time="2022-01-12T13:08:43Z" level=info msg="        10.1.2.207 lab4-h9zq6-worker-m5lqk lab4-h9zq6-worker-m5lqk.company.corp"
time="2022-01-12T13:08:43Z" level=info msg="        10.1.2.209 lab4-h9zq6-worker-v95g9 lab4-h9zq6-worker-v95g9.company.corp"
time="2022-01-12T13:08:43Z" level=info msg="        fallthrough"
time="2022-01-12T13:08:43Z" level=info msg="    }"
time="2022-01-12T13:08:43Z" level=info msg="}"
time="2022-01-12T13:08:43Z" level=info
time="2022-01-12T13:08:43Z" level=info msg="Runtimecfg rendering template" path=/etc/coredns/Corefile

The text was updated successfully, but these errors were encountered:

jwhb mentioned this issue Jan 17, 2022

coredns-monitor creates Corefile with 127.0.0.53 forwarder openshift/master-dns-operator#12

Closed

okd-project locked and limited conversation to collaborators Jan 17, 2022

vrutkovs converted this issue into discussion #1074 Jan 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

coredns-monitor creates Corefile with 127.0.0.53 forwarder #1073

coredns-monitor creates Corefile with 127.0.0.53 forwarder #1073

jwhb commented Jan 17, 2022

This issue was moved to a discussion.

This issue was moved to a discussion.

coredns-monitor creates Corefile with 127.0.0.53 forwarder #1073

coredns-monitor creates Corefile with 127.0.0.53 forwarder #1073

Comments

jwhb commented Jan 17, 2022

master-1 (bad config)

master-0 (good config) (same for master-2)

This issue was moved to a discussion.