CoreDns it uses a lot of cpu #978

Gjonni · 2021-11-17T09:33:46Z

Describe the bug
I noticed that the static pods of coredns consume a lot of cpu, always, about 1 core per pod.
This anomaly is not present in version 4.7 and can be seen if you have little cpu available on the hypervisor.
it's normal?
I had to install version okd 4.7 or openshift 4.8 and I cannot update because the problem recurs
The problem occurs on different ovirt 4.8 servers and the dns is working correctly

Version
Ovirt 4.8
IPI installation - 4.8.0-0.okd-2021-11-14-052418, 4.8.0-0.okd-2021-10-24-061736, i think all versions 4.8

How reproducible
install okd 4.8 on any ovirt 4.8

Log bundle
NAME CPU(cores) MEMORY(bytes)
coredns-okd4-rckvz-master-0 1105m 1062Mi
coredns-okd4-rckvz-master-1 456m 1049Mi
coredns-okd4-rckvz-master-2 632m 910Mi
coredns-okd4-rckvz-worker-gbbvk 621m 1052Mi
coredns-okd4-rckvz-worker-whtfz 861m 1312Mi
coredns-okd4-rckvz-worker-zpt4s 1074m 989Mi

[ERROR] plugin/errors: 2 . NS: read udp 127.0.0.1:45491->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 . NS: read udp 127.0.0.1:59614->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 . NS: read udp 127.0.0.1:49738->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 . NS: read udp 127.0.0.1:54284->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 . NS: read udp 127.0.0.1:49187->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 . NS: read udp 127.0.0.1:56365->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 . NS: read udp 127.0.0.1:34011->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 . NS: read udp 127.0.0.1:57325->127.0.0.53:53: i/o timeout

vrutkovs · 2021-11-17T10:14:58Z

Please attach (or upload to the public file sharing service) must-gather archive

Gjonni · 2021-11-17T12:57:03Z

yes of course, I need to reinstall version 4.8 first

Gjonni · 2021-11-17T15:26:08Z

do you need everything? they are several MB

Gjonni · 2021-11-17T15:58:10Z

https://drive.google.com/file/d/154XL8ToQUkWMycFFixsWo5DEAtkNsRFW/view

vrutkovs · 2021-11-18T12:13:40Z

read udp 127.0.0.1:45491->127.0.0.53:53: i/o timeout

This is odd, its forwarding requests to systemd-resolved although we effectively disable it.
@sandrobonazzola could we assign someone to look into this?

sandrobonazzola · 2021-11-19T07:47:41Z

@janosdebugs @eslutsky can you please look into this one?

ghost · 2021-11-19T13:11:25Z

Unfortunately, I don't have a test setup for OKD currently. Typically, these issues happen with OpenShift on RHV fairly frequently, but only when there is an underlying infrastructure has a problem (e.g. packet loss, upstream DNS issues, etc.) Please make sure none of these are the issue, and worst case we can set up a call to remote-debug the problem.

Gjonni · 2021-11-19T13:17:34Z

Hello,
if you prefer, I could arrange access to my workplace (in case contact me privately )
In any case, the problems you indicated do not seem to be present as in the same environment I have installed both okd 4.7 (working) and openshift 4.8 (working).
The problem occurs only with the okd version 4.8.
The dns is managed by FreeIpa in standard configuration.
How can I get details of such errors if they do not occur in other systems?

ghost · 2021-11-19T13:32:22Z

Hey @Gjonni I don't have any contact details for you, could you please send me a calendar invite for next week to janos at redhat dot com please? I work in the CET timezone.

lvlts · 2021-11-19T16:09:43Z

@Gjonni I'm in a similar situation, but on VMWare IPI, OKD 4.8.0-0.okd-2021-11-14-052418.

Same scenario: coredns pods in the openshift-vsphere-infra namespace are using (compared to earlier 4.7) way more CPU. Pod logs show MBs of:

[ERROR] plugin/errors: 2 . NS: read udp 127.0.0.1:53139->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 . NS: read udp 127.0.0.1:51260->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 . NS: read udp 127.0.0.1:54047->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 . NS: read udp 127.0.0.1:40334->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 . NS: read udp 127.0.0.1:32778->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 . NS: read udp 127.0.0.1:37877->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 . NS: read udp 127.0.0.1:36334->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 . NS: read udp 127.0.0.1:34354->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 . NS: read udp 127.0.0.1:50381->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 . NS: read udp 127.0.0.1:58928->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 . NS: read udp 127.0.0.1:52770->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 . NS: read udp 127.0.0.1:50407->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 . NS: read udp 127.0.0.1:51454->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 . NS: read udp 127.0.0.1:38480->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 . NS: read udp 127.0.0.1:58405->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 . NS: read udp 127.0.0.1:48860->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 . NS: read udp 127.0.0.1:34984->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 . NS: read udp 127.0.0.1:57840->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 . NS: read udp 127.0.0.1:52732->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 . NS: read udp 127.0.0.1:57393->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 . NS: read udp 127.0.0.1:52942->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 . NS: read udp 127.0.0.1:52834->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 . NS: read udp 127.0.0.1:54305->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 . NS: read udp 127.0.0.1:54458->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 . NS: read udp 127.0.0.1:58097->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 . NS: read udp 127.0.0.1:56115->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 . NS: read udp 127.0.0.1:57762->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 . NS: read udp 127.0.0.1:46201->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 . NS: read udp 127.0.0.1:54120->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 . NS: read udp 127.0.0.1:33597->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 . NS: read udp 127.0.0.1:56843->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 . NS: read udp 127.0.0.1:59435->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 grafana.com. AAAA: read udp 127.0.0.1:34326->127.0.0.53:53: i/o timeout

lvlts · 2021-11-30T14:51:03Z

@Gjonni @janosdebugs I have since my last comment upgraded to 4.9.0-0.okd-2021-11-28-035710 from the stable okd channel. Same issues with CoreDNS as before.

Gjonni · 2021-11-30T15:23:22Z

thanks, I'll try 4.9 right away

Gjonni · 2021-11-30T15:40:56Z

Yes, it seems to me the same problem. as soon as the installation completes, I check again

[ERROR] plugin/errors: 2 cdn02.quay.io. A: read udp 127.0.0.1:48966->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 cdn02.quay.io. A: read udp 127.0.0.1:37669->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 cdn02.quay.io. AAAA: read udp 127.0.0.1:43367->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 cdn02.quay.io. AAAA: read udp 127.0.0.1:36889->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 cdn02.quay.io. A: read udp 127.0.0.1:47946->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 cdn02.quay.io. A: read udp 127.0.0.1:32902->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 cdn02.quay.io. AAAA: read udp 127.0.0.1:38567->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 cdn02.quay.io. AAAA: read udp 127.0.0.1:60056->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 cdn02.quay.io. A: read udp 127.0.0.1:34626->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 cdn02.quay.io. AAAA: read udp 127.0.0.1:42197->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 cdn02.quay.io. A: read udp 127.0.0.1:49373->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 cdn02.quay.io. AAAA: read udp 127.0.0.1:39438->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 cdn02.quay.io. AAAA: read udp 127.0.0.1:39690->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 cdn02.quay.io. A: read udp 127.0.0.1:56093->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 cdn02.quay.io. A: read udp 127.0.0.1:53995->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 cdn02.quay.io. AAAA: read udp 127.0.0.1:56841->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 cdn02.quay.io. A: read udp 127.0.0.1:37436->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 cdn02.quay.io. AAAA: read udp 127.0.0.1:58107->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 cdn02.quay.io. A: read udp 127.0.0.1:40612->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 cdn02.quay.io. AAAA: read udp 127.0.0.1:58691->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 cdn02.quay.io. AAAA: read udp 127.0.0.1:59744->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 cdn02.quay.io. A: read udp 127.0.0.1:56944->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 cdn02.quay.io. A: read udp 127.0.0.1:45086->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 cdn02.quay.io. AAAA: read udp 127.0.0.1:45324->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 cdn02.quay.io. A: read udp 127.0.0.1:43065->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 cdn02.quay.io. A: read udp 127.0.0.1:35424->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 cdn02.quay.io. AAAA: read udp 127.0.0.1:57397->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 cdn02.quay.io. A: read udp 127.0.0.1:58627->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 cdn02.quay.io. AAAA: read udp 127.0.0.1:40900->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 cdn02.quay.io. A: read udp 127.0.0.1:46024->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 cdn02.quay.io. AAAA: read udp 127.0.0.1:46255->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 cdn02.quay.io. AAAA: read udp 127.0.0.1:43513->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 cdn02.quay.io. A: read udp 127.0.0.1:41718->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 cdn02.quay.io. A: read udp 127.0.0.1:40535->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 cdn02.quay.io. AAAA: read udp 127.0.0.1:40107->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 cdn02.quay.io. A: read udp 127.0.0.1:49389->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 cdn02.quay.io. A: read udp 127.0.0.1:51633->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 cdn02.quay.io. AAAA: read udp 127.0.0.1:50546->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 cdn02.quay.io. AAAA: read udp 127.0.0.1:47564->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 cdn02.quay.io. A: read udp 127.0.0.1:34757->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 cdn02.quay.io. A: read udp 127.0.0.1:33659->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 cdn02.quay.io. AAAA: read udp 127.0.0.1:43360->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 cdn02.quay.io. AAAA: read udp 127.0.0.1:51432->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 cdn02.quay.io. A: read udp 127.0.0.1:32883->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 cdn02.quay.io. A: read udp 127.0.0.1:43803->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 cdn02.quay.io. AAAA: read udp 127.0.0.1:40362->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 cdn02.quay.io. A: read udp 127.0.0.1:57521->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 cdn02.quay.io. AAAA: read udp 127.0.0.1:38082->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 cdn02.quay.io. AAAA: read udp 127.0.0.1:41273->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 cdn02.quay.io. A: read udp 127.0.0.1:34101->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 cdn02.quay.io. A: read udp 127.0.0.1:52929->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 cdn02.quay.io. AAAA: read udp 127.0.0.1:46132->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 cdn02.quay.io. A: read udp 127.0.0.1:38198->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 cdn02.quay.io. A: read udp 127.0.0.1:38447->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 cdn02.quay.io. AAAA: read udp 127.0.0.1:55319->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 cdn02.quay.io. AAAA: read udp 127.0.0.1:52467->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 cdn02.quay.io. AAAA: read udp 127.0.0.1:46716->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 cdn02.quay.io. A: read udp 127.0.0.1:39586->127.0.0.53:53: i/o timeout

lvlts · 2021-11-30T17:31:42Z

thanks, I'll try 4.9 right away

Yes, I also have the same issue in 4.9. This is what I was reporting earlier.

dnlwgnd · 2021-12-08T09:18:49Z

hi,
our fresh 4.9.0-0.okd-2021-11-28-035710 cluster on oVirt 4.4.9 (IPI) is also affected with same symptoms.

dnlwgnd · 2021-12-09T09:56:21Z

hi,
well we use bind (named) on a VM in the same oVirt to provide DNS. This machine has only its external IP set in the resolv.conf already. It was listening on both the loopback and the external interface, which i change to be only the external interface, but this did not change the behavior of coredns.
Did you restart nodes or pods after your change?

Gjonni · 2021-12-09T09:58:13Z

No, in fact it doesn't work.
ok for about 1h and then start again

I don't know what else to check

Gjonni · 2021-12-11T13:15:58Z

Hi,
Ok, maybe here we are.
ipv6 must be disabled on coreos nodes.
I have created the following file

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
labels:
machineconfiguration.openshift.io/role: master
name: 99-openshift-machineconfig-master-kargs
spec:
kernelArguments:

ipv6.disable=1

and for workers
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
labels:
machineconfiguration.openshift.io/role: worker
name: 99-openshift-machineconfig-worker-kargs
spec:
kernelArguments:

ipv6.disable=1

and at the moment the problem does not exist ( after 2h)

dnlwgnd · 2021-12-13T18:58:36Z

Hi,
today I did apply the available update to 4.9.0-0.okd-2021-12-12-025847, but that did not improve the situation.
I then applied your proposed change (disable ipv6) and indeed my coredns pods know use almost no cpu.

However, I now have a problem with the HAProxy pods in openshift-ovirt-infra namespace crashlooping possibly because they try to bind to an ipv6 address:

[ALERT] 346/185535 (10) : Starting frontend main: cannot create listening socket [:::9445]
[ALERT] 346/185535 (10) : Starting proxy health_check_http_url: cannot create listening socket [:::50936]

Did you observe the same?

Gjonni · 2021-12-13T19:06:53Z

yes, I observed the same situation.

My next step if possible is to try to configure dhcp and dns to use ipv6

dnlwgnd · 2021-12-14T12:15:30Z

I now reversed the ipv6 changes by deleting the MachineConfig objects for both workers and masters and interestingly now the coredns pods operate normally without excessive cpu. Of course also the haproxy pods went back to normal operation.
Looks like it is working now, but I dont know what the problem was or how it was solved.

jwhb · 2022-01-17T10:35:14Z

@Gjonni can you please check the following?

Get the name of the pod where CPU load is high, then run:

POD_NAME=changeme
NODE_NAME=`oc get pod -oyaml $POD_NAME -ojson | jq -r '.spec.nodeName'`
CONF_COMMAND=`oc get pod -oyaml $POD_NAME -ojson | jq -r '.spec.initContainers[0].command | join(" ")'`
oc exec -it $POD_NAME -c coredns-monitor -- $CONF_COMMAND

Please compare that with the output of:

oc logs <POD_NAME> coredns-monitor

Do you see a line like this in the output of both commands:

    forward . 10.1.0.1 {
        policy sequential
    }

Gjonni · 2022-01-17T13:19:47Z

you see the line but it looks different to me.

oc exec -it $POD_NAME -c coredns-monitor -- $CONF_COMMAND
INFO[0000] forward . xx.xxx.xx.xx xx.xxx.xx.xx { <- i have 2 external dns
INFO[0000] policy sequential
INFO[0000] }

oc logs <POD_NAME> -c coredns-monitor

time="2022-01-04T14:12:21Z" level=info msg=" forward . 127.0.0.53 {"
time="2022-01-04T14:12:21Z" level=info msg=" policy sequential"
time="2022-01-04T14:12:21Z" level=info msg=" }"

lukeelten · 2022-01-18T16:20:59Z

We have the same issues with all of our OKD clusters. We use vSphere 6.5.

CoreDNS generates a configuration which uses systemd-resolved as upstream and resolved uses CoreDNS as upstream; so we got a lookup loop which does not work.
A few machine does generate the correct Corefile but most doens't. Provisioning a completely new machine also does not work properly. I have the feeling that this is something like a race condition.
I deleted the Corefile on one machine and forced a reboot; after that it worked fine. I did it with another node where it didn't worked.

I observed that the initContainer which generates the initial config works properly. For some reason the "coredns-monitor" pod detects a change on the node an generates a new Corefile which contains the faulty upstream DNS server.

Logs from coredns-monitor:

time="2022-01-18T15:50:41Z" level=info msg="Node change detected, rendering Corefile" Node Addresses="[{10.194.66.11 okd-adm-staging01-m8b6k-master-0 false} {10.194.66.12 okd-adm-staging01-m8b6k-master-1 false} {10.194.66.13 okd-adm-staging01-m8b6k-master-2 false} {10.194.66.26 okd-adm-staging01-m8b6k-worker-b5mth false} {10.194.66.20 okd-adm-staging01-m8b6k-worker-j7gh4 false} {10.194.66.25 okd-adm-staging01-m8b6k-worker-qc9b5 false}]"

... 

time="2022-01-18T15:50:41Z" level=info msg="    forward . 127.0.0.53 {"
time="2022-01-18T15:50:41Z" level=info msg="        policy sequential"
time="2022-01-18T15:50:41Z" level=info msg="    }"

...

time="2022-01-18T15:50:41Z" level=info msg="Runtimecfg rendering template" path=/etc/coredns/Corefile

Logs from init container:

...
time="2022-01-18T15:49:53Z" level=info msg="    forward . 195.***.***.*** 91.***.***.*** {"
time="2022-01-18T15:49:53Z" level=info msg="        policy sequential"
time="2022-01-18T15:49:53Z" level=info msg="    }"
...

gabrilabs75 · 2022-01-19T10:05:59Z

Hi all,
we notice the same behaviour on our OKD 4.9 platform (4.9.0-0.okd-2022-01-14-230113) installed on RHV infrastructure
(4.4.8.5-0.4.el8ev)

$ oc adm top pod -n openshift-ovirt-infra

NAME CPU(cores) MEMORY(bytes)
coredns-9rg7l-infra-j6mbp 1377m 1443Mi
coredns-9rg7l-worker-pmdjx 659m 1503Mi
coredns-9rg7l-worker-s6lmh 572m 1245Mi

$ oc logs coredns-9rg7l-worker-pmdjx -n openshift-ovirt-infra -c coredns

[ERROR] plugin/errors: AAAA: read udp 127.0.0.1:33658->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: A: read udp 127.0.0.1:33245->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: A: read udp 127.0.0.1:41446->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: AAAA: read udp 127.0.0.1:34867->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: A: read udp 127.0.0.1:42423->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: A: read udp 127.0.0.1:48844->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: A: read udp 127.0.0.1:60112->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: AAAA: read udp 127.0.0.1:39168->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: AAAA: read udp 127.0.0.1:50727->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: AAAA: read udp 127.0.0.1:56105->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: A: read udp 127.0.0.1:57345->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: A: read udp 127.0.0.1:55625->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: AAAA: read udp 127.0.0.1:38818->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: A: read udp 127.0.0.1:47743->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: A: read udp 127.0.0.1:33922->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: A: read udp 127.0.0.1:37359->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: A: read udp 127.0.0.1:51707->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: A: read udp 127.0.0.1:55553->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: AAAA: read udp 127.0.0.1:40122->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: AAAA: read udp 127.0.0.1:37993->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: AAAA: read udp 127.0.0.1:35273->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: AAAA: read udp 127.0.0.1:51337->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: A: read udp 127.0.0.1:50352->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: A: read udp 127.0.0.1:59529->127.0.0.53:53: i/o timeout

We are waiting for any suggestion.
Regards,
Gabriele

aalgera · 2022-01-19T15:22:36Z

I found that this problem is affecting only part of the worker nodes here.

As a work around I am tried the following:
Editing the file /etc/coredns/Corefile on the affected nodes changing line 5
from forward . 127.0.0.53 { to forward . 10.x.x.1 10.x.x.2 {. After this change the coredns pod on that node has to be recreated.

Up to now I found that this work around has to be applied once again when there is a change in nodes (adding/removing), because it appears the the file /etc/coredns/Corefile is rewritten by coredns-monitor.

lukeelten · 2022-01-20T08:11:52Z

I found a (hacky) workaround which is persistent across node changes and updates. It may have some side effects on updates but for now it solves the problem.

The Corefile template (located on each host at /etc/kubernetes/static-pod-resources/coredns/Corefile.tmpl) is written by a machine config and I simply added my own machine config which overwrites the Corefile template and has hardcoded upstream DNS server in it. Nevertheless this is individual for each cluster because the template contains cluster specific domain names and IPs.
New Corefile template (Please pay attention to cluster specific parameters)

. {
    errors
    bufsize 512
    health :18080
    forward . 195.***.***.*** 91.***.***.*** {
        policy sequential
    }
    cache 30
    reload
    template IN {{ .Cluster.IngressVIPRecordType }} okd-adm-staging01.**CLUSTER DOMAIN** {
        match .*.apps.okd-adm-staging01.**CLUSTER DOMAIN**
        answer "{{"{{ .Name }}"}} 60 in {{"{{ .Type }}"}} 10.194.66.4"
        fallthrough
    }
    template IN {{ .Cluster.IngressVIPEmptyType }} okd-adm-staging01.**CLUSTER DOMAIN** {
        match .*.apps.okd-adm-staging01.**CLUSTER DOMAIN**
        fallthrough
    }
    template IN {{ .Cluster.APIVIPRecordType }} okd-adm-staging01.**CLUSTER DOMAIN** {
        match api.okd-adm-staging01.**CLUSTER DOMAIN**
        answer "{{"{{ .Name }}"}} 60 in {{"{{ .Type }}"}} 10.194.66.3"
        fallthrough
    }
    template IN {{ .Cluster.APIVIPEmptyType }} okd-adm-staging01.**CLUSTER DOMAIN** {
        match api.okd-adm-staging01.**CLUSTER DOMAIN**
        fallthrough
    }
    template IN {{ .Cluster.APIVIPRecordType }} okd-adm-staging01.**CLUSTER DOMAIN** {
        match api-int.okd-adm-staging01.**CLUSTER DOMAIN**
        answer "{{"{{ .Name }}"}} 60 in {{"{{ .Type }}"}} 10.194.66.3"
        fallthrough
    }
    template IN {{ .Cluster.APIVIPEmptyType }} okd-adm-staging01.**CLUSTER DOMAIN** {
        match api-int.okd-adm-staging01.**CLUSTER DOMAIN**
        fallthrough
    }
    hosts {
        {{- range .Cluster.NodeAddresses }}
        {{ .Address }} {{ .Name }} {{ .Name }}.{{ $.Cluster.Name }}.{{ $.Cluster.Domain }}
        {{- end }}
        fallthrough
    }
}

Machine Config

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: worker
  name: 90-worker-fix-corefile
spec:
  config:
    ignition:
      version: 3.2.0
    storage:
      files:
      - contents:
          source: data:,<URL ENCODED DATA OF COREFILE TEMPLATE>
        mode: 420
        overwrite: true
        path: /etc/kubernetes/static-pod-resources/coredns/Corefile.tmpl

It does not solve the root problem but it is a stable workaround for now until the bug is fixed. Nevertheless it contains the risk that the original Corefile template may be changed during an update and is overwritten with an old version.

aalgera · 2022-01-20T14:50:57Z

It look like coredns-monitor obtains nameserver information for the template from the file /var/run/NetworkManager/resolv.conf

On the nodes with a faulty coredns this file contains just one nameserver entry.

nameserver 127.0.0.53

whereas on the other nodes this file contains entries using the external-ip of the node and the IPs of the external nameservers

nameserver 10.0.10.101
nameserver 10.1.1.1
nameserver 10.1.1.2

I found that correcting /var/run/NetworkManager/resolv.conf on the nodes with a faulty coredns, /etc/coredns/Corefile is updated automatically with the correct information and coredns starts behaving as expected.

Gjonni · 2022-02-10T22:30:51Z

thus the problem actually disappears.
I also created the machineconfig, specifying the infrastructural dns servers and the problem is solved

Thank you

lvlts · 2022-02-14T10:42:05Z

In my case, the upstream DNS did not know about the api-int.oscp.DOMAIN hostnames. This led to the weird behavior with coredns-monitor rendering an invalid template (using 127.0.0.53 upstream DNS instead of the cluster defined ones).

Adding the api-int.oscp.DOMAIN in the upstream DNS, identical in configuration to the api.oscp.DOMAIN endpoint fixed the problem (after gracefully restarting the entire OKD cluster).

uselessidbr · 2022-03-02T21:19:15Z

Hello!

Any update on this?

It is causing a lot of problems on our cluster.

I've change the upstream DNS servers in the DNS clusteroperator as a workaround:

https://access.redhat.com/solutions/4765861

lvlts · 2022-03-10T18:02:22Z

It's happening again in 4.10.0-0.okd-2022-03-07-131213

@vrutkovs would it be possible to investigate this after so many months of this issue occurring on so many different installations and versions? thank you!

vrutkovs · 2022-03-11T08:37:43Z

Not sure what do we need to investigate, iiuc

In my case, the upstream DNS did not know about the api-int.oscp.DOMAIN hostnames

was the issue you were hitting.

lvlts · 2022-03-11T14:20:35Z

Not sure what do we need to investigate, iiuc

In my case, the upstream DNS did not know about the api-int.oscp.DOMAIN hostnames

was the issue you were hitting.

I assumed it was the issue, but it was not. The issue started happening again, shortly after appearing fixed.

The issue is identical to what others describe here:

For the coredns pods, high CPU load due to forwarder being set to 127.0.0.1 instead of the upstream DNS servers
The init container (render-config-coredns) has the correct Corefile content
The coredns-monitor container renders it incorrectly, with 127.0.0.1 being set as the upstream DNS server

This causes error messages like the following in the coredns container:

[ERROR] plugin/errors: 2 2.fedora.pool.ntp.org. A: read udp 127.0.0.1:50317->127.0.0.53:53: i/o timeout

And really high CPU load for all coredns pods, high power consumption and a bunch of other issues.

jcpowermac · 2022-03-11T14:44:57Z

I experienced this as well. Since I was testing the vmxnet3 driver I wanted to make sure OpenShiftSDN (vxlan) still worked correctly as well. If you build a cluster with OpenShiftSDN instead of OVN then the problem doesn't occur.
I have a story (https://issues.redhat.com/browse/SPLAT-446) in our backlog to investigate it further but it might be a little while.

@vrutkovs if you want a cluster to take a look at I can spin one up for you in VMC.

vrutkovs · 2022-03-14T09:37:12Z

I think its an MCO bug, NM resolv prepender is supposed to set the correct DNS host there: https://github.com/openshift/machine-config-operator/blob/5094a10a1ba443cac399e44185e25674635328a6/templates/common/on-prem/files/NetworkManager-resolv-prepender.yaml#L55-L67

vrutkovs · 2022-03-19T07:13:08Z

@fortinj66 has suggested an idea to run fix-resolv-conf-search.service after NM has built resolv.conf - see ^ linked PRs

vrutkovs · 2022-03-19T10:29:41Z

Keeping open to confirm that its fixed

fortinj66 · 2022-03-20T21:11:08Z

Since we don't know when the next release will be I came up with the Following MachineConfigs which do the equivalent. I have applied them with success to all my 4.10 environments:

Masters:

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: master
  name: 99-master-okd-fix-network-manager-resolv-conf
spec:
  config:
    ignition:
      version: 3.2.0
    systemd:
      units:
      - contents: "[Unit]\nDescription=Reset /var/run/NetworkManager/resolv.conf to use systemd created version\nWants=network-online.target \nAfter=network-online.target\nBefore=kubelet.service crio.service\n[Service]\n# Need oneshot to delay kubelet\nType=oneshot\nRemainAfterExit=yes\nExecStart=/usr/bin/cp /run/systemd/resolve/resolv.conf /var/run/NetworkManager/resolv.conf\n[Install]\nWantedBy=multi-user.target\n"
        enabled: true
        name: reset-nm-resolv-conf.service

Workers:

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: worker
  name: 99-worker-okd-fix-network-manager-resolv-conf
spec:
  config:
    ignition:
      version: 3.2.0
    systemd:
      units:
      - contents: "[Unit]\nDescription=Reset /var/run/NetworkManager/resolv.conf to use systemd created version\nWants=network-online.target \nAfter=network-online.target\nBefore=kubelet.service crio.service\n[Service]\n# Need oneshot to delay kubelet\nType=oneshot\nRemainAfterExit=yes\nExecStart=/usr/bin/cp /run/systemd/resolve/resolv.conf /var/run/NetworkManager/resolv.conf\n[Install]\nWantedBy=multi-user.target\n"
        enabled: true
        name: reset-nm-resolv-conf.service

one difference is that I changed the scheduling to the following:

Wants=network-online.target 
After=network-online.target
Before=kubelet.service crio.service

fortinj66 · 2022-04-23T19:10:27Z

This is fixed with openshift/okd-machine-os#350.

Release 4.10.0-0.okd-2022-04-23-131357 includes this fix

Gjonni · 2022-05-11T10:26:03Z

I confirm that the problem is no longer present on 4.10.0-0.okd-2022-04-23-131357

thanks

alexanderbystrom · 2023-01-04T10:57:43Z

Hi,

We have the same issue again in 4.11.0-0.okd-2022-12-02-145640

coredns-monitor logs

2023-01-02T19:51:08.901000964Z time="2023-01-02T19:51:08Z" level=info msg="Resolv.conf change detected, rendering Corefile" DNS upstreams="[127.0.0.53]"
2023-01-02T19:51:08.905723361Z time="2023-01-02T19:51:08Z" level=info msg=". {"
2023-01-02T19:51:08.905723361Z time="2023-01-02T19:51:08Z" level=info msg="    errors"
2023-01-02T19:51:08.905723361Z time="2023-01-02T19:51:08Z" level=info msg="    bufsize 512"
2023-01-02T19:51:08.905723361Z time="2023-01-02T19:51:08Z" level=info msg="    health :18080"
2023-01-02T19:51:08.905723361Z time="2023-01-02T19:51:08Z" level=info msg="    forward . 127.0.0.53 {"
2023-01-02T19:51:08.905723361Z time="2023-01-02T19:51:08Z" level=info msg="        policy sequential"
2023-01-02T19:51:08.905772908Z time="2023-01-02T19:51:08Z" level=info msg="    }"

klzsysy · 2023-01-15T08:44:54Z

We have the same issue in 4.11.0-0.okd-2022-10-28-153352

A my temporary solution

ssh login error coredns pod node, There can be multiple
vi /etc/coredns/Corefile
replace forward . 127.0.0.53 to forward . you-except-dns, coredns pod auto realod config
systemctl restart NetworkManager-wait-online.service

@alexanderbystrom

jwhb mentioned this issue Jan 17, 2022

coredns-monitor creates Corefile with 127.0.0.53 forwarder openshift/master-dns-operator#12

Closed

jwhb mentioned this issue Jan 17, 2022

coredns-monitor creates Corefile with 127.0.0.53 forwarder #1073

Closed

This was referenced Mar 19, 2022

fix-resolv-conf-search.service: run after NM openshift/okd-machine-os#323

Merged

[release-4.10] fix-resolv-conf-search.service: run after NM openshift/okd-machine-os#324

Merged

openshift-merge-robot closed this as completed in openshift/okd-machine-os#323 Mar 19, 2022

vrutkovs reopened this Mar 19, 2022

fortinj66 mentioned this issue Mar 30, 2022

[release-4.10] Update submodules openshift/okd-machine-os#328

Closed

Gjonni closed this as completed May 11, 2022

rassie mentioned this issue Jan 23, 2023

[4.12 upgrade] CoreDNS pods consume a lot of CPU #1476

Closed

CoreDns it uses a lot of cpu #978

CoreDns it uses a lot of cpu #978

Comments

Gjonni commented Nov 17, 2021

vrutkovs commented Nov 17, 2021

Gjonni commented Nov 17, 2021

Gjonni commented Nov 17, 2021

Gjonni commented Nov 17, 2021 • edited Loading

vrutkovs commented Nov 18, 2021

sandrobonazzola commented Nov 19, 2021

ghost commented Nov 19, 2021

Gjonni commented Nov 19, 2021

ghost commented Nov 19, 2021

lvlts commented Nov 19, 2021

lvlts commented Nov 30, 2021

Gjonni commented Nov 30, 2021

Gjonni commented Nov 30, 2021 • edited Loading

lvlts commented Nov 30, 2021

dnlwgnd commented Dec 8, 2021

dnlwgnd commented Dec 9, 2021

Gjonni commented Dec 9, 2021 • edited Loading

Gjonni commented Dec 11, 2021

dnlwgnd commented Dec 13, 2021

Gjonni commented Dec 13, 2021

dnlwgnd commented Dec 14, 2021

jwhb commented Jan 17, 2022

Gjonni commented Jan 17, 2022

lukeelten commented Jan 18, 2022

gabrilabs75 commented Jan 19, 2022

aalgera commented Jan 19, 2022

lukeelten commented Jan 20, 2022

aalgera commented Jan 20, 2022

Gjonni commented Feb 10, 2022

lvlts commented Feb 14, 2022

uselessidbr commented Mar 2, 2022

lvlts commented Mar 10, 2022

vrutkovs commented Mar 11, 2022

lvlts commented Mar 11, 2022

jcpowermac commented Mar 11, 2022

vrutkovs commented Mar 14, 2022

vrutkovs commented Mar 19, 2022

vrutkovs commented Mar 19, 2022

fortinj66 commented Mar 20, 2022

fortinj66 commented Apr 23, 2022 • edited Loading

Gjonni commented May 11, 2022

alexanderbystrom commented Jan 4, 2023

klzsysy commented Jan 15, 2023 • edited Loading

Gjonni commented Nov 17, 2021 •

edited

Loading

Gjonni commented Nov 30, 2021 •

edited

Loading

Gjonni commented Dec 9, 2021 •

edited

Loading

fortinj66 commented Apr 23, 2022 •

edited

Loading

klzsysy commented Jan 15, 2023 •

edited

Loading