Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CoreDns it uses a lot of cpu #978

Closed
Gjonni opened this issue Nov 17, 2021 · 43 comments · Fixed by openshift/okd-machine-os#323
Closed

CoreDns it uses a lot of cpu #978

Gjonni opened this issue Nov 17, 2021 · 43 comments · Fixed by openshift/okd-machine-os#323

Comments

@Gjonni
Copy link

Gjonni commented Nov 17, 2021

Describe the bug
I noticed that the static pods of coredns consume a lot of cpu, always, about 1 core per pod.
This anomaly is not present in version 4.7 and can be seen if you have little cpu available on the hypervisor.
it's normal?
I had to install version okd 4.7 or openshift 4.8 and I cannot update because the problem recurs
The problem occurs on different ovirt 4.8 servers and the dns is working correctly

Version
Ovirt 4.8
IPI installation - 4.8.0-0.okd-2021-11-14-052418, 4.8.0-0.okd-2021-10-24-061736, i think all versions 4.8

How reproducible
install okd 4.8 on any ovirt 4.8

Log bundle
NAME CPU(cores) MEMORY(bytes)
coredns-okd4-rckvz-master-0 1105m 1062Mi
coredns-okd4-rckvz-master-1 456m 1049Mi
coredns-okd4-rckvz-master-2 632m 910Mi
coredns-okd4-rckvz-worker-gbbvk 621m 1052Mi
coredns-okd4-rckvz-worker-whtfz 861m 1312Mi
coredns-okd4-rckvz-worker-zpt4s 1074m 989Mi

[ERROR] plugin/errors: 2 . NS: read udp 127.0.0.1:45491->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 . NS: read udp 127.0.0.1:59614->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 . NS: read udp 127.0.0.1:49738->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 . NS: read udp 127.0.0.1:54284->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 . NS: read udp 127.0.0.1:49187->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 . NS: read udp 127.0.0.1:56365->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 . NS: read udp 127.0.0.1:34011->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 . NS: read udp 127.0.0.1:57325->127.0.0.53:53: i/o timeout

@vrutkovs
Copy link
Member

Please attach (or upload to the public file sharing service) must-gather archive

@Gjonni
Copy link
Author

Gjonni commented Nov 17, 2021

yes of course, I need to reinstall version 4.8 first

@Gjonni
Copy link
Author

Gjonni commented Nov 17, 2021

do you need everything? they are several MB

@Gjonni
Copy link
Author

Gjonni commented Nov 17, 2021

@vrutkovs
Copy link
Member

read udp 127.0.0.1:45491->127.0.0.53:53: i/o timeout

This is odd, its forwarding requests to systemd-resolved although we effectively disable it.
@sandrobonazzola could we assign someone to look into this?

@sandrobonazzola
Copy link

@janosdebugs @eslutsky can you please look into this one?

@ghost
Copy link

ghost commented Nov 19, 2021

Unfortunately, I don't have a test setup for OKD currently. Typically, these issues happen with OpenShift on RHV fairly frequently, but only when there is an underlying infrastructure has a problem (e.g. packet loss, upstream DNS issues, etc.) Please make sure none of these are the issue, and worst case we can set up a call to remote-debug the problem.

@Gjonni
Copy link
Author

Gjonni commented Nov 19, 2021

Hello,
if you prefer, I could arrange access to my workplace (in case contact me privately )
In any case, the problems you indicated do not seem to be present as in the same environment I have installed both okd 4.7 (working) and openshift 4.8 (working).
The problem occurs only with the okd version 4.8.
The dns is managed by FreeIpa in standard configuration.
How can I get details of such errors if they do not occur in other systems?

@ghost
Copy link

ghost commented Nov 19, 2021

Hey @Gjonni I don't have any contact details for you, could you please send me a calendar invite for next week to janos at redhat dot com please? I work in the CET timezone.

@lvlts
Copy link

lvlts commented Nov 19, 2021

@Gjonni I'm in a similar situation, but on VMWare IPI, OKD 4.8.0-0.okd-2021-11-14-052418.

Same scenario: coredns pods in the openshift-vsphere-infra namespace are using (compared to earlier 4.7) way more CPU. Pod logs show MBs of:

[ERROR] plugin/errors: 2 . NS: read udp 127.0.0.1:53139->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 . NS: read udp 127.0.0.1:51260->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 . NS: read udp 127.0.0.1:54047->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 . NS: read udp 127.0.0.1:40334->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 . NS: read udp 127.0.0.1:32778->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 . NS: read udp 127.0.0.1:37877->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 . NS: read udp 127.0.0.1:36334->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 . NS: read udp 127.0.0.1:34354->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 . NS: read udp 127.0.0.1:50381->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 . NS: read udp 127.0.0.1:58928->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 . NS: read udp 127.0.0.1:52770->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 . NS: read udp 127.0.0.1:50407->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 . NS: read udp 127.0.0.1:51454->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 . NS: read udp 127.0.0.1:38480->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 . NS: read udp 127.0.0.1:58405->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 . NS: read udp 127.0.0.1:48860->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 . NS: read udp 127.0.0.1:34984->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 . NS: read udp 127.0.0.1:57840->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 . NS: read udp 127.0.0.1:52732->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 . NS: read udp 127.0.0.1:57393->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 . NS: read udp 127.0.0.1:52942->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 . NS: read udp 127.0.0.1:52834->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 . NS: read udp 127.0.0.1:54305->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 . NS: read udp 127.0.0.1:54458->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 . NS: read udp 127.0.0.1:58097->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 . NS: read udp 127.0.0.1:56115->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 . NS: read udp 127.0.0.1:57762->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 . NS: read udp 127.0.0.1:46201->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 . NS: read udp 127.0.0.1:54120->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 . NS: read udp 127.0.0.1:33597->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 . NS: read udp 127.0.0.1:56843->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 . NS: read udp 127.0.0.1:59435->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 grafana.com. AAAA: read udp 127.0.0.1:34326->127.0.0.53:53: i/o timeout

@lvlts
Copy link

lvlts commented Nov 30, 2021

@Gjonni @janosdebugs I have since my last comment upgraded to 4.9.0-0.okd-2021-11-28-035710 from the stable okd channel. Same issues with CoreDNS as before.

@Gjonni
Copy link
Author

Gjonni commented Nov 30, 2021

thanks, I'll try 4.9 right away

@Gjonni
Copy link
Author

Gjonni commented Nov 30, 2021

Yes, it seems to me the same problem. as soon as the installation completes, I check again

[ERROR] plugin/errors: 2 cdn02.quay.io. A: read udp 127.0.0.1:48966->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 cdn02.quay.io. A: read udp 127.0.0.1:37669->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 cdn02.quay.io. AAAA: read udp 127.0.0.1:43367->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 cdn02.quay.io. AAAA: read udp 127.0.0.1:36889->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 cdn02.quay.io. A: read udp 127.0.0.1:47946->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 cdn02.quay.io. A: read udp 127.0.0.1:32902->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 cdn02.quay.io. AAAA: read udp 127.0.0.1:38567->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 cdn02.quay.io. AAAA: read udp 127.0.0.1:60056->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 cdn02.quay.io. A: read udp 127.0.0.1:34626->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 cdn02.quay.io. AAAA: read udp 127.0.0.1:42197->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 cdn02.quay.io. A: read udp 127.0.0.1:49373->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 cdn02.quay.io. AAAA: read udp 127.0.0.1:39438->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 cdn02.quay.io. AAAA: read udp 127.0.0.1:39690->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 cdn02.quay.io. A: read udp 127.0.0.1:56093->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 cdn02.quay.io. A: read udp 127.0.0.1:53995->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 cdn02.quay.io. AAAA: read udp 127.0.0.1:56841->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 cdn02.quay.io. A: read udp 127.0.0.1:37436->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 cdn02.quay.io. AAAA: read udp 127.0.0.1:58107->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 cdn02.quay.io. A: read udp 127.0.0.1:40612->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 cdn02.quay.io. AAAA: read udp 127.0.0.1:58691->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 cdn02.quay.io. AAAA: read udp 127.0.0.1:59744->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 cdn02.quay.io. A: read udp 127.0.0.1:56944->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 cdn02.quay.io. A: read udp 127.0.0.1:45086->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 cdn02.quay.io. AAAA: read udp 127.0.0.1:45324->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 cdn02.quay.io. A: read udp 127.0.0.1:43065->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 cdn02.quay.io. A: read udp 127.0.0.1:35424->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 cdn02.quay.io. AAAA: read udp 127.0.0.1:57397->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 cdn02.quay.io. A: read udp 127.0.0.1:58627->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 cdn02.quay.io. AAAA: read udp 127.0.0.1:40900->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 cdn02.quay.io. A: read udp 127.0.0.1:46024->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 cdn02.quay.io. AAAA: read udp 127.0.0.1:46255->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 cdn02.quay.io. AAAA: read udp 127.0.0.1:43513->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 cdn02.quay.io. A: read udp 127.0.0.1:41718->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 cdn02.quay.io. A: read udp 127.0.0.1:40535->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 cdn02.quay.io. AAAA: read udp 127.0.0.1:40107->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 cdn02.quay.io. A: read udp 127.0.0.1:49389->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 cdn02.quay.io. A: read udp 127.0.0.1:51633->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 cdn02.quay.io. AAAA: read udp 127.0.0.1:50546->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 cdn02.quay.io. AAAA: read udp 127.0.0.1:47564->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 cdn02.quay.io. A: read udp 127.0.0.1:34757->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 cdn02.quay.io. A: read udp 127.0.0.1:33659->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 cdn02.quay.io. AAAA: read udp 127.0.0.1:43360->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 cdn02.quay.io. AAAA: read udp 127.0.0.1:51432->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 cdn02.quay.io. A: read udp 127.0.0.1:32883->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 cdn02.quay.io. A: read udp 127.0.0.1:43803->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 cdn02.quay.io. AAAA: read udp 127.0.0.1:40362->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 cdn02.quay.io. A: read udp 127.0.0.1:57521->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 cdn02.quay.io. AAAA: read udp 127.0.0.1:38082->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 cdn02.quay.io. AAAA: read udp 127.0.0.1:41273->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 cdn02.quay.io. A: read udp 127.0.0.1:34101->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 cdn02.quay.io. A: read udp 127.0.0.1:52929->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 cdn02.quay.io. AAAA: read udp 127.0.0.1:46132->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 cdn02.quay.io. A: read udp 127.0.0.1:38198->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 cdn02.quay.io. A: read udp 127.0.0.1:38447->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 cdn02.quay.io. AAAA: read udp 127.0.0.1:55319->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 cdn02.quay.io. AAAA: read udp 127.0.0.1:52467->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 cdn02.quay.io. AAAA: read udp 127.0.0.1:46716->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 cdn02.quay.io. A: read udp 127.0.0.1:39586->127.0.0.53:53: i/o timeout

@lvlts
Copy link

lvlts commented Nov 30, 2021

thanks, I'll try 4.9 right away

Yes, I also have the same issue in 4.9. This is what I was reporting earlier.

@dnlwgnd
Copy link

dnlwgnd commented Dec 8, 2021

hi,
our fresh 4.9.0-0.okd-2021-11-28-035710 cluster on oVirt 4.4.9 (IPI) is also affected with same symptoms.

@dnlwgnd
Copy link

dnlwgnd commented Dec 9, 2021

hi,
well we use bind (named) on a VM in the same oVirt to provide DNS. This machine has only its external IP set in the resolv.conf already. It was listening on both the loopback and the external interface, which i change to be only the external interface, but this did not change the behavior of coredns.
Did you restart nodes or pods after your change?

@Gjonni
Copy link
Author

Gjonni commented Dec 9, 2021

No, in fact it doesn't work.
ok for about 1h and then start again

I don't know what else to check

@Gjonni
Copy link
Author

Gjonni commented Dec 11, 2021

Hi,
Ok, maybe here we are.
ipv6 must be disabled on coreos nodes.
I have created the following file

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
labels:
machineconfiguration.openshift.io/role: master
name: 99-openshift-machineconfig-master-kargs
spec:
kernelArguments:

  • ipv6.disable=1

and for workers
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
labels:
machineconfiguration.openshift.io/role: worker
name: 99-openshift-machineconfig-worker-kargs
spec:
kernelArguments:

  • ipv6.disable=1

and at the moment the problem does not exist ( after 2h)

@dnlwgnd
Copy link

dnlwgnd commented Dec 13, 2021

Hi,
today I did apply the available update to 4.9.0-0.okd-2021-12-12-025847, but that did not improve the situation.
I then applied your proposed change (disable ipv6) and indeed my coredns pods know use almost no cpu.

However, I now have a problem with the HAProxy pods in openshift-ovirt-infra namespace crashlooping possibly because they try to bind to an ipv6 address:

[ALERT] 346/185535 (10) : Starting frontend main: cannot create listening socket [:::9445]
[ALERT] 346/185535 (10) : Starting proxy health_check_http_url: cannot create listening socket [:::50936]

Did you observe the same?

@Gjonni
Copy link
Author

Gjonni commented Dec 13, 2021

yes, I observed the same situation.

My next step if possible is to try to configure dhcp and dns to use ipv6

@dnlwgnd
Copy link

dnlwgnd commented Dec 14, 2021

I now reversed the ipv6 changes by deleting the MachineConfig objects for both workers and masters and interestingly now the coredns pods operate normally without excessive cpu. Of course also the haproxy pods went back to normal operation.
Looks like it is working now, but I dont know what the problem was or how it was solved.

@jwhb
Copy link

jwhb commented Jan 17, 2022

@Gjonni can you please check the following?

Get the name of the pod where CPU load is high, then run:

POD_NAME=changeme
NODE_NAME=`oc get pod -oyaml $POD_NAME -ojson | jq -r '.spec.nodeName'`
CONF_COMMAND=`oc get pod -oyaml $POD_NAME -ojson | jq -r '.spec.initContainers[0].command | join(" ")'`
oc exec -it $POD_NAME -c coredns-monitor -- $CONF_COMMAND

Please compare that with the output of:

oc logs <POD_NAME> coredns-monitor

Do you see a line like this in the output of both commands:

    forward . 10.1.0.1 {
        policy sequential
    }

@Gjonni
Copy link
Author

Gjonni commented Jan 17, 2022

you see the line but it looks different to me.

oc exec -it $POD_NAME -c coredns-monitor -- $CONF_COMMAND
INFO[0000] forward . xx.xxx.xx.xx xx.xxx.xx.xx { <- i have 2 external dns
INFO[0000] policy sequential
INFO[0000] }

oc logs <POD_NAME> -c coredns-monitor

time="2022-01-04T14:12:21Z" level=info msg=" forward . 127.0.0.53 {"
time="2022-01-04T14:12:21Z" level=info msg=" policy sequential"
time="2022-01-04T14:12:21Z" level=info msg=" }"

@lukeelten
Copy link

We have the same issues with all of our OKD clusters. We use vSphere 6.5.

CoreDNS generates a configuration which uses systemd-resolved as upstream and resolved uses CoreDNS as upstream; so we got a lookup loop which does not work.
A few machine does generate the correct Corefile but most doens't. Provisioning a completely new machine also does not work properly. I have the feeling that this is something like a race condition.
I deleted the Corefile on one machine and forced a reboot; after that it worked fine. I did it with another node where it didn't worked.

I observed that the initContainer which generates the initial config works properly. For some reason the "coredns-monitor" pod detects a change on the node an generates a new Corefile which contains the faulty upstream DNS server.

Logs from coredns-monitor:

time="2022-01-18T15:50:41Z" level=info msg="Node change detected, rendering Corefile" Node Addresses="[{10.194.66.11 okd-adm-staging01-m8b6k-master-0 false} {10.194.66.12 okd-adm-staging01-m8b6k-master-1 false} {10.194.66.13 okd-adm-staging01-m8b6k-master-2 false} {10.194.66.26 okd-adm-staging01-m8b6k-worker-b5mth false} {10.194.66.20 okd-adm-staging01-m8b6k-worker-j7gh4 false} {10.194.66.25 okd-adm-staging01-m8b6k-worker-qc9b5 false}]"

... 

time="2022-01-18T15:50:41Z" level=info msg="    forward . 127.0.0.53 {"
time="2022-01-18T15:50:41Z" level=info msg="        policy sequential"
time="2022-01-18T15:50:41Z" level=info msg="    }"

...

time="2022-01-18T15:50:41Z" level=info msg="Runtimecfg rendering template" path=/etc/coredns/Corefile

Logs from init container:

...
time="2022-01-18T15:49:53Z" level=info msg="    forward . 195.***.***.*** 91.***.***.*** {"
time="2022-01-18T15:49:53Z" level=info msg="        policy sequential"
time="2022-01-18T15:49:53Z" level=info msg="    }"
...

@gabrilabs75
Copy link

Hi all,
we notice the same behaviour on our OKD 4.9 platform (4.9.0-0.okd-2022-01-14-230113) installed on RHV infrastructure
(4.4.8.5-0.4.el8ev)

$ oc adm top pod -n openshift-ovirt-infra

NAME CPU(cores) MEMORY(bytes)
coredns-9rg7l-infra-j6mbp 1377m 1443Mi
coredns-9rg7l-worker-pmdjx 659m 1503Mi
coredns-9rg7l-worker-s6lmh 572m 1245Mi

$ oc logs coredns-9rg7l-worker-pmdjx -n openshift-ovirt-infra -c coredns

[ERROR] plugin/errors: AAAA: read udp 127.0.0.1:33658->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: A: read udp 127.0.0.1:33245->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: A: read udp 127.0.0.1:41446->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: AAAA: read udp 127.0.0.1:34867->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: A: read udp 127.0.0.1:42423->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: A: read udp 127.0.0.1:48844->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: A: read udp 127.0.0.1:60112->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: AAAA: read udp 127.0.0.1:39168->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: AAAA: read udp 127.0.0.1:50727->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: AAAA: read udp 127.0.0.1:56105->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: A: read udp 127.0.0.1:57345->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: A: read udp 127.0.0.1:55625->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: AAAA: read udp 127.0.0.1:38818->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: A: read udp 127.0.0.1:47743->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: A: read udp 127.0.0.1:33922->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: A: read udp 127.0.0.1:37359->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: A: read udp 127.0.0.1:51707->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: A: read udp 127.0.0.1:55553->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: AAAA: read udp 127.0.0.1:40122->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: AAAA: read udp 127.0.0.1:37993->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: AAAA: read udp 127.0.0.1:35273->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: AAAA: read udp 127.0.0.1:51337->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: A: read udp 127.0.0.1:50352->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: A: read udp 127.0.0.1:59529->127.0.0.53:53: i/o timeout

We are waiting for any suggestion.
Regards,
Gabriele

@aalgera
Copy link

aalgera commented Jan 19, 2022

I found that this problem is affecting only part of the worker nodes here.

As a work around I am tried the following:
Editing the file /etc/coredns/Corefile on the affected nodes changing line 5
from forward . 127.0.0.53 { to forward . 10.x.x.1 10.x.x.2 {. After this change the coredns pod on that node has to be recreated.

Up to now I found that this work around has to be applied once again when there is a change in nodes (adding/removing), because it appears the the file /etc/coredns/Corefile is rewritten by coredns-monitor.

@lukeelten
Copy link

I found a (hacky) workaround which is persistent across node changes and updates. It may have some side effects on updates but for now it solves the problem.

The Corefile template (located on each host at /etc/kubernetes/static-pod-resources/coredns/Corefile.tmpl) is written by a machine config and I simply added my own machine config which overwrites the Corefile template and has hardcoded upstream DNS server in it. Nevertheless this is individual for each cluster because the template contains cluster specific domain names and IPs.
New Corefile template (Please pay attention to cluster specific parameters)

. {
    errors
    bufsize 512
    health :18080
    forward . 195.***.***.*** 91.***.***.*** {
        policy sequential
    }
    cache 30
    reload
    template IN {{ .Cluster.IngressVIPRecordType }} okd-adm-staging01.**CLUSTER DOMAIN** {
        match .*.apps.okd-adm-staging01.**CLUSTER DOMAIN**
        answer "{{"{{ .Name }}"}} 60 in {{"{{ .Type }}"}} 10.194.66.4"
        fallthrough
    }
    template IN {{ .Cluster.IngressVIPEmptyType }} okd-adm-staging01.**CLUSTER DOMAIN** {
        match .*.apps.okd-adm-staging01.**CLUSTER DOMAIN**
        fallthrough
    }
    template IN {{ .Cluster.APIVIPRecordType }} okd-adm-staging01.**CLUSTER DOMAIN** {
        match api.okd-adm-staging01.**CLUSTER DOMAIN**
        answer "{{"{{ .Name }}"}} 60 in {{"{{ .Type }}"}} 10.194.66.3"
        fallthrough
    }
    template IN {{ .Cluster.APIVIPEmptyType }} okd-adm-staging01.**CLUSTER DOMAIN** {
        match api.okd-adm-staging01.**CLUSTER DOMAIN**
        fallthrough
    }
    template IN {{ .Cluster.APIVIPRecordType }} okd-adm-staging01.**CLUSTER DOMAIN** {
        match api-int.okd-adm-staging01.**CLUSTER DOMAIN**
        answer "{{"{{ .Name }}"}} 60 in {{"{{ .Type }}"}} 10.194.66.3"
        fallthrough
    }
    template IN {{ .Cluster.APIVIPEmptyType }} okd-adm-staging01.**CLUSTER DOMAIN** {
        match api-int.okd-adm-staging01.**CLUSTER DOMAIN**
        fallthrough
    }
    hosts {
        {{- range .Cluster.NodeAddresses }}
        {{ .Address }} {{ .Name }} {{ .Name }}.{{ $.Cluster.Name }}.{{ $.Cluster.Domain }}
        {{- end }}
        fallthrough
    }
}

Machine Config

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: worker
  name: 90-worker-fix-corefile
spec:
  config:
    ignition:
      version: 3.2.0
    storage:
      files:
      - contents:
          source: data:,<URL ENCODED DATA OF COREFILE TEMPLATE>
        mode: 420
        overwrite: true
        path: /etc/kubernetes/static-pod-resources/coredns/Corefile.tmpl

It does not solve the root problem but it is a stable workaround for now until the bug is fixed. Nevertheless it contains the risk that the original Corefile template may be changed during an update and is overwritten with an old version.

@aalgera
Copy link

aalgera commented Jan 20, 2022

It look like coredns-monitor obtains nameserver information for the template from the file /var/run/NetworkManager/resolv.conf

On the nodes with a faulty coredns this file contains just one nameserver entry.

nameserver 127.0.0.53

whereas on the other nodes this file contains entries using the external-ip of the node and the IPs of the external nameservers

nameserver 10.0.10.101
nameserver 10.1.1.1
nameserver 10.1.1.2

I found that correcting /var/run/NetworkManager/resolv.conf on the nodes with a faulty coredns, /etc/coredns/Corefile is updated automatically with the correct information and coredns starts behaving as expected.

@Gjonni
Copy link
Author

Gjonni commented Feb 10, 2022

thus the problem actually disappears.
I also created the machineconfig, specifying the infrastructural dns servers and the problem is solved

Thank you

@lvlts
Copy link

lvlts commented Feb 14, 2022

In my case, the upstream DNS did not know about the api-int.oscp.DOMAIN hostnames. This led to the weird behavior with coredns-monitor rendering an invalid template (using 127.0.0.53 upstream DNS instead of the cluster defined ones).

Adding the api-int.oscp.DOMAIN in the upstream DNS, identical in configuration to the api.oscp.DOMAIN endpoint fixed the problem (after gracefully restarting the entire OKD cluster).

@uselessidbr
Copy link

Hello!

Any update on this?

It is causing a lot of problems on our cluster.

I've change the upstream DNS servers in the DNS clusteroperator as a workaround:

https://access.redhat.com/solutions/4765861

@lvlts
Copy link

lvlts commented Mar 10, 2022

It's happening again in 4.10.0-0.okd-2022-03-07-131213

@vrutkovs would it be possible to investigate this after so many months of this issue occurring on so many different installations and versions? thank you!

@vrutkovs
Copy link
Member

Not sure what do we need to investigate, iiuc

In my case, the upstream DNS did not know about the api-int.oscp.DOMAIN hostnames

was the issue you were hitting.

@lvlts
Copy link

lvlts commented Mar 11, 2022

Not sure what do we need to investigate, iiuc

In my case, the upstream DNS did not know about the api-int.oscp.DOMAIN hostnames

was the issue you were hitting.

I assumed it was the issue, but it was not. The issue started happening again, shortly after appearing fixed.

The issue is identical to what others describe here:

  • For the coredns pods, high CPU load due to forwarder being set to 127.0.0.1 instead of the upstream DNS servers
  • The init container (render-config-coredns) has the correct Corefile content
  • The coredns-monitor container renders it incorrectly, with 127.0.0.1 being set as the upstream DNS server

This causes error messages like the following in the coredns container:

[ERROR] plugin/errors: 2 2.fedora.pool.ntp.org. A: read udp 127.0.0.1:50317->127.0.0.53:53: i/o timeout

And really high CPU load for all coredns pods, high power consumption and a bunch of other issues.

@jcpowermac
Copy link

I experienced this as well. Since I was testing the vmxnet3 driver I wanted to make sure OpenShiftSDN (vxlan) still worked correctly as well. If you build a cluster with OpenShiftSDN instead of OVN then the problem doesn't occur.
I have a story (https://issues.redhat.com/browse/SPLAT-446) in our backlog to investigate it further but it might be a little while.

@vrutkovs if you want a cluster to take a look at I can spin one up for you in VMC.

@vrutkovs
Copy link
Member

@vrutkovs
Copy link
Member

@fortinj66 has suggested an idea to run fix-resolv-conf-search.service after NM has built resolv.conf - see ^ linked PRs

@vrutkovs
Copy link
Member

Keeping open to confirm that its fixed

@fortinj66
Copy link
Contributor

Since we don't know when the next release will be I came up with the Following MachineConfigs which do the equivalent. I have applied them with success to all my 4.10 environments:

Masters:

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: master
  name: 99-master-okd-fix-network-manager-resolv-conf
spec:
  config:
    ignition:
      version: 3.2.0
    systemd:
      units:
      - contents: "[Unit]\nDescription=Reset /var/run/NetworkManager/resolv.conf to use systemd created version\nWants=network-online.target \nAfter=network-online.target\nBefore=kubelet.service crio.service\n[Service]\n# Need oneshot to delay kubelet\nType=oneshot\nRemainAfterExit=yes\nExecStart=/usr/bin/cp /run/systemd/resolve/resolv.conf /var/run/NetworkManager/resolv.conf\n[Install]\nWantedBy=multi-user.target\n"
        enabled: true
        name: reset-nm-resolv-conf.service

Workers:

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: worker
  name: 99-worker-okd-fix-network-manager-resolv-conf
spec:
  config:
    ignition:
      version: 3.2.0
    systemd:
      units:
      - contents: "[Unit]\nDescription=Reset /var/run/NetworkManager/resolv.conf to use systemd created version\nWants=network-online.target \nAfter=network-online.target\nBefore=kubelet.service crio.service\n[Service]\n# Need oneshot to delay kubelet\nType=oneshot\nRemainAfterExit=yes\nExecStart=/usr/bin/cp /run/systemd/resolve/resolv.conf /var/run/NetworkManager/resolv.conf\n[Install]\nWantedBy=multi-user.target\n"
        enabled: true
        name: reset-nm-resolv-conf.service

one difference is that I changed the scheduling to the following:

Wants=network-online.target 
After=network-online.target
Before=kubelet.service crio.service

@fortinj66
Copy link
Contributor

fortinj66 commented Apr 23, 2022

This is fixed with openshift/okd-machine-os#350.

Release 4.10.0-0.okd-2022-04-23-131357 includes this fix

@Gjonni
Copy link
Author

Gjonni commented May 11, 2022

I confirm that the problem is no longer present on 4.10.0-0.okd-2022-04-23-131357

thanks

@Gjonni Gjonni closed this as completed May 11, 2022
@alexanderbystrom
Copy link

Hi,

We have the same issue again in 4.11.0-0.okd-2022-12-02-145640

coredns-monitor logs

2023-01-02T19:51:08.901000964Z time="2023-01-02T19:51:08Z" level=info msg="Resolv.conf change detected, rendering Corefile" DNS upstreams="[127.0.0.53]"
2023-01-02T19:51:08.905723361Z time="2023-01-02T19:51:08Z" level=info msg=". {"
2023-01-02T19:51:08.905723361Z time="2023-01-02T19:51:08Z" level=info msg="    errors"
2023-01-02T19:51:08.905723361Z time="2023-01-02T19:51:08Z" level=info msg="    bufsize 512"
2023-01-02T19:51:08.905723361Z time="2023-01-02T19:51:08Z" level=info msg="    health :18080"
2023-01-02T19:51:08.905723361Z time="2023-01-02T19:51:08Z" level=info msg="    forward . 127.0.0.53 {"
2023-01-02T19:51:08.905723361Z time="2023-01-02T19:51:08Z" level=info msg="        policy sequential"
2023-01-02T19:51:08.905772908Z time="2023-01-02T19:51:08Z" level=info msg="    }"

@klzsysy
Copy link

klzsysy commented Jan 15, 2023

We have the same issue in 4.11.0-0.okd-2022-10-28-153352

A my temporary solution

  1. ssh login error coredns pod node, There can be multiple
  2. vi /etc/coredns/Corefile
  3. replace forward . 127.0.0.53 to forward . you-except-dns, coredns pod auto realod config
  4. systemctl restart NetworkManager-wait-online.service

@alexanderbystrom

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.