-
Notifications
You must be signed in to change notification settings - Fork 297
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CoreDns it uses a lot of cpu #978
Comments
Please attach (or upload to the public file sharing service) must-gather archive |
yes of course, I need to reinstall version 4.8 first |
do you need everything? they are several MB |
This is odd, its forwarding requests to systemd-resolved although we effectively disable it. |
@janosdebugs @eslutsky can you please look into this one? |
Unfortunately, I don't have a test setup for OKD currently. Typically, these issues happen with OpenShift on RHV fairly frequently, but only when there is an underlying infrastructure has a problem (e.g. packet loss, upstream DNS issues, etc.) Please make sure none of these are the issue, and worst case we can set up a call to remote-debug the problem. |
Hello, |
Hey @Gjonni I don't have any contact details for you, could you please send me a calendar invite for next week to janos at redhat dot com please? I work in the CET timezone. |
@Gjonni I'm in a similar situation, but on VMWare IPI, OKD Same scenario:
|
@Gjonni @janosdebugs I have since my last comment upgraded to |
thanks, I'll try 4.9 right away |
Yes, it seems to me the same problem. as soon as the installation completes, I check again [ERROR] plugin/errors: 2 cdn02.quay.io. A: read udp 127.0.0.1:48966->127.0.0.53:53: i/o timeout |
Yes, I also have the same issue in 4.9. This is what I was reporting earlier. |
hi, |
hi, |
No, in fact it doesn't work. I don't know what else to check |
Hi, apiVersion: machineconfiguration.openshift.io/v1
and for workers
and at the moment the problem does not exist ( after 2h) |
Hi, However, I now have a problem with the HAProxy pods in openshift-ovirt-infra namespace crashlooping possibly because they try to bind to an ipv6 address:
Did you observe the same? |
yes, I observed the same situation. My next step if possible is to try to configure dhcp and dns to use ipv6 |
I now reversed the ipv6 changes by deleting the MachineConfig objects for both workers and masters and interestingly now the coredns pods operate normally without excessive cpu. Of course also the haproxy pods went back to normal operation. |
@Gjonni can you please check the following? Get the name of the pod where CPU load is high, then run: POD_NAME=changeme
NODE_NAME=`oc get pod -oyaml $POD_NAME -ojson | jq -r '.spec.nodeName'`
CONF_COMMAND=`oc get pod -oyaml $POD_NAME -ojson | jq -r '.spec.initContainers[0].command | join(" ")'`
oc exec -it $POD_NAME -c coredns-monitor -- $CONF_COMMAND Please compare that with the output of:
Do you see a line like this in the output of both commands:
|
you see the line but it looks different to me. oc exec -it $POD_NAME -c coredns-monitor -- $CONF_COMMAND oc logs <POD_NAME> -c coredns-monitor time="2022-01-04T14:12:21Z" level=info msg=" forward . 127.0.0.53 {" |
We have the same issues with all of our OKD clusters. We use vSphere 6.5. CoreDNS generates a configuration which uses systemd-resolved as upstream and resolved uses CoreDNS as upstream; so we got a lookup loop which does not work. I observed that the initContainer which generates the initial config works properly. For some reason the "coredns-monitor" pod detects a change on the node an generates a new Corefile which contains the faulty upstream DNS server. Logs from coredns-monitor:
Logs from init container:
|
Hi all,
NAME CPU(cores) MEMORY(bytes)
[ERROR] plugin/errors: AAAA: read udp 127.0.0.1:33658->127.0.0.53:53: i/o timeout We are waiting for any suggestion. |
I found that this problem is affecting only part of the worker nodes here. As a work around I am tried the following: Up to now I found that this work around has to be applied once again when there is a change in nodes (adding/removing), because it appears the the file /etc/coredns/Corefile is rewritten by coredns-monitor. |
I found a (hacky) workaround which is persistent across node changes and updates. It may have some side effects on updates but for now it solves the problem. The Corefile template (located on each host at /etc/kubernetes/static-pod-resources/coredns/Corefile.tmpl) is written by a machine config and I simply added my own machine config which overwrites the Corefile template and has hardcoded upstream DNS server in it. Nevertheless this is individual for each cluster because the template contains cluster specific domain names and IPs.
Machine Config
It does not solve the root problem but it is a stable workaround for now until the bug is fixed. Nevertheless it contains the risk that the original Corefile template may be changed during an update and is overwritten with an old version. |
It look like coredns-monitor obtains nameserver information for the template from the file /var/run/NetworkManager/resolv.conf On the nodes with a faulty coredns this file contains just one nameserver entry.
whereas on the other nodes this file contains entries using the external-ip of the node and the IPs of the external nameservers
I found that correcting /var/run/NetworkManager/resolv.conf on the nodes with a faulty coredns, /etc/coredns/Corefile is updated automatically with the correct information and coredns starts behaving as expected. |
thus the problem actually disappears. Thank you |
In my case, the upstream DNS did not know about the Adding the |
Hello! Any update on this? It is causing a lot of problems on our cluster. I've change the upstream DNS servers in the DNS clusteroperator as a workaround: |
It's happening again in 4.10.0-0.okd-2022-03-07-131213 @vrutkovs would it be possible to investigate this after so many months of this issue occurring on so many different installations and versions? thank you! |
Not sure what do we need to investigate, iiuc
was the issue you were hitting. |
I assumed it was the issue, but it was not. The issue started happening again, shortly after appearing fixed. The issue is identical to what others describe here:
This causes error messages like the following in the
And really high CPU load for all coredns pods, high power consumption and a bunch of other issues. |
I experienced this as well. Since I was testing the vmxnet3 driver I wanted to make sure OpenShiftSDN (vxlan) still worked correctly as well. If you build a cluster with OpenShiftSDN instead of OVN then the problem doesn't occur. @vrutkovs if you want a cluster to take a look at I can spin one up for you in VMC. |
I think its an MCO bug, NM resolv prepender is supposed to set the correct DNS host there: https://github.com/openshift/machine-config-operator/blob/5094a10a1ba443cac399e44185e25674635328a6/templates/common/on-prem/files/NetworkManager-resolv-prepender.yaml#L55-L67 |
@fortinj66 has suggested an idea to run fix-resolv-conf-search.service after NM has built resolv.conf - see ^ linked PRs |
Keeping open to confirm that its fixed |
Since we don't know when the next release will be I came up with the Following MachineConfigs which do the equivalent. I have applied them with success to all my 4.10 environments: Masters:
Workers:
one difference is that I changed the scheduling to the following:
|
This is fixed with openshift/okd-machine-os#350. Release 4.10.0-0.okd-2022-04-23-131357 includes this fix |
I confirm that the problem is no longer present on 4.10.0-0.okd-2022-04-23-131357 thanks |
Hi, We have the same issue again in 4.11.0-0.okd-2022-12-02-145640 coredns-monitor logs
|
We have the same issue in 4.11.0-0.okd-2022-10-28-153352 A my temporary solution
|
Describe the bug
I noticed that the static pods of coredns consume a lot of cpu, always, about 1 core per pod.
This anomaly is not present in version 4.7 and can be seen if you have little cpu available on the hypervisor.
it's normal?
I had to install version okd 4.7 or openshift 4.8 and I cannot update because the problem recurs
The problem occurs on different ovirt 4.8 servers and the dns is working correctly
Version
Ovirt 4.8
IPI installation - 4.8.0-0.okd-2021-11-14-052418, 4.8.0-0.okd-2021-10-24-061736, i think all versions 4.8
How reproducible
install okd 4.8 on any ovirt 4.8
Log bundle
NAME CPU(cores) MEMORY(bytes)
coredns-okd4-rckvz-master-0 1105m 1062Mi
coredns-okd4-rckvz-master-1 456m 1049Mi
coredns-okd4-rckvz-master-2 632m 910Mi
coredns-okd4-rckvz-worker-gbbvk 621m 1052Mi
coredns-okd4-rckvz-worker-whtfz 861m 1312Mi
coredns-okd4-rckvz-worker-zpt4s 1074m 989Mi
[ERROR] plugin/errors: 2 . NS: read udp 127.0.0.1:45491->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 . NS: read udp 127.0.0.1:59614->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 . NS: read udp 127.0.0.1:49738->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 . NS: read udp 127.0.0.1:54284->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 . NS: read udp 127.0.0.1:49187->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 . NS: read udp 127.0.0.1:56365->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 . NS: read udp 127.0.0.1:34011->127.0.0.53:53: i/o timeout
[ERROR] plugin/errors: 2 . NS: read udp 127.0.0.1:57325->127.0.0.53:53: i/o timeout
The text was updated successfully, but these errors were encountered: