Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

coredns UDP service port point to TCP pod port when upgrading from 23.* #10860

Closed
Tracked by #10882
amogilny opened this issue Jan 31, 2024 · 34 comments · Fixed by #11185 · May be fixed by #10701
Closed
Tracked by #10882

coredns UDP service port point to TCP pod port when upgrading from 23.* #10860

amogilny opened this issue Jan 31, 2024 · 34 comments · Fixed by #11185 · May be fixed by #10701
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@amogilny
Copy link

What happened?

Upgrade Kubespray to v2.24.0 then apply upgrade_cluster.yml playbook on a coredns enabled Kubernetes cluster.

Checked issues regarding DNS and found report #10816

After examining cluster configuration found that coredns service has mentioned configuration bug:

ports:
- name: dns
  port: 53
  protocol: UDP
  targetPort: dns-tcp

What did you expect to happen?

Cluster to be upgraded with no issue

How can we reproduce it (as minimally and precisely as possible)?

Set dns_mode: coredns in group_vars/k8s_cluster/k8s-cluster.yml, then install or upgrade a cluster.

OS

MacOS X Sonoma

Version of Ansible

ansible --version                                                             
ansible [core 2.15.8]
  config file = /Users/bonemancer/git/kubespray-2.24.0/ansible.cfg
  configured module search path = ['/Users/bonemancer/git/kubespray-2.24.0/library']
  ansible python module location = /Users/bonemancer/git/kubespray-2.24.0/.venv/lib/python3.11/site-packages/ansible
  ansible collection location = /Users/bonemancer/.ansible/collections:/usr/share/ansible/collections
  executable location = /Users/bonemancer/git/kubespray-2.24.0/.venv/bin/ansible
  python version = 3.11.7 (main, Dec  4 2023, 18:10:11) [Clang 15.0.0 (clang-1500.1.0.2.5)] (/Users/bonemancer/git/kubespray-2.24.0/.venv/bin/python)
  jinja version = 3.1.2
  libyaml = True

Version of Python

Python 3.11.7

Version of Kubespray (commit)

64447e7

Network plugin used

calico

Full inventory with variables

No response

Command used to invoke ansible

No response

Output of ansible run

No response

Anything else we need to know

No response

@amogilny amogilny added the kind/bug Categorizes issue or PR as related to a bug. label Jan 31, 2024
@simon-wessel
Copy link
Contributor

simon-wessel commented Jan 31, 2024

We are running into the same issue. The issue seems to have been introduced in #10617, but the problem does not originate there. There is an underlying problem that leads to a corrupted coredns service during the upgrade process.

From what we gathered, the process looks like this:

  1. kubespray patches CoreDNS
  2. kubeadm patches CoreDNS <- CoreDNS service gets corrupted
  3. kubespray patches CoreDNS again <- CoreDNS is back to normal

The theory here is that kubeadm patches the coredns service which leads to a corrupted coredns service until kubespray overwrites it again. There might be some weird YAML merging going on which causes the corrupted YAML.

The corrupted service contains a port that looks like this:

  ports:
  - name: dns
    port: 53
    protocol: UDP
    targetPort: dns-tcp # Here is the issue
  - name: dns-tcp
    port: 53
    protocol: TCP
    targetPort: 53
  - name: metrics
    port: 9153
    protocol: TCP
    targetPort: 9153

Our quick fix was removing the targetPort-attributes in the coredns service template that were added in #10617.

@amogilny
Copy link
Author

@simon-wessel Thanks for your fast reply.

I have two k8s clusters, one of which is the test one.
I will run the tests on the quick fix that You provided and write back on results.

@amogilny
Copy link
Author

@simon-wessel

Hello again
After commenting out targetPort attributes in file roles/kubernetes-apps/ansible/templates/coredns-svc.yml.j2 I get following coredns configuration:

spec:
  ports:
    - name: dns
      protocol: UDP
      port: 53
      targetPort: 53
    - name: dns-tcp
      protocol: TCP
      port: 53
      targetPort: 53
    - name: metrics
      protocol: TCP
      port: 9153
      targetPort: 9153

As You have said it is a quick fix and all is working as expected. So this workaround is suitable as there is no cluster interruption during upgrade process.

Here is a part of roles/kubernetes-apps/ansible/templates/coredns-svc.yml.j2 with commented out lines:

spec:
  selector:
    k8s-app: kube-dns{{ coredns_ordinal_suffix }}
  clusterIP: {{ clusterIP }}
  ports:
    - name: dns
      port: 53
      protocol: UDP
#      targetPort: "dns"
    - name: dns-tcp
      port: 53
      protocol: TCP
#      targetPort: "dns-tcp"
    - name: metrics
      port: 9153
      protocol: TCP

@VannTen
Copy link
Contributor

VannTen commented Feb 1, 2024

@simon-wessel do you have the logs of the kubeadm invocation ? the addon/coredns should be skipped ( see

## List of kubeadm init phases that should be skipped during control plane setup
## By default 'addon/coredns' is skipped
## 'addon/kube-proxy' gets skipped for some network plugins
kubeadm_init_phases_skip_default: [ "addon/coredns" ]
kubeadm_init_phases_skip: >-
{%- if kube_network_plugin == 'kube-router' and (kube_router_run_service_proxy is defined and kube_router_run_service_proxy) -%}
{{ kubeadm_init_phases_skip_default + ["addon/kube-proxy"] }}
{%- elif kube_network_plugin == 'cilium' and (cilium_kube_proxy_replacement is defined and cilium_kube_proxy_replacement == 'strict') -%}
{{ kubeadm_init_phases_skip_default + ["addon/kube-proxy"] }}
{%- elif kube_network_plugin == 'calico' and (calico_bpf_enabled is defined and calico_bpf_enabled) -%}
{{ kubeadm_init_phases_skip_default + ["addon/kube-proxy"] }}
{%- elif kube_proxy_remove is defined and kube_proxy_remove -%}
{{ kubeadm_init_phases_skip_default + ["addon/kube-proxy"] }}
{%- else -%}
{{ kubeadm_init_phases_skip_default }}
{%- endif -%}
)

Do I get your meaning correctly that the error is transient (during upgrade ?). That would explain why it was not caught by CI.

Note: this was exposed by the cleanup here (I think) : #10695 in addition to the dual coredns stuff

@VannTen
Copy link
Contributor

VannTen commented Feb 6, 2024

I could not reproduce this on top of master by creating a new cluster then upgrading with upgrade-cluster.yml (watching the coredns service with -w during the upgrade did not reveal any changes or mismatched ports)
@amogilny or @simon-wessel could you share the exacts steps to reproduce this ?

@amogilny
Copy link
Author

amogilny commented Feb 7, 2024

@VannTen Hello, will try to test it today in the evening on my test cluster.

@amogilny
Copy link
Author

amogilny commented Feb 8, 2024

@VannTen Hello, I have checked with master. No it's still broken for coredns setup.

@VannTen
Copy link
Contributor

VannTen commented Feb 9, 2024

could you share the exacts steps to reproduce this ?

(because I haven't been able to reproduce it until now)

@amogilny
Copy link
Author

amogilny commented Feb 9, 2024

kubespray.tgz

@VannTen please find my configuration attached

command I used to upgrade cluster is as follows:

./.venv/bin/ansible-playbook -i inventory/drs/hosts.yaml --become upgrade-cluster.yml -e "@custom_vars.yaml"

@VannTen
Copy link
Contributor

VannTen commented Feb 16, 2024

You're using a seven nodes etcd cluster which are also the cluster nodes ? Is that intended ?

@amogilny
Copy link
Author

You're using a seven nodes etcd cluster which are also the cluster nodes ? Is that intended ?

Yes, that is intended. I wanted to increase redundancy of cluster in that way as I have only two of master nodes.
Does this affect DNS settings somehow?

@VannTen
Copy link
Contributor

VannTen commented Feb 16, 2024 via email

@amogilny
Copy link
Author

Not to my knowledge, that's just a bit unusual^.

Just read the documentation on this. Will rebuild my cluster for it to have 3 etcd nodes.

@dansou901
Copy link

I can confirm the bug on a cluster with 3 control planes and 3 worker nodes, also deployed with dns_mode: coredns. I can also confirm the "workaround" deleting the targetPorts from the coredns svc template. But I don't have a clue what is causing this after poking around in kubespray a bit...

@VannTen
Copy link
Contributor

VannTen commented Mar 1, 2024

I can confirm the bug on a cluster with 3 control planes and 3 worker nodes, also deployed with dns_mode: coredns. I can also confirm the "workaround" deleting the targetPorts from the coredns svc template. But I don't have a clue what is causing this after poking around in kubespray a bit...

On which version ? Can you provide and inventory and a more precise description ?

@VannTen
Copy link
Contributor

VannTen commented Mar 1, 2024 via email

@dansou901
Copy link

dansou901 commented Mar 1, 2024

I upgraded from the release-2.23 branch to the recently released v2.24.1 tag.
In the inventory, the group kube_control_plane contains three control plane nodes, the group etcd contains the same three control plane nodes and kube_node contains three worker nodes.
I've got two other clusters which are installed the same way, but with more nodes. I didn't upgrade them yet, but plan to do so next week (as they have production workloads on them). Those clusters are still on release-2.23 (k8s version 1.27.7).

@VannTen
Copy link
Contributor

VannTen commented Mar 1, 2024

I'll try with that, see if I can finally get a reproducer, thanks 👍

@VannTen
Copy link
Contributor

VannTen commented Mar 1, 2024

So the transition (that's not completely complete) goes something like this:

[
  [
    {
      "name": "dns",
      "port": 53,
      "protocol": "UDP",
      "targetPort": 53
    },
    {
      "name": "dns-tcp",
      "port": 53,
      "protocol": "TCP",
      "targetPort": 53
    },
    {
      "name": "metrics",
      "port": 9153,
      "protocol": "TCP",
      "targetPort": 9153
    }
  ],
  "13:20:59"
]
[
  [
    {
      "name": "dns",
      "port": 53,
      "protocol": "UDP",
      "targetPort": "dns"
    },
    {
      "name": "dns-tcp",
      "port": 53,
      "protocol": "TCP",
      "targetPort": 53
    },
    {
      "name": "metrics",
      "port": 9153,
      "protocol": "TCP",
      "targetPort": 9153
    }
  ],
  "13:21:57"
]
[
  [
    {
      "name": "dns",
      "port": 53,
      "protocol": "UDP",
      "targetPort": "dns-tcp"
    },
    {
      "name": "dns-tcp",
      "port": 53,
      "protocol": "TCP",
      "targetPort": 53
    },
    {
      "name": "metrics",
      "port": 9153,
      "protocol": "TCP",
      "targetPort": 9153
    }
  ],
  "13:32:21"
]

The only problematic one is the last one. I guess the patching logic does not work as planned 🤔
I still need to pinpoint the exact tasks doing this.

@VannTen
Copy link
Contributor

VannTen commented Mar 1, 2024 via email

@VannTen
Copy link
Contributor

VannTen commented Mar 1, 2024 via email

@VannTen
Copy link
Contributor

VannTen commented Mar 4, 2024 via email

@poblahblahblah
Copy link
Contributor

We're also seeing this, but one of our clusters ended up staying in a broken state with the coredns service setting the UDP TargetPort to dns-tcp.

Here's how we found it:

~ ❯ k describe service coredns
Name:              coredns
Namespace:         kube-system
Labels:            addonmanager.kubernetes.io/mode=Reconcile
                   k8s-app=kube-dns
                   kubernetes.io/name=coredns
Annotations:       createdby: kubespray
                   prometheus.io/port: 9153
                   prometheus.io/scrape: true
Selector:          k8s-app=kube-dns
Type:              ClusterIP
IP Family Policy:  SingleStack
IP Families:       IPv4
IP:                10.128.0.3
IPs:               10.128.0.3
Port:              dns  53/UDP
TargetPort:        dns-tcp/UDP
Endpoints:
Port:              dns-tcp  53/TCP
TargetPort:        53/TCP
Endpoints:         172.16.0.162:53,172.16.0.202:53,172.16.1.112:53 + 1 more...
Port:              metrics  9153/TCP
TargetPort:        9153/TCP
Endpoints:         172.16.0.162:9153,172.16.0.202:9153,172.16.1.112:9153 + 1 more...
Session Affinity:  None
Events:            <none>

Notice that the TargetPort is dns-tcp/UDP and the Endpoints are all empty. We had to manually correct this.

@VannTen
Copy link
Contributor

VannTen commented Mar 6, 2024 via email

@k8s-ci-robot k8s-ci-robot changed the title coredns setup is broken in Kubespray v2.24.0 coredns UDP service port point to TCP pod port when upgrading from 23.* Mar 6, 2024
@dansou901
Copy link

@simon-wessel

Hello again
After commenting out targetPort attributes in file roles/kubernetes-apps/ansible/templates/coredns-svc.yml.j2 I get following coredns configuration:

spec:
  ports:
    - name: dns
      protocol: UDP
      port: 53
      targetPort: 53
    - name: dns-tcp
      protocol: TCP
      port: 53
      targetPort: 53
    - name: metrics
      protocol: TCP
      port: 9153
      targetPort: 9153

As You have said it is a quick fix and all is working as expected. So this workaround is suitable as there is no cluster interruption during upgrade process.

Here is a part of roles/kubernetes-apps/ansible/templates/coredns-svc.yml.j2 with commented out lines:

spec:
  selector:
    k8s-app: kube-dns{{ coredns_ordinal_suffix }}
  clusterIP: {{ clusterIP }}
  ports:
    - name: dns
      port: 53
      protocol: UDP
#      targetPort: "dns"
    - name: dns-tcp
      port: 53
      protocol: TCP
#      targetPort: "dns-tcp"
    - name: metrics
      port: 9153
      protocol: TCP

This is the workaround, as suggested by @amogilny.

@VannTen
Copy link
Contributor

VannTen commented Mar 14, 2024 via email

@dv29
Copy link

dv29 commented Mar 14, 2024

I think i have different problem, removed the target ports from live manifest and its still in crashloop. also i'm using dns_mode: coredns.

For some reason, coredns and nodelocaldns pods are in crashloopbackoff, there are no logs as to why. also i resolvconf_mode: none so that there is no loop, still not sure whats going on

@sathieu
Copy link
Contributor

sathieu commented Apr 4, 2024

What is the status of this issue? This is very annoying and cause major impact on cluster upgrade.

There is #10701 which would fit for long term solution, but what about using server-side apply as a short-term one?

@VannTen
Copy link
Contributor

VannTen commented Apr 4, 2024 via email

@sathieu
Copy link
Contributor

sathieu commented Apr 9, 2024

Can CoreDNS setup be moved to kubeadm?

EDIT: this is currently skipped

@VannTen
Copy link
Contributor

VannTen commented Apr 9, 2024 via email

@sathieu
Copy link
Contributor

sathieu commented Apr 10, 2024

@VannTen You are right.

@VannTen
Copy link
Contributor

VannTen commented May 13, 2024

So, this should be fixed in master by #10701.
However, backporting that PR to release-2.24 could be tricky. Wdyt of reverting #10617 (root cause) for the branch release-2.24 (and only for that branch) instead ? @mzaian @yankay @MrFreezeex @cyclinder

Alternatively, we could also tweak the current kube custom ansible module, but I have no idea what quantity of work this means.

@mzaian
Copy link
Contributor

mzaian commented May 13, 2024

+1 to the revert of #10617 for release-2.24 branch. It's a lot of work to make #10701 to the soon-to-be old releases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
8 participants