external-dns v0.13.5 trying to create CNAME records after upgrading leading to crashloopbackoff #3714

rl0nergan · 2023-06-20T13:20:01Z

What happened: After upgrading external-dns from 0.13.4 to 0.13.5, it began trying to create CNAME records instead of A records like it had been previously. The external-dns pod then went into CrashLoopBackOff due to a "Modification Conflict" error.

What you expected to happen: External-dns would continue to create A records after an upgrade and not crash.

How to reproduce it (as minimally and precisely as possible): Have multiple

Anything else we need to know?:

Environment: Kubernetes cluster on v1.26

External-DNS version (use external-dns --version): 0.13.5
DNS provider: Akamai
Others:
Logs:

time="2023-06-15T20:01:45Z" level=info msg="Instantiating new Kubernetes client"
time="2023-06-15T20:01:45Z" level=info msg="Using inCluster-config based on serviceaccount-token"
time="2023-06-15T20:01:45Z" level=info msg="Created Kubernetes client https://10.233.0.1:443"
time="2023-06-15T20:01:47Z" level=info msg="Creating recordsets" record=argocd.example.org target="[cname.example.org]" ttl=600 type=CNAME zone=example.org
time="2023-06-15T20:01:47Z" level=info msg="Creating recordsets" record=prometheus.example.org target="[cname.example.org]" ttl=600 type=CNAME zone=example.org
time="2023-06-15T20:01:47Z" level=info msg="Creating recordsets" record=loki.example.org target="[cname.example.org]" ttl=600 type=CNAME zone=example.org
time="2023-06-15T20:01:47Z" level=info msg="Creating recordsets" record=teleport.example.org target="[cname.example.org]" ttl=600 type=CNAME zone=example.org
time="2023-06-15T20:01:47Z" level=info msg="Creating recordsets" record=argocd.example.org target="[\"heritage=external-dns,external-dns/owner=default,external-dns/resource=ingress/argocd/argocd-empty-ingress\"]" ttl=600 type=TXT zone=example.org
time="2023-06-15T20:01:47Z" level=info msg="Creating recordsets" record=cname-argocd.example.org target="[\"heritage=external-dns,external-dns/owner=default,external-dns/resource=ingress/argocd/argocd-empty-ingress\"]" ttl=600 type=TXT zone=example.org
time="2023-06-15T20:01:47Z" level=info msg="Creating recordsets" record=prometheus.example.org target="[\"heritage=external-dns,external-dns/owner=default,external-dns/resource=ingress/prometheus/kube-prometheus-kube-prome-prometheus\"]" ttl=600 type=TXT zone=example.org
time="2023-06-15T20:01:47Z" level=info msg="Creating recordsets" record=cname-prometheus.example.org target="[\"heritage=external-dns,external-dns/owner=default,external-dns/resource=ingress/prometheus/kube-prometheus-kube-prome-prometheus\"]" ttl=600 type=TXT zone=example.org
time="2023-06-15T20:01:47Z" level=info msg="Creating recordsets" record=loki.example.org target="[\"heritage=external-dns,external-dns/owner=default,external-dns/resource=ingress/loki/loki\"]" ttl=600 type=TXT zone=example.org
time="2023-06-15T20:01:47Z" level=info msg="Creating recordsets" record=cname-loki.example.org target="[\"heritage=external-dns,external-dns/owner=default,external-dns/resource=ingress/loki/loki\"]" ttl=600 type=TXT zone=example.org
time="2023-06-15T20:01:47Z" level=info msg="Creating recordsets" record=teleport.example.org target="[\"heritage=external-dns,external-dns/owner=default,external-dns/resource=service/teleport-cluster/teleport-cluster\"]" ttl=600 type=TXT zone=example.org
time="2023-06-15T20:01:47Z" level=info msg="Creating recordsets" record=cname-teleport.example.org target="[\"heritage=external-dns,external-dns/owner=default,external-dns/resource=service/teleport-cluster/teleport-cluster\"]" ttl=600 type=TXT zone=example.org
time="2023-06-15T20:01:47Z" level=error msg="Failed to create endpoints for DNS zone example.org. Error: Modification Confict: [Duplicate record set found with name loki.example.org and type TXT]"
time="2023-06-15T20:01:47Z" level=fatal msg="Modification Confict: [Duplicate record set found with name loki.example.org and type TXT]"

The text was updated successfully, but these errors were encountered:

szuecs · 2023-06-21T08:00:45Z

Please share all args to start external-dns and the resources that let external-dns to create these records. We also need the ingress status as it contains the target and we need to know if there are two resources that want to have different targets and what kind of source you use.

rl0nergan · 2023-06-26T16:16:46Z

Args used to start external-dns:

    Args:
      --log-level=debug
      --log-format=text
      --interval=1m
      --source=service
      --source=ingress
      --policy=sync
      --registry=txt
      --domain-filter=example.org
      --provider=akamai

Some example resources:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  annotations:
    cert-manager.io/cluster-issuer: vault-production
    cert-manager.io/common-name: prometheus.example.com
    traefik.ingress.kubernetes.io/router.entrypoints: websecure
    traefik.ingress.kubernetes.io/router.tls.options: traefik-mtls@kubernetescrd
  creationTimestamp: "2023-06-05T21:46:46Z"
  generation: 1
  labels:
    app: kube-prometheus-stack-prometheus
    app.kubernetes.io/instance: kube-prometheus
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/part-of: kube-prometheus-stack
    app.kubernetes.io/version: 46.4.1
    chart: kube-prometheus-stack-46.4.1
    heritage: Helm
    release: kube-prometheus
  name: kube-prometheus-kube-prome-prometheus
  namespace: prometheus
  resourceVersion: "1806450"
  uid: 8dd5092c-c323-4437-ad24-45dcd2f31cf8
spec:
  ingressClassName: traefik
  rules:
  - host: prometheus.example.org
    http:
      paths:
      - backend:
          service:
            name: kube-prometheus-kube-prome-prometheus
            port:
              number: 9090
        path: /
        pathType: ImplementationSpecific
  tls:
  - hosts:
    - prometheus.example.org
    secretName: prometheus-tls
status:
  loadBalancer:
    ingress:
    - hostname: 12-34-567-89.example.org
      ip: 12.34.567.89
---
apiVersion: v1
kind: Service
metadata:
  annotations:
    external-dns.alpha.kubernetes.io/hostname: teleport.example.org
  creationTimestamp: "2023-06-05T20:38:56Z"
  finalizers:
  - service.kubernetes.io/load-balancer-cleanup
  labels:
    app.kubernetes.io/component: proxy
    app.kubernetes.io/instance: teleport-cluster
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: teleport-cluster
    app.kubernetes.io/version: 13.0.0-alpha.2-amd64
    helm.sh/chart: teleport-cluster-13.0.3
    teleport.dev/majorVersion: "13"
  name: teleport-cluster
  namespace: teleport-cluster
  resourceVersion: "1784576"
  uid: d7b0f713-a259-403c-a77b-5286d9afb1cf
spec:
  allocateLoadBalancerNodePorts: true
  clusterIP: 10.233.51.128
  clusterIPs:
  - 10.233.51.128
  externalTrafficPolicy: Cluster
  internalTrafficPolicy: Cluster
  ipFamilies:
  - IPv4
  ipFamilyPolicy: SingleStack
  ports:
  - name: tls
    nodePort: 31767
    port: 443
    protocol: TCP
    targetPort: 3080
  selector:
    app.kubernetes.io/component: proxy
    app.kubernetes.io/instance: teleport-cluster
    app.kubernetes.io/name: teleport-cluster
  sessionAffinity: None
  type: LoadBalancer
status:
  loadBalancer:
    ingress:
    - hostname: 12-34-34-123.example.org
      ip: 12-34-34-123

We've seen it fail when trying to create records for both ingress and service type objects without us making any changes other then upgrading the external-dns version.

johngmyers · 2023-06-28T03:37:36Z

I don't see loki.example.org in the provided example resources. So I don't see how those resources could have created those records.

amold1 · 2023-06-28T15:38:08Z

@johngmyers I see similar behavior.

In my case, I create a KIND cluster with a service that has an annotation(external-dns.alpha.kubernetes.io/hostname: some.example.org for external-dns to create the record. Then I delete the cluster completely. I then recreate the cluster with the same service and annotation.

But because external-dns did not have a chance to delete the entry previously created, it goes into a CrashLoopBackOff state

If I delete the service first, let external-dns delete the entry and then destroy and recreate cluster, then it works as expected.

johngmyers · 2023-06-28T15:58:31Z

@amold1 Please supply a reproducable test case, complete with server arguments, Kubernetes resources, any other initial conditions, actual behavior, and expected behavior.

nakamume · 2023-07-07T03:01:04Z

@johngmyers I was also affected by this on v0.13.5, here are the steps to reproduce

Use external-dns v0.11.0 (that's what I used prev., maybe v0.13.4 might work as well as others have pointed). To reproduce we start with an older version and then upgrade to v0.13.5

# external-dns args
  - args:
    - --log-level=info
    - --log-format=json
    - --interval=30s
    - --source=service
    - --source=ingress
    - --policy=sync
    - --registry=txt
    - --txt-owner-id=xxxxxxxxxxx
    - --domain-filter=example.com
    - --provider=aws

ip-address-type: dualstack dualstack for ingress (if you don't have dualstack networking setup, then you can first create the ingress without this annotation and once the LB is provisioned, add this annotation - alb-controller would fail to reconcile further but that should be okay)

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  annotations:
    alb.ingress.kubernetes.io/listen-ports: '[{"HTTP": 80}]'
    alb.ingress.kubernetes.io/scheme: internet-facing
    alb.ingress.kubernetes.io/target-type: ip
    alb.ingress.kubernetes.io/ip-address-type: dualstack
    kubernetes.io/ingress.class: alb
  name: external-dns-test-failure
spec:
  rules:
  - host: external-dns-test.example.com
    http:
      paths:
      - backend:
          service:
            name: external-dns-test-canary
            port:
              name: http
        path: /*
        pathType: ImplementationSpecific

Once annotation for dualstack is added, external-dns v0.11.0 created three records: TXT, A and AAAA

external-dns-f4f649bdf-2swsp external-dns {"level":"info","msg":"Desired change: CREATE external-dns-test.example.com A [Id: /hostedzone/XXXXXXXXXX]","time":"2023-07-07T02:04:50Z"}
external-dns-f4f649bdf-2swsp external-dns {"level":"info","msg":"Desired change: CREATE external-dns-test.example.com AAAA [Id: /hostedzone/XXXXXXXXXXX]","time":"2023-07-07T02:04:50Z"}
external-dns-f4f649bdf-2swsp external-dns {"level":"info","msg":"Desired change: CREATE external-dns-test.example.com TXT [Id: /hostedzone/XXXXXXXXX]","time":"2023-07-07T02:04:50Z"}

Now upgrade the controller to v0.13.5, it tries to create two TXT records with cname- prefix and fails and goes into CrashLoopBackOff

external-dns-57f9b9d9d7-dkp6q external-dns time="2023-07-07T01:40:00Z" level=info msg="Desired change: CREATE cname-external-dns-test.example.com TXT [Id: /hostedzone/XXXXXXXXXX]"
external-dns-57f9b9d9d7-dkp6q external-dns time="2023-07-07T01:40:00Z" level=info msg="Desired change: CREATE cname-external-dns-test.example.com TXT [Id: /hostedzone/XXXXXXXXXXX]"
external-dns-57f9b9d9d7-dkp6q external-dns time="2023-07-07T01:40:00Z" level=error msg="Failure in zone example.com. [Id: /hostedzone/XXXXXXXXXX] when submitting change batch: InvalidChangeBatch: [The request contains an invalid set of changes for a resource record set 'TXT cname-external-dns-test.example.com.']\n\tstatus code: 400, request id: d78cd4b1-0514-4eac-bfe1-bae08e3c071d"

EKS: 1.23
aws-load-balancer-controller: v2.5.3

nakamume · 2023-07-07T03:03:58Z

The two TXT records it was trying to CREATE were exactly the same (I tested using a custom image with additional logic) - so maybe some issue with the deduplication logic?

BTW, I tried master (commit: 92824f4) and that didn't result into this behavior.

johngmyers · 2023-07-07T04:05:53Z

If this isn't reproducing on master, there's little reason to investigate.

nakamume · 2023-07-07T05:47:51Z

Did a little more digging, seems the commit 1bd3834 fixed the issue for me.

rl0nergan · 2023-07-07T18:17:28Z

hey @johngmyers, sorry about that. here's the loki ingress resource we're using.

apiVersion: v1
items:
- apiVersion: networking.k8s.io/v1
  kind: Ingress
  metadata:
    annotations:
      cert-manager.io/cluster-issuer: vault-production
      cert-manager.io/common-name: loki.example.org
      traefik.ingress.kubernetes.io/router.entrypoints: websecure
      traefik.ingress.kubernetes.io/router.tls.options: traefik-mtls@kubernetescrd
    creationTimestamp: "2023-05-11T18:17:42Z"
    generation: 1
    labels:
      app.kubernetes.io/instance: loki
      app.kubernetes.io/managed-by: Helm
      app.kubernetes.io/name: loki
      app.kubernetes.io/version: 2.8.2
      helm.sh/chart: loki-5.8.4
    name: loki
    namespace: loki
    resourceVersion: "18129523"
    uid: cd8da11a-f2be-418b-b15c-d4e3c1be4eae
  spec:
    ingressClassName: traefik
    rules:
    - host: loki.example.org
      http:
        paths:
        - backend:
            service:
              name: loki-read
              port:
                number: 3100
          path: /api/prom/tail
          pathType: Prefix
        - backend:
            service:
              name: loki-read
              port:
                number: 3100
          path: /loki/api/v1/tail
          pathType: Prefix
        - backend:
            service:
              name: loki-read
              port:
                number: 3100
          path: /loki/api
          pathType: Prefix
        - backend:
            service:
              name: loki-read
              port:
                number: 3100
          path: /api/prom/rules
          pathType: Prefix
        - backend:
            service:
              name: loki-read
              port:
                number: 3100
          path: /loki/api/v1/rules
          pathType: Prefix
        - backend:
            service:
              name: loki-read
              port:
                number: 3100
          path: /prometheus/api/v1/rules
          pathType: Prefix
        - backend:
            service:
              name: loki-read
              port:
                number: 3100
          path: /prometheus/api/v1/alerts
          pathType: Prefix
        - backend:
            service:
              name: loki-write
              port:
                number: 3100
          path: /api/prom/push
          pathType: Prefix
        - backend:
            service:
              name: loki-write
              port:
                number: 3100
          path: /loki/api/v1/push
          pathType: Prefix
    tls:
    - hosts:
      - loki.example.org
      secretName: loki-distributed-tls
  status:
    loadBalancer:
      ingress:
    - hostname: 12-34-567-89.example.org
      ip: 12.34.567.89

In our case we're running multiple clusters with workloads provisioned via argocd and have seen the same error occur but with different resources mentioned based on what external-dns tries to reconcile first.

maxkokocom · 2023-08-08T03:21:04Z

Same issue with google provider. v0.13.5 version as well. Downgrade to v0.13.4 helped.

johngmyers · 2023-09-27T00:09:56Z

@joaocc That's not a CNAME record, as reported in the initial description. That's a TXT record and is expected behavior.

joaocc · 2023-09-27T07:59:32Z

@johngmyers You are correct. Will remove my comment to avoid future confusion. Sorry for the misunderstanding. Thx

k8s-triage-robot · 2024-01-29T06:07:48Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2024-02-28T06:59:28Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2024-03-29T07:46:02Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot · 2024-03-29T07:46:06Z

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

rl0nergan added the kind/bug Categorizes issue or PR as related to a bug. label Jun 20, 2023

Max-Chandler mentioned this issue Jun 20, 2023

external-dns not ignoring pre-existing DNS records it does not have ownership of #3715

Closed

wbh1 mentioned this issue Sep 1, 2023

fix(plan): explicitly check for CNAME conflicts #3904

Closed

2 tasks

joaocc mentioned this issue Sep 26, 2023

AWS Route 53 registry TXT record for alias prefixed fix cname_ instead of a_ #3868

Open

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 29, 2024

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 28, 2024

k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Mar 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

external-dns v0.13.5 trying to create CNAME records after upgrading leading to crashloopbackoff #3714

external-dns v0.13.5 trying to create CNAME records after upgrading leading to crashloopbackoff #3714

rl0nergan commented Jun 20, 2023

szuecs commented Jun 21, 2023

rl0nergan commented Jun 26, 2023 •

edited

Loading

johngmyers commented Jun 28, 2023

amold1 commented Jun 28, 2023

johngmyers commented Jun 28, 2023

nakamume commented Jul 7, 2023 •

edited

Loading

nakamume commented Jul 7, 2023 •

edited

Loading

johngmyers commented Jul 7, 2023

nakamume commented Jul 7, 2023

rl0nergan commented Jul 7, 2023

maxkokocom commented Aug 8, 2023

johngmyers commented Sep 27, 2023

joaocc commented Sep 27, 2023

k8s-triage-robot commented Jan 29, 2024

k8s-triage-robot commented Feb 28, 2024

k8s-triage-robot commented Mar 29, 2024

k8s-ci-robot commented Mar 29, 2024

external-dns v0.13.5 trying to create CNAME records after upgrading leading to crashloopbackoff #3714

external-dns v0.13.5 trying to create CNAME records after upgrading leading to crashloopbackoff #3714

Comments

rl0nergan commented Jun 20, 2023

szuecs commented Jun 21, 2023

rl0nergan commented Jun 26, 2023 • edited Loading

johngmyers commented Jun 28, 2023

amold1 commented Jun 28, 2023

johngmyers commented Jun 28, 2023

nakamume commented Jul 7, 2023 • edited Loading

nakamume commented Jul 7, 2023 • edited Loading

johngmyers commented Jul 7, 2023

nakamume commented Jul 7, 2023

rl0nergan commented Jul 7, 2023

maxkokocom commented Aug 8, 2023

johngmyers commented Sep 27, 2023

joaocc commented Sep 27, 2023

k8s-triage-robot commented Jan 29, 2024

k8s-triage-robot commented Feb 28, 2024

k8s-triage-robot commented Mar 29, 2024

k8s-ci-robot commented Mar 29, 2024

rl0nergan commented Jun 26, 2023 •

edited

Loading

nakamume commented Jul 7, 2023 •

edited

Loading

nakamume commented Jul 7, 2023 •

edited

Loading