Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

external-dns v0.13.5 trying to create CNAME records after upgrading leading to crashloopbackoff #3714

Closed
rl0nergan opened this issue Jun 20, 2023 · 17 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@rl0nergan
Copy link

What happened: After upgrading external-dns from 0.13.4 to 0.13.5, it began trying to create CNAME records instead of A records like it had been previously. The external-dns pod then went into CrashLoopBackOff due to a "Modification Conflict" error.

What you expected to happen: External-dns would continue to create A records after an upgrade and not crash.

How to reproduce it (as minimally and precisely as possible): Have multiple

Anything else we need to know?:

Environment: Kubernetes cluster on v1.26

  • External-DNS version (use external-dns --version): 0.13.5
  • DNS provider: Akamai
  • Others:
    Logs:
time="2023-06-15T20:01:45Z" level=info msg="Instantiating new Kubernetes client"
time="2023-06-15T20:01:45Z" level=info msg="Using inCluster-config based on serviceaccount-token"
time="2023-06-15T20:01:45Z" level=info msg="Created Kubernetes client https://10.233.0.1:443"
time="2023-06-15T20:01:47Z" level=info msg="Creating recordsets" record=argocd.example.org target="[cname.example.org]" ttl=600 type=CNAME zone=example.org
time="2023-06-15T20:01:47Z" level=info msg="Creating recordsets" record=prometheus.example.org target="[cname.example.org]" ttl=600 type=CNAME zone=example.org
time="2023-06-15T20:01:47Z" level=info msg="Creating recordsets" record=loki.example.org target="[cname.example.org]" ttl=600 type=CNAME zone=example.org
time="2023-06-15T20:01:47Z" level=info msg="Creating recordsets" record=teleport.example.org target="[cname.example.org]" ttl=600 type=CNAME zone=example.org
time="2023-06-15T20:01:47Z" level=info msg="Creating recordsets" record=argocd.example.org target="[\"heritage=external-dns,external-dns/owner=default,external-dns/resource=ingress/argocd/argocd-empty-ingress\"]" ttl=600 type=TXT zone=example.org
time="2023-06-15T20:01:47Z" level=info msg="Creating recordsets" record=cname-argocd.example.org target="[\"heritage=external-dns,external-dns/owner=default,external-dns/resource=ingress/argocd/argocd-empty-ingress\"]" ttl=600 type=TXT zone=example.org
time="2023-06-15T20:01:47Z" level=info msg="Creating recordsets" record=prometheus.example.org target="[\"heritage=external-dns,external-dns/owner=default,external-dns/resource=ingress/prometheus/kube-prometheus-kube-prome-prometheus\"]" ttl=600 type=TXT zone=example.org
time="2023-06-15T20:01:47Z" level=info msg="Creating recordsets" record=cname-prometheus.example.org target="[\"heritage=external-dns,external-dns/owner=default,external-dns/resource=ingress/prometheus/kube-prometheus-kube-prome-prometheus\"]" ttl=600 type=TXT zone=example.org
time="2023-06-15T20:01:47Z" level=info msg="Creating recordsets" record=loki.example.org target="[\"heritage=external-dns,external-dns/owner=default,external-dns/resource=ingress/loki/loki\"]" ttl=600 type=TXT zone=example.org
time="2023-06-15T20:01:47Z" level=info msg="Creating recordsets" record=cname-loki.example.org target="[\"heritage=external-dns,external-dns/owner=default,external-dns/resource=ingress/loki/loki\"]" ttl=600 type=TXT zone=example.org
time="2023-06-15T20:01:47Z" level=info msg="Creating recordsets" record=teleport.example.org target="[\"heritage=external-dns,external-dns/owner=default,external-dns/resource=service/teleport-cluster/teleport-cluster\"]" ttl=600 type=TXT zone=example.org
time="2023-06-15T20:01:47Z" level=info msg="Creating recordsets" record=cname-teleport.example.org target="[\"heritage=external-dns,external-dns/owner=default,external-dns/resource=service/teleport-cluster/teleport-cluster\"]" ttl=600 type=TXT zone=example.org
time="2023-06-15T20:01:47Z" level=error msg="Failed to create endpoints for DNS zone example.org. Error: Modification Confict: [Duplicate record set found with name loki.example.org and type TXT]"
time="2023-06-15T20:01:47Z" level=fatal msg="Modification Confict: [Duplicate record set found with name loki.example.org and type TXT]"
@szuecs
Copy link
Contributor

szuecs commented Jun 21, 2023

Please share all args to start external-dns and the resources that let external-dns to create these records. We also need the ingress status as it contains the target and we need to know if there are two resources that want to have different targets and what kind of source you use.

@rl0nergan
Copy link
Author

rl0nergan commented Jun 26, 2023

Args used to start external-dns:

    Args:
      --log-level=debug
      --log-format=text
      --interval=1m
      --source=service
      --source=ingress
      --policy=sync
      --registry=txt
      --domain-filter=example.org
      --provider=akamai

Some example resources:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  annotations:
    cert-manager.io/cluster-issuer: vault-production
    cert-manager.io/common-name: prometheus.example.com
    traefik.ingress.kubernetes.io/router.entrypoints: websecure
    traefik.ingress.kubernetes.io/router.tls.options: traefik-mtls@kubernetescrd
  creationTimestamp: "2023-06-05T21:46:46Z"
  generation: 1
  labels:
    app: kube-prometheus-stack-prometheus
    app.kubernetes.io/instance: kube-prometheus
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/part-of: kube-prometheus-stack
    app.kubernetes.io/version: 46.4.1
    chart: kube-prometheus-stack-46.4.1
    heritage: Helm
    release: kube-prometheus
  name: kube-prometheus-kube-prome-prometheus
  namespace: prometheus
  resourceVersion: "1806450"
  uid: 8dd5092c-c323-4437-ad24-45dcd2f31cf8
spec:
  ingressClassName: traefik
  rules:
  - host: prometheus.example.org
    http:
      paths:
      - backend:
          service:
            name: kube-prometheus-kube-prome-prometheus
            port:
              number: 9090
        path: /
        pathType: ImplementationSpecific
  tls:
  - hosts:
    - prometheus.example.org
    secretName: prometheus-tls
status:
  loadBalancer:
    ingress:
    - hostname: 12-34-567-89.example.org
      ip: 12.34.567.89
---
apiVersion: v1
kind: Service
metadata:
  annotations:
    external-dns.alpha.kubernetes.io/hostname: teleport.example.org
  creationTimestamp: "2023-06-05T20:38:56Z"
  finalizers:
  - service.kubernetes.io/load-balancer-cleanup
  labels:
    app.kubernetes.io/component: proxy
    app.kubernetes.io/instance: teleport-cluster
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: teleport-cluster
    app.kubernetes.io/version: 13.0.0-alpha.2-amd64
    helm.sh/chart: teleport-cluster-13.0.3
    teleport.dev/majorVersion: "13"
  name: teleport-cluster
  namespace: teleport-cluster
  resourceVersion: "1784576"
  uid: d7b0f713-a259-403c-a77b-5286d9afb1cf
spec:
  allocateLoadBalancerNodePorts: true
  clusterIP: 10.233.51.128
  clusterIPs:
  - 10.233.51.128
  externalTrafficPolicy: Cluster
  internalTrafficPolicy: Cluster
  ipFamilies:
  - IPv4
  ipFamilyPolicy: SingleStack
  ports:
  - name: tls
    nodePort: 31767
    port: 443
    protocol: TCP
    targetPort: 3080
  selector:
    app.kubernetes.io/component: proxy
    app.kubernetes.io/instance: teleport-cluster
    app.kubernetes.io/name: teleport-cluster
  sessionAffinity: None
  type: LoadBalancer
status:
  loadBalancer:
    ingress:
    - hostname: 12-34-34-123.example.org
      ip: 12-34-34-123

We've seen it fail when trying to create records for both ingress and service type objects without us making any changes other then upgrading the external-dns version.

@johngmyers
Copy link
Contributor

I don't see loki.example.org in the provided example resources. So I don't see how those resources could have created those records.

@amold1
Copy link

amold1 commented Jun 28, 2023

@johngmyers I see similar behavior.

In my case, I create a KIND cluster with a service that has an annotation(external-dns.alpha.kubernetes.io/hostname: some.example.org for external-dns to create the record. Then I delete the cluster completely. I then recreate the cluster with the same service and annotation.

But because external-dns did not have a chance to delete the entry previously created, it goes into a CrashLoopBackOff state

If I delete the service first, let external-dns delete the entry and then destroy and recreate cluster, then it works as expected.

@johngmyers
Copy link
Contributor

@amold1 Please supply a reproducable test case, complete with server arguments, Kubernetes resources, any other initial conditions, actual behavior, and expected behavior.

@nakamume
Copy link

nakamume commented Jul 7, 2023

@johngmyers I was also affected by this on v0.13.5, here are the steps to reproduce

  1. Use external-dns v0.11.0 (that's what I used prev., maybe v0.13.4 might work as well as others have pointed). To reproduce we start with an older version and then upgrade to v0.13.5

    # external-dns args
      - args:
        - --log-level=info
        - --log-format=json
        - --interval=30s
        - --source=service
        - --source=ingress
        - --policy=sync
        - --registry=txt
        - --txt-owner-id=xxxxxxxxxxx
        - --domain-filter=example.com
        - --provider=aws
    
  2. ip-address-type: dualstack dualstack for ingress (if you don't have dualstack networking setup, then you can first create the ingress without this annotation and once the LB is provisioned, add this annotation - alb-controller would fail to reconcile further but that should be okay)

    apiVersion: networking.k8s.io/v1
    kind: Ingress
    metadata:
      annotations:
        alb.ingress.kubernetes.io/listen-ports: '[{"HTTP": 80}]'
        alb.ingress.kubernetes.io/scheme: internet-facing
        alb.ingress.kubernetes.io/target-type: ip
        alb.ingress.kubernetes.io/ip-address-type: dualstack
        kubernetes.io/ingress.class: alb
      name: external-dns-test-failure
    spec:
      rules:
      - host: external-dns-test.example.com
        http:
          paths:
          - backend:
              service:
                name: external-dns-test-canary
                port:
                  name: http
            path: /*
            pathType: ImplementationSpecific
  3. Once annotation for dualstack is added, external-dns v0.11.0 created three records: TXT, A and AAAA

    external-dns-f4f649bdf-2swsp external-dns {"level":"info","msg":"Desired change: CREATE external-dns-test.example.com A [Id: /hostedzone/XXXXXXXXXX]","time":"2023-07-07T02:04:50Z"}
    external-dns-f4f649bdf-2swsp external-dns {"level":"info","msg":"Desired change: CREATE external-dns-test.example.com AAAA [Id: /hostedzone/XXXXXXXXXXX]","time":"2023-07-07T02:04:50Z"}
    external-dns-f4f649bdf-2swsp external-dns {"level":"info","msg":"Desired change: CREATE external-dns-test.example.com TXT [Id: /hostedzone/XXXXXXXXX]","time":"2023-07-07T02:04:50Z"}
    
  4. Now upgrade the controller to v0.13.5, it tries to create two TXT records with cname- prefix and fails and goes into CrashLoopBackOff

    external-dns-57f9b9d9d7-dkp6q external-dns time="2023-07-07T01:40:00Z" level=info msg="Desired change: CREATE cname-external-dns-test.example.com TXT [Id: /hostedzone/XXXXXXXXXX]"
    external-dns-57f9b9d9d7-dkp6q external-dns time="2023-07-07T01:40:00Z" level=info msg="Desired change: CREATE cname-external-dns-test.example.com TXT [Id: /hostedzone/XXXXXXXXXXX]"
    external-dns-57f9b9d9d7-dkp6q external-dns time="2023-07-07T01:40:00Z" level=error msg="Failure in zone example.com. [Id: /hostedzone/XXXXXXXXXX] when submitting change batch: InvalidChangeBatch: [The request contains an invalid set of changes for a resource record set 'TXT cname-external-dns-test.example.com.']\n\tstatus code: 400, request id: d78cd4b1-0514-4eac-bfe1-bae08e3c071d"
    

EKS: 1.23
aws-load-balancer-controller: v2.5.3

@nakamume
Copy link

nakamume commented Jul 7, 2023

The two TXT records it was trying to CREATE were exactly the same (I tested using a custom image with additional logic) - so maybe some issue with the deduplication logic?

BTW, I tried master (commit: 92824f4) and that didn't result into this behavior.

@johngmyers
Copy link
Contributor

If this isn't reproducing on master, there's little reason to investigate.

@nakamume
Copy link

nakamume commented Jul 7, 2023

Did a little more digging, seems the commit 1bd3834 fixed the issue for me.

@rl0nergan
Copy link
Author

hey @johngmyers, sorry about that. here's the loki ingress resource we're using.

apiVersion: v1
items:
- apiVersion: networking.k8s.io/v1
  kind: Ingress
  metadata:
    annotations:
      cert-manager.io/cluster-issuer: vault-production
      cert-manager.io/common-name: loki.example.org
      traefik.ingress.kubernetes.io/router.entrypoints: websecure
      traefik.ingress.kubernetes.io/router.tls.options: traefik-mtls@kubernetescrd
    creationTimestamp: "2023-05-11T18:17:42Z"
    generation: 1
    labels:
      app.kubernetes.io/instance: loki
      app.kubernetes.io/managed-by: Helm
      app.kubernetes.io/name: loki
      app.kubernetes.io/version: 2.8.2
      helm.sh/chart: loki-5.8.4
    name: loki
    namespace: loki
    resourceVersion: "18129523"
    uid: cd8da11a-f2be-418b-b15c-d4e3c1be4eae
  spec:
    ingressClassName: traefik
    rules:
    - host: loki.example.org
      http:
        paths:
        - backend:
            service:
              name: loki-read
              port:
                number: 3100
          path: /api/prom/tail
          pathType: Prefix
        - backend:
            service:
              name: loki-read
              port:
                number: 3100
          path: /loki/api/v1/tail
          pathType: Prefix
        - backend:
            service:
              name: loki-read
              port:
                number: 3100
          path: /loki/api
          pathType: Prefix
        - backend:
            service:
              name: loki-read
              port:
                number: 3100
          path: /api/prom/rules
          pathType: Prefix
        - backend:
            service:
              name: loki-read
              port:
                number: 3100
          path: /loki/api/v1/rules
          pathType: Prefix
        - backend:
            service:
              name: loki-read
              port:
                number: 3100
          path: /prometheus/api/v1/rules
          pathType: Prefix
        - backend:
            service:
              name: loki-read
              port:
                number: 3100
          path: /prometheus/api/v1/alerts
          pathType: Prefix
        - backend:
            service:
              name: loki-write
              port:
                number: 3100
          path: /api/prom/push
          pathType: Prefix
        - backend:
            service:
              name: loki-write
              port:
                number: 3100
          path: /loki/api/v1/push
          pathType: Prefix
    tls:
    - hosts:
      - loki.example.org
      secretName: loki-distributed-tls
  status:
    loadBalancer:
      ingress:
    - hostname: 12-34-567-89.example.org
      ip: 12.34.567.89

In our case we're running multiple clusters with workloads provisioned via argocd and have seen the same error occur but with different resources mentioned based on what external-dns tries to reconcile first.

@maxkokocom
Copy link

Same issue with google provider. v0.13.5 version as well. Downgrade to v0.13.4 helped.

@johngmyers
Copy link
Contributor

@joaocc That's not a CNAME record, as reported in the initial description. That's a TXT record and is expected behavior.

@joaocc
Copy link

joaocc commented Sep 27, 2023

@johngmyers You are correct. Will remove my comment to avoid future confusion. Sorry for the misunderstanding. Thx

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 29, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 28, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Mar 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants