Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

external-dns continuously deleting and adding the same A and TXT records #4059

Open
paul-at-cybr opened this issue Nov 20, 2023 · 17 comments
Open
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@paul-at-cybr
Copy link

paul-at-cybr commented Nov 20, 2023

What happened:
After upgrading the external-dns image from 0.13.4 => 0.14.0, external-dns seems to have gotten stuck trying to continuously delete and recreate a subset of the records it manages.

A common thread among the domains external-dns misbehaves on, is that they all have NS records that are not managed by external-dns. In our case, these records are manually configured.

What you expected to happen:
A low rate of log output, allowing me to breathe easy in the knowledge that we are not hammering the Google Cloud DNS API.

How to reproduce it (as minimally and precisely as possible):

  1. Tell external-dns to manage a domain that has existing NS-records.
  2. If that isn't enough on its own, try adding other factors:
  • Cloud dns provider: Google
  • external-dns args (attached)
  • K8S platform: GKE
  • Ingress controller: traefik
  • Ingress apiVersion: networking.k8s.io/v1

Anything else we need to know?:
I've collected a series of debug-level log entries, washed them for public consumption, and attached them. Hope they prove useful.

Environment:

  • External-DNS version: v0.14.0
  • DNS provider: Google
  • Ingress controller: Traefik
@paul-at-cybr paul-at-cybr added the kind/bug Categorizes issue or PR as related to a bug. label Nov 20, 2023
@szuecs
Copy link
Contributor

szuecs commented Nov 21, 2023

Do I understand correctly that you have 1 zone that has some subdomains that delegate the subdomain to other NS servers?
Can you add which subdomain you delegate and which record should have been created?

@paul-at-cybr
Copy link
Author

paul-at-cybr commented Nov 21, 2023

Hi, thanks for following up!

There are no records missing in our case, and our availability has (so far) not been impacted. The primary symptom is that we see a lot of delete + add activity on A- and TXT records associated with zones that have externally managed NS records.

Edit: The rest of this comment was based on an erroneous read of the logs on a particular environment. See my latest comment for an update.

I've had a closer look at the logs across environments, and while the problem seemed to reliably impact all zones with externally managed NS records, the problem appears to have resolved itself on zones with more simple setups.

What remains is environments where the set up is more complex, with several layers of zones across multiple google cloud projects. In the environment from which the example log was gathered, there are three nested zones, where each nested zone also exists as a set of A/TXT records in the parent zone.

The external-dns instance from which the example log was gathered manages a single zone, stage.apps.cybr.ai, but this zone has a parent zone (apps.cybr.ai) which resides in a separate project and is managed by a separate instance of external-dns.
My new suspicion is that this somehow causes a conflict, but the logs on the external-dns instance managing the parent zone are very quiet.

@eyvind
Copy link

eyvind commented Nov 21, 2023

We also see external-dns delete/recreate zone apex records on every reconciliation. I suspect that this is related to the new TXT registry format where the record name contains the record type:

By default, the new TXT registry name is <record-type>-<dns name>, which when creating an A record for example.com becomes a-example.com. That hostname is outside the example.com zone, so the TXT record creation fails, apparently causing a delete/recreate on the next reconciliation. This will happen in any delegated zone, such as stage.apps.cybr.ai in your example.

This happens even if you have a prefix defined for registry TXT records, since the record type is prepended to the DNS name of the A record, not the prefixed name of the TXT record.

I think #3774 would fix this problem.

@paul-at-cybr
Copy link
Author

paul-at-cybr commented Nov 22, 2023

Follow-up: Turns out the issue is still present on the simpler setups, I was just using the wrong log filters.
This should simplify the process of reproducing the problem.

We have several domains where there's only one zone and no nesting, and the root zone is affected by the TXT / A issue.
Each of these domains have an externally managed NS record.
I've attached a log snippet with output filtered for one of these domains, litly.io

@szuecs
Copy link
Contributor

szuecs commented Nov 29, 2023

Thanks!
So from @paul-at-cybr logs:

2023-11-22T14:43:36Z/info: Add records: litly.io. TXT [\"heritage=external-dns,external-dns/owner=dns-frontend-prod-9347b6ff,external-dns/resource=ingress/litly-api/litly-app\"] 300
2023-11-22T14:43:36Z/info: Add records: litly.io. A [34.147.116.243] 300
2023-11-22T14:43:36Z/info: Del records: litly.io. TXT [\"heritage=external-dns,external-dns/owner=dns-frontend-prod-9347b6ff,external-dns/resource=ingress/litly-api/litly-app\"] 300
2023-11-22T14:43:36Z/info: Del records: litly.io. A [34.147.116.243] 300
2023-11-22T14:43:36Z/info: Change zone: litly-io-root batch #0

@paul-at-cybr Can you confirm that's only on APEX records as @eyvind wrote?

Maybe a workaround to try is to use a subdomain like tags.litly.io to store the ownership txt records --txt-suffix="-%{record_type}.tags" in order to have the ownership correctly set because it seems. This would omit having APEX records for the ownership

@Evesy
Copy link
Contributor

Evesy commented Nov 30, 2023

@szuecs We've observed the same thing (Just the Google provider) affecting all records, not just apex domains

E.g. Our zone is testing.k8.tld and records such as app.grafana.testing.k8.tld were being recreated constantly. Worth noting we also use a subdomain --txt-suffix, i.e. meta.

(.tld is in Cloudflare, .testing.k8.tld is delegated to Google Cloud DNS)

@Evesy
Copy link
Contributor

Evesy commented Dec 1, 2023

Looks like the regression was introduced between 0.13.5 and 0.13.6

In our case, it is affecting any records that have the following provider configuration:

Ingress:

metadata:
  annotations:
    external-dns.alpha.kubernetes.io/cloudflare-proxied: "false"

CRD:

    providerSpecific:
    - name: external-dns.alpha.kubernetes.io/cloudflare-proxied
      value: "false"

Arguably this config shouldn't be present on the CRD endpoint when dnsName is one that will be managed by the Google provider, but in an Ingress resource with mixed domains it can't be helped

@hungran
Copy link

hungran commented Dec 15, 2023

we have same issue

  • Cloud dns provider: Azure
  • external-dns args : traeafik-proxy
  • K8S platform: AKS
  • Ingress Controller: Treafik
  • IngressRoute: traefik.io/v1alpha1

seem that there 2 same values CNAME and TXT being update by externalDNS,

  • though that CNAME is the goal but shouldn't update after it be created,
  • and TXT should not need to be create in this case

@paul-at-cybr
Copy link
Author

paul-at-cybr commented Dec 15, 2023

Terribly sorry, I've struggled to find the time to properly follow up on this.

Can you confirm that's only on APEX records as @eyvind wrote?

We're seeing the issue on subdomains as well, such as stage.apps.cybr.ai, though those domains are in the nested zone situation that I initially suspected to be the determining factor for this bug: stage.apps.cybr.ai is a subdomain of apps.cybr.ai, which has its own google_dns_managed_zone while also being a subdomain of another google_dns_managed_zone (cybr.ai).

We have not observed the bug in any other subdomains. Only on apex records and on subdomains with nested managed zones.

I've considered introducing a txt suffix to see if this fixes things, but this section from the txt registry readme indicates I might not want to do that:

The prefix or suffix may not be changed after initial deployment, lest the registry records be orphaned and the metadata be lost.

@dboreham
Copy link

Quick note that I found this bug while beginning to look into what seems to be a similar syndrome observed in our dev setup. Any pointers on how to debug would be appreciated. So far all I've seen is that in DO (where the zone is hosted) records appear and vanish at random. There's nothing obviously related in the external dns container logs (it isn't saying "I'm creating this record...", "I'm deleting this record..." although clearly it is. So presumably the default log level is not very informative?

@Evesy
Copy link
Contributor

Evesy commented Jan 2, 2024

Looks like the regression was introduced between 0.13.5 and 0.13.6

In our case, it is affecting any records that have the following provider configuration:

Ingress:

metadata:
  annotations:
    external-dns.alpha.kubernetes.io/cloudflare-proxied: "false"

CRD:

    providerSpecific:
    - name: external-dns.alpha.kubernetes.io/cloudflare-proxied
      value: "false"

Arguably this config shouldn't be present on the CRD endpoint when dnsName is one that will be managed by the Google provider, but in an Ingress resource with mixed domains it can't be helped

I've tracked the regression in this instance down to 5339c0c

@johngmyers Hoping you might be able to advise on the changes in that commit. I see the key difference is previously the provider-specific logic:

  • Updated if current and desired mismatched
  • Updated if there is a current property (unless the value is "") but is not in the desired

Whereas now it has an additional logic:

  • Update If there's a desired provider-specific config that is not in the current

This logic change does make sense on the surface, but feels like a breaking change from the old behaviour. Having multiple DNS entries that use different providers in the same resource (e.g. Ingress, CRD etc.) will cause constant deletes/recreates if that resource uses any provider-specific annotations

cc @szuecs for your thoughts too

Feels like with this change there needs to be a way to tie provider-specific config to providers, and not have them be considered for records that are ultimately handled by a different provider

szuecs added a commit that referenced this issue Jan 17, 2024
Signed-off-by: Sandor Szücs <sandor.szuecs@zalando.de>
@szuecs
Copy link
Contributor

szuecs commented Jan 17, 2024

@Evesy thanks for your comment!
In our clusters we run v0.13.6 without provider specific values and are fine. Thanks for bisecting to the commit.
I tried to create a test case #4189 but I can't reproduce it also if I checkout v0.13.6 tag and add the test case.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 16, 2024
@paul-at-cybr
Copy link
Author

This issue does not seem to be resolved for us as of 0.14.1.
Will consider looking into it myself if time permits, though neither DNS, networking or k8s-internals are among my strong suits.

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 16, 2024
@Evesy
Copy link
Contributor

Evesy commented Apr 19, 2024

@paul-at-cybr Which providers are you using and which are affected? Are the resources responsible for the records affected using any provider specific config?

@thecmdradama
Copy link

thecmdradama commented May 2, 2024

Seeing the same thing here too with either 0.13.6 or 0.14.1 on Azure Kubernetes using Azure Private DNS.

Cloud dns provider: Azure
external-dns args:

args:
            - '--log-level=info'
            - '--log-format=text'
            - '--interval=1m'
            - '--source=ingress'
            - '--source=pod'
            - '--policy=sync'
            - '--registry=txt'
            - '--txt-owner-id=external-dns'
            - '--domain-filter=domain1.example.com'
            - '--domain-filter=domain2.example.com'
            - '--provider=azure-private-dns'

K8S platform: AKS
Ingress Controller: ingress
ingressRoute: networking.k8s.io/v1

@eldios
Copy link

eldios commented Jul 8, 2024

same issue happening for me too.
Kubernetes version 1.29.3 with k3s on bare-metal
external-dns version: v0.14.2
exertnal-dns helm chart: v1.14.5

installed with the following flags:

export CLUSTER_NAME="vu-ams-02"
helm repo add external-dns https://kubernetes-sigs.github.io/external-dns/

export GOOGLE_PROJECT="MY_GOOGLE_PROJECT_REDACTED"
export CLUSTER_DOMAIN="${CLUSTER_NAME}.switchboard-oracles.xyz"

helm upgrade --install                                                 \
  external-dns external-dns/external-dns                               \
  -n external-dns --create-namespace                                   \
  --version 1.14.5                                                     \
  --set provider=google                                                \
  --set policy=sync                                                    \
  --set sources[0]="ingress"                                           \
  --set domainFilters[0]"=${CLUSTER_DOMAIN}"                           \
  --set txtOwnerId="${CLUSTER_NAME}"                                   \
  --set extraArgs[0]='--google-project='"${GOOGLE_PROJECT}"            \
  --set extraVolumes[0].name="google-service-account"                  \
  --set extraVolumes[0].secret.secretName="external-dns"               \
  --set extraVolumeMounts[0].name="google-service-account"             \
  --set extraVolumeMounts[0].mountPath="/etc/secrets/service-account/" \
  --set env[0].name="GOOGLE_APPLICATION_CREDENTIALS"                   \
  --set env[0].value="/etc/secrets/service-account/credentials.json"

you can check the DNS record here:

$ dig +short A vu-ams-02.switchboard-oracles.xyz
136.244.110.43
$ dig +short TXT vu-ams-02.switchboard-oracles.xyz
"heritage=external-dns,external-dns/owner=vu-ams-02,external-dns/resource=ingress/switchboard-oracle-devnet/switchboard-ingress"

Let me know if you need any other hint or logs or want me to dig in any direction in the code.. this is kinda annoying-ish 😬

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

10 participants