external-dns continuously deleting and adding the same A and TXT records #4059

paul-at-cybr · 2023-11-20T19:42:48Z

What happened:
After upgrading the external-dns image from 0.13.4 => 0.14.0, external-dns seems to have gotten stuck trying to continuously delete and recreate a subset of the records it manages.

A common thread among the domains external-dns misbehaves on, is that they all have NS records that are not managed by external-dns. In our case, these records are manually configured.

What you expected to happen:
A low rate of log output, allowing me to breathe easy in the knowledge that we are not hammering the Google Cloud DNS API.

How to reproduce it (as minimally and precisely as possible):

Tell external-dns to manage a domain that has existing NS-records.
If that isn't enough on its own, try adding other factors:

Cloud dns provider: Google
external-dns args (attached)
K8S platform: GKE
Ingress controller: traefik
Ingress apiVersion: networking.k8s.io/v1

Anything else we need to know?:
I've collected a series of debug-level log entries, washed them for public consumption, and attached them. Hope they prove useful.

Environment:

External-DNS version: v0.14.0
DNS provider: Google
Ingress controller: Traefik

szuecs · 2023-11-21T10:12:48Z

Do I understand correctly that you have 1 zone that has some subdomains that delegate the subdomain to other NS servers?
Can you add which subdomain you delegate and which record should have been created?

paul-at-cybr · 2023-11-21T12:19:07Z

Hi, thanks for following up!

There are no records missing in our case, and our availability has (so far) not been impacted. The primary symptom is that we see a lot of delete + add activity on A- and TXT records associated with zones that have externally managed NS records.

Edit: The rest of this comment was based on an erroneous read of the logs on a particular environment. See my latest comment for an update.

I've had a closer look at the logs across environments, and while the problem seemed to reliably impact all zones with externally managed NS records, the problem appears to have resolved itself on zones with more simple setups.

What remains is environments where the set up is more complex, with several layers of zones across multiple google cloud projects. In the environment from which the example log was gathered, there are three nested zones, where each nested zone also exists as a set of A/TXT records in the parent zone.

The external-dns instance from which the example log was gathered manages a single zone, stage.apps.cybr.ai, but this zone has a parent zone (apps.cybr.ai) which resides in a separate project and is managed by a separate instance of external-dns.
~~My new suspicion is that this somehow causes a conflict, but the logs on the external-dns instance managing the parent zone are very quiet.~~

eyvind · 2023-11-21T13:49:56Z

We also see external-dns delete/recreate zone apex records on every reconciliation. I suspect that this is related to the new TXT registry format where the record name contains the record type:

By default, the new TXT registry name is <record-type>-<dns name>, which when creating an A record for example.com becomes a-example.com. That hostname is outside the example.com zone, so the TXT record creation fails, apparently causing a delete/recreate on the next reconciliation. This will happen in any delegated zone, such as stage.apps.cybr.ai in your example.

This happens even if you have a prefix defined for registry TXT records, since the record type is prepended to the DNS name of the A record, not the prefixed name of the TXT record.

I think #3774 would fix this problem.

paul-at-cybr · 2023-11-22T15:02:51Z

Follow-up: Turns out the issue is still present on the simpler setups, I was just using the wrong log filters.
This should simplify the process of reproducing the problem.

We have several domains where there's only one zone and no nesting, and the root zone is affected by the TXT / A issue.
Each of these domains have an externally managed NS record.
I've attached a log snippet with output filtered for one of these domains, litly.io

szuecs · 2023-11-29T14:42:01Z

Thanks!
So from @paul-at-cybr logs:

2023-11-22T14:43:36Z/info: Add records: litly.io. TXT [\"heritage=external-dns,external-dns/owner=dns-frontend-prod-9347b6ff,external-dns/resource=ingress/litly-api/litly-app\"] 300
2023-11-22T14:43:36Z/info: Add records: litly.io. A [34.147.116.243] 300
2023-11-22T14:43:36Z/info: Del records: litly.io. TXT [\"heritage=external-dns,external-dns/owner=dns-frontend-prod-9347b6ff,external-dns/resource=ingress/litly-api/litly-app\"] 300
2023-11-22T14:43:36Z/info: Del records: litly.io. A [34.147.116.243] 300
2023-11-22T14:43:36Z/info: Change zone: litly-io-root batch #0

@paul-at-cybr Can you confirm that's only on APEX records as @eyvind wrote?

Maybe a workaround to try is to use a subdomain like tags.litly.io to store the ownership txt records --txt-suffix="-%{record_type}.tags" in order to have the ownership correctly set because it seems. This would omit having APEX records for the ownership

Evesy · 2023-11-30T15:30:19Z

@szuecs We've observed the same thing (Just the Google provider) affecting all records, not just apex domains

E.g. Our zone is testing.k8.tld and records such as app.grafana.testing.k8.tld were being recreated constantly. Worth noting we also use a subdomain --txt-suffix, i.e. meta.

(.tld is in Cloudflare, .testing.k8.tld is delegated to Google Cloud DNS)

Evesy · 2023-12-01T10:58:18Z

Looks like the regression was introduced between 0.13.5 and 0.13.6

In our case, it is affecting any records that have the following provider configuration:

Ingress:

metadata:
  annotations:
    external-dns.alpha.kubernetes.io/cloudflare-proxied: "false"

CRD:

    providerSpecific:
    - name: external-dns.alpha.kubernetes.io/cloudflare-proxied
      value: "false"

Arguably this config shouldn't be present on the CRD endpoint when dnsName is one that will be managed by the Google provider, but in an Ingress resource with mixed domains it can't be helped

hungran · 2023-12-15T09:21:22Z

we have same issue

Cloud dns provider: Azure
external-dns args : traeafik-proxy
K8S platform: AKS
Ingress Controller: Treafik
IngressRoute: traefik.io/v1alpha1

seem that there 2 same values CNAME and TXT being update by externalDNS,

though that CNAME is the goal but shouldn't update after it be created,
and TXT should not need to be create in this case

paul-at-cybr · 2023-12-15T11:02:19Z

Terribly sorry, I've struggled to find the time to properly follow up on this.

Can you confirm that's only on APEX records as @eyvind wrote?

We're seeing the issue on subdomains as well, such as stage.apps.cybr.ai, though those domains are in the nested zone situation that I initially suspected to be the determining factor for this bug: stage.apps.cybr.ai is a subdomain of apps.cybr.ai, which has its own google_dns_managed_zone while also being a subdomain of another google_dns_managed_zone (cybr.ai).

We have not observed the bug in any other subdomains. Only on apex records and on subdomains with nested managed zones.

I've considered introducing a txt suffix to see if this fixes things, but this section from the txt registry readme indicates I might not want to do that:

The prefix or suffix may not be changed after initial deployment, lest the registry records be orphaned and the metadata be lost.

dboreham · 2023-12-18T22:16:28Z

Quick note that I found this bug while beginning to look into what seems to be a similar syndrome observed in our dev setup. Any pointers on how to debug would be appreciated. So far all I've seen is that in DO (where the zone is hosted) records appear and vanish at random. There's nothing obviously related in the external dns container logs (it isn't saying "I'm creating this record...", "I'm deleting this record..." although clearly it is. So presumably the default log level is not very informative?

Evesy · 2024-01-02T10:32:50Z

Looks like the regression was introduced between 0.13.5 and 0.13.6

In our case, it is affecting any records that have the following provider configuration:

Ingress:
metadata:
  annotations:
    external-dns.alpha.kubernetes.io/cloudflare-proxied: "false"
CRD:
    providerSpecific:
    - name: external-dns.alpha.kubernetes.io/cloudflare-proxied
      value: "false"
Arguably this config shouldn't be present on the CRD endpoint when dnsName is one that will be managed by the Google provider, but in an Ingress resource with mixed domains it can't be helped

I've tracked the regression in this instance down to 5339c0c

@johngmyers Hoping you might be able to advise on the changes in that commit. I see the key difference is previously the provider-specific logic:

Updated if current and desired mismatched
Updated if there is a current property (unless the value is "") but is not in the desired

Whereas now it has an additional logic:

Update If there's a desired provider-specific config that is not in the current

This logic change does make sense on the surface, but feels like a breaking change from the old behaviour. Having multiple DNS entries that use different providers in the same resource (e.g. Ingress, CRD etc.) will cause constant deletes/recreates if that resource uses any provider-specific annotations

cc @szuecs for your thoughts too

Feels like with this change there needs to be a way to tie provider-specific config to providers, and not have them be considered for records that are ultimately handled by a different provider

Signed-off-by: Sandor Szücs <sandor.szuecs@zalando.de>

szuecs · 2024-01-17T16:09:12Z

@Evesy thanks for your comment!
In our clusters we run v0.13.6 without provider specific values and are fine. Thanks for bisecting to the commit.
I tried to create a test case #4189 but I can't reproduce it also if I checkout v0.13.6 tag and add the test case.

k8s-triage-robot · 2024-04-16T16:51:56Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

paul-at-cybr · 2024-04-16T19:47:23Z

This issue does not seem to be resolved for us as of 0.14.1.
Will consider looking into it myself if time permits, though neither DNS, networking or k8s-internals are among my strong suits.

/remove-lifecycle stale

Evesy · 2024-04-19T13:01:26Z

@paul-at-cybr Which providers are you using and which are affected? Are the resources responsible for the records affected using any provider specific config?

thecmdradama · 2024-05-02T03:06:25Z

Seeing the same thing here too with either 0.13.6 or 0.14.1 on Azure Kubernetes using Azure Private DNS.

Cloud dns provider: Azure
external-dns args:

args:
            - '--log-level=info'
            - '--log-format=text'
            - '--interval=1m'
            - '--source=ingress'
            - '--source=pod'
            - '--policy=sync'
            - '--registry=txt'
            - '--txt-owner-id=external-dns'
            - '--domain-filter=domain1.example.com'
            - '--domain-filter=domain2.example.com'
            - '--provider=azure-private-dns'

K8S platform: AKS
Ingress Controller: ingress
ingressRoute: networking.k8s.io/v1

eldios · 2024-07-08T22:56:31Z

same issue happening for me too.
Kubernetes version 1.29.3 with k3s on bare-metal
external-dns version: v0.14.2
exertnal-dns helm chart: v1.14.5

installed with the following flags:

export CLUSTER_NAME="vu-ams-02"
helm repo add external-dns https://kubernetes-sigs.github.io/external-dns/

export GOOGLE_PROJECT="MY_GOOGLE_PROJECT_REDACTED"
export CLUSTER_DOMAIN="${CLUSTER_NAME}.switchboard-oracles.xyz"

helm upgrade --install                                                 \
  external-dns external-dns/external-dns                               \
  -n external-dns --create-namespace                                   \
  --version 1.14.5                                                     \
  --set provider=google                                                \
  --set policy=sync                                                    \
  --set sources[0]="ingress"                                           \
  --set domainFilters[0]"=${CLUSTER_DOMAIN}"                           \
  --set txtOwnerId="${CLUSTER_NAME}"                                   \
  --set extraArgs[0]='--google-project='"${GOOGLE_PROJECT}"            \
  --set extraVolumes[0].name="google-service-account"                  \
  --set extraVolumes[0].secret.secretName="external-dns"               \
  --set extraVolumeMounts[0].name="google-service-account"             \
  --set extraVolumeMounts[0].mountPath="/etc/secrets/service-account/" \
  --set env[0].name="GOOGLE_APPLICATION_CREDENTIALS"                   \
  --set env[0].value="/etc/secrets/service-account/credentials.json"

you can check the DNS record here:

$ dig +short A vu-ams-02.switchboard-oracles.xyz
136.244.110.43
$ dig +short TXT vu-ams-02.switchboard-oracles.xyz
"heritage=external-dns,external-dns/owner=vu-ams-02,external-dns/resource=ingress/switchboard-oracle-devnet/switchboard-ingress"

Let me know if you need any other hint or logs or want me to dig in any direction in the code.. this is kinda annoying-ish 😬

paul-at-cybr added the kind/bug Categorizes issue or PR as related to a bug. label Nov 20, 2023

szuecs added a commit that referenced this issue Jan 17, 2024

test: #4059 detect no change necessary with provider specific config

99626ee

Signed-off-by: Sandor Szücs <sandor.szuecs@zalando.de>

szuecs mentioned this issue Jan 17, 2024

test: detect no change necessary with provider specific config #4189

Merged

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 16, 2024

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

external-dns continuously deleting and adding the same A and TXT records #4059

external-dns continuously deleting and adding the same A and TXT records #4059

paul-at-cybr commented Nov 20, 2023 •

edited

Loading

szuecs commented Nov 21, 2023

paul-at-cybr commented Nov 21, 2023 •

edited

Loading

eyvind commented Nov 21, 2023

paul-at-cybr commented Nov 22, 2023 •

edited

Loading

szuecs commented Nov 29, 2023 •

edited

Loading

Evesy commented Nov 30, 2023 •

edited

Loading

Evesy commented Dec 1, 2023 •

edited

Loading

hungran commented Dec 15, 2023 •

edited

Loading

paul-at-cybr commented Dec 15, 2023 •

edited

Loading

dboreham commented Dec 18, 2023

Evesy commented Jan 2, 2024 •

edited

Loading

szuecs commented Jan 17, 2024

k8s-triage-robot commented Apr 16, 2024

paul-at-cybr commented Apr 16, 2024

Evesy commented Apr 19, 2024

thecmdradama commented May 2, 2024 •

edited

Loading

eldios commented Jul 8, 2024

external-dns continuously deleting and adding the same A and TXT records #4059

external-dns continuously deleting and adding the same A and TXT records #4059

Comments

paul-at-cybr commented Nov 20, 2023 • edited Loading

szuecs commented Nov 21, 2023

paul-at-cybr commented Nov 21, 2023 • edited Loading

eyvind commented Nov 21, 2023

paul-at-cybr commented Nov 22, 2023 • edited Loading

szuecs commented Nov 29, 2023 • edited Loading

Evesy commented Nov 30, 2023 • edited Loading

Evesy commented Dec 1, 2023 • edited Loading

hungran commented Dec 15, 2023 • edited Loading

paul-at-cybr commented Dec 15, 2023 • edited Loading

dboreham commented Dec 18, 2023

Evesy commented Jan 2, 2024 • edited Loading

szuecs commented Jan 17, 2024

k8s-triage-robot commented Apr 16, 2024

paul-at-cybr commented Apr 16, 2024

Evesy commented Apr 19, 2024

thecmdradama commented May 2, 2024 • edited Loading

eldios commented Jul 8, 2024

paul-at-cybr commented Nov 20, 2023 •

edited

Loading

paul-at-cybr commented Nov 21, 2023 •

edited

Loading

paul-at-cybr commented Nov 22, 2023 •

edited

Loading

szuecs commented Nov 29, 2023 •

edited

Loading

Evesy commented Nov 30, 2023 •

edited

Loading

Evesy commented Dec 1, 2023 •

edited

Loading

hungran commented Dec 15, 2023 •

edited

Loading

paul-at-cybr commented Dec 15, 2023 •

edited

Loading

Evesy commented Jan 2, 2024 •

edited

Loading

thecmdradama commented May 2, 2024 •

edited

Loading