Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rfc2136 continually removing then adding A records that have more than 1 target #1596

Closed
stefanlasiewski opened this issue May 23, 2020 · 24 comments · May be fixed by #4613
Closed

rfc2136 continually removing then adding A records that have more than 1 target #1596

stefanlasiewski opened this issue May 23, 2020 · 24 comments · May be fixed by #4613
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.

Comments

@stefanlasiewski
Copy link
Contributor

stefanlasiewski commented May 23, 2020

What happened:

External DNS with the rfc2136 provider is continually Removing then re-adding DNS records.

What you expected to happen:

I expect External DNS to add records once, and then only update them on occasion when it detects a change.

How to reproduce it (as minimally and precisely as possible):

  1. Install the cluster
  • Install Rancher 2.4.4 (This shouldn't' be necessary)
  • Install a v1.16.8 Kubernetes cluster
  • Add external-dns via the Bitnami chart at https://charts.bitnami.com/bitnami
  • Configure with settings similar to the following:
---
  domainFilters: ["cluster.example.org"]
  interval: "1m"
  policy: "sync"
  provider: "rfc2136"
  rfc2136: 
    host: "ns1.example.org'
    port: "53"
    tsigAxfr: "true"
    tsigKeyname: "keyname"
    tsigSecret: "key/key"
    tsigSecretAlg: "hmac-sha512"
    zone: "cluster.example.org."
  sources: [ingress]
  txtOwnerId: "cluster1"
  1. Configure an ingress.

  2. Wait a couple of days. Do normal things like upgrade external-dns to the next version of the chart, etc. Come back, and notice how existing DNS records are being updated every minute:

time="2020-05-23T01:17:10Z" level=debug msg="Record=lb.app.example.org.\t60\tIN\tA\t10.11.12.99"                                                                                                                   
time="2020-05-23T01:17:10Z" level=debug msg="Record=lb.app.example.org.\t60\tIN\tA\t10.11.12.101"                                                                                                                  
time="2020-05-23T01:17:10Z" level=debug msg="Record=lb.app.example.org.\t60\tIN\tTXT\t\"heritage=external-dns,external-dns/owner=cluster1,external-dns/resource=ingress/app/lb\""                                  time="2020-05-23T01:17:10Z" level=debug msg="Endpoints generated from ingress: app/lb: [lb.app.example.org 0 IN A  10.11.12.101 []]"                                                                               
time="2020-05-23T01:17:10Z" level=debug msg="RemoveRecord.ep=lb.app.example.org 0 IN A  10.11.12.101 []"                                                                                                           
time="2020-05-23T01:17:10Z" level=info msg="Removing RR: lb.app.example.org 0 A 10.11.12.101"                                                                                                                      
time="2020-05-23T01:17:10Z" level=debug msg="AddRecord.ep=lb.app.example.org 0 IN A  10.11.12.101 []"                                                                                                              
time="2020-05-23T01:17:10Z" level=info msg="Adding RR: lb.app.example.org 60 A 10.11.12.101"                                                                                                                       
time="2020-05-23T01:17:10Z" level=debug msg="RemoveRecord.ep=lb.app.example.org 0 IN TXT  \"heritage=external-dns,external-dns/owner=cluster1,external-dns/resource=ingress/app/lb\" []"                           
time="2020-05-23T01:17:10Z" level=info msg="Removing RR: lb.app.example.org 0 TXT \"heritage=external-dns,external-dns/owner=cluster1,external-dns/resource=ingress/app/lb\""                                      time="2020-05-23T01:17:10Z" level=debug msg="AddRecord.ep=lb.app.example.org 0 IN TXT  \"heritage=external-dns,external-dns/owner=cluster1,external-dns/resource=ingress/app/lb\" []"                              
time="2020-05-23T01:17:10Z" level=info msg="Adding RR: lb.app.example.org 60 TXT \"heritage=external-dns,external-dns/owner=cluster1,external-dns/resource=ingress/app/lb\""                                                                                                                                                                                                                                                          
time="2020-05-23T01:18:10Z" level=debug msg="Record=lb.app.example.org.\t60\tIN\tA\t10.11.12.99"                                                                                                                   
time="2020-05-23T01:18:10Z" level=debug msg="Record=lb.app.example.org.\t60\tIN\tA\t10.11.12.101"                                                                                                                  
time="2020-05-23T01:18:10Z" level=debug msg="Record=lb.app.example.org.\t60\tIN\tTXT\t\"heritage=external-dns,external-dns/owner=cluster1,external-dns/resource=ingress/app/lb\""                                  time="2020-05-23T01:18:10Z" level=debug msg="Endpoints generated from ingress: app/lb: [lb.app.example.org 0 IN A  10.11.12.101 []]"                                                                               
time="2020-05-23T01:18:10Z" level=debug msg="RemoveRecord.ep=lb.app.example.org 0 IN A  10.11.12.101 []"                                                                                                           
time="2020-05-23T01:18:10Z" level=info msg="Removing RR: lb.app.example.org 0 A 10.11.12.101"                                                                                                                      
time="2020-05-23T01:18:10Z" level=debug msg="AddRecord.ep=lb.app.example.org 0 IN A  10.11.12.101 []"                                                                                                              
time="2020-05-23T01:18:10Z" level=info msg="Adding RR: lb.app.example.org 60 A 10.11.12.101"                                                                                                                       time="2020-05-23T01:18:10Z" level=debug msg="RemoveRecord.ep=lb.app.example.org 0 IN TXT  \"heritage=external-dns,external-dns/owner=cluster1,external-dns/resource=ingress/app/lb\" []"                           
time="2020-05-23T01:18:10Z" level=info msg="Removing RR: lb.app.example.org 0 TXT \"heritage=external-dns,external-dns/owner=cluster1,external-dns/resource=ingress/app/lb\""                                      
time="2020-05-23T01:18:10Z" level=debug msg="AddRecord.ep=lb.app.example.org 0 IN TXT  \"heritage=external-dns,external-dns/owner=cluster1,external-dns/resource=ingress/app/lb\" []"                              
time="2020-05-23T01:18:10Z" level=info msg="Adding RR: lb.app.example.org 60 TXT \"heritage=external-dns,external-dns/owner=cluster1,external-dns/resource=ingress/app/lb\""  

The resulting commandline is:

external-dns --log-level=info --log-format=text --domain-filter=cluster.example.org --policy=sync --provider=rfc2136 --registry=txt --interval=1m --txt-owner-id=cluster1-dev --source=ingress --rfc2136-host=ns1.example.org --rfc2136-port=53 --rfc2136-zone=cluster.example.org. --rfc2136-min-ttl=20s --rfc2136-tsig-secret-alg=... --rfc2136-tsig-keyname=... --rfc2136-tsig-axfr

Anything else we need to know?:

Environment:

  • External-DNS version (use external-dns --version): v0.7.1
  • DNS provider: rfc2136
  • Others:
@stefanlasiewski stefanlasiewski added the kind/bug Categorizes issue or PR as related to a bug. label May 23, 2020
@stefanlasiewski
Copy link
Contributor Author

I created another new record, and the same problem is not happening:

time="2020-05-23T01:30:10Z" level=debug msg="Record=lb.app2.example.org.\t60\tIN\tTXT\t\"heritage=external-dns,external-dns/owner=cluster1,external-dns/resource=ingress/app2/lb\""                                                   
time="2020-05-23T01:30:10Z" level=debug msg="Record=lb.app2.example.org.\t60\tIN\tA\t10.11.12"                                                                                                                                        
time="2020-05-23T01:30:10Z" level=debug msg="Endpoints generated from ingress: app2/lb: [lb.app2.example.org 0 IN A  10.11.12 []]"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                
time="2020-05-23T01:31:10Z" level=debug msg="Record=lb.app2.example.org.\t60\tIN\tTXT\t\"heritage=external-dns,external-dns/owner=cluster1,external-dns/resource=ingress/app2/lb\""                                                   time="2020-05-23T01:31:10Z" level=debug msg="Record=lb.app2.example.org.\t60\tIN\tA\t10.11.12"                                                                                                                                        time="2020-05-23T01:31:10Z" level=debug msg="Endpoints generated from ingress: app2/lb: [lb.app2.example.org 0 IN A  10.11.12 []]"   

@stefanlasiewski
Copy link
Contributor Author

stefanlasiewski commented May 23, 2020

Hm, this is strange.

The removing records have a zero-second TTL:

"Removing RR: lb.app.example.org 0 A 10.11.12.101"

But the adding records have a 60 second TTL, which is something I configured a couple of days ago:

"Adding RR: lb.app.example.org 60 A 10.11.12.101"

Is that wrong?

@brpaz
Copy link

brpaz commented May 23, 2020

The same is happening with the Cloudflare provider.

Example:

time="2020-05-07T18:35:52Z" level=info msg="Changing record." action=UPDATE record=labs.brunopaz.dev targets=1 ttl=1 type=A zone=1bb86879fb7ea053bd1352c7a9b60ca2
time="2020-05-07T18:35:52Z" level=info msg="Changing record." action=UPDATE record=directus.brunopaz.dev targets=1 ttl=1 type=A zone=1bb86879fb7ea053bd1352c7a9b60ca2
time="2020-05-07T18:35:53Z" level=info msg="Changing record." action=UPDATE record=golang-api.brunopaz.dev targets=1 ttl=1 type=A zone=1bb86879fb7ea053bd1352c7a9b60ca2
time="2020-05-07T18:35:53Z" level=info msg="Changing record." action=UPDATE record=labs.brunopaz.dev targets=1 ttl=1 type=TXT zone=1bb86879fb7ea053bd1352c7a9b60ca2
time="2020-05-07T18:35:54Z" level=info msg="Changing record." action=UPDATE record=directus.brunopaz.dev targets=1 ttl=1 type=TXT zone=1bb86879fb7ea053bd1352c7a9b60ca2
time="2020-05-07T18:35:54Z" level=info msg="Changing record." action=UPDATE record=golang-api.brunopaz.dev targets=1 ttl=1 type=TXT zone=1bb86879fb7ea053bd1352c7a9b60ca2
time="2020-05-07T18:36:51Z" level=info msg="Changing record." action=UPDATE record=directus.brunopaz.dev targets=1 ttl=1 type=A zone=1bb86879fb7ea053bd1352c7a9b60ca2
time="2020-05-07T18:36:52Z" level=info msg="Changing record." action=UPDATE record=labs.brunopaz.dev targets=1 ttl=1 type=A zone=1bb86879fb7ea053bd1352c7a9b60ca2
time="2020-05-07T18:36:52Z" level=info msg="Changing record." action=UPDATE record=golang-api.brunopaz.dev targets=1 ttl=1 type=A zone=1bb86879fb7ea053bd1352c7a9b60ca2
time="2020-05-07T18:36:53Z" level=info msg="Changing record." action=UPDATE record=directus.brunopaz.dev targets=1 ttl=1 type=TXT zone=1bb86879fb7ea053bd1352c7a9b60ca2
time="2020-05-07T18:36:53Z" level=info msg="Changing record." action=UPDATE record=labs.brunopaz.dev targets=1 ttl=1 type=TXT zone=1bb86879fb7ea053bd1352c7a9b60ca2
time="2020-05-07T18:36:54Z" level=info msg="Changing record." action=UPDATE record=golang-api.brunopaz.dev targets=1 ttl=1 type=TXT zone=1bb86879fb7ea053bd1352c7a9b60ca2
time="2020-05-07T18:37:51Z" level=info msg="Changing record." action=UPDATE record=directus.brunopaz.dev targets=1 ttl=1 type=A zone=1bb86879fb7ea053bd1352c7a9b60ca2
time="2020-05-07T18:37:52Z" level=info msg="Changing record." action=UPDATE record=golang-api.brunopaz.dev targets=1 ttl=1 type=A zone=1bb86879fb7ea053bd1352c7a9b60ca2
time="2020-05-07T18:37:52Z" level=info msg="Changing record." action=UPDATE record=labs.brunopaz.dev targets=1 ttl=1 type=A zone=1bb86879fb7ea053bd1352c7a9b60ca2
time="2020-05-07T18:37:53Z" level=info msg="Changing record." action=UPDATE record=directus.brunopaz.dev targets=1 ttl=1 type=TXT zone=1bb86879fb7ea053bd1352c7a9b60ca2
time="2020-05-07T18:37:53Z" level=info msg="Changing record." action=UPDATE record=golang-api.brunopaz.dev targets=1 ttl=1 type=TXT zone=1bb86879fb7ea053bd1352c7a9b60ca2
time="2020-05-07T18:37:54Z" level=info msg="Changing record." action=UPDATE record=labs.brunopaz.dev targets=1 ttl=1 type=TXT zone=1bb86879fb7ea053bd1352c7a9b60ca2
time="2020-05-07T18:38:51Z" level=info msg="Changing record." action=UPDATE record=golang-api.brunopaz.dev targets=1 ttl=1 type=A zone=1bb86879fb7ea053bd1352c7a9b60ca2
time="2020-05-07T18:38:52Z" level=info msg="Changing record." action=UPDATE record=directus.brunopaz.dev targets=1 ttl=1 type=A zone=1bb86879fb7ea053bd1352c7a9b60ca2
time="2020-05-07T18:38:52Z" level=info msg="Changing record." action=UPDATE record=labs.brunopaz.dev targets=1 ttl=1 type=A zone=1bb86879fb7ea053bd1352c7a9b60ca2
time="2020-05-07T18:38:53Z" level=info msg="Changing record." action=UPDATE record=golang-api.brunopaz.dev targets=1 ttl=1 type=TXT zone=1bb86879fb7ea053bd1352c7a9b60ca2
time="2020-05-07T18:38:53Z" level=info msg="Changing record." action=UPDATE record=directus.brunopaz.dev targets=1 ttl=1 type=TXT zone=1bb86879fb7ea053bd1352c7a9b60ca2
time="2020-05-07T18:38:54Z" level=info msg="Changing record." action=UPDATE record=labs.brunopaz.dev targets=1 ttl=1 type=TXT zone=1bb86879fb7ea053bd1352c7a9b60ca2

This is an old issue. Please see all the discussion of #992. The problem is not fixed.

This issue can cause random major outages where the DNS cannot be resolved at all. I had outages of more than a day! fortunately in a personal cluster.

@stefanlasiewski stefanlasiewski changed the title rfc2136 continually removing then adding rfc2136 continually removing then adding A records May 24, 2020
@masterkain
Copy link

The same is happening with the Cloudflare provider.
This is an old issue. Please see all the discussion of #992. The problem is not fixed.

This issue can cause random major outages where the DNS cannot be resolved at all. I had outages of more than a day! fortunately in a personal cluster.

correct, I had to revert back to route53 because of these outages in dns name resolution, not pleasant.

@sheerun
Copy link
Contributor

sheerun commented Jun 4, 2020

This should be fixed in 0.7.2 at least for cloudflare provider, could you check?

@stefanlasiewski
Copy link
Contributor Author

I will check hopefully next week.

@JoaoBraveCoding
Copy link
Contributor

This is still broken for rfc2136 with CNAME records and provider openshift-routes

@masterkain
Copy link

seems solved with cloudflare for me

@Elegant996
Copy link

Still having issues with CNAME records and cloudflare.

@sheerun
Copy link
Contributor

sheerun commented Jun 27, 2020

@Elegant996 What version of external-dns are you using? Can you write a test to reproduce in cloudflare_test.go?

@Elegant996
Copy link

Currently using 0.7.2, may be able to write something and associate it to a dummy service.

The worst part about this issue is cloudflare eventually ignored the CNAME. It appears that after being added a removed enough times, it just assumed the record to be non-existent and returned NXDOMAIN. The record had to be removed for a few hours and then manually added before it worked again.

@stefanlasiewski
Copy link
Contributor Author

stefanlasiewski commented Jul 9, 2020

Hi folks, just to be clear, this ticket is in regards to the rfc2136 provider, not Cloudflare. For the Cloudflare bug, please see #992. While the behavior looks similar, I suspect the actual bug may be subtly different.

@stefanlasiewski
Copy link
Contributor Author

stefanlasiewski commented Jul 9, 2020

So, for my rfc2136 what's happening now is that the DNS record itself has a TTL of 20 seconds, but external-dns thinks it should have a TTL of 0 seconds, which was the default until a recent update.

The same mismatch seems to happen for users that specify their own TTL with external-dns.alpha.kubernetes.io/ttl=60

The following record actually has a TTL of 20, but for some reason, external-dns thinks it is 0.

time="2020-07-09T23:02:48Zlevel=info msg="Removing RR: lb.project.example.org 0 A 10.10.10.100"
time="2020-07-09T23:02:48Z level=info msg="Adding RR: lb.project.example.org 20 A 10.10.10.100"
time="2020-07-09T23:02:49Z" level=info msg="Removing RR: lb.project.example.org 0 TXT \"heritage=external-dns,external-dns/owner=cluster-dev,external-dns/resource=ingress/project/lb\""
time="2020-07-09T23:02:49Z" level=info msg="Adding RR: lb.project.example.org 20 TXT \"heritage=external-dns,external-dns/owner=cluster-dev,external-dns/resource=ingress/project/lb\""
$  dig +noall +answer -t a lb.project.example.org
lb.project.example.org. 20 IN A 10.10.10.100

@stefanlasiewski
Copy link
Contributor Author

Here's an updated log containing the debug messages for this server:

time="2020-07-16T20:47:10Z" level=debug msg="Record=web.service.example.org.\t20\tIN\tA\t192.168.100.100"
time="2020-07-16T20:47:10Z" level=debug msg="Record=web.service.example.org.\t20\tIN\tA\t192.168.100.101"
time="2020-07-16T20:47:10Z" level=debug msg="Record=web.service.example.org.\t20\tIN\tTXT\t\"heritage=external-dns,external-dns/ownertag,external-dns/resource=ingress/namespace1/workload1\""
time="2020-07-16T20:47:10Z" level=debug msg="Record=web.service.example.org.\t20\tIN\tRRSIG\tTXT 8 7 20 20200824185808 20200709215244 34551 service.example.org. big-l
ong-key=="
time="2020-07-16T20:47:10Z" level=debug msg="Record=web.service.example.org.\t20\tIN\tRRSIG\tA 8 7 20 20200830034531 20200716194208 34551 service.example.org. big-lon
g-key=="
time="2020-07-16T20:47:10Z" level=debug msg="Endpoints generated from ingress: namespace1/workload1: [web.service.example.org 20 IN A  192.168.100.100 []]"
time="2020-07-16T20:47:10Z" level=debug msg="RemoveRecord.ep=web.service.example.org 20 IN A  192.168.100.100 []"
time="2020-07-16T20:47:10Z" level=info msg="Removing RR: web.service.example.org 20 A 192.168.100.100"
time="2020-07-16T20:47:10Z" level=debug msg="AddRecord.ep=web.service.example.org 20 IN A  192.168.100.100 []"
time="2020-07-16T20:47:10Z" level=info msg="Adding RR: web.service.example.org 20 A 192.168.100.100"
time="2020-07-16T20:47:10Z" level=debug msg="RemoveRecord.ep=web.service.example.org 0 IN TXT  \"heritage=external-dns,external-dns/ownertag,external-dns/resource=ingress/namespace1/workload1\" []"
time="2020-07-16T20:47:10Z" level=info msg="Removing RR: web.service.example.org 0 TXT \"heritage=external-dns,external-dns/ownertag,external-dns/resource=ingress/namespace1/workload1\""
time="2020-07-16T20:47:10Z" level=debug msg="AddRecord.ep=web.service.example.org 0 IN TXT  \"heritage=external-dns,external-dns/ownertag,external-dns/resource=ingress/namespace1/workload1\" []"
time="2020-07-16T20:47:10Z" level=info msg="Adding RR: web.service.example.org 20 TXT \"heritage=external-dns,external-dns/ownertag,external-dns/resource=ingress/namespace1/workload1\""

@stefanlasiewski
Copy link
Contributor Author

I see that our DNS servers return two IPs in this record (.100 and .101), whereas external-dns tries to update one (.100) only.

@k8s-ci-robot
Copy link
Contributor

@stefanlasiewski: The label(s) /label provider/rfc2136 cannot be applied. These labels are supported: api-review, community/discussion, community/maintenance, community/question, cuj/build-train-deploy, cuj/multi-user, platform/aws, platform/azure, platform/gcp, platform/minikube, platform/other, tide/merge-method-merge, tide/merge-method-rebase, tide/merge-method-squash

In response to this:

/label provider/rfc2136

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@stefanlasiewski
Copy link
Contributor Author

stefanlasiewski commented Jul 16, 2020

After talking to @tdyas, it seems that #1595 (for Digitalocean) might be the fix for this as well. Unfortunately, I will have trouble commiting a PR because I'm having trouble with the CNCF CLA.

Code is detailed here:

https://github.com/kubernetes-sigs/external-dns/blame/7505f29e4cec80ca20468b38c03b660a8481277d/provider/digitalocean/digital_ocean.go#L116-L145

@stefanlasiewski
Copy link
Contributor Author

I believe this bug only affects records that have two IP addresses, and it appears that the rfc2136 provider has trouble reconciling those two records, and therefore it loops every interval (--interval=1m)

@stefanlasiewski stefanlasiewski changed the title rfc2136 continually removing then adding A records rfc2136 continually removing then adding A records that have more than 1 target Jul 16, 2020
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 14, 2020
@seanmalloy
Copy link
Member

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 15, 2020
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 13, 2021
@seanmalloy
Copy link
Member

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 13, 2021
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 13, 2021
@stefanlasiewski
Copy link
Contributor Author

This isn't affecting us any more as far as I can tell. I'll go ahead and close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants