Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

After update from 0.7.4 to 0.7.5 external-dns try to delete ns record #1895

Closed
Riley-sohn opened this issue Dec 16, 2020 · 21 comments · Fixed by #1915
Closed

After update from 0.7.4 to 0.7.5 external-dns try to delete ns record #1895

Riley-sohn opened this issue Dec 16, 2020 · 21 comments · Fixed by #1915
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@Riley-sohn
Copy link

What happened: After update external-dns image from 0.7.4-debian-10-r67 to 0.7.5-debian-10-r0, external-dns try to delete a record and ns record of domain-filter DNS name. After we change image from 0.7.5 to 0.7.4, it's ok.

How to reproduce it (as minimally and precisely as possible): make hostedzone with filtered-domain name and use external-dns to manage that hosted zone

Anything else we need to know?:

Environment:

  • External-DNS version (use external-dns --version): 0.7.5
  • DNS provider: google
  • Others: I got external-dns images from docker.io/bitnami. update command: helm upgrade ~~ bitnami/external-dns

logs:

time="2020-12-16T06:32:45Z" level=info msg="config: {APIServerURL: KubeConfig: RequestTimeout:30s ContourLoadBalancerService:heptio-contour/contour SkipperRouteGroupVersion:zalando.org/v1 Sources:[service ingress] Namespace: AnnotationFilter: LabelFilter: FQDNTemplate: CombineFQDNAndAnnotation:false IgnoreHostnameAnnotation:false IgnoreIngressTLSSpec:false Compatibility: PublishInternal:false PublishHostIP:false AlwaysPublishNotReadyAddresses:false ConnectorSourceServer:localhost:8080 Provider:google GoogleProject:develop-project GoogleBatchChangeSize:1000 GoogleBatchChangeInterval:1s DomainFilter:[a.bc.def.com] ExcludeDomains:[] ZoneNameFilter:[] ZoneIDFilter:[] AlibabaCloudConfigFile:/etc/kubernetes/alibaba-cloud.json AlibabaCloudZoneType: AWSZoneType: AWSZoneTagFilter:[] AWSAssumeRole: AWSBatchChangeSize:1000 AWSBatchChangeInterval:1s AWSEvaluateTargetHealth:true AWSAPIRetries:3 AWSPreferCNAME:false AWSZoneCacheDuration:0s AzureConfigFile:/etc/kubernetes/azure.json AzureResourceGroup: AzureSubscriptionID: AzureUserAssignedIdentityClientID: CloudflareProxied:false CloudflareZonesPerPage:50 CoreDNSPrefix:/skydns/ RcodezeroTXTEncrypt:false AkamaiServiceConsumerDomain: AkamaiClientToken: AkamaiClientSecret: AkamaiAccessToken: InfobloxGridHost: InfobloxWapiPort:443 InfobloxWapiUsername:admin InfobloxWapiPassword: InfobloxWapiVersion:2.3.1 InfobloxSSLVerify:true InfobloxView: InfobloxMaxResults:0 DynCustomerName: DynUsername: DynPassword: DynMinTTLSeconds:0 OCIConfigFile:/etc/kubernetes/oci.yaml InMemoryZones:[] OVHEndpoint:ovh-eu OVHApiRateLimit:20 PDNSServer:http://localhost:8081 PDNSAPIKey: PDNSTLSEnabled:false TLSCA: TLSClientCert: TLSClientCertKey: Policy:sync Registry:txt TXTOwnerID:external-dns TXTPrefix: TXTSuffix: Interval:1m0s Once:false DryRun:false UpdateEvents:false LogFormat:text MetricsAddress::7979 LogLevel:info TXTCacheInterval:0s TXTWildcardReplacement: ExoscaleEndpoint:https://api.exoscale.ch/dns ExoscaleAPIKey: ExoscaleAPISecret: CRDSourceAPIVersion:externaldns.k8s.io/v1alpha1 CRDSourceKind:DNSEndpoint ServiceTypeFilter:[] CFAPIEndpoint: CFUsername: CFPassword: RFC2136Host: RFC2136Port:0 RFC2136Zone: RFC2136Insecure:false RFC2136TSIGKeyName: RFC2136TSIGSecret: RFC2136TSIGSecretAlg: RFC2136TAXFR:false RFC2136MinTTL:0s NS1Endpoint: NS1IgnoreSSL:false NS1MinTTLSeconds:0 TransIPAccountName: TransIPPrivateKeyFile: DigitalOceanAPIPageSize:50}"
time="2020-12-16T06:32:45Z" level=info msg="Instantiating new Kubernetes client"
time="2020-12-16T06:32:45Z" level=info msg="Using inCluster-config based on serviceaccount-token"
time="2020-12-16T06:32:45Z" level=info msg="Created Kubernetes client https://10.110.0.1:443"
time="2020-12-16T06:32:56Z" level=info msg="Change zone: dev-hostedzone batch #0"
time="2020-12-16T06:32:56Z" level=info msg="Del records: a.bc.def.com. NS [ns-cloud-d100.googledomains.com ns-cloud-d12.googledomains.com ns-cloud-d43.googledomains.com ns-cloud-d22.googledomains.com] 21600"
time="2020-12-16T06:32:56Z" level=info msg="Del records: a.bc.def.com. TXT [\"heritage=external-dns,external-dns/owner=external-dns,external-dns/resource=ingress/dev/dev\"] 300"
time="2020-12-16T06:32:56Z" level=info msg="Add records: a.bc.def.com. A [11.111.11.111] 300"
time="2020-12-16T06:32:56Z" level=info msg="Add records: a.bc.def.com. TXT [\"heritage=external-dns,external-dns/owner=external-dns,external-dns/resource=ingress/dev/dev\"] 300"
time="2020-12-16T06:32:56Z" level=info msg="Add records: p66.a.bc.def.com. A [11.111.11.111] 300"
time="2020-12-16T06:32:56Z" level=info msg="Add records: p66.a.bc.def.com. TXT [\"heritage=external-dns,external-dns/owner=external-dns,external-dns/resource=ingress/dev/p66\"] 300"
time="2020-12-16T06:32:56Z" level=info msg="Add records: p67.a.bc.def.com. A [11.111.11.111] 300"
time="2020-12-16T06:32:56Z" level=info msg="Add records: p67.a.bc.def.com. TXT [\"heritage=external-dns,external-dns/owner=external-dns,external-dns/resource=ingress/dev/p67\"] 300"
time="2020-12-16T06:32:57Z" level=error msg="googleapi: Error 400: Invalid value for 'entity.change.deletions[0].rrdata[0]': 'ns-cloud-d100.googledomains.com'\nMore details:\nReason: invalid, Message: Invalid value for 'entity.change.deletions[0].rrdata[0]': 'ns-cloud-d100.googledomains.com'\nReason: invalid, Message: Invalid value for 'entity.change.deletions[0].rrdata[1]': 'ns-cloud-d12.googledomains.com'\nReason: invalid, Message: Invalid value for 'entity.change.deletions[0].rrdata[2]': 'ns-cloud-d43.googledomains.com'\nReason: invalid, Message: Invalid value for 'entity.change.deletions[0].rrdata[3]': 'ns-cloud-d22.googledomains.com'\n"
@Riley-sohn Riley-sohn added the kind/bug Categorizes issue or PR as related to a bug. label Dec 16, 2020
@Raffo
Copy link
Contributor

Raffo commented Dec 21, 2020

Thanks for the report, we'll look into this issue.

@Raffo
Copy link
Contributor

Raffo commented Dec 21, 2020

@Riley-sohn can you also try the official image mentioned in the release (k8s.gcr.io/external-dns/external-dns:v0.7.5) instead of the bitnami one? I don't know what kind of release process bitnami has and thus not sure of what version they were actually running, given that 0.7.5 is officially available as of today.

Can you also provide more info on when you see it and other possible information on how to fully reproduce the bug?

@backjo
Copy link

backjo commented Dec 21, 2020

We are also seeing this bug on AWS - using k8s.gcr.io/external-dns/external-dns:v0.7.5 after upgrading from k8s.gcr.io/external-dns/external-dns:v0.7.4.

@backjo
Copy link

backjo commented Dec 21, 2020

904d8e4 (#1813) seems to have included support for NS records being returned from the registry. In our case, we are managing a hosted zone for foo.bar.com. We also have an A record for the same domain name, pointing to an AWS Load Balancer.

It looks like External DNS is seeing the NS record being returned from the registry, not finding any resources which map to it (since NS records are usually managed outside of external DNS), and tries to delete it since it thinks it's 'managed'. There doesn't appear to be enough metadata for External DNS to distinguish whether it's supposed to manage the NS record (which is a new feature) or if it is managed externally. (as would be the case for anyone previously using 0.7.4)

@Raffo
Copy link
Contributor

Raffo commented Dec 21, 2020

@backjo I will take a deeper look at the problem tomorrow. We can think of feature-flagging it for now, but I'd be happy to hear suggestions before I analyze the problem.

@backjo
Copy link

backjo commented Dec 21, 2020

@Raffo haven't thought about it too deeply, but the first thought that comes to mind is that for this use case, there should probably be metadata in the registry that tracks which specific records for a domain name are being managed by external-dns and which records are not.

@Raffo
Copy link
Contributor

Raffo commented Dec 22, 2020

Thank you for the context @backjo , that seems to be quite bad. I will try to reproduce it and fix it as soon as possible which means we'll likely rush 0.7.6 (till we have a better versioning convention, coming next year).

@Raffo
Copy link
Contributor

Raffo commented Dec 23, 2020

@backjo are you also using google as provider? I don't have a GKE cluster to use to reproduce the bug, but I tried deploying to AWS and I was not able to reproduce the error. Can you please provide the full manifest that you use to deploy ExternalDNS and the full example of what you are doing?

@backjo
Copy link

backjo commented Dec 24, 2020

@Raffo unlike the original poster, we are using AWS / EKS , not GKE - it appears to be provider agnostic.

Try the following setup:

  • Setup external dns 0.7.4 on your cluster
  • Create a hosted zone in Route53 for foo.yourdomain.com
  • Setup NS records in foo.yourdomain.com using AWS, not ExternalDNS
  • Create an ingress with the external dns annotation for foo.yourdomain.com
  • Upgrade to 0.7.5 - and you should see logs for ExternalDNS attempting to delete the NS records

@ytsarev
Copy link
Member

ytsarev commented Dec 25, 2020

@Raffo @backjo maybe we should provide a way to skip a specific DNS record type from plan? E.g. make plan condition list configurable, somewhere around https://github.com/kubernetes-sigs/external-dns/pull/1813/files#diff-0967da8a495108f825298b0bc67c95fe9d301a1bd86baadc65a7cb3623686cc3R245

@backjo
Copy link

backjo commented Dec 28, 2020

WDYT about storing information in the registry around which specific record types are being managed for a given domain - I could see an eventual case where I might want to manage some domain NS records via external DNS, but others only maybe the A record.

@Raffo
Copy link
Contributor

Raffo commented Dec 30, 2020

Ok, we took some time with @njuettner to try to reproduce the bug and we managed to reproduce it following the steps that @backjo provided. The issue is clearly created by #1813 and we discussed at length how we want to fix it. ExternalDNS was not designed to handle custom records as we are doing with the DNSEndpoint resource and it never had the goal to be the one solution to manage everything DNS related in the cloud by using Kubernetes for everything. If the community is trying to achieve this with ExternalDNS, we'll have to talk more about this, as it could be a significant shift for the project.
That said, we will take another week to consider the fix for this, with the final decision to be taken on Jan the 6th.
We're now considering the following options:

  • revert Add NS record support #1813 and release 0.7.6 without it. This will allow people to safely use the changes that we shipped in 0.7.5 without this bug.
  • feature flag the functionality of Add NS record support #1813 and release 0.7.6 with this change. The side effects could be still surprising for users that decide to opt in for "managing NS records"
  • add the type of record to the TXT record. We'll have to assess the impact of that to the rest of the functionality that we provide.

Of course, a revert is to ensure our priority which is stability, not to say that features should not be added. If we ended up deciding for such a solution, we will start a discussion with the community to understand the need to manage custom DNS records with ExternalDNS.

@backjo: I have one more question for you: given that a NS record is delegating to another zone, can you please tell me what is the use case to mix A/CNAME and TXT records with the same name?

@backjo
Copy link

backjo commented Jan 4, 2021

@Raffo In this case, these NS records are in the delegated zone, not the root zone - i.e. this zone is being delegated to from another, higher level hosted zone.

foo.com Zone
NS bar.foo.com ns1.org

bar.foo.com Zone
NS bar.foo.com
A bar.foo.com 192.168.1.1

@Raffo
Copy link
Contributor

Raffo commented Jan 5, 2021

Thank you for the update @backjo , very appreciated. As I said, I will give an update tomorrow with a plan to move forward.

@universam1
Copy link

We are surprised by this bug happening in all of our clusters in AWS Route53 as well!

This is pretty scary that it tries to delete the NS records!

time="2021-01-11T09:16:25Z" level=info msg="Desired change: DELETE zone.reducted.org NS [Id: /hostedzone/xxxxxxxxxx]"
time="2021-01-11T09:16:25Z" level=info msg="Desired change: CREATE zone.reducted.org A [Id: /hostedzone/xxxxxxxxxx]"
time="2021-01-11T09:16:25Z" level=error msg="Failure in zone zone.reducted.org. [Id: /hostedzone/xxxxxxxxxx]"
time="2021-01-11T09:16:25Z" level=error msg="InvalidChangeBatch: [Tried to create resource record set [name='zone.reducted.org.', type='A'] but it already exists, A Hosted Zone must contain at least one NS record for the zone itself.]\n\tstatus code: 400, request id: 7a331ff2-dee8-4f9f-a31e-xxxxx"
time="2021-01-11T09:16:25Z" level=error msg="failed to submit all changes for the following zones: [/hostedzone/xxxxxxxxxx]"

@Raffo
Copy link
Contributor

Raffo commented Jan 11, 2021

@universam1 I'm a bit late with a fix, but I'm going to be publishing an update today.

@universam1
Copy link

Thank you for quick response!

Guess what surprised us most that such a change came with a patch version only- we would expect that to be reflected by a feature or major version bump at least. Our review process is based on semantic versioning principle trust.

Thank you for considering that.

@Raffo
Copy link
Contributor

Raffo commented Jan 11, 2021

For sure we've been lacking a proper versioning strategy that could give better ideas of the change contained in a release. We did however take care of putting good information in the release notes and in fact the release for 0.7.5 contained information on this bug as soon as it was discovered. Again, that's a bug, not a breaking change designed on purpose to be one. I'm in the process of working around it.

@Raffo
Copy link
Contributor

Raffo commented Jan 12, 2021

This is what I came up with to work around this issue: #1915 . We basically feature flag the NS records support, thus "protecting" the default functionality from ExternalDNS.

@Raffo
Copy link
Contributor

Raffo commented Jan 12, 2021

All right, the fix was merged, I will start the release process later today, we will have an official 0.7.6 image released by the end of the week.

@Raffo
Copy link
Contributor

Raffo commented Jan 14, 2021

I manually tested the change on AWS and I can confirm that the regression is fixed. We proceed with the release of v0.7.6.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants