-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Why does external-dns poll? Polling causes too many API requests #484
Comments
At some certain stage it might make sense to integrate "watch" capabilities, but polling is probably required anyway. For example, in case of External-DNS not running for a while, the list of services and ingresses created during that period of time should be handled as well. I am not entirely sure how well Kubernetes handles watching, but a year or so ago I found the API to be buggy. The problem with "watching" is that we cannot simply make an API call to DNS provider on every single event, because those calls usually cost money and are normally rate limited. So with "watching" we would have to do some aggregation and batching. We could allow to configure the polling interval to reduce the number of API calls, however I don't believe "watching" is a better solution to the "problem", especially in big clusters with lots of ingresses and services. |
I'm not seeing in the code where the polling is necessary, you can watch the event stream and just append changes as they come in and call The main improvement is that you eliminate API calls altogether until a change actually needs to be made. If you're concerned about a fresh pod not being aware of changes that happened since starting you can do one initial poll to get the current state and then update as necessary. I can contribute the code changes necessary to make this happen if that's a concern. Just to clarify my issue and why I think this is a major problem. In our environment we use AWS and we have several clusters where |
I don't believe it is as simple as you described, with the concepts of ownership and multi target records, you have to maintain information like who owns the record, can I modify the record, etc either in memory (cache) or do the DNS provider get call. You want to avoid the latter, but in case of in memory storage, you might as well do the diff with the previous change to see if update is required. I would make this optional and not recommended for use anyway. However, I would love to see a proposal on how to use "watch" first with proper description how external dns will operate and preserve all the features it currently has |
Are we even talking about the same thing? Are we talking about polling the Kubernetes or the AWS API? @azuretek mentions hitting the rate limits of AWS.. Maybe we should identify the actual problem first before discussing potential solutions or improvements? Is the problem "External DNS hits AWS API rate limits"? |
@hjacobs I think he means to use Kubernetes API events to watch for changes and then do the AWS API call/ otherwise stay idle. Currently the problem is we fetch the list of records from AWS even if no changes are required and this is the API call we want to prevent. However External DNS is smart enough not to "post" changes to AWS API if no changes were detected. External DNS hitting AWS API rate limiting is a problem, but I think it should be addressed in other ways, e.g. with caching result. #178 |
How about having the controller trigger off informers watching Service/Ingress with the informer resync periods set to The resync period/TTL cache would ensure that we maintained the current functionality (i.e. always ensuring state is reconciled between the provider and the cluster at least once per API rate limits could be handled by exposing Related: #14 |
I've run into this when running in an AWS account with a large number of Route53 zones. For whatever reason, it polls zones even if there are no ingress/service/etc manifests referencing that zone. Is there any way (besides filtering on domain name param) to optimise things such that it doesn't look at zones not relevant to anything configured inside kubernetes? (In my case the account had 250+ zones... and with no filter, despite the cluster coming up with maybe a half-dozen records on just a single zone, all 249 other zones are getting scanned, confirmed by looking at CloudTrail logs, resulting in the API throttling so badly it sometimes took 10-20 minutes before external-dns could get any records provisioned.) For the moment I've worked around it by specifying a whitelist of domains that can get managed by external-dns to keep how much it's scanning to a minimum. |
Some things to add to this thread: Watching on k8s events and batching seems fine but those aren't your only events, yeah? What happens if a record gets modified outside of external-dns' scope? A regular poll as @prydie suggests would still be wise. @jhohertz to your point I thought that was unintuitive too but external-dns has to delete records too. That said, whitelisting domains is the way to go and that's what we do. We include all our public domains, and then only the private domains for the VPC we're running external-dns in, for each VPC. Just ranting here, but honestly the problem here is with Amazon's APIs, which I understand we can't easily change... ideally they would give you the ability to post to an SNS topic or something like that when Route53 calls are made so we could watch on AWS events the same as we can on K8s events. |
number of retries that API calls will attempt before giving up. This somewhat mitigates the issues discussed in kubernetes-sigs#484 by allowing the current sync attempt to complete vs. failing and starting anew. Defaults to 3, which is what the aws-sdk-go defaults to where not specified. Signed-off-by: Joe Hohertz <joe@viafoura.com>
We're seeing similar things with the Cloudflare provider. Our account has approximately 10,000 zones which means (with the maximum pagination allowed) that's 200 API calls to return solely the zones. Cloudflare limits 1200 requests per 5 minutes which with DNS' default interval of 1m gives room for about 250 requests a minute, which based on the above means we're hitting the limit (Issue is exasperated if you reuse client credentials on more than one cluster running external DNS). Decreasing the interval is certainly a workaround but of course it does mean provisioning of services is impacted. Would restructuring so that |
Do you confirm that this is happening with the latest version released
(v0.5.11)?
…On Wed, Feb 27, 2019, 12:35 Mike Eves ***@***.***> wrote:
We're seeing similar things with the Cloudflare provider.
Our account has approximately 10,000 zones which means (with the maximum
pagination allowed) that's 200 API calls to return solely the zones.
--domain-filter dictates that we're only actually interested in two of
those zones, and in those zones, there are only about 75-100 pages of
records
Cloudflare limits 1200 requests per 5 minutes which with DNS' default
interval of 1m gives room for about 250 requests a minute, which based on
the above means we're hitting the limit (Issue is exasperated if you reuse
client credentials on more than one cluster running external DNS).
Decreasing the interval is certainly a workaround but of course it does
mean provisioning of services is impacted.
Would restructuring so that --domain-filter is used at the time
records/zones are queried in the provider to only look at said zones,
rather than just being used to filte records after they have been retrieved
from the provider, or are there other considerations needed?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#484 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AApv1KapNU2vNlte4D9_5fAOx4u2b5qwks5vRm2BgaJpZM4SfyeV>
.
|
Correct, 0.5.11 |
@Evesy it won't solve your problem completely, but we've been using the new |
There are several key problems from looking over things and testing on a larger AWS deployments
|
* Add github.com/miekg/dns to dependencies * Change RFC2136 default port to zero * Add small comment to push CLA check * Updating CHANGELOG.md to v0.5.6 * Moving methods around This is an update to the order in which we have the new mock methods. Removed comment and added comment to exported mock helper. * Fix handling of custom TTL values with Google DNS. * chore: replace glog with a noop logger * Continue even if node listing fails * Add couple of tests for RFC2136 provider * Fix interface export issue * Fix TLS issue with OpenStack auth (Designate) * fix: pass all relevant CLI flags to AWS provider (kubernetes-sigs#719) * fix(aws): correctly populate target health check on existing records * Don't erase endpoints labels (kubernetes-sigs#713) * docs: document how to use a different security context * Matching entire string for wildcard in txt records with prefixes * Added linode to support cloud providers * Fixed tests store records with escaped wildcard. Added test to verify wildcard record with prefix. * Small style fix after code review * docs: update changelog to v0.5.8 * docs: fix wrong entry in changelog * fix domain filter match logic to not match similar domain names * Fix nil map access of endpoint labels * Add missing rfc2136 enum value to provider flag * Switch to using nobody instead. * Add TestNewDesignateProvider test func * Add alias annotation for ingress * Add small Readme for RFC2136 provider * Format changes * allow hostname annotations to be ignored * MAINTAINER is deprecated - using LABEL instead https://docs.docker.com/engine/reference/builder/#maintainer-deprecated * pdns: Add DomainFilter support * Update Azure documentation * Update dyn.go * Update CoreDNS provider to use etcd v3 client * Update Gopkg.* vendor management files for github.com/coreos/etcd * Tiny clarification about two available deployment methods. * Oracle doc fix (add "key:" to secret) (kubernetes-sigs#750) * fix domain filter match logic to not match similar domain names * MAINTAINER is deprecated - using LABEL instead https://docs.docker.com/engine/reference/builder/#maintainer-deprecated * Fix to documentation for Oracle to include `key:` * Add Traefik to the supported list of ingress controllers. * Fix Multiple subdomains bug * Remove unnecessary slashes * Change log level * Add docs for alias annotation * Fix typos: sychronized->synchronized, resouce->resource, sepecified->specified (kubernetes-sigs#769) Signed-off-by: mooncake <xcoder@tenxcloud.com> * Remove dupplicated words:have,aliyun (kubernetes-sigs#768) Signed-off-by: mooncake <xcoder@tenxcloud.com> * adding kubernetes adder * adding kubernetes adder * Allow setting Cloudflare proxying by annotation * Change default apiversion of crd - Change default apiversion of DNSEndpoint - Add error to output CRDClient * panic: assignment to entry in nil map * Remove trim suffix * adjust gometalinter timeout by setting env var * Remove sorting of rrdatas * update dep dependencies * chore: remove unused import (kubernetes-sigs#781) * chore: update delivery.yaml to new format * Changelog v0.5.9 * Improve errors in Records() of infoblox provider * Updating Azure tutorial * update README to include Linode on the 0.5 roadmap (kubernetes-sigs#787) Notes that Linode support was added in 0.5.5 * add tutorial for coredns (kubernetes-sigs#791) There is no coredns tutorial for externalDNS. This pull request makes coredns based on minikube for working with externalDNS. * fix(infoblox): don't import logrus twice * feat(controller): expose managed resources and records as metrics * update the FAQ list of supported DNS providers (kubernetes-sigs#796) * adding config for bind for tsig (kubernetes-sigs#790) * adding config for bind for tsig * add indentation as requested * Use SOAP API to retrieve all records with 1 request * fix json syntax error - typing error (kubernetes-sigs#765) there was an unexpected comma in json used as custom configuration file * 2 issues: - coredns support more than 1 targets - delete with prefix to make sure the record is cleaned * Add zone tag filter for AWS * Removed extractTarget * Update coredns tutorial with RBAC manifest (see kubernetes-sigs#791) * avoid unnecessary updating for CRD resource with test updated * fix commands to cleanup * Update coredns.md Make the DNS service IP consistent with `my-coredns-coredns` in example * Add metrics info to FAQ * Update cloudflare.md * docs(azure): better security granuality concerning external dns service principal * Implement Stringer for planTableRow Makes for clearer log messages. * Normalize DNS names during planning Ensure that we don't consider names with and without a trailing dot differently at this stage. * RFC2136 seems to require one IP Target per RRSET instead of multiple IPs per RRSET. * Fix typos in rfc2136 provider The rfc2136Actions interface was misspelled. Signed-off-by: Lachlan Cooper <lachlancooper@gmail.com> * Fix dry-run mode in rfc2136 provider In dry-run mode we need to return early to avoid sending messages. Fixes kubernetes-sigs#816. Signed-off-by: Lachlan Cooper <lachlancooper@gmail.com> * Change default AWSBatchChangeSize to 1000 AWS API ChangeResourceRecordSets method only allows 1000 ResourceRecord elements in one call, so the previous value was not very useful. * Correct Google Cloud DNS (ref: https://cloud.google.com/dns/) naming in docs * add security file Signed-off-by: Nick Jüttner <nick@zalando.de> * Add support for eu-north-1 * Clarify registry error info * Fix private zone dns record does not work * Add apiVersion to ingress.yaml, and Delete the duplicated line in dnstools * Support updating ProviderSpecific property. * Make awscli commands use JSON output This way the use of `jq`, and the output in this document would make sense. * Cloudflare pagination for zones * Adds a new flag `--aws-api-retries` which allows overriding the number of retries that API calls will attempt before giving up. This somewhat mitigates the issues discussed in kubernetes-sigs#484 by allowing the current sync attempt to complete vs. failing and starting anew. Defaults to 3, which is what the aws-sdk-go defaults to where not specified. Signed-off-by: Joe Hohertz <joe@viafoura.com> * fix gofmt issue * Add questions from slack to the faq * Update Gopkg.toml * Update Gopkg.toml * Cloudflare pagination for zones * Improve documentation regarding Alias I got stuck here and opened kubernetes-sigs#865 because I thought it was a bug. I hope this will help others set it up correctly the first time. * Remove linki from SECURITY_CONTACTS As per responsibilities of a security contact: https://github.com/kubernetes/sig-release/blob/master/security-release-process-documentation/security-release-process.md#responsibilities * Update cloudflare.go * chore: update changelog for v0.5.10 * Fixes some style in the faq.md file * fix: reduce number of API requests by caching a bit * only compare provider-specific annotations when they exist in the provider, skip target-health annotation * fix test of ProviderSpecific comparison Signed-off-by: Joe Hohertz <joe@viafoura.com> * Fixed typo in debug output * fix broken test after merge * Fixed PowerDNS Domain Filter Bug * When using Domain Filters with PowerDNS provider and providing no domain filter, the provider ignores all zones instead of including all zones which is the default behaviour * Added test cases for PartitionZones function of PDNSClient * Add RcodeZero Anycast DNS provider * Apply doc review changes * Fix formating Fix linter issues * Run gofmt on main * Trigger travis * Added description for multiple dns name This PR is a comment about "Multiple DNS names per Service" setting. * Document make dep step which may be needed to run make build * Turns out sudo is not necessary * Clarify that hosted zone identifier is to be used * Use k8s informer cache instead of active API server calls in ingress and service sources. * Changelog for v0.5.11 * Update README.md Added a reference to a blogpost which uses ExternalDNS in a CI/CD setup. * Dropping owners * Fix rcodezero txt encrypt flag parameter Add rcodezero txt encrypt parameter tests * Make view configurable for infoblox provider * Add infoblox view flag to tests * Correct default of infoblox-view parameter * Add support for multiple Istio Ingress Gateways The --istio-ingress-gateway flag may now be specified multiple times. * set log level to debug when axfr is disabled * Added stability matrix and minor improvements to README * Bumping istio to 1.1.0, updating fake GatewayConfigStore Get method to work with 1.1.0 * Release v0.5.12 * Release v0.5.12 * Reduce verbosity of infoblox provider logs * remove unnecessary parameter check when started with insecure flag * Remove passwords from config output based on tag * Remove superfluous trailing period from hostname Tutorial specifies version >0.4 which also removed the requirement for a trailing period. New users could misunderstand the trailing dot as a significant syntax. Removing the dot simplifies the configuration of the annotation. * describe how to check if your cluster has a RBAC * aws-r53: adding china ELB endpoints and hosted zone id's * aws-r53: adding china ELB endpoints and hosted zone id's. fixed spacing * aws-r53: adding china ELB endpoints and hosted zone id's. corrected formatting * aws-r53: adding china ELB endpoints and hosted zone id's. fixed typo when reformatting * Streamline AWS ApplyChanges - collect the zones and records once * fix wrong arg 'alibaba-cloud-zone' -> 'alibaba-cloud-zone-type'
In our environment, we too are hitting rate limits on AWS. I have already increased our aws retries to 10 although now I am considering 13 with a much longer interval. We have added the -events support to combat the longer interval but that too can be rate limited. Which puts us back into the same situation.
|
In our case we settled for one AWS account per cluster. Putting even just two k8s clusters on the same AWS account easily triggers the default rate limit. Thankfully we don't have that many so it's manageable this way. It also provides us with greater isolation and accounting across clusters so it's not like we did this solely for external-dns, but just saying... |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle stale |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle stale |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
Probably not the best solution for everyone, but I ended up working around this by spinning up two $5/mo VPS instances at DigitalOcean in two different regios. Installed powerdns with a sqlite3 backend, enabled the webserver, set an API key, and reconfigured external-dns. It synced around 350 domains in ~2 seconds. Goodbye provider rate-limits. |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
Please create a proposal that outlines the problem (I see the problem but this is needed to clarify it) and how a solution would work for all the cases outlined here. Thanks for understanding the maintainers need some help here. |
Proposal: It might be easier to allow users to configure a "sleep between API calls" setting. Sleep for 0.5 seconds between API calls would end up causing ~120 API calls per minute. Example: Workarounds:
|
+1 to this, the api clients need to be able to back off when getting 429s, currently the pod will just crash with:
Which then causes it to come back up and sync again immediately which somewhat exacerbates the issue 🤷 |
Is there a reason external-dns is polling? Why not watch the event stream and trigger updates that way? There's no reason to poll on an interval if you can just watch for changes. It would drastically reduce the number of API requests and also be a lot quicker to reflect changes as services and ingresses are deployed.
The text was updated successfully, but these errors were encountered: