Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify AWS BatchSize and Record Sets limitations #3203

Closed
briantward opened this issue Nov 30, 2022 · 12 comments
Closed

Clarify AWS BatchSize and Record Sets limitations #3203

briantward opened this issue Nov 30, 2022 · 12 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@briantward
Copy link

What would you like to be added:

In Throttling section, external-dns suggests adjusting --aws-batch-change-size to help manage API rate limiting. Specifically it suggests increasing the value to a higher number.

https://github.com/kubernetes-sigs/external-dns/blob/master/docs/tutorials/aws.md#throttling

We found this terminology somewhat misleading. Perhaps a small code change or documentation/terminology change would make this better. I'm not sure yet, and I'm open to suggestions. We had to reduce the aws-batch-change-size to 10 in order not to hit the AWS error InvalidChangeBatch: [Number of records limit of 1000 exceeded.]. This because each batch change that external-dns submits contained thousands of record changes. Speaking with AWS support we found that the 1000 number is a hard limit on record changes in Route53 per batch.

For background, we had a cluster of 107 nodes and 34-37 NodePort's annotated for external-dns. In the external-dns pod logs we saw a TXT and A record CREATE and UPSERT. We also saw from the AWS console that the change record set count was high, as it appeared there was a record set for each IP address of each node in the cluster. If my math is correct, it would be 34 x 2 x 107 = 7276 records (if all records were created new but of course some were probably just updates/reconciles). But it appeared to me that it was all being submitted in one batch to AWS, but there is probably more to it than that. We originally tried increasing to 4000 as in the doc, which continued to fail with the same error. Then we tried 250 and continued to fail. When we got to 10, we got things working again. If we know the hard limit on records in each AWS batch, why would we not automatically calculate the appropriate batch size based on the record limit? Also, what happens on clusters even more busy, when batch size of 1 might contain more than 1000 record requests? I suspect it would fail.

Why is this needed:

To improve the quality of life. :-)

@briantward briantward added the kind/feature Categorizes issue or PR as related to a new feature. label Nov 30, 2022
@benjimin
Copy link

benjimin commented Nov 30, 2022

https://docs.aws.amazon.com/Route53/latest/DeveloperGuide/DNSLimitations.html#limits-api-requests

Looks like batching is recommended to avoid the 5 API request per second rate limit/throttling. But since the batches are capped by AWS at 1000 records (or 500 upserts, or perhaps even less if the value string is typically more than about 62 characters) it seems very odd that the external-dns docs suggest trying 4000.

@briantward are you saying that the number of records per batch can be greater than the specified batch size? (I would have expected only 3 records per DNS address, e.g. 3x37 in your example, although presumably for NodePorts they would all get updated every time any of your 107 nodes goes offline or gets replaced? But.. if you had 107 ip addresses in your value strings, and have to upsert all of the records at once, you'd surely be close to the character length cap - I wonder if that produces the same error message? I'm curious now, in the Route53 console, how long your value strings on TXT and other [probably A] records is?)

@briantward
Copy link
Author

briantward commented Dec 1, 2022 via email

@benjimin
Copy link

benjimin commented Dec 4, 2022

I think you're right.. each annotated NodePort Service should imply a separate A record for every node (i.e with the same hostname but a different IP) although they are grouped together into a single record in the AWS console or a resource record set in the Route53 API. But the resource limit refers to 1000 ResourceRecord elements (not Sets) and with UPSERTs counting twice.

I can confirm that in v0.13.1 a NodePort Service (tested with about a dozen nodes) is referred to in its log output as one single record (or rather 3 records, including both versions of the TXT provenance label) even when the entire service is being created or deleted. Every time a node is added or removed from the cluster, external-dns logs another UPSERT of those 3 records. (So with ~100 nodes these "3 records" probably constitute 20% of the AWS request size limit.)

@briantward just curious but why do you want external names for multiple NodePort services on the same cluster? Isn't each different NodePort hostname going to be equivalent (routed identically) anyway? Is there a reason you're not using ClusterIP DNS internally and some kind of gateway (such as a load balancer) externally, to avoid needing clients to know unique port numbers for each service?

@benjimin
Copy link

benjimin commented Dec 4, 2022

I don't think it is possible to separate the individual DNS records into different batches.

Lets say we want to use the Route53 API to mimic assigning a hostname to a NodePort, that is, to manage multiple A entries with the same host domain name.

$ function route ()
{
aws route53 change-resource-record-sets --hosted-zone-id ZZZ --change-batch "{\"Changes\":[{\"Action\":\"$1\",\"ResourceRecordSet\":{\"Name\":\"test.example\",\"Type\":\"A\",\"ResourceRecords\":$2,\"TTL\":300}}]}"
}
$ route CREATE '[{"Value":"4.4.4.4"}]'

$ route CREATE '[{"Value":"1.2.3.4"},{"Value":"3.3.3.3"}]'
InvalidChangeBatch: Tried to create resource record set .. but it already exists

$ route UPSERT '[{"Value":"1.2.3.4"},{"Value":"3.3.3.3"}]'

$ route DELETE '[{"Value":"3.3.3.3"}]'
InvalidChangeBatch: Tried to delete resource record set .. but the values provided do not match the current values

$ route DELETE '[{"Value":"3.3.3.3"},{"Value":"1.2.3.4"}]'

In other words, it seems impossible to update any of the records (within the record set corresponding to a single host name) without re-specifying all of the desired records in a single API request. If you can't incrementally update the set, then you can't partition the records between batches; you can only decide how many record sets (how many services) are batched together.

@briantward
Copy link
Author

briantward commented Dec 5, 2022

@benjimin I'm still working on permission to post scrubbed logs. But here's essentially what I see:

An UPSERT is made of 2 type=TXT ResourceRecordSets and 1 type=A ResourceRecordSet. The TXT ResourceRecordSets contain a single ResourceRecord, while the A ResourceRecordSet contains N ResourceRecords, where N is the number of nodes in the cluster. The latter is what is causing our problem, because we have 107 nodes and 36 NodePorts (36x2+36x107 = 3852 on one batchsize=something larger than 10).

"changeBatch":  {
  "changes": [
    {
      "action": "UPSERT",
      "resourceRecordSet": {
        "name": "a-myservice.mycluster.com".
        "type": "TXT",
        "tTL": 300,
        "resourceRecords": [
            {
              "value": "\"heritage=external-dns......"
            }
         ]
      }
    },
    {
      "action": "UPSERT",
      "resourceRecordSet": {
        "name": "myservice.mycluster.com".
        "type": "TXT",
        "tTL": 300,
        "resourceRecords": [
            {
              "value": "\"heritage=external-dns......"
            }
         ]
      }
    },
    {
      "action": "UPSERT",
      "resourceRecordSet": {
        "name": "myservice.mycluster.com".
        "type": "A",
        "tTL": 300,
        "resourceRecords": [
            {
              "value": "192.168.1.2"
            },
            {
              "value": "192.168.1.3"
            },
            {
              "value": "192.168.1.4"
            },
         ]
      }
    },
...

So this seems to be an issue with AWS APIs defining size to be defined by the number of ResourceRecordSets but having a separate limitation on the actual number of ResourceRecords. There would definitely be a problem, for example, when a single ResourceRecordSet would exceed the 1000 ResourceRecords limitation. Thankfully that's not the problem.

I could see external-dns improving upon the difficulty of the AWS API by looping through the current set and defining its delivered size to be whatever number of ResourceRecordsSets can be defined in a request before going over the 1000 ResourceRecords limit. But, that follows with two thoughts: 1) this is an AWS API problem and only a mitigation, and 2) Is building that workaround worth it-- does it cost us more compute to optimize the batch being sent--as opposed to just setting the batch size very low. One can argue that a dynamic calculation, at the least, means no one ever has to think about the batch size again. I may stab at it if I have time and you think it worth considering.

Finally, yes I completely agree, a single "gateway" hostname/ELB could suffice for all these services and has been suggested. I was told the LoadBalancer option per Service caused issues with hitting limits on their AWS SecurityGroups rules. I have also suggested and confirmed that using external-dns with externalTrafficPolicy=Local would reduce the number of A records to those where pods are actually hosted and may also be a reasonable option if we must continue to create custom DNS hostnames externally for each application (while accepting the other implications of Local externalTrafficPolicy).

Thanks!

@benjimin
Copy link

benjimin commented Dec 8, 2022

So I think we would like:

  • The logs already output for each record-set should include a (parenthetical) mention of how many records each record-set contains.
  • The default limit setting should be 500 (the aws/route53 max no. of upsert records per change-set), and the docs should not encourage a higher value than this.
  • Unrelated record-sets should not be combined into the same change-set if this will cause the total record count to exceed the size limit.

I'm less confident about the behaviour we want if a single record-set individually has too many records. I think it should still be attempted, so that users can continue to specify --aws-batch-change-size=1 as a work-around for other faults that cause some batches to fail. Also, I wondered if related record-sets (such as the corresponding TXT records) should be combined in the same change-set for sake of transactional/atomic behaviour, but I think such complication is unnecessary.

@briantward I still think your use-case is a bit unusual as NodePorts are often discouraged; a common pattern is to use ingresses to control a single gateway (such as an ALB or even just nginx) that proxies traffic to multiple services with different external hostnames. Whereas if you have services A, B, etc and you want your clients to access them like hostA.mydomain:30001, hostB.mydomain:30002 etc, how do your clients know which port number to use for each service, and are you worried about your DNS record sets being too large for UDP queries?

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 8, 2023
@nipr-jdoenges
Copy link

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 10, 2023
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 8, 2023
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 8, 2023
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Jan 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

5 participants