Performance regression with the AWS provider in 0.5.9 #869

lbernail · 2019-01-27T10:46:23Z

Hello

We tried to update our deployment to 0.5.9 but we noticed a significant performance regression due to #742
After this PR, for each change we call Records() instead of calling it once per plan https://github.com/kubernetes-incubator/external-dns/blob/v0.5.9/provider/aws.go#L367 which will retrieve all records for the zone.

A simple fix could be to store the result of the Records() call in the AWSProvider. I can PR this if you think it makes sense.

In addition, the call to ListResourceRecordSetsPages in the Records() function is paginated but the aws go sdk does not take into account rate limits. My understanding is that each page is 100 records which means on large zone we can quickly hit the rate limit which is 5 rps (from https://docs.aws.amazon.com/Route53/latest/DeveloperGuide/DNSLimitations.html#limits-api-requests)

We could retry and backoff on rate-limit errors but given how low the limit is and how simple the call pattern is, we could also simply sleep for 200ms (or 250ms to be safe, or even make it configurable) at the end of the callback.

Laurent

The text was updated successfully, but these errors were encountered:

linki · 2019-01-28T14:49:37Z

Hi @lbernail,

thanks for your report. That looks like a serious problem.

For a quick fix you could also try if using the registry cache helps.

Caching in the provider would be fine but would add another layer of caching which could become very confusing if done wrong. We could think about fixing the code in #742 to avoid making a call to AWS.

linki · 2019-01-28T15:16:31Z

From looking at the code it seems that using the txt-registry cache doesn't help.

Regarding a fix: It looks like newChange needs the list of current records in order to do its job. We could query for the list of records outside of newChange and just pass it in which would reduce the number of requests to one per cycle (or rather one more compared to v0.5.8).

We could add this here and pass the list of records to each call of newChanges.

lbernail · 2019-01-28T17:36:35Z

Yes the cache won't work because we are calling the provider Records method and not the one from the registry

Doing 2 calls would be a lot better but the second one still feels unnecessary (in our case we have zones with 1000s of records and we run a slightly modified version which sleeps in the ListResourceRecordSetsPages callback to avoid triggering AWS rate-limits and a full zone sync takes 10+ seconds

linki · 2019-01-28T17:56:20Z

It's not ideal but much better than the current behaviour and fairly safe to implement. We could improve it with proper caching later. What do you think?

lbernail · 2019-01-28T19:48:53Z

Definitely a lot better

Another idea, what about passing the recordsCache to the provider ApplyChanges call?
https://github.com/kubernetes-incubator/external-dns/blob/master/registry/txt.go#L174

We could just ignore it in other providers (I assume some could make use of it too)

nesl247 · 2019-01-28T20:37:31Z

I think this is affecting us as well. However, we didn't experience this until 0.5.10. Upon upgrading to 0.5.10, ALL domains in the cluster were updated. I saw this happen in both our development and staging clusters. However, we didn't experience any issues until I deployed this in staging. Now both clusters are complaining of getting records failed: Throttling: Rate exceeded.

I'm not sure why any of the domains needed to be updated anyways given there were no changes other than upgrading from 0.5.9 to 0.5.10. In 0.5.9 we only really saw All records are already up to date as nothing was changed.

Update: downgrading to 0.5.9 resolves the issue I am experiencing. Downgrading returned us back to the all records are already up to date message instead of it trying to update every record.

linki · 2019-01-29T14:53:20Z

@lbernail I worked on a rather quick fix to reduce the number of requests from O(#records) to O(#zones) which should already help: #876.

Using the txt registry cache could lead to confusion and would only work for people using the txt registry.

It probably makes sense to somehow also cache in the individual providers as they see fit. Subsequent calls to Records() could just serve cached data for a moment without hitting the API. ApplyChanges could invalidate the cache in addition to some regular cache cleaning timeout.

linki · 2019-01-29T15:14:05Z

@nesl247 It's another bug where the planning logic thinks it needs to update the records when in fact no update is needed. It persists from interval to interval and results in a lot of updates which then trigger the calls to AWS. We tried to fix it here but I'm not sure we want to accept the PR this way.

There might be a simpler way to implement it. Alternatively, we could use the "old" way of detecting changes (basically reverting parts of #650) or drop the --aws-evaluate-target-health flag in favour of using only per-ingress annotations.

lbernail · 2019-01-29T19:22:44Z

@linki : thanks a lot, this is definitely a lot better. I don't think it is enough for us because of the size of our zone on AWS (4x10+s is very long compared to <1ms with registry cache).

I agree with the risk of confusion with the registry cache but I also wonder if having two different caches may not make things complicated. Today the registry cache is very efficient: we only resync the cache once an hour so calls to Records() only take <1ms except when a full sync is performed.

linki · 2019-01-30T10:52:07Z

@lbernail Thanks a lot for your input. I agree having two caches can be confusing as well. If you think it makes sense via the TXT registry's cache we're happy to review it.

For now, we intend to get #876 and #880 merged to mitigate the current problems.

tsuna · 2019-02-05T22:26:59Z

Can we update the release notes to warn people about this performance regression in v0.5.10? I upgraded to v0.5.10 yesterday and wasted a bunch of time before concluding that this new version of external-dns was buggy and then eventually finding this issue, so it would be nice to save others some time by putting a big fat warning that Route53 will throttle them with v0.5.10 :P

Raffo · 2019-02-06T08:45:45Z

@linki @njuettner What about modifying the release notes for v0.5.10 and/or publishing v0.5.9 as v0.5.11 to make sure that we make it safe for the user? This problem has been around too long and caused too much issue to ignore it.

jhohertz · 2019-02-07T17:19:42Z

I believe #880 and #886 together will address the problems of v0.5.10 for the majority of people

szuecs · 2019-02-08T08:46:06Z

@jhohertz thanks for the investigation. @linki is not available for the next 3 Months and @njuettner is here next week again.
It would be good if @Raffo, @hjacobs or @ideahitme could review and approve the code change. I am not part in the approver list, but I can do a code review.

szuecs · 2019-02-08T08:49:58Z

I did the code review and both seem to be ok to me.

Raffo · 2019-02-08T09:25:48Z

I also reviewed and added /lgtm. We might need to rebase #886 once we merge #880 and then we can test the build for #886. If we don't see anything major, we might release this to have a fixed release.

szuecs · 2019-02-08T12:59:57Z

Thanks @Raffo

lbernail changed the title ~~Performance regression on AWS in 0.5.9~~ Performance regression with the AWS provider in 0.5.9 Jan 27, 2019

linki mentioned this issue Jan 29, 2019

Reduce number of API requests by caching a bit #876

Merged

linki mentioned this issue Jan 29, 2019

Correctly set AWS annotation for EvaluateTargetHealth #877

Closed

linki mentioned this issue Jan 30, 2019

Ignore AWS EvaluateTargetHealth when detecting plan changes #880

Merged

linki mentioned this issue Jan 30, 2019

coredns provider recreating every record every interval in v0.5.10 #879

Closed

Raffo pinned this issue Feb 8, 2019

Raffo mentioned this issue Feb 11, 2019

API rate limit on AWS provider #894

Closed

Raffo unpinned this issue Feb 12, 2019

njuettner closed this as completed Mar 19, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance regression with the AWS provider in 0.5.9 #869

Performance regression with the AWS provider in 0.5.9 #869

lbernail commented Jan 27, 2019

linki commented Jan 28, 2019 •

edited

Loading

linki commented Jan 28, 2019

lbernail commented Jan 28, 2019

linki commented Jan 28, 2019

lbernail commented Jan 28, 2019

nesl247 commented Jan 28, 2019 •

edited

Loading

linki commented Jan 29, 2019 •

edited

Loading

linki commented Jan 29, 2019

lbernail commented Jan 29, 2019

linki commented Jan 30, 2019

tsuna commented Feb 5, 2019

Raffo commented Feb 6, 2019

jhohertz commented Feb 7, 2019

szuecs commented Feb 8, 2019

szuecs commented Feb 8, 2019

Raffo commented Feb 8, 2019

szuecs commented Feb 8, 2019

Performance regression with the AWS provider in 0.5.9 #869

Performance regression with the AWS provider in 0.5.9 #869

Comments

lbernail commented Jan 27, 2019

linki commented Jan 28, 2019 • edited Loading

linki commented Jan 28, 2019

lbernail commented Jan 28, 2019

linki commented Jan 28, 2019

lbernail commented Jan 28, 2019

nesl247 commented Jan 28, 2019 • edited Loading

linki commented Jan 29, 2019 • edited Loading

linki commented Jan 29, 2019

lbernail commented Jan 29, 2019

linki commented Jan 30, 2019

tsuna commented Feb 5, 2019

Raffo commented Feb 6, 2019

jhohertz commented Feb 7, 2019

szuecs commented Feb 8, 2019

szuecs commented Feb 8, 2019

Raffo commented Feb 8, 2019

szuecs commented Feb 8, 2019

linki commented Jan 28, 2019 •

edited

Loading

nesl247 commented Jan 28, 2019 •

edited

Loading

linki commented Jan 29, 2019 •

edited

Loading