Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance regression with the AWS provider in 0.5.9 #869

Closed
lbernail opened this issue Jan 27, 2019 · 17 comments
Closed

Performance regression with the AWS provider in 0.5.9 #869

lbernail opened this issue Jan 27, 2019 · 17 comments

Comments

@lbernail
Copy link

Hello

We tried to update our deployment to 0.5.9 but we noticed a significant performance regression due to #742
After this PR, for each change we call Records() instead of calling it once per plan https://github.com/kubernetes-incubator/external-dns/blob/v0.5.9/provider/aws.go#L367 which will retrieve all records for the zone.

A simple fix could be to store the result of the Records() call in the AWSProvider. I can PR this if you think it makes sense.

In addition, the call to ListResourceRecordSetsPages in the Records() function is paginated but the aws go sdk does not take into account rate limits. My understanding is that each page is 100 records which means on large zone we can quickly hit the rate limit which is 5 rps (from https://docs.aws.amazon.com/Route53/latest/DeveloperGuide/DNSLimitations.html#limits-api-requests)

We could retry and backoff on rate-limit errors but given how low the limit is and how simple the call pattern is, we could also simply sleep for 200ms (or 250ms to be safe, or even make it configurable) at the end of the callback.

Laurent

@lbernail lbernail changed the title Performance regression on AWS in 0.5.9 Performance regression with the AWS provider in 0.5.9 Jan 27, 2019
@linki
Copy link
Member

linki commented Jan 28, 2019

Hi @lbernail,

thanks for your report. That looks like a serious problem.

For a quick fix you could also try if using the registry cache helps.

Caching in the provider would be fine but would add another layer of caching which could become very confusing if done wrong. We could think about fixing the code in #742 to avoid making a call to AWS.

@linki
Copy link
Member

linki commented Jan 28, 2019

From looking at the code it seems that using the txt-registry cache doesn't help.

Regarding a fix: It looks like newChange needs the list of current records in order to do its job. We could query for the list of records outside of newChange and just pass it in which would reduce the number of requests to one per cycle (or rather one more compared to v0.5.8).

We could add this here and pass the list of records to each call of newChanges.

@lbernail
Copy link
Author

Yes the cache won't work because we are calling the provider Records method and not the one from the registry

Doing 2 calls would be a lot better but the second one still feels unnecessary (in our case we have zones with 1000s of records and we run a slightly modified version which sleeps in the ListResourceRecordSetsPages callback to avoid triggering AWS rate-limits and a full zone sync takes 10+ seconds

@linki
Copy link
Member

linki commented Jan 28, 2019

It's not ideal but much better than the current behaviour and fairly safe to implement. We could improve it with proper caching later. What do you think?

@lbernail
Copy link
Author

Definitely a lot better

Another idea, what about passing the recordsCache to the provider ApplyChanges call?
https://github.com/kubernetes-incubator/external-dns/blob/master/registry/txt.go#L174

We could just ignore it in other providers (I assume some could make use of it too)

@nesl247
Copy link

nesl247 commented Jan 28, 2019

I think this is affecting us as well. However, we didn't experience this until 0.5.10. Upon upgrading to 0.5.10, ALL domains in the cluster were updated. I saw this happen in both our development and staging clusters. However, we didn't experience any issues until I deployed this in staging. Now both clusters are complaining of getting records failed: Throttling: Rate exceeded.

I'm not sure why any of the domains needed to be updated anyways given there were no changes other than upgrading from 0.5.9 to 0.5.10. In 0.5.9 we only really saw All records are already up to date as nothing was changed.

Update: downgrading to 0.5.9 resolves the issue I am experiencing. Downgrading returned us back to the all records are already up to date message instead of it trying to update every record.

@linki
Copy link
Member

linki commented Jan 29, 2019

@lbernail I worked on a rather quick fix to reduce the number of requests from O(#records) to O(#zones) which should already help: #876.

Using the txt registry cache could lead to confusion and would only work for people using the txt registry.

It probably makes sense to somehow also cache in the individual providers as they see fit. Subsequent calls to Records() could just serve cached data for a moment without hitting the API. ApplyChanges could invalidate the cache in addition to some regular cache cleaning timeout.

@linki
Copy link
Member

linki commented Jan 29, 2019

@nesl247 It's another bug where the planning logic thinks it needs to update the records when in fact no update is needed. It persists from interval to interval and results in a lot of updates which then trigger the calls to AWS. We tried to fix it here but I'm not sure we want to accept the PR this way.

There might be a simpler way to implement it. Alternatively, we could use the "old" way of detecting changes (basically reverting parts of #650) or drop the --aws-evaluate-target-health flag in favour of using only per-ingress annotations.

@lbernail
Copy link
Author

@linki : thanks a lot, this is definitely a lot better. I don't think it is enough for us because of the size of our zone on AWS (4x10+s is very long compared to <1ms with registry cache).

I agree with the risk of confusion with the registry cache but I also wonder if having two different caches may not make things complicated. Today the registry cache is very efficient: we only resync the cache once an hour so calls to Records() only take <1ms except when a full sync is performed.

@linki
Copy link
Member

linki commented Jan 30, 2019

@lbernail Thanks a lot for your input. I agree having two caches can be confusing as well. If you think it makes sense via the TXT registry's cache we're happy to review it.

For now, we intend to get #876 and #880 merged to mitigate the current problems.

@tsuna
Copy link

tsuna commented Feb 5, 2019

Can we update the release notes to warn people about this performance regression in v0.5.10? I upgraded to v0.5.10 yesterday and wasted a bunch of time before concluding that this new version of external-dns was buggy and then eventually finding this issue, so it would be nice to save others some time by putting a big fat warning that Route53 will throttle them with v0.5.10 :P

@Raffo
Copy link
Contributor

Raffo commented Feb 6, 2019

@linki @njuettner What about modifying the release notes for v0.5.10 and/or publishing v0.5.9 as v0.5.11 to make sure that we make it safe for the user? This problem has been around too long and caused too much issue to ignore it.

@jhohertz
Copy link
Contributor

jhohertz commented Feb 7, 2019

I believe #880 and #886 together will address the problems of v0.5.10 for the majority of people

@Raffo Raffo pinned this issue Feb 8, 2019
@szuecs
Copy link
Contributor

szuecs commented Feb 8, 2019

@jhohertz thanks for the investigation. @linki is not available for the next 3 Months and @njuettner is here next week again.
It would be good if @Raffo, @hjacobs or @ideahitme could review and approve the code change. I am not part in the approver list, but I can do a code review.

@szuecs
Copy link
Contributor

szuecs commented Feb 8, 2019

I did the code review and both seem to be ok to me.

@Raffo
Copy link
Contributor

Raffo commented Feb 8, 2019

I also reviewed and added /lgtm. We might need to rebase #886 once we merge #880 and then we can test the build for #886. If we don't see anything major, we might release this to have a fixed release.

@szuecs
Copy link
Contributor

szuecs commented Feb 8, 2019

Thanks @Raffo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants