Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Route53: retry single changes in a batch if the batch fails #1209

Conversation

alfredkrohmer
Copy link
Contributor

@alfredkrohmer alfredkrohmer commented Sep 24, 2019

If a single change fails during the retry, it will be added to a queue.
In the next iteration, changes from this queue will be submitted after
all other changes.

When submitting single changes, they are always submitted as batches of
changes with the same DNS name to avoid inconsistency between the record
created and the TXT records.

This closes #421.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Sep 24, 2019
@alfredkrohmer
Copy link
Contributor Author

/assign @njuettner

@alfredkrohmer
Copy link
Contributor Author

Example log:

time="2019-09-25T10:52:53Z" level=info msg="Desired change: CREATE ok.external-dns-test.example.com A [Id: /hostedzone/Z2XXXXXXXXXXK]"
time="2019-09-25T10:52:53Z" level=info msg="Desired change: CREATE .external-dns-test.example.com A [Id: /hostedzone/Z2XXXXXXXXXXK]"
time="2019-09-25T10:52:53Z" level=info msg="Desired change: CREATE ok.external-dns-test.example.com TXT [Id: /hostedzone/Z2XXXXXXXXXXK]"
time="2019-09-25T10:52:53Z" level=info msg="Desired change: CREATE .external-dns-test.example.com TXT [Id: /hostedzone/Z2XXXXXXXXXXK]"
time="2019-09-25T10:52:53Z" level=error msg="Failure in zone example.com. [Id: /hostedzone/Z2XXXXXXXXXXK] when submitting change batch"
time="2019-09-25T10:52:53Z" level=error msg="InvalidChangeBatch: [FATAL problem: DomainLabelEmpty (Domain label is empty) encountered with '.external-dns-test.example.com', FATAL problem: DomainLabelEmpty (Domain label is empty) encountered with '.external-dns-test.example.com']\n\tstatus code: 400, request id: 73ba2c80-45b7-438f-a191-f94386d85ed5"
time="2019-09-25T10:52:53Z" level=error msg="Trying to submit changes one-by-one instead"
time="2019-09-25T10:52:53Z" level=info msg="Desired change: CREATE ok.external-dns-test.example.com A [Id: /hostedzone/Z2XXXXXXXXXXK]"
time="2019-09-25T10:52:53Z" level=info msg="Desired change: CREATE ok.external-dns-test.example.com TXT [Id: /hostedzone/Z2XXXXXXXXXXK]"
time="2019-09-25T10:52:53Z" level=info msg="Change successful"
time="2019-09-25T10:52:53Z" level=info msg="Desired change: CREATE .external-dns-test.example.com A [Id: /hostedzone/Z2XXXXXXXXXXK]"
time="2019-09-25T10:52:53Z" level=info msg="Desired change: CREATE .external-dns-test.example.com TXT [Id: /hostedzone/Z2XXXXXXXXXXK]"
time="2019-09-25T10:52:53Z" level=error msg="Failed submitting change, it will be retried in a separate change batch in the next iteration"
time="2019-09-25T10:52:53Z" level=info msg="2 record(s) in zone example.com. [Id: /hostedzone/Z2XXXXXXXXXXK] were successfully updated"
time="2019-09-25T10:52:54Z" level=error msg="Failed to submit all changes for the following zones: [/hostedzone/Z2XXXXXXXXXXK]"

next iteration:

time="2019-09-25T10:54:53Z" level=info msg="Desired change: CREATE .external-dns-test.example.com A [Id: /hostedzone/Z2XXXXXXXXXXK]"
time="2019-09-25T10:54:53Z" level=info msg="Desired change: CREATE .external-dns-test.example.com TXT [Id: /hostedzone/Z2XXXXXXXXXXK]"
time="2019-09-25T10:54:53Z" level=error msg="Failure in zone example.com. [Id: /hostedzone/Z2XXXXXXXXXXK] when submitting change batch"
time="2019-09-25T10:54:53Z" level=error msg="InvalidChangeBatch: [FATAL problem: DomainLabelEmpty (Domain label is empty) encountered with '.external-dns-test.example.com', FATAL problem: DomainLabelEmpty (Domain label is empty) encountered with '.external-dns-test.example.com']\n\tstatus code: 400, request id: e7dc3ae0-70e7-4090-9050-5b1a53df4782"
time="2019-09-25T10:54:53Z" level=error msg="Trying to submit changes one-by-one instead"
time="2019-09-25T10:54:53Z" level=info msg="Desired change: CREATE .external-dns-test.example.com A [Id: /hostedzone/Z2XXXXXXXXXXK]"
time="2019-09-25T10:54:53Z" level=info msg="Desired change: CREATE .external-dns-test.example.com TXT [Id: /hostedzone/Z2XXXXXXXXXXK]"
time="2019-09-25T10:54:53Z" level=error msg="Failed submitting change, it will be retried in a separate change batch in the next iteration"
time="2019-09-25T10:54:54Z" level=error msg="Failed to submit all changes for the following zones: [/hostedzone/Z2XXXXXXXXXXK]"

provider/aws.go Outdated
failedUpdate = true

if len(b) > 1 {
log.Error("Trying to submit changes one-by-one instead")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks more like a debug log to me, or delete it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm are you sure about this? If there is a transaction failing, in my opinion this should output more verbose logs than debug because it needs a manual intervention to fix it.

provider/aws.go Outdated
}
if _, err := p.client.ChangeResourceRecordSets(params); err != nil {
failedUpdate = true
log.Error("Failed submitting change, it will be retried in a separate change batch in the next iteration")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks more like a debug log to me, or delete it

provider/aws.go Outdated
p.failedChangesQueue[z] = append(p.failedChangesQueue[z], groupedChanges...)
}
} else {
log.Info("Change successful")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks more like a debug log to me, or delete it

@szuecs
Copy link
Contributor

szuecs commented Sep 28, 2019

@devkid Thanks for the PR only small comments about the logs from my side.

@linki
Copy link
Member

linki commented Sep 30, 2019

@devkid Nice idea to group the smaller changes into transactions with the same hostname. Could you double-check if that also works with the --txt-prefix feature?

In order to allow TXT records for CNAMEs which cannot co-exist with the same hostname we map between them using a user provided prefix on the TXT record (See here). In that configuration the CNAME and its corresponding TXT don't have the same hostname but rather differ by the fixed prefix.

I think the transaction grouping needs to take this into account in order to have the corresponding records in the same transaction.

@alfredkrohmer
Copy link
Contributor Author

@linki Good catch. Would it make sense to attach some kind of meta information to the TXT records (like: "I belong to this record")? Or is there even some information already?

@alfredkrohmer
Copy link
Contributor Author

@linki The existing batchChangeSet function also doesn't take the TXT prefix into consideration (which I suppose is quite bad as this can bring things out of sync if a TXT prefix is used). Also, as far as I can tell from looking at the code, there is currently no meta information or so added to the TXT records that associates them with the records they are "managing".

What would you suggest going forward? Shall we merge this pull request first and tackle the "group batches batches so that TXT records are changed in the same transaction" afterwards or should I come up with a solution in this pull request?

For a solution I though about two options:

  1. Add a BelongsTo (or similar named) fielt to the Endpoint struct that can be set to point to another record in the registry implementation (via pointer or name as string).
  2. Add an item to the Labels map of the TXT record itself with a well-known key that contains the name of the record it's managing. As far as I can tell those labels are not persisted anywhere, so this should work as a temporary and transparent storage to pass information from the registry implementation to the provider implementation.
  3. Try to get the name of the managed record in the provider implementation by calling the toEndpointName function from the TXT registry. I think this would kind-of break the interface between the registry and the provider.

With any of these options we could improve the grouping in the batchChangeSet function and then also replace my newly added snipped (which does the grouping when submitting single changes) with a call to this function.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Feb 22, 2020
@szuecs
Copy link
Contributor

szuecs commented Feb 22, 2020

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 22, 2020
@szuecs
Copy link
Contributor

szuecs commented Feb 22, 2020

@devkid do you mind rebasing, please?
@linki @njuettner or @Raffo will review.

@alfredkrohmer alfredkrohmer force-pushed the feature/retry-single-changes-on-batch-failure branch from bd0a4ab to ec4233b Compare February 25, 2020 21:44
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 25, 2020
@alfredkrohmer alfredkrohmer force-pushed the feature/retry-single-changes-on-batch-failure branch from ec4233b to 09e8994 Compare February 25, 2020 21:53
@alfredkrohmer
Copy link
Contributor Author

@szuecs done.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels May 25, 2020
@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 3, 2023
@szuecs
Copy link
Contributor

szuecs commented Jan 3, 2023

/assign @njuettner

provider/aws/aws.go Outdated Show resolved Hide resolved
provider/aws/aws.go Outdated Show resolved Hide resolved
provider/aws/aws.go Outdated Show resolved Hide resolved
provider/aws/aws_test.go Outdated Show resolved Hide resolved
registry/txt.go Outdated Show resolved Hide resolved
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 6, 2023
@Raffo
Copy link
Contributor

Raffo commented Jan 11, 2023

@alfredkrohmer would you mind addressing @linki's comments. We would love to get this finally merged as it would be addressing some longstanding issues with the AWS provider.

If a single change fails during the retry, it will be added to a queue.
In the next iteration, changes from this queue will be submitted after
all other changes.

When submitting single changes, they are always submitted as batches of
changes with the same DNS name and ownership relation to avoid
inconsistency between the record created and the TXT records.
@alfredkrohmer alfredkrohmer force-pushed the feature/retry-single-changes-on-batch-failure branch from a295d1e to 7dd84a5 Compare January 16, 2023 15:06
@alfredkrohmer
Copy link
Contributor Author

Rebased and split into two commits for better reviewability. Only open discussion is around Labels vs ProviderSpecific (see my comment above in the thread).

@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 16, 2023
@linki
Copy link
Member

linki commented Jan 17, 2023

@Raffo @szuecs @njuettner LGTM

@alfredkrohmer Thanks a lot for persistently working on this! 🙏 It will solve a long-standing issue with ExternalDNS on AWS.

@linki
Copy link
Member

linki commented Jan 17, 2023

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 17, 2023
@k8s-ci-robot k8s-ci-robot merged commit a18bf2b into kubernetes-sigs:master Jan 17, 2023
@linki
Copy link
Member

linki commented Jan 18, 2023

@Raffo Would you mind cutting a new release with this included?

@Raffo
Copy link
Contributor

Raffo commented Jan 18, 2023

@linki i did a release last weekend. I'm totally up to do a release, but maybe let's aim for the beginning of February so that we have a few more changes. Wdyt?

@linki
Copy link
Member

linki commented Jan 20, 2023

That's OK.

The sooner the better for me. I tested this fix through our pipeline but I would like to run the official version going forward: zalando-incubator/kubernetes-on-aws#5604

@Raffo
Copy link
Contributor

Raffo commented Jan 20, 2023

@linki sounds good. I'll try to get some more fixes in and then get a release cooking at the beginning of February.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

external-dns stops updating route53 on error
10 participants