Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

One invalid record in ChangeBatch stops all others from updating #1517

Closed
richstokes opened this issue Apr 13, 2020 · 37 comments
Closed

One invalid record in ChangeBatch stops all others from updating #1517

richstokes opened this issue Apr 13, 2020 · 37 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/backlog Higher priority than priority/awaiting-more-evidence.

Comments

@richstokes
Copy link

richstokes commented Apr 13, 2020

What happened:

time="2020-04-13T22:23:40Z" level=info msg="Desired change: CREATE demo.example.io A [Id: /hostedzone/ZVEABCZXYZ123]"
time="2020-04-13T22:23:40Z" level=info msg="Desired change: CREATE demo-us-west-1.example.io A [Id: /hostedzone/ZVEABCZXYZ123]"
time="2020-04-13T22:23:40Z" level=info msg="Desired change: CREATE demo-host-il.example.io A [Id: /hostedzone/ZVEABCZXYZ123]"
time="2020-04-13T22:23:40Z" level=info msg="Desired change: CREATE demo.example.io TXT [Id: /hostedzone/ZVEABCZXYZ123]"
time="2020-04-13T22:23:40Z" level=info msg="Desired change: CREATE demo-us-west-1.example.io TXT [Id: /hostedzone/ZVEABCZXYZ123]"
time="2020-04-13T22:23:40Z" level=info msg="Desired change: CREATE demo-host-il.example.io TXT [Id: /hostedzone/ZVEABCZXYZ123]"
time="2020-04-13T22:23:40Z" level=error msg="Failure in zone example.io. [Id: /hostedzone/ZVEABCZXYZ123]"
time="2020-04-13T22:23:40Z" level=error msg="InvalidChangeBatch: [RRSet of type A with DNS name demo.example.io. is not permitted because a conflicting RRSet of type  CNAME with the same DNS name already exists in zone example.io., RRSet of type TXT with DNS name demo.example.io. is not permitted because a conflicting RRSet of type  CNAME with the same DNS name already exists in zone example.io.]\n\tstatus code: 400, request id: ca31ed28-2fef-4429-b769-ae04d297da51"
time="2020-04-13T22:23:40Z" level=error msg="Failed to submit all changes for the following zones: [/hostedzone/ZVEABCZXYZ123]"

What you expected to happen:
Ignore the invalid record, process the others

The use case here is the record demo.example.io is created outside of the K8s cluster. But the K8s ingress still needs to be able to handle traffic for this host, since the CNAME is set up to failover between two K8s clusters.

In previous versions of external-dns (<= v0.5.17) everything worked, since it just ignored any records that already exist. Now its batching changes and failing everything, even when only one of the records is "invalid".

Perhaps we need an "ignore" configuration option that would tell external-dns to continue on failure of N records instead of trying to do bulk, atomic submissions?

Environment:
external-dns: v0.7.1
K8s v1.16.2

@richstokes richstokes added the kind/bug Categorizes issue or PR as related to a bug. label Apr 13, 2020
@richstokes richstokes changed the title One invalid record in ChangeBatch stops all others from updatng One invalid record in ChangeBatch stops all others from updating Apr 13, 2020
@szuecs
Copy link
Contributor

szuecs commented Apr 14, 2020

Also similar problem #731

@szuecs
Copy link
Contributor

szuecs commented Apr 14, 2020

I think as workaround you could try natch site 1 or 2

@richstokes
Copy link
Author

Thanks @szuecs - adding --aws-batch-change-size=2 as an arg works. Am I missing something or is there no documentation for this?

@szuecs
Copy link
Contributor

szuecs commented Apr 14, 2020

It’s not really meant to fix your issue.
It’s meant to reduce the bytes send, because of aws api limits.

@OmerKahani
Copy link
Contributor

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 19, 2020
@seanmalloy
Copy link
Member

Can someone confirm this is still an issue with the latest release(v0.7.3)?

@Blanko2
Copy link

Blanko2 commented Oct 1, 2020

/remove-lifecycle stale
this seems to still be pending - at least this is what the changelogs show

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 1, 2020
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 4, 2021
@szuecs
Copy link
Contributor

szuecs commented Jan 11, 2021

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 11, 2021
@funkypenguin
Copy link

I can confirm this is still an issue on 0.7.6:

time="2021-03-30T09:40:38Z" level=error msg="failed to submit all changes for the following zones: [/hostedzone/Z00-redacted--VC]"
time="2021-03-30T09:41:39Z" level=info msg="Desired change: CREATE lorem.elpenguino.net CNAME [Id: /hostedzone/Z00-redacted--VC]"
time="2021-03-30T09:41:39Z" level=info msg="Desired change: CREATE loremipsumdolorsitametconsecteturadipiscingelitloremipsumdolorsitametconsecteturadipiscingelit.elpenguino.net CNAME [Id: /hostedzone/Z00-redacted--VC]"
time="2021-03-30T09:41:39Z" level=info msg="Desired change: CREATE txtlorem.elpenguino.net TXT [Id: /hostedzone/Z00-redacted--VC]"
time="2021-03-30T09:41:39Z" level=info msg="Desired change: CREATE txtloremipsumdolorsitametconsecteturadipiscingelitloremipsumdolorsitametconsecteturadipiscingelit.elpenguino.net TXT [Id: /hostedzone/Z00-redacted--VC]"
time="2021-03-30T09:41:39Z" level=error msg="Failure in zone elpenguino.net. [Id: /hostedzone/Z00-redacted--VC]"
time="2021-03-30T09:41:39Z" level=error msg="InvalidChangeBatch: [FATAL problem: DomainLabelTooLong (Domain label is too long) encountered with 'loremipsumdolorsitametconsecteturadipiscingelitloremipsumdolorsi', FATAL problem: DomainLabelTooLong (Domain label is too long) encountered with 'txtloremipsumdolorsitametconsecteturadipiscingelitloremipsumdolo']\n\tstatus code: 400, request id: 065e58ea-8383-4987-8b99-ff2f76634d8c"
time="2021-03-30T09:41:39Z" level=error msg="failed to submit all changes for the following zones: [/hostedzone/Z00-redacted--VC]"

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 28, 2021
@szuecs
Copy link
Contributor

szuecs commented Jun 29, 2021

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 29, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 27, 2021
@szuecs
Copy link
Contributor

szuecs commented Sep 27, 2021

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 27, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 29, 2022
@szuecs
Copy link
Contributor

szuecs commented Jun 30, 2022

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 30, 2022
@allamand
Copy link

Do we have any inputs on this issue ?

@szuecs
Copy link
Contributor

szuecs commented Sep 14, 2022

I am working on that I will have time to fix it. 😀
In general I would be happy to review a pr here, if it's using binary search to figure out the bad records. So always do a split batch, one hopefully works the other not and follow up with the failing one to split and try again. We need to be careful with aws api limits.

@jegeland
Copy link

jegeland commented Sep 17, 2022

This should have been resolved with the contribution @knackaron and myself (then knackjeff) made back in 2021 with #2127 and included in 0.9.0. Just change the batchsize back to 1 resorts back to the old style behavior and simply submits the DNS change requests one at a time and not the "number of bytes sent" as @szuecs stated above. Have tested this in production environments and it solved this issue for us.

@szuecs
Copy link
Contributor

szuecs commented Sep 17, 2022

@jegeland I don't think it's a great solution but it works for us as well. I meant that reducing batch size is not an optimal solution because the user has to calculate max bytes herself and judge about how big the average dns record might be. So it's not great for the user and not reducing api calls to cloud providers. Having too many issues with number of api calls to aws in the past with several incidents, I want to fix it when I have a bit more time to invest into coding and testing it.

@dudicoco
Copy link

@jegeland how is that different than the workaround suggested in #1517 (comment)?

I must stress that this is a workaround and does not resolve the issue.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 17, 2022
@paritoshparmar14
Copy link

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 20, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 20, 2023
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Apr 19, 2023
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

@k8s-ci-robot k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale May 19, 2023
@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@szuecs
Copy link
Contributor

szuecs commented May 19, 2023

/reopen
/priority backlog

@k8s-ci-robot k8s-ci-robot added the priority/backlog Higher priority than priority/awaiting-more-evidence. label May 19, 2023
@k8s-ci-robot k8s-ci-robot reopened this May 19, 2023
@k8s-ci-robot
Copy link
Contributor

@szuecs: Reopened this issue.

In response to this:

/reopen
/priority backlog

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@szuecs szuecs removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label May 19, 2023
@jukie
Copy link

jukie commented May 25, 2023

Isn't this resolved in later versions now? Failed records are lumped into their own batch and retried.

@szuecs
Copy link
Contributor

szuecs commented Jun 26, 2023

@jukie yeah that's true, but not completely. IIRC it will split into two chunks and one will be applied and the other not and in the next iteration it will do the same, so it will fix it after some time. Maybe we can do better than that, maybe it's fine.
Let's close for now and see if we get issues that suggest we need to be better here.
/close

@k8s-ci-robot
Copy link
Contributor

@szuecs: Closing this issue.

In response to this:

@jukie yeah that's true, but not completely. IIRC it will split into two chunks and one will be applied and the other not and in the next iteration it will do the same, so it will fix it after some time. Maybe we can do better than that, maybe it's fine.
Let's close for now and see if we get issues that suggest we need to be better here.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/backlog Higher priority than priority/awaiting-more-evidence.
Projects
None yet
Development

Successfully merging a pull request may close this issue.