Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] - Smart bisect faild change batch #1641

Conversation

OmerKahani
Copy link
Contributor

@OmerKahani OmerKahani commented Jun 22, 2020

fixes #1517
fixes #731

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jun 22, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: OmerKahani
To complete the pull request process, please assign hjacobs
You can assign the PR to them by writing /assign @hjacobs in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jun 22, 2020
@OmerKahani OmerKahani changed the title WIP - Smart bisect faild change batch [WIP] - Smart bisect faild change batch Jun 23, 2020
@OmerKahani
Copy link
Contributor Author

OmerKahani commented Jun 23, 2020

Hi, so I started working on it but I have a problem:
what to do with TXT records?

TXT record should be in the same batch, batches are group using the dns name. the problem is that TXT record can have prefix and suffix, so their name is different from the A/CNAME record.

So far I cam with two solutions:

  1. add a label to the TXT record with the origin name and change the code in the aws provider so split the changes by name form the creation of them (change the newChanges function) - this will result in a big change in the code

  2. pass the Provider the mapper function toEndPointName and use in when in splitting to batches - this will result in a change in the provider API

I also think this as an issue today

both solutions are big, and will make the PR even bigger 😞 so I will be happy to hear your inputs before I solving it

@OmerKahani
Copy link
Contributor Author

Tested on master:
using batch size 2 and 4 changes, the first batch has only the TXT record and the second batch has the A/CNAME records,

maybe it will be better to solve this bug before merging this PR?

time.Sleep(p.batchChangeInterval)
}
for i, b := range batchCs {
if p.ChangeZone(ctx, b, z, zoneName) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's take the opportunity to inverse the return value here (or return an error). It's unexpected that p.ChangeZone() => true means that an error happened.

}

size := len(names) / 2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to ensure the minimum size is 2 because the main record and the ownership TXT records should be applied in the same transaction so that they either are both or applied or not at all.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For safety we should also ensure that names is even I guess. Otherwise this logic might become incorrect.

time.Sleep(p.batchChangeInterval)
}
for i, b := range batchCs {
if p.ChangeZone(ctx, b, z, zoneName) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could introduce a feature flag here that can be used to toggle between the old approach and the better bisect handling. I think it's fairly easy to do and it would greatly help with getting this merged.

@linki
Copy link
Member

linki commented Jul 1, 2020

There's an alternative approach worth trying that solves the TXT record problem, enables this feature for all providers and might be even easier to do.

You could implement this as part of the planning phase here. The Plan holds the current and desired records and calculates the raw diff of creates, updates and deletes needed to turn the current into the desired records. After the diff is calculated the changes are post-processed by different policies. We use that for instance to strip out all deletions when running in upsert-only mode.

You could very easily write and test a "DoHalf" policy that strips out half the endpoints from all creates, updates and deletes. Then you could use that policy to reduce the change set in case ApplyChanges fails:

        for {
          err = c.Registry.ApplyChanges(ctx, plan.Changes)
          if err == nil {
            break // needs some better logic to avoid infinite loops.
          }
          plan.Changes = DoHalf{}.Apply(plan.Changes) // strips out some of the desired changes, e.g. half of everything.
        }

Note, that this happens before any of the TXT record stuff and works for all providers. It should even be fine to continue once one batch has been processed successfully. The reconciling nature of ExternalDNS will pick up the remaining records in the next loop. That might avoid complicated logic of keeping track of the partitions or recursion. However, with so many providers down the processing line I feel there's some that won't play nice with this DoHalf policy.

This is just an idea how it could also be solved. Feel free to continue with the current approach as well.

@linki
Copy link
Member

linki commented Jul 1, 2020

@OmerKahani The bug you see with one batch containing only the TXT records: Is that only when you use txt-prefix or even without prefix?

@OmerKahani
Copy link
Contributor Author

OmerKahani commented Jul 2, 2020

@OmerKahani The bug you see with one batch containing only the TXT records: Is that only when you use txt-prefix or even without prefix?

There's an alternative approach worth trying that solves the TXT record problem, enables this feature for all providers and might be even easier to do.

You could implement this as part of the planning phase here. The Plan holds the current and desired records and calculates the raw diff of creates, updates and deletes needed to turn the current into the desired records. After the diff is calculated the changes are post-processed by different policies. We use that for instance to strip out all deletions when running in upsert-only mode.

You could very easily write and test a "DoHalf" policy that strips out half the endpoints from all creates, updates and deletes. Then you could use that policy to reduce the change set in case ApplyChanges fails:

        for {
          err = c.Registry.ApplyChanges(ctx, plan.Changes)
          if err == nil {
            break // needs some better logic to avoid infinite loops.
          }
          plan.Changes = DoHalf{}.Apply(plan.Changes) // strips out some of the desired changes, e.g. half of everything.
        }

Note, that this happens before any of the TXT record stuff and works for all providers. It should even be fine to continue once one batch has been processed successfully. The reconciling nature of ExternalDNS will pick up the remaining records in the next loop. That might avoid complicated logic of keeping track of the partitions or recursion. However, with so many providers down the processing line I feel there's some that won't play nice with this DoHalf policy.

This is just an idea how it could also be solved. Feel free to continue with the current approach as well.

Do you think we can push this kind of change? It feels that changing this code will result in much bigger change (not in code but in concept). It is a better approach, I just want to make sure I will be able to merge the code :)

@seanmalloy
Copy link
Member

/kind bug

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Aug 17, 2020
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 15, 2020
@k8s-ci-robot
Copy link
Contributor

@OmerKahani: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 15, 2020
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Dec 15, 2020
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closed this PR.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
5 participants