[WIP] - Smart bisect faild change batch #1641

OmerKahani · 2020-06-22T13:15:44Z

k8s-ci-robot · 2020-06-22T13:15:57Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: OmerKahani
To complete the pull request process, please assign hjacobs
You can assign the PR to them by writing /assign @hjacobs in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

OmerKahani · 2020-06-23T15:36:14Z

Hi, so I started working on it but I have a problem:
what to do with TXT records?

TXT record should be in the same batch, batches are group using the dns name. the problem is that TXT record can have prefix and suffix, so their name is different from the A/CNAME record.

So far I cam with two solutions:

add a label to the TXT record with the origin name and change the code in the aws provider so split the changes by name form the creation of them (change the newChanges function) - this will result in a big change in the code
pass the Provider the mapper function toEndPointName and use in when in splitting to batches - this will result in a change in the provider API

I also think this as an issue today

both solutions are big, and will make the PR even bigger 😞 so I will be happy to hear your inputs before I solving it

OmerKahani · 2020-06-24T17:08:49Z

Tested on master:
using batch size 2 and 4 changes, the first batch has only the TXT record and the second batch has the A/CNAME records,

maybe it will be better to solve this bug before merging this PR?

linki · 2020-07-01T15:10:20Z

provider/aws/aws.go

-					time.Sleep(p.batchChangeInterval)
-				}
+	for i, b := range batchCs {
+		if p.ChangeZone(ctx, b, z, zoneName) {


Let's take the opportunity to inverse the return value here (or return an error). It's unexpected that p.ChangeZone() => true means that an error happened.

linki · 2020-07-01T15:22:49Z

provider/aws/aws.go

 			}
+
+			size := len(names) / 2


We might want to ensure the minimum size is 2 because the main record and the ownership TXT records should be applied in the same transaction so that they either are both or applied or not at all.

For safety we should also ensure that names is even I guess. Otherwise this logic might become incorrect.

linki · 2020-07-01T15:27:48Z

provider/aws/aws.go

-					time.Sleep(p.batchChangeInterval)
-				}
+	for i, b := range batchCs {
+		if p.ChangeZone(ctx, b, z, zoneName) {


We could introduce a feature flag here that can be used to toggle between the old approach and the better bisect handling. I think it's fairly easy to do and it would greatly help with getting this merged.

linki · 2020-07-01T15:53:15Z

There's an alternative approach worth trying that solves the TXT record problem, enables this feature for all providers and might be even easier to do.

You could implement this as part of the planning phase here. The Plan holds the current and desired records and calculates the raw diff of creates, updates and deletes needed to turn the current into the desired records. After the diff is calculated the changes are post-processed by different policies. We use that for instance to strip out all deletions when running in upsert-only mode.

You could very easily write and test a "DoHalf" policy that strips out half the endpoints from all creates, updates and deletes. Then you could use that policy to reduce the change set in case ApplyChanges fails:

        for {
          err = c.Registry.ApplyChanges(ctx, plan.Changes)
          if err == nil {
            break // needs some better logic to avoid infinite loops.
          }
          plan.Changes = DoHalf{}.Apply(plan.Changes) // strips out some of the desired changes, e.g. half of everything.
        }

Note, that this happens before any of the TXT record stuff and works for all providers. It should even be fine to continue once one batch has been processed successfully. The reconciling nature of ExternalDNS will pick up the remaining records in the next loop. That might avoid complicated logic of keeping track of the partitions or recursion. However, with so many providers down the processing line I feel there's some that won't play nice with this DoHalf policy.

This is just an idea how it could also be solved. Feel free to continue with the current approach as well.

linki · 2020-07-01T16:01:59Z

@OmerKahani The bug you see with one batch containing only the TXT records: Is that only when you use txt-prefix or even without prefix?

OmerKahani · 2020-07-02T09:47:38Z

@OmerKahani The bug you see with one batch containing only the TXT records: Is that only when you use txt-prefix or even without prefix?

There's an alternative approach worth trying that solves the TXT record problem, enables this feature for all providers and might be even easier to do.

You could implement this as part of the planning phase here. The Plan holds the current and desired records and calculates the raw diff of creates, updates and deletes needed to turn the current into the desired records. After the diff is calculated the changes are post-processed by different policies. We use that for instance to strip out all deletions when running in upsert-only mode.

You could very easily write and test a "DoHalf" policy that strips out half the endpoints from all creates, updates and deletes. Then you could use that policy to reduce the change set in case ApplyChanges fails:
        for {
          err = c.Registry.ApplyChanges(ctx, plan.Changes)
          if err == nil {
            break // needs some better logic to avoid infinite loops.
          }
          plan.Changes = DoHalf{}.Apply(plan.Changes) // strips out some of the desired changes, e.g. half of everything.
        }
Note, that this happens before any of the TXT record stuff and works for all providers. It should even be fine to continue once one batch has been processed successfully. The reconciling nature of ExternalDNS will pick up the remaining records in the next loop. That might avoid complicated logic of keeping track of the partitions or recursion. However, with so many providers down the processing line I feel there's some that won't play nice with this DoHalf policy.

This is just an idea how it could also be solved. Feel free to continue with the current approach as well.

Do you think we can push this kind of change? It feels that changing this code will result in much bigger change (not in code but in concept). It is a better approach, I just want to make sure I will be able to merge the code :)

seanmalloy · 2020-08-17T04:11:47Z

/kind bug

fejta-bot · 2020-11-15T04:45:22Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

k8s-ci-robot · 2020-11-15T04:45:29Z

@OmerKahani: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

fejta-bot · 2020-12-15T05:30:06Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot · 2021-01-14T06:15:23Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

k8s-ci-robot · 2021-01-14T06:15:31Z

@fejta-bot: Closed this PR.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

OmerKahani added 5 commits June 22, 2020 03:04

WIP - basic implementation

603296f

fix tests

80cb847

refactor tests

869fc5e

refactor

294e373

refactor testing

3a96181

k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jun 22, 2020

k8s-ci-robot requested review from linki and njuettner June 22, 2020 13:15

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jun 22, 2020

fmt

5b9398d

OmerKahani changed the title ~~WIP - Smart bisect faild change batch~~ [WIP] - Smart bisect faild change batch Jun 23, 2020

linki reviewed Jul 1, 2020

View reviewed changes

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Aug 17, 2020

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 15, 2020

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 15, 2020

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Dec 15, 2020

k8s-ci-robot closed this Jan 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] - Smart bisect faild change batch #1641

[WIP] - Smart bisect faild change batch #1641

OmerKahani commented Jun 22, 2020 •

edited

Loading

k8s-ci-robot commented Jun 22, 2020

OmerKahani commented Jun 23, 2020 •

edited

Loading

OmerKahani commented Jun 24, 2020

linki Jul 1, 2020

linki Jul 1, 2020

linki Jul 1, 2020

linki Jul 1, 2020

linki commented Jul 1, 2020 •

edited

Loading

linki commented Jul 1, 2020

OmerKahani commented Jul 2, 2020 •

edited

Loading

seanmalloy commented Aug 17, 2020

fejta-bot commented Nov 15, 2020

k8s-ci-robot commented Nov 15, 2020

fejta-bot commented Dec 15, 2020

fejta-bot commented Jan 14, 2021

k8s-ci-robot commented Jan 14, 2021

[WIP] - Smart bisect faild change batch #1641

[WIP] - Smart bisect faild change batch #1641

Conversation

OmerKahani commented Jun 22, 2020 • edited Loading

k8s-ci-robot commented Jun 22, 2020

OmerKahani commented Jun 23, 2020 • edited Loading

OmerKahani commented Jun 24, 2020

linki Jul 1, 2020

Choose a reason for hiding this comment

linki Jul 1, 2020

Choose a reason for hiding this comment

linki Jul 1, 2020

Choose a reason for hiding this comment

linki Jul 1, 2020

Choose a reason for hiding this comment

linki commented Jul 1, 2020 • edited Loading

linki commented Jul 1, 2020

OmerKahani commented Jul 2, 2020 • edited Loading

seanmalloy commented Aug 17, 2020

fejta-bot commented Nov 15, 2020

k8s-ci-robot commented Nov 15, 2020

fejta-bot commented Dec 15, 2020

fejta-bot commented Jan 14, 2021

k8s-ci-robot commented Jan 14, 2021

OmerKahani commented Jun 22, 2020 •

edited

Loading

OmerKahani commented Jun 23, 2020 •

edited

Loading

linki commented Jul 1, 2020 •

edited

Loading

OmerKahani commented Jul 2, 2020 •

edited

Loading