Sharded router based on namespace labels should notice routes immediately #16039

pravisankar · 2017-08-29T15:52:08Z

Currently, sharded router based on namespace labels could take 2 resync
intervals (10 to 15 mins) to notice new routes which may not be acceptable
to some customers. This change allows routes to work immediately just like
the non-sharded router behavior.
Watching project resource may not guarantee the order of the events,
so there is no behavior change to shared router based on project labels.

Trello card: https://trello.com/c/Q0puUQOT
Rebased on top of #16315

pravisankar · 2017-08-29T15:54:43Z

@openshift/networking @knobunc @rajatchopra PTAL

smarterclayton · 2017-08-29T17:44:11Z

This adds a lot of complexity to a simple code path. Watching projects is also not reliable, so you can't guarantee total visibility. So this is going to result in inconsistent results showing up.

At best, you can watch for events and refresh more quickly. But it still needs to be refresh driven.

smarterclayton · 2017-08-29T17:45:45Z

We've never tried to do a project based informer cache, I'm not sure we should start here without a lot more review and time. This is effectively uncharted territory @liggitt @deads2k

pravisankar · 2017-08-29T21:59:45Z

We did not add a lot of code for this functionality. More than half of the code in this pr is dependent code (#15916) and the core logic is calling couple of existing methods in NextNamespace() other than using informers.

We got 2 customers requesting this functionality (https://access.redhat.com/support/cases/01903093 and https://access.redhat.com/support/cases/01714223) and the request seems reasonable to me. Refreshing more quickly(namespace, endpoints and route resources) instead of watch will create a lot of etcd calls which is not desirable.

During my testing, namespace and project watching via informers worked as expected. I also wrote a extended router test case that exercises sharded router based on namespace/project labels (close to completion, similar to scoped-router).
What's the issue with project informers? If there is a known issue that needs more thought, at least allowing this functionality for namespace labels will be very helpful.

@smarterclayton @knobunc @rajatchopra @eparis @liggitt @deads2k

liggitt · 2017-08-29T22:18:26Z

pkg/cmd/infra/router/router.go

@@ -221,19 +219,25 @@ func (o *RouterSelection) Complete() error {
 }

 // NewFactory initializes a factory that will watch the requested routes
-func (o *RouterSelection) NewFactory(routeclient routeclient.RoutesGetter, projectclient projectclient.ProjectResourceInterface, kc kclientset.Interface) *controllerfactory.RouterControllerFactory {
-	factory := controllerfactory.NewDefaultRouterControllerFactory(routeclient, kc)
+func (o *RouterSelection) NewFactory(routeClient routeclient.RoutesGetter, projectClient *projectclientset.Clientset, kc kclientset.Interface) *controllerfactory.RouterControllerFactory {


Switching from a targeted interface to a concrete (broader) clientset is a code smell. What extra API surface is required here?

deads2k · 2017-08-30T16:01:47Z

We've never tried to do a project based informer cache, I'm not sure we should start here without a lot more review and time. This is effectively uncharted territory @liggitt @deads2k

Projects won't work properly in an informer. It will appear to work, but it will break down on the edges as permissions are created and removed.

rajatchopra · 2017-09-01T16:58:01Z

Projects won't work properly in an informer. It will appear to work, but it will break down on the edges as permissions are created and removed.

@deads2k
Can you please elaborate with an example. This may be critical at other places.

deads2k · 2017-09-01T17:12:54Z

@deads2k
Can you please elaborate with an example. This may be critical at other places.

The "watch from resource version X" on projects does not actually watch from version X. The endpoint is logically a combination of five types: namespaces, clusterroles, clusterrolebindings, roles, rolebindings. This means that the concept of resource version becomes very confused, since there is no guaranteed ordering amongst different resources. Because the project watch was targeted at users (web console), we decided that making it "watch from now" was an acceptable semantic.

All that means that a reflector doing a list/watch won't get the behavior its expecting. And that means that cache that is built from it won't be reliable either. I think you could write something that deals in "watch from now", but I don't think that thing looks like a stock reflector/informer/indexer/lister stack.

pravisankar · 2017-09-26T23:38:00Z

@smarterclayton @knobunc @rajatchopra @openshift/networking PTAL
Rebased on top of #16315, last commit daa6c64 is relevant to this pr.

pravisankar · 2017-09-27T02:19:44Z

/retest

knobunc

This looks good to me. @deads2k are you okay with this?

pravisankar · 2017-09-29T19:31:39Z

@smarterclayton @knobunc PTAL

…tely - Currently, sharded router based on namespace labels could take 2 resync intervals (10 to 15 mins) to notice new routes which may not be acceptable to some customers. This change allows routes to work immediately just like the non-sharded router behavior. - Watching project resource may not guarantee the order of the events, so there is no behavior change to shared router based on project labels.

pravisankar · 2017-10-02T20:33:51Z

/test integration

knobunc · 2017-10-11T13:18:57Z

@rajatchopra PTAL

rajatchopra

/lgtm

rajatchopra · 2017-10-11T19:16:46Z

pkg/router/controller/router_controller.go

-	for i := 0; i < c.NamespaceRetries; i++ {
-		namespaces, err := c.Namespaces.NamespaceNames()
+func (c *RouterController) HandleProjects() {
+	for i := 0; i < c.ProjectRetries; i++ {


The loop may be unnecessary

rajatchopra · 2017-10-11T19:17:21Z

/approve

pravisankar · 2017-10-11T19:35:42Z

Needs approval from pkg/cmd/OWNERS and test/integration/OWNERS
@knobunc @smarterclayton please review/approve

knobunc · 2017-10-13T18:06:13Z

/approve

knobunc · 2017-10-13T18:08:24Z

@smarterclayton @liggitt @enj -- Would someone please approve the pkg/cmd changes for this.

eparis · 2017-10-16T16:19:25Z

/approve

openshift-merge-robot · 2017-10-16T16:19:29Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: eparis, knobunc, pravisankar, rajatchopra

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

~~pkg/cmd/OWNERS~~ [eparis]
~~pkg/router/OWNERS~~ [eparis,knobunc,rajatchopra]
~~test/integration/OWNERS~~ [eparis,knobunc]

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

openshift-bot · 2017-10-16T19:51:06Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-merge-robot · 2017-10-16T22:40:09Z

Automatic merge from submit-queue.

Rationale: - Resyncs are mainly intended for robustness. Mainly to handle the case where the resource handler failed to process the item and we hope this will be fixed if we process the item again after sometime(resync interval). Yes, this may fix some transient errors but if we resync frequently then there could be big penalities. - Currently router watches these resources: routes, endpoints, nodes, namespaces, ingresses and secrets. When we have several many routes (like several thousand in online case), processing these items takes long time, router reload itself takes few seconds (not milliseconds). Due to short resync interval there will be constant churn of reprocessing of all the items for all these resources. - Earlier we needed shorter resync interval because sharded router was depending on this interval but with openshift#16039 that limitation is removed. 10 mins seems aggressive for some rare transient errors, changed defaults to 30 mins. Admin can edit router deployment config if they need custom resync interval.

Rationale: - Resyncs are mainly intended for robustness. Mainly to handle the case where the resource handler failed to process the item and we hope this will be fixed if we process the item again after sometime(resync interval). Yes, this may fix some transient errors but if we resync frequently then there could be big penalities. - Currently router watches these resources: routes, endpoints, nodes, namespaces, ingresses and secrets. When we have many routes (like several thousand in online case), processing these items takes long time, router reload itself takes few seconds (not milliseconds). Due to short resync interval there will be constant churn of reprocessing of all the items for all these resources. - Earlier we needed shorter resync interval because sharded router was depending on this interval but with openshift#16039 that limitation is removed. 10 mins seems aggressive for some rare transient errors, changed defaults to 30 mins. Admin can edit router deployment config if they need custom resync interval.

Automatic merge from submit-queue (batch tested with PRs 17012, 17243). Router: Changed default resource resync interval from 10mins to 30mins - Resyncs are mainly intended for robustness. Mainly to handle the case where the resource handler failed to process the item and we hope this will be fixed if we process the item again after sometime(resync interval). Yes, this may fix some transient errors but if we resync frequently then there could be big penalities. - Currently router watches these resources: routes, endpoints, nodes, namespaces, ingresses and secrets. When we have many routes (like several thousand in online case), processing these items takes long time, router reload itself takes few seconds (not milliseconds). Due to short resync interval there will be constant churn of reprocessing of all the items for all these resources. - Earlier we needed shorter resync interval because sharded router was depending on this interval but with #16039 that limitation is removed. 10 mins seems aggressive for some rare transient errors, changed defaults to 30 mins. Admin can edit router deployment config if they need custom resync interval. Fixed project sync interval in router

pravisankar added the component/routing label Aug 29, 2017

openshift-merge-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Aug 29, 2017

openshift-merge-robot assigned pecameron and rajatchopra Aug 29, 2017

liggitt reviewed Aug 29, 2017

View reviewed changes

openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 30, 2017

pravisankar force-pushed the fix-route-delay branch from f54f796 to daa6c64 Compare September 26, 2017 23:26

openshift-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Sep 26, 2017

openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 26, 2017

pravisankar changed the title ~~Sharded router should notice routes immediately~~ Sharded router based on namespace labels should notice routes immediately Sep 26, 2017

pravisankar force-pushed the fix-route-delay branch from daa6c64 to 5f6dc52 Compare September 27, 2017 02:11

knobunc approved these changes Sep 27, 2017

View reviewed changes

pravisankar force-pushed the fix-route-delay branch from 5f6dc52 to b8657dc Compare September 29, 2017 19:18

openshift-merge-robot added the needs-api-review label Sep 29, 2017

pravisankar force-pushed the fix-route-delay branch from b8657dc to aa3d8e4 Compare September 29, 2017 19:22

openshift-merge-robot removed the needs-api-review label Sep 29, 2017

pravisankar force-pushed the fix-route-delay branch from aa3d8e4 to c9f43f7 Compare October 2, 2017 19:04

openshift-ci-robot removed the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Oct 2, 2017

openshift-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Oct 2, 2017

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Oct 11, 2017

rajatchopra approved these changes Oct 11, 2017

View reviewed changes

openshift-merge-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 16, 2017

openshift-merge-robot merged commit fb7cdc9 into openshift:master Oct 16, 2017

pravisankar mentioned this pull request Oct 23, 2017

Router: Changed default resource resync interval from 10mins to 30mins #17012

Merged

pravisankar mentioned this pull request Oct 24, 2017

[Router] Namespace sync not guaranteed to run before route sync #13437

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sharded router based on namespace labels should notice routes immediately #16039

Sharded router based on namespace labels should notice routes immediately #16039

pravisankar commented Aug 29, 2017 •

edited

Loading

pravisankar commented Aug 29, 2017

smarterclayton commented Aug 29, 2017

smarterclayton commented Aug 29, 2017

pravisankar commented Aug 29, 2017

liggitt Aug 29, 2017

deads2k commented Aug 30, 2017

rajatchopra commented Sep 1, 2017

deads2k commented Sep 1, 2017

pravisankar commented Sep 26, 2017

pravisankar commented Sep 27, 2017

knobunc left a comment

pravisankar commented Sep 29, 2017

pravisankar commented Oct 2, 2017

knobunc commented Oct 11, 2017

rajatchopra left a comment

rajatchopra Oct 11, 2017

rajatchopra commented Oct 11, 2017

pravisankar commented Oct 11, 2017

knobunc commented Oct 13, 2017

knobunc commented Oct 13, 2017

eparis commented Oct 16, 2017

openshift-merge-robot commented Oct 16, 2017

openshift-bot commented Oct 16, 2017

openshift-merge-robot commented Oct 16, 2017

Sharded router based on namespace labels should notice routes immediately #16039

Sharded router based on namespace labels should notice routes immediately #16039

Conversation

pravisankar commented Aug 29, 2017 • edited Loading

pravisankar commented Aug 29, 2017

smarterclayton commented Aug 29, 2017

smarterclayton commented Aug 29, 2017

pravisankar commented Aug 29, 2017

liggitt Aug 29, 2017

Choose a reason for hiding this comment

deads2k commented Aug 30, 2017

rajatchopra commented Sep 1, 2017

deads2k commented Sep 1, 2017

pravisankar commented Sep 26, 2017

pravisankar commented Sep 27, 2017

knobunc left a comment

Choose a reason for hiding this comment

pravisankar commented Sep 29, 2017

pravisankar commented Oct 2, 2017

knobunc commented Oct 11, 2017

rajatchopra left a comment

Choose a reason for hiding this comment

rajatchopra Oct 11, 2017

Choose a reason for hiding this comment

rajatchopra commented Oct 11, 2017

pravisankar commented Oct 11, 2017

knobunc commented Oct 13, 2017

knobunc commented Oct 13, 2017

eparis commented Oct 16, 2017

openshift-merge-robot commented Oct 16, 2017

openshift-bot commented Oct 16, 2017

openshift-merge-robot commented Oct 16, 2017

pravisankar commented Aug 29, 2017 •

edited

Loading