-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sharded router based on namespace labels should notice routes immediately #16039
Sharded router based on namespace labels should notice routes immediately #16039
Conversation
@openshift/networking @knobunc @rajatchopra PTAL |
This adds a lot of complexity to a simple code path. Watching projects is also not reliable, so you can't guarantee total visibility. So this is going to result in inconsistent results showing up. At best, you can watch for events and refresh more quickly. But it still needs to be refresh driven. |
We did not add a lot of code for this functionality. More than half of the code in this pr is dependent code (#15916) and the core logic is calling couple of existing methods in NextNamespace() other than using informers. We got 2 customers requesting this functionality (https://access.redhat.com/support/cases/01903093 and https://access.redhat.com/support/cases/01714223) and the request seems reasonable to me. Refreshing more quickly(namespace, endpoints and route resources) instead of watch will create a lot of etcd calls which is not desirable. During my testing, namespace and project watching via informers worked as expected. I also wrote a extended router test case that exercises sharded router based on namespace/project labels (close to completion, similar to scoped-router). @smarterclayton @knobunc @rajatchopra @eparis @liggitt @deads2k |
pkg/cmd/infra/router/router.go
Outdated
@@ -221,19 +219,25 @@ func (o *RouterSelection) Complete() error { | |||
} | |||
|
|||
// NewFactory initializes a factory that will watch the requested routes | |||
func (o *RouterSelection) NewFactory(routeclient routeclient.RoutesGetter, projectclient projectclient.ProjectResourceInterface, kc kclientset.Interface) *controllerfactory.RouterControllerFactory { | |||
factory := controllerfactory.NewDefaultRouterControllerFactory(routeclient, kc) | |||
func (o *RouterSelection) NewFactory(routeClient routeclient.RoutesGetter, projectClient *projectclientset.Clientset, kc kclientset.Interface) *controllerfactory.RouterControllerFactory { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Switching from a targeted interface to a concrete (broader) clientset is a code smell. What extra API surface is required here?
Projects won't work properly in an informer. It will appear to work, but it will break down on the edges as permissions are created and removed. |
@deads2k |
The "watch from resource version X" on projects does not actually watch from version X. The endpoint is logically a combination of five types: namespaces, clusterroles, clusterrolebindings, roles, rolebindings. This means that the concept of resource version becomes very confused, since there is no guaranteed ordering amongst different resources. Because the project watch was targeted at users (web console), we decided that making it "watch from now" was an acceptable semantic. All that means that a reflector doing a list/watch won't get the behavior its expecting. And that means that cache that is built from it won't be reliable either. I think you could write something that deals in "watch from now", but I don't think that thing looks like a stock reflector/informer/indexer/lister stack. |
f54f796
to
daa6c64
Compare
@smarterclayton @knobunc @rajatchopra @openshift/networking PTAL |
daa6c64
to
5f6dc52
Compare
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me. @deads2k are you okay with this?
5f6dc52
to
b8657dc
Compare
b8657dc
to
aa3d8e4
Compare
@smarterclayton @knobunc PTAL |
…tely - Currently, sharded router based on namespace labels could take 2 resync intervals (10 to 15 mins) to notice new routes which may not be acceptable to some customers. This change allows routes to work immediately just like the non-sharded router behavior. - Watching project resource may not guarantee the order of the events, so there is no behavior change to shared router based on project labels.
aa3d8e4
to
c9f43f7
Compare
/test integration |
@rajatchopra PTAL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
for i := 0; i < c.NamespaceRetries; i++ { | ||
namespaces, err := c.Namespaces.NamespaceNames() | ||
func (c *RouterController) HandleProjects() { | ||
for i := 0; i < c.ProjectRetries; i++ { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The loop may be unnecessary
/approve |
Needs approval from pkg/cmd/OWNERS and test/integration/OWNERS |
/approve |
@smarterclayton @liggitt @enj -- Would someone please approve the pkg/cmd changes for this. |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: eparis, knobunc, pravisankar, rajatchopra The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these OWNERS Files:
You can indicate your approval by writing |
/retest Please review the full test history for this PR and help us cut down flakes. |
Automatic merge from submit-queue. |
Rationale: - Resyncs are mainly intended for robustness. Mainly to handle the case where the resource handler failed to process the item and we hope this will be fixed if we process the item again after sometime(resync interval). Yes, this may fix some transient errors but if we resync frequently then there could be big penalities. - Currently router watches these resources: routes, endpoints, nodes, namespaces, ingresses and secrets. When we have several many routes (like several thousand in online case), processing these items takes long time, router reload itself takes few seconds (not milliseconds). Due to short resync interval there will be constant churn of reprocessing of all the items for all these resources. - Earlier we needed shorter resync interval because sharded router was depending on this interval but with openshift#16039 that limitation is removed. 10 mins seems aggressive for some rare transient errors, changed defaults to 30 mins. Admin can edit router deployment config if they need custom resync interval.
Rationale: - Resyncs are mainly intended for robustness. Mainly to handle the case where the resource handler failed to process the item and we hope this will be fixed if we process the item again after sometime(resync interval). Yes, this may fix some transient errors but if we resync frequently then there could be big penalities. - Currently router watches these resources: routes, endpoints, nodes, namespaces, ingresses and secrets. When we have many routes (like several thousand in online case), processing these items takes long time, router reload itself takes few seconds (not milliseconds). Due to short resync interval there will be constant churn of reprocessing of all the items for all these resources. - Earlier we needed shorter resync interval because sharded router was depending on this interval but with openshift#16039 that limitation is removed. 10 mins seems aggressive for some rare transient errors, changed defaults to 30 mins. Admin can edit router deployment config if they need custom resync interval.
Automatic merge from submit-queue (batch tested with PRs 17012, 17243). Router: Changed default resource resync interval from 10mins to 30mins - Resyncs are mainly intended for robustness. Mainly to handle the case where the resource handler failed to process the item and we hope this will be fixed if we process the item again after sometime(resync interval). Yes, this may fix some transient errors but if we resync frequently then there could be big penalities. - Currently router watches these resources: routes, endpoints, nodes, namespaces, ingresses and secrets. When we have many routes (like several thousand in online case), processing these items takes long time, router reload itself takes few seconds (not milliseconds). Due to short resync interval there will be constant churn of reprocessing of all the items for all these resources. - Earlier we needed shorter resync interval because sharded router was depending on this interval but with #16039 that limitation is removed. 10 mins seems aggressive for some rare transient errors, changed defaults to 30 mins. Admin can edit router deployment config if they need custom resync interval. Fixed project sync interval in router
Currently, sharded router based on namespace labels could take 2 resync
intervals (10 to 15 mins) to notice new routes which may not be acceptable
to some customers. This change allows routes to work immediately just like
the non-sharded router behavior.
Watching project resource may not guarantee the order of the events,
so there is no behavior change to shared router based on project labels.
Trello card: https://trello.com/c/Q0puUQOT
Rebased on top of #16315