Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sharded router based on namespace labels should notice routes immediately #16039

Merged
merged 1 commit into from
Oct 16, 2017

Conversation

pravisankar
Copy link

@pravisankar pravisankar commented Aug 29, 2017

  • Currently, sharded router based on namespace labels could take 2 resync
    intervals (10 to 15 mins) to notice new routes which may not be acceptable
    to some customers. This change allows routes to work immediately just like
    the non-sharded router behavior.

  • Watching project resource may not guarantee the order of the events,
    so there is no behavior change to shared router based on project labels.

Trello card: https://trello.com/c/Q0puUQOT
Rebased on top of #16315

@openshift-merge-robot openshift-merge-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Aug 29, 2017
@pravisankar
Copy link
Author

@openshift/networking @knobunc @rajatchopra PTAL

@smarterclayton
Copy link
Contributor

This adds a lot of complexity to a simple code path. Watching projects is also not reliable, so you can't guarantee total visibility. So this is going to result in inconsistent results showing up.

At best, you can watch for events and refresh more quickly. But it still needs to be refresh driven.

@smarterclayton
Copy link
Contributor

We've never tried to do a project based informer cache, I'm not sure we should start here without a lot more review and time. This is effectively uncharted territory @liggitt @deads2k

@pravisankar
Copy link
Author

We did not add a lot of code for this functionality. More than half of the code in this pr is dependent code (#15916) and the core logic is calling couple of existing methods in NextNamespace() other than using informers.

We got 2 customers requesting this functionality (https://access.redhat.com/support/cases/01903093 and https://access.redhat.com/support/cases/01714223) and the request seems reasonable to me. Refreshing more quickly(namespace, endpoints and route resources) instead of watch will create a lot of etcd calls which is not desirable.

During my testing, namespace and project watching via informers worked as expected. I also wrote a extended router test case that exercises sharded router based on namespace/project labels (close to completion, similar to scoped-router).
What's the issue with project informers? If there is a known issue that needs more thought, at least allowing this functionality for namespace labels will be very helpful.

@smarterclayton @knobunc @rajatchopra @eparis @liggitt @deads2k

@@ -221,19 +219,25 @@ func (o *RouterSelection) Complete() error {
}

// NewFactory initializes a factory that will watch the requested routes
func (o *RouterSelection) NewFactory(routeclient routeclient.RoutesGetter, projectclient projectclient.ProjectResourceInterface, kc kclientset.Interface) *controllerfactory.RouterControllerFactory {
factory := controllerfactory.NewDefaultRouterControllerFactory(routeclient, kc)
func (o *RouterSelection) NewFactory(routeClient routeclient.RoutesGetter, projectClient *projectclientset.Clientset, kc kclientset.Interface) *controllerfactory.RouterControllerFactory {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switching from a targeted interface to a concrete (broader) clientset is a code smell. What extra API surface is required here?

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 30, 2017
@deads2k
Copy link
Contributor

deads2k commented Aug 30, 2017

We've never tried to do a project based informer cache, I'm not sure we should start here without a lot more review and time. This is effectively uncharted territory @liggitt @deads2k

Projects won't work properly in an informer. It will appear to work, but it will break down on the edges as permissions are created and removed.

@rajatchopra
Copy link
Contributor

Projects won't work properly in an informer. It will appear to work, but it will break down on the edges as permissions are created and removed.

@deads2k
Can you please elaborate with an example. This may be critical at other places.

@deads2k
Copy link
Contributor

deads2k commented Sep 1, 2017

@deads2k
Can you please elaborate with an example. This may be critical at other places.

The "watch from resource version X" on projects does not actually watch from version X. The endpoint is logically a combination of five types: namespaces, clusterroles, clusterrolebindings, roles, rolebindings. This means that the concept of resource version becomes very confused, since there is no guaranteed ordering amongst different resources. Because the project watch was targeted at users (web console), we decided that making it "watch from now" was an acceptable semantic.

All that means that a reflector doing a list/watch won't get the behavior its expecting. And that means that cache that is built from it won't be reliable either. I think you could write something that deals in "watch from now", but I don't think that thing looks like a stock reflector/informer/indexer/lister stack.

@openshift-ci-robot openshift-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Sep 26, 2017
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 26, 2017
@pravisankar pravisankar changed the title Sharded router should notice routes immediately Sharded router based on namespace labels should notice routes immediately Sep 26, 2017
@pravisankar
Copy link
Author

@smarterclayton @knobunc @rajatchopra @openshift/networking PTAL
Rebased on top of #16315, last commit daa6c64 is relevant to this pr.

@pravisankar
Copy link
Author

/retest

Copy link
Contributor

@knobunc knobunc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. @deads2k are you okay with this?

@pravisankar
Copy link
Author

@smarterclayton @knobunc PTAL

…tely

- Currently, sharded router based on namespace labels could take 2 resync
intervals (10 to 15 mins) to notice new routes which may not be acceptable
to some customers. This change allows routes to work immediately just like
the non-sharded router behavior.

- Watching project resource may not guarantee the order of the events,
so there is no behavior change to shared router based on project labels.
@openshift-ci-robot openshift-ci-robot removed the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Oct 2, 2017
@openshift-ci-robot openshift-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Oct 2, 2017
@pravisankar
Copy link
Author

/test integration

@knobunc
Copy link
Contributor

knobunc commented Oct 11, 2017

@rajatchopra PTAL

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Oct 11, 2017
Copy link
Contributor

@rajatchopra rajatchopra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

for i := 0; i < c.NamespaceRetries; i++ {
namespaces, err := c.Namespaces.NamespaceNames()
func (c *RouterController) HandleProjects() {
for i := 0; i < c.ProjectRetries; i++ {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The loop may be unnecessary

@rajatchopra
Copy link
Contributor

/approve

@pravisankar
Copy link
Author

Needs approval from pkg/cmd/OWNERS and test/integration/OWNERS
@knobunc @smarterclayton please review/approve

@knobunc
Copy link
Contributor

knobunc commented Oct 13, 2017

/approve

@knobunc
Copy link
Contributor

knobunc commented Oct 13, 2017

@smarterclayton @liggitt @enj -- Would someone please approve the pkg/cmd changes for this.

@eparis
Copy link
Member

eparis commented Oct 16, 2017

/approve

@openshift-merge-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: eparis, knobunc, pravisankar, rajatchopra

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

@openshift-merge-robot openshift-merge-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 16, 2017
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot
Copy link
Contributor

Automatic merge from submit-queue.

@openshift-merge-robot openshift-merge-robot merged commit fb7cdc9 into openshift:master Oct 16, 2017
pravisankar pushed a commit to pravisankar/origin that referenced this pull request Oct 23, 2017
Rationale:

- Resyncs are mainly intended for robustness. Mainly to handle the case
where the resource handler failed to process the item and we hope this
will be fixed if we process the item again after sometime(resync interval).
Yes, this may fix some transient errors but if we resync frequently then
there could be big penalities.

- Currently router watches these resources: routes, endpoints, nodes,
namespaces, ingresses and secrets. When we have several many routes
(like several thousand in online case), processing these items takes
long time, router reload itself takes few seconds (not milliseconds).
Due to short resync interval there will be constant churn of reprocessing
of all the items for all these resources.

- Earlier we needed shorter resync interval because sharded router was depending
on this interval but with openshift#16039
that limitation is removed.

10 mins seems aggressive for some rare transient errors, changed defaults
to 30 mins. Admin can edit router deployment config if they need custom resync interval.
pravisankar pushed a commit to pravisankar/origin that referenced this pull request Oct 23, 2017
Rationale:

- Resyncs are mainly intended for robustness. Mainly to handle the case
where the resource handler failed to process the item and we hope this
will be fixed if we process the item again after sometime(resync interval).
Yes, this may fix some transient errors but if we resync frequently then
there could be big penalities.

- Currently router watches these resources: routes, endpoints, nodes,
namespaces, ingresses and secrets. When we have many routes
(like several thousand in online case), processing these items takes
long time, router reload itself takes few seconds (not milliseconds).
Due to short resync interval there will be constant churn of reprocessing
of all the items for all these resources.

- Earlier we needed shorter resync interval because sharded router was depending
on this interval but with openshift#16039
that limitation is removed.

10 mins seems aggressive for some rare transient errors, changed defaults
to 30 mins. Admin can edit router deployment config if they need custom resync interval.
openshift-merge-robot added a commit that referenced this pull request Nov 18, 2017
Automatic merge from submit-queue (batch tested with PRs 17012, 17243).

Router: Changed default resource resync interval from 10mins to 30mins

- Resyncs are mainly intended for robustness. Mainly to handle the case
where the resource handler failed to process the item and we hope this
will be fixed if we process the item again after sometime(resync interval).
Yes, this may fix some transient errors but if we resync frequently then
there could be big penalities.

- Currently router watches these resources: routes, endpoints, nodes,
namespaces, ingresses and secrets. When we have many routes
(like several thousand in online case), processing these items takes
long time, router reload itself takes few seconds (not milliseconds).
Due to short resync interval there will be constant churn of reprocessing
of all the items for all these resources.

- Earlier we needed shorter resync interval because sharded router was depending
on this interval but with #16039 that limitation is removed.

10 mins seems aggressive for some rare transient errors, changed defaults
to 30 mins. Admin can edit router deployment config if they need custom resync interval.

Fixed project sync interval in router
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. component/routing lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.