Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replaced event queue based watching resources in router with shared informers #16315

Merged

Conversation

pravisankar
Copy link

@pravisankar pravisankar commented Sep 13, 2017

  • Custom shared informer is used to leverage namespace, label and field filtering.
    (Auto generated shared informer does not allow this)

  • Listing resources by shared informers doesn't order by resource version/creation time.
    So custom lister for routes is used to order the route list by creation time and this
    will allow oldest route to be processed before new route to claim the host name.

  • Synchronization with the informer queue and cache is a bit difficult as the cache could
    have newer changes than what was pushed on to the queue. Luckily We only care about the
    first sync to avoid 503 status code for routes.

  • Handling first sync:

    • Informers are started with no registered event handlers
    • Wait for all informers to be synced
    • Block router reload
    • Get list of items from informers store and process manually
    • Perform router reload
    • Register router event handlers
      This guarantees first router sync is performed after processing all existing items.
  • Subsequent router syncs rely on informer syncing sate and uses rate limiter to coalesce changes.

  • Deleted eventQueue, no longer used

Trello card: https://trello.com/c/y6SFvOA7

@pravisankar
Copy link
Author

[test]

@openshift-ci-robot openshift-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Sep 13, 2017
@pravisankar
Copy link
Author

@openshift/networking @knobunc @rajatchopra PTAL

@dcbw
Copy link
Contributor

dcbw commented Sep 13, 2017

/test integration issue #16312

@knobunc
Copy link
Contributor

knobunc commented Sep 19, 2017

/lgtm Thanks @pravisankar.

@smarterclayton does this look sane to you? Obviously we still need to look into some of the test case failures (especially the one around the router reload).

@smarterclayton
Copy link
Contributor

I don't like the live calls. Let's do this the proper way and use the cache correctly. Wait for sync, then flip a bool and force a refresh on the cache.

Also, don't use the resource name switch, just embed five informer inits. Strong typing is better

@smarterclayton
Copy link
Contributor

Note that I'm really glad the event queue is gone, just want to get the last extra mile to make this "normal". Live calls are bad because they won't take advantage of API chunking when we turn that on in 3.8

@smarterclayton
Copy link
Contributor

smarterclayton commented Sep 20, 2017 via email

@deads2k
Copy link
Contributor

deads2k commented Sep 20, 2017

@deads2k re: how to safely have "only do this after sync" with the informer

In general, "do this after sync" is expressed by filling a work queue while the cache is priming, but not starting any workers. After all caches have sync, you can synchronously do some work (fill a secondary cache perhaps?) and then start a single worker.

Doing it like that would ensure let you trigger based off of a shared informer, do some work before you consume and process any update, and process resources in order. If you must process individual watch notifications (not resources), then you could fill your own queue (or super deep channel).

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 23, 2017
@smarterclayton
Copy link
Contributor

smarterclayton commented Sep 23, 2017 via email

@pravisankar pravisankar force-pushed the router-change-to-informer branch 2 times, most recently from ff000e8 to 3dc6842 Compare September 25, 2017 23:20
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 25, 2017
@openshift-ci-robot openshift-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Sep 25, 2017
@pravisankar
Copy link
Author

[test]

@pravisankar
Copy link
Author

@smarterclayton @knobunc @rajatchopra Updated, PTAL

routes, err := lw.client.Routes(lw.namespace).List(opts)
if err != nil {
return nil, err
rc.FirstSyncDone = func() bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once has synced has returned true, this is unnecessary.

Maybe we should talk in person, I still think this is much more complicated than it needs to be.

  1. Set "syncing" boolean true (never commit while this is true)
  2. have your config loops update internal structs
  3. wait for all informers synced
  4. set syncing to be true
  5. trigger a refresh in informers

3-5 should be able to be done in a single method.

Copy link
Contributor

@smarterclayton smarterclayton Sep 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the issue here is that you have two caches:

  1. in the informer
  2. caches in the route plugin

The second cache is filled by the first cache. The code that fills the second cache is not done under the lock that maintains the first cache, and therefore while order of events can be guaranteed, there is no guarantee that the second cache is up to date with the first cache.

You need to determine when the second cache has observed all of the events from the first cache once it has synced once.

I think you can address this by starting the informers, waiting for synced (on all of them) and then call each cache as informer.GetStore().List() and send those to the route controller as adds. Once all of them are done, then call commit. Then register the route controller as a listener on the informer, and every update from then on is safe.

@pravisankar
Copy link
Author

/test extended_conformance_gce

@smarterclayton
Copy link
Contributor

smarterclayton commented Sep 27, 2017 via email

@smarterclayton
Copy link
Contributor

smarterclayton commented Sep 27, 2017 via email

// event handlers have the same view of sync state.
c.endpointsListConsumed = c.EndpointsListConsumed()
c.commit()
c.updateConsumedCount(endpoints)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Counting isn't enough to tell you whether you've got everything. If an endpoint is deleted while you're doing the sync you will never reach your target number. You can't synchronize with your handlers like this unfortunately.

@pravisankar
Copy link
Author

@smarterclayton @knobunc Updated, PTAL

}
time.Sleep(50 * time.Millisecond)
}
c.StartInformers(utilwait.NeverStop)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem to belong here, but in the factory. Why should this be a concern of the route controller?

Registering functions into another type is a code smell. Just have a higher level method that calls public methods on the controller from the factory

glog.Fatalf("Failed to sync router informer cache: %v", err)
}
c.processExistingItems()
c.firstSyncDone = true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only part of this method that really belongs here is setting this Boolean (which needs to be under a lock).

c.HandleEndpoints(watch.Added, item.(*kapi.Endpoints))
}

for _, item := range c.InformerCacheList(&routeapi.Route{}) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This abstraction is unnecessary. Do this from the factory and just call the store directly.

@pravisankar
Copy link
Author

@smarterclayton @knobunc rearranged the code as suggested, please review

@pravisankar pravisankar force-pushed the router-change-to-informer branch 2 times, most recently from ed94e3d to f5ddc5e Compare September 28, 2017 18:15
field fields.Selector
namespace string
func (f *RouterControllerFactory) initCallbacks(rc *routercontroller.RouterController) {
rc.HasSyncedInformers = func() bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you still need this?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When re-sync is in progress, we want to reduce the number of reloads. We could fully rely on router coalescing and can get rid of this informer synced check. @knobunc @rajatchopra what do you think?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested with 1000 routes to check whether router coalescing is sufficient or informer synced check is necessary to reduce reloads. Router coalescing without sync check worked fine, so removed this unnecessary check.

}
if lw.field != nil {
field = lw.field.String()
func (f *RouterControllerFactory) processExistingItems(rc *routercontroller.RouterController) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

godoc on this function and how it is used (and reason)

Ravi Sankar Penta added 2 commits September 29, 2017 12:21
…nformers

- Custom shared informer is used to leverage namespace, label and field filtering.
  (Auto generated shared informer does not allow this)
- Listing resources by shared informers doesn't order by resource version/creation time.
  So custom lister for routes is used to order the route list by creation time and this
  will allow oldest route to be processed before new route to claim the host name.
- Synchronization with the informer queue and cache is a bit difficult as the cache could
  have newer changes than what was pushed on to the queue. Luckily We only care about the
  first sync to avoid 503 status code for routes.
- Handling first sync:
  * Informers are started with no registered event handlers
  * Wait for all informers to be synced
  * Block router reload
  * Get list of items from informers store and process manually
  * Perform router reload
  * Register router event handlers
  This guarantees first router sync is performed after processing all existing items.
- Subsequent router syncs rely on informer syncing sate and uses rate limiter to coalesce changes.
@pravisankar
Copy link
Author

@smarterclayton @knobunc @rajatchopra can you please take another look?

@smarterclayton
Copy link
Contributor

Looks phenomenal, thanks for cleaning up. Very easy to read now.

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Sep 29, 2017
@openshift-merge-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: knobunc, smarterclayton

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

@openshift-merge-robot openshift-merge-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 29, 2017
@pravisankar
Copy link
Author

/retest

@smarterclayton smarterclayton added the kind/bug Categorizes issue or PR as related to a bug. label Sep 29, 2017
@smarterclayton
Copy link
Contributor

Was preapproved due to importance to getting this fixed

@openshift-merge-robot
Copy link
Contributor

Automatic merge from submit-queue.

@openshift-merge-robot openshift-merge-robot merged commit 2af8e92 into openshift:master Sep 30, 2017
openshift-merge-robot added a commit that referenced this pull request Oct 16, 2017
Automatic merge from submit-queue.

Sharded router based on namespace labels should notice routes immediately

- Currently, sharded router based on namespace labels could take 2 resync
  intervals (10 to 15 mins) to notice new routes which may not be acceptable
  to some customers. This change allows routes to work immediately just like
  the non-sharded router behavior.

- Watching project resource may not guarantee the order of the events,
  so there is no behavior change to shared router based on project labels.

Trello card: https://trello.com/c/Q0puUQOT
Rebased on top of #16315
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. component/routing kind/bug Categorizes issue or PR as related to a bug. lgtm Indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants