Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change the router reload suppression so that it doesn't block updates #17049

Conversation

knobunc
Copy link
Contributor

@knobunc knobunc commented Oct 26, 2017

Change the router reload suppression so that it doesn't block updates

This changes the locking so that a reload doesn't hold a lock of the router object for the duration of the reload so that updates that happen while the router is reloading can be processed immediately and batched up, then included when the next reload occurs. Before this, if a reload ran longer than the reload interval, only one event would be processed before triggering a new reload. Which would then lock out other event processing. This caused the router to not make any meaningful progress consuming events.

A new module to do the rate limiting has been added.

The module has have a top and bottom half. The top half simply calls the bottom half with a flag indicating the user has made a change. The flag simply tells the bottom half to register the desire to reload (so we can do it under a single lock acquisition).

The bottom half is in charge of determining if it can immediately reload or if it has to wait. If it must wait, then it works out the earliest time it can reload and schedules a callback to itself for that time.

If it determines it can reload, then it runs the reload code immediately. When the reload is complete, it calls itself again to make sure there was no other pending reload that had come in while the reload was running.

Whenever the bottom half calls itself, it does it without the flag indicating the user made a change.

Fixes bug 1471899 -- https://bugzilla.redhat.com/show_bug.cgi?id=1471899

@openshift/networking PTAL

Added new environment variables to help with debugging:
 - OPENSHIFT_LOG_LEVEL: Defaults to 4, but sets the debug level to the given value
 - OPENSHIFT_GET_ALL_DOCKER_LOGS: A boolean that enables dumping of all container logs if any container failed (rather than just giving the logs from the failure)
@knobunc knobunc self-assigned this Oct 26, 2017
@openshift-ci-robot openshift-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Oct 26, 2017
@openshift-merge-robot openshift-merge-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 26, 2017
@knobunc
Copy link
Contributor Author

knobunc commented Oct 26, 2017

/test

@knobunc knobunc force-pushed the bug/bz1471899-change-router-locking branch 2 times, most recently from 2c3828a to 3893117 Compare October 26, 2017 15:02
This changes the locking so that a reload doesn't hold a lock of the
router object for the duration of the reload so that updates that
happen while the router is reloading can be processed immediately and
batched up, then included when the next reload occurs. Before this, if
a reload ran longer than the reload interval, only one event would be
processed before triggering a new reload. Which would then lock out
other event processing.  This caused the router to not make any
meaningful progress consuming events.

A new module to do the rate limiting has been added.

The module has have a top and bottom half.  The top half simply calls
the bottom half with a flag indicating the user has made a change.
The flag simply tells the bottom half to register the desire to reload
(so we can do it under a single lock acquisition).

The bottom half is in charge of determining if it can immediately
reload or if it has to wait.  If it must wait, then it works out the
earliest time it can reload and schedules a callback to itself for
that time.

If it determines it can reload, then it runs the reload code
immediately.  When the reload is complete, it calls itself again to
make sure there was no other pending reload that had come in while the
reload was running.

Whenever the bottom half calls itself, it does it without the flag
indicating the user made a change.

Fixes bug 1471899 -- https://bugzilla.redhat.com/show_bug.cgi?id=1471899
@knobunc knobunc force-pushed the bug/bz1471899-change-router-locking branch from 3893117 to dac5ce6 Compare October 26, 2017 15:44
@eparis
Copy link
Member

eparis commented Oct 26, 2017

This scares me a lot coming in so late. Can you confirm with QA that they have time to run all their router tests on this, at scale? I'm starting to lean to pushing this out of 3.7, even though we know it would be huge win for us...

@pravisankar
Copy link

pravisankar commented Oct 27, 2017

Changes are very easy to understand, thanks Ben!
With this we guarantee only one router reload is executed for the given interval.
In here we chose the interval as period between start of last reload and start of next reload. This seems good if the handler function commitAndReload allows enough coalescing i.e. other router ops are not blocked when router is reloading.

In router commitAndReload, we do:
(1) Write the state and config holding the lock
(2) Reload script is called outside the lock.
If time taken by (1) is substantial, then we won't see much coalescing during reload. If that is the case then end-to-start interval might be better than start-to-start interval.

@pravisankar
Copy link

pravisankar commented Oct 27, 2017

After discussing with @knobunc and @rajatchopra
In commitAndReload, time taken writing the state/config holding the lock will be very small compared to the time taken by reload script. So this is not an issue.

Copy link
Contributor

@rajatchopra rajatchopra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

Non blocking comments posted above.

if untilNextCallback > 0 {
// We want to reload... but can't yet because some window is not satisfied
if csrl.callbackTimer == nil {
csrl.callbackTimer = time.AfterFunc(untilNextCallback, func() { csrl.changeWorker(false) })
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At some point we should have a for loop instead of a recursive call, just to avoid the remote possibility of constant changes causing stackoverflow.


return csrl.handlerFunc()
}
if err := runHandler(); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: The func variable is the same as the function it resides in. Maybe choose a different name here. Spun me around a bit because I thought we are making a recursive call :).

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Oct 27, 2017
@openshift-merge-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: knobunc, rajatchopra

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

@knobunc knobunc added the kind/bug Categorizes issue or PR as related to a bug. label Oct 27, 2017
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot
Copy link
Contributor

Automatic merge from submit-queue.

@openshift-merge-robot openshift-merge-robot merged commit 6f125e4 into openshift:master Oct 28, 2017
@knobunc knobunc deleted the bug/bz1471899-change-router-locking branch June 7, 2018 12:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. component/routing kind/bug Categorizes issue or PR as related to a bug. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants