Change the router reload suppression so that it doesn't block updates #17049

knobunc · 2017-10-26T13:53:41Z

Change the router reload suppression so that it doesn't block updates

This changes the locking so that a reload doesn't hold a lock of the router object for the duration of the reload so that updates that happen while the router is reloading can be processed immediately and batched up, then included when the next reload occurs. Before this, if a reload ran longer than the reload interval, only one event would be processed before triggering a new reload. Which would then lock out other event processing. This caused the router to not make any meaningful progress consuming events.

A new module to do the rate limiting has been added.

The module has have a top and bottom half. The top half simply calls the bottom half with a flag indicating the user has made a change. The flag simply tells the bottom half to register the desire to reload (so we can do it under a single lock acquisition).

The bottom half is in charge of determining if it can immediately reload or if it has to wait. If it must wait, then it works out the earliest time it can reload and schedules a callback to itself for that time.

If it determines it can reload, then it runs the reload code immediately. When the reload is complete, it calls itself again to make sure there was no other pending reload that had come in while the reload was running.

Whenever the bottom half calls itself, it does it without the flag indicating the user made a change.

Fixes bug 1471899 -- https://bugzilla.redhat.com/show_bug.cgi?id=1471899

@openshift/networking PTAL

Added new environment variables to help with debugging: - OPENSHIFT_LOG_LEVEL: Defaults to 4, but sets the debug level to the given value - OPENSHIFT_GET_ALL_DOCKER_LOGS: A boolean that enables dumping of all container logs if any container failed (rather than just giving the logs from the failure)

knobunc · 2017-10-26T13:55:52Z

/test

This changes the locking so that a reload doesn't hold a lock of the router object for the duration of the reload so that updates that happen while the router is reloading can be processed immediately and batched up, then included when the next reload occurs. Before this, if a reload ran longer than the reload interval, only one event would be processed before triggering a new reload. Which would then lock out other event processing. This caused the router to not make any meaningful progress consuming events. A new module to do the rate limiting has been added. The module has have a top and bottom half. The top half simply calls the bottom half with a flag indicating the user has made a change. The flag simply tells the bottom half to register the desire to reload (so we can do it under a single lock acquisition). The bottom half is in charge of determining if it can immediately reload or if it has to wait. If it must wait, then it works out the earliest time it can reload and schedules a callback to itself for that time. If it determines it can reload, then it runs the reload code immediately. When the reload is complete, it calls itself again to make sure there was no other pending reload that had come in while the reload was running. Whenever the bottom half calls itself, it does it without the flag indicating the user made a change. Fixes bug 1471899 -- https://bugzilla.redhat.com/show_bug.cgi?id=1471899

eparis · 2017-10-26T21:20:59Z

This scares me a lot coming in so late. Can you confirm with QA that they have time to run all their router tests on this, at scale? I'm starting to lean to pushing this out of 3.7, even though we know it would be huge win for us...

pravisankar · 2017-10-27T17:32:21Z

Changes are very easy to understand, thanks Ben!
With this we guarantee only one router reload is executed for the given interval.
In here we chose the interval as period between start of last reload and start of next reload. This seems good if the handler function commitAndReload allows enough coalescing i.e. other router ops are not blocked when router is reloading.

In router commitAndReload, we do:
(1) Write the state and config holding the lock
(2) Reload script is called outside the lock.
If time taken by (1) is substantial, then we won't see much coalescing during reload. If that is the case then end-to-start interval might be better than start-to-start interval.

pravisankar · 2017-10-27T18:31:11Z

After discussing with @knobunc and @rajatchopra
In commitAndReload, time taken writing the state/config holding the lock will be very small compared to the time taken by reload script. So this is not an issue.

rajatchopra

/lgtm
/approve

Non blocking comments posted above.

rajatchopra · 2017-10-27T20:03:57Z

pkg/router/template/limiter/limiter.go

+	if untilNextCallback > 0 {
+		// We want to reload... but can't yet because some window is not satisfied
+		if csrl.callbackTimer == nil {
+			csrl.callbackTimer = time.AfterFunc(untilNextCallback, func() { csrl.changeWorker(false) })


At some point we should have a for loop instead of a recursive call, just to avoid the remote possibility of constant changes causing stackoverflow.

rajatchopra · 2017-10-27T20:05:54Z

pkg/router/template/limiter/limiter.go

+
+		return csrl.handlerFunc()
+	}
+	if err := runHandler(); err != nil {


Nit: The func variable is the same as the function it resides in. Maybe choose a different name here. Spun me around a bit because I thought we are making a recursive call :).

openshift-merge-robot · 2017-10-27T20:07:06Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: knobunc, rajatchopra

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

~~pkg/router/OWNERS~~ [knobunc,rajatchopra]
~~test/end-to-end/OWNERS~~ [knobunc]
~~test/integration/OWNERS~~ [knobunc]

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

openshift-bot · 2017-10-28T18:10:01Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-merge-robot · 2017-10-28T22:18:42Z

Automatic merge from submit-queue.

knobunc added the component/routing label Oct 26, 2017

knobunc self-assigned this Oct 26, 2017

knobunc requested review from DirectXMan12 and rajatchopra October 26, 2017 13:53

openshift-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Oct 26, 2017

openshift-merge-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 26, 2017

knobunc mentioned this pull request Oct 26, 2017

Change the router reload supression, and add some more controls #16564

Closed

knobunc force-pushed the bug/bz1471899-change-router-locking branch 2 times, most recently from 2c3828a to 3893117 Compare October 26, 2017 15:02

knobunc force-pushed the bug/bz1471899-change-router-locking branch from 3893117 to dac5ce6 Compare October 26, 2017 15:44

knobunc requested a review from pravisankar October 26, 2017 18:55

pravisankar approved these changes Oct 27, 2017

View reviewed changes

openshift-ci-robot assigned rajatchopra Oct 27, 2017

rajatchopra approved these changes Oct 27, 2017

View reviewed changes

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Oct 27, 2017

knobunc added the kind/bug Categorizes issue or PR as related to a bug. label Oct 27, 2017

openshift-merge-robot merged commit 6f125e4 into openshift:master Oct 28, 2017

jmencak mentioned this pull request Dec 13, 2017

Optimizations for HAProxy reloads openshift/openshift-docs#6744

Merged

knobunc deleted the bug/bz1471899-change-router-locking branch June 7, 2018 12:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change the router reload suppression so that it doesn't block updates #17049

Change the router reload suppression so that it doesn't block updates #17049

knobunc commented Oct 26, 2017 •

edited

Loading

knobunc commented Oct 26, 2017

eparis commented Oct 26, 2017

pravisankar commented Oct 27, 2017 •

edited

Loading

pravisankar commented Oct 27, 2017 •

edited

Loading

rajatchopra left a comment

rajatchopra Oct 27, 2017

rajatchopra Oct 27, 2017

openshift-merge-robot commented Oct 27, 2017

openshift-bot commented Oct 28, 2017

openshift-merge-robot commented Oct 28, 2017

Change the router reload suppression so that it doesn't block updates #17049

Change the router reload suppression so that it doesn't block updates #17049

Conversation

knobunc commented Oct 26, 2017 • edited Loading

knobunc commented Oct 26, 2017

eparis commented Oct 26, 2017

pravisankar commented Oct 27, 2017 • edited Loading

pravisankar commented Oct 27, 2017 • edited Loading

rajatchopra left a comment

Choose a reason for hiding this comment

rajatchopra Oct 27, 2017

Choose a reason for hiding this comment

rajatchopra Oct 27, 2017

Choose a reason for hiding this comment

openshift-merge-robot commented Oct 27, 2017

openshift-bot commented Oct 28, 2017

openshift-merge-robot commented Oct 28, 2017

knobunc commented Oct 26, 2017 •

edited

Loading

pravisankar commented Oct 27, 2017 •

edited

Loading

pravisankar commented Oct 27, 2017 •

edited

Loading