Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce the rate of route recalculation to reduce CPU load #111

Merged
merged 1 commit into from
Oct 11, 2019

Conversation

bboreham
Copy link
Contributor

@bboreham bboreham commented Aug 29, 2019

Fixes #105
This is a competing solution to #106.

Introduce a timer to defer recalculation by up to 100ms. This should help in situations where changes are hapening very rapidly so the recalculation was continuous.

Probably we should shut down the ticker when no requests are pending, but this is a first step to see how it behaves.

@murali-reddy
Copy link
Contributor

murali-reddy commented Aug 30, 2019

For reference adding the goroutine dumps from various nodes in 175 node cluster running the combined patch #110, #111

https://gist.github.com/murali-reddy/d60850ac2ffa4080b5cbd0ce86f51bde

Connection between mesh peers across the cluster are stuck in retrying/pending state

Snip of logs related to connection processing between peers 172.20.106.10 <-> 172.20.126.119

on 172.20.106.10

INFO: 2019/08/30 06:14:53.239099 ->[172.20.126.119:6783|7a:7d:86:1b:1b:5d(ip-172-20-126-119.us-west-2.compute.internal)]: connection shutting down due to error: write tcp4 172.20.106.10:46813->172.20.126.119:6783: write: connection reset by peer
INFO: 2019/08/30 06:15:06.716285 ->[172.20.126.119:6783|7a:7d:86:1b:1b:5d(ip-172-20-126-119.us-west-2.compute.internal)]: connection deleted
INFO: 2019/08/30 06:15:08.188144 ->[172.20.126.119:56475|7a:7d:86:1b:1b:5d(ip-172-20-126-119.us-west-2.compute.internal)]: connection added
INFO: 2019/08/30 06:15:08.817049 ->[172.20.126.119:56475|7a:7d:86:1b:1b:5d(ip-172-20-126-119.us-west-2.compute.internal)]: connection shutting down due to error: write tcp4 172.20.106.10:6783->172.20.126.119:56475: write: connection reset by peer
INFO: 2019/08/30 06:15:53.175466 overlay_switch ->[7a:7d:86:1b:1b:5d(ip-172-20-126-119.us-west-2.compute.internal)] using sleeve
INFO: 2019/08/30 06:15:53.175757 overlay_switch ->[7a:7d:86:1b:1b:5d(ip-172-20-126-119.us-west-2.compute.internal)] sleeve timed out waiting for UDP heartbeat
INFO: 2019/08/30 06:16:08.722177 overlay_switch ->[7a:7d:86:1b:1b:5d(ip-172-20-126-119.us-west-2.compute.internal)] using sleeve
INFO: 2019/08/30 06:16:08.722250 overlay_switch ->[7a:7d:86:1b:1b:5d(ip-172-20-126-119.us-west-2.compute.internal)] sleeve timed out waiting for UDP heartbeat
INFO: 2019/08/30 06:16:21.112587 ->[172.20.126.119:47697] connection accepted
INFO: 2019/08/30 06:16:21.113292 ->[172.20.126.119:47697|7a:7d:86:1b:1b:5d(ip-172-20-126-119.us-west-2.compute.internal)]: connection ready; using protocol version 2
INFO: 2019/08/30 06:16:21.113413 overlay_switch ->[7a:7d:86:1b:1b:5d(ip-172-20-126-119.us-west-2.compute.internal)] using fastdp
INFO: 2019/08/30 06:16:25.178730 ->[172.20.126.119:42787|7a:7d:86:1b:1b:5d(ip-172-20-126-119.us-west-2.compute.internal)]: connection shutting down due to error: Multiple connections to 7a:7d:86:1b:1b:5d(ip-172-20-126-119.us-west-2.compute.internal) added to ce:82:96:ce:3d:60(ip-172-20-106-10.us-west-2.compute.internal)
INFO: 2019/08/30 06:17:29.202348 ->[172.20.126.119:42829|7a:7d:86:1b:1b:5d(ip-172-20-126-119.us-west-2.compute.internal)]: connection shutting down due to error: Multiple connections to 7a:7d:86:1b:1b:5d(ip-172-20-126-119.us-west-2.compute.internal) added to ce:82:96:ce:3d:60(ip-172-20-106-10.us-west-2.compute.internal)
INFO: 2019/08/30 06:17:50.687064 ->[172.20.126.119:56475|7a:7d:86:1b:1b:5d(ip-172-20-126-119.us-west-2.compute.internal)]: connection deleted
INFO: 2019/08/30 06:17:51.262805 ->[172.20.126.119:6783] attempting connection
INFO: 2019/08/30 06:17:51.472519 ->[172.20.126.119:6783|7a:7d:86:1b:1b:5d(ip-172-20-126-119.us-west-2.compute.internal)]: connection ready; using protocol version 2
INFO: 2019/08/30 06:17:51.472612 overlay_switch ->[7a:7d:86:1b:1b:5d(ip-172-20-126-119.us-west-2.compute.internal)] using fastdp
INFO: 2019/08/30 06:18:10.629911 ->[172.20.126.119:50127] connection accepted
INFO: 2019/08/30 06:18:10.851159 ->[172.20.126.119:50127|7a:7d:86:1b:1b:5d(ip-172-20-126-119.us-west-2.compute.internal)]: connection ready; using protocol version 2
INFO: 2019/08/30 06:18:10.851387 overlay_switch ->[7a:7d:86:1b:1b:5d(ip-172-20-126-119.us-west-2.compute.internal)] using fastdp
INFO: 2019/08/30 06:19:22.212912 ->[172.20.126.119:47697|7a:7d:86:1b:1b:5d(ip-172-20-126-119.us-west-2.compute.internal)]: connection added
INFO: 2019/08/30 06:19:23.179128 ->[172.20.126.119:47697|7a:7d:86:1b:1b:5d(ip-172-20-126-119.us-west-2.compute.internal)]: connection shutting down due to error: write tcp4 172.20.106.10:6783->172.20.126.119:47697: write: connection reset by peer
INFO: 2019/08/30 06:20:23.177227 overlay_switch ->[7a:7d:86:1b:1b:5d(ip-172-20-126-119.us-west-2.compute.internal)] sleeve timed out waiting for UDP heartbeat
INFO: 2019/08/30 06:20:38.093480 ->[172.20.126.119:32783] connection accepted
INFO: 2019/08/30 06:20:38.109158 ->[172.20.126.119:32783|7a:7d:86:1b:1b:5d(ip-172-20-126-119.us-west-2.compute.internal)]: connection ready; using protocol version 2
INFO: 2019/08/30 06:20:38.109252 overlay_switch ->[7a:7d:86:1b:1b:5d(ip-172-20-126-119.us-west-2.compute.internal)] using fastdp
INFO: 2019/08/30 06:21:20.190622 ->[172.20.126.119:47697|7a:7d:86:1b:1b:5d(ip-172-20-126-119.us-west-2.compute.internal)]: connection deleted
INFO: 2019/08/30 06:21:20.678640 ->[172.20.126.119:6783|7a:7d:86:1b:1b:5d(ip-172-20-126-119.us-west-2.compute.internal)]: connection added
INFO: 2019/08/30 06:21:21.676987 ->[172.20.126.119:6783|7a:7d:86:1b:1b:5d(ip-172-20-126-119.us-west-2.compute.internal)]: connection shutting down due to error: write tcp4 172.20.106.10:59989->172.20.126.119:6783: write: connection reset by peer
INFO: 2019/08/30 06:21:40.178522 ->[172.20.126.119:50127|7a:7d:86:1b:1b:5d(ip-172-20-126-119.us-west-2.compute.internal)]: connection shutting down due to error: Multiple connections to 7a:7d:86:1b:1b:5d(ip-172-20-126-119.us-west-2.compute.internal) added to ce:82:96:ce:3d:60(ip-172-20-106-10.us-west-2.compute.internal)

on 172.20.126.119

INFO: 2019/08/30 06:16:00.094880 ->[172.20.106.10:6783|ce:82:96:ce:3d:60(ip-172-20-106-10.us-west-2.compute.internal)]: connection shutting down due to error: read tcp4 172.20.126.119:42829->172.20.106.10:6783: i/o timeout
INFO: 2019/08/30 06:16:00.095200 overlay_switch ->[ce:82:96:ce:3d:60(ip-172-20-106-10.us-west-2.compute.internal)] sleeve timed out waiting for UDP heartbeat
INFO: 2019/08/30 06:16:20.579041 ->[172.20.106.10:6783|ce:82:96:ce:3d:60(ip-172-20-106-10.us-west-2.compute.internal)]: connection deleted
INFO: 2019/08/30 06:16:21.108108 ->[172.20.106.10:6783] attempting connection
INFO: 2019/08/30 06:16:21.444311 ->[172.20.106.10:6783|ce:82:96:ce:3d:60(ip-172-20-106-10.us-west-2.compute.internal)]: connection ready; using protocol version 2
INFO: 2019/08/30 06:16:21.444399 overlay_switch ->[ce:82:96:ce:3d:60(ip-172-20-106-10.us-west-2.compute.internal)] using fastdp
INFO: 2019/08/30 06:16:38.593541 ->[172.20.106.10:6783|ce:82:96:ce:3d:60(ip-172-20-106-10.us-west-2.compute.internal)]: connection added
INFO: 2019/08/30 06:17:39.101056 overlay_switch ->[ce:82:96:ce:3d:60(ip-172-20-106-10.us-west-2.compute.internal)] sleeve timed out waiting for UDP heartbeat
INFO: 2019/08/30 06:17:39.101226 ->[172.20.106.10:6783|ce:82:96:ce:3d:60(ip-172-20-106-10.us-west-2.compute.internal)]: connection shutting down due to error: no working forwarders to ce:82:96:ce:3d:60(ip-172-20-106-10.us-west-2.compute.internal)
INFO: 2019/08/30 06:17:51.259388 ->[172.20.106.10:59989] connection accepted
INFO: 2019/08/30 06:17:51.260205 ->[172.20.106.10:59989|ce:82:96:ce:3d:60(ip-172-20-106-10.us-west-2.compute.internal)]: connection ready; using protocol version 2
INFO: 2019/08/30 06:17:51.260310 overlay_switch ->[ce:82:96:ce:3d:60(ip-172-20-106-10.us-west-2.compute.internal)] using fastdp
INFO: 2019/08/30 06:18:10.083534 ->[172.20.106.10:6783|ce:82:96:ce:3d:60(ip-172-20-106-10.us-west-2.compute.internal)]: connection deleted
INFO: 2019/08/30 06:18:10.624555 ->[172.20.106.10:6783] attempting connection
INFO: 2019/08/30 06:18:10.936664 ->[172.20.106.10:6783|ce:82:96:ce:3d:60(ip-172-20-106-10.us-west-2.compute.internal)]: connection ready; using protocol version 2
INFO: 2019/08/30 06:18:10.936756 overlay_switch ->[ce:82:96:ce:3d:60(ip-172-20-106-10.us-west-2.compute.internal)] using fastdp
INFO: 2019/08/30 06:18:26.083601 ->[172.20.106.10:59989|ce:82:96:ce:3d:60(ip-172-20-106-10.us-west-2.compute.internal)]: connection added
INFO: 2019/08/30 06:18:47.583999 ->[172.20.106.10:6783|ce:82:96:ce:3d:60(ip-172-20-106-10.us-west-2.compute.internal)]: connection shutting down due to error: Multiple connections to ce:82:96:ce:3d:60(ip-172-20-106-10.us-west-2.compute.internal) added to 7a:7d:86:1b:1b:5d(ip-172-20-126-119.us-west-2.compute.internal)
INFO: 2019/08/30 06:19:27.084839 overlay_switch ->[ce:82:96:ce:3d:60(ip-172-20-106-10.us-west-2.compute.internal)] sleeve timed out waiting for UDP heartbeat
INFO: 2019/08/30 06:19:27.084933 ->[172.20.106.10:59989|ce:82:96:ce:3d:60(ip-172-20-106-10.us-west-2.compute.internal)]: connection shutting down due to error: no working forwarders to ce:82:96:ce:3d:60(ip-172-20-106-10.us-west-2.compute.internal)
INFO: 2019/08/30 06:20:37.083296 ->[172.20.106.10:59989|ce:82:96:ce:3d:60(ip-172-20-106-10.us-west-2.compute.internal)]: connection deleted
INFO: 2019/08/30 06:20:38.087617 ->[172.20.106.10:6783] attempting connection
INFO: 2019/08/30 06:20:38.134591 ->[172.20.106.10:6783|ce:82:96:ce:3d:60(ip-172-20-106-10.us-west-2.compute.internal)]: connection ready; using protocol version 2
INFO: 2019/08/30 06:20:38.134823 overlay_switch ->[ce:82:96:ce:3d:60(ip-172-20-106-10.us-west-2.compute.internal)] using fastdp
INFO: 2019/08/30 06:21:38.082668 ->[172.20.106.10:6783|ce:82:96:ce:3d:60(ip-172-20-106-10.us-west-2.compute.internal)]: connection added
INFO: 2019/08/30 06:22:39.099695 overlay_switch ->[ce:82:96:ce:3d:60(ip-172-20-106-10.us-west-2.compute.internal)] using sleeve
INFO: 2019/08/30 06:22:39.099845 ->[172.20.106.10:6783|ce:82:96:ce:3d:60(ip-172-20-106-10.us-west-2.compute.internal)]: connection shutting down due to error: read tcp4 172.20.126.119:32783->172.20.106.10:6783: i/o timeout
INFO: 2019/08/30 06:22:39.099995 overlay_switch ->[ce:82:96:ce:3d:60(ip-172-20-106-10.us-west-2.compute.internal)] sleeve timed out waiting for UDP heartbeat
INFO: 2019/08/30 06:24:07.079211 ->[172.20.106.10:6783|ce:82:96:ce:3d:60(ip-172-20-106-10.us-west-2.compute.internal)]: connection deleted
INFO: 2019/08/30 06:24:07.770630 ->[172.20.106.10:6783] attempting connection
INFO: 2019/08/30 06:24:08.570231 ->[172.20.106.10:6783|ce:82:96:ce:3d:60(ip-172-20-106-10.us-west-2.compute.internal)]: connection ready; using protocol version 2
INFO: 2019/08/30 06:24:08.570351 overlay_switch ->[ce:82:96:ce:3d:60(ip-172-20-106-10.us-west-2.compute.internal)] using fastdp

Introduce a timer to defer recalculation by up to 100ms.
This should help in situations where changes are hapenning very
rapidly so the recalculation was continuous.
@bboreham bboreham force-pushed the limit-route-calculation-with-ticker branch from 42dabcd to d743780 Compare October 11, 2019 09:33
@murali-reddy
Copy link
Contributor

LGTM

@bboreham bboreham merged commit 8889a80 into master Oct 11, 2019
@bboreham bboreham deleted the limit-route-calculation-with-ticker branch October 11, 2019 11:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

rate-limit routes calculation done when a gossip topology update is received
2 participants