-
-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discussion: ResourceManager defaults when active by default #9322
Comments
@ajnavarro : I have spent some time looking at this and will type up some thoughts once I get out of meetings this morning. |
The goal of this PR is to show an easier set of defaults for resource manager to reason about. This is an attempt to address #9322 The basic idea is: 1. Use these inputs: - maxMemory: can be set by user or default 1/8th the system memory - maxFD: can be set by user or default 1/2 the system limit - maxConns: can be set as the connection manager high water mark or defaults to infinity 2. Only set limits at the system and transient scope, and even there, mostly just focus on memory, FD, and inbound connections. Ingore outbout connections and stream limits. 3. Apply any limits that libp2p has for its protocols/services. This PR is not intended to be merged as is. It's not complete, undoubtedly has syntax errors, I haven't run tests, etc. It was done as a starting point to communicate specifically on how I think we can simplify the default story.
@ajnavarro : I was finding it easiest to convey my ideas in code. Here's a PR for discussion: #9351 Basically I imagine 3 options for users:
I'm also good if we want to simplify further. |
I think the cases where you're seeing limits == 0 is for cases where that value won't be consulted. For example, taking "libp2p.autonat"'s Conn limit of 0, that makes sense because that scope won't be consulted for connection limiting. |
We are getting these ideas actualized in #9338 |
This PR adds several new functionalities to make easier the usage of ResourceManager: - Now resource manager logs when resources are exceeded are on ERROR instead of warning. - The resources exceeded error now shows what kind of limit was reached and the scope. - When there was no limit exceeded, we print a message for the user saying that limits are not exceeded anymore. - Added `swarm limit all` command to show all set limits with the same format as `swarm stats all` - Added `min-used-limit-perc` option to `swarm stats all` to only show stats that are above a specific percentage - Simplify a lot default values. - **Enable ResourceManager by default.** Output example: ``` 2022-11-09T10:51:40.565+0100 ERROR resourcemanager libp2p/rcmgr_logging.go:59 Consider inspecting logs and raising the resource manager limits. Documentation: https://github.com/ipfs/kubo/blob/master/docs/config.md#swarmresourcemgr 2022-11-09T10:51:50.565+0100 ERROR resourcemanager libp2p/rcmgr_logging.go:55 Resource limits were exceeded 483095 times with error "transient: cannot reserve inbound stream: resource limit exceeded". 2022-11-09T10:51:50.565+0100 ERROR resourcemanager libp2p/rcmgr_logging.go:59 Consider inspecting logs and raising the resource manager limits. Documentation: https://github.com/ipfs/kubo/blob/master/docs/config.md#swarmresourcemgr 2022-11-09T10:52:00.565+0100 ERROR resourcemanager libp2p/rcmgr_logging.go:55 Resource limits were exceeded 455294 times with error "transient: cannot reserve inbound stream: resource limit exceeded". 2022-11-09T10:52:00.565+0100 ERROR resourcemanager libp2p/rcmgr_logging.go:59 Consider inspecting logs and raising the resource manager limits. Documentation: https://github.com/ipfs/kubo/blob/master/docs/config.md#swarmresourcemgr 2022-11-09T10:52:10.565+0100 ERROR resourcemanager libp2p/rcmgr_logging.go:55 Resource limits were exceeded 471384 times with error "transient: cannot reserve inbound stream: resource limit exceeded". 2022-11-09T10:52:10.565+0100 ERROR resourcemanager libp2p/rcmgr_logging.go:59 Consider inspecting logs and raising the resource manager limits. Documentation: https://github.com/ipfs/kubo/blob/master/docs/config.md#swarmresourcemgr 2022-11-09T10:52:20.565+0100 ERROR resourcemanager libp2p/rcmgr_logging.go:55 Resource limits were exceeded 8 times with error "peer:12D3KooWKqcaBtcmZKLKCCoDPBuA6AXGJMNrLQUPPMsA5Q6D1eG6: cannot reserve inbound stream: resource limit exceeded". 2022-11-09T10:52:20.565+0100 ERROR resourcemanager libp2p/rcmgr_logging.go:55 Resource limits were exceeded 192 times with error "peer:12D3KooWPjetWPGQUih9LZTGHdyAM9fKaXtUxDyBhA93E3JAWCXj: cannot reserve inbound stream: resource limit exceeded". 2022-11-09T10:52:20.565+0100 ERROR resourcemanager libp2p/rcmgr_logging.go:55 Resource limits were exceeded 469746 times with error "transient: cannot reserve inbound stream: resource limit exceeded". 2022-11-09T10:52:20.565+0100 ERROR resourcemanager libp2p/rcmgr_logging.go:59 Consider inspecting logs and raising the resource manager limits. Documentation: https://github.com/ipfs/kubo/blob/master/docs/config.md#swarmresourcemgr 2022-11-09T10:52:30.565+0100 ERROR resourcemanager libp2p/rcmgr_logging.go:55 Resource limits were exceeded 484137 times with error "transient: cannot reserve inbound stream: resource limit exceeded". 2022-11-09T10:52:30.565+0100 ERROR resourcemanager libp2p/rcmgr_logging.go:55 Resource limits were exceeded 29 times with error "peer:12D3KooWPjetWPGQUih9LZTGHdyAM9fKaXtUxDyBhA93E3JAWCXj: cannot reserve inbound stream: resource limit exceeded". 2022-11-09T10:52:30.565+0100 ERROR resourcemanager libp2p/rcmgr_logging.go:59 Consider inspecting logs and raising the resource manager limits. Documentation: https://github.com/ipfs/kubo/blob/master/docs/config.md#swarmresourcemgr 2022-11-09T10:52:40.565+0100 ERROR resourcemanager libp2p/rcmgr_logging.go:55 Resource limits were exceeded 468843 times with error "transient: cannot reserve inbound stream: resource limit exceeded". 2022-11-09T10:52:40.566+0100 ERROR resourcemanager libp2p/rcmgr_logging.go:59 Consider inspecting logs and raising the resource manager limits. Documentation: https://github.com/ipfs/kubo/blob/master/docs/config.md#swarmresourcemgr 2022-11-09T10:52:50.566+0100 ERROR resourcemanager libp2p/rcmgr_logging.go:55 Resource limits were exceeded 366638 times with error "transient: cannot reserve inbound stream: resource limit exceeded". 2022-11-09T10:52:50.566+0100 ERROR resourcemanager libp2p/rcmgr_logging.go:59 Consider inspecting logs and raising the resource manager limits. Documentation: https://github.com/ipfs/kubo/blob/master/docs/config.md#swarmresourcemgr 2022-11-09T10:53:00.566+0100 ERROR resourcemanager libp2p/rcmgr_logging.go:55 Resource limits were exceeded 405526 times with error "transient: cannot reserve inbound stream: resource limit exceeded". 2022-11-09T10:53:00.566+0100 ERROR resourcemanager libp2p/rcmgr_logging.go:55 Resource limits were exceeded 107 times with error "peer:12D3KooWQZQCwevTDGhkE9iGYk5sBzWRDUSX68oyrcfM9tXyrs2Q: cannot reserve inbound stream: resource limit exceeded". 2022-11-09T10:53:00.566+0100 ERROR resourcemanager libp2p/rcmgr_logging.go:59 Consider inspecting logs and raising the resource manager limits. Documentation: https://github.com/ipfs/kubo/blob/master/docs/config.md#swarmresourcemgr 2022-11-09T10:53:10.566+0100 ERROR resourcemanager libp2p/rcmgr_logging.go:55 Resource limits were exceeded 336923 times with error "transient: cannot reserve inbound stream: resource limit exceeded". 2022-11-09T10:53:10.566+0100 ERROR resourcemanager libp2p/rcmgr_logging.go:59 Consider inspecting logs and raising the resource manager limits. Documentation: https://github.com/ipfs/kubo/blob/master/docs/config.md#swarmresourcemgr 2022-11-09T10:53:20.565+0100 ERROR resourcemanager libp2p/rcmgr_logging.go:55 Resource limits were exceeded 71 times with error "transient: cannot reserve inbound stream: resource limit exceeded". 2022-11-09T10:53:20.565+0100 ERROR resourcemanager libp2p/rcmgr_logging.go:59 Consider inspecting logs and raising the resource manager limits. Documentation: https://github.com/ipfs/kubo/blob/master/docs/config.md#swarmresourcemgr 2022-11-09T10:53:30.565+0100 ERROR resourcemanager libp2p/rcmgr_logging.go:64 Resrouce limits are no longer being exceeded. ``` ## Validation tests - Accelerated DHT client runs with no errors when ResourceManager is active. No problems were observed. - Running an attack with 200 connections and 1M streams using yamux protocol. Node was usable during the attack. With ResourceManager deactivated, the node was killed by the OS because of the amount of memory consumed. - Actions done when the attack was active: - Add files - Force a reprovide - Use the gateway to resolve an IPNS address. It closes #9001 It closes #9351 It closes #9322
This is the default resource manager configuration on a machine with 32Gb of memory, and 65536 file descriptors:
Default ResourceManager config
Right now, auto-scaling functionality is getting half of the file descriptors and 1/8 of the memory.
From Kubo side, we are modifying several specific parameters:
SystemBaseLimit.ConnsOutbound
: we are setting it to 65536 if the default value is lower. It is done to allow the accelerated DHT to load its routing table.SystemBaseLimit.FD
is set to 4096 if the default value is lower than that.ConnMgr.Type
type isbasic
we set other extra params.We use
ConnMgr.HighWater
(by default 900) as a base to configure the following params IFSystem.ConnsInbound
is smaller than 2*HighWater
All the following commands are set using a function that converts the provided
HighWater
with a multiplier that basically counts the number of needed bits to represent that number and adds these number of bits to 1. For example, 0 will need 0 bits to represent it, so the output is 1. 10 needs 4 bits, so the output will be 1 << 4 = 16, and so on (do not ask me why we are doing this.).System.ConnsInbound
: 2*HighWater
System.ConnsOutbound
: 2*HighWater
System.Conns
: 4*HighWater
System.StreamsInbound
: 16*HighWater
System.StreamsOutbound
: 64*HighWater
System.Streams
: 64*HighWater
System.FD
: 2*HighWater
ServiceDefault.StreamsInbound
: 8*HighWater
ServiceDefault.StreamsOutbound
: 32*HighWater
ServiceDefault.Streams
: 32*HighWater
ProtocolDefault.StreamsInbound
: 8*HighWater
ProtocolDefault.StreamsOutbound
: 32*HighWater
ProtocolDefault.Streams
: 32*HighWater
These values will be the default ones when we activate ResourceManager by default. Right now, ResourceManager is not managing resources but mostly limiting them, throwing errors internally when limits are reached.
Also, as you can see on the default configuration, we have some values set to 0. It is still not clear to me if that value means no limit or limit == 0. Some conversation about that here: https://filecoinproject.slack.com/archives/C03FFEVK30F/p1664184608359269
Related issue: #8761
CC: @guseggert , @Jorropo , @lidel , @BigLep , @galargh WDYT about these defaults? do they look good for you? Thx.
The text was updated successfully, but these errors were encountered: