Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Goroutine leak leads to OOM #6166

Closed
requilence opened this issue Apr 2, 2019 · 18 comments
Closed

Goroutine leak leads to OOM #6166

requilence opened this issue Apr 2, 2019 · 18 comments
Labels
kind/bug A bug in existing code (including security flaws) topic/perf Performance

Comments

@requilence
Copy link
Contributor

requilence commented Apr 2, 2019

Version information:

tried both 0.4.19 and latest master:

go-ipfs version: 0.4.20-dev-fd15c62
Repo version: 7
System version: amd64/darwin
Golang version: go1.11.4

Type:

bug

Description:

I created the fresh repo this morning. It was working good for some time but now every time I run ipfs daemon I have a huge goroutine leak that leads to the OOM in a few minutes. I set HighWater = 60, LowWater = 30 to make sure it doesn't depend on swarm size
https://gist.github.com/requilence/8f81663a95bec7a4083e2600ff24aeda

I had the same problem a few days ago(recreated the repo after)

It is a really huge list to manually check one by one. Maybe someone has an idea where it could come from?

@requilence
Copy link
Contributor Author

requilence commented Apr 3, 2019

I have more details to share:
I put the debug log here:
https://github.com/ipfs/go-bitswap/blob/85e3f43f0b3b6859434b16a59c36bae6abf5d29e/peermanager/peermanager.go#L131

After 2 minutes of uptime I see:
PeerManager.getOrCreate(QmRnTcjn29vbepLtQoUJdS8cYiNYUnMSrfTsTCJZUaPFRJ) times = 3, len(pm.peerQueues) = 8299, len(uniquePeersMap) = 13437

I count unique and times this way:

	uniquePeersMapMutex.Lock()
	times := uniquePeersMap[p] + 1
	uniquePeersMap[p] = times
	uniquePeersMapMutex.Unlock()

Please notice that I have HighWater = 60, LowWater = 30. But despite this, it connects to 8299 peers

@requilence requilence changed the title Goroutine leak Goroutine leak leads to OOM Apr 3, 2019
@Stebalien
Copy link
Member

@requilence could create a dump as described here: https://github.com/ipfs/go-ipfs/blob/master/docs/debug-guide.md#beginning? We have a tool called stackparse for exactly this.

@Stebalien Stebalien added the kind/bug A bug in existing code (including security flaws) label Apr 3, 2019
@requilence
Copy link
Contributor Author

requilence commented Apr 3, 2019

@Stebalien thanks. It was challenging to capture all of them before OOM as it becomes worse and eats 3GB in 1 min :-)
0.4.19.tar.gz

0.4.20@74d07eff35965a3f635d03aedaa43561c73679e2:
0.4.20.tar.gz

I have also added ipfs.stacks_grouped using goroutine?debug=1 because full stack is 64M

@Stebalien Stebalien mentioned this issue Apr 3, 2019
9 tasks
@Stebalien
Copy link
Member

Could you post your config, minus your private keys? It looks like you're running a relay which would explain all the peers.

Note: the connection manager tries to keep the number of connections within the target range but it doesn't stop new connections from being created. That's what's killing your CPU (creating/removing connections). We definitely need better back-pressure, it looks like this is a bit of a runaway process.

@Stebalien Stebalien added the topic/perf Performance label Apr 4, 2019
@requilence
Copy link
Contributor Author

requilence commented Apr 4, 2019

@Stebalien
You are right, I have EnableRelayHop = true and EnableAutoRelay = true
https://gist.github.com/requilence/0d713de5a8e52d666830b696a10b6264

That's what's killing your CPU

actually the main problem that it eats 3GB of RAM, while heap only showing about 500MB. As I know goroutine is pretty cheap(2KB of memory) and 200k goroutines should eat around 390MB. Where it could come from?

@Stebalien
Copy link
Member

You are right, I have EnableRelayHop = true

EnableAutoRelay is fine, it's EnableRelayHop that's causing everyone to use you as a relay.

actually the main problem that it eats 3GB of RAM, while heap only showing about 500MB. As I know goroutine is pretty cheap(2KB of memory) and 200k goroutines should eat around 390MB. Where it could come from?

It could be allocation velocity (#5530). Basically, we're allocating and deallocating really fast so go reserves a bunch of memory it thinks it might need. That's my best guess.

@requilence
Copy link
Contributor Author

EnableRelayHop that's causing everyone to use you as a relay.

It was intentionally. So I guess after introducing EnableAutoRelay option demand for relays dramatically increased but offers is still very thin. So this disbalance is the core reason.

@Stebalien
Copy link
Member

Likely, yes. Basically, this is a combination of two issues:

  1. You're relaying so many nodes are trying to use you to talk to other peers.
  2. You have a very low connection limit so you're rapidly killing these connections.

Ideally, the connection manager and relay would actually talk to eachother and the relay would stop accepting new connections at some point... (libp2p/go-libp2p-circuit#65).

@Stebalien
Copy link
Member

@requilence has disabling relay helped?

@vyzo
Copy link
Contributor

vyzo commented Apr 6, 2019

If you want to enable relay hop you will need to set limits in the connection manager.
Otherwise you will be quickly inundated with connections (our relays have 40k-50k connections active currently), which will lead to ooms.

@vyzo
Copy link
Contributor

vyzo commented Apr 8, 2019

See also libp2p/go-libp2p-circuit#69
We've identified the biggest culprit in relay memory usage, and this should make it much better.

@requilence
Copy link
Contributor Author

@Stebalien disabling relay doesn't help. Probably because I have already advertised my peer as a relay through DHT and it needs some time to expire

@requilence
Copy link
Contributor Author

@vyzo sounds cool, I will try to use this patch on leaking setup and come back here with results

@vyzo
Copy link
Contributor

vyzo commented Apr 9, 2019

We have identified the goroutine buildup culprit as identify. There is a series of patches that should fix the issues:

@Stebalien
Copy link
Member

@requilence could you try the latest master?

@leerspace
Copy link
Contributor

I think I'm hitting an issue similar to this where at some point connection counts start climbing rapidly past the default HighWater threshold; but I don't have the exact same configuration. While I have EnableAutoRelay = true, I have EnableRelayHop = false; I also have QUIC enabled.

Should I create a separate issue? Or would it be worth upload the debug logs (e.g., heap dump, stacks, config, ipfs swarm peers snapshots, etc) here?

@Stebalien
Copy link
Member

@leerspace please file a new issue. Also, try disabling the DHT with --routing=dhtclient (your node may now be dialable where it wasn't before).

@Stebalien
Copy link
Member

I'm going to close this issue as "solved" for now. If that's not the case, please yell and I'll reopen it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug A bug in existing code (including security flaws) topic/perf Performance
Projects
None yet
Development

No branches or pull requests

4 participants