Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CPU Utilization Issues #5613

Closed
lifeBCE opened this issue Oct 18, 2018 · 37 comments · Fixed by #6240
Closed

CPU Utilization Issues #5613

lifeBCE opened this issue Oct 18, 2018 · 37 comments · Fixed by #6240
Assignees
Labels
topic/perf Performance

Comments

@lifeBCE
Copy link

lifeBCE commented Oct 18, 2018

First, great job to all involved!

I am super excited about this project and am close to releasing a new project myself built on IPFS. My project will encourage users to operate an IPFS node but I have a current concern with a CPU utilization issue I am seeing which could seriously hamper the desire to run a node.

Version information:

go-ipfs version: 0.4.17-
Repo version: 7
System version: amd64/linux
Golang version: go1.10.3

☝️
$ cat /etc/issue
Ubuntu 18.04.1 LTS \n \l

The same is witnessed on my MacBook pro (Sierra 10.12.6) running latest ipfs-desktop (not sure which version of core comes bundled, having a difficult time finding the version info)

Type:

Bug

Description:

When IPFS core (go) is launched and running, the CPU utilization is generally fine. Idles around 5% with momentary spikes up to 10-15%. When I launch the web GUI (or the desktop GUI) the CPU utilization jumps to idling at over 100% utilization with spikes above 200%.

This is seen on both Ubuntu and Mac OS X. I have managed to narrow it down on the desktop version to the peers tab. If any other view is accessed, the utilization remains fine. As soon as the peers view is accessed, the CPU jumps.

In both Ubuntu and Mac OS X, if I close out either management interface, the CPU utilization eventually calms down but this takes quite a bit of time as well. I usually just kill the daemon and restart it to recover.

I am submitting this here first as it seems consistent across OS and client which suggested to me a core issue but I can file this with each client if it is felt to be an issue on that side. Apologies for not providing more info but I have not have a decent chance to dig into it more myself but can after next week if needed.

@Stebalien
Copy link
Member

Is IPFS jumping or the web browser? The fact that the webui has an extremely inefficient peers view is, unfortunately, known and fortunately, fixed, and, unfortunately, not yet merged/released. I've just submitted a PR to use the in-development version (#5614).

@lifeBCE
Copy link
Author

lifeBCE commented Oct 18, 2018

It is actually the IPFS node itself. Both Chrome/Firefox (for web) and IPFS-Desktop (for desktop) show no signs of stress during the spike from node. This happens for both local and remote nodes (accessed via local ssh tunnel).

ipfs_cpu

@Stebalien
Copy link
Member

So, it looks like this is due to the geoip database. For each peer we're connected to, lookup that peer in an IPLD-based geoip database. Unfortunately, each of these requests are independent and are likely forcing us to independently try to find providers for each object due to ipfs/go-bitswap#16 which is, in turn, connecting us to more peers which is, in turn causing us to download more of this graph, rinse-repeat.

I'll see if I can delay finding providers. We shouldn't need that as we're already likely to be connected to peers with the blocks we need.

@Stebalien
Copy link
Member

So, this is even worse than I thought. I have the first half of a fix in ipfs/go-bitswap#17. However, not only are we fetching each of these blocks, we're also announcing that we have them. This announcement also forces us to connect to a bunch of peers, leading to this run-away CPU issue.

@lanzafame
Copy link
Contributor

@olizilla The we now know the reason behind my little trick to use the WebUI to boost my peer count 😛

@Stebalien I am in no way suggesting that you don't fix this but I wouldn't mind another way of forcing my peer count up?

@bonekill
Copy link

The network usage itself also goes crazy...
While loading the peer page on a fresh ipfs node according to a wire shark capture

  • 5.8 million packets
  • 1007 MB of traffic
  • peek packets per second of ~175,000
  • for ipfs to download 808 blocks containing 7.04 MB of data
    pps graph
    packet count

@Stebalien
Copy link
Member

I am in no way suggesting that you don't fix this but I wouldn't mind another way of forcing my peer count up?

Personally, I just use ipfs dht findprovs "Qm...." where Qm... is some random cid. If you repeatedly call this, you'll massively increase your peers list.

While loading the peer page on a fresh ipfs node according to a wire shark capture

Yeah... that's because it's downloading a bunch of tiny blocks and wasting a bunch of traffic just telling the network about them/finding them. @keks will be working on this issue this quarter.

@markg85
Copy link
Contributor

markg85 commented Nov 22, 2018

I just chime in with a rather useless "me too" comment.
I do that because i've been keeping an eye on ipfs for years now. A couple years ago i tried running this but was let down by the quite demanding consistently high CPU usage. I call an average of 10% with peaks up till 30% while doing absolutely nothing way too high.

That was a few years ago. So a few days ago i tried it out again. And immediately i notice it again. High CPU usage. That is just with the stock go-ipfs package as it comes on archlinux. Nothing special turned on, not a thousand users on it.. Nope, just plain and simple sitting idle, eating away my CPU.

In my mind this project has the potential to really be a difference to how we use the internet. Really make it much more efficient. As in the more users join, the more stable the whole network becomes; you do not need a monstrous server setup for a highly popular website, the internet as a whole is that monstrous giant server with each user contributing a tiny piece. I quite like that idea! But not if it's eating my desktop and server CPU. Fur a cpu usage graph: https://image.ibb.co/eDPtHA/ipfs-cpuusabe.png (and that is on a quite decent vps!)

This CPU usage thingy is a real showstopper for me a the moment. I wish i could help profiling and fixing stuff, but Go isn't quite my language. I would've if it were C++ :)

@Stebalien
Copy link
Member

Please try running the daemon with --routing=dhtclient. That won't stop bitswap but should help quite a bit.

@markg85
Copy link
Contributor

markg85 commented Dec 6, 2018

Please try running the daemon with --routing=dhtclient. That won't stop bitswap but should help quite a bit.

I will give that a shot. I have no problem passing that argument on my local machine.
But on a VPS, things are slightly more complicated. There IPFS is running inside a docker container, how do i pass that argument? Preferably without rebuilding it... (rebuilding in this case means cloning go-ipfs and changing the file https://github.com/ipfs/go-ipfs/blob/master/bin/container_daemon)

@Stebalien
Copy link
Member

You should be able to pass that on the docker command line. That is:

docker run -d --name ipfs_host ... ipfs/go-ipfs:latest daemon --routing=dhtclient

@markg85
Copy link
Contributor

markg85 commented Dec 11, 2018

You should be able to pass that on the docker command line. That is:

docker run -d --name ipfs_host ... ipfs/go-ipfs:latest daemon --routing=dhtclient

I now tried that (locally and on a VPS server). I think it helped somewhat. Even though the CPU usage still is high and now is really wonky: https://i.imgur.com/bUdbFv6.png

Also, due to the sheer number of peers it connects to (i'm using the default configs) it triggers network hacking detection. Thus i get notified by my provider that the ipfs ip is possibly doing hacking attempts... Right. What is the proper way to limit ipfs to - say - 100 peers or so? That alone will probably also reduce CPU usage significantly.

@Stebalien
Copy link
Member

To limit the number of connections, use the connection manager: https://github.com/ipfs/go-ipfs/blob/master/docs/config.md#connmgr

To avoid connecting to private IP addresses (what's likely triggering the port scanner), apply the server profile: https://github.com/ipfs/go-ipfs/blob/master/docs/config.md#profiles.

@markg85
Copy link
Contributor

markg85 commented Dec 12, 2018

@Stebalien
Thank you for your reply!

I do have more questions though, sorry :)
I applied thew server profile. It gave me some diff-like output of config lines it changed. Afterwards, i restarted ipfs. Now how do i figure out if the ipfs instance runs in the server profile?
I would expect a "ipfs config profile" to print what the current profile is, but instead it prints the man page but with no apparent command to get the current profile.

Regarding the connmgr. That is really vaguely worded! "LowWater" and "HighWater" ... seriously? Even then it still doesn't tell me anything if those will be the maximum number of connections it opens or if it will open a thousand and keep_open what is specified.

@Stebalien
Copy link
Member

Now how do i figure out if the ipfs instance runs in the server profile?

Profiles patch/transform your config itself.

Regarding the connmgr.

You're right, that documentation is pretty terrible. Fixed by: #5839. New version: https://github.com/ipfs/go-ipfs/blob/716f69b8f8a3abbaa2fdcacc7827eba00e3470de/docs/config.md#connmgr

@markg85
Copy link
Contributor

markg85 commented Mar 3, 2019

New release, new screenshot!
I'm now running on 0.4.19
CPU load

If anything, it got even worse!

I am running it in stock mode though. No --routing=dhtclient and running it as server profile too (as the docker commands suggest i should do).

Please put this CPU load issue as the No.1 priority. For you folks, the IPFS devs, it's only annoying that you get complaints about it. And for us - the users - it's also annoying as having an app that constantly uses much of the CPU (even on high end CPU's!) is bound to get us into trouble if hosted at some provider.

@Stebalien
Copy link
Member

Performance and resource utilization is a very high priority issue and we're really doing the best we can. This looks like a regression we've also noticed in the gateways, we'll try to get a fix out as soon as possible.

Can you give us a CPU profile? That is, curl https://localhost:5001/debug/pprof/profile?seconds=30' > cpu.profile`. That will help me verify that we're talking about the same issue.

Also, what kind of load are you putting on this node? Is it doing nothing? Is it fetching content? Acting as a gateway? Serving content over bitswap?

@markg85
Copy link
Contributor

markg85 commented Mar 10, 2019

Performance and resource utilization is a very high priority issue and we're really doing the best we can. This looks like a regression we've also noticed in the gateways, we'll try to get a fix out as soon as possible.

Can you give us a CPU profile? That is, curl https://localhost:5001/debug/pprof/profile?seconds=30' > cpu.profile`. That will help me verify that we're talking about the same issue.

Also, what kind of load are you putting on this node? Is it doing nothing? Is it fetching content? Acting as a gateway? Serving content over bitswap?

Hi,

Sorry for the quite late reply.

It's doing nothing at all! Just a plain and simple docker run to let it run. Nothing more.
The command i used:
docker run -d --name ipfs_host -e IPFS_PROFILE=server -v $ipfs_staging:/export -v $ipfs_data:/data/ipfs -p 4001:4001 -p 127.0.0.1:8080:8080 -p 127.0.0.1:5001:5001 ipfs/go-ipfs --routing=dht

This is on a hetzner VPS instance. If you want, i can give you a VPS instance to play with for a month or so. Yes, i'ill pay for it, see it as my little contribution to this project ;)

As of a few days ago, it even decided to take up 100% CPU usage.
CPU load

I had to kill my VPS to even be able to get it back responding and run the command you asked for.
As for that command. I don't know why you added https in it, it makes no sense for a server (webui is not on it). I'm not going to send you the cpu.profile output because it's binary. I don't know what is in there, might as well be my private ssh keys. I know that i can trust this, but i do just want to see the output.

Lastly, i really don't understand why others are apparently not running in this CPU insanity along with this bug #5977 That bug gives me freaky unstable swarm connections with them mostly not working, sometimes they do.

If you have commands for me to run and help in debugging, i'd be happy to help :) But... no binary! Right now i'm shutting off IPFS as it's already having between 20 and 30% CPU usage. Please lets figure this out!

Last note. I'd really consider revoking this last IPFS release (0.4.19). I know it sucks. Specially as a fellow developer i know that's about the last thing you'd want to do! But this CPU issue really is getting out of control imho and together with #5977 it just doesn't give a good IPFS experience at all. Something nobody involved wants.

@ivilata
Copy link

ivilata commented Apr 4, 2019

@markg85, I'm also running against this issue at least from mid-February (v0.4.18 I guess, but it also happens with v0.4.19). The daemon quietly but steadily eats nearly all remaining system memory and maxes CPU to 100%. ipfs cat or ipfs dht findprovs then become quite unreliable until the daemon is restarted (an action which BTW takes nearly a minute to complete).

@Stebalien
Copy link
Member

I had to kill my VPS to even be able to get it back responding and run the command you asked for.
As for that command. I don't know why you added https in it, it makes no sense for a server (webui is not on it). I'm not going to send you the cpu.profile output because it's binary. I don't know what is in there, might as well be my private ssh keys. I know that i can trust this, but i do just want to see the output.

The CPU profile is generated by pprof. The https was a muscle memory.

If you don't want to share the binary blob, you can generate a list of the top 20 consumers by running:

> go tool pprof /path/to/your/ipfs/binary /path/to/the/profile

At the prompt, type top20 to list the top20 functions. Even better, you can run svg to spit out a call graph.

Lastly, i really don't understand why others are apparently not running in this CPU insanity along with this bug #5977 That bug gives me freaky unstable swarm connections with them mostly not working, sometimes they do.

I've seen unstable swarm connections (issues with the connection manager) but I have absolutely no idea why your CPU pegged to 100%. My best guess is that you ran out of memory.

@markg85
Copy link
Contributor

markg85 commented Apr 4, 2019

I've seen unstable swarm connections (issues with the connection manager) but I have absolutely no idea why your CPU pegged to 100%. My best guess is that you ran out of memory.

It has 4GB available.. How much more does IPFS need?
Again, i can give you access to the server so you can profile whatever you want to see or know. It's just a plain simple account on Hetzner, the CX21 one to be precise.

If memory was an issue then there is a massive leak in IPFS somewhere.

@ivilata
Copy link

ivilata commented Apr 4, 2019

Just as a side comment, I switched off QUIC support and swarm addresses in case it had something to do with quic-go/quic-go#1811, but the behaviour stayed the same, so it may be related with running out of memory indeed.

The VPS the daemon is running on doesn't have a lot of memory, but the 100% CPU usage issue didn't happen until recently and it did fine until then.

@Stebalien
Copy link
Member

@ivilata could I get a CPU profile (https://github.com/ipfs/go-ipfs/blob/master/docs/debug-guide.md#beginning)?

@markg85 if that works for you, my SSH keys are https://github.com/Stebalien.keys. I'm also happy receiving the CPU profile by email (encrypted if you'd like).

@Kubuxu
Copy link
Member

Kubuxu commented Apr 4, 2019

You can also use: https://github.com/ipfs/go-ipfs/blob/master/bin/collect-profiles.sh to collect profiles.

@lifeBCE
Copy link
Author

lifeBCE commented Apr 7, 2019

Just to pile it on here... :-)

I don't think my issue has anything to do with lack of RAM as I have 32GB on my server and 25GB+ available but the CPU and system load will climb steadily over time to the point where all data transfer is halted. CPU jumps to over 100% utilization and no clients can access anything.

A quick restart of the ipfs daemon brings it back to being able to transfer data again. I have to do this numerous times a day to the point where I am thinking of creating a crontab entry to just reboot the daemon every hour.

I do have upwards of 500GB+ worth of blocks being shared but not sure if it has anything to do with that or not since after a restart, things work fine and the size of the data store has not changed. My application also does an hourly addition of content sync'd from other sources. This job takes about 10-15 mins to run during which time data access is also halted or very slow. Seems like running ipfs add with 500GB+ of data already added causes some performance issues.

I don't know yet if this hourly addition of content is what triggers the non-responsiveness causing the need to reboot or if it is just an additional issue which the daemon eventually recovers from. I am still attempting to isolate the issues.

FYI, I do have numerous other services running on the same server and during these times where IPFS is non-responsive, all other services are super responsive. The server has plenty of resources and ability to execute on other requests, just not IPFS.

@ivilata
Copy link

ivilata commented Apr 10, 2019

@ivilata could I get a CPU profile (https://github.com/ipfs/go-ipfs/blob/master/docs/debug-guide.md#beginning)?

Sent to @Stebalien by email. The script linked by Kubuxu failed with:

Fetching profile from http://localhost:5001/debug/pprof/goroutine
parsing profile: unrecognized profile format

@Kubuxu
Copy link
Member

Kubuxu commented Apr 10, 2019

It might be an old golang version (which go version are you running), or profile incompatibilities between golang versions.

@Stebalien
Copy link
Member

@markg85 Looks like the issue is with storing provider records. TL;DR, your node's peer ID is probably "close" to some very popular content so everyone is telling you about that content.

The actual issue here is garbage collection. It looks like the process of walking through and removing expired provider records is eating 50% of your CPU.

Issue filed as: libp2p/go-libp2p-kad-dht#316

@markg85
Copy link
Contributor

markg85 commented Apr 10, 2019

@Stebalien that would be a response to @ivilata :)

Also, CPU usage always is high with IPFS. That is on a low end crappy machine, but also on a high end multi-core power beast. You just notice it way less as it's often one core that is being used most which you hardly notice if you have 16.. That doesn't make the issue less, it merely "masks" it.

This is the case for me when hosting it locally (lots of cores) and on a remote VPS (just 2 cores).

The rationale of being close to a popular source would be very troublesome as IPFS is most certainly a niche product at this point in time. So if i'm close to a popular source (both locally in The Netherlands and in the VPS in Germany) then there is a whole big issue lurking right around the corner when IPFS does become populair.

You also reference a DHT issue. Whereas earlier (months..) it was suggested to use DHT as it could lighten the CPU stress. And it does (or did) reduce CPU load somewhat when compared to not using DHT. But it's just always high.

Also, sorry for not sending the information yet you had requested. I will do that later today.

@Stebalien
Copy link
Member

@Stebalien that would be a response to @ivilata :)

(oops)

The rationale of being close to a popular source would be very troublesome as IPFS is most certainly a niche product at this point in time. So if i'm close to a popular source (both locally in The Netherlands and in the VPS in Germany) then there is a whole big issue lurking right around the corner when IPFS does become populair.

By "close to" I mean in terms of the DHT keyspace, not physical space. His peer ID is such that the

And yes, this is an issue, that's why I filed an issue.


Note: I run go-ipfs on my laptop all the time and barely notice it (albeit with the --routing=dhtclient option set).

@ivilata
Copy link

ivilata commented Apr 12, 2019

[…] your node's peer ID is probably "close" to some very popular content so everyone is telling you about that content.

Now that's (literally) quite unfortunate. I hope that the issue you opened gets fixed fast… In the worst case I guess I can just replace keys…

Thanks!

@Stebalien
Copy link
Member

@ivan386 for now, I recommend running go-ipfs with the --routing=dhtclient option set.

@ivan386
Copy link
Contributor

ivan386 commented Apr 12, 2019

@Stebalien ok

@ivilata
Copy link

ivilata commented Apr 19, 2019

@ivan386 for now, I recommend running go-ipfs with the --routing=dhtclient option set.

@Stebalien I guess this was addressed to me… 😛 I upgraded the daemon to v0.4.20 but CPU usage eventually raised to 100%. Then I restarted the daemon with --routing=dhtclient but it has eventually reached 100% CPU again.

Do you suggest anything else I could test?

Stebalien added a commit that referenced this issue Apr 20, 2019
* fixes #5613 (comment)
* fixes some relay perf issues
* improves datastore query performance

License: MIT
Signed-off-by: Steven Allen <steven@stebalien.com>
@ghost ghost assigned Stebalien Apr 20, 2019
@ghost ghost added the status/in-progress In progress label Apr 20, 2019
@ghost ghost removed the status/in-progress In progress label Apr 21, 2019
@Stebalien
Copy link
Member

@ivilata, could you try the latest master?

Everyone else, please open new issues if you're still noticing high CPU usage in the current master. This issue has several separate bug reports that are becoming hard to untangle.


WRT the original issue (opening the peers page leading to a bunch of CPU usage), I believe we've mostly fixed the issue:

  1. We've reduced the dialing in the DHT.
  2. We've added a delay before searching for providers in bitswap.
  3. We've reduced the parallel provide workers in the DHT.

All together, this has significantly reduced the issue.

@ivilata
Copy link

ivilata commented Jun 3, 2019

@Stebalien: After some hours with 0.4.20, CPU usage went back to 15%-30% and it's stayed like that. I upgraded to 0.4.21 and CPU usage continued to stay in the same range (though IPv4 traffic doubled and IPv6 was cut to a quarter, but that's another story).

Thanks a lot again for taking care of this! 😄

@Stebalien
Copy link
Member

Thanks for the report! (although I'm not sure what would have caused the IP traffic changes)

hannahhoward pushed a commit to filecoin-project/kubo-api-client that referenced this issue Jun 19, 2023
* fixes ipfs/kubo#5613 (comment)
* fixes some relay perf issues
* improves datastore query performance

License: MIT
Signed-off-by: Steven Allen <steven@stebalien.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic/perf Performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants