-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
High Idle CPU usage (fresh install) #15542
Comments
I'm seeing the same behaviour on my machine. I have activated the access log to make sure no one is accessing my Gitea instance and triggering something. On Log Level "Info" I can see a log like this every 10 seconds (:05, :15, :25, ...):
Can you confirm that? Cheers, |
I am also noticing a CPU usage that seems high for gitea being idle with no usage at all. On my server that also hosts samba shares that are used every day, and other standard services like chrony, rsyslogd, crond, and firewalld(a python daemon!), gitea had the highest TIME value on my server after 24 days, by a large amount. I also see the SQL query lines every 10 seconds, but I don't think that is the issue, since that only runs every 10 seconds, and I see the cpu usage spikes seem to be pretty constant, though it jumps between threads. The parent process will be at 1% or 2% CPU usage according to htop with the refresh set to 1 second. With htop's refresh set to much shorter, like 0.1 seconds then it jumps from 0% to 8% back and forth constantly. What may be more telling is the strace output, which shows a constant polling of epoll_pwait() on several threads: [pid 944513] 14:39:28.794308 epoll_pwait(3, <unfinished ...> |
There are various scheduled tasks that happen periodically, what would be most helpful is information from pprof (specifically the diagram it provides) as then CPU usage could be traced throughout the codebase. |
big surprise... i guess strace doesn't lie. i just wasted over an hour of my time for that stupid pprof that strace already clearly showed us. |
@ddan39 if you had given us the out file instead of the svg we could have looked at what was causing the poll waits. You may find changing your queue configurations may change the number of polling. In terms of the |
Ah, yeah, It was getting late and I was getting a bit frustrated trying to get pprof to even work when I have zero go knowledge. Sorry about responding a bit harshly. I was going to upload more files(well to be honest, I was trying to upload svg first, which was denied) but github was only letting me attach certain file types to the comments. I probably should've just zipped them both together... it looks like .zip files are allowed. When I get off work I will attach some more files. I will look into the queue settings, thanks. I was surprised to see gitea seemingly polling the epoll_wait function so fast like that to be honest. With go being all about concurrency I figured it could just do a long blocking call... but again, I'm a newb here. |
I got pprof and strace running inside the docker of gitea:
But whenever I try to kill the running gitea process to try to start it with pprof or strace I get booted out of the container. How do I run this inside the docker container? |
edit: removed bad info about pprof, easy to use it below to get profile, simply add to your app.ini [server] section |
Gitea has pprof http endpoints built in. See https://docs.gitea.io/en-us/config-cheat-sheet/#server-server |
well, shit. |
So thanks to Etzelia in Discord I got pprof to run. Here is my output of 10min, this is a fresh rootless docker install. No repo, one user. Docker stats showed around 4-5% CPU usage of my host system (with a 1sec refresh rate of docker stats):
and second run
I attached the pprof output. Loadable with: |
@jolheiser Can you do anything with those pprofs? |
I mean these are pretty low-level go runtime issues to do with selects and polling. I've put up a PR that might(?) help if it's the case that it's due to having too many go-routines waiting on selects that are to blame. Could you try building #15686 to see if this helps? |
Yeah I fully understand that this might not be an easy fix and not one that has a high priority. I've build the rootless docker based on your PR changes @zeripath but sadly I did not see any changes on the idle CPU usage on a fresh install. (apk add takes ages inside the docker build container...). Still around 5% after booting it up, configure it and creating one user. If it helps I attached another pprof (2x10min) of the docker container running #15686
|
OK, So I'm really not certain what to do here - we could be chasing a ghost that is fundamentally not fixable but I think it's likely that the above pprofs and straces aren't really capturing what the real problem is - simply because by measuring they're causing wakeups. My suspicion is that the wakeups are caused by the level queues work loop checking if there is work for it to do - it will currently do this every 100ms and there are more than a few queues that all wake up and check. So just to prove that set your app.ini: ...
[queue]
TYPE=channel
[queue.issue_indexer]
TYPE=channel
(Remove/change any other Then just check with Now - the presumption when this code was made was that this is a minimal and insignificant potential issue. As this is the third report, clearly that presumption is incorrect. So what can be done? Well... the majority of the queues are actually using the persistable-queue type - which means that once the level queue is empty it should never get any more work. So any further polling in this case is actually unnecessary - but there are significant concurrency edgecases that mean asserting when that further polling can stop is actually hard. However, a leveldb only can have only one process open it at a time so... we could realistically check the length of the queue and if it is 0 block waiting for a push that will cause us to have some work to do. The trouble is getting the concurrency correct here and handling stopping properly. For redis queues though I'm not certain what we can do about polling. In all both cases 100ms was chosen as a "reasonable" default fall back polling time rather doing some more complicated exponential backoff as balanced between responsiveness and backend load. |
Hi @zeripath, I don't have time to look into pprof (this is a whole new topic for me) but setting the queue type to When the idle CPU usage of Gitea was around 6% before the change it is now down to around 3%. Should I reset the queue type after testing or what is the implication on setting it to Cheers, |
Here's another PR that will cause level db queues to simply wait on empty instead of polling at the cost of another select |
@zeripath I have managed to build your branch After initial setup with one user the load oscillates around 2.5%. I have ran a fresh release version and created a new user, etc. to find the CPU load oscillating around 6% again. |
Hey @zeripath I appreciate the help!
I will be running pprofs and be posting them later. |
OK so I think now it's worth checking if the combination: https://github.com/zeripath/gitea/tree/prs-15693-15686-15696 helps |
So https://github.com/zeripath/gitea/tree/prs-15693-15686-15696 has a CPU usage of around 2%-3% in my case (without queue set to channel). This is better than the previous PRs :) Do you still need any pprofs from the previous or this PR / queue channels? |
top from #15693 (pprof.gitea.samples.cpu.002.pb.gz)
top from queue set to channel + #15693 (pprof.gitea.samples.cpu.001.pb.gz)
top from prs-15693-15686-15696 (pprof.gitea.samples.cpu.003.pb.gz):
top from prs-15693-15686-15696 + queue set to channel (pprof.gitea.samples.cpu.004.pb.gz):
|
OK I've just pushed another little change up to that branch and #15696 that would allow: [queue]
WORKERS=0
BOOST_WORKERS=1
[queue.issue_indexer]
WORKERS=0
BOOST_WORKERS=1 (Which could still be changed to The point is that this will not have worker go-routines running unless they're actually needed. BTW This |
sorry for the delay, i saw someone else had already posted their profile so figured there was no rush to get mine, but i've attached it anyways, and the pprof top output is further down. this is actually the first time i've gotten back to my home PC since my last comment. i have tried the code from prs-15693-15686-15696, the profile is attached and pprof top looks like: File: gitea and in case it helps, here is the head of my original profile top output: File: gitea |
Running the code from zeripath@b1f6a0c with the 2 queue settings set to WORKERS=0 and BOOST_WORKERS=1 with default config: File: gitea |
I was hoping that I think we (well mostly @zeripath ;) ) is making some good progress here.
it is around 2%-3%
it is around 0.5%-1%
and with
it is steady 0% !
|
i see that too when setting
i get near constant 0% cpu usage at idle. i can also see with strace that the constant rapid calling of epoll_pwait() and futex() no longer happen. i only see a group of calls like every 10 seconds that are pretty minimal. are there any possible side-effects of using these settings? |
So I think we can stop with the pprofs just now. It looks like the are two fundamental issue for your idle CPU usage:
#15696 & #15693 reduce around half of the work of point 1. but if that is still not enough then you will have to set With the three PRs together, using channel queues - which does have the cost of a potential loss of data on shutdown at times of load - you should just flush the queues before you shutdown - and the below config you should be able to get gitea down to 30 goroutines when absolutely idle. [queue]
TYPE=channel
WORKERS=0
BOOST_WORKERS=1
[queue.issue_indexer]
TYPE=channel
WORKERS=0
BOOST_WORKERS=1 I think there may be a way of getting the persistable channel queue to shutdown its runner once it's empty so I will continue to see if that can be improved. Another option is to see if during shutdown we can flush the channel queues to reduce the risk of leaving something in them. I'm not otherwise certain if I can reduce the number of basal goroutines further but it looks like that's enough. There's likely some Linux resource setting you can increase that would allow go to avoid futex cycling with more goroutines but I'm no expert here and don't know. |
I've let it run for a bit longer and computed the avg. CPU usage of the docker container in idle which was around 0.14% CPU usage average (with all three queue settings and all three PRs). That's already lots better than the 5% we started with before. Loss of data would obviously not be good. However, when running the flush command I get this:
This sounds like an error? I've run this as gitea and root inside the rootless container - both the same. Thanks for you patient and awesome work so far! |
I suspect your manager call needs you to set your config option correctly. |
Indeed, that fixed it. It was quite late yesterday... Anyway, is there any more testing/feedback that you require or is this the maximum of reduceable goroutins for now? (not complaining, I'm already quite happy that you were able to reduce it this much) |
I've changed the persistable-channel queues to shutdown their level dbs once they're empty or not run them at all if they're empty which brings the number of baseline goroutines on three pr branch with default configuration to 80, and 74 with I'll have a think about what we can do about reducing these further. With the current state of prs-15693-15686-15696 76e34f05fbcc3d0069e708a440f882cf4da77a01 and app.ini [queue]
; TYPE=channel
WORKERS=0
BOOST_WORKERS=1
[queue.issue_indexer]
; TYPE=channel
WORKERS=0
BOOST_WORKERS=1 The starting number of go-routines is down to 64-65 ish. I can think of potentially one more small improvement in the queues infrastructure (just use contexts instead of channels) but it may not be worth it. It may not be possible to reduce the number of goroutines in the persistable channel further though. I still need to figure out some way of improving polling to reduce baseline load with the RedisQueue - likely this will be with Redis Signals but I'm not sure. In terms of other baseline go-routines - there may be a few other places where the changing the CancelFunc trick may apply. |
Awesome work! I tested the prs one earlier today but I will test the new one tomorrow. |
So I think if you set: [queue]
WORKERS=0
BOOST_WORKERS=1
CONN_STR=leveldb://queues/common
[queue.issue_indexer]
WORKERS=0
BOOST_WORKERS=1
CONN_STR=leveldb://queues/common Then the persistent channel queues will all share a level dB instead of opening their own. That should significantly reduce the number of goroutines and on the current state of that pr probably drop it to around 29 goroutines at rest. |
That's probably the event source. We currently poll the db even if there's no one connected but I think we can get it to stop when no one is connected. |
On 1.14.1, using the below settings
Does reduce the gitea idle CPU to almost 0%, however when I create a new repository and push (using SSH), the webui does not show the repository code (still showing "Quick Guide") but internally the code is pushed to Git successfully. Changing the workers to 1 fix this. |
The above config is not meant to work on 1.14.1 and requires changes made on 1.15 and in particular with the PRs I have linked. These changes will almost certainly not be backported to 1.14.x. |
I have tested it with 1.14.1 as well but the CPU spikes are a bit higher. However, if you run this with 1.14.1 you risk data loss and other undefined behavior if I understood the changes correctly which were made in #15693 (and are not present in 1.14.x) So with #15693 I have using:
I had ~26 goroutines, CPU usage of 0-0.5%,with spikes being usually at the same time as the DB spikes. And using:
I had ~57 goroutines, CPU usage of 0.2-0.7% |
OK looks like the CONN_STR isn't doing the trick of forcing the use of a common leveldb otherwise the goroutines should have been around half that. I'll take a look - but it might be that we have to actually specifically tell the queues to use a common db. |
The below is the current configuration for #15693 using common queues: [queue]
WORKERS=0
BOOST_WORKERS=1
[queue.mail]
DATADIR=queues/common
[queue.notification-service]
DATADIR=queues/common
[queue.pr_patch_checker]
DATADIR=queues/common
[queue.push_update]
DATADIR=queues/common
[queue.repo_stats_update]
DATADIR=queues/common
[queue.task]
DATADIR=queues/common
[queue.issue_indexer]
WORKERS=0
BOOST_WORKERS=1
DATADIR=queues/common That has a baseline goroutines number of 29. Those on 1.14.1 who want reductions should find that: [queue.mail]
DATADIR=queues/common
[queue.notification-service]
DATADIR=queues/common
[queue.pr_patch_checker]
DATADIR=queues/common
[queue.push_update]
DATADIR=queues/common
[queue.repo_stats_update]
DATADIR=queues/common
[queue.task]
DATADIR=queues/common
[queue.issue_indexer]
DATADIR=queues/common should cause significant reductions in baseline goroutine numbers - although to nowhere near as low levels as #15693 allows. I had intended to make it possible to force a common leveldb connection by setting CONN_STR appropriately but I suspect handling the people that insist on copying the app.example.ini as their app.ini has caused this to break. I suspect just adding a new option to |
Yeah, with that configuration for #15693 in docker I got ~32 goroutines and a CPU usage of around 0.05%-0.8%. |
Whether the channel only queue is appropriate for you is a decision about your data-safety and how you start and shutdown your instance. With #15693 the channel queue will at least attempt to flush itself at time of shutdown but... there's still no guarantee that every piece of data in the queue is dealt with. If you have an absolutely low load - and if you flush before shutting down - then it should be fine to use channel only queues - but as I've shown above it's possible to get it baseline goroutines down to 29 using a leveldb backend which doesn't have this gotcha. It's up to you really - persistable channels probably have to be the default which is why I've been working to reduce the baseline load there. Other Queue implementations are possible too. |
Isn't there a way to reduce the idle load to 0%, possibly giving up some functionality? Background: I'm running it on my x-slow NAS N40L where I get an idle load of ~3%. This is not acceptable; I'm kicking out any service raising my electricity bill sucking CPU permanently. My current workaround is "systemctl { stop | start } gitea". Btw; the performance of GITEA is incredible on this slow NAS! Probably because there is tight object code instead of lame web scripts or java bloat. |
I am guessing that you're running 1.14.x and not 1.15.x/main. Please try main. I think since all the PRs mentioned above (and now explicitly linked to this issue) have been merged we can actually close this issue as the baseline load on an idle gitea is now back to almost zero. |
[x]
):Description
I have a high idle CPU usage with my upgraded gitea server, so I tried a fresh deployment and have the same issue.
After a fresh install (even 1 hour after) the CPU% usage is about 5% of my system. One user, no repository.
In my testing VM (also Debian 10, 1 CPU thread only) the idle CPU% usage is about 0,1% (with the exact same docker-compose.yml and configuration).
This happens with 14.1 root or rootless version. (haven't tried others)
I know 5% is not that high, but should it really be that much with a fresh install and while idle?
According to htop the CPU is used by /usr/local/bin/gitea -c /etc/gitea/app.ini web. I've set the logging to debug but there's no logs when being idle.
Any help is appreciated :)
Screenshots
The text was updated successfully, but these errors were encountered: