-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Memory Leak in EventPublisher process #61565
Comments
Does the number of open file descriptors appear to increase too? I'm wondering if this is related to #61521 |
Here is a patch that will address the issue for 3004. I've tested this against the master branch and I don't see the memory leak. Looks like the recent transport refactor inadvertently addressed the issue. 0001-Slow-memory-leak-fix.patch.txt @frebib I don't believe this is the same issue as #61521. I believe the root of #61521 is caused by this commit. We're probably creating multiple instances of events and transports which needs to be addressed. #61468 should be a step in the right direction. |
is there any progress on this ? Python Version: Dependency Versions: System Versions: |
it looks a lot better in 3006.5 when using this af12352 |
Fantastic news, thanks for the update. |
How to debug a memory leak like this? tracemalloc or some other tools? |
In the past we've relied on the We've also been working towards better debugging of running salt processes with better tooling. We've added debug symbol packages in 3006.x and there is a newer tool relenv-gdb-dbg to help debug these kinds of issues. |
@dwoz i needed to do a flip on this row in the minion.py code |
@Zpell82 sounds like a bug which should be an issue ? |
@Zpell82 Can you open a separate issue for this? |
We are running salt 3006.6 and are seeing this memory leak. I examined salt/minion.py and it already has the fix suggested in #61565 (comment) Further, after the upgrade from 3006.5 to 3006.6 we are now seeing this problem present much faster than before. It used to take about 20 days, now we’ve noticed it after just 7 days. |
Yeah, seeing this in 3006.6 in testing as well. |
I believe this is resolved with |
I attempted to install 3006.7, but I got an error while trying to salt-pip install pygit2 to use with a git backend.
OS is Debian Bullseye Maybe a new dependency needs to be added to the deb? |
I still see 3006.7's leaking in our environment...EventPublisher process is over 8GB memory in ~48 hours since upgrading (edited from originally saying "less than 24 hours", I lost track of what day it was /facepalm) A master still at 3006.4 EventPublisher up to 24GB in about 10 days since its last restart These masters are both using an external job cache. Masters not using an external job cache don't seem to leak noticeably - are others seeing the leak in 3006.6/etc using an external job cache? Maybe the external job cache is a red herring, I haven't dug in to all that is all glued together and if that is handled by the EventPublisher process... |
@lomeroe as a matter of fact, ours is using an external job cache (redis). |
interesting...maybe there is something there with the external job cache and the EventPublisher process leaking (we are not using redis as our external job cache, so it would seem to not specifically be related to a single external job cache type at least) @dwoz - thoughts? |
This is still a problem in 3006.7 for us, and we have a cronjob to restart Salt master once a day. |
@johje349 @jheiselman - do either of you run orchestrations or other background jobs that utilize batch mode? After restarting a master with the patch I mention in #66249, memory for the EventPublisher process hasn't gone over 400MB in almost 24 hours, which seems considerably better from what we have been seeing - typically it is several GB in a day. Obviously need more time monitoring to really tell if it has something to do with it, so it could just be coincidental.... We've got quite a few orchestrations/api initiated jobs that use batch mode (all initiated on the masters exhibiting the issue), so if they were hanging up/never fully returning, I suppose it's possible that could cause memory issues in the EventPublisher process somehow. I wouldn't have ever guessed that, but... |
We do not have very many, but yes, we do have a few scheduled jobs (started via the salt-api) that utilize batch mode. None of our orchestrations use batch mode. At this point in time, both of our salt-masters were last restarted four days ago. One of our masters is currently using 3.5 GB of memory and less than 1 GB on another. They share the same workloads/minions. There's no difference between the two. Their salt-api's are behind a load balancer, so some one of them may be getting heavier jobs than the other purely by chance. By the load balancer is configured for a simple round-robin so they should be getting the same number of jobs. |
Yes we use orchestration states with batch set to 20%. |
After 5 days, EventPublisher memory usage is still hovering around 400MB. The only change made was applying the patch mentioned in #66249. Seems fairly likely to me that issue causes memory usage increases in the EventPublisher process due to the jobs run in batch mode never actually ending. |
Found the issue , it was that i hade my master : ["mastername"] in minion config , removed the brakets and it started working just fine |
Description
We've observed a memory leak in the EventPubisher process on the master. This can be seen by adding a simple engine to a salt-minion that sends lots of auth requests. It looks like the leak isn't specific to auth events. A large number of any events should trigger it.
Setup
Versions
v3004
The text was updated successfully, but these errors were encountered: