-
Notifications
You must be signed in to change notification settings - Fork 342
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory fragmentation prevents memory release on Linux #287
Comments
I got more data in production. I have two instances of https://github.com/Rongronggg9/RSS-to-Telegram-Bot on the same VPS. One with ~4000 feeds, another one with ~3000 feeds. The bot will check the updates of feeds frequently. I noticed that the relation between the number of feeds and the amount of memory leakage is a logarithm relation. And parsing the same feed (no matter if it keeps the same or is updated) multiple times leaks less than parsing different feeds once, but when the same feed has been parsed fairly high times, the memory leakage will hardly increase. That is to say, the relation between the number of times of parsing and the amount of memory leakage is also a logarithm relation. I guess the leaked objects can somehow be reused? If that's true, it will be a helpful clue to figuring out the cause of memory leakage. Related: #302 (comment) |
Hi, coming here from your comment on #302. I ran a few tests where I called feedparser.parse() in a loop and measured memory usage (details below). I tried two feeds, one 2M and one 50K, both loaded from disk; I did this both on macOS and on Ubuntu. The results are as you describe, the max RSS increases in what looks like a logarithmic curve; that is, after enough iterations (10-100), the max RSS remains almost horizontal/stable. However, I am not convinced this is a memory leak in feedparser. Rather, I think it's a side-effect of how Python memory allocation works. Specifically, Python never releases allocated memory back to the operating system (1, 2, 3), but keeps it around and reuses it. (Because of this, running gc.collect() will never decrease RSS.) I assume the initial sharper memory increase is due to fragmentation (even if there's enough memory available, it's not in a contiguous chunk, so the allocator has to allocate additional memory); as more and more memory is allocated and then released (in the pool), it becomes easier to find a contiguous chunk. It makes sense for #302 to make max RSS stabilize faster, since it reduces the number of allocations – and more importantly, the number of big (whole feed) allocations (which reduces the impact of fragmentation). It might be possible to confirm this 100% by measuring the used memory as seen by the Python allocator, instead of max RSS. Script: import sys, resource
import feedparser
print(" loop maxrss")
for i in range(10 ** 3 + 1):
with open(sys.argv[1], 'rb') as file:
feedparser.parse(file)
maxrss = (
resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
/ 2 ** (20 if sys.platform == 'darwin' else 10)
)
if (i <= 10) or (i <= 100 and i % 10 == 0) or (i <= 1000 and i % 100 == 0):
print(f"{i:>8} {maxrss:>8.3f}") Output:
|
Hi, @lemon24. Thanks for your share. I can confirm that your statement "I am not convinced this is a memory leak in feedparser" is true. However, after confirming the previous statement, I did a deep dive. I believe that your statement "Python never releases allocated memory back to the operating system, but keeps it around and reuses it" is incorrect. In conclusion, your PR (#302) does help reduce the "leakage", but fairly limited. My final solution is shown below. Prohibiting the usage of As a solution in production, 1.
|
kurtmckee/feedparser#287 (comment) Signed-off-by: Rongrong <i@rong.moe>
improve multithread performance and reduce memory fragmentation related: kurtmckee/feedparser#287 Signed-off-by: Rongrong <i@rong.moe>
A better workaround for multithread programs is to replace the
I've changed the title of the issue and would like to keep it open to be a guide for those developers facing the same issue. It would be better if the issue could be documented in the docs. My conclusion is that to "solve" the issue at the |
kurtmckee/feedparser#287 (comment) Signed-off-by: Rongrong <i@rong.moe>
improve multithread performance and reduce memory fragmentation related: kurtmckee/feedparser#287 Signed-off-by: Rongrong <i@rong.moe>
Code to reproduce
feeds.tar.gz
My tests
feedparser 6.0.8
Debian GNU/Linux 11 (bullseye) on WSL (CPython 3.9.2) - Leaked!
neofetch
Debian GNU/Linux 11 (bullseye) on Azure b1s (CPython 3.9.2) - Leaked!
neofetch
AOSC OS aarch64 (CPython 3.8.6) - Leaked!
neofetch
Armbian bullseye (21.08.2) aarch64 (CPython 3.9.2) - Leaked!
neofetch
Windows 11 22000.194 (CPython 3.9.2) - Just leaked little, which can be ignored.
neofetch
Windows 11 22000.194 (PyPy 7.3.5, Python 3.7.10) - Leaked!
Note
If I run would_leak_1 and would_leak_2 separately, their leaking behavior seems the same. However, running them sequentially at a time does make the second-run one leak less under some conditions as you see.
The text was updated successfully, but these errors were encountered: