-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Unbounded increases in GC time #7176
Comments
It's hard to see why GC time would increase when memory usage and the number of objects being collected stays the same. The only change in 1.12.0 which is vaguely related to GC is #6953. Could you patch out that change and see if it helps at all? |
After something like a week of uptime synapse managed to use all the RAM on my server. A restart later (a little after 12:00) all was smooth again. I can provide more precise information if needed. I am the only user on my server and I am in a few big rooms. Also I only started sending messages today, it was on #synapse:matrix.org |
I think I reverted it in https://mau.dev/maunium/synapse, but still seems to grow quickly: |
@tulir yeah; it was a long shot tbh. I have no real idea what could cause this. Are you sure it didn't happen with 1.12.0rc1 ? It feels like something has changed outside of synapse. I don't suppose you've had a python version change or something? I guess it could just reflect a different traffic pattern from your clients/federated servers. |
I am seeing similar issues, at first I thought it might have been related to open file handles and/or I was starting to suspect maybe the version of Python we are running may not be good for the Twisted update (we are on Debian 9 with Python v3.5.3 and matrix-synapse-py3 on 1.12.3). Unfortunately, we weren't scraping Prometheus metrics until we got onto the 1.12.x branch so we cannot compare GC collection time from a prior release. |
So, my gc times are still growing linearly: It seems mostly the federation reader and federation sender are affected. My caches are already full and are not growing anymore: But it seems like memory usage for the federation workers is still climbing: This may suggest, that there is a resource leak regarding DNS or connections or similar. Could of course be something else, but I would expect the memory usage of the federation sender to flatten off after a few days. In any case, the GC times shouldn't grow linearly over time, which they do at the moment. |
as I've said previously, I don't think this is a regression: it's more likely that there was some change in the federation which coincided with the upgrade and which made synapse want to chew more memory. We're not seeing any sort of leaking on the servers we administer, so I guess the issues being seen here are either due to misconfigured caches, or due to some strange interaction of environment/configuration. It is certainly true that synapse can be a memory hog, and it's a bit of a black-box in terms of what it is doing. Making the caching more intelligent is on the medium-term roadmap, and as part of that work I'd like to add some analytics which reveal how much memory is being used in caches (and how much elsewhere). However, in the shorter term, it's hard for us to do much to investigate individual cases of unexpectedly high memory usage. The only thing I can really suggest is that folks have a play with python memory-profiling tools to see if you can find out where the memory is being used. |
@richvdh How often are you restarting your federation workers? In my case I can simply plot a line through my increased GC times and memory usage for my federation workers. This is while cache size is up to its max value after a few hours, the GC increases are over the period of a few days. A week is pretty much the limit, since at some point my server simply runs out of memory. I can work around that by restarting my synapse/workers every few days. Of the people I've asked, about half of them see the same pattern. Some don't, but many of them don't even have metrics enabled. So while this may not affect everyone, it does at least affect a few people with very different configurations. I also wouldn't jump to this being cache related. As I said, my caches don't grow after a certain size, so this can be any kind of resource leak or it is an unbounded cache. I'll see, if I can figure something out with a memory profiler at some point. I'm not sure, if this is a regression or has only become noticeable recently, but unbounded increase in memory use and GC times does sound like a bug to me. |
Yeah, I think I linked to a similar issue before. We'll see, the graphs don't match up exactly, but both are growing, that is true. If you can run you workers for a few weeks, then it really seems to be system specific. Usually synapse updates tend to restart my workers often enough, that memory use doesn't become an issue, so I thought that may mask the effects. |
I'm fairly certain now, that #7828 is unrelated, after running the 1.20 rcs for a while now (unless the fix didn't actually land yet): Didn't have the time to profile yet though. |
I've found some interesting things: Firstly, as time goes by, there is an accumulation of objects being tracked by the garbage collector, as returned by This is consistent with GC taking more and more time, as the GC must inspect each of those objects on each gen2 garbage collection. The majority of these objects are tuples:
And more interestingly still, objgraph fails to find any reference to most of those tuples:
Of those leaking tuples, they are almost all 1- or 2-tuples:
the 1-tuples almost all contain a
and the 2-tuples are almost all
so, to recap, of the 1.4M objects being tracked by the garbage collector in my synapse process, 845000 of them (60%) are tuples containing |
It's maybe a little early to say, but disabling zope's C extensions (by moving About 15 hours after restart:
Next step is to see if I can repro this behaviour outside of synapse, I think. |
@tulir aha, seems we have a culprit then. Will continue investigating and raise an issue with zope once I have more details. |
Looks like zope released a 5.1.1 with the fix. Do we want to do anything about this? (Blacklist versions or anything?) Or is this "fixed" from our POV? |
I'm just running with updated zope to confirm. Assuming it works I don't think we need to go as far as blacklisting versions; the most we might want to do is suggest in a changelog or somewhere that people upgrade. |
Yup looks good to me. Thanks to @jamadden for jumping on this from the zope side. |
Thanks @richvdh and everyone here for finding and reporting this! Apologies for the trouble it caused. |
Description
After updating to 1.12.0, average event send times have gone up. Metrics are showing GC time going up a lot more than before, around 1 second per day. Memory usage hasn't changed significantly though. @lub:imninja.net's metrics showed the same problem when I asked in #synapse:matrix.org.
The problem did not happen with the 1.12 release candidate, which leads me to believe it could have something to do with the Twisted update.
Version information
The text was updated successfully, but these errors were encountered: