Unbounded increases in GC time #7176

tulir · 2020-03-30T16:07:38Z

Description

After updating to 1.12.0, average event send times have gone up. Metrics are showing GC time going up a lot more than before, around 1 second per day. Memory usage hasn't changed significantly though. @lub:imninja.net's metrics showed the same problem when I asked in #synapse:matrix.org.

The problem did not happen with the 1.12 release candidate, which leads me to believe it could have something to do with the Twisted update.

Version information

Version: 1.12.0 (https://mau.dev/maunium/synapse)
Install method: Docker
Platform: buster-slim, tested both with and without jemalloc

lub · 2020-03-31T07:23:12Z

Although I don't see such a clear difference in other metrics, but maybe I just have less load

I use matrixdotorg/synapse:latest. Also had the rc1 on release day, but it seems it didn't cause that behaviour

tulir · 2020-04-08T16:26:26Z

I tried downgrading to the previous Twisted version, but that didn't seem to help, so I guess it wasn't that after all.

New graphs

For comparison, here are 30 day graphs from before 1.12. The GC time still went up and didn't come down, but the growth was much slower:

Graphs from before 1.12

richvdh · 2020-04-08T16:46:08Z

It's hard to see why GC time would increase when memory usage and the number of objects being collected stays the same.

The only change in 1.12.0 which is vaguely related to GC is #6953. Could you patch out that change and see if it helps at all?

casimir · 2020-04-08T19:01:07Z

After something like a week of uptime synapse managed to use all the RAM on my server. A restart later (a little after 12:00) all was smooth again. I can provide more precise information if needed.

I am the only user on my server and I am in a few big rooms. Also I only started sending messages today, it was on #synapse:matrix.org

tulir · 2020-04-09T10:39:12Z

I think I reverted it in https://mau.dev/maunium/synapse, but still seems to grow quickly:

richvdh · 2020-04-09T11:18:31Z

@tulir yeah; it was a long shot tbh. I have no real idea what could cause this. Are you sure it didn't happen with 1.12.0rc1 ?

It feels like something has changed outside of synapse. I don't suppose you've had a python version change or something?

I guess it could just reflect a different traffic pattern from your clients/federated servers.

twoequaldots · 2020-04-09T15:47:08Z

I am seeing similar issues, at first I thought it might have been related to open file handles and/or use_presence: true (now false) in our configuration but those only seemed to help a bit. The only way around this is periodically restarting matrix-synapse.

I was starting to suspect maybe the version of Python we are running may not be good for the Twisted update (we are on Debian 9 with Python v3.5.3 and matrix-synapse-py3 on 1.12.3).

Unfortunately, we weren't scraping Prometheus metrics until we got onto the 1.12.x branch so we cannot compare GC collection time from a prior release.

deepbluev7 · 2020-06-04T11:29:45Z

So, my gc times are still growing linearly:

It seems mostly the federation reader and federation sender are affected. My caches are already full and are not growing anymore:

But it seems like memory usage for the federation workers is still climbing:

This may suggest, that there is a resource leak regarding DNS or connections or similar. Could of course be something else, but I would expect the memory usage of the federation sender to flatten off after a few days. In any case, the GC times shouldn't grow linearly over time, which they do at the moment.

richvdh · 2020-09-07T11:51:12Z

as I've said previously, I don't think this is a regression: it's more likely that there was some change in the federation which coincided with the upgrade and which made synapse want to chew more memory.

We're not seeing any sort of leaking on the servers we administer, so I guess the issues being seen here are either due to misconfigured caches, or due to some strange interaction of environment/configuration.

It is certainly true that synapse can be a memory hog, and it's a bit of a black-box in terms of what it is doing. Making the caching more intelligent is on the medium-term roadmap, and as part of that work I'd like to add some analytics which reveal how much memory is being used in caches (and how much elsewhere).

However, in the shorter term, it's hard for us to do much to investigate individual cases of unexpectedly high memory usage. The only thing I can really suggest is that folks have a play with python memory-profiling tools to see if you can find out where the memory is being used.

deepbluev7 · 2020-09-07T13:44:45Z

@richvdh How often are you restarting your federation workers? In my case I can simply plot a line through my increased GC times and memory usage for my federation workers. This is while cache size is up to its max value after a few hours, the GC increases are over the period of a few days. A week is pretty much the limit, since at some point my server simply runs out of memory. I can work around that by restarting my synapse/workers every few days. Of the people I've asked, about half of them see the same pattern. Some don't, but many of them don't even have metrics enabled. So while this may not affect everyone, it does at least affect a few people with very different configurations.

I also wouldn't jump to this being cache related. As I said, my caches don't grow after a certain size, so this can be any kind of resource leak or it is an unbounded cache. I'll see, if I can figure something out with a memory profiler at some point. I'm not sure, if this is a regression or has only become noticeable recently, but unbounded increase in memory use and GC times does sound like a bug to me.

richvdh · 2020-09-07T14:11:46Z

@richvdh How often are you restarting your federation workers?

we have no problems with running them for up to a couple of weeks. There's no particular change in GC time over this period.

If its limited to federation senders, it could be #7828 which has a fix in the forthcoming v1.20.0 release.

deepbluev7 · 2020-09-07T14:16:48Z

If its limited to federation senders, it could be #7828 which has a fix in the forthcoming v1.20.0 release.

Yeah, I think I linked to a similar issue before. We'll see, the graphs don't match up exactly, but both are growing, that is true.

If you can run you workers for a few weeks, then it really seems to be system specific. Usually synapse updates tend to restart my workers often enough, that memory use doesn't become an issue, so I thought that may mask the effects.

deepbluev7 · 2020-09-16T21:15:12Z

I'm fairly certain now, that #7828 is unrelated, after running the 1.20 rcs for a while now (unless the fix didn't actually land yet):

Didn't have the time to profile yet though.

richvdh · 2020-09-25T13:28:18Z

I think I'm actually seeing something similar to this on my personal HS:

I'll have a poke about to see if I can figure out what's going on.

richvdh · 2020-09-26T17:04:43Z

I've found some interesting things:

Firstly, as time goes by, there is an accumulation of objects being tracked by the garbage collector, as returned by len(gc.get_objects()):

This is consistent with GC taking more and more time, as the GC must inspect each of those objects on each gen2 garbage collection.

The majority of these objects are tuples:

>>> objgraph.show_most_common_types(5,gc.get_objects())
tuple       879076
dict        73675
set         62514
list        60363
ProfileInfo 43990

And more interestingly still, objgraph fails to find any reference to most of those tuples:

>>> unreffed = objgraph.get_leaking_objects()
>>> sum(1 for t in unreffed if isinstance(t, tuple))
844957

Of those leaking tuples, they are almost all 1- or 2-tuples:

>>> collections.Counter(len(t) for t in unreffed if isinstance(t, tuple))
Counter({1: 531264, 2: 313618, 3: 30, 4: 26, 5: 9, 7: 3, 6: 2, 8: 1, 9: 1, 11: 1, 13: 1, 15: 1})

the 1-tuples almost all contain a zope.interface.declarations.Implements:

>>> onetups = [t for t in unreffed if isinstance(t, tuple) and len(t) == 1]
>>> implonetups = [t for t in onetups if isinstance(t[0], zope.interface.declarations.Implements)]
>>> len(implonetups)
531032

and the 2-tuples are almost all (zope.interface.interface.InterfaceClass, zope.interface.declarations.Implements) pairs:

>>> impltwotups = [
...     t
...     for t in twotups
...     if isinstance(t[0], zope.interface.interface.InterfaceClass)
...     and isinstance(t[1], zope.interface.declarations.Implements)
... ]
>>> len(impltwotups)
313596

so, to recap, of the 1.4M objects being tracked by the garbage collector in my synapse process, 845000 of them (60%) are tuples containing zope.interface.declarations.Implements objects which appear to be unreferenced by any other python object. This suggests some sort of leak in the C parts of zope.interface, though it's also possible that we or Twisted are misusing zope in some way that causes this.

richvdh · 2020-09-27T07:44:48Z

It's maybe a little early to say, but disabling zope's C extensions (by moving _zope_interface_coptimizations.cpython-35m-x86_64-linux-gnu.so aside) seems to stop the leak, or at least significantly change the behaviour:

About 15 hours after restart:

>>> len(gc.get_objects())
540302
>>> unreffed = objgraph.get_leaking_objects()
>>> objgraph.show_most_common_types(5,unreffed)
dict   1696
list   949
CType  103
tuple  95
method 46

Next step is to see if I can repro this behaviour outside of synapse, I think.

tulir · 2020-09-27T09:24:00Z

Downgrading to zope.interface<5 also seems to solve it. The release date of v5 and this issue starting match pretty much exactly

richvdh · 2020-09-27T09:33:20Z

@tulir aha, seems we have a culprit then. Will continue investigating and raise an issue with zope once I have more details.

richvdh · 2020-09-27T19:41:28Z

zopefoundation/zope.interface#216

clokep · 2020-09-30T17:07:58Z

Looks like zope released a 5.1.1 with the fix. Do we want to do anything about this? (Blacklist versions or anything?) Or is this "fixed" from our POV?

richvdh · 2020-09-30T17:10:08Z

I'm just running with updated zope to confirm. Assuming it works I don't think we need to go as far as blacklisting versions; the most we might want to do is suggest in a changelog or somewhere that people upgrade.

richvdh · 2020-10-01T11:46:52Z

Yup looks good to me. Thanks to @jamadden for jumping on this from the zope side.

jamadden · 2020-10-01T12:39:14Z

Thanks @richvdh and everyone here for finding and reporting this! Apologies for the trouble it caused.

clokep added the A-Performance Performance, both client-facing and admin-facing label Apr 2, 2020

deepbluev7 mentioned this issue Jul 14, 2020

in-memory federation transaction transmission queues build up indefinitely for offline servers #7828

Closed

richvdh changed the title ~~Memory/GC time regression~~ Unbounded increases in GC time Sep 7, 2020

clokep added the Z-Upstream-Bug This issue requires a fix in an upstream dependency. label Sep 29, 2020

richvdh mentioned this issue Sep 30, 2020

synapse takes ages to restart #7968

Open

richvdh closed this as completed Oct 1, 2020

ajs124 mentioned this issue Jan 26, 2021

[20.09] python3Packages.zope.interface: 5.1.0 -> 5.1.2 NixOS/nixpkgs#110868

Merged

10 tasks

richvdh mentioned this issue Oct 25, 2021

Unbounded increases in GC time #11172

Closed

clokep mentioned this issue Jul 21, 2022

Update locked frozendict version to 2.3.3 #13352

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unbounded increases in GC time #7176

Unbounded increases in GC time #7176

tulir commented Mar 30, 2020 •

edited

Loading

lub commented Mar 31, 2020 •

edited

Loading

tulir commented Apr 8, 2020

richvdh commented Apr 8, 2020 •

edited

Loading

casimir commented Apr 8, 2020 •

edited

Loading

tulir commented Apr 9, 2020

richvdh commented Apr 9, 2020

twoequaldots commented Apr 9, 2020

deepbluev7 commented Jun 4, 2020

richvdh commented Sep 7, 2020

deepbluev7 commented Sep 7, 2020

richvdh commented Sep 7, 2020

deepbluev7 commented Sep 7, 2020

deepbluev7 commented Sep 16, 2020

richvdh commented Sep 25, 2020

richvdh commented Sep 26, 2020

richvdh commented Sep 27, 2020

tulir commented Sep 27, 2020

richvdh commented Sep 27, 2020

richvdh commented Sep 27, 2020

clokep commented Sep 30, 2020

richvdh commented Sep 30, 2020

richvdh commented Oct 1, 2020

jamadden commented Oct 1, 2020

Unbounded increases in GC time #7176

Unbounded increases in GC time #7176

Comments

tulir commented Mar 30, 2020 • edited Loading

Description

Version information

lub commented Mar 31, 2020 • edited Loading

tulir commented Apr 8, 2020

richvdh commented Apr 8, 2020 • edited Loading

casimir commented Apr 8, 2020 • edited Loading

tulir commented Apr 9, 2020

richvdh commented Apr 9, 2020

twoequaldots commented Apr 9, 2020

deepbluev7 commented Jun 4, 2020

richvdh commented Sep 7, 2020

deepbluev7 commented Sep 7, 2020

richvdh commented Sep 7, 2020

deepbluev7 commented Sep 7, 2020

deepbluev7 commented Sep 16, 2020

richvdh commented Sep 25, 2020

richvdh commented Sep 26, 2020

richvdh commented Sep 27, 2020

tulir commented Sep 27, 2020

richvdh commented Sep 27, 2020

richvdh commented Sep 27, 2020

clokep commented Sep 30, 2020

richvdh commented Sep 30, 2020

richvdh commented Oct 1, 2020

jamadden commented Oct 1, 2020

tulir commented Mar 30, 2020 •

edited

Loading

lub commented Mar 31, 2020 •

edited

Loading

richvdh commented Apr 8, 2020 •

edited

Loading

casimir commented Apr 8, 2020 •

edited

Loading