Unbounded increases in GC time #11172

hyossing · 2021-10-25T02:15:30Z

Description

After updating to 1.35.0, Average GC Time is going up a lot more than before, around 0.5 secs per week.

Almost 6secs in increased last within 3months before restarting synapse.

I'm trying to reproduce my test environment.

It is reproduced in our commercial environment up to now.

Do you have any idea to handle this problem?

Additional Synapse Metrics

Environment Information

What version of Synapse is running?
: 1.35.0
: No worker, single instance

Synapse Cache Factor
: 10.0

Event Cache Size
: 100k

Sync(Long Polling) Timeout & Request Timeout in out client
: 10 secs & 15secs

Synapse Element Client do sync with 30secs & 108 secs.

Synapse Users
: Around 3000

Active User Devices
: 1000 Devices

The top 3 API Endpoints

/sync
/sendToDevice
/metrics

Platform

ubuntu
: 20.04
container
: Customized docker container FROM phusion/baseimage:0.11
: python 3.6
vm (kvm)
: cpu cores: 24, memory: 61GB

richvdh · 2021-10-25T11:20:24Z

related: #7176

hyossing · 2021-10-25T21:59:10Z

Hello, richvdh.

I checked it is the same problem as #7176.

I did the same command you did in #7176.

However, It seems that the problem is not related to zope.interface.

I take a snapshot for the objects 2days after restarting synapse.

squahtx · 2021-10-26T09:26:59Z

Can you zoom in on the Currently allocated objects graph, showing synapse-gen 2? Does synapse-gen 2 trend upwards over time, just like the average GC time?

hyossing · 2021-10-26T20:58:07Z

Hello, Sean Quah.

Nice to meet you.

I attached two images.

Last 30 days.

2021-10-14 (Peak duration for GC Time)

richvdh · 2021-10-29T10:15:20Z

I think your 'currently allocated objects' graph is incorrect - it's reporting the number of gen1 GCs since the last gen2 GC, which is why it is cycling in the range 0-10. You should consider updating your Grafana dashboard.

Unfortunately we don't expose the total number of active objects as a prometheus metric (I used a local patch to generate the graph in #7176 (comment). Calculating the count requires the python runtime to build a list of all the active objects, which is expensive).

The easiest way to get the metric we're after here is probably to do gc.set_debug(gc.DEBUG_STATS) in the manhole, and then to grep the logs:

grep 'gc: objects in each generation' homeserver.log

hyossing · 2021-11-01T23:00:06Z

Hello, Richard.

I reproduced in my test environment.

I run the command - gc.set_debug(gc.DEBUG_STATS)

I sampled some logs.

2021-10-30

2021-10-30 07:59:58,823 - twisted - 258 - ERROR - sentinel- gc: objects in each generation: 1518 5816 31794
2021-10-30 07:59:56,699 - twisted - 258 - ERROR - sentinel- gc: objects in each generation: 1621 5371 31794
2021-10-30 07:59:55,425 - twisted - 258 - ERROR - sentinel- gc: objects in each generation: 1894 4666 31806
2021-10-30 07:59:54,228 - twisted - 258 - ERROR - sentinel- gc: objects in each generation: 1636 3792 32428
2021-10-30 07:59:50,618 - twisted - 258 - ERROR - sentinel- gc: objects in each generation: 1543 2559 33271

2021-10-31

2021-10-31 07:59:59,830 - twisted - 258 - ERROR - sentinel- gc: objects in each generation: 1887 4097 162277
2021-10-31 07:59:57,592 - twisted - 258 - ERROR - sentinel- gc: objects in each generation: 1618 2826 163332
2021-10-31 07:59:55,243 - twisted - 258 - ERROR - sentinel- gc: objects in each generation: 1891 1290 164193
2021-10-31 07:59:52,300 - twisted - 258 - ERROR - sentinel- gc: objects in each generation: 1637 0 165252
2021-10-31 07:59:50,095 - twisted - 258 - ERROR - sentinel- gc: objects in each generation: 9 9189 159962

2021-11-01

2021-11-01 08:00:00,954 - twisted - 258 - ERROR - sentinel- gc: objects in each generation: 980 0 222327
2021-11-01 07:59:59,865 - twisted - 258 - ERROR - sentinel- gc: objects in each generation: 4 0 243758
2021-11-01 07:59:59,635 - twisted - 258 - ERROR - sentinel- gc: objects in each generation: 19 8526 237368
2021-11-01 07:59:59,586 - twisted - 258 - ERROR - sentinel- gc: objects in each generation: 1893 6967 237368
2021-11-01 07:59:57,325 - twisted - 258 - ERROR - sentinel- gc: objects in each generation: 1459 7211 237384

Gen2 objects are increased continuously.

Something is not collected in garbage.

Is there any way to know the something?

Actually, I did some customization in sync.py.

I'll test the official synapse docker image(1.35.0) to check it has the same problem.

Thanks you.

richvdh · 2021-11-01T23:59:49Z

I wouldn't say 200K objects over a period of several days is a huge number, particularly given your cache configuration settings. Indeed, given your event_cache_size is set to 100K, I'm surprised it is as few as that. Maybe try reducing your cache sizes and see what difference it makes.

hyossing · 2021-11-02T22:30:02Z

Hello, Richard.

I changed the cache configuration.

Synapse Cache Factor: 10.0 -> 0.5 (default)
Event Cache Size: 100K -> 10K (default)

It seems it have an effect.

The average gc time is bounded to 1.0 secs even though there are so many requests(/sync, /sendToDevice).

I will test 2~3 days more to watch the average gc time.

I looked up the synapse code, then I am querious about the cache configuration.

event_cache_size

self._get_event_cache = LruCache(
    cache_name="*getEvent*",
    max_size=hs.config.caches.event_cache_size,
)

I thought more cache, more performance if machine have enough memory.

But, it seems event_cache_size have optimal value and there is tradeoff between event_cache_size and gc time.

Is it right?

Synapse Cache Factor

# the pusher makes lots of concurrent SSL connections to sygnal, and
# tends to do so in batches, so we need to allow the pool to keep
# lots of idle connections around.
pool = HTTPConnectionPool(self.reactor)

# XXX: The justification for using the cache factor here is that larger instances
# will need both more cache and more connections.
# Still, this should probably be a separate dial
pool.maxPersistentPerHost = max((100 * hs.config.caches.global_factor, 5))

I thought extending global_factor would help the overall performance for synapse.

But, it seems it is just for request pool for http client.

Almost outbound http request from synapse would go to a push server(=sygnal), and pushing is working at background.

So, it seems like `global_factor' is for pusher performance.

Is it right?

// Best Regards
// Always appreciate your efforts for synapse.
// Hyosung

reivilibre · 2021-11-03T11:40:27Z

I thought more cache, more performance if machine have enough memory.

But, it seems event_cache_size have optimal value and there is tradeoff between event_cache_size and gc time.

Is it right?

This sounds right to me. Accessing memory takes time — if you have more of it, then (making simplifications) it will take more time to access it all in order to perform garbage collection.

There is probably a sweet spot for your server between 0.5 and 10.0.

So, it seems like `global_factor' is for pusher performance.

global_factor seems to be the controlling the same property as the SYNAPSE_CACHE_FACTOR environment variable (default: 0.5), which is also used to scale the sizes of Least-Recently Used (LRU) caches and 'Expiring Caches'.
I think, as the comment explains, the factor was chosen to control the HTTP pusher pool because it's a good way to guess whether the homeserver deployment is big or small. The main intention is to control the cache sizes, however.

hyossing · 2021-11-04T02:49:32Z

I got it.

I missed below codes.

Now, everything is done.

Thanks you very much.

Best Regards
Hyosung

# Set the global one so that it's reflected in new caches
        properties.default_factor_size = self.global_factor

    def resize_all_caches(self):
        """Ensure all cache sizes are up to date

        For each cache, run the mapped callback function with either
        a specific cache factor or the default, global one.
        """
        # block other threads from modifying _CACHES while we iterate it.
        with _CACHES_LOCK:
            for cache_name, callback in _CACHES.items():
                new_factor = self.cache_factors.get(cache_name, self.global_factor)
                callback(new_factor)

reivilibre · 2021-11-04T10:00:11Z

Glad to help!

squahtx added the X-Needs-Discussion label Oct 25, 2021

squahtx added X-Needs-Info This issue is blocked awaiting information from the reporter and removed X-Needs-Discussion labels Oct 26, 2021

reivilibre closed this as completed Nov 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unbounded increases in GC time #11172

Unbounded increases in GC time #11172

hyossing commented Oct 25, 2021 •

edited

Loading

richvdh commented Oct 25, 2021

hyossing commented Oct 25, 2021

squahtx commented Oct 26, 2021

hyossing commented Oct 26, 2021

richvdh commented Oct 29, 2021

hyossing commented Nov 1, 2021

richvdh commented Nov 1, 2021

hyossing commented Nov 2, 2021

reivilibre commented Nov 3, 2021 •

edited

Loading

hyossing commented Nov 4, 2021

reivilibre commented Nov 4, 2021

Unbounded increases in GC time #11172

Unbounded increases in GC time #11172

Comments

hyossing commented Oct 25, 2021 • edited Loading

Description

Additional Synapse Metrics

Environment Information

richvdh commented Oct 25, 2021

hyossing commented Oct 25, 2021

squahtx commented Oct 26, 2021

hyossing commented Oct 26, 2021

richvdh commented Oct 29, 2021

hyossing commented Nov 1, 2021

2021-10-30

2021-10-31

2021-11-01

richvdh commented Nov 1, 2021

hyossing commented Nov 2, 2021

event_cache_size

Synapse Cache Factor

reivilibre commented Nov 3, 2021 • edited Loading

hyossing commented Nov 4, 2021

reivilibre commented Nov 4, 2021

hyossing commented Oct 25, 2021 •

edited

Loading

reivilibre commented Nov 3, 2021 •

edited

Loading