Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Unbounded increases in GC time #11172

Closed
hyossing opened this issue Oct 25, 2021 · 11 comments
Closed

Unbounded increases in GC time #11172

hyossing opened this issue Oct 25, 2021 · 11 comments
Labels
X-Needs-Info This issue is blocked awaiting information from the reporter

Comments

@hyossing
Copy link

hyossing commented Oct 25, 2021

Description

After updating to 1.35.0, Average GC Time is going up a lot more than before, around 0.5 secs per week.

  • Almost 6secs in increased last within 3months before restarting synapse.

Screen Shot 2021-10-25 at 9 01 57 AM

I'm trying to reproduce my test environment.

  • It is reproduced in our commercial environment up to now.

Do you have any idea to handle this problem?

Additional Synapse Metrics

Screen Shot 2021-10-25 at 9 01 39 AM

Screen Shot 2021-10-25 at 9 02 18 AM

Environment Information

What version of Synapse is running?
: 1.35.0
: No worker, single instance

Synapse Cache Factor
: 10.0

Event Cache Size
: 100k

Sync(Long Polling) Timeout & Request Timeout in out client
: 10 secs & 15secs

  • Synapse Element Client do sync with 30secs & 108 secs.

Synapse Users
: Around 3000

Active User Devices
: 1000 Devices

The top 3 API Endpoints

  • /sync
  • /sendToDevice
  • /metrics

Platform

  • ubuntu
    : 20.04

  • container
    : Customized docker container FROM phusion/baseimage:0.11
    : python 3.6

  • vm (kvm)
    : cpu cores: 24, memory: 61GB

@richvdh
Copy link
Member

richvdh commented Oct 25, 2021

related: #7176

@hyossing
Copy link
Author

Hello, richvdh.

I checked it is the same problem as #7176.

I did the same command you did in #7176.

However, It seems that the problem is not related to zope.interface.

I take a snapshot for the objects 2days after restarting synapse.

Screen Shot 2021-10-23 at 12 05 38 PM

@squahtx
Copy link
Contributor

squahtx commented Oct 26, 2021

Can you zoom in on the Currently allocated objects graph, showing synapse-gen 2? Does synapse-gen 2 trend upwards over time, just like the average GC time?

@squahtx squahtx added X-Needs-Info This issue is blocked awaiting information from the reporter and removed X-Needs-Discussion labels Oct 26, 2021
@hyossing
Copy link
Author

Hello, Sean Quah.

Nice to meet you.

I attached two images.

  • Last 30 days.

Screen Shot 2021-10-27 at 5 53 25 AM

  • 2021-10-14 (Peak duration for GC Time)

Screen Shot 2021-10-27 at 5 54 08 AM

@richvdh
Copy link
Member

richvdh commented Oct 29, 2021

I think your 'currently allocated objects' graph is incorrect - it's reporting the number of gen1 GCs since the last gen2 GC, which is why it is cycling in the range 0-10. You should consider updating your Grafana dashboard.

Unfortunately we don't expose the total number of active objects as a prometheus metric (I used a local patch to generate the graph in #7176 (comment). Calculating the count requires the python runtime to build a list of all the active objects, which is expensive).

The easiest way to get the metric we're after here is probably to do gc.set_debug(gc.DEBUG_STATS) in the manhole, and then to grep the logs:

grep 'gc: objects in each generation' homeserver.log

@hyossing
Copy link
Author

hyossing commented Nov 1, 2021

Hello, Richard.

I reproduced in my test environment.

I run the command - gc.set_debug(gc.DEBUG_STATS)

I sampled some logs.

2021-10-30

2021-10-30 07:59:58,823 - twisted - 258 - ERROR - sentinel- gc: objects in each generation: 1518 5816 31794
2021-10-30 07:59:56,699 - twisted - 258 - ERROR - sentinel- gc: objects in each generation: 1621 5371 31794
2021-10-30 07:59:55,425 - twisted - 258 - ERROR - sentinel- gc: objects in each generation: 1894 4666 31806
2021-10-30 07:59:54,228 - twisted - 258 - ERROR - sentinel- gc: objects in each generation: 1636 3792 32428
2021-10-30 07:59:50,618 - twisted - 258 - ERROR - sentinel- gc: objects in each generation: 1543 2559 33271

2021-10-31

2021-10-31 07:59:59,830 - twisted - 258 - ERROR - sentinel- gc: objects in each generation: 1887 4097 162277
2021-10-31 07:59:57,592 - twisted - 258 - ERROR - sentinel- gc: objects in each generation: 1618 2826 163332
2021-10-31 07:59:55,243 - twisted - 258 - ERROR - sentinel- gc: objects in each generation: 1891 1290 164193
2021-10-31 07:59:52,300 - twisted - 258 - ERROR - sentinel- gc: objects in each generation: 1637 0 165252
2021-10-31 07:59:50,095 - twisted - 258 - ERROR - sentinel- gc: objects in each generation: 9 9189 159962

2021-11-01

2021-11-01 08:00:00,954 - twisted - 258 - ERROR - sentinel- gc: objects in each generation: 980 0 222327
2021-11-01 07:59:59,865 - twisted - 258 - ERROR - sentinel- gc: objects in each generation: 4 0 243758
2021-11-01 07:59:59,635 - twisted - 258 - ERROR - sentinel- gc: objects in each generation: 19 8526 237368
2021-11-01 07:59:59,586 - twisted - 258 - ERROR - sentinel- gc: objects in each generation: 1893 6967 237368
2021-11-01 07:59:57,325 - twisted - 258 - ERROR - sentinel- gc: objects in each generation: 1459 7211 237384

Gen2 objects are increased continuously.

Something is not collected in garbage.

Is there any way to know the something?

Actually, I did some customization in sync.py.

I'll test the official synapse docker image(1.35.0) to check it has the same problem.

Thanks you.

@richvdh
Copy link
Member

richvdh commented Nov 1, 2021

I wouldn't say 200K objects over a period of several days is a huge number, particularly given your cache configuration settings. Indeed, given your event_cache_size is set to 100K, I'm surprised it is as few as that. Maybe try reducing your cache sizes and see what difference it makes.

@hyossing
Copy link
Author

hyossing commented Nov 2, 2021

Hello, Richard.

I changed the cache configuration.

  • Synapse Cache Factor: 10.0 -> 0.5 (default)
  • Event Cache Size: 100K -> 10K (default)

It seems it have an effect.

The average gc time is bounded to 1.0 secs even though there are so many requests(/sync, /sendToDevice).

I will test 2~3 days more to watch the average gc time.

I looked up the synapse code, then I am querious about the cache configuration.

event_cache_size

self._get_event_cache = LruCache(
    cache_name="*getEvent*",
    max_size=hs.config.caches.event_cache_size,
)

I thought more cache, more performance if machine have enough memory.

But, it seems event_cache_size have optimal value and there is tradeoff between event_cache_size and gc time.

Is it right?

Synapse Cache Factor

# the pusher makes lots of concurrent SSL connections to sygnal, and
# tends to do so in batches, so we need to allow the pool to keep
# lots of idle connections around.
pool = HTTPConnectionPool(self.reactor)

# XXX: The justification for using the cache factor here is that larger instances
# will need both more cache and more connections.
# Still, this should probably be a separate dial
pool.maxPersistentPerHost = max((100 * hs.config.caches.global_factor, 5))

I thought extending global_factor would help the overall performance for synapse.

But, it seems it is just for request pool for http client.

Almost outbound http request from synapse would go to a push server(=sygnal), and pushing is working at background.

So, it seems like `global_factor' is for pusher performance.

Is it right?

// Best Regards
// Always appreciate your efforts for synapse.
// Hyosung

@reivilibre
Copy link
Contributor

reivilibre commented Nov 3, 2021

I thought more cache, more performance if machine have enough memory.

But, it seems event_cache_size have optimal value and there is tradeoff between event_cache_size and gc time.

Is it right?

This sounds right to me. Accessing memory takes time — if you have more of it, then (making simplifications) it will take more time to access it all in order to perform garbage collection.

There is probably a sweet spot for your server between 0.5 and 10.0.

So, it seems like `global_factor' is for pusher performance.

global_factor seems to be the controlling the same property as the SYNAPSE_CACHE_FACTOR environment variable (default: 0.5), which is also used to scale the sizes of Least-Recently Used (LRU) caches and 'Expiring Caches'.
I think, as the comment explains, the factor was chosen to control the HTTP pusher pool because it's a good way to guess whether the homeserver deployment is big or small. The main intention is to control the cache sizes, however.

@hyossing
Copy link
Author

hyossing commented Nov 4, 2021

I got it.

I missed below codes.

Now, everything is done.

Thanks you very much.

Best Regards
Hyosung

# Set the global one so that it's reflected in new caches
        properties.default_factor_size = self.global_factor
    def resize_all_caches(self):
        """Ensure all cache sizes are up to date

        For each cache, run the mapped callback function with either
        a specific cache factor or the default, global one.
        """
        # block other threads from modifying _CACHES while we iterate it.
        with _CACHES_LOCK:
            for cache_name, callback in _CACHES.items():
                new_factor = self.cache_factors.get(cache_name, self.global_factor)
                callback(new_factor)

@reivilibre
Copy link
Contributor

Glad to help!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
X-Needs-Info This issue is blocked awaiting information from the reporter
Projects
None yet
Development

No branches or pull requests

4 participants