-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
many duplicate _get_state_groups_from_groups
queries leading to OOMs
#10301
Comments
this is currently complicated by the fact that the code does some batching of lookups. It's not obvious that the batching achieves much (at least on postgres) so we could maybe strip it out |
Not sure if this could be related, but in EMS land we've noticed with the latest few synapse releases it seems that joining HQ sometimes puts a small host into an OOM loop, never recovering. Previously hosts took a much longer time joining HQ and OOMed for a while trying to do it, but eventually stabilized. Lately it feels like sometimes the host never recovers, until we give it more headroom in RAM. The limit we have for these kind of small hosts is 1GB before the cluster kills the host to protect server stability. This issue we've seen lately is resolved by raising that limit to 1.5GB, which seems enough to process whatever it wants to (immediately and failing) to do. |
During the last two days this impacts my single-user server, too. I’m not in big rooms like HQ (the biggest one has around 200 users). I can’t connect this to any events, like me joining a room or something, it just happened out of the blue; since then, i barely can send messages to any rooms, and my sync requests take a really long time if they succeed at all. It also seems to “get to its senses” every now and then; during these times everything works like nothing happened, but it doesn’t take long, maybe a few minutes, before it goes back to PostgreSQL hell. I tried reindexing and vacuuming my DB hoping that it will speed up these queries, but to no avail. Until this gets fixed, is it safe (and useful) to downgrade to 1.46 (the version i used before 1.48.0)? Also, if i can help with any debug data, let me know. |
The issue predates 1.46, so I wouldn't assume downgrading is going to help. |
Thatʼs strange, because our company HS works just fine with 1.46. Also, does this mean thereʼs no workaround available? Anything one can do to use Synapse until it gets fixed? |
FTR this also seems to affect the WhatsApp bridge somehow as not all my messages get forwarded to WhatsApp. |
Have you verified that the queries that are causing you trouble are the same as the ones that are in the description of this PR? Although this issue existed before 1.46, it's always possible that something new has aggravated it further for you, so you're welcome to try and downgrade — we always try to keep at least 1 version of rollback possible. Synapse won't start up if you roll back too far, so it's harmless to return to 1.46 (or 1.47). It seems like you can roll back to 1.46 (and further, if you wanted to) as the database is compatible. If you'd like to try and report back, that could be useful (and if you're lucky you might get your server back to give some time to try and investigate what's going on). |
The only thing that doesn't match in my query is that ::bigint part; mine has |
@gergelypolonkai Do you think you could try rolling back to 1.46 and seeing how that works for you? |
I just did that, After starting it, it feels better (at least message sending doesn't timeout), but let me use it for a few hours before I jump to conclusions. |
Nope, after like 15 minutes itʼs the same 😔 Let me know if i can help with anything else, iʼm happy to help when iʼm behind my keyboard. |
Thanks for trying! How are you installing Synapse? (wondering in case you'd be willing to try a branch to see if something improves the situation for you) |
I’m using virtualenv/pip install on bare metal. So sure, shoot at me with the branch name and i can easily try. |
@gergelypolonkai The branch |
@reivilibre I’m also surprised that i’m affected not just because my HS is single-user, but because all the rooms i participate in are small (<250 users) and most of these rooms don’t have a long state history. FTR, here’s what i used if someone with less Python-fu wants to give a try:
It installed smoothly and started up. I’ll check back within a few hours to let you know if it looks good from my server. |
Sorry for not coming back earlier, yesterday i had a terrible migraine. I can still see the query occasionally firing up (almost every time i switch rooms i Nheko). However, it feels smoother; at least sending messages isn’t slowed down, which is a great win in my book. |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
let's be clear: this issue is about the fact that we run multiple identical queries in parallel, which is inefficient, even if the queries themselves perform correctly. If the queries aren't terminating (or if you're not seeing identical queries with identical constants in the |
@richvdh: Thanks for the clarification. I also have multiple queries running but the constants are different. I will open a new issue for this then as they are not terminating. |
@reivilibre does your branch get updated from master occasionally? I just upgraded to 1.54 and this issue persists; can i use your branch without essentially downgrading (not that i mind if it does mean a downgrade, though). |
Upgrading to 1.54 and thus reverting this change brought a significant drawback in performance, so i reverted to this branch. |
This was biting again, so I've updated the branch to 1.57.0 (as a new branch: Edit: it seemed to help a bit, but we still had OOMing afterwards. |
We had a federation sender instance which died. On restart, it rapidly consumed all available ram and OOMed again.
Inspection from postgres side shows it is doing many duplicate queries of the form
This query is
_get_state_groups_from_groups
, which is called from_get_state_for_groups
. Although the latter has a cache, if many threads hit it at the same time, all will find the cache empty and go on to hit the database. I think we need a zero-timeoutResponseCache
on_get_state_groups_from_groups
.The text was updated successfully, but these errors were encountered: