many duplicate `_get_state_groups_from_groups` queries leading to OOMs #10301

richvdh · 2021-07-02T09:31:15Z

We had a federation sender instance which died. On restart, it rapidly consumed all available ram and OOMed again.

Inspection from postgres side shows it is doing many duplicate queries of the form

WITH RECURSIVE state(state_group) AS ( VALUES(3405820::bigint) UNION ALL SELECT prev_state_group FROM state_group_edges e, state s WHERE s.state_group = e.state_group ) SELECT DISTINCT ON (type, state_key) type, state_key, event_id FROM state_groups_state WHERE state_group IN ( SELECT state_group FROM state ) ORDER BY type, state_key, state_group DESC

This query is _get_state_groups_from_groups, which is called from _get_state_for_groups. Although the latter has a cache, if many threads hit it at the same time, all will find the cache empty and go on to hit the database. I think we need a zero-timeout ResponseCache on _get_state_groups_from_groups.

The text was updated successfully, but these errors were encountered:

richvdh · 2021-07-08T11:51:53Z

this is currently complicated by the fact that the code does some batching of lookups. It's not obvious that the batching achieves much (at least on postgres) so we could maybe strip it out

jaywink · 2021-07-30T09:04:59Z

Not sure if this could be related, but in EMS land we've noticed with the latest few synapse releases it seems that joining HQ sometimes puts a small host into an OOM loop, never recovering. Previously hosts took a much longer time joining HQ and OOMed for a while trying to do it, but eventually stabilized. Lately it feels like sometimes the host never recovers, until we give it more headroom in RAM.

The limit we have for these kind of small hosts is 1GB before the cluster kills the host to protect server stability. This issue we've seen lately is resolved by raising that limit to 1.5GB, which seems enough to process whatever it wants to (immediately and failing) to do.

gergelypolonkai · 2021-12-14T07:54:16Z

During the last two days this impacts my single-user server, too. I’m not in big rooms like HQ (the biggest one has around 200 users). I can’t connect this to any events, like me joining a room or something, it just happened out of the blue; since then, i barely can send messages to any rooms, and my sync requests take a really long time if they succeed at all.

It also seems to “get to its senses” every now and then; during these times everything works like nothing happened, but it doesn’t take long, maybe a few minutes, before it goes back to PostgreSQL hell.

I tried reindexing and vacuuming my DB hoping that it will speed up these queries, but to no avail.

Until this gets fixed, is it safe (and useful) to downgrade to 1.46 (the version i used before 1.48.0)?

Also, if i can help with any debug data, let me know.

daenney · 2021-12-14T10:52:22Z

The issue predates 1.46, so I wouldn't assume downgrading is going to help.

gergelypolonkai · 2021-12-14T17:33:29Z

Thatʼs strange, because our company HS works just fine with 1.46.

Also, does this mean thereʼs no workaround available? Anything one can do to use Synapse until it gets fixed?

gergelypolonkai · 2021-12-16T07:39:09Z

FTR this also seems to affect the WhatsApp bridge somehow as not all my messages get forwarded to WhatsApp.

reivilibre · 2021-12-16T14:22:06Z

@gergelypolonkai

Have you verified that the queries that are causing you trouble are the same as the ones that are in the description of this PR?
(I just want to make sure we're talking about the same issue and not perhaps about something similar.)

Although this issue existed before 1.46, it's always possible that something new has aggravated it further for you, so you're welcome to try and downgrade — we always try to keep at least 1 version of rollback possible. Synapse won't start up if you roll back too far, so it's harmless to return to 1.46 (or 1.47).

It seems like you can roll back to 1.46 (and further, if you wanted to) as the database is compatible. If you'd like to try and report back, that could be useful (and if you're lucky you might get your server back to give some time to try and investigate what's going on).

gergelypolonkai · 2021-12-16T14:45:00Z

The only thing that doesn't match in my query is that ::bigint part; mine has 9690528 instead of 3405820.

callahad · 2021-12-16T15:46:00Z

@gergelypolonkai Do you think you could try rolling back to 1.46 and seeing how that works for you?

gergelypolonkai · 2021-12-16T15:58:27Z

I just did that, After starting it, it feels better (at least message sending doesn't timeout), but let me use it for a few hours before I jump to conclusions.

gergelypolonkai · 2021-12-16T16:15:22Z

Nope, after like 15 minutes itʼs the same 😔

Let me know if i can help with anything else, iʼm happy to help when iʼm behind my keyboard.

reivilibre · 2021-12-16T16:37:41Z

Thanks for trying! How are you installing Synapse? (wondering in case you'd be willing to try a branch to see if something improves the situation for you)

gergelypolonkai · 2021-12-17T05:02:49Z

I’m using virtualenv/pip install on bare metal. So sure, shoot at me with the branch name and i can easily try.

reivilibre · 2021-12-21T11:01:16Z

@gergelypolonkai The branch rei/p/stcache contains a way of deduplicating these queries, which might help. (Though I am surprised to see that you're having this issue on a single-user homeserver, to be honest!). It's been running on librepush.net since yesterday (with additional code that runs the old implementation and verifies they give the same result), plus I've tried to be reasonably paranoid with the testing. You're welcome to give it a try.

gergelypolonkai · 2021-12-21T11:20:43Z

@reivilibre I’m also surprised that i’m affected not just because my HS is single-user, but because all the rooms i participate in are small (<250 users) and most of these rooms don’t have a long state history.

FTR, here’s what i used if someone with less Python-fu wants to give a try:

pip install git+https://github.com/matrix-org/synapse.git@rei/p/stcache

It installed smoothly and started up. I’ll check back within a few hours to let you know if it looks good from my server.

gergelypolonkai · 2021-12-23T06:21:52Z

Sorry for not coming back earlier, yesterday i had a terrible migraine.

I can still see the query occasionally firing up (almost every time i switch rooms i Nheko). However, it feels smoother; at least sending messages isn’t slowed down, which is a great win in my book.

richvdh · 2022-01-07T14:11:46Z

This sounds like the recursive part of the query never gets to a point where it no longer returns a tuple, causing it to run forever. That would almost suggest some kind of cyclical relationship we're unable to break?

let's be clear: this issue is about the fact that we run multiple identical queries in parallel, which is inefficient, even if the queries themselves perform correctly.

If the queries aren't terminating (or if you're not seeing identical queries with identical constants in the VALUES(3405820::bigint) part), then it's a separate bug; please open a different issue. (Though see also #9826 and #7772, both of which may be related.)

foxcris · 2022-01-08T09:01:58Z

@richvdh: Thanks for the clarification. I also have multiple queries running but the constants are different. I will open a new issue for this then as they are not terminating.

gergelypolonkai · 2022-03-18T04:35:41Z

@reivilibre does your branch get updated from master occasionally? I just upgraded to 1.54 and this issue persists; can i use your branch without essentially downgrading (not that i mind if it does mean a downgrade, though).

gergelypolonkai · 2022-03-20T15:27:16Z

Upgrading to 1.54 and thus reverting this change brought a significant drawback in performance, so i reverted to this branch.

reivilibre · 2022-04-20T12:34:20Z

This was biting again, so I've updated the branch to 1.57.0 (as a new branch: rei/p/stcache-1.57; merely reverting #12126) to get us out of a pickle. Assuming this helps us out with the current situation, we may want to prioritise doing it properly and getting something akin to this solution mainlined.

Edit: it seemed to help a bit, but we still had OOMing afterwards.

richvdh added S-Minor Blocks non-critical functionality, workarounds exist. T-Enhancement New features, changes in functionality, improvements in performance, or user-facing enhancements. labels Jul 22, 2021

callahad assigned reivilibre Jul 22, 2021

reivilibre mentioned this issue Aug 2, 2021

_get_state_groups_from_groups: add ResponseCache (and strip out batching) #10510

Closed

reivilibre mentioned this issue Aug 24, 2021

Deduplicate concurrent requests to get_state_for_groups #10681

Closed

This was referenced Sep 21, 2021

Track and deduplicate in-flight requests to _get_state_for_groups. #10870

Merged

Add an approximate difference method to StateFilters #10825

Merged

DMRobertson added the X-Needs-Discussion label Dec 14, 2021

This comment has been minimized.

Sign in to view

reivilibre mentioned this issue Jan 20, 2022

The state_group_edges table doesn't have a unique index; can severely impact WITH RECURSIVE state(state_group) AS ... query performance if duplicates are somehow introduced #11779

Closed

DMRobertson removed the X-Needs-Discussion label Jan 20, 2022

reivilibre mentioned this issue Mar 2, 2022

Reintroduce in-flight state caching changes (backed out because of poor performance) #12116

Closed

reivilibre removed their assignment Apr 27, 2022

MadLittleMods added the A-Database DB stuff like queries, migrations, new/remove columns, indexes, unexpected entries in the db label Dec 23, 2022

matrixbot mentioned this issue Dec 21, 2023

many duplicate _get_state_groups_from_groups queries leading to OOMs element-hq/synapse#10301

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

many duplicate `_get_state_groups_from_groups` queries leading to OOMs #10301

many duplicate `_get_state_groups_from_groups` queries leading to OOMs #10301

richvdh commented Jul 2, 2021

richvdh commented Jul 8, 2021

jaywink commented Jul 30, 2021

gergelypolonkai commented Dec 14, 2021

daenney commented Dec 14, 2021

gergelypolonkai commented Dec 14, 2021

gergelypolonkai commented Dec 16, 2021

reivilibre commented Dec 16, 2021

gergelypolonkai commented Dec 16, 2021

callahad commented Dec 16, 2021

gergelypolonkai commented Dec 16, 2021

gergelypolonkai commented Dec 16, 2021

reivilibre commented Dec 16, 2021

gergelypolonkai commented Dec 17, 2021

reivilibre commented Dec 21, 2021

gergelypolonkai commented Dec 21, 2021

gergelypolonkai commented Dec 23, 2021

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

richvdh commented Jan 7, 2022

foxcris commented Jan 8, 2022

gergelypolonkai commented Mar 18, 2022

gergelypolonkai commented Mar 20, 2022

reivilibre commented Apr 20, 2022 •

edited

Loading

many duplicate _get_state_groups_from_groups queries leading to OOMs #10301

many duplicate _get_state_groups_from_groups queries leading to OOMs #10301

Comments

richvdh commented Jul 2, 2021

richvdh commented Jul 8, 2021

jaywink commented Jul 30, 2021

gergelypolonkai commented Dec 14, 2021

daenney commented Dec 14, 2021

gergelypolonkai commented Dec 14, 2021

gergelypolonkai commented Dec 16, 2021

reivilibre commented Dec 16, 2021

gergelypolonkai commented Dec 16, 2021

callahad commented Dec 16, 2021

gergelypolonkai commented Dec 16, 2021

gergelypolonkai commented Dec 16, 2021

reivilibre commented Dec 16, 2021

gergelypolonkai commented Dec 17, 2021

reivilibre commented Dec 21, 2021

gergelypolonkai commented Dec 21, 2021

gergelypolonkai commented Dec 23, 2021

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

richvdh commented Jan 7, 2022

foxcris commented Jan 8, 2022

gergelypolonkai commented Mar 18, 2022

gergelypolonkai commented Mar 20, 2022

reivilibre commented Apr 20, 2022 • edited Loading

many duplicate `_get_state_groups_from_groups` queries leading to OOMs #10301

many duplicate `_get_state_groups_from_groups` queries leading to OOMs #10301

reivilibre commented Apr 20, 2022 •

edited

Loading