Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Improve performance of remove_hidden_devices_from_device_inbox #11420

Closed
wants to merge 5 commits into from
Closed
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions changelog.d/11420.bugfix
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Fix performance issues with the `remove_hidden_devices_from_device_inbox` database background update.
23 changes: 10 additions & 13 deletions synapse/storage/databases/main/deviceinbox.py
Original file line number Diff line number Diff line change
Expand Up @@ -737,31 +737,28 @@ async def _remove_hidden_devices_from_device_inbox(
def _remove_hidden_devices_from_device_inbox_txn(
txn: LoggingTransaction,
) -> int:
"""stream_id is not unique
we need to use an inclusive `stream_id >= ?` clause,
since we might not have deleted all hidden device messages for the stream_id
returned from the previous query
"""Retrieve the list of hidden devices that appear in the device_inbox table,
ordered by user_id so the query uses the index on the devices table.

Then delete only rows matching the `(user_id, device_id, stream_id)` tuple,
to avoid problems of deleting a large number of rows all at once
due to a single device having lots of device messages.
"""

last_stream_id = progress.get("stream_id", 0)
last_user_id = progress.get("user_id", "")
babolivier marked this conversation as resolved.
Show resolved Hide resolved

sql = """
SELECT device_id, user_id, stream_id
FROM device_inbox
SELECT user_id, device_id, stream_id
FROM devices
INNER JOIN device_inbox USING (device_id, user_id)
WHERE
stream_id >= ?
AND (device_id, user_id) IN (
SELECT device_id, user_id FROM devices WHERE hidden = ?
)
ORDER BY stream_id
user_id >= ?
AND hidden = ?
ORDER BY user_id
LIMIT ?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this limit will mean that we may not find all the device_inbox rows for a given user - we may only end up deleting half of them before deciding to move onto the next user.

I think a better strategy is to remove the join against device_inbox, and just look for hidden devices, without worrying about whether they have device_inbox rows. Then, do a DELETE FROM device_inbox for each such device.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was going to do this initially, but the original docstring mentions that it might be a bit heavy if the device has tons of pending messages in its inbox. For example, on abolivier.bzh's database, I've got devices with over 60k rows in device_inbox (which aren't hidden devices but that's probably because I've already run the previous incarnation of this update), which sounds like a lot to delete in one go.

On top of that, I don't see how we might not delete entries for all devices for a given user, given the condition in the query is user_id >= ?, so if we don't do every device from a user we should just do the rest in the next run?

Copy link
Member

@richvdh richvdh Nov 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On top of that, I don't see how we might not delete entries for all devices for a given user, given the condition in the query is user_id >= ?, so if we don't do every device from a user we should just do the rest in the next run?

oh sorry, you're right.

However, from your analysis in the PR comment:

         ->  Index Only Scan using device_inbox_user_stream_id on device_inbox  (cost=0.54..100217.89 rows=109879 width=42) (actual time=0.141..92.477 rows=101229 loops=1)
               Heap Fetches: 6968

this is no good. It's a scan of the entirety of the device_inbox_user_stream_id index: there is no index cond here. It's acceptable for your local db with 100k rows, but won't be for a larger db.

Honestly, looking at this again, I think we're better off rewriting it again (sorry!) to do the same as #11421 (ie, walk through the device_inbox table for a sequence of stream_ids. Hell, why not combine it with #11421?

other ideas...

  • we can probably get away with deleting all the messages for a device at once, even if its in the 10s of thousands, though it gets really nasty given we'll be considering 100 devices on the first pass.
  • maybe a two-stage loop: first find a device, then work through a range of stream_ids for that device on each run of the bg update. That gets annoyingly fiddly in terms of tracking state, though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think you're right here, let's combine it with #11421

"""

txn.execute(sql, (last_stream_id, True, batch_size))
txn.execute(sql, (last_user_id, True, batch_size))
rows = txn.fetchall()

num_deleted = 0
Expand Down