Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

device_inbox never gets emptied #3599

Open
ara4n opened this issue Jul 24, 2018 · 20 comments
Open

device_inbox never gets emptied #3599

ara4n opened this issue Jul 24, 2018 · 20 comments
Labels
A-Database DB stuff like queries, migrations, new/remove columns, indexes, unexpected entries in the db A-Disk-Space things which fill up the disk S-Minor Blocks non-critical functionality, workarounds exist. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.

Comments

@ara4n
Copy link
Member

ara4n commented Jul 24, 2018

it's up to 85M rows or so on matrix.org, which seems insane. similarly *_stream never seem to get purged, or device_federation_inbox/outbox

@richvdh
Copy link
Member

richvdh commented Jul 25, 2018

The main problem is that it's hard to tell when we can delete to-device messages. They don't have an expiry time on them.

@ukcb
Copy link

ukcb commented Jul 27, 2018

I made a small test script for private use with postgresql. Let's see if it works. :-)

#!/usr/bin/perl

use strict;
use warnings;

use DBI;
use DateTime;

my $db_name = 'synapse';
my $db_host = '127.0.0.1';
my $db_port = '5432';
my $db_user = 'synapse';
my $db_pass = 'synapse';

my $time_to_keep_in_days = 7;

my $table_device_inbox = 'device_inbox';
my $table_time_indexer = 'device_inbox_timestamp_indexer';


my $datetime        = DateTime->today;
my $timestamp_today = $datetime->epoch;
$datetime->subtract( days => $time_to_keep_in_days );
my $timestamp_obsolete = $datetime->epoch;


my $dbh = DBI->connect( "dbi:Pg:dbname=$db_name;host=$db_host;port=$db_port",
    $db_user, $db_pass,
    { AutoCommit => 1, RaiseError => 1, PrintError => 1, PrintWarn => 0 } )
  or die $DBI::errstr;

$dbh->do("CREATE TABLE IF NOT EXISTS $table_time_indexer
            ( stream_id BIGINT,
              timestamp BIGINT,
              PRIMARY KEY (stream_id)
            )") or die $dbh->errstr;

my $sth = $dbh->prepare("SELECT stream_id FROM $table_device_inbox")
  or warn $dbh->errstr;
$sth->execute() or warn $dbh->errstr;

my $pool = {};
while ( my $ref = $sth->fetchrow_hashref() ) {
    $pool->{ $ref->{'stream_id'} } = 1;
}

$sth = $dbh->prepare("SELECT stream_id, timestamp FROM $table_time_indexer")
  or warn $dbh->errstr;
$sth->execute() or warn $dbh->errstr;

my $index = {};
while ( my $res = $sth->fetchrow_hashref() ) {
    $index->{ $res->{'stream_id'} } = $res->{'timestamp'};
}


foreach my $key ( keys $pool->%* ) {
    if ( $index->%{$key} ) {
        if ( $index->%{$key} < $timestamp_obsolete ) {
            delete_key_from_tables($key);
        }
    }
    else {
        insert_key_into_indexer($key);
    }
}


sub insert_key_into_indexer {
    my $key = shift;

    return if ( $key !~ / \A \d+ \z /smx );

    # insert key into table $table_time_indexer
    $sth = $dbh->prepare("INSERT INTO $table_time_indexer
                            ( stream_id, timestamp )
                          VALUES
                            ( ?, ? )
                         ") or warn $dbh->errstr;

    $sth->execute( $key, $timestamp_today ) or warn $dbh->errstr;
}


sub delete_key_from_tables {
    my $key = shift;

    return if ( $key !~ / \A \d+ \z /smx );

    # delete key in table $table_time_indexer
    $sth = $dbh->prepare("DELETE FROM $table_time_indexer
                          WHERE stream_id = ?
                         ") or warn $dbh->errstr;

    $sth->execute($key) or warn $dbh->errstr;

    # delete key in table $table_device_inbox
    $sth = $dbh->prepare("DELETE FROM $table_device_inbox
                          WHERE stream_id = ?
                         ") or warn $dbh->errstr;

    $sth->execute($key) or warn $dbh->errstr;
}

# eof

@ukcb
Copy link

ukcb commented Jul 28, 2018

It would be better if the table device_inbox already had a timestamp. That would make purging up a lot easier. Technically, that should not be a problem.

@Midek
Copy link

Midek commented Dec 13, 2021

Is there currently any official solution to this?
device_inbox is currently the largest table in my DB, taking up 82GB

synapse=# SELECT nspname || '.' || relname AS "relation",
    pg_size_pretty(pg_total_relation_size(C.oid)) AS "total_size"
  FROM pg_class C
  LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
  WHERE nspname NOT IN ('pg_catalog', 'information_schema')
    AND C.relkind <> 'i'
    AND nspname !~ '^pg_toast'
  ORDER BY pg_total_relation_size(C.oid) DESC
  LIMIT 20;
               relation               | total_size 
--------------------------------------+------------
 public.device_inbox                  | 82 GB

@aaronraimist
Copy link
Contributor

@Midek what version of Synapse are you on? Recent versions of Synapse should have deleted quite a bit

@Midek
Copy link

Midek commented Dec 13, 2021

My synapse version is:

"version": "1.48.0"

If anything, i have noticed it started growing more rapidly quite recently (so perhaps after one of the more recent updates), but i cannot pinpoint exactly when, since the table was always quite large.

//edit
Is there some other info i can provide to help debug this?
I have 3770 users not counting the ones from irc and discord bridge:

synapse=# select count(*) from users where name not like '%irc%'  and name  not like '%discord' and deactivated='0';
 count 
-------
  3770
(1 row)

device_inbox has exactly this many records:

synapse=# select count(*) from device_inbox;
  count   
----------
 66963488
(1 row)

And my synapse instance is using workers.

@DMRobertson
Copy link
Contributor

@Midek there is a background update job in 1.47 (whose performance was improved in 1.48) which removes to_device messages for hidden and deleted devices. You could check to see what background updates are running, with select * from background_updates;. Perhaps something else, e.g. the event_arbitrary_relations job is running.

FWIW, device_inbox messages should get deleted once they're:

  • delivered to the client in a /sync
  • that client acknowledges receipt by requesting another /sync with the next_batch from the first /sync response. I would guess that to_device messages are accumulating for devices that aren't active at the moment---possibly in a large encrypted room?

You could try looking to see which devices have all these unread messages. Something like this, perhaps, but beware: I expect this query will take a while if it's got to scan 80GB of records.

select devices.user_id, count(*) 
from device_inbox join devices using(device_id) 
group by devices.user_id
order by count desc;

I would only start investigating down this line if there are no rows in background_updates; though.

@Midek
Copy link

Midek commented Dec 14, 2021

Oh, i do have event_arbitrary_relations and remove_dead_devices_from_device_inbox in background_updates
I guess that might be why my database has been causing 3x higher load than usual for the past month.

synapse=# select * from background_updates;
              update_name              |                          progress_json                           | depends_on | ordering 
---------------------------------------+------------------------------------------------------------------+------------+----------
 remove_dead_devices_from_device_inbox | {}                                                               |            |     6508
 event_arbitrary_relations             | {"last_event_id":"$J31N5_iiqU9cEnXL5qXGbQsSop-X3uHAA1-5CSbWCEM"} |            |     6507
(2 rows)

So this is normal and should just fix itself after the background jobs are done, i guess?
Thanks for your response!

@DMRobertson
Copy link
Contributor

So this is normal and should just fix itself after the background jobs are done, i guess?

I think things will get better after the remove_dead_devices_from_device_inbox job is done. But maybe not perfect: after all, to_device messages are kept on the HS so we can guarantee their delivery. Maybe there are lots of clients out there who haven't synced in a while and have a large inbox.

As @richvdh notes, there's no inherent mechanism for expiring these messages if the client doesn't acknowledge their receipt. Perhaps there ought to be? Unsure---there may be ramifications for E2EE)

@Midek
Copy link

Midek commented Dec 22, 2021

The remove_dead_devices_from_device_inbox background job has finished, and instead of 66963488, there are 47965078 entries in the device_inbox table.
That's better, but still unproportionaly large to other tables in the db.
After running the query

select devices.user_id, count(*) 
from device_inbox join devices using(device_id) 
group by devices.user_id
order by count desc;

i have noticed that there are multiple users with over 1mln results.
Many of them seem inactive.
What would be the consequence of deleting those entries? Would that result in them simply being unable to decrypt some messages? Or can it lead to a more serious breakage?

@clokep clokep added T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. S-Minor Blocks non-critical functionality, workarounds exist. and removed z-p2 (Deprecated Label) z-minor (Deprecated Label) labels Feb 1, 2022
@acheng-floyd

This comment was marked as off-topic.

@dklimpel

This comment was marked as off-topic.

@acheng-floyd

This comment was marked as off-topic.

@acheng-floyd
Copy link

acheng-floyd commented Jun 4, 2022

image
I've notice that there's always delete operations of the two users on my server. These two users are shared by many people, it means many people may login at the same time,i've also noticed that these people login with element web client, is it the reason there are so many delete operations of different device with table "device_inbox"?

If these delete operations effect, it can't explain why the table still grows.

image
now this table increase to 78M. Could i just delete all the data in this table and restart synapse server? What will happy if i do this

@richvdh
Copy link
Member

richvdh commented Jun 6, 2022

@acheng-floyd your questions are unrelated to this issue, but in short, those DELETE queries look like normal behaviour.

Deleting entries from device_inbox isn't particularly recommended; at the very least, it has the potential to cause problems with encrypted messaging for any currently-logged-in devices.

@anoadragon453
Copy link
Member

anoadragon453 commented Nov 17, 2022

i have noticed that there are multiple users with over 1mln results.
Many of them seem inactive.
What would be the consequence of deleting those entries? Would that result in them simply being unable to decrypt some messages? Or can it lead to a more serious breakage?

If the recipient devices are deleted (there is an Admin API for doing so) then the corresponding to-device messages will be deleted as well. Likewise if the user is completely deactivated - which in turn erases all of their devices.

On just removing the entries from the table and leaving the devices in place however - there shouldn't be much that breaks other than the ability to decrypt past messages, no. If the ability to decrypt past messages is definitely not a concern for the accounts in question (know that sometimes people come back to their accounts after years of inactivity!), then feel free to remove them to save space.

While the rows in device_inbox do not have a timestamp, they do have a stream_id. This is a value that increments for every to-device message sent to all users. If you are cautious about preserving new messages and only want to delete older ones to save space, you could delete entries by partitioning them at the median stream_id for that device_id.

One last note: @DMRobertson is right in that to-device messages can accumulate for accounts that just sit in encrypted rooms all day, don't /sync and have to-device messages with session keys continuously sent to them. If you have a bot that is doing this and doesn't need to read (encrypted) messages, consider removing any encryption-enabled devices from that account (log out any devices that were created by logging in through an E2EE matrix client). Other clients will then no longer send you keys via to-device messages to decrypt room messages with. Or, just call /sync periodically while your bot is running to clear out the backlog.

And finally: encourage your users to remove log out old devices that they are no longer using. Chances are these are the ones that are racking up a backlog of to-device messages to begin with.

@MadLittleMods MadLittleMods added A-Disk-Space things which fill up the disk A-Database DB stuff like queries, migrations, new/remove columns, indexes, unexpected entries in the db labels Dec 8, 2022
@ara4n
Copy link
Member Author

ara4n commented Jan 4, 2023

it's now up to 3.6 billion rows...

@kegsay
Copy link
Member

kegsay commented Jan 4, 2023

From my understanding of to device messages for sliding sync, the vast vast majority (94%) of these messages are m.room_key_request which come in two flavours: request and request cancellation. These events aiui are sent when a new device for the same user logs in and then requests keys for encrypted messages as a form of gossiping. It should be fine to delete these events and cause minimal disruption in the process. Best case is that the request was satisfied by another device and has now been cancelled which would cause zero impact and is what MSC3944 is advocating. Worst case the key isn't sent when requested and the new device cannot read that particular message, though if they actually used a proper key management solution e.g backing up keys this would not be needed.

@ara4n
Copy link
Member Author

ara4n commented Jan 8, 2023

The disk wastage here is crazy. can I suggest we add a bg job to:

  • delete all key reqs/cancels older than a week or so
  • log out devices which have been idle for more than 3 months or so, for accts where the user is backing up their keys (to avoid risk of losing their keys) and keeping a single device around (the newest one) so they still have a hope of receiving msgs while they were gone?
    • this would be easier if we had dehydrated devices properly rolled out, as all the non-dehydrated devices could be logged out, without fear of losing msgs
    • this would also be easier with SS everywhere, given login wouldn't be frustratingly slow.

@RhysRdm
Copy link

RhysRdm commented Mar 13, 2023

Was there any resolution to this problem. I see the same thing on my install
Anything I can help with?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-Database DB stuff like queries, migrations, new/remove columns, indexes, unexpected entries in the db A-Disk-Space things which fill up the disk S-Minor Blocks non-critical functionality, workarounds exist. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.
Projects
None yet
Development

Successfully merging a pull request may close this issue.