Sweep up old federated toots from local DB #1554

Floppy · 2017-04-11T16:53:39Z

#875 talks about removing old toots for users, but this is more about keeping down DB size on instances. Would it be sensible to sweep up old toots from the fediverse on a regular basis?

Perhaps any toot older than X days (configurable) that doesn't mention a local user, or was boosted or replied to by a local user, could be removed by a rake task similar to rake mastodon:feeds:clear.

Any thoughts?

I searched or browsed the repo’s other issues to ensure this is not a duplicate.

The text was updated successfully, but these errors were encountered:

alastairhm · 2017-04-12T09:58:03Z

Maybe have a TTL setting for federated messages which haven't been "touched" by a local user?

Fastidious · 2017-04-12T13:04:14Z

Feeds are fine being ephemeral, but toots are not. I don't want to delete, nor have mine deleted, any toots.

wxcafe · 2017-04-15T00:45:21Z

Deleting toots from the fediverse in the DB of an instance doesn't delete these toots from their home instance. They are not deleted.

This would be an interesting feature, the software could simply fetch the remote toots if/when they're accessed again.

Gargron · 2017-04-15T01:06:38Z

Self-hosting means sustainability, and infinitely growing DB is not sustainable, so I agree that this is needed. We just need a good default strategy. Might not be desirable to just delete all remote toots. What about local user's favourites, reblogs, or conversations? So we should delete only things not touched by local users.

Alternatively - delete all, including local, content that is older than x. That's something that could be debated.

Floppy · 2017-04-15T11:22:18Z

Thanks @Gargron, I quite agree - in theory an instance could end up with every toot from the entire fediverse in it, which would be pretty unscalable. I agree that removing things not "touched" in some way would be a sensible default.

Presumably different instances could have different policies, too.

danhunsaker · 2017-04-15T22:45:47Z

As far as having other policies, those would have to be implemented in any change to the code. That means either the instance owner needs to make those changes themself, or the feature proposed here needs to include some configurability to account for multiple approaches to handling this sort of cleanup. Which means deciding what should be configurable, and in what ways. So far suggested:

Federated only vs federated and local
TTL (how old it needs to be to consider sweeping it away; possibly configured separately based on local vs federated origins)
Whether interactions prevent sweeping, and which interactions qualify; if local toots are sweepable, they should be configured separately from federated ones, and consider interactions from the Fediverse in addition to local ones.

As to defaults, the current consensus seems to be:

Federated only
Two weeks? This one hasn't really been discussed, yet, but that's a default I've seen used fairly commonly.
All interactions prevent sweeping (likes, boosts, replies)

Did I miss anything?

wxcafe · 2017-04-16T14:35:18Z

Looks good to me. I'd reduce the timeout to something like 2 days but maybe I'm being too
agressive

danhunsaker · 2017-04-16T15:02:15Z

Optimally, we'd have some metrics to look at, such as average consumption per week, day, maybe even hour, and set a default based on that. But short of a survey to collect such data, I'm not sure it's possible to get. Larger instances (or rather, those with more external content flowing in) would of course want lower values than smaller ones (with lower inbound traffic) might.

I think 2 days is perfectly reasonable for high-inbound-traffic instances, or those with excessively small disk space, but that the default might be ok being a bit higher. (Maybe not as high as 2 weeks, though. 5 days? 7?)

bratta · 2017-04-17T13:55:13Z

Personally I think this would be good as a rake task and leave it up to the instance admin when to schedule it via cron. The documentation repository could contain some suggestions like what has been outlined above (eg. 1-2 weeks for smaller instances, 2 days for larger instances).

nightpool · 2017-04-17T14:06:41Z

~~Would toots that users have boosted get deleted? What about ones that users have liked? Ones that are replies to users statuses? We need to make sure that this isn't noticeable in it's effect.~~ missed this, @Gargron mentions it above

Another downside of this is that it could make remote users look like they haven't tooted at all when they in fact have. It also removes the redundancies inherent in a federated system. It also makes slower hashtags effectively useless. (do tag streams currently expire after 7 days? that's probably also an issue for slower tags)

This should definitely be a rake task, and it should be opt-in and up to the admin to schedule when it happens.

danhunsaker · 2017-04-17T20:10:43Z

@bratta The frequency at which the task runs is separate from the age at which it sweeps toots. We were discussing the age setting (frequently called a TTL (Time To Live), when it comes up in technical scenarios), not the frequency, above. I definitely agree that the docs should mention the task, what it does, and recommend various cron schedules (and associated TTLs) for various amounts of traffic/load. Thanks for that suggestion! 👍

@nightpool Toots which have been interacted with have been mentioned in various ways since the very first post. I assume from the way it's been rephrased so often that not only is it an important consideration (which we agree that it is), but also it needs to be very clearly communicated in the docs that toots with local interactions will not be affected (unless configured otherwise). Vital feedback. 👍

There are already a number of other situations which prompt users to check remote instances to get a better picture of a user's activity, so I'm not sure "they'll look inactive" is as big an issue as it seems. If nothing else, any toots that have been interacted with locally would still remain available locally. And of course, unless they've gone silent for X days (plus the time since the last sweep), their most recent content would still be available locally, too. So it's probably something to keep in mind, but I'm not convinced it's a blocker for implementation.

While you're correct about reduced redundancy, the benefit is not having your instance run out of disk space and stop working entirely. This issue is all about a trade-off, reduced disk usage at the cost of redundancy. It would also be interesting to consider alternative solutions, though, for conserving disk space.

Tags are an interesting facet of this which we really should consider and discuss further. I suspect there's already significant fragmentation affecting tag streams in the first place - if my instance hasn't pulled a given toot with that tag, say because none of my users follow the account which posted it, it won't show up in my results anyway. So that's something to figure out. 👍

@ALL Finally, just to be clear, this is absolutely intended to be a rake task, which automatically means instance admins will be in charge of scheduling sweeps, and must opt-in to run them at all. I feel we all agree these elements are vital.

ghost · 2018-03-29T12:50:16Z

The network has grown quite a bit since the last time this was commented on, is this still planned?

Gargron · 2018-07-16T21:20:21Z

At the time of writing the mastodon.social statuses table has 52,913,648 rows, and once we introduce federation relays, that number is only going to go up faster. So a solution is needed.

Unfortunately, lines between what can be deleted are blurry, and finding such items over SQL is far from simple. A naive algorithm might be:

Find all remote users with 0 local followers
Select all toots by those users and queue them into sidekiq for individual checks
In a check, consider:
- reblogged/faved by local user?
- mentions local user?
- in a thread where a local user participates?
- if all no, then delete

However, 50 million sidekiq jobs will be backlogged for a massive amount of time. And the checks aren't even complete, because would you want to keep a status if it's reblogged by a remote user who does have followers? What about threads with remote users with local followers?

IPFS also has this issue, but they have the concept of explicit pinning, and everything that is not pinned by a node gets periodically flushed out. I think that it might be worth it to add a data structure for explicit pinning of statuses. So a local user replying/faving/reblogging a remote status would explicitly pin it. That would make sweeping unpinned ones a lot easier.

However, even that approach does not help with already stored statuses, we would still need a one-time migration procedure to pin what's necessary.

I need help figuring this out.

progval · 2018-07-16T22:08:37Z

Rather than deleting them, would it be possible to compact them? Like, remove them from indexes, move them to a different storage (eg. one file per user), ...?
This way would be less risky, it would only slow down access to these toots (which is not that bad, as they wouldn't appear in searches anyway), and would be recoverable in case the algorithm had a bug.

danhunsaker · 2018-07-16T23:53:36Z

I'm very much in favor of a secondary, long-term storage option. Cleaning things out of the main DB which haven't been interacted with recently is a great idea, but I always favor moving them elsewhere over removing them entirely, unless we're talking about a complete suspension or deletion operation which was manually triggered by the admin or the user (respectively). Things which aren't explicitly requested to be removed should be archived.

I know my stance may not be widely accepted, though, so i propose something a bit more involved. The instance admin would have configuration options for what to archive, and how to archive it. The exact settings for the "what" seem to be the main question, here, so I'll come back to those in a second, but as to the how, I'd propose three options. First, and the default, the filesystem, as instance/user/post.json or similar, for relatively quick management of individual posts. Second, a data source which can be defined and set up via environment variables, similar to how the main DB operates, but preferably with less dependence on a specific DBMS, since Rails abstracts a lot of that anyway. And third, "the abyss", "the void", or "/dev/null" - an option to simply delete selected posts entirely, rather than actually archive them anywhere meaningful.

As to the what, I think "no local interactions in the last X days", as described above, is a perfectly acceptable default setting. If the archival option is adopted, I might go slightly more aggressive with it and say "no local interactions in the last X days, and no recorded interactions in the last Y days", where both X and Y are some sane default, such as X=30 and Y=90, but adjustable by the instance admin according to what makes most sense for their users. This will archive local toots, too, but they'll still be accessible when referenced directly, and long-lived threads that are still active will still be in the main, quick-access DB.

Of course, how do we tell how recently something was interacted with? Does the updated timestamp get updated with an interaction, or ... ? I haven't been in the DB in a while to see for sure how that works.

deutrino · 2018-07-21T18:30:00Z

#34 really needs some attention if we are going to start potentially deleting old toots from remote users.

@tateisu

Query by @tateisu Fix #1554

@tateisu

Query by @tateisu Fix #1554

@tateisu

…n#10063) Query by @tateisu Fix mastodon#1554

@tateisu

…n#10063) Query by @tateisu Fix mastodon#1554

wxcafe changed the title ~~Sweep up old federated toots~~ Sweep up old federated toots from local DB Apr 15, 2017

wxcafe added enhancement labels Apr 15, 2017

wxcafe mentioned this issue Apr 20, 2017

What to clean up old, abandoned, remote posts #2112

Closed

1 task

deutrino mentioned this issue Jul 21, 2018

Backfill statuses from remote accounts when first subscribed #34

Open

Gargron added suggestion and removed enhancement labels Oct 20, 2018

Gargron added a commit that referenced this issue Feb 16, 2019

Add tootctl statuses remove to sweep unreferenced statuses

d59726f

Query by @tateisu Fix #1554

Gargron mentioned this issue Feb 16, 2019

Add tootctl statuses remove to sweep unreferenced statuses #10063

Merged

Gargron closed this as completed in #10063 Mar 11, 2019

Gargron added a commit that referenced this issue Mar 11, 2019

Add tootctl statuses remove to sweep unreferenced statuses (#10063)

6766502

Query by @tateisu Fix #1554

hiyuki2578 pushed a commit to ProjectMyosotis/mastodon that referenced this issue Oct 2, 2019

Add tootctl statuses remove to sweep unreferenced statuses (mastodo…

7d18ac6

…n#10063) Query by @tateisu Fix mastodon#1554

messenjahofchrist pushed a commit to Origin-Creative/mastodon that referenced this issue Jul 30, 2021

Add tootctl statuses remove to sweep unreferenced statuses (mastodo…

c252816

…n#10063) Query by @tateisu Fix mastodon#1554

cham2019 mentioned this issue Dec 7, 2022

Add retention policy for cached content and media #19232

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sweep up old federated toots from local DB #1554

Sweep up old federated toots from local DB #1554

Floppy commented Apr 11, 2017

alastairhm commented Apr 12, 2017

Fastidious commented Apr 12, 2017

wxcafe commented Apr 15, 2017

Gargron commented Apr 15, 2017

Floppy commented Apr 15, 2017

danhunsaker commented Apr 15, 2017

wxcafe commented Apr 16, 2017

danhunsaker commented Apr 16, 2017

bratta commented Apr 17, 2017

nightpool commented Apr 17, 2017 •

edited

Loading

danhunsaker commented Apr 17, 2017 •

edited

Loading

ghost commented Mar 29, 2018

Gargron commented Jul 16, 2018 •

edited

Loading

progval commented Jul 16, 2018

danhunsaker commented Jul 16, 2018

deutrino commented Jul 21, 2018

Sweep up old federated toots from local DB #1554

Sweep up old federated toots from local DB #1554

Comments

Floppy commented Apr 11, 2017

alastairhm commented Apr 12, 2017

Fastidious commented Apr 12, 2017

wxcafe commented Apr 15, 2017

Gargron commented Apr 15, 2017

Floppy commented Apr 15, 2017

danhunsaker commented Apr 15, 2017

wxcafe commented Apr 16, 2017

danhunsaker commented Apr 16, 2017

bratta commented Apr 17, 2017

nightpool commented Apr 17, 2017 • edited Loading

danhunsaker commented Apr 17, 2017 • edited Loading

ghost commented Mar 29, 2018

Gargron commented Jul 16, 2018 • edited Loading

progval commented Jul 16, 2018

danhunsaker commented Jul 16, 2018

deutrino commented Jul 21, 2018

nightpool commented Apr 17, 2017 •

edited

Loading

danhunsaker commented Apr 17, 2017 •

edited

Loading

Gargron commented Jul 16, 2018 •

edited

Loading