-
-
Notifications
You must be signed in to change notification settings - Fork 7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sweep up old federated toots from local DB #1554
Comments
Maybe have a TTL setting for federated messages which haven't been "touched" by a local user? |
Feeds are fine being ephemeral, but toots are not. I don't want to delete, nor have mine deleted, any toots. |
Deleting toots from the fediverse in the DB of an instance doesn't delete these toots from their home instance. They are not deleted. This would be an interesting feature, the software could simply fetch the remote toots if/when they're accessed again. |
Self-hosting means sustainability, and infinitely growing DB is not sustainable, so I agree that this is needed. We just need a good default strategy. Might not be desirable to just delete all remote toots. What about local user's favourites, reblogs, or conversations? So we should delete only things not touched by local users. Alternatively - delete all, including local, content that is older than x. That's something that could be debated. |
Thanks @Gargron, I quite agree - in theory an instance could end up with every toot from the entire fediverse in it, which would be pretty unscalable. I agree that removing things not "touched" in some way would be a sensible default. Presumably different instances could have different policies, too. |
As far as having other policies, those would have to be implemented in any change to the code. That means either the instance owner needs to make those changes themself, or the feature proposed here needs to include some configurability to account for multiple approaches to handling this sort of cleanup. Which means deciding what should be configurable, and in what ways. So far suggested:
As to defaults, the current consensus seems to be:
Did I miss anything? |
Looks good to me. I'd reduce the timeout to something like 2 days but maybe I'm being too |
Optimally, we'd have some metrics to look at, such as average consumption per week, day, maybe even hour, and set a default based on that. But short of a survey to collect such data, I'm not sure it's possible to get. Larger instances (or rather, those with more external content flowing in) would of course want lower values than smaller ones (with lower inbound traffic) might. I think 2 days is perfectly reasonable for high-inbound-traffic instances, or those with excessively small disk space, but that the default might be ok being a bit higher. (Maybe not as high as 2 weeks, though. 5 days? 7?) |
Personally I think this would be good as a rake task and leave it up to the instance admin when to schedule it via cron. The documentation repository could contain some suggestions like what has been outlined above (eg. 1-2 weeks for smaller instances, 2 days for larger instances). |
Another downside of this is that it could make remote users look like they haven't tooted at all when they in fact have. It also removes the redundancies inherent in a federated system. It also makes slower hashtags effectively useless. (do tag streams currently expire after 7 days? that's probably also an issue for slower tags) This should definitely be a rake task, and it should be opt-in and up to the admin to schedule when it happens. |
@bratta The frequency at which the task runs is separate from the age at which it sweeps toots. We were discussing the age setting (frequently called a TTL (Time To Live), when it comes up in technical scenarios), not the frequency, above. I definitely agree that the docs should mention the task, what it does, and recommend various cron schedules (and associated TTLs) for various amounts of traffic/load. Thanks for that suggestion! 👍 @nightpool Toots which have been interacted with have been mentioned in various ways since the very first post. I assume from the way it's been rephrased so often that not only is it an important consideration (which we agree that it is), but also it needs to be very clearly communicated in the docs that toots with local interactions will not be affected (unless configured otherwise). Vital feedback. 👍 There are already a number of other situations which prompt users to check remote instances to get a better picture of a user's activity, so I'm not sure "they'll look inactive" is as big an issue as it seems. If nothing else, any toots that have been interacted with locally would still remain available locally. And of course, unless they've gone silent for X days (plus the time since the last sweep), their most recent content would still be available locally, too. So it's probably something to keep in mind, but I'm not convinced it's a blocker for implementation. While you're correct about reduced redundancy, the benefit is not having your instance run out of disk space and stop working entirely. This issue is all about a trade-off, reduced disk usage at the cost of redundancy. It would also be interesting to consider alternative solutions, though, for conserving disk space. Tags are an interesting facet of this which we really should consider and discuss further. I suspect there's already significant fragmentation affecting tag streams in the first place - if my instance hasn't pulled a given toot with that tag, say because none of my users follow the account which posted it, it won't show up in my results anyway. So that's something to figure out. 👍 @ALL Finally, just to be clear, this is absolutely intended to be a |
The network has grown quite a bit since the last time this was commented on, is this still planned? |
At the time of writing the mastodon.social statuses table has 52,913,648 rows, and once we introduce federation relays, that number is only going to go up faster. So a solution is needed. Unfortunately, lines between what can be deleted are blurry, and finding such items over SQL is far from simple. A naive algorithm might be:
However, 50 million sidekiq jobs will be backlogged for a massive amount of time. And the checks aren't even complete, because would you want to keep a status if it's reblogged by a remote user who does have followers? What about threads with remote users with local followers? IPFS also has this issue, but they have the concept of explicit pinning, and everything that is not pinned by a node gets periodically flushed out. I think that it might be worth it to add a data structure for explicit pinning of statuses. So a local user replying/faving/reblogging a remote status would explicitly pin it. That would make sweeping unpinned ones a lot easier. However, even that approach does not help with already stored statuses, we would still need a one-time migration procedure to pin what's necessary. I need help figuring this out. |
Rather than deleting them, would it be possible to compact them? Like, remove them from indexes, move them to a different storage (eg. one file per user), ...? |
I'm very much in favor of a secondary, long-term storage option. Cleaning things out of the main DB which haven't been interacted with recently is a great idea, but I always favor moving them elsewhere over removing them entirely, unless we're talking about a complete suspension or deletion operation which was manually triggered by the admin or the user (respectively). Things which aren't explicitly requested to be removed should be archived. I know my stance may not be widely accepted, though, so i propose something a bit more involved. The instance admin would have configuration options for what to archive, and how to archive it. The exact settings for the "what" seem to be the main question, here, so I'll come back to those in a second, but as to the how, I'd propose three options. First, and the default, the filesystem, as As to the what, I think "no local interactions in the last X days", as described above, is a perfectly acceptable default setting. If the archival option is adopted, I might go slightly more aggressive with it and say "no local interactions in the last X days, and no recorded interactions in the last Y days", where both X and Y are some sane default, such as X=30 and Y=90, but adjustable by the instance admin according to what makes most sense for their users. This will archive local toots, too, but they'll still be accessible when referenced directly, and long-lived threads that are still active will still be in the main, quick-access DB. Of course, how do we tell how recently something was interacted with? Does the |
#34 really needs some attention if we are going to start potentially deleting old toots from remote users. |
#875 talks about removing old toots for users, but this is more about keeping down DB size on instances. Would it be sensible to sweep up old toots from the fediverse on a regular basis?
Perhaps any toot older than X days (configurable) that doesn't mention a local user, or was boosted or replied to by a local user, could be removed by a rake task similar to
rake mastodon:feeds:clear
.Any thoughts?
The text was updated successfully, but these errors were encountered: