Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sweep up old federated toots from local DB #1554

Closed
1 task done
Floppy opened this issue Apr 11, 2017 · 16 comments · Fixed by #10063
Closed
1 task done

Sweep up old federated toots from local DB #1554

Floppy opened this issue Apr 11, 2017 · 16 comments · Fixed by #10063

Comments

@Floppy
Copy link
Contributor

Floppy commented Apr 11, 2017

#875 talks about removing old toots for users, but this is more about keeping down DB size on instances. Would it be sensible to sweep up old toots from the fediverse on a regular basis?

Perhaps any toot older than X days (configurable) that doesn't mention a local user, or was boosted or replied to by a local user, could be removed by a rake task similar to rake mastodon:feeds:clear.

Any thoughts?


  • I searched or browsed the repo’s other issues to ensure this is not a duplicate.
@alastairhm
Copy link

Maybe have a TTL setting for federated messages which haven't been "touched" by a local user?

@Fastidious
Copy link

Feeds are fine being ephemeral, but toots are not. I don't want to delete, nor have mine deleted, any toots.

@wxcafe
Copy link
Contributor

wxcafe commented Apr 15, 2017

Deleting toots from the fediverse in the DB of an instance doesn't delete these toots from their home instance. They are not deleted.

This would be an interesting feature, the software could simply fetch the remote toots if/when they're accessed again.

@wxcafe wxcafe changed the title Sweep up old federated toots Sweep up old federated toots from local DB Apr 15, 2017
@Gargron
Copy link
Member

Gargron commented Apr 15, 2017

Self-hosting means sustainability, and infinitely growing DB is not sustainable, so I agree that this is needed. We just need a good default strategy. Might not be desirable to just delete all remote toots. What about local user's favourites, reblogs, or conversations? So we should delete only things not touched by local users.

Alternatively - delete all, including local, content that is older than x. That's something that could be debated.

@Floppy
Copy link
Contributor Author

Floppy commented Apr 15, 2017

Thanks @Gargron, I quite agree - in theory an instance could end up with every toot from the entire fediverse in it, which would be pretty unscalable. I agree that removing things not "touched" in some way would be a sensible default.

Presumably different instances could have different policies, too.

@danhunsaker
Copy link
Contributor

As far as having other policies, those would have to be implemented in any change to the code. That means either the instance owner needs to make those changes themself, or the feature proposed here needs to include some configurability to account for multiple approaches to handling this sort of cleanup. Which means deciding what should be configurable, and in what ways. So far suggested:

  • Federated only vs federated and local
  • TTL (how old it needs to be to consider sweeping it away; possibly configured separately based on local vs federated origins)
  • Whether interactions prevent sweeping, and which interactions qualify; if local toots are sweepable, they should be configured separately from federated ones, and consider interactions from the Fediverse in addition to local ones.

As to defaults, the current consensus seems to be:

  • Federated only
  • Two weeks? This one hasn't really been discussed, yet, but that's a default I've seen used fairly commonly.
  • All interactions prevent sweeping (likes, boosts, replies)

Did I miss anything?

@wxcafe
Copy link
Contributor

wxcafe commented Apr 16, 2017

Looks good to me. I'd reduce the timeout to something like 2 days but maybe I'm being too
agressive

@danhunsaker
Copy link
Contributor

Optimally, we'd have some metrics to look at, such as average consumption per week, day, maybe even hour, and set a default based on that. But short of a survey to collect such data, I'm not sure it's possible to get. Larger instances (or rather, those with more external content flowing in) would of course want lower values than smaller ones (with lower inbound traffic) might.

I think 2 days is perfectly reasonable for high-inbound-traffic instances, or those with excessively small disk space, but that the default might be ok being a bit higher. (Maybe not as high as 2 weeks, though. 5 days? 7?)

@bratta
Copy link

bratta commented Apr 17, 2017

Personally I think this would be good as a rake task and leave it up to the instance admin when to schedule it via cron. The documentation repository could contain some suggestions like what has been outlined above (eg. 1-2 weeks for smaller instances, 2 days for larger instances).

@nightpool
Copy link
Member

nightpool commented Apr 17, 2017

Would toots that users have boosted get deleted? What about ones that users have liked? Ones that are replies to users statuses? We need to make sure that this isn't noticeable in it's effect. missed this, @Gargron mentions it above

Another downside of this is that it could make remote users look like they haven't tooted at all when they in fact have. It also removes the redundancies inherent in a federated system. It also makes slower hashtags effectively useless. (do tag streams currently expire after 7 days? that's probably also an issue for slower tags)

This should definitely be a rake task, and it should be opt-in and up to the admin to schedule when it happens.

@danhunsaker
Copy link
Contributor

danhunsaker commented Apr 17, 2017

@bratta The frequency at which the task runs is separate from the age at which it sweeps toots. We were discussing the age setting (frequently called a TTL (Time To Live), when it comes up in technical scenarios), not the frequency, above. I definitely agree that the docs should mention the task, what it does, and recommend various cron schedules (and associated TTLs) for various amounts of traffic/load. Thanks for that suggestion! 👍

@nightpool Toots which have been interacted with have been mentioned in various ways since the very first post. I assume from the way it's been rephrased so often that not only is it an important consideration (which we agree that it is), but also it needs to be very clearly communicated in the docs that toots with local interactions will not be affected (unless configured otherwise). Vital feedback. 👍

There are already a number of other situations which prompt users to check remote instances to get a better picture of a user's activity, so I'm not sure "they'll look inactive" is as big an issue as it seems. If nothing else, any toots that have been interacted with locally would still remain available locally. And of course, unless they've gone silent for X days (plus the time since the last sweep), their most recent content would still be available locally, too. So it's probably something to keep in mind, but I'm not convinced it's a blocker for implementation.

While you're correct about reduced redundancy, the benefit is not having your instance run out of disk space and stop working entirely. This issue is all about a trade-off, reduced disk usage at the cost of redundancy. It would also be interesting to consider alternative solutions, though, for conserving disk space.

Tags are an interesting facet of this which we really should consider and discuss further. I suspect there's already significant fragmentation affecting tag streams in the first place - if my instance hasn't pulled a given toot with that tag, say because none of my users follow the account which posted it, it won't show up in my results anyway. So that's something to figure out. 👍

@ALL Finally, just to be clear, this is absolutely intended to be a rake task, which automatically means instance admins will be in charge of scheduling sweeps, and must opt-in to run them at all. I feel we all agree these elements are vital.

@ghost
Copy link

ghost commented Mar 29, 2018

The network has grown quite a bit since the last time this was commented on, is this still planned?

@Gargron
Copy link
Member

Gargron commented Jul 16, 2018

At the time of writing the mastodon.social statuses table has 52,913,648 rows, and once we introduce federation relays, that number is only going to go up faster. So a solution is needed.

Unfortunately, lines between what can be deleted are blurry, and finding such items over SQL is far from simple. A naive algorithm might be:

  1. Find all remote users with 0 local followers
  2. Select all toots by those users and queue them into sidekiq for individual checks
  3. In a check, consider:
    • reblogged/faved by local user?
    • mentions local user?
    • in a thread where a local user participates?
    • if all no, then delete

However, 50 million sidekiq jobs will be backlogged for a massive amount of time. And the checks aren't even complete, because would you want to keep a status if it's reblogged by a remote user who does have followers? What about threads with remote users with local followers?

IPFS also has this issue, but they have the concept of explicit pinning, and everything that is not pinned by a node gets periodically flushed out. I think that it might be worth it to add a data structure for explicit pinning of statuses. So a local user replying/faving/reblogging a remote status would explicitly pin it. That would make sweeping unpinned ones a lot easier.

However, even that approach does not help with already stored statuses, we would still need a one-time migration procedure to pin what's necessary.

I need help figuring this out.

@progval
Copy link
Contributor

progval commented Jul 16, 2018

Rather than deleting them, would it be possible to compact them? Like, remove them from indexes, move them to a different storage (eg. one file per user), ...?
This way would be less risky, it would only slow down access to these toots (which is not that bad, as they wouldn't appear in searches anyway), and would be recoverable in case the algorithm had a bug.

@danhunsaker
Copy link
Contributor

I'm very much in favor of a secondary, long-term storage option. Cleaning things out of the main DB which haven't been interacted with recently is a great idea, but I always favor moving them elsewhere over removing them entirely, unless we're talking about a complete suspension or deletion operation which was manually triggered by the admin or the user (respectively). Things which aren't explicitly requested to be removed should be archived.

I know my stance may not be widely accepted, though, so i propose something a bit more involved. The instance admin would have configuration options for what to archive, and how to archive it. The exact settings for the "what" seem to be the main question, here, so I'll come back to those in a second, but as to the how, I'd propose three options. First, and the default, the filesystem, as instance/user/post.json or similar, for relatively quick management of individual posts. Second, a data source which can be defined and set up via environment variables, similar to how the main DB operates, but preferably with less dependence on a specific DBMS, since Rails abstracts a lot of that anyway. And third, "the abyss", "the void", or "/dev/null" - an option to simply delete selected posts entirely, rather than actually archive them anywhere meaningful.

As to the what, I think "no local interactions in the last X days", as described above, is a perfectly acceptable default setting. If the archival option is adopted, I might go slightly more aggressive with it and say "no local interactions in the last X days, and no recorded interactions in the last Y days", where both X and Y are some sane default, such as X=30 and Y=90, but adjustable by the instance admin according to what makes most sense for their users. This will archive local toots, too, but they'll still be accessible when referenced directly, and long-lived threads that are still active will still be in the main, quick-access DB.

Of course, how do we tell how recently something was interacted with? Does the updated timestamp get updated with an interaction, or ... ? I haven't been in the DB in a while to see for sure how that works.

@deutrino
Copy link

#34 really needs some attention if we are going to start potentially deleting old toots from remote users.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants