-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Automated Watermark based GC and Transient Quota allocation #134
Comments
With two-watermarks systems, the goal tends to be to keep the value between the watermarks. What's described here seems to be more of a trigger/target system? ("When value is above , activate GC to bring it to, or below, ") |
The edge case seems pretty dangerous. Is it possible to identify this livelock situation in the garbage collector, and interrupt transient downloads to vacate more space? |
Note that there are new edge cases that emerge from such situations, e.g. a malicious user forcing the system to download a huge transient to DoS all other active downloads. |
Which protocols are unable to report a shard size in your use case? Having unknown shard sizes is acceptable for trusted scenarios, but definitely a no-go for untrusted/adversarial scenarios. An attacker may exploit the system by forcing it to (1) download a shard with unknown size from themselves, and (2) send infinite garbage (cheap to do). |
This is a meta-issue to track the work of introducing an automated watermark based LRU GC of transients along with a quota reservation mechanism to allow for downloading transients whose size we do not know upfront.
The work is spread across multiple PRs.
High level overview
The dagstore now performs automated high->low watermark based GC for transient files.
Users who want to use this feature will have to configure a maximum size for the transients directory and the dagstore guarantees that the size of the transients directory will never exceed that limit.
Users will also have to configure a high and low watermark for the transients directory. The dagstore will kickstart an automated GC when it detects that the size of the transients directory has crossed the high watermark and will attempt to bring down the directory size below the low watermark threshold.
Users will have to configure a GC Strategy that will recommend the order in which reclaimable shards should be GC'd by the automated GC mechanism. The dagstore comes inbuilt with an LRU GC Strategy but users are free to implement their own. See the documentation of
GarbageCollectionStrategy
for more details.A quota reservation mechanism has been introduced for downloading transients whose size we do not know upfront. To download such a CAR, the downloader will first get a reservation from the dagstore for a preconfigured number of bytes, then download those many bytes and then go back to the allocator for more reservation if it hasn't finished downloading the transient. In the end, it will release unused reserved bytes back to the allocator.
The existing manual GC mechanism works as is and no changes have been made to it.
Known Edge Case
There is an unhandled known edge case in the code.
If a group of concurrent transients downloads end up reserving all the available space in the transients directory but not enough to satisfy their individual downloads, then all of them will end up back-off retrying together for more space to become available. However, no space will become available till one of them exhausts the number of backoff-retry attempts -> fails the download -> releases reserved space. Thus, the dagstore will not make any progress with new downloads till one of the download fails and releases it's reservation.
However, this edge case should be mitigated by:
PRs
The text was updated successfully, but these errors were encountered: