-
Notifications
You must be signed in to change notification settings - Fork 291
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prune concurrency #813
Prune concurrency #813
Conversation
There have been a lot of code changes since these patches were written - one of the stumbling blocks was that they were originally doing In prune at least we now have two functions, so I had to add locking there. This is going to need careful auditing and review. There are also a few compiler errors to fix still, which I left intentionally to be sure we don't merge this until it's fixed more. |
I think it would be great (if possible) to add some tests using |
Minor, but I realized `checkout_tree_at()` is a better place to do common setup before checkout. Prep for ostreedev#813
☔ The latest upstream changes (presumably 8d8f06f) made this pull request unmergeable. Please resolve the merge conflicts. |
7a1f9fd
to
e0695ad
Compare
Did some work on this; I rebased 🏄♂️ and worked on a test case which blows up with concurrent prunes; will debug more later. |
I've been looking at locking for some work at Endless. There are a few other spots I think could use some locking.
|
☔ The latest upstream changes (presumably e6f17b9) made this pull request unmergeable. Please resolve the merge conflicts. |
Yeah...there is a long tail of stuff here. I think delta pruning falls under commit pruning, no? But today it is possible to regenerate a delta which isn't really safe at all against concurrent readers...that's its own problem. I'm not sure why one would be signing a commit to be deleted? If it failed, it'd mostly be an operator issue? At a high level, I'd say we should focus on in order, given concurrent commit + prune operations:
|
When a transaction is finished and we have moved all the staged loose objects into the repo we fsync all the object directory, to ensure the filenames are stable before we update the refs files to point to the new commits. With out this an unclean shutdown after the transaction is finished could result in a refs file that points to an incomplete commit. https://bugzilla.gnome.org/show_bug.cgi?id=759442
In general, we're pretty robust when there are multiple ongoing operations, because transactions are isolated to per-transaction staging dirs, and the transaction commit is a last-writer-win replace operation on immutable data. However, any non-atomic read operation can fail if it is concurrent with a prune, as then objects may disappear under the operations feet half-way. This patch-set makes this robust by having a repo lock that is taken in a shared fashion for all transactions, and for checkouts, but is taken in exclusive mode for prune. It also adds a non-blocking prune mode that allows you to do opportunistic prunes that don't block if there is an ongoing operation. There are a bunch of operations that are still unsafe (does not block prune), such as: cat, ls, show, log, diff, generate-delta, rev-parse. I don't think that is necessarily a huge problem, as these are mainly used in development or debugging, and a failure here will just be an error printed, it will never cause a broken repo. This is a simpler version of what i proposed at: https://mail.gnome.org/archives/ostree-list/2015-December/msg00024.html - it takes a lock during the whole transaction rather than retrying the transaction under a lock when commiting it. https://bugzilla.gnome.org/show_bug.cgi?id=759442
e0695ad
to
472863d
Compare
For signing a commit that's getting deleted, it's not that you'd do it intentionally. It's that someone else starts pruning while you're trying to do it. I suppose you could just let it fail in that case, but it also seems like you could race with the prune and end up leaving dangling detached metadata around. There's a separate entry point for Currently each caller except for the transaction lock separately opens the lock file. This has pros and cons. On the plus side, you're not at risk of lowering the lock state (either unlocking or dropping exclusive to shared) if some other part of the code has locked the repo. On the down side, there's no way to coordinate locking access and it leaves things open for deadlocking. I.e., it would be ideal to take an exclusive lock in This also means, for example, that you can't lock the repo in your application before starting a chain of operations. This is the type of thing I'd like to do during a release - take an exclusive lock immediately after opening the repo and maintaining it for the entire release process. It might be overkill, but I'd really like to prevent anything else from touching the repo during that process. In order to support that, you'd probably want to keep one lock in the Another ugly option would be to add a bunch of |
Yeah, I think you're right. The recursive case could be handled by tracking whether or not we already own the lock in this process, but to handle the general case of wanting exclusion across multiple operations, I'd agree we need to hoist the locking to be caller-controlled. |
GError **error); | ||
|
||
gboolean | ||
_ostree_repo_lock_exclusive (OstreeRepo *self, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you think it would make sense to expose these two functions externally? In general I think they can be useful when more complex prune logics are implemented by an application. In the particular case, there is still an issue with Atomic system containers when atomic images prune
is used at the same time an image referring to a dangling layer (which is already present locally) is being pulled: prune might delete the layer but the pull process already checked it is present in the repository and it doesn't pull it again. If these two functions are made public, I could use them directly, otherwise we will probably need to implement the locking part separately.
Working on converting this to explicit locking. |
☔ The latest upstream changes (presumably d0d5f54) made this pull request unmergeable. Please resolve the merge conflicts. |
I was thinking about this more, and while it wouldn't be a complete fix, we could greatly mitigate things if we followed what low-pause memory GC algorithms do, which is: Compute references without locking, then stop and look for any new objects that have appeared, rescan those, etc. In general, we can also assume objects in any transaction staging dir are referenced. |
I've been thinking about this off and on for a while. I think that in order to be really useful, the lock should be acquired freely so that you can take it freely within ostree or outside ostree. I'm definitely no locking guru, but here are the features I think are needed to accomplish that:
I came up with something layered on top of flock(2) that seems to work. It uses
A Obviously that's not totally usable as is (you'd want to control the lock file path, use |
I came up with an alternative that I think handles most use cases in #1292. |
Closing this in favor of #813 (comment) |
Migrating from https://bugzilla.gnome.org/show_bug.cgi?id=759442