-
Notifications
You must be signed in to change notification settings - Fork 548
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Concurrency issue when pvc is mounted to multiple pods #4654
Comments
@z2000l Instead of reopening the issue, can you be more specific about the problem with details?
|
Yes, we're trying to test cephfs encryption. We used rook to orchestrate our ceph cluster. We applied the patch to ceph csi 3.10.2. The patch improved fscrypt a lot. Still we experienced problems when replica is high, especially when pods were running on multiple nodes: a subset of the replicas were running without a problem, some of them were stuck in ContainerCreating with error messges: We suspect when multiple nodes try to setup fscrypt policy at the same time, the failed nodes don't refresh or sync to get the policy and get stuck. And which canary cephcsi image you'd like us to try? Thanks. |
@z2000l Thanks for the response , @NymanRobin is it something still happening or its fixed? |
I can confirm that this is consistently reproducible when using a deployment with multiple replicas spread over nodes and RWX accessMode for the pvc. I will start looking into a solution, not sure yet if the problem is in fscrypt or ceph-csi implementation |
I now understand the error clearly. It seems that the fscrypt client handles metadata and policy differently, leading to this error. When checking for metadata, if it doesn't exist, we create it. However, this isn't the case for policy checks. This results in scenarios where, for instance, if three pods are created simultaneously, the first creates metadata, and the subsequent two find it. However, none of these pods find the policy, leading to a mismatch because the subsequent two have metadata but no policy. The solution is to properly set up the policy and metadata initially. Once this is done, the race condition won't occur. As a temporary workaround, one can start the deployment with one replica, wait for it to be in the running state, and then scale up without issue. I believe a permanent solution shouldn't be too difficult. Best idea I currently have is to check if the initial setup is done, this needs to be behind a lock and the lock can be released if the check is okay or when the setup is done. I will still do some more thinking about the best course of action, but how does this initially sound to you @Madhu-1 EDIT: Okay actually the pods are spread over nodes so a mutex won't solve this it will need a lease or some other mechanism to ensure it is ran only once /assign |
Thanks for taking this issue! Let us know if you have any questions! |
The issue you are trying to assign to yourself is already assigned. |
EDIT: Okay actually the pods are spread over nodes so a mutex won't solve this it will need a lease or some other mechanism to ensure it is ran only once Thanks for exploring it. Yes, as you mentioned internal locking might not be good, we need to see if we can have something else for this one. |
Hi @NymanRobin, You might be able to use Thanks for looking into this! |
Hey @nixpanic! Thanks for the suggestion but based on our last discussion I still think the Rados omap is the way to go. Let me try to explain, so how I understand it is that the lock seems to be already acquired in the Do you agree with this understanding? I think my best idea currently would be to implement a similar interface as the reftracker that would track the encryption of the cephfs filesystem to ensure the encryption is only set up once |
I tested the Rados omap locking with a simple proof of concept, and it appears to be working well! During testing, I created a reference object based on the volume in the refTracker. If this succeeded, I proceeded with the encryption setup and, upon exiting the function, deleted the reference. This straightforward lock mechanism resolved all issues, and all pods started up correctly. However, I believe this can be further improved by creating a dedicated interface for encryption tracking. |
Yes, if
Great work! Sounds really good 👏
If you have a generic "Rados lock/mutex" function, we might be able to use it for other things too. I can't immediately think of something else than encryption, though. Making it encryption specific is fine, but ideally it is easily adaptable later on to be more generic (if we find a usecase for it). Thanks again! |
@NymanRobin what about the case where we took the lock and csi restarted and the app pod scale downed to 1 where it needs to run on a specific node instead of two nodes, will this also helps?
|
@Madhu-1, I currently do not see any problems with the scale-down scenario, but maybe I am missing something? I also do not foresee any issues with scaling down, as ideally, once a node has set up the encryption, there is no need to acquire the lock again. This is why I considered a specific encryption interface to to first do a lookup if the volume has already been encrypted, thus avoiding the need to lock it again in case it is already encrypted. This approach could be an extension of the lock mechanism. Although a normal mutex could also work, it will be slightly slower during large-scale operations. However, the lock will only be active for a very short period of time is the key point here. On another note if the update/delete call for the lock fails, we need to maintain a list of orphaned locks and attempt to clean them periodically is my thought That said, I understand the concern about: Is a valid point I hadn't considered earlier. Handling a pod restart in a locked state is more challenging, and I do not have a perfect solution on top of my mind for this scenario. Maybe implementing an expiry time for the lock could solve this? What do you think @Madhu-1 and @nixpanic? |
A lock that can time-out would be ideal, I guess. Depending on how complex you want to make it, this could be an approach:
|
Hey everyone, I've just opened a draft PR (#4688) that addresses the concurrency issue we've been facing. While it's still a bit rough around the edges and misses some functionality, it implements the necessary functionality for resolving the concurrency issue. I made a check list in the PR of what could still be done or looked into I opened the PR in this state now because I'm going on vacation for the next four weeks and don't want to lose momentum on this issue. I'll be working to finalize as much as possible today, but any remaining tasks are open for anyone who feels inspired to take over. From our side, @Sunnatillo will be available to assist with this. Thanks! |
@NymanRobin Thank you for taking care of it, Enjoy your vacation |
Thanks a lot @Madhu-1! |
Hi @Madhu-1 @nixpanic. Should we use LockExclusive from ceph-go or the @NymanRobin's current implementation is fine? |
@Sunnatillo yes i think using https://github.com/ceph/go-ceph/blob/ee25db94c6595ad3dca6cd66d7f290df3baa7c9a/rados/ioctx.go#L463-L508 LockExclusive from rados does what described above by @nixpanic am fine with using it instead of reinvent the same. @nixpanic can you please check this as well. |
The simpler it can be done, the better! |
Describe the bug
This is an extension/reopen of https://github.com/ceph/ceph-csi/issues/4592. Our test showed that when creating a deployment of replica 2, with high chance that one pod will fail to mount the pvc.
Environment details
Same as https://github.com/ceph/ceph-csi/issues/4592
Steps to reproduce
Steps to reproduce the behavior:
Actual results
Only one pod will be running
Expected behavior
All pods will be running
The text was updated successfully, but these errors were encountered: