Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for persistent locks/semaphores #84

Open
stensonb opened this issue Dec 12, 2014 · 9 comments
Open

Support for persistent locks/semaphores #84

stensonb opened this issue Dec 12, 2014 · 9 comments

Comments

@stensonb
Copy link

Use case:

My code performs a cluster-level operation that MUST be successful before other members of the cluster are allowed to begin (think "restarting a web service" on a node in a load balanced cluster).

"MUST" here includes when the program dies (either due to it's own exception, or due to a system exception).

I'd like to:

  1. get the lock
  2. perform the operation
  3. remove the lock

If/When the operation (step 2) dies, the lock should persist (so no other members of the cluster perform the operation). Additionally, I'd like to be able to restart the program/application and resume with the same lock.

Currently, locks/semaphores are created with :ephemeral_sequential, which means the lock is automatically removed if/when the operation dies.

@slyphon
Copy link
Contributor

slyphon commented Dec 13, 2014

Why not just have your cluster wait on a node? You can observe a node that hasn't been created yet, so just have your cluster check for the existence of /path/to/foo, if it doesn't exist, watch /path/to/foo and when it's created by the process-that-needs-to-do-something-before-the-cluster-can-start, then the cluster will be notified of the event and start running.

Locks were intended to lock around updating a record or running a job, or ensuring that you could have a single writer setup, but have two of them: one active, one on standby.

@stensonb
Copy link
Author

Either I didn't explain my requirements well enough, or I'm not understanding what you're suggesting.

Each of my clustered nodes are trying to obtain the lock so they can do something locally (restart a service). If the service restarts successfully, or if the restart fails wildly, that lock is removed (because it's ephemeral), and the other nodes in the cluster proceed with their get-lock-or-block-then-do-stuff loop.

I feel like I'm missing something...

@stensonb
Copy link
Author

To clarify, WHEN the service fails to start (for whatever reason), I want the node to continue to hold the lock to prevent other nodes from proceeding (maybe the service we're restarting is configured incorrectly, and sequencing through each of them will bring the entire load-balanced solution down).

@tobowers
Copy link

I think what Jonathan is saying is to not use the actual locking class and
instead have your first process write out a new node when it completes
its part of the process. Then have your secondary process watch for the
presence of that node.

On Tue Dec 16 2014 at 5:41:27 PM Bryan Stenson notifications@github.com
wrote:

To clarify, WHEN the service fails to start (for whatever reason), I want
the node to continue to hold the lock to prevent other nodes from
proceeding (maybe the service we're restarting is configured incorrectly,
and sequencing through each of them will bring the entire load-balanced
solution down).


Reply to this email directly or view it on GitHub
#84 (comment).

@stensonb
Copy link
Author

I think I understand. But, that still won't work for me for a few reasons:

  1. None of my nodes have a higher priority.
  2. While it is required they perform the job in series, the order they perform their job in irrelevant (and does not need to be deterministic).
  3. The number of nodes in the cluster is dynamic and could/will change during this "locking" process. So, having a node depend on another's lock is impractical. Similarly, number of nodes in the cluster may not be available before the locking sequence begins...(an approach of "wait until I see X nodes before I do my thing" would not work since I don't know how many X is a priori).

Finally, once my process completes on any given node, the process running the zk client completes. From a zookeeper cluster perspective, I cannot tell whether the restart was successful or not.

@stensonb
Copy link
Author

Well, I think I've worked around this...as suggested, I'm not using the built-in lock/semaphore objects...I'm simply:

  1. get lock by creating persistent node
  2. if i have the lock(identified by lowest sequence #), i perform service restart and delete persistent node
  3. if i don't have the lock (my znode is different than the lowest sequence #), then i block using the ZK::NodeDeletionWatcher class on the node with the lock.

@stensonb
Copy link
Author

Me having to implement this workflow just to get persistent locks, however, seems like a common case.

I still think the idea of expanding the "lock" construct (semaphores too) to support persistent nodes would be a great feature.

Anybody else?

@eric
Copy link
Member

eric commented Dec 17, 2014

If you have a node that is persistent, how does a client know which node is theirs after their process restarts?

It may be worth trying to separate these concepts into two things:

  1. A directory that contains one ephemeral node per service that is currently active
  2. A task that ensures at least N services are active and when that is met, attempts to get a lock (the existing ephemeral one) and performs a restart

@rehevkor5
Copy link

rehevkor5 commented May 16, 2018

I'd like to do this same thing too, for restarts across clusters like Cassandra, Kafka, even Zookeeper itself. I only want one machine to be down for restart/reboot at any given time.

how does a client know which node is theirs after their process restarts

I was thinking of including the machine's IP (it's unique and static, in my case) in the ZK node name. That way the client would always be able to tell if it has an existing lock node or not, and can delete the right one.

A task that ensures at least N services are active

I thought about this as a way to use ephemeral nodes instead of non-ephemeral nodes. Unfortunately, it relies on knowing how many nodes should be up at any given time. I'm not sure if that's always a straightforward thing to determine, and it may remove the advantage of decentralization that ZK affords. You'd probably end up having to create non-ephemeral nodes to record all the machines which are "supposed" to be up, and make sure that you create & delete those at the right time. If you accidentally don't create one, then bad things can happen like rebooting too many machines at once. If you accidentally don't delete one, the end result is the same as if the lock wasn't freed, so you need intervention anyway.

Using non-ephemeral nodes makes it somewhat more likely to encounter locks that are stuck. But in that situation I need a human to intervene anyway, so I just need to create the right tooling. I'm going to try to build my functionality on top of this library: hopefully it won't require too many big changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants