-
Notifications
You must be signed in to change notification settings - Fork 314
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
jetcd locking problem #291
Comments
@dinoopt the lock api is added recently and possibly contains bugs. if you can reproduce this issue with a simple script, it can help me to debug this for you. |
@fanminshi Sorry for coming back late on this. The initial problem I faced was that there are multiple threads waiting for a lock and each of these locks have a timeout and a lease timeout. After the lock was released by the holding thread, it was going to a thread which had timed out but the lease had not expired. This was leading to all other threads waiting for this thread to release the lock. I added a step to revoke the lease during timeout, which fixed the issue to a certain extent. However during concurrent runs, we still are seeing this issue. I'm trying to reproduce this using a standalone program, but have not been successful so far. Below is the sequence we use for lock, unlock. Could you help in understanding if there is something missing here.
Also I ensure that if the lock is successful, it is held only for very minimal time to perform an update to etcd and is immediately released. What I notice during the runs is that once the locking problem surfaces, many threads keep creating lock keys and try to acquire lock but fail to acquire the lock. I see too many instances of this thread in the stack and it keeps increasing once the locking problem starts.
If there are any other best practices while using jetcd locking, then please let know of that. Thank you in advance for your help. |
@fanminshi I face this issue only on a ETCD cluster running 3.2.13 version and don't see the issue in my other instances which uses ETCD cluster running 3.2.9 and 3.2.11 versions. I'm not able to reproduce this issue on my local standalone ETCD server running 3.2.13 version. After including some debug messages in jetcd code, was trying the below scenario
When running against the 3.2.9 version, see this behavior
When running against 3.2.13 version, see this behavior
Could you kindly let know if the difference in the etcd server versions can result in this kind of behavior ? |
@dinoopt hey thanks for the detailed explanation. I am current busy with many things. I'll take a look of this once I am free. |
cc/ @Grisu118 |
@fanminshi Thank you for your response. Including the stand alone Scala program I was using to reproduce the issue against etcd cluster running 3.2.13 version, in case that helps.
|
Did a quick try, but I'm not able to reproduce the bug. I always get the following output of your test. As far as I understand this is the correct output, right? Tested with single node, cluster and versions 3.2.9/3.2.10/3.2.13
|
@Grisu118 Sorry I forgot to mention that I was running with some debug points added in jetcd code. The code output will be the same when run against all the etcd server versions. The difference I saw in the behavior was,
This is the piece of code in jetcd in Util.java where the other 3 threads trying to acquire the lock wait for the future to complete.
If I take a jstack output immediately after the unlock and before the run ends ,still see three threads getting stuck in toCompeltableFutureWithRetry() in the below stack.
Also tried adding a timeout value in f.get() in which case the threads are able to terminate after the timeout. Also I was not able to reproduce the problem on a standalone Etcd server running in my system. It seems to be happening only when I run against a provisioned Etcd cluster which we are using in our project. Please let me know if you need any additional details. Thank you for your help. |
I cannot reproduce this against an etcd cluster on my local machine (windows 10). The cluster is created on docker, similar to the unit tests of jetcd. Every future which is accessed in the code snipped you mentioned is completed
|
@Grisu118 Thanks for trying it out. Yesterday I got a new etcd cluster provisioned with 3.2.13 and the problem is not happening there. In the clusters which were upgraded from some lower version to 3.2.13, I can still see the problem happening. I tried upgrading a local etcd server installed in my Mac using brew from 3.2.9 to 3.2.13, but was not able to reproduce the issue. |
@fanminshi I also encountered the same problem. Using a distributed lock, two programs compete to acquire locks, each maintaining the lock's associated lease id's expiration time (keepAliveOnce). When the lock program is terminated, theoretically the expiration time is reached, the lock should be released, and another program acquires the lock, but the actual Failed to get the lock. This problem is not easy to reproduce. The current poor positioning is a service-side issue (etcd version 3.2.18), or a jetcd(version 0.0.2) issue
|
@fanminshi @Grisu118,I also have the same problem in mutil-thread evnriment.
and below is EtcdClient code:
And then is the result:
Is there any problem with my test code???? Thanks. |
maybe the reason is due to params timeout..
change 1000 to 2000. every thing ok |
This issue is stale because it has been open 60 days with no activity. |
I'm using jetcd in my project to do lock/unlock of etcd key before doing an update of the value. Verified from the code and logs that everytime a lock is done, an unlock of the key also is done.
Whenever we fail to acquire a lock, we keep retrying till we get the lock.
However still end up in scenarios where after some time, none of the threads are able to acquire any lock. Eventually we see most of the threads getting stuck for locks which no other thread seems to be holding as confirmed from the logs. Even when we fail to acquire a lock, see the lock key getting created in etcd.
I'm using the recent 0.0.2 version with all the latest changes, is this a known issue ? Is there a fix available for this ? The server version of etcd I'm using is 3.2.13.
The text was updated successfully, but these errors were encountered: