Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EtcdLeaderElection does't lose leadership when network unavailable #49

Open
belovaf opened this issue Jan 24, 2021 · 7 comments
Open

EtcdLeaderElection does't lose leadership when network unavailable #49

belovaf opened this issue Jan 24, 2021 · 7 comments

Comments

@belovaf
Copy link

belovaf commented Jan 24, 2021

Hello!

I think, when SessionLease becomes EXPIRED, owning EtcdLeaderElection should call ElectionListener.leadershipChange(false).

With current behavior we can have several leaders at the same time.

@njhill
Copy link
Member

njhill commented Jan 24, 2021

Thanks @belovaf I will take a closer look at this. It actually used to be handled via the PersistentLeaseKey's connection to the RangeCache but this behaviour was modified in version 0.0.11 and later, I think due to some unintended side effects.

@belovaf
Copy link
Author

belovaf commented Jan 24, 2021

Also I notised that etcd supports election natively:
https://etcd.io/docs/v3.3.13/dev-guide/api_concurrency_reference_v3/

Looks like it can be implemented with SessionLease only.

@sherman
Copy link

sherman commented Feb 21, 2021

Hi, all!

I think it would be nice having a configurable option, what is to do in case of network/etcd server is unavailable. Here's a simple log. The server became a leader. Then, I killed etcd server and finally restarted it. So, there are no events about lost/restore leadership.

@sherman
Copy link

sherman commented Feb 21, 2021

AFAICS, when a lease is changed state to EXPIRED, this line of code is executed. Maybe it's the right place to notify listeners about this situation?

@sherman
Copy link

sherman commented Feb 21, 2021

The patch illustrates the idea.

diff --git a/src/main/java/com/ibm/etcd/client/utils/PersistentLeaseKey.java b/src/main/java/com/ibm/etcd/client/utils/PersistentLeaseKey.java
index ab4ce68..076e37f 100644
--- a/src/main/java/com/ibm/etcd/client/utils/PersistentLeaseKey.java
+++ b/src/main/java/com/ibm/etcd/client/utils/PersistentLeaseKey.java
@@ -94,6 +94,9 @@ public class PersistentLeaseKey extends AbstractFuture<ByteString> implements Au
                 putKey(lease.getLeaseId());
             } else {
                 leaseActive = false;
+                if (rangeCache != null && newState == LeaseState.EXPIRED) {
+                    rangeCache.deleteLocaly(key);
+                }
             }
         });
     }
diff --git a/src/main/java/com/ibm/etcd/client/utils/RangeCache.java b/src/main/java/com/ibm/etcd/client/utils/RangeCache.java
index bc82dd5..4eaa96c 100644
--- a/src/main/java/com/ibm/etcd/client/utils/RangeCache.java
+++ b/src/main/java/com/ibm/etcd/client/utils/RangeCache.java
@@ -425,6 +425,14 @@ public class RangeCache implements AutoCloseable, Iterable<KeyValue> {
         return listeners.remove(listener);
     }
 
+    public void deleteLocaly(ByteString key) {
+        KeyValue existKv = entries.get(key);
+        if (existKv != null) {
+            entries.remove(key);
+            notifyListeners(EventType.DELETED, existKv, true);
+        }
+    }
+
     /**
      * Interface for listening to update events from
      * the cache

@njhill
Copy link
Member

njhill commented Feb 25, 2021

Thanks @sherman, sorry I'm a bit swamped at the moment but I will have a closer look at this soon. What you suggest is similar to the prior logic that was there, I need to look again at what the problematic side effects were. I think it could result in an inconsistent cache state in some circumstances - i.e. the key ending up absent from the cache when it should not have been.

@sherman
Copy link

sherman commented Mar 15, 2021

Hi, @njhill.

Additionally, I've tested the patch against a 3-node ETCD cluster with leader/follower configuration. Can't see any problems.

Cases.

  1. Run a leader and a follower. Block network between leader and ETCD. When a leader is changed, restore the network.
  2. Restart ETCD cluster.

What kind of circumstances do you mention? Would you please provide any information on how it could be reproduced?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants