Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] the remote cluster connection will time out when the node restarts and the remote cluster seeds is point to itself #7950

Closed
kkewwei opened this issue Jun 7, 2023 · 9 comments · Fixed by #8038
Labels
Cluster Manager enhancement Enhancement or improvement to existing feature or request

Comments

@kkewwei
Copy link
Contributor

kkewwei commented Jun 7, 2023

Describe the bug
We configure many remote clusters pointing to the local cluster to satisfy the needs, the seeds are the domain name of local cluster, in our case, the domain name will map to the local node.

If the node restart, it will not join the cluster successfully because of block.

To Reproduce
Steps to reproduce the behavior:
The cluster: data0: (127.0.0.1:9200), master1 (127.0.0.1:9201), master2 (127.0.0.1:9202)
Step1: the cluster is healthy, contains the three nodes.
Step2: set 10 remote cluster settings, the remote_name={remote1...remote10}, which all point to the data0

PUT _cluster/settings
{
 "persistent" : {
    "cluster" : {
      "remote" : {
        "{remote_name}" : {
          "skip_unavailable" : "true",
          "seeds" : [
            "127.0.0.1:9300"
          ]
        }
      }
    }
 }
}

Step3: kill the data0, then the data will be blocked when joining cluster.

Screenshots

The blocked jvm stack is as follows:

"elasticsearch[data0][clusterApplierService#updateTask][T#1]" #42 daemon prio=5 os_prio=0 cpu=2872.02ms elapsed=415.73s tid=0x00007fc440e08980 nid=0x130 runnable  [0x00007fc448af4000]
   java.lang.Thread.State: TIMED_WAITING (parking)
        at jdk.internal.misc.Unsafe.park(java.base@15.0.1/Native Method)
        - parking to wait for  <0x00007fcb8a8527c8> (a java.util.concurrent.CountDownLatch$Sync)
        at java.util.concurrent.locks.LockSupport.parkNanos(java.base@15.0.1/LockSupport.java:252)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(java.base@15.0.1/AbstractQueuedSynchronizer.java:716)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(java.base@15.0.1/AbstractQueuedSynchronizer.java:1073)
        at java.util.concurrent.CountDownLatch.await(java.base@15.0.1/CountDownLatch.java:278)
        at org.elasticsearch.transport.RemoteClusterService.updateRemoteCluster(RemoteClusterService.java:298)
        at org.elasticsearch.transport.RemoteClusterAware.validateAndUpdateRemoteCluster(RemoteClusterAware.java:111)
        at org.elasticsearch.transport.RemoteClusterAware$$Lambda$2441/0x0000000801519ca8.accept(Unknown Source)
        at org.elasticsearch.common.settings.AbstractScopedSettings$2.apply(AbstractScopedSettings.java:354)
        at org.elasticsearch.common.settings.AbstractScopedSettings$2.apply(AbstractScopedSettings.java:326)
        at org.elasticsearch.common.settings.AbstractScopedSettings$SettingUpdater.lambda$updater$0(AbstractScopedSettings.java:627)
        at org.elasticsearch.common.settings.AbstractScopedSettings$SettingUpdater$$Lambda$3229/0x0000000801753d60.run(Unknown Source)
        at org.elasticsearch.common.settings.AbstractScopedSettings.applySettings(AbstractScopedSettings.java:193)
        - locked <0x00007fc58b5c5f30> (a org.elasticsearch.common.settings.ClusterSettings)
        at org.elasticsearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:466)
        at org.elasticsearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:418)
        at org.elasticsearch.cluster.service.ClusterApplierService.access$000(ClusterApplierService.java:68)
        at org.elasticsearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:162)
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:684)
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252)
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@15.0.1/ThreadPoolExecutor.java:1130)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@15.0.1/ThreadPoolExecutor.java:630)
        at java.lang.Thread.run(java.base@15.0.1/Thread.java:832)

the related the log is as follow:

[2023-06-06T07:25:21,664][DEBUG][o.e.c.c.PublicationTransportHandler] [data0] received full cluster state version [2643818] with size [2199150]
[2023-06-06T07:25:21,666][DEBUG][o.e.c.c.Coordinator      ] [data0] handlePublishRequest: coordinator becoming FOLLOWER of [{es-master-ith17x5iwky9i9am-0}{eUjPI4KKTGCTAG6QubjM3g}{0V0TUH8NS3eQezmZFKjm5g}{172.24.70.226}{172.24.70.226:9300}{imr}{data_node=master}] in term 118 (was CANDIDATE, lastKnownLeader was [Optional.empty])
[2023-06-06T07:25:21,689][DEBUG][o.e.c.s.ClusterApplierService] [data0] processing [ApplyCommitRequest{term=118, version=2643818, sourceNode={es-master-ith17x5iwky9i9am-0}{eUjPI4KKTGCTAG6QubjM3g}{0V0TUH8NS3eQezmZFKjm5g}{172.24.70.226}{172.24.70.226:9300}{imr}{data_node=master}}]: execute
[2023-06-06T07:25:21,691][DEBUG][o.e.c.s.ClusterApplierService] [data0] cluster state updated, version [2643818], source [ApplyCommitRequest{term=118, version=2643818, sourceNode={es-master-ith17x5iwky9i9am-0}{eUjPI4KKTGCTAG6QubjM3g}{0V0TUH8NS3eQezmZFKjm5g}{172.24.70.226}{172.24.70.226:9300}{imr}{data_node=master}}]
[2023-06-06T07:25:21,693][INFO ][o.e.c.s.ClusterApplierService] [data0] master node changed {previous [], current [{es-master-ith17x5iwky9i9am-0}{eUjPI4KKTGCTAG6QubjM3g}{0V0TUH8NS3eQezmZFKjm5g}{172.24.70.226}{172.24.70.226:9300}{imr}{data_node=master}]}, added {{es-master-ith17x5iwky9i9am-0}{eUjPI4KKTGCTAG6QubjM3g}{0V0TUH8NS3eQezmZFKjm5g}{172.24.70.226}{172.24.70.226:9300}{imr}{data_node=master},{es-master-ith17x5iwky9i9am-2}{TjTqqe5VS9G-D2XXbq4A1w}{hzLZnAaSRQyrd4jWLuw2vg}{172.19.120.103}{172.19.120.103:9300}{imr}{data_node=master},{es-data-ith17x5iwky9i9am-4}{KGr3nbpMTouGRp8UeUFFqQ}{9qUT47sFTEev4uqG47GE7g}{172.19.149.62}{172.19.149.62:9300}{dir}{data_node=hot},{es-master-ith17x5iwky9i9am-1}{QIlaC9VkTeKAr-ezekbCsw}{tTPydcqQToWfZRp7CphMHQ}{172.29.103.63}{172.29.103.63:9300}{imr}{data_node=master},{es-data-ith17x5iwky9i9am-1}{MvUONrMhT0q9mURbJMyyLA}{8TfnB5G1SKu4p87tMaEeRg}{172.29.103.64}{172.29.103.64:9300}{dir}{data_node=hot},{es-data-ith17x5iwky9i9am-3}{MsgCwRoQSLijqhCeouYI9Q}{eWIBx3uBTFiBkD-nfkiZjA}{172.23.119.192}{172.23.119.192:9300}{dir}{data_node=hot},{es-data-ith17x5iwky9i9am-5}{lIlPN12eSNS64CWJPjwVeA}{lnRCt5DWTpKroAcAoKjTyw}{172.26.13.99}{172.26.13.99:9300}{dir}{data_node=hot},{es-data-ith17x5iwky9i9am-6}{keu3oi4cT1WvY4cdI6Vt5Q}{QQA_FeABQpuJOzqdO_xkUQ}{172.19.124.80}{172.19.124.80:9300}{dir}{data_node=hot},{es-data-ith17x5iwky9i9am-2}{8yYzycxUQbmBWiF5pn-zxg}{drmhwzadQ0uU18sOC1Gmdw}{172.19.120.104}{172.19.120.104:9300}{dir}{data_node=hot}}, term: 118, version: 2643818, reason: ApplyCommitRequest{term=118, version=2643818, sourceNode={es-master-ith17x5iwky9i9am-0}{eUjPI4KKTGCTAG6QubjM3g}{0V0TUH8NS3eQezmZFKjm5g}{172.24.70.226}{172.24.70.226:9300}{imr}{data_node=master}}
[2023-06-06T07:25:21,697][DEBUG][o.e.c.NodeConnectionsService] [data0] connecting to {es-master-ith17x5iwky9i9am-2}{TjTqqe5VS9G-D2XXbq4A1w}{hzLZnAaSRQyrd4jWLuw2vg}{172.19.120.103}{172.19.120.103:9300}{imr}{data_node=master}
[2023-06-06T07:25:21,699][DEBUG][c.a.o.s.s.t.OpenDistroSecuritySSLNettyTransport] [data0] Connection to 172.19.120.103 needs to be ssl, adding ssl handler to the client channel
[2023-06-06T07:25:21,699][DEBUG][c.a.o.s.s.t.OpenDistroSecuritySSLNettyTransport] [data0] Connection to 172.19.120.103 needs to be ssl, adding ssl handler to the client channel
[2023-06-06T07:25:21,700][DEBUG][c.a.o.s.s.t.OpenDistroSecuritySSLNettyTransport] [data0] Connection to 172.19.120.103 needs to be ssl, adding ssl handler to the client channel
[2023-06-06T07:25:21,700][DEBUG][c.a.o.s.s.t.OpenDistroSecuritySSLNettyTransport] [data0] Connection to 172.19.120.103 needs to be ssl, adding ssl handler to the client channel
......
[2023-06-06T07:25:48,697][DEBUG][o.e.a.s.m.TransportMasterNodeAction] [data0] no known master node, scheduling a retry

Failed to connect remote cluster:

[2023-06-06T07:26:35,242][WARN ][o.e.t.RemoteClusterService] [data0] failed to connect to new remote cluster remote1 within 10s
[2023-06-06T07:26:45,242][WARN ][o.e.t.RemoteClusterService] [data0] failed to connect to new remote cluster remote2 within 10s
[2023-06-06T07:26:55,243][WARN ][o.e.t.RemoteClusterService] [data0] failed to connect to new remote cluster remote3 within 10s

The reason why data0 fails to join the cluster(ClusterApplierService.applyChanges):

  1. data0 restarts and receives the new ClusterState.
  2. data0 starts to connect the remote cluster, but failed to build remote connection mapped to it self. Because it will send request to the master.
    https://github.com/elastic/elasticsearch/blob/main/server/src/main/java/org/elasticsearch/transport/SniffConnectionStrategy.java#L330

But the master will be null by local ClusterState, and the new ClusterState is not set(it will be set later), so the request will be blocked.

That means, here is a deadlock, the connecting remote node depends on setting new ClusterState, but setting new ClusterState is blocked by the the connecting remote node.

So every connecting remote cluster(point to the local ip) will timeout 10s(see above), if we have ten more than the same remote cluster settings, it will cost 100s, which means it will be removed from the cluster by the master.

There are two problems when applying the new ClusterState:

  1. the connection of remote cluster (point to the local ip) will cost 10s(timeout) when first join the cluster.
  2. the process of connection remote cluster is included in the process of applying the new ClusterState, if there are thousands of remote cluster, it may costs much time and be removed by the master.

Solution:
If we should handle this remote connection asynchronously?

Host/Environment (please complete the following information):

  • OS: Debian 4.19.117.bsk.12 Thu Mar 17 16:41:16 x86_64 x86_64 x86_64 GNU/Linux
  • Version es7.10.2
@kkewwei kkewwei added bug Something isn't working untriaged labels Jun 7, 2023
@dblock
Copy link
Member

dblock commented Jun 7, 2023

I suspect @shwetathareja knows a lot about this problem, care to comment?

@shwetathareja
Copy link
Member

Thanks @kkewwei for raising the issue.
Though in general I agree that applying of cluster state shouldn't run expensive appliers ( including cluster settings applier) or listeners as it can stall the cluster state updates to be applied on other nodes and eventually resulting in node drops. The leader kicks out any node after 90secs (which is the default lag timeout) if it has not applied the latest state.

And in case of remote cluster, establishing connection can take longer and failure to connect to remote cluster shouldn't cause node drop in local cluster. It is anyway giving up after 10s, why should it even waste 10s to try it in sync in the ClusterApplier thread. We could definitely consider establishing remote connections in async after evaluating in detail.

But the underlying issue reported on why it took longer to connect is not exactly valid deadlock in code, though it helped in surfacing the above problem of local node unable to join the cluster due to remote connection taking longer.

Step2: set 10 remote cluster settings, the remote_name={remote1...remote10}, which all point to the data0

My understanding: Your local cluster which has one of the node as data0 is trying to connect to 10 remote clusters (remote1 .. remote10)

For each remote cluster, the seeds should point to node in that remote cluster instead of the data node in the local cluster as it was in your case with data0. The list of seed nodes is used to sniff the remote cluster state and not local cluster state. This self loop was created due to wrong configuration where instead of pointing to a node in remote cluster, it is pointing to itself.

I am wondering should we have best effort validation checks to prevent adding transport address of local cluster in seeds of remote cluster setting. But, this is not fool proof as node with same local transport address can join the local cluster later as well.

@ankitkala
Copy link
Member

I agree with @shwetathareja. This isn't a valid deadlock scenario with multiple remotes pointing to the same local node.

For the usual use cases, there will be a 10 seconds delay in node joining the cluster. We can definitely explore avoiding this altogether by establishing connections in async.

@kkewwei
Copy link
Contributor Author

kkewwei commented Jun 10, 2023

I also agree with @shwetathareja and @ankitkala, the use seems unreasonable.

But it's indeed be used in product. the reason why multiple remotes pointing to the same local node is as follow:
1.The index is in multiple clusters. For convenience, user establishes remote clusters in each cluster for the index.
2.User use domain name to point the local cluster, but it always point the the local point (maybe because of dns, or other reason).

We could definitely consider establishing remote connections in async after evaluating in detail.
@shwetathareja @ankitkala, if you have any ideas? or I will write my idea later.

If we should add a timeout when establishing a remote connection? if the process is in async, it doesn't matter we wait for a while, even pointing to local node.

@shwetathareja
Copy link
Member

But it's indeed be used in product. the reason why multiple remotes pointing to the same local node is as follow:
1.The index is in multiple clusters. For convenience, user establishes remote clusters in each cluster for the index.
2.User use domain name to point the local cluster, but it always point the the local point (maybe because of dns, or other reason).

@kkewwei Lets say there is an index index1 which is present in clusters c1, c2, c3. Now, in case you intend to land up on any of the cluster c1 to c3 but still be able to search index1, then every cluster should have remote configured with all the other cluster except itself. Also when configuring remote for a cluster, node ip should be from that cluster.

Example:

Cluster c1

PUT _cluster/settings
{
 "persistent" : {
    "cluster" : {
      "remote" : {
        "c2" : {
          "skip_unavailable" : "true",
          "seeds" : [
            "127.0.0.1:9305"           // "127.0.0.1:9305" exists in c2 cluster
          ]
        },
        "c3" : {
          "skip_unavailable" : "true",
          "seeds" : [
            "127.0.0.1:9306"        //  "127.0.0.1:9306" exists in c3 cluster
          ]
        },
...
      }
    }
 }
}

Cluster c2

PUT _cluster/settings
{
 "persistent" : {
    "cluster" : {
      "remote" : {
        "c1" : {
          "skip_unavailable" : "true",
          "seeds" : [
            "127.0.0.1:9304"           // "127.0.0.1:9304" exists in c1 cluster
          ]
        },
        "c3" : {
          "skip_unavailable" : "true",
          "seeds" : [
            "127.0.0.1:9306"        //  "127.0.0.1:9306" exists in c3 cluster
          ]
        },
...
      }
    }
 }
}

Cluster c3

PUT _cluster/settings
{
 "persistent" : {
    "cluster" : {
      "remote" : {
        "c1" : {
          "skip_unavailable" : "true",
          "seeds" : [
            "127.0.0.1:9304"           // "127.0.0.1:9304" exists in c1 cluster
          ]
        },
        "c2" : {
          "skip_unavailable" : "true",
          "seeds" : [
            "127.0.0.1:9305"        //  "127.0.0.1:9305" exists in c2 cluster
          ]
        },
...
      }
    }
 }
}

@shwetathareja @ankitkala, if you have any ideas? or I will write my idea later.

I have not started looking into it but in general, establishing connection with remote cluster can be made async. Please feel free to take it forward if you want to do the investigation or raise PR.

if the process is in async, it doesn't matter we wait for a while, even pointing to local node.

It should never point to local node. It is not how it is suppose to work. The seeds should belong to remote cluster and not local cluster.

@shwetathareja shwetathareja added enhancement Enhancement or improvement to existing feature or request and removed bug Something isn't working labels Jun 13, 2023
@ankitkala
Copy link
Member

I've raised a PR for establishing the connections in async: #8038

@dblock
Copy link
Member

dblock commented Jun 20, 2023

@ankitkala other than fixing a deadlock, is there a performance benefit with your implementation in which a cluster comes up faster?

@ankitkala
Copy link
Member

ankitkala commented Jun 20, 2023

Yes. If any seed node connection call is timing out, we were waiting for 30 seconds(cluster.remote.initial_connect_timeout) during Node bootstrap.

With this change, this logic is in async, so the cluster will come up faster in such cases.

@dblock
Copy link
Member

dblock commented Jun 26, 2023

@ankitkala This is a neat change that hides a performance improvement that could use a blog post on opensearch.org. Hint hint.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Cluster Manager enhancement Enhancement or improvement to existing feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants