Cross-cluster search and default connections can get crossed #29321

jasontedor · 2018-03-30T19:52:02Z

This is arising during recovery where it appears that the cross-cluster search connection profile is being selected:

Caused by: java.lang.IllegalStateException: can't select channel size is 0 for types: [RECOVERY, BULK, STATE]
	at org.elasticsearch.transport.ConnectionProfile$ConnectionTypeHandle.getChannel(ConnectionProfile.java:216) ~[elasticsearch-5.6.3.jar:5.6.3]
	at org.elasticsearch.transport.TcpTransport$NodeChannels.channel(TcpTransport.java:416) ~[elasticsearch-5.6.3.jar:5.6.3]
	at org.elasticsearch.transport.TcpTransport$NodeChannels.sendRequest(TcpTransport.java:441) ~[elasticsearch-5.6.3.jar:5.6.3]
	at org.elasticsearch.transport.TransportService.sendRequestInternal(TransportService.java:586) ~[elasticsearch-5.6.3.jar:5.6.3]
	at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:519) ~[elasticsearch-5.6.3.jar:5.6.3]
	at org.elasticsearch.transport.TransportService.submitRequest(TransportService.java:481) ~[elasticsearch-5.6.3.jar:5.6.3]
	at org.elasticsearch.indices.recovery.RemoteRecoveryTargetHandler.writeFileChunk(RemoteRecoveryTargetHandler.java:151) ~[elasticsearch-5.6.3.jar:5.6.3]
	at org.elasticsearch.indices.recovery.RecoverySourceHandler$RecoveryOutputStream.lambda$sendNextChunk$0(RecoverySourceHandler.java:537) ~[elasticsearch-5.6.3.jar:5.6.3]
	at org.elasticsearch.common.util.CancellableThreads.executeIO(CancellableThreads.java:105) ~[elasticsearch-5.6.3.jar:5.6.3]
	at org.elasticsearch.indices.recovery.RecoverySourceHandler$RecoveryOutputStream.sendNextChunk(RecoverySourceHandler.java:536) ~[elasticsearch-5.6.3.jar:5.6.3]
	at org.elasticsearch.indices.recovery.RecoverySourceHandler$RecoveryOutputStream.write(RecoverySourceHandler.java:529) ~[elasticsearch-5.6.3.jar:5.6.3]
	at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82) ~[?:1.8.0_144]
	at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140) ~[?:1.8.0_144]
	at org.elasticsearch.common.io.Streams.copy(Streams.java:83) ~[elasticsearch-5.6.3.jar:5.6.3]
	at org.elasticsearch.common.io.Streams.copy(Streams.java:60) ~[elasticsearch-5.6.3.jar:5.6.3]
	at org.elasticsearch.indices.recovery.RecoverySourceHandler.sendFiles(RecoverySourceHandler.java:554) ~[elasticsearch-5.6.3.jar:5.6.3]
	at org.elasticsearch.indices.recovery.RecoverySourceHandler.phase1(RecoverySourceHandler.java:276) ~[elasticsearch-5.6.3.jar:5.6.3]
	at org.elasticsearch.indices.recovery.RecoverySourceHandler.recoverToTarget(RecoverySourceHandler.java:138) ~[elasticsearch-5.6.3.jar:5.6.3]
	at org.elasticsearch.indices.recovery.PeerRecoverySourceService.recover(PeerRecoverySourceService.java:132) ~[elasticsearch-5.6.3.jar:5.6.3]
	at org.elasticsearch.indices.recovery.PeerRecoverySourceService.access$100(PeerRecoverySourceService.java:54) ~[elasticsearch-5.6.3.jar:5.6.3]
	at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:141) ~[elasticsearch-5.6.3.jar:5.6.3]
	at org.elasticsearch.indices.recovery.PeerRecoverySourceService$StartRecoveryTransportRequestHandler.messageReceived(PeerRecoverySourceService.java:138) ~[elasticsearch-5.6.3.jar:5.6.3]
	at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:33) ~[elasticsearch-5.6.3.jar:5.6.3]
	at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69) ~[elasticsearch-5.6.3.jar:5.6.3]
	at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1539) ~[elasticsearch-5.6.3.jar:5.6.3]
	... 5 more

The text was updated successfully, but these errors were encountered:

elasticmachine · 2018-03-30T19:52:03Z

Pinging @elastic/es-core-infra

Tim-Brooks · 2018-04-06T00:20:05Z

I looked at this today. It appears to me that the majority of connections use the TransportService openConnection method to open the connection (which is not managed by the TransportService). However, one node is picked as the "handshake node". This node is officially added to the TransportService through the connectToNode method. I think that this means that the cross-cluster search profile is stored in the TransportService and used for that node's connection (if it is registered first).

I'm not super familiar with the cross-cluster search design. But I assume registering the handshake node with the TransportService is intentional @s1monw?

We should consider:

If we want cross-cluster search nodes to be registered with the TransportService.
Adding validation to prevent adding your the current cluster as a cluster for cross-cluster search.

In RemoteClusterConnection:

     try {
        handshakeNode = transportService.handshake(connection, remoteProfile.getHandshakeTimeout().millis(),
            (c) -> remoteClusterName.get() == null ? true : c.equals(remoteClusterName.get()));
    } catch (IllegalStateException ex) {
        logger.warn((Supplier<?>) () -> new ParameterizedMessage("seed node {} cluster name mismatch expected " +
            "cluster name {}", connection.getNode(), remoteClusterName.get()), ex);
        throw ex;
    }
                            if (nodePredicate.test(handshakeNode) && connectedNodes.size() < maxNumRemoteConnections) {
        transportService.connectToNode(handshakeNode, remoteProfile);
        connectedNodes.add(handshakeNode);
    }

Tim-Brooks · 2018-05-01T15:48:13Z

I just wanted to ping about this issue @s1monw. @jasontedor wanted me to investigate and the last comment I posted was the results of what I found. I'm not sure what action we want to take now (such as: preventing adding the current cluster in settings or if there is something about the "handshake node" concept that we need to change).

s1monw · 2018-05-31T14:14:56Z

@tbrooks8 this can only happen if we are connecting to a node in the same cluster right? I mean if it's a remote node we should be connected to it. so is this what is happening here?

s1monw · 2018-05-31T14:28:05Z

I think a simple fix would be a to compare the local cluster uuid to the remote uuid and then don't pass the profie. This should be doable. @tbrooks8 @javanna WDYT?

treyd · 2018-05-31T17:04:24Z

Just wanted to add my perspective as a user about the CCS configuration and adding a remote cluster pointing to itself... to me, it actually makes more sense to have the same cross cluster search configuration in all clusters who are using each other for search. My use case (described also in #27006 (comment)) has multiple clusters that I deploy along side a multi-data center system to index that system's logs, and so I'd like each cluster to be able to search across logs in both data centers. I'd prefer to have the same CCS configuration in both clusters, e.g.:

search:
    remote:
        region1: 
            seeds: 10.0.10.10:9300
        region2: 
            seeds: 10.0.20.10:9301

This also has the benefit of being able to use the search API the same way in both clusters (e.g. GET *:logstash-*/_search), and explicitly see which cluster the results are coming from.

Tim-Brooks · 2018-05-31T22:25:53Z

@s1monw - I don't follow your suggested fix. But essentially as @treyd describes, some users reuse cross-cluster search configs between clusters. This leads to a configs having the "local" cluster added to the search.remote configs.

When we open a connection to the remote cluster nodes, we create a managed connection in the TransportService. Once the TransportService has a connection opened it reuses that for further attempts to create a open a managed connection. So if the RemoteClusterConnection opens a connection to a local node before the normal connection is opened by the cluster, the remote cluster search connection is used for normal cluster operations (by the TransportService). Unfortunately, the remote cluster connection does not have connections configured for normal cluster operations such as bulk. This eventually leads to an exception being thrown.

It seems to me that a reasonable fix would be to detect if the node is a member of the local cluster, do not open a managed connection from the RemoteClusterConnection, and log a warning.

If we knew the cluster uuid of the node we are connecting to prior opening a managed connection that would work. We do open a non-managed connection and make a HandshakeRequest. The HandshakeResponse returns the clusterName. Is that the uuid you are referring to? We could propagate the HandshakeResponse back from the TransportService (right now we only pass the DiscoveryNode back) to know if this node belongs to the local cluster.

s1monw · 2018-06-04T15:50:08Z

It seems to me that a reasonable fix would be to detect if the node is a member of the local cluster, do not open a managed connection from the RemoteClusterConnection, and log a warning.

++

If we knew the cluster uuid of the node we are connecting to prior opening a managed connection that would work. We do open a non-managed connection and make a HandshakeRequest. The HandshakeResponse returns the clusterName. Is that the uuid you are referring to? We could propagate the HandshakeResponse back from the TransportService (right now we only pass the DiscoveryNode back) to know if this node belongs to the local cluster.

that's the one I was referring to. Yet, I think we could also skip adding the managed connection until we fetch the cluster state and only add it there if we really need to. WDYT?

Tim-Brooks · 2018-06-04T16:11:21Z

that's the one I was referring to. Yet, I think we could also skip adding the managed connection until we fetch the cluster state and only add it there if we really need to. WDYT?

Yes I think that fix would work.

s1monw · 2018-06-07T07:57:00Z

@tbrooks8 are you going to look into it?

Tim-Brooks · 2018-06-08T04:56:24Z

Yeah sometime in the next few days.

If we are running into a race condition between a node being configured to be a remote node for cross cluster search etc. and that node joining the cluster we might connect to that node with a remote profile. If that node now joins the cluster it connected to it as a CCS remote node we use the wrong profile and can't use bulk connections etc. anymore. This change uses the remote profile only if we connect to a node that has a different cluster name than the local cluster. This is not a perfect fix for this situation but is the safe option while potentially only loose a small optimization of using less connections per node which is small anyways since we only connect to a small set of nodes. Closes elastic#29321

If we are running into a race condition between a node being configured to be a remote node for cross cluster search etc. and that node joining the cluster we might connect to that node with a remote profile. If that node now joins the cluster it connected to it as a CCS remote node we use the wrong profile and can't use bulk connections etc. anymore. This change uses the remote profile only if we connect to a node that has a different cluster name than the local cluster. This is not a perfect fix for this situation but is the safe option while potentially only loose a small optimization of using less connections per node which is small anyways since we only connect to a small set of nodes. Closes #29321

…#31331) If we are running into a race condition between a node being configured to be a remote node for cross cluster search etc. and that node joining the cluster we might connect to that node with a remote profile. If that node now joins the cluster it connected to it as a CCS remote node we use the wrong profile and can't use bulk connections etc. anymore. This change uses the remote profile only if we connect to a node that has a different cluster name than the local cluster. This is not a perfect fix for this situation but is the safe option while potentially only loose a small optimization of using less connections per node which is small anyways since we only connect to a small set of nodes. Closes elastic#29321

If we are running into a race condition between a node being configured to be a remote node for cross cluster search etc. and that node joining the cluster we might connect to that node with a remote profile. If that node now joins the cluster it connected to it as a CCS remote node we use the wrong profile and can't use bulk connections etc. anymore. This change uses the remote profile only if we connect to a node that has a different cluster name than the local cluster. This is not a perfect fix for this situation but is the safe option while potentially only loose a small optimization of using less connections per node which is small anyways since we only connect to a small set of nodes. Closes #29321

jasontedor added >bug :Distributed Coordination/Network Http and internode communication implementations v7.0.0 v6.3.0 v5.6.9 v6.2.4 labels Mar 30, 2018

jasontedor assigned Tim-Brooks Mar 30, 2018

jasontedor mentioned this issue Mar 30, 2018

can't select channel size is 0 for types: [RECOVERY, BULK, STATE] elastic/elasticsearch-ruby#460

Closed

cosmo0920 mentioned this issue Apr 5, 2018

can't select channel size is 0 for types: [RECOVERY, BULK, STATE] uken/fluent-plugin-elasticsearch#292

Closed

clintongormley added v5.6.10 and removed v5.6.9 labels Apr 18, 2018

bleskes added v6.3.1 v6.4.0 and removed v6.3.0 v6.3.1 labels Apr 26, 2018

ppf2 mentioned this issue Jun 6, 2018

Comma separated CCS syntax does not work until 6.3 elastic/kibana#19705

Closed

jpountz added v5.6.11 and removed v5.6.10 labels Jun 13, 2018

s1monw mentioned this issue Jun 14, 2018

Ensure we don't use a remote profile if cluster name matches #31331

Merged

s1monw closed this as completed in #31331 Jun 17, 2018

s1monw mentioned this issue Jun 17, 2018

Ensure we don't use a remote profile if cluster name matches (#31331) #31386

Merged

colings86 added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cross-cluster search and default connections can get crossed #29321

Cross-cluster search and default connections can get crossed #29321

jasontedor commented Mar 30, 2018

elasticmachine commented Mar 30, 2018

Tim-Brooks commented Apr 6, 2018 •

edited

Loading

Tim-Brooks commented May 1, 2018

s1monw commented May 31, 2018

s1monw commented May 31, 2018

treyd commented May 31, 2018

Tim-Brooks commented May 31, 2018

s1monw commented Jun 4, 2018

Tim-Brooks commented Jun 4, 2018

s1monw commented Jun 7, 2018

Tim-Brooks commented Jun 8, 2018

Cross-cluster search and default connections can get crossed #29321

Cross-cluster search and default connections can get crossed #29321

Comments

jasontedor commented Mar 30, 2018

elasticmachine commented Mar 30, 2018

Tim-Brooks commented Apr 6, 2018 • edited Loading

Tim-Brooks commented May 1, 2018

s1monw commented May 31, 2018

s1monw commented May 31, 2018

treyd commented May 31, 2018

Tim-Brooks commented May 31, 2018

s1monw commented Jun 4, 2018

Tim-Brooks commented Jun 4, 2018

s1monw commented Jun 7, 2018

Tim-Brooks commented Jun 8, 2018

Tim-Brooks commented Apr 6, 2018 •

edited

Loading