[ci] :x-pack:rolling-upgrade:with-system-key times out when starting oneThirdUpgradedTestCluster node0 #32566

andyb-elastic · 2018-08-01T22:58:58Z

Happened in CI intake job, on a PR job, and I was able to reproduce it locally

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+intake/2448/console

Test cluster logs
v6.5.0-SNAPSHOT#oldClusterTestCluster-node0.log
v6.5.0-SNAPSHOT#oldClusterTestCluster-node1.log
v6.5.0-SNAPSHOT#oldClusterTestCluster-node2.log
v6.5.0-SNAPSHOT#oneThirdUpgradedTestCluster-node0.log

I'm not sure if this deserialization error is the real cause but it appeared in all three instances I looked at the cluster logs for - it looks like there might have been some recent changes here (for example #32319)

[2018-08-01T20:22:35,314][WARN ][o.e.d.z.ZenDiscovery     ] [node-1] failed to validate incoming join request from node [{upgraded-node-0}{_DluXPheQ3q0NQzXEPKpzQ}{pb58T916Q0azZH_9t9KsVw}{127.0.0.1}{127.0.0.1:44168}{testattr=test, upgraded=true, ml.machine_memory=31606448128, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}]
org.elasticsearch.transport.RemoteTransportException: [upgraded-node-0][127.0.0.1:44168][internal:discovery/zen/join/validate]
Caused by: java.lang.IllegalStateException: unexpected byte [0x04]
        at org.elasticsearch.common.io.stream.StreamInput.readBoolean(StreamInput.java:439) ~[elasticsearch-6.5.0-SNAPSHOT.jar:6.5.0-SNAPSHOT]
        at org.elasticsearch.common.io.stream.StreamInput.readBoolean(StreamInput.java:429) ~[elasticsearch-6.5.0-SNAPSHOT.jar:6.5.0-SNAPSHOT]
        at org.elasticsearch.common.io.stream.StreamInput.readOptionalLong(StreamInput.java:322) ~[elasticsearch-6.5.0-SNAPSHOT.jar:6.5.0-SNAPSHOT]
        at org.elasticsearch.xpack.core.ml.job.config.Job.<init>(Job.java:242) ~[?:?]
        at org.elasticsearch.xpack.core.ml.MlMetadata.<init>(MlMetadata.java:140) ~[?:?]
        at org.elasticsearch.common.io.stream.NamedWriteableAwareStreamInput.readNamedWriteable(NamedWriteableAwareStreamInput.java:46) ~[elasticsearch-6.5.0-SNAPSHOT.jar:6.5.0-SNAPSHOT]
        at org.elasticsearch.common.io.stream.NamedWriteableAwareStreamInput.readNamedWriteable(NamedWriteableAwareStreamInput.java:39) ~[elasticsearch-6.5.0-SNAPSHOT.jar:6.5.0-SNAPSHOT]
        at org.elasticsearch.cluster.metadata.MetaData.readFrom(MetaData.java:834) ~[elasticsearch-6.5.0-SNAPSHOT.jar:6.5.0-SNAPSHOT]
        at org.elasticsearch.cluster.ClusterState.readFrom(ClusterState.java:727) ~[elasticsearch-6.5.0-SNAPSHOT.jar:6.5.0-SNAPSHOT]
        at org.elasticsearch.discovery.zen.MembershipAction$ValidateJoinRequest.readFrom(MembershipAction.java:173) ~[elasticsearch-6.5.0-SNAPSHOT.jar:6.5.0-SNAPSHOT]
        at org.elasticsearch.common.io.stream.Streamable.lambda$newWriteableReader$0(Streamable.java:51) ~[elasticsearch-6.5.0-SNAPSHOT.jar:6.5.0-SNAPSHOT]
        at org.elasticsearch.transport.RequestHandlerRegistry.newRequest(RequestHandlerRegistry.java:56) ~[elasticsearch-6.5.0-SNAPSHOT.jar:6.5.0-SNAPSHOT]
        at org.elasticsearch.transport.TcpTransport.handleRequest(TcpTransport.java:1633) ~[elasticsearch-6.5.0-SNAPSHOT.jar:6.5.0-SNAPSHOT]
        at org.elasticsearch.transport.TcpTransport.messageReceived(TcpTransport.java:1501) ~[elasticsearch-6.5.0-SNAPSHOT.jar:6.5.0-SNAPSHOT]
        at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:62) ~[?:?]

Bonus logs from my local reproduction
oldClusterTestCluster node0 run.log
oldClusterTestCluster node1 run.log
oldClusterTestCluster node2 run.log
oneThirdUpgradedTestCluster node0 run.log

The text was updated successfully, but these errors were encountered:

elasticmachine · 2018-08-01T22:58:59Z

Pinging @elastic/ml-core

elasticmachine · 2018-08-01T22:59:00Z

Pinging @elastic/es-distributed

ywelsch · 2018-08-02T06:29:56Z

@dimitris-athanasiou can you take a look?

dimitris-athanasiou · 2018-08-02T09:57:52Z

@ywelsch That's weird. I haven't merged in the PR I was talking about even! looking

dimitris-athanasiou · 2018-08-02T10:27:48Z

Ok. This is my bad. Somehow I cherry-picked and merged the commit from #32496 in 6.x before ever merging the PR. I must have done so accidentally when I was backporting some other bug fixes. I don't recall at all how it happened :-(. I've merged #32496 on master now which should fix the build. Apologies for the noise.

andyb-elastic added >test-failure Triaged test failures from CI :Distributed/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. :ml Machine learning labels Aug 1, 2018

ywelsch removed the :Distributed/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. label Aug 2, 2018

dimitris-athanasiou closed this as completed Aug 2, 2018

dimitris-athanasiou self-assigned this Aug 2, 2018

colings86 mentioned this issue Aug 2, 2018

[CI] :x-pack:qa:rolling-upgrade:with-system-key:v6.5.0-SNAPSHOT#oneThirdUpgradedTestCluster#wait timeout waiting for cluster to start #32581

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ci] :x-pack:rolling-upgrade:with-system-key times out when starting oneThirdUpgradedTestCluster node0 #32566

[ci] :x-pack:rolling-upgrade:with-system-key times out when starting oneThirdUpgradedTestCluster node0 #32566

andyb-elastic commented Aug 1, 2018 •

edited

Loading

elasticmachine commented Aug 1, 2018

elasticmachine commented Aug 1, 2018

ywelsch commented Aug 2, 2018

dimitris-athanasiou commented Aug 2, 2018

dimitris-athanasiou commented Aug 2, 2018

[ci] :x-pack:rolling-upgrade:with-system-key times out when starting oneThirdUpgradedTestCluster node0 #32566

[ci] :x-pack:rolling-upgrade:with-system-key times out when starting oneThirdUpgradedTestCluster node0 #32566

Comments

andyb-elastic commented Aug 1, 2018 • edited Loading

elasticmachine commented Aug 1, 2018

elasticmachine commented Aug 1, 2018

ywelsch commented Aug 2, 2018

dimitris-athanasiou commented Aug 2, 2018

dimitris-athanasiou commented Aug 2, 2018

andyb-elastic commented Aug 1, 2018 •

edited

Loading