zookeeper needs to be restarted after upgrading instance type #80

jazzl0ver · 2018-10-11T10:17:40Z

How to reproduce:

Create a working zookeeper/kafka cluster (firecamp version 1.0) using t2.large instances
Using CloudFormation update stack control, upgrade the instances to m5.large
Zookeeper containers are all up, while Kafka can't start up
Restart zookeeper service using firecamp-service-cli and the issue gets resolved

Zookeeper logs before step 4 contain a lot of similar messages:

2018-10-11 10:09:13,821 [myid:2] - WARN  [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Learner@237] - Unexpected exception, tries=0, connecting to zoo-prod-2.firecamp-prod-firecamp.com/172.31.4.202:2888
java.net.ConnectException: Connection refused (Connection refused)

The text was updated successfully, but these errors were encountered:

JuniusLuo · 2018-10-15T16:45:24Z

This looks like a rolling restart steps for zk. Would be an issue of zk itself. The full logs would help to further analyse the issue.

jazzl0ver · 2018-10-15T17:00:59Z

The cluster has been running for several months w/o issues and I probably don't have the logs of point of start anymore
zoo.log.gz
. Please, check the logs attached and let me know if you need something else.

JuniusLuo · 2018-10-22T06:26:04Z

Not sure the root cause. The node looks not able to connect to zk on 172.31.4.202.

2018-10-11 10:10:29,214 [myid:2] - WARN [WorkerSender[myid=2]:QuorumCnxManager@584] - Cannot open channel to 3 at election address zoo-prod-2.firecamp-prod-firecamp.com/172.31.4.202:3888
java.net.ConnectException: Connection refused (Connection refused)
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:589)
at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:558)
at org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:534)
at org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:454)
at org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:435)
at java.lang.Thread.run(Thread.java:748)
2018-10-11 10:10:29,216 [myid:2] - INFO [WorkerSender[myid=2]:QuorumPeer$QuorumServer@184] - Resolved hostname: zoo-prod-2.firecamp-prod-firecamp.com to address: zoo-prod-2.firecamp-prod-firecamp.com/172.31.4.202

jazzl0ver · 2018-10-22T09:28:30Z

Yeah, I'm not sure as well. The main point is that it's got fixed after restarting zookeeper. Probably we need to wait for all zookeeper nodes to start up before starting up kafka?

JuniusLuo · 2018-10-22T15:58:09Z

Are you able to get logs from node 172.31.4.202?

Kafka does rely on ZK. If zk cluster is not working, kafka will not work. We could consider to introduce the dependency between services. While, how to detecting one service is healthy might not be easy. It has to look into the service internal status. For kafka, it is not necessary to do so. Kafka itself will wait till zk is running.

jazzl0ver · 2018-10-22T17:15:41Z

What kind of logs do you need?

JuniusLuo · 2018-10-22T17:25:39Z

zk logs, to see if it showed some information about why connecting fails.

jazzl0ver · 2018-10-23T10:13:38Z

@JuniusLuo, please check #80 (comment) for zookeeper logs

JuniusLuo · 2018-10-26T00:42:25Z

there is only one log file. We need the log file for zk on 172.31.4.202

jazzl0ver · 2018-10-26T12:32:38Z

zoo1.log.gz
zoo2.log.gz

JuniusLuo · 2018-11-03T05:17:18Z

It looks like 4.202 is zoo2. zoo2 was not started at 10:10:29. The first log in zoo2.log.gz was at 10:11:14.
2018-10-11 10:11:14,536 [myid:] - INFO [main:QuorumPeerConfig@136] - Reading configuration from: /etc/zk/zoo.cfg
2018-10-11 10:11:14,560 [myid:] - INFO [main:QuorumPeer$QuorumServer@184] - Resolved hostname: zoo-prod-2.firecamp-prod-firecamp.com to address: zoo-prod-2.firecamp-prod-firecamp.com/172.31.4.202

Probably when system was not stable, zk instance kept restarting itself. Restarting the zk service helps to bring all instances up around the same time. This looks like a zk issue itself. Probably zk instance should just wait and retry.

jazzl0ver · 2018-11-06T12:06:14Z

Could you please make firecamp-manager aware of this issue, so Kafka will be restarted only after all ZK instances are up and running?

JuniusLuo · 2018-11-08T06:42:59Z

The manager service aims to be a common service. Monitoring the specific service healthy status is too specific to the service. Looks not the best fit to have the manager service to do this.

jazzl0ver · 2018-11-08T09:32:10Z

@JuniusLuo , in general, I agree with you. But this particular service - Kafka - does not work at all without Zookeeper. So, they're tied by design. Thinking that would be a good feature to handle this case in the manager.
Otherwise, we probably need to put that in docs, so other people won't be messed.

JuniusLuo · 2019-01-09T06:43:34Z

This is not an easy task. This is like requiring the full monitoring ability to ZooKeeper. For example, if ZooKeeper fails because of some bug/issue, Kafka will not work as well.

jazzl0ver · 2019-01-09T11:37:57Z

We just need to start Kafka containers after Zookeeper is got up. Do you really think it's hard to implement?

JuniusLuo · 2019-01-12T06:26:23Z

Currently we don't have the control for which service starts first. Firecamp is simple. The manage service is simply responsible for initializing the service and updating the service configs, such as creating volumes, etc. Then it is ECS's responsible to schedule the containers, and the manage service does not involve on the scheduling. Filecamp plugin will talk with DynamoDB to grab the volume and update network.

jazzl0ver · 2019-01-16T17:08:25Z

Understand. What do you think of adding this issue somewhere in the wiki? So people might be aware of such things

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

zookeeper needs to be restarted after upgrading instance type #80

zookeeper needs to be restarted after upgrading instance type #80

jazzl0ver commented Oct 11, 2018 •

edited

Loading

JuniusLuo commented Oct 15, 2018

jazzl0ver commented Oct 15, 2018

JuniusLuo commented Oct 22, 2018

jazzl0ver commented Oct 22, 2018

JuniusLuo commented Oct 22, 2018

jazzl0ver commented Oct 22, 2018

JuniusLuo commented Oct 22, 2018

jazzl0ver commented Oct 23, 2018

JuniusLuo commented Oct 26, 2018

jazzl0ver commented Oct 26, 2018

JuniusLuo commented Nov 3, 2018

jazzl0ver commented Nov 6, 2018

JuniusLuo commented Nov 8, 2018

jazzl0ver commented Nov 8, 2018

JuniusLuo commented Jan 9, 2019

jazzl0ver commented Jan 9, 2019

JuniusLuo commented Jan 12, 2019

jazzl0ver commented Jan 16, 2019

zookeeper needs to be restarted after upgrading instance type #80

zookeeper needs to be restarted after upgrading instance type #80

Comments

jazzl0ver commented Oct 11, 2018 • edited Loading

JuniusLuo commented Oct 15, 2018

jazzl0ver commented Oct 15, 2018

JuniusLuo commented Oct 22, 2018

jazzl0ver commented Oct 22, 2018

JuniusLuo commented Oct 22, 2018

jazzl0ver commented Oct 22, 2018

JuniusLuo commented Oct 22, 2018

jazzl0ver commented Oct 23, 2018

JuniusLuo commented Oct 26, 2018

jazzl0ver commented Oct 26, 2018

JuniusLuo commented Nov 3, 2018

jazzl0ver commented Nov 6, 2018

JuniusLuo commented Nov 8, 2018

jazzl0ver commented Nov 8, 2018

JuniusLuo commented Jan 9, 2019

jazzl0ver commented Jan 9, 2019

JuniusLuo commented Jan 12, 2019

jazzl0ver commented Jan 16, 2019

jazzl0ver commented Oct 11, 2018 •

edited

Loading