Second ZK pod doesn't starts due to 'java.lang.RuntimeException: My id 2 not in the peer list' exception #315

BassinD · 2021-03-29T15:00:46Z

Description

We are using zookeeper v.0.2.9.
Sometimes (not in all environements) zookeeper-1 pod unable to start due to RuntimeException.
Pod 's log:

kubectl logs deployment-zookeeper-1
+ source /conf/env.sh
++ DOMAIN=deployment-zookeeper-headless.objectstore-large.svc.cluster.local
++ QUORUM_PORT=2888
++ LEADER_PORT=3888
++ CLIENT_HOST=deployment-zookeeper-client
++ CLIENT_PORT=9277
++ CLUSTER_NAME=deployment-zookeeper
++ CLUSTER_SIZE=3
+ source /usr/local/bin/zookeeperFunctions.sh
++ set -ex
++ hostname -s
+ HOST=deployment-zookeeper-1
+ DATA_DIR=/data
+ MYID_FILE=/data/myid
+ LOG4J_CONF=/conf/log4j-quiet.properties
+ DYNCONFIG=/data/zoo.cfg.dynamic
+ STATIC_CONFIG=/data/conf/zoo.cfg
+ [[ deployment-zookeeper-1 =~ (.*)-([0-9]+)$ ]]
+ NAME=deployment-zookeeper
+ ORD=1
+ MYID=2
+ WRITE_CONFIGURATION=true
+ REGISTER_NODE=true
+ ONDISK_MYID_CONFIG=false
+ ONDISK_DYN_CONFIG=false
+ '[' -f /data/myid ']'
++ cat /data/myid
+ EXISTING_ID=2
+ [[ 2 == \2 ]]
+ [[ -f /data/conf/zoo.cfg ]]
+ ONDISK_MYID_CONFIG=true
+ '[' -f /data/zoo.cfg.dynamic ']'
+ ONDISK_DYN_CONFIG=true
+ set +e
+ [[ -n '' ]]
+ set -e
+ set +e
+ nslookup deployment-zookeeper-headless.objectstore-large.svc.cluster.local
Server:		10.96.0.10
Address:	10.96.0.10#53

Name:	deployment-zookeeper-headless.objectstore-large.svc.cluster.local
Address: 192.168.55.175

+ [[ 0 -eq 0 ]]
+ ACTIVE_ENSEMBLE=true
+ [[ true == true ]]
+ [[ true == true ]]
+ WRITE_CONFIGURATION=false
+ [[ true == false ]]
+ [[ true == false ]]
+ [[ true == false ]]
+ REGISTER_NODE=false
+ [[ false == true ]]
+ [[ false == true ]]
+ ZOOCFGDIR=/data/conf
+ export ZOOCFGDIR
+ echo Copying /conf contents to writable directory, to support Zookeeper dynamic reconfiguration
Copying /conf contents to writable directory, to support Zookeeper dynamic reconfiguration
+ [[ ! -d /data/conf ]]
+ echo Copying the /conf/zoo.cfg contents except the dynamic config file during restart
Copying the /conf/zoo.cfg contents except the dynamic config file during restart
++ head -n -1 /conf/zoo.cfg
++ tail -n 1 /data/conf/zoo.cfg
+ echo -e '4lw.commands.whitelist=cons, envi, conf, crst, srvr, stat, mntr, ruok
dataDir=/data
standaloneEnabled=false
reconfigEnabled=true
skipACL=yes
metricsProvider.className=org.apache.zookeeper.metrics.prometheus.PrometheusMetricsProvider
metricsProvider.httpPort=7000
metricsProvider.exportJvmInfo=true
initLimit=10
syncLimit=2
tickTime=2000
quorumListenOnAllIPs=false\ndynamicConfigFile=/data/zoo.cfg.dynamic'
+ cp -f /conf/log4j.properties /data/conf
+ cp -f /conf/log4j-quiet.properties /data/conf
+ cp -f /conf/env.sh /data/conf
+ '[' -f /data/zoo.cfg.dynamic ']'
Starting zookeeper service
+ echo Starting zookeeper service
+ zkServer.sh --config /data/conf start-foreground
ZooKeeper JMX enabled by default
Using config: /data/conf/zoo.cfg
2021-03-29 13:33:55,679 [myid:] - INFO  [main:QuorumPeerConfig@173] - Reading configuration from: /data/conf/zoo.cfg
2021-03-29 13:33:55,686 [myid:] - INFO  [main:QuorumPeerConfig@450] - clientPort is not set
2021-03-29 13:33:55,686 [myid:] - INFO  [main:QuorumPeerConfig@463] - secureClientPort is not set
2021-03-29 13:33:55,686 [myid:] - INFO  [main:QuorumPeerConfig@479] - observerMasterPort is not set
2021-03-29 13:33:55,744 [myid:] - INFO  [main:QuorumPeerConfig@496] - metricsProvider.className is org.apache.zookeeper.metrics.prometheus.PrometheusMetricsProvider
2021-03-29 13:33:55,753 [myid:] - WARN  [main:QuorumPeerConfig@727] - No server failure will be tolerated. You need at least 3 servers.
2021-03-29 13:33:55,758 [myid:2] - INFO  [main:DatadirCleanupManager@78] - autopurge.snapRetainCount set to 3
2021-03-29 13:33:55,758 [myid:2] - INFO  [main:DatadirCleanupManager@79] - autopurge.purgeInterval set to 0
2021-03-29 13:33:55,758 [myid:2] - INFO  [main:DatadirCleanupManager@101] - Purge task is not scheduled.
2021-03-29 13:33:55,762 [myid:2] - INFO  [main:ManagedUtil@44] - Log4j 1.2 jmx support found and enabled.
2021-03-29 13:33:55,771 [myid:2] - INFO  [main:QuorumPeerMain@151] - Starting quorum peer
2021-03-29 13:33:55,850 [myid:2] - INFO  [main:PrometheusMetricsProvider@74] - Initializing metrics, configuration: {exportJvmInfo=true, httpPort=7000}
2021-03-29 13:33:55,850 [myid:2] - INFO  [main:PrometheusMetricsProvider@82] - Starting /metrics HTTP endpoint at port 7000 exportJvmInfo: true
2021-03-29 13:33:55,895 [myid:2] - INFO  [main:Log@169] - Logging initialized @1300ms to org.eclipse.jetty.util.log.Slf4jLog
2021-03-29 13:33:56,180 [myid:2] - INFO  [main:Server@359] - jetty-9.4.24.v20191120; built: 2019-11-20T21:37:49.771Z; git: 363d5f2df3a8a28de40604320230664b9c793c16; jvm 11.0.8+10
2021-03-29 13:33:56,280 [myid:2] - INFO  [main:ContextHandler@825] - Started o.e.j.s.ServletContextHandler@69c81773{/,null,AVAILABLE}
2021-03-29 13:33:56,356 [myid:2] - INFO  [main:AbstractConnector@330] - Started ServerConnector@771a660{HTTP/1.1,[http/1.1]}{0.0.0.0:7000}
2021-03-29 13:33:56,356 [myid:2] - INFO  [main:Server@399] - Started @1767ms
2021-03-29 13:33:56,368 [myid:2] - INFO  [main:ServerMetrics@62] - ServerMetrics initialized with provider org.apache.zookeeper.metrics.prometheus.PrometheusMetricsProvider@7d7758be
2021-03-29 13:33:56,449 [myid:2] - INFO  [main:QuorumPeer@752] - zookeeper.quorumCnxnTimeoutMs=-1
2021-03-29 13:33:56,463 [myid:2] - WARN  [main:ContextHandler@1520] - o.e.j.s.ServletContextHandler@a82c5f1{/,null,UNAVAILABLE} contextPath ends with /*
2021-03-29 13:33:56,463 [myid:2] - WARN  [main:ContextHandler@1531] - Empty contextPath
2021-03-29 13:33:56,467 [myid:2] - INFO  [main:X509Util@77] - Setting -D jdk.tls.rejectClientInitiatedRenegotiation=true to disable client-initiated TLS renegotiation
2021-03-29 13:33:56,470 [myid:2] - INFO  [main:FileTxnSnapLog@124] - zookeeper.snapshot.trust.empty : false
2021-03-29 13:33:56,473 [myid:2] - INFO  [main:QuorumPeer@1680] - Local sessions disabled
2021-03-29 13:33:56,473 [myid:2] - INFO  [main:QuorumPeer@1691] - Local session upgrading disabled
2021-03-29 13:33:56,474 [myid:2] - INFO  [main:QuorumPeer@1658] - tickTime set to 2000
2021-03-29 13:33:56,474 [myid:2] - INFO  [main:QuorumPeer@1702] - minSessionTimeout set to 4000
2021-03-29 13:33:56,474 [myid:2] - INFO  [main:QuorumPeer@1713] - maxSessionTimeout set to 40000
2021-03-29 13:33:56,497 [myid:2] - INFO  [main:QuorumPeer@1738] - initLimit set to 10
2021-03-29 13:33:56,497 [myid:2] - INFO  [main:QuorumPeer@1920] - syncLimit set to 2
2021-03-29 13:33:56,497 [myid:2] - INFO  [main:QuorumPeer@1935] - connectToLearnerMasterLimit set to 0
2021-03-29 13:33:56,553 [myid:2] - INFO  [main:ZookeeperBanner@42] - 
2021-03-29 13:33:56,553 [myid:2] - INFO  [main:ZookeeperBanner@42] -   ______                  _                                          
2021-03-29 13:33:56,553 [myid:2] - INFO  [main:ZookeeperBanner@42] -  |___  /                 | |                                         
2021-03-29 13:33:56,553 [myid:2] - INFO  [main:ZookeeperBanner@42] -     / /    ___     ___   | | __   ___    ___   _ __     ___   _ __   
2021-03-29 13:33:56,553 [myid:2] - INFO  [main:ZookeeperBanner@42] -    / /    / _ \   / _ \  | |/ /  / _ \  / _ \ | '_ \   / _ \ | '__|
2021-03-29 13:33:56,553 [myid:2] - INFO  [main:ZookeeperBanner@42] -   / /__  | (_) | | (_) | |   <  |  __/ |  __/ | |_) | |  __/ | |    
2021-03-29 13:33:56,553 [myid:2] - INFO  [main:ZookeeperBanner@42] -  /_____|  \___/   \___/  |_|\_\  \___|  \___| | .__/   \___| |_|
2021-03-29 13:33:56,553 [myid:2] - INFO  [main:ZookeeperBanner@42] -                                               | |                     
2021-03-29 13:33:56,553 [myid:2] - INFO  [main:ZookeeperBanner@42] -                                               |_|                     
2021-03-29 13:33:56,554 [myid:2] - INFO  [main:ZookeeperBanner@42] - 
2021-03-29 13:33:56,555 [myid:2] - INFO  [main:Environment@98] - Server environment:zookeeper.version=3.6.1--104dcb3e3fb464b30c5186d229e00af9f332524b, built on 04/21/2020 15:01 GMT
2021-03-29 13:33:56,555 [myid:2] - INFO  [main:Environment@98] - Server environment:host.name=deployment-zookeeper-1.deployment-zookeeper-headless.objectstore-large.svc.cluster.local
2021-03-29 13:33:56,555 [myid:2] - INFO  [main:Environment@98] - Server environment:java.version=11.0.8
2021-03-29 13:33:56,555 [myid:2] - INFO  [main:Environment@98] - Server environment:java.vendor=N/A
2021-03-29 13:33:56,556 [myid:2] - INFO  [main:Environment@98] - Server environment:java.home=/usr/local/openjdk-11
2021-03-29 13:33:56,556 [myid:2] - INFO  [main:Environment@98] - Server environment:java.class.path=/apache-zookeeper-3.6.1-bin/bin/../zookeeper-server/target/classes:/apache-zookeeper-3.6.1-bin/bin/../build/classes:/apache-zookeeper-3.6.1-bin/bin/../zookeeper-server/target/lib/*.jar:/apache-zookeeper-3.6.1-bin/bin/../build/lib/*.jar:/apache-zookeeper-3.6.1-bin/bin/../lib/zookeeper-prometheus-metrics-3.6.1.jar:/apache-zookeeper-3.6.1-bin/bin/../lib/zookeeper-jute-3.6.1.jar:/apache-zookeeper-3.6.1-bin/bin/../lib/zookeeper-3.6.1.jar:/apache-zookeeper-3.6.1-bin/bin/../lib/snappy-java-1.1.7.jar:/apache-zookeeper-3.6.1-bin/bin/../lib/slf4j-log4j12-1.7.25.jar:/apache-zookeeper-3.6.1-bin/bin/../lib/slf4j-api-1.7.25.jar:/apache-zookeeper-3.6.1-bin/bin/../lib/simpleclient_servlet-0.6.0.jar:/apache-zookeeper-3.6.1-bin/bin/../lib/simpleclient_hotspot-0.6.0.jar:/apache-zookeeper-3.6.1-bin/bin/../lib/simpleclient_common-0.6.0.jar:/apache-zookeeper-3.6.1-bin/bin/../lib/simpleclient-0.6.0.jar:/apache-zookeeper-3.6.1-bin/bin/../lib/netty-transport-native-unix-common-4.1.48.Final.jar:/apache-zookeeper-3.6.1-bin/bin/../lib/netty-transport-native-epoll-4.1.48.Final.jar:/apache-zookeeper-3.6.1-bin/bin/../lib/netty-transport-4.1.48.Final.jar:/apache-zookeeper-3.6.1-bin/bin/../lib/netty-resolver-4.1.48.Final.jar:/apache-zookeeper-3.6.1-bin/bin/../lib/netty-handler-4.1.48.Final.jar:/apache-zookeeper-3.6.1-bin/bin/../lib/netty-common-4.1.48.Final.jar:/apache-zookeeper-3.6.1-bin/bin/../lib/netty-codec-4.1.48.Final.jar:/apache-zookeeper-3.6.1-bin/bin/../lib/netty-buffer-4.1.48.Final.jar:/apache-zookeeper-3.6.1-bin/bin/../lib/metrics-core-3.2.5.jar:/apache-zookeeper-3.6.1-bin/bin/../lib/log4j-1.2.17.jar:/apache-zookeeper-3.6.1-bin/bin/../lib/json-simple-1.1.1.jar:/apache-zookeeper-3.6.1-bin/bin/../lib/jline-2.11.jar:/apache-zookeeper-3.6.1-bin/bin/../lib/jetty-util-9.4.24.v20191120.jar:/apache-zookeeper-3.6.1-bin/bin/../lib/jetty-servlet-9.4.24.v20191120.jar:/apache-zookeeper-3.6.1-bin/bin/../lib/jetty-server-9.4.24.v20191120.jar:/apache-zookeeper-3.6.1-bin/bin/../lib/jetty-security-9.4.24.v20191120.jar:/apache-zookeeper-3.6.1-bin/bin/../lib/jetty-io-9.4.24.v20191120.jar:/apache-zookeeper-3.6.1-bin/bin/../lib/jetty-http-9.4.24.v20191120.jar:/apache-zookeeper-3.6.1-bin/bin/../lib/javax.servlet-api-3.1.0.jar:/apache-zookeeper-3.6.1-bin/bin/../lib/jackson-databind-2.10.3.jar:/apache-zookeeper-3.6.1-bin/bin/../lib/jackson-core-2.10.3.jar:/apache-zookeeper-3.6.1-bin/bin/../lib/jackson-annotations-2.10.3.jar:/apache-zookeeper-3.6.1-bin/bin/../lib/commons-lang-2.6.jar:/apache-zookeeper-3.6.1-bin/bin/../lib/commons-cli-1.2.jar:/apache-zookeeper-3.6.1-bin/bin/../lib/audience-annotations-0.5.0.jar:/apache-zookeeper-3.6.1-bin/bin/../zookeeper-*.jar:/apache-zookeeper-3.6.1-bin/bin/../zookeeper-server/src/main/resources/lib/*.jar:/data/conf:
2021-03-29 13:33:56,556 [myid:2] - INFO  [main:Environment@98] - Server environment:java.library.path=/usr/java/packages/lib:/usr/lib64:/lib64:/lib:/usr/lib
2021-03-29 13:33:56,556 [myid:2] - INFO  [main:Environment@98] - Server environment:java.io.tmpdir=/tmp
2021-03-29 13:33:56,556 [myid:2] - INFO  [main:Environment@98] - Server environment:java.compiler=<NA>
2021-03-29 13:33:56,556 [myid:2] - INFO  [main:Environment@98] - Server environment:os.name=Linux
2021-03-29 13:33:56,556 [myid:2] - INFO  [main:Environment@98] - Server environment:os.arch=amd64
2021-03-29 13:33:56,556 [myid:2] - INFO  [main:Environment@98] - Server environment:os.version=4.15.0-76-generic
2021-03-29 13:33:56,556 [myid:2] - INFO  [main:Environment@98] - Server environment:user.name=root
2021-03-29 13:33:56,556 [myid:2] - INFO  [main:Environment@98] - Server environment:user.home=/root
2021-03-29 13:33:56,557 [myid:2] - INFO  [main:Environment@98] - Server environment:user.dir=/apache-zookeeper-3.6.1-bin
2021-03-29 13:33:56,557 [myid:2] - INFO  [main:Environment@98] - Server environment:os.memory.free=7MB
2021-03-29 13:33:56,557 [myid:2] - INFO  [main:Environment@98] - Server environment:os.memory.max=966MB
2021-03-29 13:33:56,557 [myid:2] - INFO  [main:Environment@98] - Server environment:os.memory.total=14MB
2021-03-29 13:33:56,557 [myid:2] - INFO  [main:ZooKeeperServer@128] - zookeeper.enableEagerACLCheck = false
2021-03-29 13:33:56,557 [myid:2] - INFO  [main:ZooKeeperServer@132] - zookeeper.skipACL=="yes", ACL checks will be skipped
2021-03-29 13:33:56,557 [myid:2] - INFO  [main:ZooKeeperServer@136] - zookeeper.digest.enabled = true
2021-03-29 13:33:56,557 [myid:2] - INFO  [main:ZooKeeperServer@140] - zookeeper.closeSessionTxn.enabled = true
2021-03-29 13:33:56,557 [myid:2] - INFO  [main:ZooKeeperServer@1434] - zookeeper.flushDelay=0
2021-03-29 13:33:56,558 [myid:2] - INFO  [main:ZooKeeperServer@1443] - zookeeper.maxWriteQueuePollTime=0
2021-03-29 13:33:56,558 [myid:2] - INFO  [main:ZooKeeperServer@1452] - zookeeper.maxBatchSize=1000
2021-03-29 13:33:56,558 [myid:2] - INFO  [main:ZooKeeperServer@241] - zookeeper.intBufferStartingSizeBytes = 1024
2021-03-29 13:33:56,563 [myid:2] - INFO  [main:WatchManagerFactory@42] - Using org.apache.zookeeper.server.watch.WatchManager as watch manager
2021-03-29 13:33:56,563 [myid:2] - INFO  [main:WatchManagerFactory@42] - Using org.apache.zookeeper.server.watch.WatchManager as watch manager
2021-03-29 13:33:56,565 [myid:2] - INFO  [main:ZKDatabase@132] - zookeeper.snapshotSizeFactor = 0.33
2021-03-29 13:33:56,565 [myid:2] - INFO  [main:ZKDatabase@152] - zookeeper.commitLogCount=500
2021-03-29 13:33:56,578 [myid:2] - INFO  [main:QuorumPeer@2001] - Using insecure (non-TLS) quorum communication
2021-03-29 13:33:56,578 [myid:2] - INFO  [main:QuorumPeer@2007] - Port unification disabled
2021-03-29 13:33:56,578 [myid:2] - INFO  [main:QuorumPeer@174] - multiAddress.enabled set to false
2021-03-29 13:33:56,578 [myid:2] - INFO  [main:QuorumPeer@199] - multiAddress.reachabilityCheckEnabled set to true
2021-03-29 13:33:56,578 [myid:2] - INFO  [main:QuorumPeer@186] - multiAddress.reachabilityCheckTimeoutMs set to 1000
2021-03-29 13:33:56,578 [myid:2] - INFO  [main:QuorumPeer@2461] - QuorumPeer communication is not secured! (SASL auth disabled)
2021-03-29 13:33:56,579 [myid:2] - INFO  [main:QuorumPeer@2486] - quorum.cnxn.threads.size set to 20
2021-03-29 13:33:56,585 [myid:2] - INFO  [main:AbstractConnector@380] - Stopped ServerConnector@771a660{HTTP/1.1,[http/1.1]}{0.0.0.0:7000}
2021-03-29 13:33:56,587 [myid:2] - INFO  [main:ContextHandler@1016] - Stopped o.e.j.s.ServletContextHandler@69c81773{/,null,UNAVAILABLE}
2021-03-29 13:33:56,645 [myid:2] - ERROR [main:QuorumPeerMain@113] - Unexpected exception, exiting abnormally
java.lang.RuntimeException: My id 2 not in the peer list
	at org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:1073)
	at org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:227)
	at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:136)
	at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:90)
2021-03-29 13:33:56,647 [myid:2] - INFO  [main:ZKAuditProvider@42] - ZooKeeper audit is disabled.
2021-03-29 13:33:56,649 [myid:2] - ERROR [main:ServiceUtils@42] - Exiting JVM with code 1

Zookeeper cluster CRD description

kubectl describe zookeepercluster deployment-zookeeper
Name:         deployment-zookeeper
Namespace:    objectstore-large
Labels:       app=zookeeper
              component=zk
              release=deployment
Annotations:  <none>
API Version:  zookeeper.pravega.io/v1beta1
Kind:         ZookeeperCluster
Metadata:
  Creation Timestamp:  2021-03-29T13:31:16Z
  Finalizers:
    cleanUpZookeeperPVC
  Generation:  3
    Manager:      objectscale-operator
    Operation:    Update
    Time:         2021-03-29T13:31:17Z
    API Version:  zookeeper.pravega.io/v1beta1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:finalizers:
      f:spec:
        f:ports:
      f:status:
        .:
        f:conditions:
        f:externalClientEndpoint:
        f:internalClientEndpoint:
        f:members:
          .:
          f:ready:
          f:unready:
        f:readyReplicas:
        f:replicas:
    Manager:    zookeeper-operator
    Operation:  Update
    Time:       2021-03-29T14:50:03Z
  Resource Version:        10195029
  Self Link:               /apis/zookeeper.pravega.io/v1beta1/namespaces/objectstore-large/zookeeperclusters/deployment-zookeeper
  UID:                     903303a2-80e1-47bf-94a4-a4cf8cae3f3f
Spec:
  Admin Server Service:
  Client Service:
  Config:
    Auto Purge Purge Interval:     1
    Auto Purge Snap Retain Count:  3
    Commit Log Count:              500
    Global Outstanding Limit:      1000
    Init Limit:                    10
    Max Client Cnxns:              60
    Max Session Timeout:           40000
    Min Session Timeout:           4000
    Pre Alloc Size:                65536
    Snap Count:                    10000
    Snap Size Limit In Kb:         4194304
    Sync Limit:                    2
    Tick Time:                     2000
  Headless Service:
  Image:
    Pull Policy:  IfNotPresent
    Repository:   pravega/zookeeper
    Tag:          0.2.9
  Labels:
    App:        zookeeper
    Component:  zk
    Release:    deployment
  Persistence:
    Reclaim Policy:  Delete
    Spec:
      Access Modes:
        ReadWriteOnce
      Resources:
        Requests:
          Storage:         20Gi
      Storage Class Name:  csi-baremetal-sc-hddlvg
  Pod:
    Affinity:
      Pod Anti Affinity:
        Preferred During Scheduling Ignored During Execution:
          Pod Affinity Term:
            Label Selector:
              Match Expressions:
                Key:       app
                Operator:  In
                Values:
                  deployment-zookeeper
            Topology Key:  kubernetes.io/hostname
          Weight:          20
    Labels:
      App:      deployment-zookeeper
      Release:  deployment-zookeeper
    Resources:
      Limits:
        Cpu:     1
        Memory:  256M
      Requests:
        Cpu:                           500m
        Memory:                        128M
    Service Account Name:              default
    Termination Grace Period Seconds:  30
  Ports:
    Container Port:  9277
    Name:            client
    Container Port:  2888
    Name:            quorum
    Container Port:  3888
    Name:            leader-election
    Container Port:  7000
    Name:            metrics
    Container Port:  8080
    Name:            admin-server
  Probes:            <nil>
  Replicas:          3
  Storage Type:      persistence
Status:
  Conditions:
    Status:                  False
    Type:                    Upgrading
    Status:                  False
    Type:                    PodsReady
    Status:                  False
    Type:                    Error
  External Client Endpoint:  N/A
  Internal Client Endpoint:  10.102.68.149:9277
  Members:
    Ready:
      deployment-zookeeper-0
    Unready:
      deployment-zookeeper-1
  Ready Replicas:  1
  Replicas:        2
Events:            <none>

Previously we used ZK v0.2.7 and there was no this issue.
Also I tried the fix described in issue #259 , but it didn't helped.

Importance

Blocker issue. We need some fixes related to 0.2.9 version (#257), so upgrade is required.

Location

(Where is the piece of code, package, or document affected by this issue?)

Suggestions for an improvement

(How do you suggest to fix or proceed with this issue?)

The text was updated successfully, but these errors were encountered:

priyavj08 · 2021-03-31T13:53:39Z

@BassinD I am hit by this issue for quite some time, no resolution so far. This is a critical issue for me as well. It happens intermittently even in our production environment.

from what I understand, this failure can happen when promotion of node from observer to participant fails in ZookeeperReady.sh which is run as part of Readiness probe. For me sometimes, I get the error as below

++ echo ruok
++ nc 127.0.0.1 2181
(UNKNOWN) [127.0.0.1] 2181 (?) : Connection refused

This could mean ZK server is not ready to accept connections? I was wondering, if by adding an initial delay to "readinessprobe" , so "ruok" requests are sent after a delay (hopefully ZK server will be started and running by then)

I had requested for a fix to make these probes configurable #275 . Wonder when next release of ZK operator will be so we can get these fixes?

priyavj08 · 2021-04-18T12:45:33Z

hi, when will this fix be available?

thanks

anishakj · 2021-04-19T15:28:52Z

hi, when will this fix be available?

thanks

@priyavj08 The fix is available in master, will be part of the next release. If you want to use the build now, you can git clone the code from master and build using make build-zk-image and use that image for current testing

priyavj08 · 2021-04-19T16:11:20Z

thanks @anishakj

any ETA on next release so I can plan?

thanks

anishakj · 2021-04-20T05:11:47Z

thanks @anishakj

any ETA on next release so I can plan?

thanks

@priyavj08 We are planning to do release sometimes this week.

priyavj08 · 2021-04-20T05:23:01Z

thanks for your continued support @anishakj @amuraru

I pulled the latest code from master, built ZK image and ran a continued install/uninstall tests, about 20 odd iterations it was ok, it failed at the 25th iteration. This issue still exists, unfortunately it happens most of the time in production (murphy's law coming into play)

When I was able to exec in to the failure pod-container I found

cat /data/myid
2

cat /data/conf/zoo.cfg
4lw.commands.whitelist=cons, envi, conf, crst, srvr, stat, mntr, ruok
dataDir=/data
standaloneEnabled=false
reconfigEnabled=true
skipACL=yes
metricsProvider.className=org.apache.zookeeper.metrics.prometheus.PrometheusMetricsProvider
metricsProvider.httpPort=7000
metricsProvider.exportJvmInfo=true
initLimit=10
syncLimit=2
tickTime=2000
globalOutstandingLimit=1000
preAllocSize=65536
snapCount=10000
commitLogCount=500
snapSizeLimitInKb=4194304
maxCnxns=0
maxClientCnxns=60
minSessionTimeout=4000
maxSessionTimeout=40000
autopurge.snapRetainCount=3
autopurge.purgeInterval=1
quorumListenOnAllIPs=false
admin.serverPort=8080
dynamicConfigFile=/data/conf/zoo.cfg.dynamic.1000000b4

cat /data/conf/zoo.cfg.dynamic.1000000b4
server.1=fed-kafka-affirmedzk-0.fed-kafka-affirmedzk-headless.fed-kafka.svc.cluster.local:2888:3888:participant;0.0.0.0:21

seems to be some timing issue, please make this part of the code in zookeeperStart.sh more reliable

thanks

anishakj · 2021-04-20T05:26:59Z

thanks for your continued support @anishakj @amuraru

I pulled the latest code from master, built ZK image and ran a continued install/uninstall tests, about 20 odd iterations it was ok, it failed at the 25th iteration. This issue still exists, unfortunately it happens most of the time in production (murphy's law coming into play)

When I was able to exec in to the failure pod-container I found

cat /data/myid
2

cat /data/conf/zoo.cfg
4lw.commands.whitelist=cons, envi, conf, crst, srvr, stat, mntr, ruok
dataDir=/data
standaloneEnabled=false
reconfigEnabled=true
skipACL=yes
metricsProvider.className=org.apache.zookeeper.metrics.prometheus.PrometheusMetricsProvider
metricsProvider.httpPort=7000
metricsProvider.exportJvmInfo=true
initLimit=10
syncLimit=2
tickTime=2000
globalOutstandingLimit=1000
preAllocSize=65536
snapCount=10000
commitLogCount=500
snapSizeLimitInKb=4194304
maxCnxns=0
maxClientCnxns=60
minSessionTimeout=4000
maxSessionTimeout=40000
autopurge.snapRetainCount=3
autopurge.purgeInterval=1
quorumListenOnAllIPs=false
admin.serverPort=8080
dynamicConfigFile=/data/conf/zoo.cfg.dynamic.1000000b4

cat /data/conf/zoo.cfg.dynamic.1000000b4
server.1=fed-kafka-affirmedzk-0.fed-kafka-affirmedzk-headless.fed-kafka.svc.cluster.local:2888:3888:participant;0.0.0.0:21

seems to be some timing issue, please make this part of the code in zookeeperStart.sh more reliable

thanks

@priyavj08 please provide the logs from zookeeper-1 before it performs a restart

pbelgundi · 2021-04-20T05:29:39Z

Reopening issue since problem is not fixed still.

priyavj08 · 2021-04-20T08:00:53Z

I am looking to reproduce this to get the first log from zk-1 pod when it crashes but in the mean time, here is the log from ZK-1 pod
ZK1-log.txt

also, here is the output of describe from pod ZK-0, see connection refused from the probes

kc describe pod -n fed-kafka fed-kafka-affirmedzk-0
Name: fed-kafka-affirmedzk-0
Namespace: fed-kafka
Priority: 0
Node: priya-vijaya-tx-k8-node-2-cxkrlz-0zdmawqhtomva4wa/10.163.66.203
Start Time: Tue, 20 Apr 2021 06:16:05 +0000
Labels: app=fed-kafka-affirmedzk
controller-revision-hash=fed-kafka-affirmedzk-545b7fbd4
kind=ZookeeperMember
release=fed-kafka-affirmedzk
statefulset.kubernetes.io/pod-name=fed-kafka-affirmedzk-0
Annotations: cni.projectcalico.org/podIP: 192.168.173.46/32
cni.projectcalico.org/podIPs: 192.168.173.46/32,fde6:7f0d:5c6c:ad36:56d7:55a8:2f31:c652/128
kubernetes.io/psp: permissive-network
Status: Running
IP: 192.168.173.46
IPs:
IP: 192.168.173.46
Controlled By: StatefulSet/fed-kafka-affirmedzk
Containers:
zookeeper:
Container ID: docker://a9b7dc72376f2849148f313ad4d34d55bc51c9dce6ec9d66e6c8ec62268f53b2
Image: cnreg-dev:5000/priya_vijaya/affirmed/zookeeper:0.2.12
Image ID: docker-pullable://cnreg-dev:5000/priya_vijaya/affirmed/zookeeper@sha256:2207ced4485ad175e6dc1ece88d44a43238db77b20e6fee543decd4d29f500e6
Ports: 2181/TCP, 2888/TCP, 3888/TCP, 7000/TCP, 8080/TCP
Host Ports: 0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP
Command:
/usr/local/bin/zookeeperStart.sh
State: Running
Started: Tue, 20 Apr 2021 06:16:06 +0000
Ready: True
Restart Count: 0
Limits:
cpu: 100m
memory: 1536Mi
Requests:
cpu: 10m
memory: 1Gi
Liveness: exec [zookeeperLive.sh] delay=10s timeout=10s period=10s #success=1 #failure=3
Readiness: exec [zookeeperReady.sh] delay=10s timeout=10s period=10s #success=1 #failure=3
Environment:
ENVOY_SIDECAR_STATUS: (v1:metadata.annotations['sidecar.istio.io/status'])
Mounts:
/conf from conf (rw)
/data from data (rw)
/var/run/secrets/kubernetes.io/serviceaccount from zookeeper-sa-token-lfl4l (ro)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
data:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: data-fed-kafka-affirmedzk-0
ReadOnly: false
conf:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: fed-kafka-affirmedzk-configmap
Optional: false
zookeeper-sa-token-lfl4l:
Type: Secret (a volume populated by a Secret)
SecretName: zookeeper-sa-token-lfl4l
Optional: false
QoS Class: Burstable
Node-Selectors:
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message

Normal Scheduled 5m44s default-scheduler Successfully assigned fed-kafka/fed-kafka-affirmedzk-0 to priya-vijaya-tx-k8-node-2-cxkrlz-0zdmawqhtomva4wa
Normal Pulled 5m43s kubelet, priya-vijaya-tx-k8-node-2-cxkrlz-0zdmawqhtomva4wa Container image "cnreg-dev:5000/priya_vijaya/affirmed/zookeeper:0.2.12" already present on machine
Normal Created 5m43s kubelet, priya-vijaya-tx-k8-node-2-cxkrlz-0zdmawqhtomva4wa Created container zookeeper
Normal Started 5m43s kubelet, priya-vijaya-tx-k8-node-2-cxkrlz-0zdmawqhtomva4wa Started container zookeeper
Warning Unhealthy 5m31s kubelet, priya-vijaya-tx-k8-node-2-cxkrlz-0zdmawqhtomva4wa Liveness probe failed: + source /conf/env.sh
++ DOMAIN=fed-kafka-affirmedzk-headless.fed-kafka.svc.cluster.local
++ QUORUM_PORT=2888
++ LEADER_PORT=3888
++ CLIENT_HOST=fed-kafka-affirmedzk-client
++ CLIENT_PORT=2181
++ ADMIN_SERVER_HOST=fed-kafka-affirmedzk-admin-server
++ ADMIN_SERVER_PORT=8080
++ CLUSTER_NAME=fed-kafka-affirmedzk
++ CLUSTER_SIZE=3
++ nc 127.0.0.1 2181
++ echo ruok
(UNKNOWN) [127.0.0.1] 2181 (?) : Connection refused

OK=
Warning Unhealthy 5m29s kubelet, priya-vijaya-tx-k8-node-2-cxkrlz-0zdmawqhtomva4wa Readiness probe failed: + source /conf/env.sh
++ DOMAIN=fed-kafka-affirmedzk-headless.fed-kafka.svc.cluster.local
++ QUORUM_PORT=2888
++ LEADER_PORT=3888
++ CLIENT_HOST=fed-kafka-affirmedzk-client
++ CLIENT_PORT=2181
++ ADMIN_SERVER_HOST=fed-kafka-affirmedzk-admin-server
++ ADMIN_SERVER_PORT=8080
++ CLUSTER_NAME=fed-kafka-affirmedzk
++ CLUSTER_SIZE=3
source /usr/local/bin/zookeeperFunctions.sh
++ set -ex
++ hostname -s
HOST=fed-kafka-affirmedzk-0
DATA_DIR=/data
MYID_FILE=/data/myid
LOG4J_CONF=/conf/log4j-quiet.properties
STATIC_CONFIG=/data/conf/zoo.cfg
++ nc 127.0.0.1 2181
++ echo ruok
(UNKNOWN) [127.0.0.1] 2181 (?) : Connection refused
OK=

priyavj08 · 2021-04-20T11:50:27Z

in my recent tests of repeated install/uninstall after 18th iteration, it got in to a bad state.

for some reason ZK-1 failed to join the ensemble and ZK-2 is in crashloopback state (but not the error .my id 2 is missing)

attaching all the logs

failure-zk0.log
failure-zk1.log
failure-zk2.log

attaching the pod describe output

desc-output.log

anishakj · 2021-04-20T18:05:09Z

in my recent tests of repeated install/uninstall after 18th iteration, it got in to a bad state.

for some reason ZK-1 failed to join the ensemble and ZK-2 is in crashloopback state (but not the error .my id 2 is missing)

attaching all the logs

failure-zk0.log
failure-zk1.log
failure-zk2.log

attaching the pod describe output

desc-output.log

@priyavj08 , Could you please confirm base zookeeper image version used is 3.6.1

anishakj · 2021-04-21T04:02:14Z

@priyavj08 From the failure-zk1.log, i could see connectivity issue of second node with the first one.

2021-04-20 10:30:23,206 [myid:2] - WARN  [RecvWorker:1:QuorumCnxManager$RecvWorker@1395] - Connection broken for id 1, my id = 2
java.io.EOFException
	at java.base/java.io.DataInputStream.readInt(Unknown Source)
	at org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:1383)
2021-04-20 10:30:23,208 [myid:2] - WARN  [RecvWorker:1:QuorumCnxManager$RecvWorker@1401] - Interrupting SendWorker thread from RecvWorker. sid: 1. myId: 2
2021-04-20 10:30:23,209 [myid:2] - WARN  [SendWorker:1:QuorumCnxManager$SendWorker@1281] - Interrupted while waiting for message on queue
java.lang.InterruptedException
	at java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(Unknown Source)
	at java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(Unknown Source)
	at org.apache.zookeeper.util.CircularBlockingQueue.poll(CircularBlockingQueue.java:105)
	at org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:1446)
	at org.apache.zookeeper.server.quorum.QuorumCnxManager.access$900(QuorumCnxManager.java:98)
	at org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:1270)
2021-04-20 10:30:23,210 [myid:2] - WARN  [SendWorker:1:QuorumCnxManager$SendWorker@1293] - Send worker leaving thread id 1 my id = 2

priyavj08 · 2021-04-21T06:59:29Z

Ran this inside the ZK pod, I am using build 3.6.1

Zookeeper version: 3.6.1--104dcb3e3fb464b30c5186d229e00af9f332524b, built on 04/21/2020 15:01 GMT

priyavj08 · 2021-04-21T13:50:33Z

with the latest zk image and setting InitialDelayseconds to "60" for readiness probe, I was able to run install/uninstall tests continuously, tried 2 sets of 30 iterations. Haven't seen pod1/pod2 crashing with error "My id 2 not in the peer list"

question: will setting 60 seconds as Initiadelay for readiness probe have any impact on the functionality of the ZK ensemble?

another concern, during the first set of tests I saw the other issue though where ZK server in pod0 wasn't running and this caused issue in ZK pod 1
2021-04-20 11:34:50,819 [myid:1] - WARN [NIOWorkerThread-2:NIOServerCnxn@373] - Close of session 0x0
java.io.IOException: ZooKeeperServer not running
at org.apache.zookeeper.server.NIOServerCnxn.readLength(NIOServerCnxn.java:544)
at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:332)
at org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:522)
at org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:154)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.base/java.lang.Thread.run(Unknown Source)

anishakj · 2021-04-21T14:41:19Z

with the latest zk image and setting InitialDelayseconds to "60" for readiness probe, I was able to run install/uninstall tests continuously, tried 2 sets of 30 iterations. Haven't seen pod1/pod2 crashing with error "My id 2 not in the peer list"

question: will setting 60 seconds as Initiadelay for readiness probe have any impact on the functionality of the ZK ensemble?

another concern, during the first set of tests I saw the other issue though where ZK server in pod0 wasn't running and this caused issue in ZK pod 1
2021-04-20 11:34:50,819 [myid:1] - WARN [NIOWorkerThread-2:NIOServerCnxn@373] - Close of session 0x0
java.io.IOException: ZooKeeperServer not running
at org.apache.zookeeper.server.NIOServerCnxn.readLength(NIOServerCnxn.java:544)
at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:332)
at org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:522)
at org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:154)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.base/java.lang.Thread.run(Unknown Source)

In zookeeper cluster deployment, only after the first pod has started and running, second pod will start. Is it not happening in your case?

Also, initialDelaySeconds will ensure that probes will start only after that time. Other than that it wont have any effect.

priyavj08 · 2021-04-21T14:54:23Z

looks like ZK-0 pod was in running state, though the ZK server wasn't running and logs shows that error , similarly ZK-1 pod was in running state with errors . you can see pods status in the desc-output I had attached

anishakj · 2021-04-21T17:41:47Z

looks like ZK-0 pod was in running state, though the ZK server wasn't running and logs shows that error , similarly ZK-1 pod was in running state with errors . you can see pods status in the desc-output I had attached

Are you seeing these errors with increased initialdelay?

priyavj08 · 2021-04-22T05:41:00Z

@anishakj yes it happened with the set of tests I did by adding initialdelay but this is also not always seen.

anishakj · 2021-04-22T05:42:56Z

@anishakj yes it happened with the set of tests I did by adding initialdelay but this is also not always seen.

By deafult initialDelaySeconds is 10. Could you please set to 30 and give it a try

priyavj08 · 2021-04-22T12:20:03Z

ZK-1 pod got into crashloopackoff couple of times but eventually worked, here is the log. This is the new fix right?

source /conf/env.sh
++ DOMAIN=fed-kafka-affirmedzk-headless.fed-kafka.svc.cluster.local
++ QUORUM_PORT=2888
++ LEADER_PORT=3888
++ CLIENT_HOST=fed-kafka-affirmedzk-client
++ CLIENT_PORT=2181
++ ADMIN_SERVER_HOST=fed-kafka-affirmedzk-admin-server
++ ADMIN_SERVER_PORT=8080
++ CLUSTER_NAME=fed-kafka-affirmedzk
++ CLUSTER_SIZE=3
source /usr/local/bin/zookeeperFunctions.sh
++ set -ex
++ hostname -s
HOST=fed-kafka-affirmedzk-1
DATA_DIR=/data
MYID_FILE=/data/myid
LOG4J_CONF=/conf/log4j-quiet.properties
DYNCONFIG=/data/zoo.cfg.dynamic
STATIC_CONFIG=/data/conf/zoo.cfg
[[ fed-kafka-affirmedzk-1 =~ (.*)-([0-9]+)$ ]]
NAME=fed-kafka-affirmedzk
ORD=1
MYID=2
WRITE_CONFIGURATION=true
REGISTER_NODE=true
ONDISK_MYID_CONFIG=false
ONDISK_DYN_CONFIG=false
'[' -f /data/myid ']'
++ cat /data/myid
EXISTING_ID=2
[[ 2 == \2 ]]
[[ -f /data/conf/zoo.cfg ]]
ONDISK_MYID_CONFIG=true
'[' -f /data/zoo.cfg.dynamic ']'
set +e
[[ -n '' ]]
set -e
set +e
nslookup fed-kafka-affirmedzk-headless.fed-kafka.svc.cluster.local
[[ 1 -eq 0 ]]
nslookup fed-kafka-affirmedzk-headless.fed-kafka.svc.cluster.local
grep -q 'server can'''t find fed-kafka-affirmedzk-headless.fed-kafka.svc.cluster.local'
Server: 10.96.0.10
Address: 10.96.0.10#53

** server can't find fed-kafka-affirmedzk-headless.fed-kafka.svc.cluster.local: NXDOMAIN

echo 'there is no active ensemble'
ACTIVE_ENSEMBLE=false
[[ true == true ]]
[[ false == true ]]
WRITE_CONFIGURATION=true
[[ false == false ]]
REGISTER_NODE=false
[[ true == true ]]
echo 'Writing myid: 2 to: /data/myid.'
echo 2
[[ 2 -eq 1 ]]
[[ false == true ]]
ZOOCFGDIR=/data/conf
export ZOOCFGDIR
echo Copying /conf contents to writable directory, to support Zookeeper dynamic reconfiguration
[[ ! -d /data/conf ]]
echo Copying the /conf/zoo.cfg contents except the dynamic config file during restart
++ head -n -1 /conf/zoo.cfg
there is no active ensemble
Writing myid: 2 to: /data/myid.
Copying /conf contents to writable directory, to support Zookeeper dynamic reconfiguration
Copying the /conf/zoo.cfg contents except the dynamic config file during restart
++ tail -n 1 /data/conf/zoo.cfg
echo -e '4lw.commands.whitelist=cons, envi, conf, crst, srvr, stat, mntr, ruok
dataDir=/data
standaloneEnabled=false
reconfigEnabled=true
skipACL=yes
metricsProvider.className=org.apache.zookeeper.metrics.prometheus.PrometheusMetricsProvider
metricsProvider.httpPort=7000
metricsProvider.exportJvmInfo=true
initLimit=10
syncLimit=2
tickTime=2000
globalOutstandingLimit=1000
preAllocSize=65536
snapCount=10000
commitLogCount=500
snapSizeLimitInKb=4194304
maxCnxns=0
maxClientCnxns=60
minSessionTimeout=4000
maxSessionTimeout=40000
autopurge.snapRetainCount=3
autopurge.purgeInterval=1
quorumListenOnAllIPs=false
admin.serverPort=8080\ndynamicConfigFile=/data/zoo.cfg.dynamic'
cp -f /conf/log4j.properties /data/conf
cp -f /conf/log4j-quiet.properties /data/conf
cp -f /conf/env.sh /data/conf
'[' -f /data/zoo.cfg.dynamic ']'
echo 'Node failed to register!'
Node failed to register!
exit 1

fed-kafka-affirmedzk-0 1/1 Running 0 122m
fed-kafka-affirmedzk-1 1/1 Running 2 121m
fed-kafka-affirmedzk-2 1/1 Running 0 117m
fed-kafka-affirmedzk-5bd966c46d-5j6hh 1/1 Running 0 122m

anishakj · 2021-04-22T12:23:16Z

ZK-1 pod got into crashloopackoff couple of times but eventually worked, here is the log. This is the new fix right?

source /conf/env.sh
++ DOMAIN=fed-kafka-affirmedzk-headless.fed-kafka.svc.cluster.local
++ QUORUM_PORT=2888
++ LEADER_PORT=3888
++ CLIENT_HOST=fed-kafka-affirmedzk-client
++ CLIENT_PORT=2181
++ ADMIN_SERVER_HOST=fed-kafka-affirmedzk-admin-server
++ ADMIN_SERVER_PORT=8080
++ CLUSTER_NAME=fed-kafka-affirmedzk
++ CLUSTER_SIZE=3

source /usr/local/bin/zookeeperFunctions.sh
++ set -ex
++ hostname -s

HOST=fed-kafka-affirmedzk-1

DATA_DIR=/data

MYID_FILE=/data/myid

LOG4J_CONF=/conf/log4j-quiet.properties

DYNCONFIG=/data/zoo.cfg.dynamic

STATIC_CONFIG=/data/conf/zoo.cfg

[[ fed-kafka-affirmedzk-1 =~ (.*)-([0-9]+)$ ]]

NAME=fed-kafka-affirmedzk

ORD=1

MYID=2

WRITE_CONFIGURATION=true

REGISTER_NODE=true

ONDISK_MYID_CONFIG=false

ONDISK_DYN_CONFIG=false

'[' -f /data/myid ']'
++ cat /data/myid

EXISTING_ID=2

[[ 2 == \2 ]]

[[ -f /data/conf/zoo.cfg ]]

ONDISK_MYID_CONFIG=true

'[' -f /data/zoo.cfg.dynamic ']'

set +e

[[ -n '' ]]

set -e

set +e

nslookup fed-kafka-affirmedzk-headless.fed-kafka.svc.cluster.local

[[ 1 -eq 0 ]]

nslookup fed-kafka-affirmedzk-headless.fed-kafka.svc.cluster.local

grep -q 'server can'''t find fed-kafka-affirmedzk-headless.fed-kafka.svc.cluster.local'
Server: 10.96.0.10
Address: 10.96.0.10#53

** server can't find fed-kafka-affirmedzk-headless.fed-kafka.svc.cluster.local: NXDOMAIN

echo 'there is no active ensemble'

ACTIVE_ENSEMBLE=false

[[ true == true ]]

[[ false == true ]]

WRITE_CONFIGURATION=true

[[ false == false ]]

REGISTER_NODE=false

[[ true == true ]]

echo 'Writing myid: 2 to: /data/myid.'

echo 2

[[ 2 -eq 1 ]]

[[ false == true ]]

ZOOCFGDIR=/data/conf

export ZOOCFGDIR

echo Copying /conf contents to writable directory, to support Zookeeper dynamic reconfiguration

[[ ! -d /data/conf ]]

echo Copying the /conf/zoo.cfg contents except the dynamic config file during restart
++ head -n -1 /conf/zoo.cfg
there is no active ensemble
Writing myid: 2 to: /data/myid.
Copying /conf contents to writable directory, to support Zookeeper dynamic reconfiguration
Copying the /conf/zoo.cfg contents except the dynamic config file during restart
++ tail -n 1 /data/conf/zoo.cfg

echo -e '4lw.commands.whitelist=cons, envi, conf, crst, srvr, stat, mntr, ruok
dataDir=/data
standaloneEnabled=false
reconfigEnabled=true
skipACL=yes
metricsProvider.className=org.apache.zookeeper.metrics.prometheus.PrometheusMetricsProvider
metricsProvider.httpPort=7000
metricsProvider.exportJvmInfo=true
initLimit=10
syncLimit=2
tickTime=2000
globalOutstandingLimit=1000
preAllocSize=65536
snapCount=10000
commitLogCount=500
snapSizeLimitInKb=4194304
maxCnxns=0
maxClientCnxns=60
minSessionTimeout=4000
maxSessionTimeout=40000
autopurge.snapRetainCount=3
autopurge.purgeInterval=1
quorumListenOnAllIPs=false
admin.serverPort=8080\ndynamicConfigFile=/data/zoo.cfg.dynamic'

cp -f /conf/log4j.properties /data/conf

cp -f /conf/log4j-quiet.properties /data/conf

cp -f /conf/env.sh /data/conf

'[' -f /data/zoo.cfg.dynamic ']'

echo 'Node failed to register!'
**Node failed to register!

exit 1**

fed-kafka-affirmedzk-0 1/1 Running 0 122m
fed-kafka-affirmedzk-1 1/1 Running 2 121m
fed-kafka-affirmedzk-2 1/1 Running 0 117m
fed-kafka-affirmedzk-5bd966c46d-5j6hh 1/1 Running 0 122m

yes, with this fix while on restart pod will become ready. If it is working for you after couple of pod restarts, could you please close the issue.

priyavj08 · 2021-04-22T15:24:55Z

so far it is working fine. please close this bug

anishakj · 2021-04-22T15:27:52Z

so far it is working fine. please close this bug

@priyavj08 Thanks for the confirmation. Please feel free to reopen if you see the same issue again.

iampranabroy · 2022-07-15T10:36:44Z

Did anyone try setting initialDelaySeconds when using the zookeeper helm chart?
I don't see that option in pravega/zookeeper-operator chart v0.2.12 or later
@priyavj08 @anishakj

alexangas · 2022-08-11T11:11:33Z

@iampranabroy Does this work for you:

zookeeper-operator/charts/zookeeper/values.yaml

Line 25 in 2403ac5

initialDelaySeconds: 10

mmoscher · 2022-09-26T15:19:48Z

I'm facing the same issue. @anishakj even with long initialDelaySeconds (20,30,60) for liveness- and readinessProbes I'm unable to successfully deploy a zookeeper cluster with replicas > 1. Any further hint what could go wrong?

The issues stays the same:
2022-09-26 15:14:06,975 [myid:2] - ERROR [main:QuorumPeerMain@114] - Unexpected exception, exiting abnormally java.lang.RuntimeException: My id 2 not in the peer list at org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:1128) at org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:229) at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:137) at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:91)
the second replica tries to start and ends up in a CrashLoopBackOff.

Zookeeper-Operator is indirectly deployed using the Solr-Operator, version 0.2.14

iampranabroy · 2022-09-27T02:43:39Z

Hey @mmoscher - Can you please try setting only InitialDelayseconds to 60 for readiness probe probes.readiness.initialDelaySeconds as suggested above and see if it works?

@priyavj08 @anishakj - Did you try any additional changes?

mmoscher · 2022-09-27T08:48:03Z

Issue solved - at least for me, I think.

There were two unique issues (on my side) which kept me running into the error mentioned above.
Reading this thread led me into a (somewhat) wrong direction, probing/testing the wrong solution.

Issue 1 - wrong network policies
My deployment is blocking any egress/ingress traffic, using network policies, iff not explicitly allowed.
For zookeeper internal traffic, i.e. zookeeper <-> zookeeper, I had one policy in place allowing port 2181 (client connection) only.
However, for leader election this is not enough. After opening ports 2888 (quorum), 3888 (leader election), 7000 (metrics) and 8080 (admin) a fresh (new, unique) zookeeper cluster was able to bootstrap in the same namespace 🤦
However, my solr-cloud related zookeeper cluster keeps failing. Even after deleting the zookeeper cluster resource and automatic recreation by the solr-operator.

Issue 2 - old state/wrong data on pvc
My zookeeper and solr keep their pvc's after deletion and reuse them if redeployed (due to dataStorage.persistent.reclaimPolicy: "Retain"). After dropping all related pvcs and boostrapping a fresh solrcloud cluster the zookeeper cluster bootstraps successfully 🤔 . Furthermore, this explains why I was able to successfully deploy a zookeeper cluster on my local dev environment, as we do not use persistent storage during development.
It seems as zookeeper stores some data about his ID/State on disk, which could lead to later failures if corrupt and not cleaned up correctly. Unfortunately, I did not inspect the volumes and its data.
Maybe @priyavj08 and/or @anishakj have a clue/guess on this.

TL;DR: after setting correct network polices and cleaning up old (corrupt) data, i.e. all zookeeper related pvcs, I was able to successfully bootstrap a zookeeper cluster (4-times in row, same namespace and same k8s-cluster) . No need to adjust liveness- or readinessProbe's.

@iampranabroy hopefully these tips help you!

iampranabroy · 2022-10-21T17:43:53Z

Thanks much @mmoscher
Is it possible for you to share the updated network policy that you applied for zookeeper <-> zookeeper internal traffic?

mmoscher · 2022-10-25T11:57:50Z

@iampranabroy sure, here we go:

We've two policies in place related to solr/zookeeper. One (a) to allow traffic between the zookeeper members itself (z<->z) and another one (b) to allow traffic from solr to the zookeeper (s->z) pods.
Note: we do block any egress traffic from all pods by default and are following the "efault-deny-all-egress-traffic" principle. If you're doing it vice-versa, eg. blocking ingress traffic, you need to change the policies accordingly. Furthermore, the solr-instances accessing your zookeeper pods needs to have the custom label allow-zookeeper-access: "true" set.

a)

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-zookeeper-access-zookeeper
spec:
  egress:
  - ports:
    - port: 2181
      protocol: TCP
    - port: 2888
      protocol: TCP
    - port: 3888
      protocol: TCP
    - port: 7000
      protocol: TCP
    - port: 8080
      protocol: TCP
    to:
    - podSelector:
        matchLabels:
          kind: ZookeeperMember
          technology: zookeeper
  podSelector:
    matchLabels:
      kind: ZookeeperMember
      technology: zookeeper
  policyTypes:
  - Egress

b)

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-solr-access-zookeeper
spec:
  egress:
  - ports:
    - port: 2181
      protocol: TCP
    - port: 7000
      protocol: TCP
    - port: 8080
      protocol: TCP
    to:
    - podSelector:
        matchLabels:
          kind: ZookeeperMember
          technology: zookeeper
  podSelector:
    matchLabels:
      allow-zookeeper-access: "true"
  policyTypes:
  - Egress

Good luck 🤞

iampranabroy · 2022-10-25T14:18:39Z

Thanks, @mmoscher for sharing the details. A good point to keep in mind about NetworkPolicy for zookeeper and solr cluster's internal communication.

In my case, the zookeeper and solr clusters are in the same namespace and they come up sometimes fine, but notice this error a few times. The error itself is not consistent, making it hard to debug.

medvedevigorek added the kind/bug Something isn't working label Apr 12, 2021

medvedevigorek assigned anishakj Apr 12, 2021

medvedevigorek added the status/needs_investigation label Apr 12, 2021

anishakj mentioned this issue Apr 16, 2021

Observers fail to register when zk ensemble service domain is not yet available #135

Merged

anishakj closed this as completed in #135 Apr 19, 2021

pbelgundi reopened this Apr 20, 2021

anishakj closed this as completed Apr 22, 2021

medvedevigorek removed the status/needs_investigation label Apr 22, 2021

Slach mentioned this issue Mar 29, 2022

changing replicas sequence test 1 -> 3 -> 1 -> 3 not works on latest step, Second ZK pod doesn't starts due to 'java.lang.RuntimeException: My id 2 not in the peer list' exception #448

Closed

iampranabroy mentioned this issue Sep 21, 2022

Support Zookeeper probes parameters in Apache Solr Operator helm charts. apache/solr-operator#477

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Second ZK pod doesn't starts due to 'java.lang.RuntimeException: My id 2 not in the peer list' exception #315

Second ZK pod doesn't starts due to 'java.lang.RuntimeException: My id 2 not in the peer list' exception #315

BassinD commented Mar 29, 2021 •

edited

Loading

priyavj08 commented Mar 31, 2021

priyavj08 commented Apr 18, 2021

anishakj commented Apr 19, 2021

priyavj08 commented Apr 19, 2021

anishakj commented Apr 20, 2021

priyavj08 commented Apr 20, 2021

anishakj commented Apr 20, 2021

pbelgundi commented Apr 20, 2021

priyavj08 commented Apr 20, 2021

priyavj08 commented Apr 20, 2021 •

edited

Loading

anishakj commented Apr 20, 2021

anishakj commented Apr 21, 2021

priyavj08 commented Apr 21, 2021

priyavj08 commented Apr 21, 2021

anishakj commented Apr 21, 2021 •

edited

Loading

priyavj08 commented Apr 21, 2021

anishakj commented Apr 21, 2021

priyavj08 commented Apr 22, 2021

anishakj commented Apr 22, 2021

priyavj08 commented Apr 22, 2021 •

edited

Loading

anishakj commented Apr 22, 2021

priyavj08 commented Apr 22, 2021

anishakj commented Apr 22, 2021

iampranabroy commented Jul 15, 2022 •

edited

Loading

alexangas commented Aug 11, 2022

mmoscher commented Sep 26, 2022 •

edited

Loading

iampranabroy commented Sep 27, 2022

mmoscher commented Sep 27, 2022 •

edited

Loading

iampranabroy commented Oct 21, 2022

mmoscher commented Oct 25, 2022 •

edited

Loading

iampranabroy commented Oct 25, 2022

Second ZK pod doesn't starts due to 'java.lang.RuntimeException: My id 2 not in the peer list' exception #315

Second ZK pod doesn't starts due to 'java.lang.RuntimeException: My id 2 not in the peer list' exception #315

Comments

BassinD commented Mar 29, 2021 • edited Loading

Description

Importance

Location

Suggestions for an improvement

priyavj08 commented Mar 31, 2021

priyavj08 commented Apr 18, 2021

anishakj commented Apr 19, 2021

priyavj08 commented Apr 19, 2021

anishakj commented Apr 20, 2021

priyavj08 commented Apr 20, 2021

anishakj commented Apr 20, 2021

pbelgundi commented Apr 20, 2021

priyavj08 commented Apr 20, 2021

priyavj08 commented Apr 20, 2021 • edited Loading

anishakj commented Apr 20, 2021

anishakj commented Apr 21, 2021

priyavj08 commented Apr 21, 2021

priyavj08 commented Apr 21, 2021

anishakj commented Apr 21, 2021 • edited Loading

priyavj08 commented Apr 21, 2021

anishakj commented Apr 21, 2021

priyavj08 commented Apr 22, 2021

anishakj commented Apr 22, 2021

priyavj08 commented Apr 22, 2021 • edited Loading

anishakj commented Apr 22, 2021

priyavj08 commented Apr 22, 2021

anishakj commented Apr 22, 2021

iampranabroy commented Jul 15, 2022 • edited Loading

alexangas commented Aug 11, 2022

mmoscher commented Sep 26, 2022 • edited Loading

iampranabroy commented Sep 27, 2022

mmoscher commented Sep 27, 2022 • edited Loading

iampranabroy commented Oct 21, 2022

mmoscher commented Oct 25, 2022 • edited Loading

iampranabroy commented Oct 25, 2022

BassinD commented Mar 29, 2021 •

edited

Loading

priyavj08 commented Apr 20, 2021 •

edited

Loading

anishakj commented Apr 21, 2021 •

edited

Loading

priyavj08 commented Apr 22, 2021 •

edited

Loading

iampranabroy commented Jul 15, 2022 •

edited

Loading

mmoscher commented Sep 26, 2022 •

edited

Loading

mmoscher commented Sep 27, 2022 •

edited

Loading

mmoscher commented Oct 25, 2022 •

edited

Loading