core: last node in 3-node cluster fails to shut down gracefully due to livelock #40834

cockroach-teamcity · 2019-09-17T16:07:13Z

Found by the CLI test test_multiple_nodes.tcl here:
#1493442:

The scenario is pretty simple:

start n1
add n2
add n3
run a query to confirm all 3 nodes are connected
quit n2 - this succeeds and log file confirms n2 has shut down
quit n3 - this succeeds and log file confirms n3 has shut down
quit n1

At this point the quit process on n1 livelocks and n1 fails to shut down. Log file here: cockroach.log

Relevant log lines from n1:

at step 4, just before n2 is quit

190917 15:18:20.358811 110 storage/stores.go:259  [n1] wrote 2 node addresses to persistent storage
I190917 15:18:20.543398 1841 storage/replica_command.go:1580  [n1,replicate,s1,r2/1:/System/NodeLiveness{-Max}] change replicas (add [(n3,s3):2LEARNER] remove []): existing descriptor r2:/System/NodeLiveness{-Max} [(n1,s1):1, next=2, gen=0]
I190917 15:18:20.554046 1841 storage/replica_raft.go:291  [n1,s1,r2/1:/System/NodeLiveness{-Max}] proposing ADD_REPLICA[(n3,s3):2LEARNER]: after=[(n1,s1):1 (n3,s3):2LEARNER] next=3
I190917 15:18:20.560840 1841 storage/store_snapshot.go:995  [n1,replicate,s1,r2/1:/System/NodeLiveness{-Max}] sending LEARNER snapshot ed7d457c at applied index 23
I190917 15:18:20.560983 1841 storage/store_snapshot.go:1038  [n1,replicate,s1,r2/1:/System/NodeLiveness{-Max}] streamed snapshot to (n3,s3):2LEARNER: kv pairs: 10, log entries: 0, rate-limit: 8.0 MiB/sec, 0.00s
I190917 15:18:20.564621 1900 storage/raft_snapshot_queue.go:125  [n1,raftsnapshot,s1,r2/1:/System/NodeLiveness{-Max}] skipping snapshot; replica is likely a learner in the process of being added: (n3,s3):2LEARNER
I190917 15:18:20.593802 1841 storage/replica_command.go:1580  [n1,replicate,s1,r2/1:/System/NodeLiveness{-Max}] change replicas (add [(n3,s3):2] remove []): existing descriptor r2:/System/NodeLiveness{-Max} [(n1,s1):1, (n3,s3):2LEARNER, next=3, gen=1]
I190917 15:18:20.603547 1841 storage/replica_raft.go:291  [n1,s1,r2/1:/System/NodeLiveness{-Max}] proposing ADD_REPLICA[(n3,s3):2]: after=[(n1,s1):1 (n3,s3):2] next=3
I190917 15:18:20.763654 1886 storage/store.go:2613  [n1,replicaGC,s1,r2/1:/System/NodeLiveness{-Max}] removing replica r2/1
I190917 15:18:20.766736 1886 storage/replica_destroy.go:152  [n1,replicaGC,s1,r2/1:/System/NodeLiveness{-Max}] removed 10 (3+7) keys in 3ms [clear=0ms commit=3ms]

at step 5, when n2 leaves:

W190917 15:18:21.020789 1828 storage/raft_transport.go:620  [n1] while processing outgoing Raft queue to node 2: rpc error: code = Unavailable desc = transport is closing:
W190917 15:18:21.020840 1790 vendor/google.golang.org/grpc/clientconn.go:1206  grpc: addrConn.createTransport failed to connect to {ff5f9200c59e:26258 0  <nil>}. Err :connection error: desc = "transport: Error while dialing cannot reuse client connection". Reconnecting...
W190917 15:18:21.020854 1582 vendor/google.golang.org/grpc/clientconn.go:1206  grpc: addrConn.createTransport failed to connect to {ff5f9200c59e:26258 0  <nil>}. Err :connection error: desc = "transport: Error while dialing cannot reuse client connection". Reconnecting...
W190917 15:18:21.020875 1480 storage/raft_transport.go:620  [n1] while processing outgoing Raft queue to node 2: rpc error: code = Unavailable desc = transport is closing:
W190917 15:18:21.141698 1919 vendor/google.golang.org/grpc/clientconn.go:1206  grpc: addrConn.createTransport failed to connect to {ff5f9200c59e:26258 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 172.31.0.2:26258: connect: connection refused". Reconnecting...
I190917 15:18:21.141888 1923 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:322  [n1] circuitbreaker: rpc [::]:26257 [n2] tripped: failed to connect to n2 at ff5f9200c59e:26258: initial connection heartbeat failed: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 172.31.0.2:26258: connect: connection refused"
I190917 15:18:21.141923 1923 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:447  [n1] circuitbreaker: rpc [::]:26257 [n2] event: BreakerTripped
I190917 15:18:21.141946 1923 rpc/nodedialer/nodedialer.go:160  [ct-client] unable to connect to n2: failed to connect to n2 at ff5f9200c59e:26258: initial connection heartbeat failed: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 172.31.0.2:26258: connect: connection refused"
W190917 15:18:21.241802 1954 vendor/google.golang.org/grpc/clientconn.go:1206  grpc: addrConn.createTransport failed to connect to {ff5f9200c59e:26258 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 172.31.0.2:26258: connect: connection refused". Reconnecting...
I190917 15:18:21.241955 198 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:322  [n1] circuitbreaker: rpc [::]:26257 [n2] tripped: failed to connect to n2 at ff5f9200c59e:26258: initial connection heartbeat failed: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 172.31.0.2:26258: connect: connection refused"
I190917 15:18:21.241968 198 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:447  [n1] circuitbreaker: rpc [::]:26257 [n2] event: BreakerTripped
I190917 15:18:21.241987 198 rpc/nodedialer/nodedialer.go:160  [n1] unable to connect to n2: failed to connect to n2 at ff5f9200c59e:26258: initial connection heartbeat failed: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 172.31.0.2:26258: connect: connection refused"

at step 6 when n3 leaves:

I190917 15:18:21.543864 1866 storage/replica_command.go:1580  [n1,replicate,s1,r6/1:/Table/{SystemConâ€¦-11}] change replicas (add [(n3,s3):2LEARNER] remove []): existing descriptor r6:/Table/{SystemConfigSpan/Start-11} [(n1,s1):1, next=2, gen=0]
W190917 15:18:22.109660 1876 vendor/google.golang.org/grpc/clientconn.go:1206  grpc: addrConn.createTransport failed to connect to {ff5f9200c59e:26259 0  <nil>}. Err :connection error: desc = "transport: Error while dialing cannot reuse client connection". Reconnecting...
W190917 15:18:22.109719 1783 vendor/google.golang.org/grpc/clientconn.go:1206  grpc: addrConn.createTransport failed to connect to {ff5f9200c59e:26259 0  <nil>}. Err :connection error: desc = "transport: Error while dialing cannot reuse client connection". Reconnecting...
W190917 15:18:22.109739 1825 storage/raft_transport.go:620  [n1] while processing outgoing Raft queue to node 3: rpc error: code = Unavailable desc = transport is closing:

at step 7 when n1 is instructed to leave:

I190917 15:18:22.135193 1 cli/start.go:860  received signal 'terminated'
I190917 15:18:22.137128 1 cli/start.go:925  initiating graceful shutdown of server
W190917 15:18:22.141659 1869 vendor/google.golang.org/grpc/clientconn.go:1206  grpc: addrConn.createTransport failed to connect to {ff5f9200c59e:26259 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 172.31.0.2:26259: connect: connection refused". Reconnecting...
W190917 15:18:22.141997 1919 vendor/google.golang.org/grpc/clientconn.go:1206  grpc: addrConn.createTransport failed to connect to {ff5f9200c59e:26258 0  <nil>}. Err :connection error: desc = "transport: Error while dialing cannot reuse client connection". Reconnecting...
I190917 15:18:22.143602 1970 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:322  [n1] circuitbreaker: rpc [::]:26257 [n3] tripped: failed to connect to n3 at ff5f9200c59e:26259: initial connection heartbeat failed: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 172.31.0.2:26259: connect: connection refused"
I190917 15:18:22.147242 1970 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:447  [n1] circuitbreaker: rpc [::]:26257 [n3] event: BreakerTripped
I190917 15:18:22.149341 1970 rpc/nodedialer/nodedialer.go:160  [ct-client] unable to connect to n3: failed to connect to n3 at ff5f9200c59e:26259: initial connection heartbeat failed: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 172.31.0.2:26259: connect: connection refused"
W190917 15:18:22.146443 1925 vendor/google.golang.org/grpc/clientconn.go:1206  grpc: addrConn.createTransport failed to connect to {ff5f9200c59e:26259 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 172.31.0.2:26259: connect: connection refused". Reconnecting...
I190917 15:18:22.153243 1856 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:322  [n1] circuitbreaker: rpc [::]:26257 [n3] tripped: failed to connect to n3 at ff5f9200c59e:26259: initial connection heartbeat failed: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 172.31.0.2:26259: connect: connection refused"
I190917 15:18:22.154910 1856 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:447  [n1] circuitbreaker: rpc [::]:26257 [n3] event: BreakerTripped
I190917 15:18:22.156976 1856 rpc/nodedialer/nodedialer.go:160  [server drain process,n1,txn=61eb57a5] unable to connect to n3: failed to connect to n3 at ff5f9200c59e:26259: initial connection heartbeat failed: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 172.31.0.2:26259: connect: connection refused"
W190917 15:18:22.214587 1929 storage/intentresolver/intent_resolver.go:508  failed to push during intent resolution: failed to push id=b7f82d25 key=/Local/Range/Table/SystemConfigSpan/Start/RangeDescriptor rw=true pri=0.03822413 stat=PENDING epo=0 ts=1568733501.543484158,0 orig=0.000000000,0 min=1568733501.543484158,0 max=0.000000000,0 wto=false seq=7
W190917 15:18:22.242227 1954 vendor/google.golang.org/grpc/clientconn.go:1206  grpc: addrConn.createTransport failed to connect to {ff5f9200c59e:26258 0  <nil>}. Err :connection error: desc = "transport: Error while dialing cannot reuse client connection". Reconnecting...
W190917 15:18:22.341772 1973 vendor/google.golang.org/grpc/clientconn.go:1206  grpc: addrConn.createTransport failed to connect to {ff5f9200c59e:26258 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 172.31.0.2:26258: connect: connection refused". Reconnecting...
W190917 15:18:22.419949 110 gossip/gossip.go:1517  [n1] no incoming or outgoing connections
W190917 15:18:22.423219 1991 vendor/google.golang.org/grpc/clientconn.go:1206  grpc: addrConn.createTransport failed to connect to {ff5f9200c59e:26258 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 172.31.0.2:26258: connect: connection refused". Reconnecting...
I190917 15:18:22.425611 1988 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:322  [n1] circuitbreaker: gossip [::]:26257->ff5f9200c59e:26258 tripped: initial connection heartbeat failed: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 172.31.0.2:26258: connect: connection refused"
I190917 15:18:22.427705 1988 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:447  [n1] circuitbreaker: gossip [::]:26257->ff5f9200c59e:26258 event: BreakerTripped
W190917 15:18:22.429760 1988 gossip/client.go:118  [n1] failed to start gossip client to ff5f9200c59e:26258: initial connection heartbeat failed: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 172.31.0.2:26258: connect: connection refused"
W190917 15:18:22.641760 2000 vendor/google.golang.org/grpc/clientconn.go:1206  grpc: addrConn.createTransport failed to connect to {ff5f9200c59e:26258 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 172.31.0.2:26258: connect: connection refused". Reconnecting...
W190917 15:18:23.143794 1869 vendor/google.golang.org/grpc/clientconn.go:1206  grpc: addrConn.createTransport failed to connect to {ff5f9200c59e:26259 0  <nil>}. Err :connection error: desc = "transport: Error while dialing cannot reuse client connection". Reconnecting...
W190917 15:18:23.153431 1925 vendor/google.golang.org/grpc/clientconn.go:1206  grpc: addrConn.createTransport failed to connect to {ff5f9200c59e:26259 0  <nil>}. Err :connection error: desc = "transport: Error while dialing cannot reuse client connection". Reconnecting...
W190917 15:18:23.344661 1973 vendor/google.golang.org/grpc/clientconn.go:1206  grpc: addrConn.createTransport failed to connect to {ff5f9200c59e:26258 0  <nil>}. Err :connection error: desc = "transport: Error while dialing cannot reuse client connection". Reconnecting...
W190917 15:18:23.424255 2023 vendor/google.golang.org/grpc/clientconn.go:1206  grpc: addrConn.createTransport failed to connect to {ff5f9200c59e:26259 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 172.31.0.2:26259: connect: connection refused". Reconnecting...
W190917 15:18:23.425768 1991 vendor/google.golang.org/grpc/clientconn.go:1206  grpc: addrConn.createTransport failed to connect to {ff5f9200c59e:26258 0  <nil>}. Err :connection error: desc = "transport: Error while dialing cannot reuse client connection". Reconnecting...
I190917 15:18:23.427116 2020 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:322  [n1] circuitbreaker: gossip [::]:26257->ff5f9200c59e:26259 tripped: initial connection heartbeat failed: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 172.31.0.2:26259: connect: connection refused"
I190917 15:18:23.430985 2020 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:447  [n1] circuitbreaker: gossip [::]:26257->ff5f9200c59e:26259 event: BreakerTripped
W190917 15:18:23.433033 2020 gossip/client.go:118  [n1] failed to start gossip client to ff5f9200c59e:26259: initial connection heartbeat failed: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 172.31.0.2:26259: connect: connection refused"
W190917 15:18:23.647012 2000 vendor/google.golang.org/grpc/clientconn.go:1206  grpc: addrConn.createTransport failed to connect to {ff5f9200c59e:26258 0  <nil>}. Err :connection error: desc = "transport: Error while dialing cannot reuse client connection". Reconnecting...
W190917 15:18:23.941760 2037 vendor/google.golang.org/grpc/clientconn.go:1206  grpc: addrConn.createTransport failed to connect to {ff5f9200c59e:26258 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 172.31.0.2:26258: connect: connection refused". Reconnecting...
W190917 15:18:24.424810 2067 vendor/google.golang.org/grpc/clientconn.go:1206  grpc: addrConn.createTransport failed to connect to {ff5f9200c59e:26258 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 172.31.0.2:26258: connect: connection refused". Reconnecting...
W190917 15:18:24.427360 2023 vendor/google.golang.org/grpc/clientconn.go:1206  grpc: addrConn.createTransport failed to connect to {ff5f9200c59e:26259 0  <nil>}. Err :connection error: desc = "transport: Error while dialing cannot reuse client connection". Reconnecting...
W190917 15:18:24.541821 2014 vendor/google.golang.org/grpc/clientconn.go:1206  grpc: addrConn.createTransport failed to connect to {ff5f9200c59e:26259 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 172.31.0.2:26259: connect: connection refused". Reconnecting...
W190917 15:18:24.641776 2103 vendor/google.golang.org/grpc/clientconn.go:1206  grpc: addrConn.createTransport failed to connect to {ff5f9200c59e:26258 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 172.31.0.2:26258: connect: connection refused". Reconnecting...
W190917 15:18:24.769245 2114 vendor/google.golang.org/grpc/clientconn.go:1206  grpc: addrConn.createTransport failed to connect to {ff5f9200c59e:26259 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 172.31.0.2:26259: connect: connection refused". Reconnecting...
W190917 15:18:24.944635 2037 vendor/google.golang.org/grpc/clientconn.go:1206  grpc: addrConn.createTransport failed to connect to {ff5f9200c59e:26258 0  <nil>}. Err :connection error: desc = "transport: Error while dialing cannot reuse client connection". Reconnecting...
W190917 15:18:25.341824 2117 vendor/google.golang.org/grpc/clientconn.go:1206  grpc: addrConn.createTransport failed to connect to {ff5f9200c59e:26258 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 172.31.0.2:26258: connect: connection refused". Reconnecting...
W190917 15:18:25.427558 2067 vendor/google.golang.org/grpc/clientconn.go:1206  grpc: addrConn.createTransport failed to connect to {ff5f9200c59e:26258 0  <nil>}. Err :connection error: desc = "transport: Error while dialing cannot reuse client connection". Reconnecting...
W190917 15:18:25.544280 2014 vendor/google.golang.org/grpc/clientconn.go:1206  grpc: addrConn.createTransport failed to connect to {ff5f9200c59e:26259 0  <nil>}. Err :connection error: desc = "transport: Error while dialing cannot reuse client connection". Reconnecting...
W190917 15:18:25.645857 2103 vendor/google.golang.org/grpc/clientconn.go:1206  grpc: addrConn.createTransport failed to connect to {ff5f9200c59e:26258 0  <nil>}. Err :connection error: desc = "transport: Error while dialing cannot reuse client connection". Reconnecting...
W190917 15:18:25.772193 2114 vendor/google.golang.org/grpc/clientconn.go:1206  grpc: addrConn.createTransport failed to connect to {ff5f9200c59e:26259 0  <nil>}. Err :connection error: desc = "transport: Error while dialing cannot reuse client connection". Reconnecting...
W190917 15:18:26.344751 2117 vendor/google.golang.org/grpc/clientconn.go:1206  grpc: addrConn.createTransport failed to connect to {ff5f9200c59e:26258 0  <nil>}. Err :connection error: desc = "transport: Error while dialing cannot reuse client connection". Reconnecting...
W190917 15:18:26.425449 2161 vendor/google.golang.org/grpc/clientconn.go:1206  grpc: addrConn.createTransport failed to connect to {ff5f9200c59e:26258 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 172.31.0.2:26258: connect: connection refused". Reconnecting...
W190917 15:18:26.641770 2055 vendor/google.golang.org/grpc/clientconn.go:1206  grpc: addrConn.createTransport failed to connect to {ff5f9200c59e:26258 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 172.31.0.2:26258: connect: connection refused". Reconnecting...
W190917 15:18:26.741879 2059 vendor/google.golang.org/grpc/clientconn.go:1206  grpc: addrConn.createTransport failed to connect to {ff5f9200c59e:26258 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 172.31.0.2:26258: connect: connection refused". Reconnecting...
W190917 15:18:26.860424 2187 vendor/google.golang.org/grpc/clientconn.go:1206  grpc: addrConn.createTransport failed to connect to {ff5f9200c59e:26259 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 172.31.0.2:26259: connect: connection refused". Reconnecting...
W190917 15:18:26.941537 2193 vendor/google.golang.org/grpc/clientconn.go:1206  grpc: addrConn.createTransport failed to connect to {ff5f9200c59e:26259 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 172.31.0.2:26259: connect: connection refused". Reconnecting..

The text was updated successfully, but these errors were encountered:

knz · 2019-09-17T21:11:13Z

thanks, that's 1) a separate issue from before 2) also apparently a new/recent issue

Investigating...

knz · 2019-09-17T21:35:45Z

I have updated the issue description with the analysis. This seems to be a new problem. @andreimatei can you have a quick look and help me triage this? For liveness issues like this I'd usually ask tobias and you're next in line.

This seems to be a low-priority issue (and not a release blocker) because there's an easy production workaround (simply kill the last process)

However the issue does cause tests to flake unfortunately — even though I could tweak the one test that I used for the analysis to issue a kill instead of graceful quit, I suspect other tests are failing non-deterministically because of this.

cc @andy-kimball, maybe you have an opinion on how to prioritize this.

Release justification: deflakes a test and makes CI run faster Prior to this patch the test would attempt to shut down the cluster gracefully after asserting that the 3 nodes are properly joined. Unfortunately this wait is running into separate issue cockroachdb#40834 and this makes the test clean-up flaky. Since this unit test is not about quitting a cluster but merely checking that the join is successful, this patch both works around the related issue and accelerates the test by simply killing the nodes. Release note: None

40867: cli/interactive_tests: deflake and accelerate test_multiple_nodes r=knz a=knz Release justification: deflakes a test and makes CI run faster Prior to this patch the test would attempt to shut down the cluster gracefully after asserting that the 3 nodes are properly joined. Unfortunately this wait is running into separate issue #40834 and this makes the test clean-up flaky. Since this unit test is not about quitting a cluster but merely checking that the join is successful, this patch both works around the related issue and accelerates the test by simply killing the nodes. Release note: None Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net>

andreimatei · 2019-09-18T19:47:12Z

This reproduces every time?
I don't see anything in the logs to tell me what's going on. The draining process for some reason is not succeeding on n1. A stack dump would probably help. Is it possible that the client still has a transaction opened against n1? That could hold up draining for a while.

FWIW, I believe that after waiting a minute the shutdown would proceed (or at least try to) because of this code

cockroach/pkg/cli/start.go

Line 967 in b8004ff

    
           return errors.Errorf("time limit reached, initiating hard shutdown%s", hardShutdownHint)

knz · 2019-09-19T08:40:39Z

Is it possible that the client still has a transaction opened against n1? That could hold up draining for a while.

No but that makes me think of something else - is it possible for an internal client to hold up draining in this way?

andreimatei · 2019-09-19T14:16:25Z

I'm not completely sure, but I don't think an InternalExecutor can hold back the draining. I believe the way in which SQL connection hold up draining is at the pgwire.Server level, which accounts for all pgwire connections. The draining process signals all connections to "find a good time to close" (meaning, we'll close them in between transactions). InternalExecutors are not accounted for.

knz · 2019-09-19T20:49:40Z

This reproduces every time?

No it does not.

I don't think an InternalExecutor can hold back the draining.

Well then we must have something else to investigate. There is certainly no external client connected at that point.

knz · 2020-04-22T13:27:54Z

clear what is happening here since #45149 - there's no way for the last node to shut down gracefully since its liveness record is unavailable at that point. The hard shutdown timeout should take care of this now.

cockroach-teamcity added C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. labels Sep 17, 2019

cockroach-teamcity added this to the 19.2 milestone Sep 17, 2019

jordanlewis assigned knz Sep 17, 2019

knz changed the title ~~teamcity: failed test: TestDockerCLI~~ core: last node in 3-node cluster fails to shut down gracefully due to livelock Sep 17, 2019

knz added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-kv-gossip A-kv-replication Relating to Raft, consensus, and coordination. A-kv-server Relating to the KV-level RPC server labels Sep 17, 2019

This was referenced Sep 18, 2019

teamcity: failed test: TestDockerCLI #40865

Closed

teamcity: failed test: TestDockerCLI #40857

Closed

knz mentioned this issue Sep 18, 2019

cli/interactive_tests: deflake and accelerate test_multiple_nodes #40867

Merged

tbg added the branch-master Failures and bugs on the master branch. label Jan 22, 2020

knz closed this as completed Apr 22, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

core: last node in 3-node cluster fails to shut down gracefully due to livelock #40834

core: last node in 3-node cluster fails to shut down gracefully due to livelock #40834

cockroach-teamcity commented Sep 17, 2019 •

edited by knz

Loading

knz commented Sep 17, 2019

knz commented Sep 17, 2019

andreimatei commented Sep 18, 2019

knz commented Sep 19, 2019

andreimatei commented Sep 19, 2019

knz commented Sep 19, 2019

knz commented Apr 22, 2020

core: last node in 3-node cluster fails to shut down gracefully due to livelock #40834

core: last node in 3-node cluster fails to shut down gracefully due to livelock #40834

Comments

cockroach-teamcity commented Sep 17, 2019 • edited by knz Loading

knz commented Sep 17, 2019

knz commented Sep 17, 2019

andreimatei commented Sep 18, 2019

knz commented Sep 19, 2019

andreimatei commented Sep 19, 2019

knz commented Sep 19, 2019

knz commented Apr 22, 2020

cockroach-teamcity commented Sep 17, 2019 •

edited by knz

Loading