Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

core: last node in 3-node cluster fails to shut down gracefully due to livelock #40834

Closed
cockroach-teamcity opened this issue Sep 17, 2019 · 7 comments
Assignees
Labels
A-kv-gossip A-kv-replication Relating to Raft, consensus, and coordination. A-kv-server Relating to the KV-level RPC server branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Sep 17, 2019

Found by the CLI test test_multiple_nodes.tcl here:
#1493442:

The scenario is pretty simple:

  1. start n1
  2. add n2
  3. add n3
  4. run a query to confirm all 3 nodes are connected
  5. quit n2 - this succeeds and log file confirms n2 has shut down
  6. quit n3 - this succeeds and log file confirms n3 has shut down
  7. quit n1

At this point the quit process on n1 livelocks and n1 fails to shut down. Log file here: cockroach.log

Relevant log lines from n1:

  • at step 4, just before n2 is quit
190917 15:18:20.358811 110 storage/stores.go:259  [n1] wrote 2 node addresses to persistent storage
I190917 15:18:20.543398 1841 storage/replica_command.go:1580  [n1,replicate,s1,r2/1:/System/NodeLiveness{-Max}] change replicas (add [(n3,s3):2LEARNER] remove []): existing descriptor r2:/System/NodeLiveness{-Max} [(n1,s1):1, next=2, gen=0]
I190917 15:18:20.554046 1841 storage/replica_raft.go:291  [n1,s1,r2/1:/System/NodeLiveness{-Max}] proposing ADD_REPLICA[(n3,s3):2LEARNER]: after=[(n1,s1):1 (n3,s3):2LEARNER] next=3
I190917 15:18:20.560840 1841 storage/store_snapshot.go:995  [n1,replicate,s1,r2/1:/System/NodeLiveness{-Max}] sending LEARNER snapshot ed7d457c at applied index 23
I190917 15:18:20.560983 1841 storage/store_snapshot.go:1038  [n1,replicate,s1,r2/1:/System/NodeLiveness{-Max}] streamed snapshot to (n3,s3):2LEARNER: kv pairs: 10, log entries: 0, rate-limit: 8.0 MiB/sec, 0.00s
I190917 15:18:20.564621 1900 storage/raft_snapshot_queue.go:125  [n1,raftsnapshot,s1,r2/1:/System/NodeLiveness{-Max}] skipping snapshot; replica is likely a learner in the process of being added: (n3,s3):2LEARNER
I190917 15:18:20.593802 1841 storage/replica_command.go:1580  [n1,replicate,s1,r2/1:/System/NodeLiveness{-Max}] change replicas (add [(n3,s3):2] remove []): existing descriptor r2:/System/NodeLiveness{-Max} [(n1,s1):1, (n3,s3):2LEARNER, next=3, gen=1]
I190917 15:18:20.603547 1841 storage/replica_raft.go:291  [n1,s1,r2/1:/System/NodeLiveness{-Max}] proposing ADD_REPLICA[(n3,s3):2]: after=[(n1,s1):1 (n3,s3):2] next=3
I190917 15:18:20.763654 1886 storage/store.go:2613  [n1,replicaGC,s1,r2/1:/System/NodeLiveness{-Max}] removing replica r2/1
I190917 15:18:20.766736 1886 storage/replica_destroy.go:152  [n1,replicaGC,s1,r2/1:/System/NodeLiveness{-Max}] removed 10 (3+7) keys in 3ms [clear=0ms commit=3ms]
  • at step 5, when n2 leaves:
W190917 15:18:21.020789 1828 storage/raft_transport.go:620  [n1] while processing outgoing Raft queue to node 2: rpc error: code = Unavailable desc = transport is closing:
W190917 15:18:21.020840 1790 vendor/google.golang.org/grpc/clientconn.go:1206  grpc: addrConn.createTransport failed to connect to {ff5f9200c59e:26258 0  <nil>}. Err :connection error: desc = "transport: Error while dialing cannot reuse client connection". Reconnecting...
W190917 15:18:21.020854 1582 vendor/google.golang.org/grpc/clientconn.go:1206  grpc: addrConn.createTransport failed to connect to {ff5f9200c59e:26258 0  <nil>}. Err :connection error: desc = "transport: Error while dialing cannot reuse client connection". Reconnecting...
W190917 15:18:21.020875 1480 storage/raft_transport.go:620  [n1] while processing outgoing Raft queue to node 2: rpc error: code = Unavailable desc = transport is closing:
W190917 15:18:21.141698 1919 vendor/google.golang.org/grpc/clientconn.go:1206  grpc: addrConn.createTransport failed to connect to {ff5f9200c59e:26258 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 172.31.0.2:26258: connect: connection refused". Reconnecting...
I190917 15:18:21.141888 1923 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:322  [n1] circuitbreaker: rpc [::]:26257 [n2] tripped: failed to connect to n2 at ff5f9200c59e:26258: initial connection heartbeat failed: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 172.31.0.2:26258: connect: connection refused"
I190917 15:18:21.141923 1923 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:447  [n1] circuitbreaker: rpc [::]:26257 [n2] event: BreakerTripped
I190917 15:18:21.141946 1923 rpc/nodedialer/nodedialer.go:160  [ct-client] unable to connect to n2: failed to connect to n2 at ff5f9200c59e:26258: initial connection heartbeat failed: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 172.31.0.2:26258: connect: connection refused"
W190917 15:18:21.241802 1954 vendor/google.golang.org/grpc/clientconn.go:1206  grpc: addrConn.createTransport failed to connect to {ff5f9200c59e:26258 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 172.31.0.2:26258: connect: connection refused". Reconnecting...
I190917 15:18:21.241955 198 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:322  [n1] circuitbreaker: rpc [::]:26257 [n2] tripped: failed to connect to n2 at ff5f9200c59e:26258: initial connection heartbeat failed: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 172.31.0.2:26258: connect: connection refused"
I190917 15:18:21.241968 198 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:447  [n1] circuitbreaker: rpc [::]:26257 [n2] event: BreakerTripped
I190917 15:18:21.241987 198 rpc/nodedialer/nodedialer.go:160  [n1] unable to connect to n2: failed to connect to n2 at ff5f9200c59e:26258: initial connection heartbeat failed: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 172.31.0.2:26258: connect: connection refused"
  • at step 6 when n3 leaves:
I190917 15:18:21.543864 1866 storage/replica_command.go:1580  [n1,replicate,s1,r6/1:/Table/{SystemCon…-11}] change replicas (add [(n3,s3):2LEARNER] remove []): existing descriptor r6:/Table/{SystemConfigSpan/Start-11} [(n1,s1):1, next=2, gen=0]
W190917 15:18:22.109660 1876 vendor/google.golang.org/grpc/clientconn.go:1206  grpc: addrConn.createTransport failed to connect to {ff5f9200c59e:26259 0  <nil>}. Err :connection error: desc = "transport: Error while dialing cannot reuse client connection". Reconnecting...
W190917 15:18:22.109719 1783 vendor/google.golang.org/grpc/clientconn.go:1206  grpc: addrConn.createTransport failed to connect to {ff5f9200c59e:26259 0  <nil>}. Err :connection error: desc = "transport: Error while dialing cannot reuse client connection". Reconnecting...
W190917 15:18:22.109739 1825 storage/raft_transport.go:620  [n1] while processing outgoing Raft queue to node 3: rpc error: code = Unavailable desc = transport is closing:
  • at step 7 when n1 is instructed to leave:
I190917 15:18:22.135193 1 cli/start.go:860  received signal 'terminated'
I190917 15:18:22.137128 1 cli/start.go:925  initiating graceful shutdown of server
W190917 15:18:22.141659 1869 vendor/google.golang.org/grpc/clientconn.go:1206  grpc: addrConn.createTransport failed to connect to {ff5f9200c59e:26259 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 172.31.0.2:26259: connect: connection refused". Reconnecting...
W190917 15:18:22.141997 1919 vendor/google.golang.org/grpc/clientconn.go:1206  grpc: addrConn.createTransport failed to connect to {ff5f9200c59e:26258 0  <nil>}. Err :connection error: desc = "transport: Error while dialing cannot reuse client connection". Reconnecting...
I190917 15:18:22.143602 1970 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:322  [n1] circuitbreaker: rpc [::]:26257 [n3] tripped: failed to connect to n3 at ff5f9200c59e:26259: initial connection heartbeat failed: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 172.31.0.2:26259: connect: connection refused"
I190917 15:18:22.147242 1970 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:447  [n1] circuitbreaker: rpc [::]:26257 [n3] event: BreakerTripped
I190917 15:18:22.149341 1970 rpc/nodedialer/nodedialer.go:160  [ct-client] unable to connect to n3: failed to connect to n3 at ff5f9200c59e:26259: initial connection heartbeat failed: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 172.31.0.2:26259: connect: connection refused"
W190917 15:18:22.146443 1925 vendor/google.golang.org/grpc/clientconn.go:1206  grpc: addrConn.createTransport failed to connect to {ff5f9200c59e:26259 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 172.31.0.2:26259: connect: connection refused". Reconnecting...
I190917 15:18:22.153243 1856 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:322  [n1] circuitbreaker: rpc [::]:26257 [n3] tripped: failed to connect to n3 at ff5f9200c59e:26259: initial connection heartbeat failed: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 172.31.0.2:26259: connect: connection refused"
I190917 15:18:22.154910 1856 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:447  [n1] circuitbreaker: rpc [::]:26257 [n3] event: BreakerTripped
I190917 15:18:22.156976 1856 rpc/nodedialer/nodedialer.go:160  [server drain process,n1,txn=61eb57a5] unable to connect to n3: failed to connect to n3 at ff5f9200c59e:26259: initial connection heartbeat failed: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 172.31.0.2:26259: connect: connection refused"
W190917 15:18:22.214587 1929 storage/intentresolver/intent_resolver.go:508  failed to push during intent resolution: failed to push id=b7f82d25 key=/Local/Range/Table/SystemConfigSpan/Start/RangeDescriptor rw=true pri=0.03822413 stat=PENDING epo=0 ts=1568733501.543484158,0 orig=0.000000000,0 min=1568733501.543484158,0 max=0.000000000,0 wto=false seq=7
W190917 15:18:22.242227 1954 vendor/google.golang.org/grpc/clientconn.go:1206  grpc: addrConn.createTransport failed to connect to {ff5f9200c59e:26258 0  <nil>}. Err :connection error: desc = "transport: Error while dialing cannot reuse client connection". Reconnecting...
W190917 15:18:22.341772 1973 vendor/google.golang.org/grpc/clientconn.go:1206  grpc: addrConn.createTransport failed to connect to {ff5f9200c59e:26258 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 172.31.0.2:26258: connect: connection refused". Reconnecting...
W190917 15:18:22.419949 110 gossip/gossip.go:1517  [n1] no incoming or outgoing connections
W190917 15:18:22.423219 1991 vendor/google.golang.org/grpc/clientconn.go:1206  grpc: addrConn.createTransport failed to connect to {ff5f9200c59e:26258 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 172.31.0.2:26258: connect: connection refused". Reconnecting...
I190917 15:18:22.425611 1988 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:322  [n1] circuitbreaker: gossip [::]:26257->ff5f9200c59e:26258 tripped: initial connection heartbeat failed: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 172.31.0.2:26258: connect: connection refused"
I190917 15:18:22.427705 1988 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:447  [n1] circuitbreaker: gossip [::]:26257->ff5f9200c59e:26258 event: BreakerTripped
W190917 15:18:22.429760 1988 gossip/client.go:118  [n1] failed to start gossip client to ff5f9200c59e:26258: initial connection heartbeat failed: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 172.31.0.2:26258: connect: connection refused"
W190917 15:18:22.641760 2000 vendor/google.golang.org/grpc/clientconn.go:1206  grpc: addrConn.createTransport failed to connect to {ff5f9200c59e:26258 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 172.31.0.2:26258: connect: connection refused". Reconnecting...
W190917 15:18:23.143794 1869 vendor/google.golang.org/grpc/clientconn.go:1206  grpc: addrConn.createTransport failed to connect to {ff5f9200c59e:26259 0  <nil>}. Err :connection error: desc = "transport: Error while dialing cannot reuse client connection". Reconnecting...
W190917 15:18:23.153431 1925 vendor/google.golang.org/grpc/clientconn.go:1206  grpc: addrConn.createTransport failed to connect to {ff5f9200c59e:26259 0  <nil>}. Err :connection error: desc = "transport: Error while dialing cannot reuse client connection". Reconnecting...
W190917 15:18:23.344661 1973 vendor/google.golang.org/grpc/clientconn.go:1206  grpc: addrConn.createTransport failed to connect to {ff5f9200c59e:26258 0  <nil>}. Err :connection error: desc = "transport: Error while dialing cannot reuse client connection". Reconnecting...
W190917 15:18:23.424255 2023 vendor/google.golang.org/grpc/clientconn.go:1206  grpc: addrConn.createTransport failed to connect to {ff5f9200c59e:26259 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 172.31.0.2:26259: connect: connection refused". Reconnecting...
W190917 15:18:23.425768 1991 vendor/google.golang.org/grpc/clientconn.go:1206  grpc: addrConn.createTransport failed to connect to {ff5f9200c59e:26258 0  <nil>}. Err :connection error: desc = "transport: Error while dialing cannot reuse client connection". Reconnecting...
I190917 15:18:23.427116 2020 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:322  [n1] circuitbreaker: gossip [::]:26257->ff5f9200c59e:26259 tripped: initial connection heartbeat failed: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 172.31.0.2:26259: connect: connection refused"
I190917 15:18:23.430985 2020 vendor/github.com/cockroachdb/circuitbreaker/circuitbreaker.go:447  [n1] circuitbreaker: gossip [::]:26257->ff5f9200c59e:26259 event: BreakerTripped
W190917 15:18:23.433033 2020 gossip/client.go:118  [n1] failed to start gossip client to ff5f9200c59e:26259: initial connection heartbeat failed: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 172.31.0.2:26259: connect: connection refused"
W190917 15:18:23.647012 2000 vendor/google.golang.org/grpc/clientconn.go:1206  grpc: addrConn.createTransport failed to connect to {ff5f9200c59e:26258 0  <nil>}. Err :connection error: desc = "transport: Error while dialing cannot reuse client connection". Reconnecting...
W190917 15:18:23.941760 2037 vendor/google.golang.org/grpc/clientconn.go:1206  grpc: addrConn.createTransport failed to connect to {ff5f9200c59e:26258 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 172.31.0.2:26258: connect: connection refused". Reconnecting...
W190917 15:18:24.424810 2067 vendor/google.golang.org/grpc/clientconn.go:1206  grpc: addrConn.createTransport failed to connect to {ff5f9200c59e:26258 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 172.31.0.2:26258: connect: connection refused". Reconnecting...
W190917 15:18:24.427360 2023 vendor/google.golang.org/grpc/clientconn.go:1206  grpc: addrConn.createTransport failed to connect to {ff5f9200c59e:26259 0  <nil>}. Err :connection error: desc = "transport: Error while dialing cannot reuse client connection". Reconnecting...
W190917 15:18:24.541821 2014 vendor/google.golang.org/grpc/clientconn.go:1206  grpc: addrConn.createTransport failed to connect to {ff5f9200c59e:26259 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 172.31.0.2:26259: connect: connection refused". Reconnecting...
W190917 15:18:24.641776 2103 vendor/google.golang.org/grpc/clientconn.go:1206  grpc: addrConn.createTransport failed to connect to {ff5f9200c59e:26258 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 172.31.0.2:26258: connect: connection refused". Reconnecting...
W190917 15:18:24.769245 2114 vendor/google.golang.org/grpc/clientconn.go:1206  grpc: addrConn.createTransport failed to connect to {ff5f9200c59e:26259 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 172.31.0.2:26259: connect: connection refused". Reconnecting...
W190917 15:18:24.944635 2037 vendor/google.golang.org/grpc/clientconn.go:1206  grpc: addrConn.createTransport failed to connect to {ff5f9200c59e:26258 0  <nil>}. Err :connection error: desc = "transport: Error while dialing cannot reuse client connection". Reconnecting...
W190917 15:18:25.341824 2117 vendor/google.golang.org/grpc/clientconn.go:1206  grpc: addrConn.createTransport failed to connect to {ff5f9200c59e:26258 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 172.31.0.2:26258: connect: connection refused". Reconnecting...
W190917 15:18:25.427558 2067 vendor/google.golang.org/grpc/clientconn.go:1206  grpc: addrConn.createTransport failed to connect to {ff5f9200c59e:26258 0  <nil>}. Err :connection error: desc = "transport: Error while dialing cannot reuse client connection". Reconnecting...
W190917 15:18:25.544280 2014 vendor/google.golang.org/grpc/clientconn.go:1206  grpc: addrConn.createTransport failed to connect to {ff5f9200c59e:26259 0  <nil>}. Err :connection error: desc = "transport: Error while dialing cannot reuse client connection". Reconnecting...
W190917 15:18:25.645857 2103 vendor/google.golang.org/grpc/clientconn.go:1206  grpc: addrConn.createTransport failed to connect to {ff5f9200c59e:26258 0  <nil>}. Err :connection error: desc = "transport: Error while dialing cannot reuse client connection". Reconnecting...
W190917 15:18:25.772193 2114 vendor/google.golang.org/grpc/clientconn.go:1206  grpc: addrConn.createTransport failed to connect to {ff5f9200c59e:26259 0  <nil>}. Err :connection error: desc = "transport: Error while dialing cannot reuse client connection". Reconnecting...
W190917 15:18:26.344751 2117 vendor/google.golang.org/grpc/clientconn.go:1206  grpc: addrConn.createTransport failed to connect to {ff5f9200c59e:26258 0  <nil>}. Err :connection error: desc = "transport: Error while dialing cannot reuse client connection". Reconnecting...
W190917 15:18:26.425449 2161 vendor/google.golang.org/grpc/clientconn.go:1206  grpc: addrConn.createTransport failed to connect to {ff5f9200c59e:26258 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 172.31.0.2:26258: connect: connection refused". Reconnecting...
W190917 15:18:26.641770 2055 vendor/google.golang.org/grpc/clientconn.go:1206  grpc: addrConn.createTransport failed to connect to {ff5f9200c59e:26258 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 172.31.0.2:26258: connect: connection refused". Reconnecting...
W190917 15:18:26.741879 2059 vendor/google.golang.org/grpc/clientconn.go:1206  grpc: addrConn.createTransport failed to connect to {ff5f9200c59e:26258 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 172.31.0.2:26258: connect: connection refused". Reconnecting...
W190917 15:18:26.860424 2187 vendor/google.golang.org/grpc/clientconn.go:1206  grpc: addrConn.createTransport failed to connect to {ff5f9200c59e:26259 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 172.31.0.2:26259: connect: connection refused". Reconnecting...
W190917 15:18:26.941537 2193 vendor/google.golang.org/grpc/clientconn.go:1206  grpc: addrConn.createTransport failed to connect to {ff5f9200c59e:26259 0  <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 172.31.0.2:26259: connect: connection refused". Reconnecting..
@cockroach-teamcity cockroach-teamcity added C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. labels Sep 17, 2019
@cockroach-teamcity cockroach-teamcity added this to the 19.2 milestone Sep 17, 2019
@knz
Copy link
Contributor

knz commented Sep 17, 2019

thanks, that's 1) a separate issue from before 2) also apparently a new/recent issue

Investigating...

@knz knz changed the title teamcity: failed test: TestDockerCLI core: last node in 3-node cluster fails to shut down gracefully due to livelock Sep 17, 2019
@knz
Copy link
Contributor

knz commented Sep 17, 2019

I have updated the issue description with the analysis. This seems to be a new problem. @andreimatei can you have a quick look and help me triage this? For liveness issues like this I'd usually ask tobias and you're next in line.

This seems to be a low-priority issue (and not a release blocker) because there's an easy production workaround (simply kill the last process)

However the issue does cause tests to flake unfortunately — even though I could tweak the one test that I used for the analysis to issue a kill instead of graceful quit, I suspect other tests are failing non-deterministically because of this.

cc @andy-kimball, maybe you have an opinion on how to prioritize this.

@knz knz added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-kv-gossip A-kv-replication Relating to Raft, consensus, and coordination. A-kv-server Relating to the KV-level RPC server labels Sep 17, 2019
knz added a commit to knz/cockroach that referenced this issue Sep 18, 2019
Release justification: deflakes a test and makes CI run faster

Prior to this patch the test would attempt to shut down the cluster
gracefully after asserting that the 3 nodes are properly joined.

Unfortunately this wait is running into separate issue
cockroachdb#40834 and this makes
the test clean-up flaky.

Since this unit test is not about quitting a cluster but merely
checking that the join is successful, this patch both works around the
related issue and accelerates the test by simply killing the nodes.

Release note: None
craig bot pushed a commit that referenced this issue Sep 18, 2019
40867: cli/interactive_tests: deflake and accelerate test_multiple_nodes r=knz a=knz

Release justification: deflakes a test and makes CI run faster

Prior to this patch the test would attempt to shut down the cluster
gracefully after asserting that the 3 nodes are properly joined.

Unfortunately this wait is running into separate issue
#40834 and this makes
the test clean-up flaky.

Since this unit test is not about quitting a cluster but merely
checking that the join is successful, this patch both works around the
related issue and accelerates the test by simply killing the nodes.

Release note: None

Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net>
@andreimatei
Copy link
Contributor

This reproduces every time?
I don't see anything in the logs to tell me what's going on. The draining process for some reason is not succeeding on n1. A stack dump would probably help. Is it possible that the client still has a transaction opened against n1? That could hold up draining for a while.

FWIW, I believe that after waiting a minute the shutdown would proceed (or at least try to) because of this code

return errors.Errorf("time limit reached, initiating hard shutdown%s", hardShutdownHint)

@knz
Copy link
Contributor

knz commented Sep 19, 2019

Is it possible that the client still has a transaction opened against n1? That could hold up draining for a while.

No but that makes me think of something else - is it possible for an internal client to hold up draining in this way?

@andreimatei
Copy link
Contributor

I'm not completely sure, but I don't think an InternalExecutor can hold back the draining. I believe the way in which SQL connection hold up draining is at the pgwire.Server level, which accounts for all pgwire connections. The draining process signals all connections to "find a good time to close" (meaning, we'll close them in between transactions). InternalExecutors are not accounted for.

@knz
Copy link
Contributor

knz commented Sep 19, 2019

This reproduces every time?

No it does not.

I don't think an InternalExecutor can hold back the draining.

Well then we must have something else to investigate. There is certainly no external client connected at that point.

@tbg tbg added the branch-master Failures and bugs on the master branch. label Jan 22, 2020
@knz
Copy link
Contributor

knz commented Apr 22, 2020

clear what is happening here since #45149 - there's no way for the last node to shut down gracefully since its liveness record is unavailable at that point. The hard shutdown timeout should take care of this now.

@knz knz closed this as completed Apr 22, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-gossip A-kv-replication Relating to Raft, consensus, and coordination. A-kv-server Relating to the KV-level RPC server branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot.
Projects
None yet
Development

No branches or pull requests

4 participants