Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

teamcity: failed test: TestDistSQLDrainingHosts #31235

Closed
cockroach-teamcity opened this issue Oct 11, 2018 · 8 comments
Closed

teamcity: failed test: TestDistSQLDrainingHosts #31235

cockroach-teamcity opened this issue Oct 11, 2018 · 8 comments
Labels
C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

The following tests appear to have failed on master (testrace): TestDistSQLDrainingHosts

You may want to check for open issues.

#957737:

TestDistSQLDrainingHosts
...t running (UNKNOWN), cannot determine version
I181011 01:44:15.834468 13924 sql/event_log.go:126  [n2] Event: "node_join", target: 2, info: {Descriptor:{NodeID:2 Address:{NetworkField:tcp AddressField:127.0.0.1:36621} Attrs: Locality: ServerVersion:2.1 BuildTag:v2.2.0-alpha.00000000-1548-geebedfb StartedAt:1539222255767483113 LocalityAddress:[]} ClusterID:8107b94a-1278-4a2d-b8c6-21c85cbdcb34 StartedAt:1539222255767483113 LastUp:1539222255767483113}
I181011 01:44:15.838598 13913 sql/event_log.go:126  [n1,client=127.0.0.1:53332,user=root] Event: "create_database", target: 52, info: {DatabaseName:test Statement:CREATE DATABASE IF NOT EXISTS test User:root}
I181011 01:44:15.844545 13913 sql/event_log.go:126  [n1,client=127.0.0.1:53332,user=root] Event: "create_table", target: 53, info: {TableName:test.public.nums Statement:CREATE TABLE test.public.nums (num INT) User:root}
I181011 01:44:15.850063 13913 storage/replica_command.go:298  [n1,s1,r1/1:/M{in-ax}] initiating a split of this range at key /Table/53/1/1 [r2]
I181011 01:44:15.924668 13913 storage/store_snapshot.go:615  [n1,s1,r2/1:/{Table/53/1/1-Max}] sending preemptive snapshot e0fae41c at applied index 11
I181011 01:44:15.924928 13913 storage/store_snapshot.go:657  [n1,s1,r2/1:/{Table/53/1/1-Max}] streamed snapshot to (n2,s2):?: kv pairs: 8, log entries: 1, rate-limit: 2.0 MiB/sec, 1ms
I181011 01:44:15.926042 13962 storage/replica_raftstorage.go:803  [n2,s2,r2/?:{-}] applying preemptive snapshot at index 11 (id=e0fae41c, encoded size=380, 1 rocksdb batches, 1 log entries)
I181011 01:44:15.926360 13962 storage/replica_raftstorage.go:809  [n2,s2,r2/?:/{Table/53/1/1-Max}] applied preemptive snapshot in 0ms [clear=0ms batch=0ms entries=0ms commit=0ms]
I181011 01:44:15.926886 13913 storage/replica_command.go:812  [n1,s1,r2/1:/{Table/53/1/1-Max}] change replicas (ADD_REPLICA (n2,s2):2): read existing descriptor r2:/{Table/53/1/1-Max} [(n1,s1):1, next=2, gen=0]
I181011 01:44:15.930519 13913 storage/replica.go:3899  [n1,s1,r2/1:/{Table/53/1/1-Max}] proposing ADD_REPLICA((n2,s2):2): updated=[(n1,s1):1 (n2,s2):2] next=3
I181011 01:44:15.934564 13813 storage/replica_proposal.go:211  [n2,s2,r2/2:/{Table/53/1/1-Max}] new range lease repl=(n2,s2):2 seq=3 start=1539222255.930901995,0 epo=1 pro=1539222255.930904776,0 following repl=(n1,s1):1 seq=2 start=1539222255.561957387,0 exp=1539222264.562457433,0 pro=1539222255.562478480,0
I181011 01:44:15.935938 14012 storage/replica_command.go:812  [n2,s2,r2/2:/{Table/53/1/1-Max}] change replicas (REMOVE_REPLICA (n1,s1):1): read existing descriptor r2:/{Table/53/1/1-Max} [(n1,s1):1, (n2,s2):2, next=3, gen=0]
I181011 01:44:15.944741 14012 storage/replica.go:3899  [n2,s2,r2/2:/{Table/53/1/1-Max}] proposing REMOVE_REPLICA((n1,s1):1): updated=[(n2,s2):2] next=3
I181011 01:44:15.946738 14034 storage/store.go:2744  [n1,replicaGC,s1,r2/1:/{Table/53/1/1-Max}] removing replica r2/1
I181011 01:44:15.946983 14034 storage/replica.go:878  [n1,replicaGC,s1,r2/1:/{Table/53/1/1-Max}] removed 9 (2+7) keys in 0ms [clear=0ms commit=0ms]
I181011 01:44:15.986820 14058 util/stop/stopper.go:537  quiescing; tasks left:
1      [async] closedts-subscription
1      [async] closedts-rangefeed-subscriber
1      [async] auto-upgrade
I181011 01:44:15.992705 14058 util/stop/stopper.go:537  quiescing; tasks left:
1      [async] closedts-subscription
1      [async] closedts-rangefeed-subscriber
I181011 01:44:15.986946 14057 util/stop/stopper.go:537  quiescing; tasks left:
1      [async] closedts-subscription
1      [async] closedts-rangefeed-subscriber
W181011 01:44:15.987104 14003 storage/raft_transport.go:583  [n2] while processing outgoing Raft queue to node 1: rpc error: code = Unavailable desc = transport is closing:
W181011 01:44:15.987579 13968 storage/raft_transport.go:583  [n1] while processing outgoing Raft queue to node 2: EOF:
I181011 01:44:15.992878 14058 util/stop/stopper.go:537  quiescing; tasks left:
1      [async] closedts-subscription



TestDistSQLDrainingHosts
...istsqlplan/decode.html#eJyskT1rwzAQhvf-inJTCoJETrpoSumUoXbJBx2KCap1GEMsmZMELcH_vdgaEodYTSGjTn7uef3qCNooTGWNFsQncGCQQM6gIVOgtYa6cfhopb5BzBhUuvGuG-cMCkMI4giucgcEAVv5dcA1SoU0nQEDhU5Wh351Q1Ut6WepfW2BQeadeEyNRshbBsa701LrZIkgeMtuF7-UJWEpnaFpMvS-Zrt0u19nH5vJ06grGXWdFF4bUkioBvvzNp5mMUyz2b3tV-l2suTjYeaDMPz2xvldG_9DfPaP87s2fsW1RtsYbfGi-eubZ92LoCoxPJ81ngp8J1P0mnDMeq4fKLQu3PJwWOlw1QU8h3kUTgYwv4STKPwcN8-j8CIOL_4VO28ffgMAAP__nC9YuA==]]
got:[[https://cockroachdb.github.io/distsqlplan/decode.html#eJyUkEFL9DAQhu_fr_h4TwqBbfeYk-JpL63UFQ8SJDZDKLSZMklAWfrfpc1BV1jR47yT533CnBDYUWMnitDPqGEUZuGeYmRZo_Lg4N6gK4UhzDmtsVHoWQj6hDSkkaBxtK8jdWQdya6CgqNkh3GrnWWYrLzfhDxFKLQ56f8NB4JZFDinz9KYrCfoelG_F996L-RtYtnV59679rE5vnTt08PV9UXX_i-ujuLMIdKZ51JztRgFcp7KISNn6eleuN80ZWw3bgscxVS2dRkOoazWD36F6x_h_TfYLP8-AgAA__-zG6EE]]
I181011 01:50:47.819361 14067 sql/distsql_physical_planner_test.go:513  SucceedsSoon: 
expected:[[https://cockroachdb.github.io/distsqlplan/decode.html#eJyskT1rwzAQhvf-inJTCoJETrpoSumUoXbJBx2KCap1GEMsmZMELcH_vdgaEodYTSGjTn7uef3qCNooTGWNFsQncGCQQM6gIVOgtYa6cfhopb5BzBhUuvGuG-cMCkMI4giucgcEAVv5dcA1SoU0nQEDhU5Wh351Q1Ut6WepfW2BQeadeEyNRshbBsa701LrZIkgeMtuF7-UJWEpnaFpMvS-Zrt0u19nH5vJ06grGXWdFF4bUkioBvvzNp5mMUyz2b3tV-l2suTjYeaDMPz2xvldG_9DfPaP87s2fsW1RtsYbfGi-eubZ92LoCoxPJ81ngp8J1P0mnDMeq4fKLQu3PJwWOlw1QU8h3kUTgYwv4STKPwcN8-j8CIOL_4VO28ffgMAAP__nC9YuA==]]
got:[[https://cockroachdb.github.io/distsqlplan/decode.html#eJyUkEFL9DAQhu_fr_h4TwqBbfeYk-JpL63UFQ8SJDZDKLSZMklAWfrfpc1BV1jR47yT533CnBDYUWMnitDPqGEUZuGeYmRZo_Lg4N6gK4UhzDmtsVHoWQj6hDSkkaBxtK8jdWQdya6CgqNkh3GrnWWYrLzfhDxFKLQ56f8NB4JZFDinz9KYrCfoelG_F996L-RtYtnV59679rE5vnTt08PV9UXX_i-ujuLMIdKZ51JztRgFcp7KISNn6eleuN80ZWw3bgscxVS2dRkOoazWD36F6x_h_TfYLP8-AgAA__-zG6EE]]
I181011 01:50:48.833410 14067 sql/distsql_physical_planner_test.go:513  SucceedsSoon: 
expected:[[https://cockroachdb.github.io/distsqlplan/decode.html#eJyskT1rwzAQhvf-inJTCoJETrpoSumUoXbJBx2KCap1GEMsmZMELcH_vdgaEodYTSGjTn7uef3qCNooTGWNFsQncGCQQM6gIVOgtYa6cfhopb5BzBhUuvGuG-cMCkMI4giucgcEAVv5dcA1SoU0nQEDhU5Wh351Q1Ut6WepfW2BQeadeEyNRshbBsa701LrZIkgeMtuF7-UJWEpnaFpMvS-Zrt0u19nH5vJ06grGXWdFF4bUkioBvvzNp5mMUyz2b3tV-l2suTjYeaDMPz2xvldG_9DfPaP87s2fsW1RtsYbfGi-eubZ92LoCoxPJ81ngp8J1P0mnDMeq4fKLQu3PJwWOlw1QU8h3kUTgYwv4STKPwcN8-j8CIOL_4VO28ffgMAAP__nC9YuA==]]
got:[[https://cockroachdb.github.io/distsqlplan/decode.html#eJyUkEFL9DAQhu_fr_h4TwqBbfeYk-JpL63UFQ8SJDZDKLSZMklAWfrfpc1BV1jR47yT533CnBDYUWMnitDPqGEUZuGeYmRZo_Lg4N6gK4UhzDmtsVHoWQj6hDSkkaBxtK8jdWQdya6CgqNkh3GrnWWYrLzfhDxFKLQ56f8NB4JZFDinz9KYrCfoelG_F996L-RtYtnV59679rE5vnTt08PV9UXX_i-ujuLMIdKZ51JztRgFcp7KISNn6eleuN80ZWw3bgscxVS2dRkOoazWD36F6x_h_TfYLP8-AgAA__-zG6EE]]
I181011 01:50:49.834763 14933 util/stop/stopper.go:537  quiescing; tasks left:
1      [async] transport racer
1      [async] closedts-subscription
1      [async] closedts-rangefeed-subscriber
I181011 01:50:49.838623 14170 kv/transport_race.go:113  transport race promotion: ran 62 iterations on up to 511 requests
W181011 01:50:49.843125 14666 storage/raft_transport.go:583  [n1] while processing outgoing Raft queue to node 2: rpc error: code = Canceled desc = grpc: the client connection is closing:
W181011 01:50:49.843471 14616 storage/raft_transport.go:583  [n2] while processing outgoing Raft queue to node 1: rpc error: code = Canceled desc = grpc: the client connection is closing:
W181011 01:50:49.844300 14454 gossip/gossip.go:1496  [n2] no incoming or outgoing connections
I181011 01:50:49.845574 14933 util/stop/stopper.go:537  quiescing; tasks left:
1      [async] closedts-subscription
1      [async] closedts-rangefeed-subscriber
I181011 01:50:49.834815 14934 util/stop/stopper.go:537  quiescing; tasks left:
1      [async] closedts-subscription
1      [async] closedts-rangefeed-subscriber
I181011 01:50:49.846322 14934 util/stop/stopper.go:537  quiescing; tasks left:
1      [async] closedts-subscription
I181011 01:50:49.846745 14933 util/stop/stopper.go:537  quiescing; tasks left:
1      [async] closedts-subscription




Please assign, take a look and update the issue accordingly.

@cockroach-teamcity cockroach-teamcity added this to the 2.2 milestone Oct 11, 2018
@cockroach-teamcity cockroach-teamcity added C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. labels Oct 11, 2018
@jordanlewis
Copy link
Member

soon.go:49: condition failed to evaluate within 45s: 
	expected:[[https://cockroachdb.github.io/distsqlplan/decode.html#eJyskT1rwzAQhvf-inJTCoJETrpoSumUoXbJBx2KCap1GEMsmZMELcH_vdgaEodYTSGjTn7uef3qCNooTGWNFsQncGCQQM6gIVOgtYa6cfhopb5BzBhUuvGuG-cMCkMI4giucgcEAVv5dcA1SoU0nQEDhU5Wh351Q1Ut6WepfW2BQeadeEyNRshbBsa701LrZIkgeMtuF7-UJWEpnaFpMvS-Zrt0u19nH5vJ06grGXWdFF4bUkioBvvzNp5mMUyz2b3tV-l2suTjYeaDMPz2xvldG_9DfPaP87s2fsW1RtsYbfGi-eubZ92LoCoxPJ81ngp8J1P0mnDMeq4fKLQu3PJwWOlw1QU8h3kUTgYwv4STKPwcN8-j8CIOL_4VO28ffgMAAP__nC9YuA==]]
	got:[[https://cockroachdb.github.io/distsqlplan/decode.html#eJyUkEFL9DAQhu_fr_h4TwqBbfeYk-JpL63UFQ8SJDZDKLSZMklAWfrfpc1BV1jR47yT533CnBDYUWMnitDPqGEUZuGeYmRZo_Lg4N6gK4UhzDmtsVHoWQj6hDSkkaBxtK8jdWQdya6CgqNkh3GrnWWYrLzfhDxFKLQ56f8NB4JZFDinz9KYrCfoelG_F996L-RtYtnV59679rE5vnTt08PV9UXX_i-ujuLMIdKZ51JztRgFcp7KISNn6eleuN80ZWw3bgscxVS2dRkOoazWD36F6x_h_TfYLP8-AgAA__-zG6EE]]

Looks like the second node didn't make it into the cache of available nodes within 45 seconds.

@jordanlewis
Copy link
Member

Seems like the same thing as #31068 and co. I'll send a PR to add a "verification query" to populate the range cache after the relocate.

@jordanlewis
Copy link
Member

It's weird that this is so hard to reproduce on a gceworker.

5779 runs so far, 0 failures, over 17m45s

@jordanlewis
Copy link
Member

13832 runs so far, 0 failures, over 42m25s

@tbg
Copy link
Member

tbg commented Oct 11, 2018

make stressrace PKG=./pkg/sql TESTS=TestDistSQLDrainingHosts repros this for me immediately once 45s are up. Did you maybe make stress instead?

I'll try with your PR to save you the trouble from doing so yourself.

craig bot pushed a commit that referenced this issue Oct 11, 2018
30913: server: add a configuration to enable GC of system.rangelog r=tschottdorf a=mvijaykarthik

system.rangelog table currently grows unboundedly. The rate of
growth is slow (as long as there is no replica rebalancing
thrashing), but it can still become a problem in long running
clusters.

This commit adds cluster settings to specify interval and TTL
for rows in system.rangelog.
By default, TTL of system.rangelog is set to 30 days.

Fixes #21260

Release note: Add configuration to enable GC of system.rangelog

31239: sql: attempt to deflake distsql physical planner tests r=tschottdorf a=jordanlewis

Make sure the range cache is populated before verifying things about it.

This seems like a hack, but otherwise I think these will just keep flaking.

Closes #25808.
Closes #31235.

Release note: None

Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>
Co-authored-by: Jordan Lewis <jordanthelewis@gmail.com>
@craig craig bot closed this as completed in #31239 Oct 11, 2018
@jordanlewis
Copy link
Member

Did you maybe make stress instead?

No, I was running make stressrace on a gceworker. I would love to get to the bottom of why that didn't catch the issue for me. Were you running on a smaller machine than the gceworker spec?

@tbg
Copy link
Member

tbg commented Oct 11, 2018

No, I was literally using my gceworker and the exact stressrace invocation above.

@asubiotto
Copy link
Contributor

I was having the same issue trying to reproduce #25808.

craig bot pushed a commit that referenced this issue Oct 11, 2018
31285: backport-2.1: sql: attempt to deflake distsql physical planner tests r=jordanlewis a=tschottdorf

Backport 1/1 commits from #31239.

Flaky as hell on release-2.1. Someone please ~~bors~~merge this for me with the LGTM.

/cc @cockroachdb/release

---

Make sure the range cache is populated before verifying things about it.

This seems like a hack, but otherwise I think these will just keep flaking.

Closes #25808.
Closes #31235.

Release note: None


Co-authored-by: Jordan Lewis <jordanthelewis@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot.
Projects
None yet
Development

No branches or pull requests

4 participants