Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ccl/backupccl: TestBackupRestoreWithConcurrentWrites failed under stress #40361

Closed
cockroach-teamcity opened this issue Aug 30, 2019 · 11 comments
Assignees
Labels
C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

SHA: https://github.com/cockroachdb/cockroach/commits/66bd279c9aa682c2b7adcec87ec0c639b8039a33

Parameters:

TAGS=
GOFLAGS=-parallel=4

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=TestBackupRestoreWithConcurrentWrites PKG=github.com/cockroachdb/cockroach/pkg/ccl/backupccl TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1461687&tab=buildLog

I190830 05:34:37.282071 77288 util/stop/stopper.go:542  quiescing; tasks left:
7      [async] storage.IntentResolver: cleanup txn records
2      [async] closedts-subscription
1      [async] closedts-rangefeed-subscriber
I190830 05:34:37.282133 77288 util/stop/stopper.go:542  quiescing; tasks left:
6      [async] storage.IntentResolver: cleanup txn records
2      [async] closedts-subscription
1      [async] closedts-rangefeed-subscriber
I190830 05:34:37.282141 77289 util/stop/stopper.go:542  quiescing; tasks left:
6      [async] storage.IntentResolver: cleanup txn records
2      [async] closedts-subscription
1      [async] intent_resolver_ir_batcher
1      [async] closedts-rangefeed-subscriber
I190830 05:34:37.282248 77288 util/stop/stopper.go:542  quiescing; tasks left:
5      [async] storage.IntentResolver: cleanup txn records
2      [async] closedts-subscription
1      [async] closedts-rangefeed-subscriber
I190830 05:34:37.282314 77288 util/stop/stopper.go:542  quiescing; tasks left:
4      [async] storage.IntentResolver: cleanup txn records
2      [async] closedts-subscription
1      [async] closedts-rangefeed-subscriber
W190830 05:34:37.282332 70617 gossip/gossip.go:1517  [n2] no incoming or outgoing connections
I190830 05:34:37.282377 77288 util/stop/stopper.go:542  quiescing; tasks left:
3      [async] storage.IntentResolver: cleanup txn records
2      [async] closedts-subscription
1      [async] closedts-rangefeed-subscriber
I190830 05:34:37.282483 77288 util/stop/stopper.go:542  quiescing; tasks left:
2      [async] storage.IntentResolver: cleanup txn records
2      [async] closedts-subscription
1      [async] closedts-rangefeed-subscriber
I190830 05:34:37.282556 77288 util/stop/stopper.go:542  quiescing; tasks left:
2      [async] closedts-subscription
1      [async] storage.IntentResolver: cleanup txn records
1      [async] closedts-rangefeed-subscriber
I190830 05:34:37.282611 77288 util/stop/stopper.go:542  quiescing; tasks left:
2      [async] closedts-subscription
1      [async] closedts-rangefeed-subscriber
I190830 05:34:37.282666 77289 util/stop/stopper.go:542  quiescing; tasks left:
4      [async] storage.IntentResolver: cleanup txn records
2      [async] closedts-subscription
1      [async] closedts-rangefeed-subscriber
I190830 05:34:37.282683 77290 util/stop/stopper.go:542  quiescing; tasks left:
2      [async] closedts-subscription
1      [async] closedts-rangefeed-subscriber
I190830 05:34:37.282745 77289 util/stop/stopper.go:542  quiescing; tasks left:
3      [async] storage.IntentResolver: cleanup txn records
2      [async] closedts-subscription
1      [async] closedts-rangefeed-subscriber
I190830 05:34:37.282826 77289 util/stop/stopper.go:542  quiescing; tasks left:
3      [async] storage.IntentResolver: cleanup txn records
2      [async] closedts-subscription
I190830 05:34:37.283011 77289 util/stop/stopper.go:542  quiescing; tasks left:
2      [async] storage.IntentResolver: cleanup txn records
2      [async] closedts-subscription
I190830 05:34:37.283037 77289 util/stop/stopper.go:542  quiescing; tasks left:
2      [async] closedts-subscription
1      [async] storage.IntentResolver: cleanup txn records
I190830 05:34:37.283141 77288 util/stop/stopper.go:542  quiescing; tasks left:
1      [async] closedts-subscription
1      [async] closedts-rangefeed-subscriber
I190830 05:34:37.283168 77288 util/stop/stopper.go:542  quiescing; tasks left:
1      [async] closedts-subscription
I190830 05:34:37.283214 77289 util/stop/stopper.go:542  quiescing; tasks left:
1      [async] storage.IntentResolver: cleanup txn records
1      [async] closedts-subscription
I190830 05:34:37.283235 77290 util/stop/stopper.go:542  quiescing; tasks left:
1      [async] closedts-subscription
1      [async] closedts-rangefeed-subscriber
I190830 05:34:37.283353 77289 util/stop/stopper.go:542  quiescing; tasks left:
1      [async] storage.IntentResolver: cleanup txn records
--- FAIL: TestBackupRestoreWithConcurrentWrites (3.42s)
    backup_test.go:1888: error executing 'SELECT id, balance, payload FROM data.bank WHERE id != balance': pq: table "bank" is offline: restoring

@cockroach-teamcity cockroach-teamcity added C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. labels Aug 30, 2019
@cockroach-teamcity cockroach-teamcity added this to the 19.2 milestone Aug 30, 2019
@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/d51fa78ff90a113c9009d263dfaf58d3672670a6

Parameters:

TAGS=
GOFLAGS=-parallel=4

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=TestBackupRestoreWithConcurrentWrites PKG=github.com/cockroachdb/cockroach/pkg/ccl/backupccl TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1463610&tab=buildLog

10     [async] storage.IntentResolver: cleanup txn records
1      [async] intent_resolver_ir_batcher
1      [async] closedts-rangefeed-subscriber
I190902 05:31:18.324837 77781 util/stop/stopper.go:542  quiescing; tasks left:
9      [async] storage.IntentResolver: cleanup txn records
2      [async] closedts-subscription
1      [async] intent_resolver_ir_batcher
1      [async] closedts-rangefeed-subscriber
I190902 05:31:18.324864 77781 util/stop/stopper.go:542  quiescing; tasks left:
9      [async] storage.IntentResolver: cleanup txn records
2      [async] closedts-subscription
1      [async] closedts-rangefeed-subscriber
I190902 05:31:18.324912 77781 util/stop/stopper.go:542  quiescing; tasks left:
8      [async] storage.IntentResolver: cleanup txn records
2      [async] closedts-subscription
1      [async] closedts-rangefeed-subscriber
W190902 05:31:18.325333 73347 storage/raft_transport.go:620  [n3] while processing outgoing Raft queue to node 2: EOF:
I190902 05:31:18.325503 77781 util/stop/stopper.go:542  quiescing; tasks left:
7      [async] storage.IntentResolver: cleanup txn records
2      [async] closedts-subscription
1      [async] closedts-rangefeed-subscriber
I190902 05:31:18.325557 77781 util/stop/stopper.go:542  quiescing; tasks left:
6      [async] storage.IntentResolver: cleanup txn records
2      [async] closedts-subscription
1      [async] closedts-rangefeed-subscriber
I190902 05:31:18.325603 77781 util/stop/stopper.go:542  quiescing; tasks left:
5      [async] storage.IntentResolver: cleanup txn records
2      [async] closedts-subscription
1      [async] closedts-rangefeed-subscriber
I190902 05:31:18.325657 77781 util/stop/stopper.go:542  quiescing; tasks left:
4      [async] storage.IntentResolver: cleanup txn records
2      [async] closedts-subscription
1      [async] closedts-rangefeed-subscriber
I190902 05:31:18.325700 77781 util/stop/stopper.go:542  quiescing; tasks left:
3      [async] storage.IntentResolver: cleanup txn records
2      [async] closedts-subscription
1      [async] closedts-rangefeed-subscriber
I190902 05:31:18.325780 77781 util/stop/stopper.go:542  quiescing; tasks left:
2      [async] storage.IntentResolver: cleanup txn records
2      [async] closedts-subscription
1      [async] closedts-rangefeed-subscriber
I190902 05:31:18.325827 77781 util/stop/stopper.go:542  quiescing; tasks left:
2      [async] closedts-subscription
1      [async] storage.IntentResolver: cleanup txn records
1      [async] closedts-rangefeed-subscriber
I190902 05:31:18.325874 77781 util/stop/stopper.go:542  quiescing; tasks left:
2      [async] closedts-subscription
1      [async] closedts-rangefeed-subscriber
W190902 05:31:18.325931 74233 storage/raft_transport.go:620  [n3] while processing outgoing Raft queue to node 2: rpc error: code = Canceled desc = grpc: the client connection is closing:
I190902 05:31:18.327114 77782 util/stop/stopper.go:542  quiescing; tasks left:
1      [async] closedts-subscription
1      [async] closedts-rangefeed-subscriber
I190902 05:31:18.327153 77782 util/stop/stopper.go:542  quiescing; tasks left:
1      [async] closedts-rangefeed-subscriber
W190902 05:31:18.327197 71708 storage/raft_transport.go:620  [n1] while processing outgoing Raft queue to node 2: rpc error: code = Canceled desc = grpc: the client connection is closing:
I190902 05:31:18.327363 77780 util/stop/stopper.go:542  quiescing; tasks left:
1      [async] closedts-subscription
1      [async] closedts-rangefeed-subscriber
I190902 05:31:18.327435 77780 util/stop/stopper.go:542  quiescing; tasks left:
1      [async] closedts-subscription
W190902 05:31:18.327466 70989 gossip/gossip.go:1517  [n2] no incoming or outgoing connections
I190902 05:31:18.329196 77781 util/stop/stopper.go:542  quiescing; tasks left:
2      [async] closedts-subscription
I190902 05:31:18.329316 77781 util/stop/stopper.go:542  quiescing; tasks left:
1      [async] closedts-subscription
--- FAIL: TestBackupRestoreWithConcurrentWrites (2.38s)
    backup_test.go:1888: error executing 'SELECT id, balance, payload FROM data.bank WHERE id != balance': pq: table "bank" is offline: restoring

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/82de9887be1e9625c83cfd9903e9f5bf1590ef9d

Parameters:

TAGS=
GOFLAGS=-parallel=4

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=TestBackupRestoreWithConcurrentWrites PKG=github.com/cockroachdb/cockroach/pkg/ccl/backupccl TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1473423&tab=buildLog

2      [async] closedts-subscription
1      [async] closedts-rangefeed-subscriber
I190907 05:17:43.665711 79814 util/stop/stopper.go:542  quiescing; tasks left:
2      [async] storage.IntentResolver: cleanup txn records
2      [async] closedts-subscription
1      [async] intent_resolver_ir_batcher
1      [async] closedts-rangefeed-subscriber
W190907 05:17:43.665712 71672 storage/raft_transport.go:620  [n1] while processing outgoing Raft queue to node 3: EOF:
W190907 05:17:43.665788 72941 storage/raft_transport.go:620  [n2] while processing outgoing Raft queue to node 3: EOF:
W190907 05:17:43.665793 73705 storage/raft_transport.go:620  [n2] while processing outgoing Raft queue to node 3: EOF:
W190907 05:17:43.665826 79703 storage/intentresolver/intent_resolver.go:818  failed to gc transaction record: could not GC completed transaction anchored at /Table/3/1/54/2/1: node unavailable; try another peer
I190907 05:17:43.665892 79814 util/stop/stopper.go:542  quiescing; tasks left:
2      [async] closedts-subscription
1      [async] storage.IntentResolver: cleanup txn records
1      [async] closedts-rangefeed-subscriber
W190907 05:17:43.665926 71101 gossip/gossip.go:1517  [n2] no incoming or outgoing connections
W190907 05:17:43.666002 71938 storage/raft_transport.go:620  [n1] while processing outgoing Raft queue to node 3: EOF:
I190907 05:17:43.666014 79814 util/stop/stopper.go:542  quiescing; tasks left:
2      [async] closedts-subscription
1      [async] closedts-rangefeed-subscriber
W190907 05:17:43.666102 71059 storage/raft_transport.go:620  [n2] while processing outgoing Raft queue to node 1: rpc error: code = Canceled desc = grpc: the client connection is closing:
W190907 05:17:43.666224 71685 storage/raft_transport.go:620  [n3] while processing outgoing Raft queue to node 1: rpc error: code = Canceled desc = grpc: the client connection is closing:
W190907 05:17:43.666354 71702 storage/raft_transport.go:620  [n3] while processing outgoing Raft queue to node 1: rpc error: code = Canceled desc = grpc: the client connection is closing:
W190907 05:17:43.666448 71363 gossip/gossip.go:1517  [n3] no incoming or outgoing connections
W190907 05:17:43.666544 74093 storage/raft_transport.go:620  [n3] while processing outgoing Raft queue to node 2: rpc error: code = Canceled desc = grpc: the client connection is closing:
W190907 05:17:43.666613 72936 storage/raft_transport.go:620  [n3] while processing outgoing Raft queue to node 2: rpc error: code = Canceled desc = grpc: the client connection is closing:
W190907 05:17:43.666870 71652 storage/raft_transport.go:620  [n2] while processing outgoing Raft queue to node 1: EOF:
W190907 05:17:43.667032 71960 storage/raft_transport.go:620  [n1] while processing outgoing Raft queue to node 2: EOF:
I190907 05:17:43.667252 79816 util/stop/stopper.go:542  quiescing; tasks left:
2      [async] closedts-subscription
I190907 05:17:43.667415 79816 util/stop/stopper.go:542  quiescing; tasks left:
1      [async] closedts-subscription
I190907 05:17:43.667679 79815 util/stop/stopper.go:542  quiescing; tasks left:
2      [async] closedts-subscription
1      [async] intent_resolver_ir_batcher
1      [async] closedts-rangefeed-subscriber
I190907 05:17:43.668097 79815 util/stop/stopper.go:542  quiescing; tasks left:
2      [async] closedts-subscription
1      [async] closedts-rangefeed-subscriber
I190907 05:17:43.668143 79815 util/stop/stopper.go:542  quiescing; tasks left:
2      [async] closedts-subscription
I190907 05:17:43.668160 79814 util/stop/stopper.go:542  quiescing; tasks left:
1      [async] closedts-subscription
1      [async] closedts-rangefeed-subscriber
I190907 05:17:43.668196 79814 util/stop/stopper.go:542  quiescing; tasks left:
1      [async] closedts-rangefeed-subscriber
I190907 05:17:43.669195 79815 util/stop/stopper.go:542  quiescing; tasks left:
1      [async] closedts-subscription
--- FAIL: TestBackupRestoreWithConcurrentWrites (3.45s)
    backup_test.go:1899: error executing 'SELECT id, balance, payload FROM data.bank WHERE id != balance': pq: table "bank" is offline: restoring

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/e186c663735a9210a92a81f36f30aaabadc2754c

Parameters:

TAGS=
GOFLAGS=-parallel=4

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=TestBackupRestoreWithConcurrentWrites PKG=github.com/cockroachdb/cockroach/pkg/ccl/backupccl TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1474512&tab=buildLog

I190908 05:18:01.784375 77982 util/stop/stopper.go:542  quiescing; tasks left:
5      [async] storage.IntentResolver: cleanup txn records
2      [async] closedts-subscription
1      [async] closedts-rangefeed-subscriber
I190908 05:18:01.784404 77982 util/stop/stopper.go:542  quiescing; tasks left:
4      [async] storage.IntentResolver: cleanup txn records
2      [async] closedts-subscription
1      [async] closedts-rangefeed-subscriber
I190908 05:18:01.784445 77982 util/stop/stopper.go:542  quiescing; tasks left:
3      [async] storage.IntentResolver: cleanup txn records
2      [async] closedts-subscription
1      [async] closedts-rangefeed-subscriber
I190908 05:18:01.784486 77982 util/stop/stopper.go:542  quiescing; tasks left:
2      [async] storage.IntentResolver: cleanup txn records
2      [async] closedts-subscription
1      [async] closedts-rangefeed-subscriber
I190908 05:18:01.784656 77982 util/stop/stopper.go:542  quiescing; tasks left:
2      [async] closedts-subscription
1      [async] storage.IntentResolver: cleanup txn records
1      [async] closedts-rangefeed-subscriber
I190908 05:18:01.784717 77982 util/stop/stopper.go:542  quiescing; tasks left:
2      [async] closedts-subscription
1      [async] closedts-rangefeed-subscriber
W190908 05:18:01.785116 71525 storage/raft_transport.go:620  [n1] while processing outgoing Raft queue to node 3: EOF:
W190908 05:18:01.785212 73941 storage/raft_transport.go:620  [n2] while processing outgoing Raft queue to node 3: EOF:
W190908 05:18:01.785262 71743 storage/raft_transport.go:620  [n1] while processing outgoing Raft queue to node 3: EOF:
W190908 05:18:01.785268 71763 storage/raft_transport.go:620  [n1] while processing outgoing Raft queue to node 2: EOF:
W190908 05:18:01.785263 73357 storage/raft_transport.go:620  [n2] while processing outgoing Raft queue to node 3: EOF:
W190908 05:18:01.785283 71299 storage/raft_transport.go:620  [n1] while processing outgoing Raft queue to node 2: EOF:
W190908 05:18:01.785332 71504 storage/raft_transport.go:620  [n3] while processing outgoing Raft queue to node 1: EOF:
W190908 05:18:01.785361 71256 storage/raft_transport.go:620  [n2] while processing outgoing Raft queue to node 1: EOF:
W190908 05:18:01.785418 71434 storage/raft_transport.go:620  [n2] while processing outgoing Raft queue to node 1: EOF:
W190908 05:18:01.785427 71542 storage/raft_transport.go:620  [n3] while processing outgoing Raft queue to node 1: EOF:
W190908 05:18:01.785493 74163 storage/raft_transport.go:620  [n3] while processing outgoing Raft queue to node 2: EOF:
W190908 05:18:01.785560 72716 storage/raft_transport.go:620  [n3] while processing outgoing Raft queue to node 2: EOF:
I190908 05:18:01.783824 77981 util/stop/stopper.go:542  quiescing; tasks left:
2      [async] closedts-subscription
18     [async] storage.IntentResolver: cleanup txn records
1      [async] intent_resolver_ir_batcher
1      [async] intent_resolver_gc_batcher
1      [async] closedts-rangefeed-subscriber
I190908 05:18:01.786596 77981 util/stop/stopper.go:542  quiescing; tasks left:
2      [async] closedts-subscription
1      [async] intent_resolver_gc_batcher
1      [async] closedts-rangefeed-subscriber
I190908 05:18:01.786647 77981 util/stop/stopper.go:542  quiescing; tasks left:
2      [async] closedts-subscription
1      [async] intent_resolver_gc_batcher
I190908 05:18:01.786662 77981 util/stop/stopper.go:542  quiescing; tasks left:
2      [async] closedts-subscription
I190908 05:18:01.786677 77980 util/stop/stopper.go:542  quiescing; tasks left:
1      [async] closedts-subscription
I190908 05:18:01.786811 77982 util/stop/stopper.go:542  quiescing; tasks left:
2      [async] closedts-subscription
I190908 05:18:01.786834 77982 util/stop/stopper.go:542  quiescing; tasks left:
1      [async] closedts-subscription
I190908 05:18:01.787007 77981 util/stop/stopper.go:542  quiescing; tasks left:
1      [async] closedts-subscription
--- FAIL: TestBackupRestoreWithConcurrentWrites (2.04s)
    backup_test.go:1899: error executing 'SELECT id, balance, payload FROM data.bank WHERE id != balance': pq: table "bank" is offline: restoring

@dt dt assigned pbardea and unassigned dt Sep 8, 2019
@dt
Copy link
Member

dt commented Sep 8, 2019

backup_test.go:1899: error executing 'SELECT id, balance, payload FROM data.bank WHERE id != balance': pq: table "bank" is offline: restoring.

It looks like as soon as the RESTORE statement returns, the test expects the concurrent traffic to stop observing errors. At first I was thinking maybe the concurrent traffic was using a stale, cached copy of the OFFLINE desc, but I thought that offline descriptors can't be leased?

@pbardea
Copy link
Contributor

pbardea commented Sep 10, 2019

Have been looking into this, and it seems very similar (possibly the same issue as) #36841 (which is the same issue, but after an IMPORT rather than a RESTORE). It does indeed look like offline descriptors can't be leased.

From my investigation so far far it seems that the issue seems to be caused (or at least becomes a lot more prominent) due to the concurrent writes on the same connection as the BACKUP and SELECT statements in the test. One hypothesis is that the concurrent UPDATES are holding an outdated lease on bank and the SELECT is not seeing the latest descriptor.

@dt
Copy link
Member

dt commented Sep 10, 2019

I'm a bit confused though since if RESTORE returned, then there should be a public desc for that table committed. If that select ran after that, then it shouldn't possibly get that error, since offline means it can't be leased so it must have read it directly.

Ah, wait, actually, maybe that SELECT isn't being run after at all though: we atomically read allowErr after we get a result back, but that doesn't mean the result we got back was generated after the flag flipped, just that by the time we check it the flag flipped. I think we might need to read the atomic flag before we execute the statement and then use that read value to check the result? EDIT: that would explain it if the UPDATE were the flake but not that SELECT -- The SELECT is sent in the same goroutine that got back the RESTORE result so it shouldn't possibly observe the old desc. never mind me.

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/99bed2718a9b5d1c6b94ce0d65dc40d1e746de16

Parameters:

TAGS=
GOFLAGS=-parallel=4

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=TestBackupRestoreWithConcurrentWrites PKG=github.com/cockroachdb/cockroach/pkg/ccl/backupccl TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1481803&tab=buildLog

27     [async] storage.IntentResolver: cleanup txn records
2      [async] closedts-subscription
1      [async] intent_resolver_ir_batcher
1      [async] intent_resolver_gc_batcher
1      [async] closedts-rangefeed-subscriber
I190912 05:24:23.101648 86299 util/stop/stopper.go:542  quiescing; tasks left:
9      [async] storage.IntentResolver: cleanup txn records
2      [async] closedts-subscription
1      [async] intent_resolver_ir_batcher
1      [async] intent_resolver_gc_batcher
1      [async] closedts-rangefeed-subscriber
I190912 05:24:23.101675 86299 util/stop/stopper.go:542  quiescing; tasks left:
9      [async] storage.IntentResolver: cleanup txn records
2      [async] closedts-subscription
1      [async] intent_resolver_ir_batcher
1      [async] closedts-rangefeed-subscriber
I190912 05:24:23.101735 86299 util/stop/stopper.go:542  quiescing; tasks left:
8      [async] storage.IntentResolver: cleanup txn records
2      [async] closedts-subscription
1      [async] intent_resolver_ir_batcher
1      [async] closedts-rangefeed-subscriber
W190912 05:24:23.101820 79076 storage/raft_transport.go:620  [n1] while processing outgoing Raft queue to node 2: rpc error: code = Canceled desc = grpc: the client connection is closing:
I190912 05:24:23.101987 86299 util/stop/stopper.go:542  quiescing; tasks left:
2      [async] closedts-subscription
1      [async] storage.IntentResolver: cleanup txn records
1      [async] closedts-rangefeed-subscriber
W190912 05:24:23.102042 80071 storage/raft_transport.go:620  [n1] while processing outgoing Raft queue to node 2: rpc error: code = Unavailable desc = transport is closing:
W190912 05:24:23.102229 79606 storage/raft_transport.go:620  [n2] while processing outgoing Raft queue to node 1: EOF:
W190912 05:24:23.101629 80041 storage/raft_transport.go:620  [n1] while processing outgoing Raft queue to node 3: rpc error: code = Canceled desc = grpc: the client connection is closing:
W190912 05:24:23.101552 81779 storage/raft_transport.go:620  [n3] while processing outgoing Raft queue to node 2: rpc error: code = Unavailable desc = transport is closing:
I190912 05:24:23.102360 86298 util/stop/stopper.go:542  quiescing; tasks left:
2      [async] closedts-subscription
1      [async] closedts-rangefeed-subscriber
W190912 05:24:23.102365 81990 storage/raft_transport.go:620  [n2] while processing outgoing Raft queue to node 3: EOF:
W190912 05:24:23.102419 80947 storage/raft_transport.go:620  [n2] while processing outgoing Raft queue to node 3: EOF:
I190912 05:24:23.102509 86298 util/stop/stopper.go:542  quiescing; tasks left:
2      [async] closedts-subscription
W190912 05:24:23.102559 79577 storage/raft_transport.go:620  [n2] while processing outgoing Raft queue to node 1: EOF:
I190912 05:24:23.102607 86298 util/stop/stopper.go:542  quiescing; tasks left:
1      [async] closedts-subscription
W190912 05:24:23.101544 82614 storage/intentresolver/intent_resolver.go:818  failed to gc transaction record: could not GC completed transaction anchored at /Local/Range/Table/53/1/2/RangeDescriptor: node unavailable; try another peer
I190912 05:24:23.102720 86299 util/stop/stopper.go:542  quiescing; tasks left:
2      [async] closedts-subscription
1      [async] closedts-rangefeed-subscriber
I190912 05:24:23.102732 86300 util/stop/stopper.go:542  quiescing; tasks left:
2      [async] closedts-subscription
1      [async] closedts-rangefeed-subscriber
I190912 05:24:23.102771 86299 util/stop/stopper.go:542  quiescing; tasks left:
2      [async] closedts-subscription
I190912 05:24:23.102884 86300 util/stop/stopper.go:542  quiescing; tasks left:
2      [async] closedts-subscription
I190912 05:24:23.102890 86299 util/stop/stopper.go:542  quiescing; tasks left:
1      [async] closedts-subscription
I190912 05:24:23.103124 86300 util/stop/stopper.go:542  quiescing; tasks left:
1      [async] closedts-subscription
--- FAIL: TestBackupRestoreWithConcurrentWrites (2.11s)
    backup_test.go:1899: error executing 'SELECT id, balance, payload FROM data.bank WHERE id != balance': pq: table "bank" is offline: restoring

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/99ba774d014955965f819d36fac7e14b2d405aa9

Parameters:

TAGS=
GOFLAGS=-parallel=4

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=TestBackupRestoreWithConcurrentWrites PKG=github.com/cockroachdb/cockroach/pkg/ccl/backupccl TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1489828&tab=buildLog

2      [async] closedts-subscription
1      [async] closedts-rangefeed-subscriber
W190916 05:30:02.084110 82327 storage/raft_transport.go:620  [n2] while processing outgoing Raft queue to node 3: EOF:
W190916 05:30:02.084178 81876 storage/raft_transport.go:620  [n2] while processing outgoing Raft queue to node 3: EOF:
W190916 05:30:02.084243 79925 storage/raft_transport.go:620  [n1] while processing outgoing Raft queue to node 3: EOF:
W190916 05:30:02.084464 79327 storage/raft_transport.go:620  [n1] while processing outgoing Raft queue to node 3: EOF:
W190916 05:30:02.084493 81890 storage/intentresolver/intent_resolver.go:818  failed to gc transaction record: could not GC completed transaction anchored at /Local/Range/Table/53/RangeDescriptor: node unavailable; try another peer
I190916 05:30:02.084532 85799 util/stop/stopper.go:542  quiescing; tasks left:
2      [async] closedts-subscription
19     [async] storage.IntentResolver: cleanup txn records
1      [async] intent_resolver_ir_batcher
1      [async] intent_resolver_gc_batcher
1      [async] closedts-rangefeed-subscriber
W190916 05:30:02.084591 82285 storage/raft_transport.go:620  [n3] while processing outgoing Raft queue to node 2: rpc error: code = Canceled desc = grpc: the client connection is closing:
W190916 05:30:02.084656 79661 storage/raft_transport.go:620  [n3] while processing outgoing Raft queue to node 1: rpc error: code = Canceled desc = grpc: the client connection is closing:
W190916 05:30:02.084780 79654 storage/raft_transport.go:620  [n3] while processing outgoing Raft queue to node 1: rpc error: code = Canceled desc = grpc: the client connection is closing:
W190916 05:30:02.084823 81802 storage/raft_transport.go:620  [n3] while processing outgoing Raft queue to node 2: rpc error: code = Canceled desc = grpc: the client connection is closing:
W190916 05:30:02.085317 79694 storage/raft_transport.go:620  [n2] while processing outgoing Raft queue to node 1: EOF:
I190916 05:30:02.085346 85799 util/stop/stopper.go:542  quiescing; tasks left:
2      [async] closedts-subscription
1      [async] closedts-rangefeed-subscriber
W190916 05:30:02.085416 79701 storage/raft_transport.go:620  [n2] while processing outgoing Raft queue to node 1: EOF:
W190916 05:30:02.085418 79685 storage/raft_transport.go:620  [n1] while processing outgoing Raft queue to node 2: rpc error: code = Canceled desc = grpc: the client connection is closing:
W190916 05:30:02.085495 82889 storage/intentresolver/intent_resolver.go:818  failed to gc transaction record: could not GC completed transaction anchored at /Local/Range/Table/53/1/7/RangeDescriptor: node unavailable; try another peer
W190916 05:30:02.085646 79912 storage/raft_transport.go:620  [n1] while processing outgoing Raft queue to node 2: rpc error: code = Canceled desc = grpc: the client connection is closing:
I190916 05:30:02.085828 85800 util/stop/stopper.go:542  quiescing; tasks left:
2      [async] closedts-subscription
1      [async] storage.IntentResolver: cleanup txn records
1      [async] closedts-rangefeed-subscriber
I190916 05:30:02.085857 85800 util/stop/stopper.go:542  quiescing; tasks left:
2      [async] closedts-subscription
1      [async] closedts-rangefeed-subscriber
I190916 05:30:02.086205 85801 util/stop/stopper.go:542  quiescing; tasks left:
2      [async] closedts-subscription
I190916 05:30:02.086381 85799 util/stop/stopper.go:542  quiescing; tasks left:
2      [async] closedts-subscription
I190916 05:30:02.086429 85800 util/stop/stopper.go:542  quiescing; tasks left:
2      [async] closedts-subscription
I190916 05:30:02.086467 85801 util/stop/stopper.go:542  quiescing; tasks left:
1      [async] closedts-subscription
I190916 05:30:02.086477 85799 util/stop/stopper.go:542  quiescing; tasks left:
1      [async] closedts-subscription
I190916 05:30:02.086535 85800 util/stop/stopper.go:542  quiescing; tasks left:
1      [async] closedts-subscription
--- FAIL: TestBackupRestoreWithConcurrentWrites (2.14s)
    backup_test.go:1899: error executing 'SELECT id, balance, payload FROM data.bank WHERE id != balance': pq: table "bank" is offline: restoring

@ajwerner
Copy link
Contributor

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/a92c7d01d3076eabafbd536d8a344511ec9081c6

Parameters:

TAGS=
GOFLAGS=-parallel=4

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=TestBackupRestoreWithConcurrentWrites PKG=github.com/cockroachdb/cockroach/pkg/ccl/backupccl TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1500229&tab=buildLog

2      [async] closedts-subscription
1      [async] closedts-rangefeed-subscriber
I190921 05:17:57.196452 85262 util/stop/stopper.go:542  quiescing; tasks left:
9      [async] storage.IntentResolver: cleanup txn records
2      [async] closedts-subscription
1      [async] closedts-rangefeed-subscriber
I190921 05:17:57.196477 85263 util/stop/stopper.go:542  quiescing; tasks left:
2      [async] closedts-subscription
1      [async] storage.IntentResolver: cleanup txn records
1      [async] closedts-rangefeed-subscriber
I190921 05:17:57.196496 85262 util/stop/stopper.go:542  quiescing; tasks left:
8      [async] storage.IntentResolver: cleanup txn records
2      [async] closedts-subscription
1      [async] closedts-rangefeed-subscriber
I190921 05:17:57.196516 85263 util/stop/stopper.go:542  quiescing; tasks left:
2      [async] closedts-subscription
1      [async] closedts-rangefeed-subscriber
I190921 05:17:57.196533 85262 util/stop/stopper.go:542  quiescing; tasks left:
7      [async] storage.IntentResolver: cleanup txn records
2      [async] closedts-subscription
1      [async] closedts-rangefeed-subscriber
I190921 05:17:57.196575 85262 util/stop/stopper.go:542  quiescing; tasks left:
6      [async] storage.IntentResolver: cleanup txn records
2      [async] closedts-subscription
1      [async] closedts-rangefeed-subscriber
I190921 05:17:57.196611 85262 util/stop/stopper.go:542  quiescing; tasks left:
5      [async] storage.IntentResolver: cleanup txn records
2      [async] closedts-subscription
1      [async] closedts-rangefeed-subscriber
I190921 05:17:57.196652 85262 util/stop/stopper.go:542  quiescing; tasks left:
4      [async] storage.IntentResolver: cleanup txn records
2      [async] closedts-subscription
1      [async] closedts-rangefeed-subscriber
W190921 05:17:57.196681 79183 storage/raft_transport.go:620  [n1] while processing outgoing Raft queue to node 2: rpc error: code = Canceled desc = grpc: the client connection is closing:
I190921 05:17:57.196690 85262 util/stop/stopper.go:542  quiescing; tasks left:
3      [async] storage.IntentResolver: cleanup txn records
2      [async] closedts-subscription
1      [async] closedts-rangefeed-subscriber
I190921 05:17:57.196730 85262 util/stop/stopper.go:542  quiescing; tasks left:
2      [async] storage.IntentResolver: cleanup txn records
2      [async] closedts-subscription
1      [async] closedts-rangefeed-subscriber
I190921 05:17:57.196781 85262 util/stop/stopper.go:542  quiescing; tasks left:
2      [async] closedts-subscription
1      [async] storage.IntentResolver: cleanup txn records
1      [async] closedts-rangefeed-subscriber
I190921 05:17:57.196819 85262 util/stop/stopper.go:542  quiescing; tasks left:
2      [async] closedts-subscription
1      [async] closedts-rangefeed-subscriber
W190921 05:17:57.197189 79024 storage/raft_transport.go:620  [n2] while processing outgoing Raft queue to node 1: EOF:
W190921 05:17:57.197271 79202 storage/raft_transport.go:620  [n2] while processing outgoing Raft queue to node 1: EOF:
I190921 05:17:57.198233 85264 util/stop/stopper.go:542  quiescing; tasks left:
2      [async] closedts-subscription
1      [async] closedts-rangefeed-subscriber
I190921 05:17:57.198341 85263 util/stop/stopper.go:542  quiescing; tasks left:
1      [async] closedts-subscription
1      [async] closedts-rangefeed-subscriber
I190921 05:17:57.198365 85263 util/stop/stopper.go:542  quiescing; tasks left:
1      [async] closedts-subscription
I190921 05:17:57.198383 85264 util/stop/stopper.go:542  quiescing; tasks left:
2      [async] closedts-subscription
I190921 05:17:57.198592 85264 util/stop/stopper.go:542  quiescing; tasks left:
1      [async] closedts-subscription
I190921 05:17:57.199411 85262 util/stop/stopper.go:542  quiescing; tasks left:
2      [async] closedts-subscription
I190921 05:17:57.199437 85262 util/stop/stopper.go:542  quiescing; tasks left:
1      [async] closedts-subscription
--- FAIL: TestBackupRestoreWithConcurrentWrites (2.19s)
    backup_test.go:1896: error executing 'SELECT id, balance, payload FROM data.bank WHERE id != balance': pq: table "bank" is offline: restoring

craig bot pushed a commit that referenced this issue Sep 23, 2019
40996: sql: fix lease interaction with offline tables r=dt a=pbardea

This updates the lease management code to handle offline tables the
same way dropped tables are handled. We are not supposed to be able
to acquire a lease on offline tables (as well as dropped tables).

Fixes #40951 and therefore #40361.

Release justification: Fixes test flakes where a table descriptor lease
is being taken on an offline table.

Release note: None

Co-authored-by: Paul Bardea <pbardea@gmail.com>
@pbardea
Copy link
Contributor

pbardea commented Sep 24, 2019

Closed by #40996

@pbardea pbardea closed this as completed Sep 24, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot.
Projects
None yet
Development

No branches or pull requests

4 participants