Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: election-after-restart failed [skipped] #35047

Closed
cockroach-teamcity opened this issue Feb 19, 2019 · 28 comments · Fixed by #53689
Closed

roachtest: election-after-restart failed [skipped] #35047

cockroach-teamcity opened this issue Feb 19, 2019 · 28 comments · Fixed by #53689
Assignees
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

SHA: https://github.com/cockroachdb/cockroach/commits/bd80a74f882a583d6bb2a04dfdb57b49254bc7ba

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=election-after-restart PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1143393&tab=buildLog

The test failed on master:
	election.go:83,test.go:1212: expected query to succeed in less than 15s, took 29.657844041s

@cockroach-teamcity cockroach-teamcity added this to the 2.2 milestone Feb 19, 2019
@cockroach-teamcity cockroach-teamcity added C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. labels Feb 19, 2019
@tbg
Copy link
Member

tbg commented Feb 19, 2019

Attached the artifacts. A cursory glance shows that we're taking a long time to acquire a lease. This doesn't quite come out at the top of my priority queue right now, so I'm going to defer.

election-after-restart.zip

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/d0a93d286ee42ceb94f986b6a06a1afd52cf914e

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=election-after-restart PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1149020&tab=buildLog

The test failed on master:
	election.go:83,test.go:1211: expected query to succeed in less than 15s, took 31.799701966s

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/a119a3a158725c9e3f9b8084d9398601c0e67007

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=election-after-restart PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1170795&tab=buildLog

The test failed on master:
	election.go:83,test.go:1214: expected query to succeed in less than 15s, took 31.431615225s

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/57e825a7940495b67e0cc7213a5fabc24e12be0e

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=election-after-restart PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1176948&tab=buildLog

The test failed on master:
	election.go:83,test.go:1214: expected query to succeed in less than 15s, took 29.659654027s

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/a512e390f7f2f2629f3f811bab5866c46e3e5713

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=election-after-restart PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1183678&tab=buildLog

The test failed on provisional_201903122203_v19.1.0-beta.20190318:
	election.go:83,test.go:1214: expected query to succeed in less than 15s, took 26.450120864s

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/9399d559ae196e5cf2ad122195048ff9115ab56a

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=election-after-restart PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1194326&tab=buildLog

The test failed on release-19.1:
	election.go:83,test.go:1214: expected query to succeed in less than 15s, took 21.725999878s

@tbg
Copy link
Member

tbg commented Mar 25, 2019

Taken from the last run, presumably the others are similar:

https://gist.github.com/tbg/895c05d4b6672e382bc8a09cf2c15999

Looks like most of the gaps (not all) have the following pattern

  1. request lease
  2. nothing for 3s
  3. "refresh (reason: reasonTicks) returning AmbiguousResultError for command without MaxLeaseIndex"
  4. lease retried
  5. success

We are presumably dropping the pending lease request (why exactly? No leaseholder known or an election fight?) and because it's a special request it has no replay protection and we "have" to wait for the caller.

@bdarnell I wonder what a good way to fix this without additional hacks could look like. The real solution is to handle ErrProposalDropped (if that is the problem here).

2019-03-23 07:32:39.147341+00:00	00:00:02.704266	request range lease (attempt #2)	[n2,s2,r34/2:/{Table/53/1/1…-Max}]		/cockroach.roachpb.Internal/Batch	514
2019-03-23 07:32:41.083074+00:00	00:00:04.639999	request range lease (attempt #3)	[n2,s2,r34/2:/{Table/53/1/1…-Max}]		/cockroach.roachpb.Internal/Batch	514
2019-03-23 07:32:41.083094+00:00	00:00:04.640019	=== SPAN START: request range lease ===			request range lease	515
2019-03-23 07:32:41.083108+00:00	00:00:04.640033	=== SPAN START: [async] storage.pendingLeaseRequest: requesting lease ===		
2019-03-23 07:32:41.096306+00:00	00:00:04.653231	acquired replica mu	[n2,s2,r35/2:/{Table/53/1/1…-Max}]		[async] storage.pendingLeaseRequest: requesting lease	519
2019-03-23 07:32:43.94729+00:00	00:00:07.504215	"refresh (reason: reasonTicks) returning AmbiguousResultError for command without MaxLeaseIndex: proposer_replica:<node_id:2 store_id:2 replica_id:2 > proposer_lease_sequence:1 replicated_eval_result:<block_reads:true state:<lease:<start:<wall_time:1553326361096119110 > replica:<node_id:2 store_id:2 replica_id:2 > proposed_ts:<wall_time:1553326361096123051 > epoch:2 sequence:2 > > is_lease_request:true timestamp:<wall_time:1553326361096159869 > delta:<sys_bytes:10 > prev_lease_proposal:<wall_time:1553326348962369099 > > write_batch:<data:""\000\000\000\000\000\000\000\000\001\000\000\000\001\t\001i\253rrll-\0007\022\004\010\000\020\000\030\000 \000(\0002)L\372\220\025\003\n\n\010\306\306\211\253\207\303\241\307\025\032\006\010\002\020\002\030\002*\n\010\253\345\211\253\207\303\241\307\0250\0028\002"" > "	[n2,s2,r35/2:/{Table/53/1/1…-Max}]		[async] storage.pendingLeaseRequest: requesting lease	519
2019-03-23 07:32:43.947329+00:00	00:00:07.504254	replica.Send got error: result is ambiguous (unknown status for command without MaxLeaseIndex at refreshProposalsLocked time (refresh reason: reasonTicks))	[n2,s2,r35/2:/{Table/53/1/1…-Max}]		[async] storage.pendingLeaseRequest: requesting lease	519
2019-03-23 07:32:43.947594+00:00	00:00:07.504519	request range lease (attempt #2)	[n2,s2,r35/2:/{Table/53/1/1…-Max}]		/cockroach.roachpb.Internal/Batch	517
2019-03-23 07:32:44.186478+00:00	00:00:07.743402	request range lease (attempt #3)	[n2,s2,r35/2:/{Table/53/1/1…-Max}]		/cockroach.roachpb.Internal/Batch	517
2019-03-23 07:32:44.194452+00:00	00:00:07.751377	acquired replica mu	[n2,s2,r36/2:/{Table/53/1/1…-Max}]		[async] storage.pendingLeaseRequest: requesting lease	527
2019-03-23 07:32:47.147471+00:00	00:00:10.704396	"refresh (reason: reasonTicks) returning AmbiguousResultError for command without MaxLeaseIndex: proposer_replica:<node_id:2 store_id:2 replica_id:2 > proposer_lease_sequence:1 replicated_eval_result:<block_reads:true state:<lease:<start:<wall_time:1553326364194269660 > replica:<node_id:2 store_id:2 replica_id:2 > proposed_ts:<wall_time:1553326364194273583 > epoch:2 sequence:2 > > is_lease_request:true timestamp:<wall_time:1553326364194313992 > delta:<sys_bytes:10 > prev_lease_proposal:<wall_time:1553326348962369099 > > write_batch:<data:""\000\000\000\000\000\000\000\000\001\000\000\000\001\t\001i\254rrll-\0007\022\004\010\000\020\000\030\000 \000(\0002)\262\227*7\003\n\n\010\334\323\261\360\222\303\241\307\025\032\006\010\002\020\002\030\002*\n\010\257\362\261\360\222\303\241\307\0250\0028\002"" > "	[n2,s2,r36/2:/{Table/53/1/1…-Max}]		[async] storage.pendingLeaseRequest: requesting lease	527
2019-03-23 07:32:47.147512+00:00	00:00:10.704437	replica.Send got error: result is ambiguous (unknown status for command without MaxLeaseIndex at refreshProposalsLocked time (refresh reason: reasonTicks))	[n2,s2,r36/2:/{Table/53/1/1…-Max}]		[async] storage.pendingLeaseRequest: requesting lease	527
2019-03-23 07:32:47.147885+00:00	00:00:10.70481	evaluated request	[n2,s2,r36/2:/{Table/53/1/1…-Max}]		[async] storage.pendingLeaseRequest: requesting lease	529
2019-03-23 07:32:47.14789+00:00	00:00:10.704815	acquired replica mu	[n2,s2,r36/2:/{Table/53/1/1…-Max}]		[async] storage.pendingLeaseRequest: requesting lease	529
2019-03-23 07:32:48.987474+00:00	00:00:12.544399	"refresh (reason: reasonNewLeader) returning AmbiguousResultError for command without MaxLeaseIndex: proposer_replica:<node_id:2 store_id:2 replica_id:2 > proposer_lease_sequence:1 replicated_eval_result:<block_reads:true state:<lease:<start:<wall_time:1553326367147574607 > replica:<node_id:2 store_id:2 replica_id:2 > proposed_ts:<wall_time:1553326367147585736 > epoch:2 sequence:2 > > is_lease_request:true timestamp:<wall_time:1553326367147671582 > delta:<sys_bytes:10 > prev_lease_proposal:

@bdarnell
Copy link
Contributor

ErrProposalDropped is returned when there is no known raft leader. When that happens, it should be followed by a reasonNewLeader reproposal as soon as the election completes, which avoids waiting out reasonTicks.

One scenario in which you do have to wait for ticks is if n1 proposes a new lease while (n1 believes) n2 is leader (it must have been true at some point for n1 to believe this). The proposal is forwarded to n2. If n2 has lost leadership at this point, it will drop the proposal (and now we're too far removed from the original proposer to do anything with the error). I think that's probably what's happening here although I don't know why this appears to have become more common recently.

I don't immediately see a better/quicker solution than #21849. (which involves introducing a new return path so we can tell the original proposer that its forwarded proposal has been dropped.

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/1cbf3680129e47bd310640bf32b665662f30faa9

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=election-after-restart PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1217781&tab=buildLog

The test failed on release-19.1:
	election.go:83,test.go:1228: expected query to succeed in less than 15s, took 24.772590044s

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/76e8e78d2eca7c81ce7f76aa0504a0cf4b91e27f

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=election-after-restart PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1231790&tab=buildLog

The test failed on release-19.1:
	election.go:83,test.go:1237: expected query to succeed in less than 15s, took 15.42340216s

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/134478e4dde16919eb86efb81fb22d8cce8a9701

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=election-after-restart PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1234680&tab=buildLog

The test failed on release-19.1:
	election.go:83,test.go:1237: expected query to succeed in less than 15s, took 20.74453444s

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/509c5b130fb1ad0042beb74e083817aa68e4fc92

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=election-after-restart PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1237068&tab=buildLog

The test failed on release-19.1:
	election.go:83,test.go:1237: expected query to succeed in less than 15s, took 25.338079789s

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/d3f704f839ccaef7f10c3af48c78a26d390ae1dc

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=election-after-restart PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1241436&tab=buildLog

The test failed on master:
	election.go:83,test.go:1237: expected query to succeed in less than 15s, took 25.608109605s

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/9938cb1a2cca4c0350244f76845f0c61391d44a7

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=election-after-restart PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1249130&tab=buildLog

The test failed on release-19.1:
	election.go:83,test.go:1237: expected query to succeed in less than 15s, took 18.662278092s

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/c0d8e9d838fca9f79bc10d9fb43eeeaa502fdd91

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=election-after-restart PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1255139&tab=buildLog

The test failed on release-19.1:
	election.go:83,test.go:1237: expected query to succeed in less than 15s, took 25.030747296s

@solongordon solongordon mentioned this issue Apr 23, 2019
18 tasks
@bdarnell
Copy link
Contributor

Looking back at the recent reports of this (the ones that still have their artifacts), there are a lot of ADD_REPLICAs, snapshots, and snapshot failures because the follower nodes haven't yet caught up on the splits from before the restart.

It looks like for whatever reason the initial ranges aren't getting fully replicated before the ALTER TABLE SPLIT starts, so when we reach the restart we have a bunch of incompletely-replicated ranges, and followers who haven't even caught up on splits, so they'll reject incoming snapshots. I'm not 100% sure how this leads to the observed behavior, but it could fairly easily explain a lot of slow starts.

Since this isn't core to what the test is trying to prove, I think adding a sleep before the restart to let all the replication and split processing catch up would be an easy way to deflake the test. I'm not sure whether there would still be stuff to do (like the ErrProposalDropped issue discussed above) after that.

bdarnell added a commit to bdarnell/cockroach that referenced this issue Apr 24, 2019
Logs suggest that we're seeing failures when nodes aren't caught up on
all the split processing before the restart, so give this a chance to
finish.

Updates cockroachdb#35047

Release note: None
craig bot pushed a commit that referenced this issue Apr 24, 2019
36984: storage: Add comments about leases and liveness r=tbg a=bdarnell

I had to study this code for #35986, so I took the opportunity to
update the documentation while it's fresh in my mind.

Release note: None

37100: roachtest: Hopefully deflake election-after-restart r=tbg a=bdarnell

Logs suggest that we're seeing failures when nodes aren't caught up on
all the split processing before the restart, so give this a chance to
finish.

Updates #35047

Release note: None

Co-authored-by: Ben Darnell <ben@bendarnell.com>
@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/99306ec3e9fcbba01c05431cbf496e8b5b8954b4

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=election-after-restart PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1260033&tab=buildLog

The test failed on master:
	election.go:90,test.go:1245: expected query to succeed in less than 15s, took 30.653788984s

@tbg
Copy link
Member

tbg commented Apr 25, 2019

^- the roachtest version here seems to have had #37100:

image

Back to square one.

@tbg
Copy link
Member

tbg commented Apr 25, 2019

Actually this is the SHA that matters:

image

bc40e38 has that PR too, though.

@bdarnell
Copy link
Contributor

Symptoms look similar to before, with each slow range having a failed lease request. This time it's a mix of reasonNewLeaderOrConfigChange and reasonTicks. I think before we were seeing mostly reasonTicks, but I'm not sure.

2019-04-25 07:25:39.083595+00:00	00:00:09.941142	"refresh (reason: reasonNewLeaderOrConfigChange) returning AmbiguousResultError for command without MaxLeaseIndex: proposer_replica:<node_id:3 store_id:3 replica_id:2 > proposer_lease_sequence:1 replicated_eval_result:<block_reads:true state:<lease:<start:<wall_time:1556177139081205475 > replica:<node_id:3 store_id:3 replica_id:2 > proposed_ts:<wall_time:1556177139081225857 > epoch:2 sequence:2 > > is_lease_request:true timestamp:<wall_time:1556177139081275451 > delta:<sys_bytes:10 > prev_lease_proposal:<wall_time:1556177116058892330 > > write_batch:<data:""\000\000\000\000\000\000\000\000\001\000\000\000\001\t\001i\252rrll-\0007\022\004\010\000\020\000\030\000 \000(\0002)4er\372\003\n\n\010\343\255\223\364\316\333\251\314\025\032\006\010\003\020\003\030\002*\n\010\201\315\224\364\316\333\251\314\0250\0028\002"" > "	[n3,s3,r34/2:/Table/53/1/1{100-200}]		[async] storage.pendingLeaseRequest: requesting lease	510
2019-04-25 07:25:39.083625+00:00	00:00:09.941171	replica.Send got error: result is ambiguous (unknown status for command without MaxLeaseIndex at refreshProposalsLocked time (refresh reason: reasonNewLeaderOrConfigChange))	[n3,s3,r34/2:/Table/53/1/1{100-200}]		[async] storage.pendingLeaseRequest: requesting lease	510

The same split-related messages i saw before #37100 are there.

E190425 07:25:49.296397 1110 storage/queue.go:974  [n2,raftsnapshot,s2,r114/3:/Table/53/1/9{100-200}] snapshot failed: (n3,s3):2: remote couldn't accept Raft snapshot 169529cc at applied index 22 with error: [n3,s3],r114: cannot apply snapshot: snapshot intersects existing range; initiated GC: [n3,s3,r38/2:/{Table/53/1/1…-Max}] (incoming /Table/53/1/9{100-200})
E190425 07:25:49.297292 1073 storage/queue.go:974  [n2,raftsnapshot,s2,r69/3:/Table/53/1/4{600-700}] snapshot failed: (n3,s3):2: remote couldn't accept Raft snapshot d831e6d9 at applied index 22 with error: [n3,s3],r69: cannot apply snapshot: snapshot intersects existing range; initiated GC: [n3,s3,r38/2:/{Table/53/1/1…-Max}] (incoming /Table/53/1/4{600-700})
E190425 07:25:49.298246 1123 storage/queue.go:974  [n2,raftsnapshot,s2,r61/3:/Table/53/1/3{800-900}] snapshot failed: (n3,s3):2: remote couldn't accept Raft snapshot de384a90 at applied index 22 with error: [n3,s3],r61: cannot apply snapshot: snapshot intersects existing range; initiated GC: [n3,s3,r38/2:/{Table/53/1/1…-Max}] (incoming /Table/53/1/3{800-900})
W190425 07:25:48.402389 115 storage/split_trigger_helper.go:150  [n3,s3,r91/2:{-}] would have dropped incoming MsgApp to wait for split trigger, but allowing due to 105 (>100) ticks
W190425 07:25:48.801983 105 storage/split_trigger_helper.go:150  [n3,s3,r43/2:{-}] would have dropped incoming MsgApp to wait for split trigger, but allowing due to 101 (>100) ticks
W190425 07:25:49.202276 131 storage/split_trigger_helper.go:150  [n3,s3,r69/2:{-}] would have dropped incoming MsgApp to wait for split trigger, but allowing due to 101 (>100) ticks

I still see a lot of ADD_REPLICA messages interleaved with the "initiating a split" messages and continuing afterwards. It looks like the initial ranges aren't getting fully upreplicated before the splits, and then maybe the serialization of snapshots is enough to keep it from catching up before the restart? Adding another sleep before the split might fix things, although it's just adding another hack.

bdarnell added a commit to bdarnell/cockroach that referenced this issue Apr 30, 2019
The sleep added in the previous PR didn't eliminate the flakiness,
because it appears that the initial ranges weren't fully replicated.
Add another sleep before the splits in another attempt to get
everything into a stable configuration before restart.

Updates cockroachdb#35047

Release note: None
craig bot pushed a commit that referenced this issue Apr 30, 2019
37212: roachtest: Another attempt at deflaking election-after-restart r=tbg a=bdarnell

The sleep added in the previous PR didn't eliminate the flakiness,
because it appears that the initial ranges weren't fully replicated.
Add another sleep before the splits in another attempt to get
everything into a stable configuration before restart.

Updates #35047

Release note: None

Co-authored-by: Ben Darnell <ben@bendarnell.com>
@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/24feca7a4106f08c73534e16ebb79d949a479f35

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=election-after-restart PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1268176&tab=buildLog

The test failed on branch=master, cloud=gce:
	election.go:96,test.go:1251: expected query to succeed in less than 15s, took 31.802816962s

@bdarnell
Copy link
Contributor

bdarnell commented May 1, 2019

Sigh. That failure includes my most recent change and has exactly the same behavior. Looks like we're going to have to actually understand this instead of just hacking around it with sleeps.

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/23155799e92e54915ae66259d06a630e981afbeb

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=election-after-restart PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1277061&tab=buildLog

The test failed on branch=master, cloud=gce:
	election.go:96,test.go:1251: expected query to succeed in less than 15s, took 30.100598236s

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/45e15e05abff25e099ca59f4c5cb40a6cf695e6d

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=election-after-restart PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1285294&tab=buildLog

The test failed on branch=master, cloud=gce:
	election.go:96,test.go:1251: expected query to succeed in less than 15s, took 17.437465613s

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/c25518b4e9a723d8de0dba30a95ce0ade7963aed

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=election-after-restart PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1286188&tab=buildLog

The test failed on branch=master, cloud=gce:
	election.go:96,test.go:1251: expected query to succeed in less than 15s, took 27.903806268s

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/dcd4cc5e37ebbcebbbf0f01670811b176a58bf89

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=election-after-restart PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1288281&tab=buildLog

The test failed on branch=release-19.1, cloud=gce:
	election.go:96,test.go:1251: expected query to succeed in less than 15s, took 16.477467074s

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/ba5c092a726134b73e789c2047f7ec151be7c1a1

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=election-after-restart PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1288263&tab=buildLog

The test failed on branch=master, cloud=gce:
	election.go:96,test.go:1251: expected query to succeed in less than 15s, took 21.037465538s

bdarnell added a commit to bdarnell/cockroach that referenced this issue May 14, 2019
craig bot pushed a commit that referenced this issue May 14, 2019
37507: roachtest: Skip flaky election-after-restart r=tbg a=bdarnell

See #35047

Release note: None

Co-authored-by: Ben Darnell <ben@bendarnell.com>
@bdarnell bdarnell changed the title roachtest: election-after-restart failed roachtest: election-after-restart failed [skipped] May 14, 2019
@bdarnell bdarnell removed their assignment May 14, 2019
ajwerner pushed a commit to ajwerner/cockroach that referenced this issue May 15, 2019
The sleep added in the previous PR didn't eliminate the flakiness,
because it appears that the initial ranges weren't fully replicated.
Add another sleep before the splits in another attempt to get
everything into a stable configuration before restart.

Updates cockroachdb#35047

Release note: None
@tbg tbg added the branch-master Failures and bugs on the master branch. label Jan 22, 2020
@tbg tbg assigned knz Aug 27, 2020
@knz
Copy link
Contributor

knz commented Aug 31, 2020

I investigated this: I can't repro the issue after many runs.

I also see that the failure was for stressrace. Any race-enabled build makes everything slow. The fact the test "failed" by merely running for 20s when 15s were expected is... a waste of everyone's time.

I think that roachtests should not run under stressrace, full stop (or perhaps just once a week, with no timeout, under the condition that they succeed without race prior)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants