roachtest: jepsen/3/register/split failed #30792

cockroach-teamcity · 2018-09-29T11:56:54Z

SHA: https://github.com/cockroachdb/cockroach/commits/031748232a601aab191fd3fb00dc298ca20e4506

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stressrace instead of stress and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
make stress TESTS=jepsen/3/register/split PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-stderr=false -maxtime 20m -timeout 10m'

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=932135&tab=buildLog

The test failed on master:
	test.go:498,jepsen.go:232,jepsen.go:285: /home/agent/work/.go/bin/roachprod run teamcity-932135-jepsen-3:6 -- bash -e -c "\
		cd /mnt/data1/jepsen/cockroachdb && set -eo pipefail && \
		 ~/lein run test \
		   --tarball file://${PWD}/cockroach.tgz \
		   --username ${USER} \
		   --ssh-private-key ~/.ssh/id_rsa \
		   --os ubuntu \
		   --time-limit 300 \
		   --concurrency 30 \
		   --recovery-time 25 \
		   --test-count 1 \
		   -n 10.128.0.29 -n 10.128.0.26 -n 10.128.0.28 -n 10.128.0.27 -n 10.128.0.25 \
		   --test register --nemesis split \
		> invoke.log 2>&1 \
		" returned:
		stderr:
		
		stdout:
		Error:  exit status 1
		: exit status 1

The text was updated successfully, but these errors were encountered:

bdarnell · 2018-10-01T15:46:42Z

This looks like a real one! I'm still working on making sense of the output but jepsen ran to completion and found a non-linearizable history.

bdarnell · 2018-10-01T16:04:52Z

Could be a red herring, but #30778 merged in the day before this failure.

petermattis · 2018-10-01T17:24:31Z

@bdarnell What are you looking at that makes you suspect #30778? (I'm not fluent in looking through jepsen failures).

bdarnell · 2018-10-01T17:29:19Z

Just that this was the split nemesis, and #30778 touched the split code. Nothing else in the last day or so (prior to commit 0317482) caught my attention, but since this failure is non-deterministic it wasn't necessarily introduced within the previous day.

petermattis · 2018-10-01T17:31:25Z

Ok. #30778 seems pretty innocuous. It is not removing a replica from the replicas map temporarily during committing a split. I'll poke around in the logs for this failure to see if anything jumps out at me.

bdarnell · 2018-10-01T18:22:28Z

OK, I think these are the relevant portions of the logs. We see from the reads here (starting at line 15) that key 167 has value 1. Several writes (with values 2 and 4) are attempted, but they all fail (unambiguously, I think, although I don't see any explicit handling of ambiguity here), just after the nemesis causes a split at key 165.

INFO [2018-09-29 11:54:47,520] jepsen worker 16 - jepsen.util 16	:invoke	:read	[167 nil]
INFO [2018-09-29 11:54:47,520] jepsen worker 14 - jepsen.util 14	:invoke	:write	[167 2]
INFO [2018-09-29 11:54:47,520] jepsen worker 15 - jepsen.util 15	:invoke	:read	[167 nil]
INFO [2018-09-29 11:54:47,520] jepsen worker 10 - jepsen.util 10	:invoke	:write	[167 2]
INFO [2018-09-29 11:54:47,520] jepsen worker 17 - jepsen.util 17	:invoke	:read	[167 nil]
INFO [2018-09-29 11:54:47,520] jepsen worker 19 - jepsen.util 19	:invoke	:read	[167 nil]
INFO [2018-09-29 11:54:47,520] jepsen worker 13 - jepsen.util 13	:invoke	:write	[167 2]
INFO [2018-09-29 11:54:47,520] jepsen worker 11 - jepsen.util 11	:invoke	:cas	[167 [2 3]]
INFO [2018-09-29 11:54:47,520] jepsen worker 18 - jepsen.util 18	:invoke	:read	[167 nil]
INFO [2018-09-29 11:54:47,520] jepsen worker 12 - jepsen.util 12	:invoke	:write	[167 4]
INFO [2018-09-29 11:54:47,522] jepsen worker 6 - jepsen.util 6	:ok	:read	[166 4]
INFO [2018-09-29 11:54:47,522] jepsen worker 8 - jepsen.util 8	:ok	:read	[166 4]
INFO [2018-09-29 11:54:47,523] jepsen worker 7 - jepsen.util 7	:ok	:read	[166 4]
INFO [2018-09-29 11:54:47,523] jepsen worker 9 - jepsen.util 9	:ok	:read	[166 4]
INFO [2018-09-29 11:54:47,523] jepsen worker 15 - jepsen.util 15	:ok	:read	[167 1]
INFO [2018-09-29 11:54:47,523] jepsen worker 16 - jepsen.util 16	:ok	:read	[167 1]
INFO [2018-09-29 11:54:47,523] jepsen worker 17 - jepsen.util 17	:ok	:read	[167 1]
INFO [2018-09-29 11:54:47,523] jepsen worker 11 - jepsen.util 11	:fail	:cas	[167 [2 3]]
INFO [2018-09-29 11:54:47,524] jepsen worker 3 - jepsen.util 3	:fail	:cas	[166 [3 2]]
INFO [2018-09-29 11:54:47,524] jepsen worker 19 - jepsen.util 19	:ok	:read	[167 1]
INFO [2018-09-29 11:54:47,524] jepsen worker 18 - jepsen.util 18	:ok	:read	[167 1]
INFO [2018-09-29 11:54:47,528] jepsen worker 4 - jepsen.util 4	:fail	:cas	[166 [2 1]]
INFO [2018-09-29 11:54:47,530] jepsen worker 20 - jepsen.util 20	:ok	:write	[165 0]
INFO [2018-09-29 11:54:47,531] jepsen worker 1 - jepsen.util 1	:ok	:write	[166 4]
INFO [2018-09-29 11:54:47,541] jepsen nemesis - jepsen.util :nemesis	:info	["splits" :split]	[:split :test 165]
INFO [2018-09-29 11:54:47,562] jepsen worker 2 - jepsen.util 2	:fail	:write	[166 2]
INFO [2018-09-29 11:54:47,564] jepsen worker 12 - jepsen.util 12	:fail	:write	[167 4]
INFO [2018-09-29 11:54:47,566] jepsen worker 5 - jepsen.util 5	:ok	:read	[166 4]
INFO [2018-09-29 11:54:47,571] jepsen worker 14 - jepsen.util 14	:fail	:write	[167 2]
INFO [2018-09-29 11:54:47,574] jepsen worker 0 - jepsen.util 0	:ok	:write	[166 1]
INFO [2018-09-29 11:54:47,581] jepsen worker 10 - jepsen.util 10	:fail	:write	[167 2]
INFO [2018-09-29 11:54:47,589] jepsen worker 13 - jepsen.util 13	:fail	:write	[167 2]

Next time we touch key 167, it has value 4:

INFO [2018-09-29 11:54:48,020] jepsen worker 16 - jepsen.util 16	:invoke	:read	[167 nil]
INFO [2018-09-29 11:54:48,020] jepsen worker 17 - jepsen.util 17	:invoke	:read	[167 nil]
INFO [2018-09-29 11:54:48,020] jepsen worker 10 - jepsen.util 10	:invoke	:cas	[167 [4 4]]
INFO [2018-09-29 11:54:48,020] jepsen worker 15 - jepsen.util 15	:invoke	:read	[167 nil]
INFO [2018-09-29 11:54:48,020] jepsen worker 13 - jepsen.util 13	:invoke	:cas	[167 [2 3]]
INFO [2018-09-29 11:54:48,020] jepsen worker 19 - jepsen.util 19	:invoke	:read	[167 nil]
INFO [2018-09-29 11:54:48,020] jepsen worker 18 - jepsen.util 18	:invoke	:read	[167 nil]
INFO [2018-09-29 11:54:48,020] jepsen worker 12 - jepsen.util 12	:invoke	:cas	[167 [4 3]]
INFO [2018-09-29 11:54:48,020] jepsen worker 14 - jepsen.util 14	:invoke	:write	[167 1]
INFO [2018-09-29 11:54:48,020] jepsen worker 11 - jepsen.util 11	:invoke	:cas	[167 [2 1]]
INFO [2018-09-29 11:54:48,021] jepsen worker 6 - jepsen.util 6	:ok	:read	[166 1]
INFO [2018-09-29 11:54:48,022] jepsen worker 8 - jepsen.util 8	:ok	:read	[166 1]
INFO [2018-09-29 11:54:48,022] jepsen worker 1 - jepsen.util 1	:fail	:cas	[166 [3 3]]
INFO [2018-09-29 11:54:48,022] jepsen worker 9 - jepsen.util 9	:ok	:read	[166 1]
INFO [2018-09-29 11:54:48,022] jepsen worker 7 - jepsen.util 7	:ok	:read	[166 1]
INFO [2018-09-29 11:54:48,023] jepsen worker 5 - jepsen.util 5	:ok	:read	[166 1]
INFO [2018-09-29 11:54:48,023] jepsen worker 19 - jepsen.util 19	:ok	:read	[167 4]
INFO [2018-09-29 11:54:48,023] jepsen worker 17 - jepsen.util 17	:ok	:read	[167 4]
INFO [2018-09-29 11:54:48,023] jepsen worker 15 - jepsen.util 15	:ok	:read	[167 4]

So either A) pending writes on the RHS of a split are reporting unambiguous failures when they are in fact being applied or B) jepsen is not correctly accounting for ambiguous failures in this test.

bdarnell · 2018-10-01T19:55:16Z

What kind of errors are happening here? Fortunately we have packet captures.

$ for i in */*.pcap; do tshark -d tcp.port==26257,pgsql -O pgsql -Y 'pgsql.message' -r $i; done|grep Message|cut -d: -f1-4|sort |uniq -c |sort -n
tshark: The file "10.128.0.26/trace.pcap" appears to have been cut short in the middle of a packet.
tshark: The file "10.128.0.27/trace.pcap" appears to have been cut short in the middle of a packet.
tshark: The file "10.128.0.28/trace.pcap" appears to have been cut short in the middle of a packet.
tshark: The file "10.128.0.29/trace.pcap" appears to have been cut short in the middle of a packet.
   3     Message [truncated]: restart transaction: HandledRetryableTxnError: TransactionAbortedError(ABORT_REASON_TIMESTAMP_CACHE_REJECTED_POSSIBLE_REPLAY)
   5     Message: query execution canceled
  50     Message: restart transaction: HandledRetryableTxnError: WriteTooOldError
1031     Message [truncated]: restart transaction: HandledRetryableTxnError: ReadWithinUncertaintyIntervalError
1576     Message [truncated]: restart transaction: HandledRetryableTxnError: TransactionRetryError

The test considers any restart transaction error to be an unambiguous failure. query execution canceled should be getting treated as ambiguous. So maybe the (infrequent) ABORT_REASON_TIMESTAMP_CACHE_REJECTED_POSSIBLE_REPLAY error is in fact ambiguous (at least when it is caused by a split, which advances the timestamp cache for the RHS), but it looks unambiguous. Unfortunately, it looks like that's not it. I checked out all three of these errors and none of them apply to a query involving the key 167 (although one of them involves the key 169). None of the "query execution canceled" errors involve this key either.

Let's approach this from the other direction and find queries involving key 167. This is a little tricky because of the use of prepared statements; we can't actually see the query and the parameters together. But the filter pgsql.val.data==00:00:00:00:00:00:00:a7 && pgsql.val.data==00:00:00:4 gives us queries involving both 167 (as an int64, the type of our keys) and 4 (as an int32, the type of our values). We can further narrow it down because of the arity of the statements: blind writes (which is what we're looking for) have two arguments, while reads have 1 and CAS queries have 3.

That narrows it down to a single blind write of value 4 to key 167. Here's the full dump: https://gist.github.com/bdarnell/f07d1cd944d7dc7c1c4a0ec15fd4d235

Prepared statement decoder ring:

S_2: BEGIN
S_3: select val from test where id = $1
S_5: UPDATE test SET val = $1 WHERE id = $2
S_1: COMMIT

So the query does a read followed by an update, then tries to commit and gets a RETRY_ASYNC_WRITE_FAILURE.

  Message [truncated]: restart transaction: HandledRetryableTxnError: TransactionRetryError: retry txn (RETRY_ASYNC_WRITE_FAILURE): "sql txn" id=55f33e25 key=/Table/53/1/167/0 rw=true pri=0.03783833 iso=SERIALIZABLE stat=PENDING epo=0 ts=15

Repeating the bashreduce above with more detail on the TransactionRetryErrors shows that this variant is rare:

$ for i in */*.pcap; do tshark -d tcp.port==26257,pgsql -O pgsql -Y 'pgsql.message' -r $i; done|grep Message|grep TransactionRetryError |cut -d: -f1-5|sort |uniq -c |sort -n
tshark: The file "10.128.0.26/trace.pcap" appears to have been cut short in the middle of a packet.
tshark: The file "10.128.0.27/trace.pcap" appears to have been cut short in the middle of a packet.
tshark: The file "10.128.0.28/trace.pcap" appears to have been cut short in the middle of a packet.
tshark: The file "10.128.0.29/trace.pcap" appears to have been cut short in the middle of a packet.
   5     Message [truncated]: restart transaction: HandledRetryableTxnError: TransactionRetryError: retry txn (RETRY_ASYNC_WRITE_FAILURE)
  44     Message [truncated]: restart transaction: HandledRetryableTxnError: TransactionRetryError: retry txn (RETRY_SERIALIZABLE)
1527     Message [truncated]: restart transaction: HandledRetryableTxnError: TransactionRetryError: retry txn (RETRY_WRITE_TOO_OLD)

So it looks like async writes (possibly combined with splits) are failing in an ambiguous way that is getting reported to the client as unambiguous. cc @nvanbenschoten

bdarnell · 2018-10-01T20:41:13Z

Some timing details that aren't in the gist:

The BEGIN and SELECT were sent in a single packet, and their response was also one packet, 1ms later
The UPDATE took 1.3ms (really just to return, not to execute, since it's being queued up as an async write)
The COMMIT took 39ms to fail. The split took 36ms.

Here's how the operations were interleaved (split is coming from a different clock, so it might be a bit off)

07:54:47.477 split begins
07:54:47.492 begin+select begins
07:54:47.493 begin+select ends
07:54:47.493 update begins
07:54:47.494 update ends
07:54:47.495 commit begins
07:54:47.513 split ends
07:54:47.534 commit ends

nvanbenschoten · 2018-10-01T21:19:21Z

If we see an RETRY_ASYNC_WRITE_FAILURE error then the transaction should have been retried without committing, just like any other retry error that doesn't hit one of our optimizations.

There are two interesting things to note here that aren't immediately clear:

the Put request will be in the same batch as the BeginTxn request since it's the first write in the txn, so if the async write failed, there shouldn't be a txn record.
the txn only wrote a single key, so the QueryIntent and the EndTxn requests will also be in the same batch.

Together, this means that if the QueryIntent missed its intent, then the transaction record shouldn't even exist for the EndTxn to commit. It also means that if the QueryIntent returned an error during evaluation, the EndTxn should never have been evaluated.

It is however possible that the txn record exists from a previous epoch, but I don't see how that would cause issues because intents at the old epoch would not be committed.

tbg · 2018-10-02T07:43:47Z

Is the following or some variation of it possible?

{Begin,Put} gets proposed but is "somehow delayed or replayed in a way that doesn't block a subsequent QueryIntent" (magic)
QueryIntent

cockroach/pkg/storage/batcheval/cmd_query_intent.go

Line 120 in 0f5a289

return result.Result{}, roachpb.NewIntentMissingError(curIntent)

comes in (in the same batch as the commit) and produces a MissingIntentError
which means that this request doesn't mutate the timestamp cache (?) due to the error
{Begin,Put} gets "replayed" and writes the txn record and intent
for some reason the QueryIntent,Commit also gets replayed and actually commits the txn

Independently of whether this is even at fault, shouldn't QueryIntent use the updatesTSCacheOnError flag?

I'm also not sure how the split would've made the above possible. There's seemingly no reason the Begin,Put would ever get replayed as it doesn't even get into Raft in the first place.

@bdarnell do we have gRPC packet captures too? It'll be quite nasy to decode but now might be the time to put together a tool that does it for us.

nvanbenschoten · 2018-10-02T16:46:01Z

The part of that which doesn't seem possible is the "magic". If {Begin,Put} returned successfully to the client then it would have already been proposed to Raft. If the leaseholder stays the same then the command will block the subsequent QueryIntent in the command queue until it has applied. If the leaseholder changed hands before the {Begin,Put} applied then the proposal would have gotten rejected beneath Raft. Either way, the {Begin,Put} shouldn't apply succesfully after the {QueryIntent,Commit} executes.

tbg · 2018-10-02T16:55:20Z

Well, the "magic" replay could be that the RPC arrives twice some tryExecute stuff goes wrong. But yeah, it doesn't seem likely (but then again, this is probably a rare bug).
I also don't see how step 5) would be possible. I guess the same reasoning could do it, namely that the request pops out of gRPC twice? That seems... just unlikely.

bdarnell · 2018-10-02T17:29:25Z

My working theory is that the Begin+Put batch applies normally, and all the weirdness is on the QueryIntent+Commit batch. This batch fails with an error that is flagged as retriable and unambiguous, but it is in fact ambiguous and the batch goes on to apply successfully. Then the retry happens and QueryIntent fails because the transaction and its local intent are already gone.

The question is, what is that error on the first attempt to commit? We can't see that here because we don't have grpc packet captures or verbose-enough logging. We could get that, but so far the error has proven difficult to reproduce.

tbg · 2018-10-03T12:59:14Z

Looking at this again from @bdarnell's perspective, the RETRY_ASYNC_WRITE_FAILURE doesn't mean anything. It's what would be expected if we accidentally committed the value and then went to check it via a QueryTxn (which would return an IntentMissingError that gets turned into the async write failure). So the "real" problem is unknown. It must've happened before the retry that caught the async write failure.

tbg · 2018-10-03T13:56:29Z

I see one at least remotely plausible way for the write to be applied twice.

If this line fires, we might return ambiguous results when refreshing commands:

cockroach/pkg/storage/replica.go

Line 4173 in bbf9825

refreshReason = reasonSnapshotApplied

but we could unconditionally overwrite that variable later

cockroach/pkg/storage/replica.go

Line 4345 in bbf9825

refreshReason = reasonNewLeaderOrConfigChange

The justification is given in the comment above as saying "it's the most aggressive mode anyway", but it's not true:

cockroach/pkg/storage/replica.go

Lines 4880 to 4884 in bbf9825

    
           if reason == reasonSnapshotApplied { 
        
           	p.finishApplication(proposalResult{ProposalRetry: proposalAmbiguousShouldBeReevaluated}) 
        
           } else { 
        
           	p.finishApplication(proposalResult{ProposalRetry: proposalIllegalLeaseIndex}) 
        
           }

We will get an unambiguous retry instead of an ambiguous one. The retry will then fail checking the intent and the result would be what we're seeing.

Oh, it actually becomes easier because the snapshot reason assignment doesn't do it if there's also a new leader:

cockroach/pkg/storage/replica.go

Lines 4171 to 4174 in bbf9825

    
           if !r.store.TestingKnobs().DisableRefreshReasonSnapshotApplied && 
        
           	refreshReason == noReason { 
        
           	refreshReason = reasonSnapshotApplied 
        
           }

So it seems reasonable that this would happen when the replica has the batch pending, but commits it via a snapshot that gets applied simultaneously with a leadership change (not too unlikely: we were leader before, and put the command in the log but then lose leadership and fall behind, getting caught up by the new leader).

I'm not confident that this is really the bug we're seeing here. Seems like a bug though. I hope we can get a repro.

andreimatei · 2018-10-03T15:21:15Z

I'm trying to get a repro, but have been unsuccessful so far. Might have to also write a unit test trying to replicate the spirit of the jepsen test+nemesis.

Incidentally to your findings, I've discussed recently with @bdarnell about getting rid of reproposals on "snapshot applied" conditions - because they're hard to test (i.e. we're not testing them) and supposed to be rare and of questionable value. Instead, we'd treat the respective proposals as ambiguous without trying to resolve the ambiguity in any direction. I was going to do that when I have a chance, and then presumably the funky code that you're pointing to would be forced to change.

tbg · 2018-10-03T15:29:53Z

That sounds good to me. If what I wrote above is actually what caused the bug, you're going to have a terrible time reproducing this.

It's a good idea to return a few more ambiguous errors here and to remove the incorrect code.

andreimatei · 2018-10-03T15:35:28Z

Well, what kind of chaos do you reckon will make the bug you found more likely to surface?

tbg · 2018-10-03T16:09:02Z

Snapshots and leaseholder changes are associated with ... poorly functioning clusters. So you need one of those, and then you're not getting good qps = low frequency of repro.

I still "hope" (well, not exactly hope) that this is another bug that actually has to do with the split.

tbg · 2018-10-03T16:09:33Z

But, we should at least check the logs to see whether there was a snapshot for this range.

bdarnell · 2018-11-03T15:16:42Z

Ah, right. It did go to a replica, but if the transaction has already committed and resolved, evaluating the rollback results in an empty write batch and slips through a hole in our logging. So this doesn't really tell us anything new (it comes after the async write failure, and we already know that the damage is done at that point).

OK, next line of inquiry: Is there anything other than EndTransaction that could have resolved this intent? There's a failed push in the middle of this run (we've seen that in some but not all of the occurrences of this bug, I believe, so this may not prove anything).

123732:I181102 20:08:08.455585 16981 storage/replica_proposal.go:112  XXX proposal.finishApplication. cmd: 52797ada9d4e9bde. endCmds: 0xc425869a40, doneCh: 0xc4268240c0, err: failed to push "sql txn" id=d370e84f key=/Table/53/1/77/0 rw=true pri=0.00604256 iso=SERIALIZABLE stat=PENDING epo=0 ts=1541189283.465060250,1 orig=1541189283.463512172,0 max=1541189283.963512172,0 wto=false rop=false seq=3, retry: 0

Here's the gateway-side logs for all push and resolve traffic targeting this key during the relevant second:

123718:I181102 20:08:08.454999 16981 kv/dist_sender.go:1052  [n1,s1] XXX sendPartialBatch: PushTxn [/Table/53/1/77/0,/Min)
123719:I181102 20:08:08.455023 16981 kv/dist_sender.go:1091  [n1,s1] XXX sendPartialBatch attempt 0: PushTxn [/Table/53/1/77/0,/Min)
123720:I181102 20:08:08.455046 16981 kv/dist_sender.go:1314  [n1,s1] XXX sendToReplicas: PushTxn [/Table/53/1/77/0,/Min)
123721:I181102 20:08:08.455078 16981 kv/transport.go:169  [n1,s1] XXX gprcTransport sending req: PushTxn [/Table/53/1/77/0,/Min)...
123826:I181102 20:08:08.467231 16981 kv/transport.go:171  [n1,s1] XXX gprcTransport sending req: PushTxn [/Table/53/1/77/0,/Min)...  done. got resp: (err: <nil>), *roachpb.PushTxnResponse, err: <nil>
123828:I181102 20:08:08.467264 16981 kv/dist_sender.go:1345  [n1,s1] XXX : sendToReplicas loop (0). Req: PushTxn [/Table/53/1/77/0,/Min). br: (err: <nil>), *roachpb.PushTxnResponse, err: <nil>
123842:I181102 20:08:08.467317 16981 kv/dist_sender.go:1318  [n1,s1] XXX sendToReplicas returning. ba: PushTxn [/Table/53/1/77/0,/Min). br: (err: <nil>), *roachpb.PushTxnResponse err: <nil>. ambiguousErr: <nil>. haveCommit: false
123845:I181102 20:08:08.467677 16981 kv/dist_sender.go:1111  [n1,s1] XXX sendPartialBatch: sendSingleRange returned: resp: (err: <nil>), *roachpb.PushTxnResponse err: <nil> (ba: PushTxn [/Table/53/1/77/0,/Min))
123847:I181102 20:08:08.467694 16981 kv/dist_sender.go:1054  [n1,s1] XXX sendPartialBatch returning. ba: PushTxn [/Table/53/1/77/0,/Min). br: (err: <nil>), *roachpb.PushTxnResponse err: <nil>
123849:I181102 20:08:08.467733 16981 kv/dist_sender.go:1052  [n1,s1] XXX sendPartialBatch: ResolveIntent [/Table/53/1/77/0,/Min)
123850:I181102 20:08:08.467747 16981 kv/dist_sender.go:1091  [n1,s1] XXX sendPartialBatch attempt 0: ResolveIntent [/Table/53/1/77/0,/Min)
123852:I181102 20:08:08.467765 16981 kv/dist_sender.go:1314  [n1,s1] XXX sendToReplicas: ResolveIntent [/Table/53/1/77/0,/Min)
123853:I181102 20:08:08.467797 16981 kv/transport.go:169  [n1,s1] XXX gprcTransport sending req: ResolveIntent [/Table/53/1/77/0,/Min)...
123900:I181102 20:08:08.472985 16981 kv/transport.go:171  [n1,s1] XXX gprcTransport sending req: ResolveIntent [/Table/53/1/77/0,/Min)...  done. got resp: (err: <nil>), *roachpb.ResolveIntentResponse, err: <nil>
123901:I181102 20:08:08.473037 16981 kv/dist_sender.go:1345  [n1,s1] XXX : sendToReplicas loop (0). Req: ResolveIntent [/Table/53/1/77/0,/Min). br: (err: <nil>), *roachpb.ResolveIntentResponse, err: <nil>
123902:I181102 20:08:08.473052 16981 kv/dist_sender.go:1318  [n1,s1] XXX sendToReplicas returning. ba: ResolveIntent [/Table/53/1/77/0,/Min). br: (err: <nil>), *roachpb.ResolveIntentResponse err: <nil>. ambiguousErr: <nil>. haveCommit: false
123903:I181102 20:08:08.473076 16981 kv/dist_sender.go:1111  [n1,s1] XXX sendPartialBatch: sendSingleRange returned: resp: (err: <nil>), *roachpb.ResolveIntentResponse err: <nil> (ba: ResolveIntent [/Table/53/1/77/0,/Min))
123904:I181102 20:08:08.473089 16981 kv/dist_sender.go:1054  [n1,s1] XXX sendPartialBatch returning. ba: ResolveIntent [/Table/53/1/77/0,/Min). br: (err: <nil>), *roachpb.ResolveIntentResponse err: <nil>
123929:I181102 20:08:08.474028 589 kv/dist_sender.go:1052  [n1,s1] XXX sendPartialBatch: PushTxn [/Table/53/1/77/0,/Min)
123930:I181102 20:08:08.474055 589 kv/dist_sender.go:1091  [n1,s1] XXX sendPartialBatch attempt 0: PushTxn [/Table/53/1/77/0,/Min)
123931:I181102 20:08:08.474074 589 kv/dist_sender.go:1314  [n1,s1] XXX sendToReplicas: PushTxn [/Table/53/1/77/0,/Min)
123932:I181102 20:08:08.474099 589 kv/transport.go:169  [n1,s1] XXX gprcTransport sending req: PushTxn [/Table/53/1/77/0,/Min)...
124002:I181102 20:08:08.488006 589 kv/transport.go:171  [n1,s1] XXX gprcTransport sending req: PushTxn [/Table/53/1/77/0,/Min)...  done. got resp: (err: <nil>), *roachpb.PushTxnResponse, err: <nil>
124003:I181102 20:08:08.488033 589 kv/dist_sender.go:1345  [n1,s1] XXX : sendToReplicas loop (0). Req: PushTxn [/Table/53/1/77/0,/Min). br: (err: <nil>), *roachpb.PushTxnResponse, err: <nil>
124004:I181102 20:08:08.488084 589 kv/dist_sender.go:1318  [n1,s1] XXX sendToReplicas returning. ba: PushTxn [/Table/53/1/77/0,/Min). br: (err: <nil>), *roachpb.PushTxnResponse err: <nil>. ambiguousErr: <nil>. haveCommit: false
124005:I181102 20:08:08.488107 589 kv/dist_sender.go:1111  [n1,s1] XXX sendPartialBatch: sendSingleRange returned: resp: (err: <nil>), *roachpb.PushTxnResponse err: <nil> (ba: PushTxn [/Table/53/1/77/0,/Min))
124008:I181102 20:08:08.488137 589 kv/dist_sender.go:1054  [n1,s1] XXX sendPartialBatch returning. ba: PushTxn [/Table/53/1/77/0,/Min). br: (err: <nil>), *roachpb.PushTxnResponse err: <nil>
124011:I181102 20:08:08.488247 589 kv/dist_sender.go:1052  [n1,s1] XXX sendPartialBatch: ResolveIntent [/Table/53/1/77/0,/Min)
124012:I181102 20:08:08.488269 589 kv/dist_sender.go:1091  [n1,s1] XXX sendPartialBatch attempt 0: ResolveIntent [/Table/53/1/77/0,/Min)
124013:I181102 20:08:08.488290 589 kv/dist_sender.go:1314  [n1,s1] XXX sendToReplicas: ResolveIntent [/Table/53/1/77/0,/Min)
124014:I181102 20:08:08.488315 589 kv/transport.go:169  [n1,s1] XXX gprcTransport sending req: ResolveIntent [/Table/53/1/77/0,/Min)...
124030:I181102 20:08:08.490684 589 kv/transport.go:171  [n1,s1] XXX gprcTransport sending req: ResolveIntent [/Table/53/1/77/0,/Min)...  done. got resp: (err: <nil>), *roachpb.ResolveIntentResponse, err: <nil>
124031:I181102 20:08:08.490699 589 kv/dist_sender.go:1345  [n1,s1] XXX : sendToReplicas loop (0). Req: ResolveIntent [/Table/53/1/77/0,/Min). br: (err: <nil>), *roachpb.ResolveIntentResponse, err: <nil>
124032:I181102 20:08:08.490713 589 kv/dist_sender.go:1318  [n1,s1] XXX sendToReplicas returning. ba: ResolveIntent [/Table/53/1/77/0,/Min). br: (err: <nil>), *roachpb.ResolveIntentResponse err: <nil>. ambiguousErr: <nil>. haveCommit: false
124033:I181102 20:08:08.490728 589 kv/dist_sender.go:1111  [n1,s1] XXX sendPartialBatch: sendSingleRange returned: resp: (err: <nil>), *roachpb.ResolveIntentResponse err: <nil> (ba: ResolveIntent [/Table/53/1/77/0,/Min))
124034:I181102 20:08:08.490741 589 kv/dist_sender.go:1054  [n1,s1] XXX sendPartialBatch returning. ba: ResolveIntent [/Table/53/1/77/0,/Min). br: (err: <nil>), *roachpb.ResolveIntentResponse err: <nil>

This appears to show a successful push and resolve. The failed push isn't visible here (are we not logging the right error? Also, I'm only seeing the gateway stuff coming from a single node, which is suspicious).

Of course, we don't see here which txn is being pushed (the test is deliberately racing multiple transactions against each other), or what kind of pushes these are. Maybe we need more logging of successful commands and first-time proposals, not just errors and reproposals.

bdarnell · 2018-11-03T17:21:18Z

OK, time to look at this from the other side. The EndTransaction(commit=True) is behaving normally as far as we can tell; is there any other way we could end up reading the value it was supposed to write? I can think of a few ways which seem far-fetched but would be consistent with the symptoms we're seeing.

If the Put were processed as a non-transactional write. However, we see in the replica logs that the BeginTransaction and Put are processed in the same batch and it seems very unlikely that this link would be severed.
If the intent were still pending but the MVCCMetadata were somehow skipped over at read time (due to a misused time-bounded iterator, an off-by-one seek, or something like that)
The intent resolved with a status of ABORTED but the value was not correctly deleted. My specific concern here is that this is possible if MVCCMetadata.Timestamp were out of sync with the timestamp encoded in the MVCC key.

I think the last of these is most plausible, but I haven't found any evidence that any of them are possible.

andreimatei · 2018-11-03T19:53:12Z

Out of your last three points, the last one seems the more likely to me. In one of the previous failures (and in the last one too) we saw a PushTxn get unstuck concomitantly with the EndTxn - which suggests that something happened to the txn record.

…

On Sat, Nov 3, 2018, 1:21 PM Ben Darnell ***@***.*** wrote: OK, time to look at this from the other side. The EndTransaction(commit=True) is behaving normally as far as we can tell; is there any other way we could end up reading the value it was supposed to write? I can think of a few ways which seem far-fetched but would be consistent with the symptoms we're seeing. - If the Put were processed as a non-transactional write. However, we see in the replica logs that the BeginTransaction and Put are processed in the same batch and it seems very unlikely that this link would be severed. - If the intent were still pending but the MVCCMetadata were somehow skipped over at read time (due to a misused time-bounded iterator, an off-by-one seek, or something like that) - The intent resolved with a status of ABORTED but the value was not correctly deleted. My specific concern here is that this is possible if MVCCMetadata.Timestamp were out of sync with the timestamp encoded in the MVCC key. I think the last of these is most plausible, but I haven't found any evidence that any of them are possible. — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub <#30792 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAXBcZlMygaNrmOnQihhMaAae67tvTKMks5urdCcgaJpZM4XAeZt> .

tbg · 2018-11-03T22:35:37Z

Just to throw another more powerful debug to in the mix, can we just keep the write ahead log? We added this functionality.

tbg · 2018-11-05T11:56:51Z

@andreimatei got another failure with full Raft logging enabled:

2018-11-04 23:23:40,710{GMT} INFO [jepsen worker 9] jepsen.util: 9 :ok :read [34 3]
2018-11-04 23:23:40,710{GMT} INFO [jepsen worker 8] jepsen.util: 8 :ok :read [34 4]
2018-11-04 23:23:40,716{GMT} INFO [jepsen worker 3] jepsen.util: 3 :fail :cas [34 [3 4]] [:restart-transaction "ERROR: restart transaction: HandledRetryableTxnError: TransactionRetryError: retry txn (RETRY_ASYNC_WRITE_FAILURE): "sql txn" id=8c066bb5 key=/Table/53/1/34/0 rw=true pri=0.01295777 iso=SERIALIZABLE stat=PENDING epo=0 ts=1541373815.494010172,2 orig=1541373813.645676518,0 max=1541373814.145676518,0 wto=false rop=false seq=5"]

We're going to dig in later.

tbg · 2018-11-05T12:24:30Z

I might be getting over-excited here, but I think I have something (it's not based on any of the new logging). @andreimatei commented yesterday that one thing he saw in a previous repro was that there was a transaction that was waiting on the intent and that got unblocked just around the time the committing proposals were reproposed (but apparently both bounced because of invalid lease index).

This smelled to me like a side-effect leak: we're signaling the txnwait queue with the proto resulting from a commit operation. What if we accidentally did that even when the commit didn't happen?

Looking into it, this seems true. Step by step, we're seeing the command pop out of Raft for the first time (so we populate proposal) and it catches the forced error.

cockroach/pkg/storage/replica.go

Lines 5396 to 5404 in fb4a974

    
           proposal, proposedLocally := r.mu.proposals[idKey] 
        
           // TODO(tschottdorf): consider the Trace situation here. 
        
           if proposedLocally { 
        
           	// We initiated this command, so use the caller-supplied context. 
        
           	ctx = proposal.ctx 
        
           	proposal.ctx = nil // avoid confusion 
        
           	delete(r.mu.proposals, idKey) 
        
           }

Now note how we try to apply a no-op proposal, but we're not resetting the local eval result (only the replicated one):

cockroach/pkg/storage/replica.go

Lines 5473 to 5478 in 25922fc

    
           if forcedErr != nil { 
        
           	// Apply an empty entry. 
        
           	raftCmd.ReplicatedEvalResult = storagepb.ReplicatedEvalResult{} 
        
           	raftCmd.WriteBatch = nil 
        
           	raftCmd.LogicalOpLog = nil 
        
           }

We then continue to populate lResult (since proposedLocally is true), despite the error, and run the handler:

cockroach/pkg/storage/replica.go

Lines 5622 to 5631 in 25922fc

    
           	lResult = proposal.Local 
        
           } 
        
           // Handle the Result, executing any side effects of the last 
        
           // state machine transition. 
        
           // 
        
           // Note that this must happen after committing (the engine.Batch), but 
        
           // before notifying a potentially waiting client. 
        
           r.handleEvalResultRaftMuLocked(ctx, lResult, 
        
           	raftCmd.ReplicatedEvalResult, raftIndex, leaseIndex)

This eventually ends up in

cockroach/pkg/storage/replica_proposal.go

Lines 765 to 770 in c23d984

    
           if lResult.UpdatedTxns != nil { 
        
           	for _, txn := range *lResult.UpdatedTxns { 
        
           		r.txnWaitQueue.UpdateTxn(ctx, txn) 
        
           		lResult.UpdatedTxns = nil 
        
           	} 
        
           }

and so what is likely happening is that we're trying to commit a txn (but we don't), but give the committed proto to some waiting pushers, who will happily go and resolve the intent they had been blocked on:

cockroach/pkg/storage/txnwait/txnqueue.go

Lines 475 to 489 in c32a379

    
           case txn := <-push.pending: 
        
           	log.VEventf(ctx, 2, "result of pending push: %v", txn) 
        
           	// If txn is nil, the queue was cleared, presumably because the 
        
           	// replica lost the range lease. Return not pushed so request 
        
           	// proceeds and is redirected to the new range lease holder. 
        
           	if txn == nil { 
        
           		return nil, nil 
        
           	} 
        
           	// Transaction was committed, aborted or had its timestamp 
        
           	// pushed. If this PushTxn request is satisfied, return 
        
           	// successful PushTxn response. 
        
           	if isPushed(req, txn) { 
        
           		log.VEvent(ctx, 2, "push request is satisfied") 
        
           		return createPushTxnResponse(txn), nil 
        
           	}

bdarnell · 2018-11-05T15:15:31Z

I think you've found it! I thought I had looked through processRaftCommand for a bug like this, but I misread the fact that we use lResult even on error.

This bug lay dormant for a long while - most of the stuff in LocalResult is harmless (e.g. Intents will trigger a push which will verify the state of the world before doing anything; Metrics will cause minor discrepancies in exported metrics, etc). But EndTxns and UpdatedTxns are special, since they will cause intent resolution to act on the contained Transaction proto even if it didn't get applied. The use of EndTxns has a check on response.Err, but UpdatedTxns does not.

On a "forced" error (pErr != nil ), we should just be throwing away the LocalResult entirely. I don't think there's any reason to do any of the stuff in it for errors arising from lease/proposal ordering. For response errors (response.Err != nil), it looks like there are sometimes reasons to process parts of the LocalResult (since EndTxns is specifically aware of this case), but it seems very fragile to address this for each field. I'd rather have a whitelist of fields that are meaningful on error instead of processing all of them by default.

This should be a very simple patch (at least to test the hypothesis): just wrap the line lResult = proposal.Local in if pErr != nil.

andreimatei · 2018-11-05T15:46:09Z

Yeah, the theory seems very likely.

The analysis is more interesting: the bug (as it relates to EndTxn specifically) was probably introduced with the txn wait queue. But the introduction of the QueryIntent mechanism in 2.1 makes the bug a lot more exposed: before, the re-evaluation of the EndTxn would generally succeed and so the full txn would end up being committed. If the node doing the reevaluation were to crash, then we'd end up with non-atomic txn writes. If the server did not crash, there was a window where "dirty reads" were possible - a reader would see data that was not exactly committed, but since the txn is known to commit afterwards, maybe these were benign.
But in 2.1 we introduced this QueryIntent request, which makes the re-evaluation fail. And so now it's easy to lose txn atomicity.

This patch fixes a bug with the application of LocalResults: they were applied even in cases when the processing of the respective command failed. So, we were evaluating a batch, proposing a command, failing to apply the command (in particular because of the lease index check), but then still applying the LocalResult even though the replicated state was not updated. This bug had catastrophic consequences: part of the LocalResult is a list of UpdatedTxn - this is used to ping the txn wait queue and unblock waiters, communicating a final txn disposition to them. Now imagine an EndTxn batch that evaluates to a command that fails the lease check. The respective LocalResult is populate with out txn, updated to the COMMITTED disposition. The command fails to process, but we still ping the wait queue telling all the waiters that the transaction committed. The waiters, if any, now go and send ResolveIntent requests, which actually commit intents (thinking that the intent's txn committed). And so we end up with "partially committed transactions" - some of the writes from a txn are committed (the ones that happened to have waiters on their intents), others don't. In order for this bug to manifest itself, we need: - read/write contention on some keys - a command to fail processing. Generally this happens either because of the lease index check or because of the lease check; so either a Raft leadership change or a lease change can potentially cause the badness. In the case of a Raft leadership change, proposals can get dropped, which leads to reproposals, which seem to frequently(?) end up processing at the wrong lease index and thus be rejected (and it's the reproposal processing that triggers the bug). This kind of situation is triggered by the Jepsen register test, with various nemeses. The lease check failure can probably be triggered simply with lease transfers (e.g. lease rebalancing). Interestingly, the rate of occurence of this bug probably changed between 2.0 and 2.1 because of the introduction, in 2.1, of the QueryIntent request (and also, separately, because of increased usage of lease transfers; this claim was not verified though). Speaking of the Raft leadership change scenario, once the reproposal fail to be applied (but the wait queue is erroneously signalled), we also re-evaluate the EndTransaction batch. Unless something else goes wrong, in 2.0 this re-evaluation was likely succeeding. In 2.1, however, it tends to fail if there was a waiter on our txn because the EndTxn requests are usually bundled with QueryIntent requests - and these QueryIntents fail during re-evaluation because they don't find the intent - it was, by then, errneously committed by a waiter through a ResolveIntent. Except for transaction that wrote to multiple ranges: I think for those the QueryIntent requests are split off from the EndTransaction request by the DistSender, and so we're back to the situation in 2.0. Fixes cockroachdb#30792 Release note (bug fix): Fix a bug causing transactions to appear to be partially committed. CRDB was sometimes claiming to have failed to commit a transaction but some (or all) of its writes were actually persisted.

32166: storage: don't apply local results if cmd processing failed r=andreimatei a=andreimatei This patch fixes a bug with the application of LocalResults: they were applied even in cases when the processing of the respective command failed. So, we were evaluating a batch, proposing a command, failing to apply the command (in particular because of the lease index check), but then still applying the LocalResult even though the replicated state was not updated. This bug had catastrophic consequences: part of the LocalResult is a list of UpdatedTxn - this is used to ping the txn wait queue and unblock waiters, communicating a final txn disposition to them. Now imagine an EndTxn batch that evaluates to a command that fails the lease check. The respective LocalResult is populate with out txn, updated to the COMMITTED disposition. The command fails to process, but we still ping the wait queue telling all the waiters that the transaction committed. The waiters, if any, now go and send ResolveIntent requests, which actually commit intents (thinking that the intent's txn committed). And so we end up with "partially committed transactions" - some of the writes from a txn are committed (the ones that happened to have waiters on their intents), others don't. In order for this bug to manifest itself, we need: - read/write contention on some keys - a command to fail processing. Generally this happens either because of the lease index check or because of the lease check; so either a Raft leadership change or a lease change can potentially cause the badness. In the case of a Raft leadership change, proposals can get dropped, which leads to reproposals, which seem to frequently(?) end up processing at the wrong lease index and thus be rejected (and it's the reproposal processing that triggers the bug). This kind of situation is triggered by the Jepsen register test, with various nemeses. The lease check failure can probably be triggered simply with lease transfers (e.g. lease rebalancing). Interestingly, the rate of occurence of this bug probably changed between 2.0 and 2.1 because of the introduction, in 2.1, of the QueryIntent request (and also, separately, because of increased usage of lease transfers; this claim was not verified though). Speaking of the Raft leadership change scenario, once the reproposal fail to be applied (but the wait queue is erroneously signalled), we also re-evaluate the EndTransaction batch. Unless something else goes wrong, in 2.0 this re-evaluation was likely succeeding. In 2.1, however, it tends to fail if there was a waiter on our txn because the EndTxn requests are usually bundled with QueryIntent requests - and these QueryIntents fail during re-evaluation because they don't find the intent - it was, by then, errneously committed by a waiter through a ResolveIntent. Except for transaction that wrote to multiple ranges: I think for those the QueryIntent requests are split off from the EndTransaction request by the DistSender, and so we're back to the situation in 2.0. Fixes #30792 Release note (bug fix): Fix a bug causing transactions to appear to be partially committed. CRDB was sometimes claiming to have failed to commit a transaction but some (or all) of its writes were actually persisted. Co-authored-by: Andrei Matei <andrei@cockroachlabs.com>

This patch fixes a bug with the application of LocalResults: they were applied even in cases when the processing of the respective command failed. So, we were evaluating a batch, proposing a command, failing to apply the command (in particular because of the lease index check), but then still applying the LocalResult even though the replicated state was not updated. This bug had catastrophic consequences: part of the LocalResult is a list of UpdatedTxn - this is used to ping the txn wait queue and unblock waiters, communicating a final txn disposition to them. Now imagine an EndTxn batch that evaluates to a command that fails the lease check. The respective LocalResult is populate with out txn, updated to the COMMITTED disposition. The command fails to process, but we still ping the wait queue telling all the waiters that the transaction committed. The waiters, if any, now go and send ResolveIntent requests, which actually commit intents (thinking that the intent's txn committed). And so we end up with "partially committed transactions" - some of the writes from a txn are committed (the ones that happened to have waiters on their intents), others don't. In order for this bug to manifest itself, we need: - read/write contention on some keys - a command to fail processing. Generally this happens either because of the lease index check or because of the lease check; so either a Raft leadership change or a lease change can potentially cause the badness. In the case of a Raft leadership change, proposals can get dropped, which leads to reproposals, which seem to frequently(?) end up processing at the wrong lease index and thus be rejected (and it's the reproposal processing that triggers the bug). This kind of situation is triggered by the Jepsen register test, with various nemeses. The lease check failure can probably be triggered simply with lease transfers (e.g. lease rebalancing). Interestingly, the rate of occurence of this bug probably changed between 2.0 and 2.1 because of the introduction, in 2.1, of the QueryIntent request (and also, separately, because of increased usage of lease transfers; this claim was not verified though). Speaking of the Raft leadership change scenario, once the reproposal fail to be applied (but the wait queue is erroneously signalled), we also re-evaluate the EndTransaction batch. Unless something else goes wrong, in 2.0 this re-evaluation was likely succeeding. In 2.1, however, it tends to fail if there was a waiter on our txn because the EndTxn requests are usually bundled with QueryIntent requests - and these QueryIntents fail during re-evaluation because they don't find the intent - it was, by then, errneously committed by a waiter through a ResolveIntent. Except for transaction that wrote to multiple ranges: I think for those the QueryIntent requests are split off from the EndTransaction request by the DistSender, and so we're back to the situation in 2.0. Fixes cockroachdb#30792 Release note (bug fix): Fix a bug causing transactions to appear to be partially committed. CRDB was sometimes claiming to have failed to commit a transaction but some (or all) of its writes were actually persisted.

…aders Another test for cockroachdb#30792, this time testing the specific interaction between readers and rejected Raft commands. Release note: None

32236: storage: add a test for interactions between rejected commands and re… r=andreimatei a=andreimatei …aders Another test for #30792, this time testing the specific interaction between readers and rejected Raft commands. Release note: None Co-authored-by: Andrei Matei <andrei@cockroachlabs.com>

A recent commit (master only) reintroduced a bug that we ironically had spent a lot of time on [before]. In summary, it would allow the result of an EndTransaction which would in itself *not* apply to leak and would result in intents being committed even though their transaction ultimately would not: cockroachdb#34025 (comment) We've diagnosed this pretty quickly the second time around, but clearly we didn't do a good job at preventing the regression. I can see how this would happen as the method this code is in is notoriously difficult to test for it interfaces so much with everything else that it's difficult to unit test it; one needs to jump through lots of hoops to target it, and so we do it less than we ought to. I believe this wasn't released in any alpha (nor backported anywhere), so no release note is necessary. Fixes cockroachdb#34025. [before]: cockroachdb#30792 (comment) Release note: None

34651: server: rework TestClusterVersionBootstrapStrict r=andreimatei a=andreimatei This test... I'm not entirely sure what it was supposed to test to be honest, but it seemed to be more complicated than it needed to be. It forced and emphasized MinSupportedVersion being equal to BinaryServerVersion (which is generally not a thing). I've simplified it, making it not muck with the versions, while keep (I think) the things it was testing (to the extent that it was testing anything). This test was also in my way because it created servers that pretended to be versions that are not technically supported by the binary, and this kind of funkiness is making my life hard as I'm trying to rework the way in which versions are propagated and what knobs servers have, etc. Release note: None 34659: storage: don't leak committed protos to pushers on reproposal r=bdarnell,andreimatei a=tbg TODO: test ---- A recent commit (master only) reintroduced a bug that we ironically had spent a lot of time on [before]. In summary, it would allow the result of an EndTransaction which would in itself *not* apply to leak and would result in intents being committed even though their transaction ultimately would not: #34025 (comment) We've diagnosed this pretty quickly the second time around, but clearly we didn't do a good job at preventing the regression. I can see how this would happen as the method this code is in is notoriously difficult to test for it interfaces so much with everything else that it's difficult to unit test it; one needs to jump through lots of hoops to target it, and so we do it less than we ought to. I believe this wasn't released in any alpha (nor backported anywhere), so no release note is necessary. Fixes #34025. [before]: #30792 (comment) Release note: None Co-authored-by: Andrei Matei <andrei@cockroachlabs.com> Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>

cockroach-teamcity added this to the 2.2 milestone Sep 29, 2018

cockroach-teamcity assigned petermattis Sep 29, 2018

cockroach-teamcity added C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. labels Sep 29, 2018

petermattis assigned bdarnell and unassigned petermattis Sep 29, 2018

tbg added A-kv-client Relating to the KV client and the KV interface. S-0-visible-logical-error Database stores inconsistent data in some cases, or queries return invalid results silently. labels Oct 2, 2018

tbg modified the milestones: 2.2, 2.1 Oct 2, 2018

bdarnell mentioned this issue Nov 3, 2018

production: qps stalls on adriatic #31823

Closed

andreimatei mentioned this issue Nov 6, 2018

storage: don't apply local results if cmd processing failed #32166

Merged

andreimatei mentioned this issue Nov 12, 2018

release-2.1: storage: don't apply local results if cmd processing failed #32220

Merged

andreimatei mentioned this issue Nov 12, 2018

release-2.0: storage: don't apply local results if cmd processing failed #32223

Merged

craig bot closed this as completed in #32166 Nov 12, 2018

andreimatei mentioned this issue Nov 12, 2018

storage: add a test for interactions between rejected commands and re… #32236

Merged

bdarnell mentioned this issue Feb 4, 2019

roachtest: tpcc-1000 roachtests have failing checks #34025

Closed

tbg mentioned this issue Feb 6, 2019

storage: don't leak committed protos to pushers on reproposal #34659

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: jepsen/3/register/split failed #30792

roachtest: jepsen/3/register/split failed #30792

cockroach-teamcity commented Sep 29, 2018

bdarnell commented Oct 1, 2018

bdarnell commented Oct 1, 2018

petermattis commented Oct 1, 2018

bdarnell commented Oct 1, 2018

petermattis commented Oct 1, 2018

bdarnell commented Oct 1, 2018

bdarnell commented Oct 1, 2018

bdarnell commented Oct 1, 2018

nvanbenschoten commented Oct 1, 2018

tbg commented Oct 2, 2018

nvanbenschoten commented Oct 2, 2018

tbg commented Oct 2, 2018

bdarnell commented Oct 2, 2018

tbg commented Oct 3, 2018

tbg commented Oct 3, 2018

andreimatei commented Oct 3, 2018

tbg commented Oct 3, 2018

andreimatei commented Oct 3, 2018

tbg commented Oct 3, 2018

tbg commented Oct 3, 2018

bdarnell commented Nov 3, 2018

bdarnell commented Nov 3, 2018

andreimatei commented Nov 3, 2018 via email

tbg commented Nov 3, 2018 via email •

edited

Loading

tbg commented Nov 5, 2018 •

edited

Loading

tbg commented Nov 5, 2018

bdarnell commented Nov 5, 2018

andreimatei commented Nov 5, 2018

roachtest: jepsen/3/register/split failed #30792

roachtest: jepsen/3/register/split failed #30792

Comments

cockroach-teamcity commented Sep 29, 2018

bdarnell commented Oct 1, 2018

bdarnell commented Oct 1, 2018

petermattis commented Oct 1, 2018

bdarnell commented Oct 1, 2018

petermattis commented Oct 1, 2018

bdarnell commented Oct 1, 2018

bdarnell commented Oct 1, 2018

bdarnell commented Oct 1, 2018

nvanbenschoten commented Oct 1, 2018

tbg commented Oct 2, 2018

nvanbenschoten commented Oct 2, 2018

tbg commented Oct 2, 2018

bdarnell commented Oct 2, 2018

tbg commented Oct 3, 2018

tbg commented Oct 3, 2018

andreimatei commented Oct 3, 2018

tbg commented Oct 3, 2018

andreimatei commented Oct 3, 2018

tbg commented Oct 3, 2018

tbg commented Oct 3, 2018

bdarnell commented Nov 3, 2018

bdarnell commented Nov 3, 2018

andreimatei commented Nov 3, 2018 via email

tbg commented Nov 3, 2018 via email • edited Loading

tbg commented Nov 5, 2018 • edited Loading

tbg commented Nov 5, 2018

bdarnell commented Nov 5, 2018

andreimatei commented Nov 5, 2018

tbg commented Nov 3, 2018 via email •

edited

Loading

tbg commented Nov 5, 2018 •

edited

Loading