roachtest: tpcc-1000 roachtests have failing checks #34025

thoszhang · 2019-01-15T18:22:41Z

Some roachtests that involve running tpcc-1000 are failing the tpcc checks at the end, e.g.:

		I190114 15:41:47.835436 1 workload/tpcc/tpcc.go:290  check 3.3.2.1 took 162.495391ms
		Error: check failed: 3.3.2.1: 12 rows returned, expected zero

The tests affected by this:

roachtest: schemachange/index/tpcc-1000 failed #33624 (schemachange/index/tpcc-1000)
roachtest: schemachange/indexrollback/tpcc-1000 failed #33335 (schemachange/indexrollback/tpcc-1000)
roachtest: scrub/index-only/tpcc-1000 failed #33151 (scrub/index-only/tpcc-1000)
roachtest: scrub/all-checks/tpcc-1000 failed #33149 (scrub/all-checks/tpcc-1000)
roachtest: cdc/tpcc-1000 failed #32813 (comment) (cdc/tpcc-1000) is potentially related

https://teamcity.cockroachdb.com/viewLog.html?buildId=1085451&buildTypeId=Cockroach_Nightlies_WorkloadNightly&tab=buildResultsDiv seems to be the first test failure of this kind. I reproduced this failure on scrub/index-only/tpcc-1000 on a separate roachprod cluster, and ran SCRUB on the entire tpcc database, which did not turn up any anomalies.

It's not clear whether this is a schema change problem, a problem with the tests, or something else. Some next steps would be to determine whether this occurs when just running tpcc-1000 by itself, and look at what the anomalous rows are in the results of the check query.

The text was updated successfully, but these errors were encountered:

vivekmenezes · 2019-01-21T20:47:05Z

It's running
SELECT count(*) FROM (SELECT max(no_o_id) - min(no_o_id) - count(*) AS nod FROM new_order GROUP BY no_w_id, no_d_id) WHERE nod != -1

expecting 0 to be returned. The schema changes do not run on the new_order table which makes it especially confounding.

jordanlewis · 2019-01-22T13:34:50Z

Yeah I would say that this probably isn't a schema change issue per se. When we've seen issues like this before, it's been either a problem with the test or a problem with transactionality in the database itself.

petermattis · 2019-01-22T18:02:22Z

Do we run the TPC-C checks after the restore/import is finished? I'm wondering if this is possibly a problem with restore/import.

jordanlewis · 2019-01-22T21:32:14Z

I believe we do run the checks after the restore/import, right @lucy-zhang? I think we run two sets - after the restore/import and after the test, and this problem was after the test.

thoszhang · 2019-01-23T16:15:06Z

Yes, the checks are run immediately after the restore/import, and they pass.

vivekmenezes · 2019-01-24T19:51:32Z

https://teamcity.cockroachdb.com/project.html?projectId=Cockroach_Nightlies&testNameId=-6687189696048107282&tab=testDetails

starting failing on Jan 11th on master

https://teamcity.cockroachdb.com/project.html?projectId=Cockroach_Nightlies&testNameId=-2631868135262591507&tab=testDetails

started failing on Jan 14th on master

https://teamcity.cockroachdb.com/project.html?projectId=Cockroach_Nightlies&testNameId=9189024510331884225&tab=testDetails

started failing on Jan 14th on master

So it does appear something changed around Jan 10-11th that is causing this problem

vivekmenezes · 2019-01-24T20:12:33Z

#33566 was a complex change that got merged before the failure.

The first failure seen was https://teamcity.cockroachdb.com/viewLog.html?buildId=1088848&tab=buildResultsDiv&buildTypeId=Cockroach_Nightlies_WorkloadNightly#testNameId-6687189696048107282

tbg · 2019-01-28T15:47:11Z

A different check failed here:

#34293

nvanbenschoten · 2019-01-28T15:53:18Z

Yeah, these all look related. I'm working on isolating #33566 as the culprit now. Once that's confirmed, I'll dig into what's going wrong.

tbg · 2019-01-28T16:03:59Z

SGTM, let me know if I can be of any help.

…

On Mon, Jan 28, 2019 at 4:53 PM Nathan VanBenschoten < ***@***.***> wrote: Yeah, these all look related. I'm working on isolating #33566 <#33566> as the culprit now. Once that's confirmed, I'll dig into what's going wrong. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#34025 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AE135BMY2nGUdXGtJd9pipd2nDIgjWEFks5vHxz8gaJpZM4aBiGL> .

vivekmenezes · 2019-01-28T17:03:31Z

@nvanbenschoten thanks for your help on this issue!

nvanbenschoten · 2019-01-29T10:15:29Z

https://teamcity.cockroachdb.com/viewLog.html?buildId=1085451&buildTypeId=Cockroach_Nightlies_WorkloadNightly&tab=buildResultsDiv seems to be the first test failure of this kind.

This corresponds to 12e2815, which was actually merged before #33566. That doesn't mean its sibling change 04189f5 isn't responsible (it still probably is), but it narrows the search.

nvanbenschoten · 2019-01-29T13:31:59Z

I just got a reproduction on c5516ef. That doesn't tell us too much since we saw it before, but it's nice to have some degree of reproducibility, even if it takes a few hours.

I convinced myself that the duplicate key value (o_w_id,o_d_id,o_id)=(979,7,3002) violates unique constraint "primary" (SQLSTATE 23505) errors are different symptoms for the same problem demonstrated by the check failures. For instance,

Error: check failed: 3.3.2.2: inequality at idx 435: order: 3010.000000, newOrder: 3010.000000, district-1: 3009.000000

is just another way of saying the the order table has an order id (3010) that does not line up with the d_next_o_id from the district table (also 3010, but should at least be 3011). This should not be possible given a properly functioning new_order transaction.

Error: check failed: 3.3.2.1: 12 rows returned, expected zero

is similar in that it demonstrates a partially applied payment txn.

These two check failures also fit a similar pattern. In both, the first write in the transaction seems to be missing. Interestingly, in both of these transactions, the writes are meant to be the most heavily contended. I suspect that we're seeing something like the following series of events:

transaction A writes intent X for statements 1
transaction B begins blocking on transaction A, intent X because its trying to write to the same row
transaction A writes intents Y and Z for statements 2 and 3
transaction B succesfully aborts transaction A and removes intent X
transaction A's aborted record is removed (GC?)
transaction A somehow "comes back to life" and re-writes its record either through a HeartbeatRequest or an EndTransaction request
transaction A commits and resolves Y and Z. It fails to find X, so it throws this warning (or one of the ones below it)

Unfortunately I don't actually see a warning like that in the logs, but perhaps the transaction didn't commit at epoch 0. I'm going to add some logging in and see if I can get a repro again.

nvanbenschoten · 2019-01-29T15:07:34Z

An interesting note is that these scrub and schemachange roachtests are dramatically underprovisioned, which leads to them having extremely long transactions. They run TPC-C 1000 on 3-node clusters with 4 cores each. This cluster config can usually only support around 450 warehouses before falling over.

nvanbenschoten · 2019-02-03T02:18:10Z

I'm still not to the bottom of this, but it's been my main focus for about a week and feel like I'm slowly painting a full picture of what's going on. The slow reproduction cycle (1-2 times a day) doesn't make quick iteration easy.

First off, I'll retract my previous statement that the duplicate key value / 3.3.2.2 check failure is the same issue as what we see here. I've been stressing tpcc (through the roachtests listed in this issue) for the past six days on c5516ef and have yet to see a single instance of this (granted I've only seen about 5 3.3.2.1 failures with 16 clusters continuously stressing this). @awoods187 seems to be able to reproduce these at-will on master and we saw the first instance about a week after we first saw the first 3.3.2.1 failure. So the conclusion I've been pushed towards is that they're not the same. If anyone wants to help, investigating that in parallel would be very helpful.

In terms of the easiest way to reproduce this issue, I've found that the scrub tests fail most frequently. I believe this is because they most severely underprovision a cluster for tpcc, leading to more transaction aborts. This is supported by the failure being slightly easier to reproduce when I dropped TxnLivenessThreshold down to 1 * base.DefaultHeartbeatInterval instead of 2 * base.DefaultHeartbeatInterval, which should cause more transaction aborts due to delayed txn heartbeats.

Another interesting point which @danhhz and I discussed in #32813 (comment) is that the failure usually occurs within a minute of starting load on a tpcc cluster. By dropping the load duration in the scrub tests down to 10 minutes I was able to shorten the iteration time without reducing the failure rate (noticeably?). More on why I think this is later.

Given that our prime suspect for this is 04189f5 (although other than #33149 (comment), I've never seen this fail before c5516ef), I started by adding a lot of logging around transaction state transitions. Specifically, I logged about transaction pushes, transaction aborts, transaction record creation, transaction commits, etc. When I finally got a reproduction, I parsed all the logs, threw them into an advanced system specializing in ad-hoc query processing over large data sets (a CockroachDB cluster), and began looking for strange transaction timelines. I didn't find anything that stood out and soon realized that I had been logging so heavily for ~3 hours that most of the log files had rolled over and I was missing anything of interest. After fixing that and waiting another day for a repro, I finally had a full set of logs. Unfortunately, on its own, I still couldn't find anything that looked strange.

Since I still had the cluster up and running, I could verify that the query from check 3.3.2.1 still reproduced. The check confirmed that the year-to-date sum on a warehouse row equaled the combined year-to-date sum of all of that warehouse's districts. By adjusting it slightly, it was possible to determine the warehouse that was inconsistent and the price difference that it was inconsistent by. I took this and found that in most cases I could find a corresponding row in the history table with the same amount! I then realized just how brilliant AS OF SYSTEM TIME queries are. I could perform the query with AS OF SYSTEM TIME specifiers to determine exactly when this history row was inserted. Mapping that back to my log of transactions I was able to find the exact transaction that performed this insert. Since the warehouse table was missing this amount, I figured the transaction's intent on the warehouse table had somehow been removed prematurely. This was essentially the theory in #34025 (comment).

I spent a while looking at neighboring transactions in the logs and how they interacted with our suspect transaction (let's call it Txn A). Everything checked out. No one thought they had aborted A (other than after it had committed and its txn record had been removed, but that's kosher). I went back through the reproduction cycle twice more just to get more logging on intent resolution, intent removal, and txn record removal. Still, nothing popped up. I then realized I could even query the warehouse table immediately after the transaction was supposed to be committed using AOST and see the update. I found that the warehouse table had been correctly updated by Txn A. I then confirmed that the 3.3.2.1 check passed after Txn A was supposed to have committed. At this point, I was thoroughly confused.

This is when I did what I should have done in the first place and ran the 3.3.2.1 check using AOST to determine exactly when it started failing. This ended up being about 10 seconds after Txn A committed (which remember, updated w_ytd by exactly the amount it was later missing). I looked in the history table again and sure enough, found a row inserted by a transaction (call it Txn B) at the same time. It turns out that Txn B updated the warehouse row by exactly TxnB.amount - TxnA.amount. In other words, Txn B had completed ignore Txn A's original update to the warehouse row. This was classic write skew of two read-modify-write operations!

I've since seen this exact behavior on three different clusters (every repro since I started looking for it). I've even seen it four times on the same cluster once, which corresponded to a 4 rows returned, expected zero failure. Here are what I see to be the commonalities between all of these cases:

Txn A always commits with a lower commit timestamp than Txn B's original timestamp
Everyone always commits at epoch 0 (i.e. no restarts)
Txn A always seems to hit a RETRY_SERIALIZABLE error, so I assume it's refreshing to avoid restarting (see fact add pre-commit hook & convenience script to add it. fix some vet errors. #2)
Txn B usually but not always hits a RETRY_SERIALIZABLE error. In other words, it doesn't always refresh.
Txn A always seems to be surrounded by the following operations on its range: Add replica, Remove replica, and one or more splits. I highly suspect that this is related because it would explain why we see this so frequently at the beginning of runs.
From what I can tell, Txn A usually seems to commit on a different leaseholder for its warehouse range (where it's txn record is) than the leaseholder who existed when it stated. However, I can see cases where its heartbeat creates a txn record on the first and second leaseholder. Unfortunately, I've never been able to see the exact lease history of the range.
In the 4 rows returned, expected zero case, the Txn A in all fours cases operated at essentially the exact same time, on the same range.

Since Txn B always has a higher orig timestamp than Txn A's commit timestamp (although they are active concurrently) it's baffling that Txn B misses Txn A's write. What should happen is either

(if the read comes after the intent write) Txn B observes Txn A's write. If Txn A has already resolved its intent then it should absolutely respect it. If Txn A has not, then Txn B would start pushing Txn A. If Txn B is not able to abort Txn A then it should observe its update after the intent is resolved.
(if the read comes before the intent write) Txn B bumps the timestamp cache to TxnB.OrigTimestamp after its read, preventing Txn A from writing without bumping TxnA.Timestamp to at least TxnB.OrigTimestamp. This timestamp bump does not happen (see fact tidy up some correctness issues reported by go vet #1).

I haven't been able to think of any way that the second mechanism could break down in general or due to any of the changes in 04189f5. Even if TxnB incorrectly thought that TxnA was aborted (which I don't see evidence for), I don't see evidence that it tried to remove TxnA's intent. Without removing the intent, I don't think it's possible for it to have read the previously committed value. So I think a breakdown in the second mechanism is more likely, especially because this seems to be related to lease changes. But I also don't see how that could break down.

Random theories I explored but disproved:

the write skew is a result of a follower read gone wrong. This would explain why Txn B's read would either miss Txn A's intent/committed value or fail to prevent the intent by bumping the read timestamp cache. It turns out that kv.closed_timestamp.follower_reads_enabled defaults to false.
Txn B performed the read portion of its UPDATE statement in a different txn (with a lower original timestamp) and performed the write portion of its UPDATE statement as Txn B. This idea is similar to what happened in sql/opt: avoid swallowing TransactionAbortedErrors #33312. Since I started looking for this, I have always seen Txn B be a result of a previously aborted txn, but that txn always tried to commit unsuccessfully with 6 intents before being aborted. This means that it couldn't have been aborted in the middle of the SQL transaction's UPDATE statement.

Any insight or wild guesses here would be helpful. This is going to delay the next Alpha (as will the duplicate key value error I mentioned above) if we can't fix it by then, so there is some urgency here.

cc. @tbg @bdarnell @petermattis

petermattis · 2019-02-03T20:52:43Z

This is a fantastic write-up, though I'm afraid I don't have any good ideas to offer. One thing to keep in mind is that it is possible that 04189f5 simply revealed a previously existing bug (perhaps one that was not exercisable prior to that change).

In the 4 rows returned, expected zero case, the Txn A in all fours cases operated at essentially the exact same time, on the same range.

That's super curious. Something happened on that range that affected 4 different txns? I suppose this lends weight to the suspicion that an add replica, remove replica, or split, is related to the problem.

tbg · 2019-02-04T00:10:51Z

I'll sleuth some more code and read your report more thoroughly tomorrow, but I'll let this script run overnight on my worker in the hope that it reproduces. It basically merges-splits-scatters warehouses repeatedly and checks 3.3.2.1 after each iteration. I have it on c5516ef currently.

tbg · 2019-02-04T00:19:41Z

Very likely unrelated, but my tpcc workload always dies with (right at the beginning, takes a few sec though):

$ ./bin/workload run tpcc --expensive-checks --warehouses 100 --ramp 30s --wait=false --tolerate-errors --duration=24h 'postgres://root@localhost:26257?sslmode=disable' --histograms tpcc.json
Error: preparing
		INSERT INTO "order" (o_id, o_d_id, o_w_id, o_c_id, o_entry_d, o_ol_cnt, o_all_local)
		VALUES ($1, $2, $3, $4, $5, $6, $7): ERROR: root: memory budget exceeded: 10240 bytes requested, 134215680 currently allocated, 134217728 bytes in budget (SQLSTATE 53200)

That's 134mb.

tbg · 2019-02-05T20:39:13Z

Hooray!

…

On Tue, Feb 5, 2019 at 9:14 PM craig[bot] ***@***.***> wrote: Closed #34025 <#34025> via #34548 <#34548>. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#34025 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AE135OYYNCWhZ71SB4_uGI3gyNufJUaxks5vKeYNgaJpZM4aBiGL> .

34566: storage/roachpb: various cleanup around Refresh{Range}Request r=nvanbenschoten a=nvanbenschoten This PR includes a few small changes that I noticed while investigating #34025. Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>

nvanbenschoten · 2019-02-06T06:36:32Z

For completeness, I spent about 6 hours stressing the test with c5516ef + the patch from #34548. I didn't see a single failure.

I then did the same test on master. Here I stumbled upon some good news and some bad news. The bad news is that I saw the check failure on one of my clusters. The good news is that I saw the same failure on every single other cluster! The test actually reliably fails on master.

The test still took 2 hours to run, but now that its failures were frequent, I was able to bisect the problem to 80eea9e. So it appears that we messed something up in that commit. The next step here is to figure out what that is.

This is actually in some senses a huge relief because it explains why the frequency of these failures (along with a few others with similar symptoms) has appeared to rise over the month of January. Amusingly, I mentioned something along these lines in #34025 (comment) - check out the dates between #33381 (comment) and #33396 (comment).

So it looks like the test failed very rarely before 80eea9e because of the bug that was fixed today, but began failing regularly because of something in that commit. I feel a little silly that I never once thought to test on master before looking back at commits in the vacinity of where we saw the test first fail. On the bright side, it seems like this will lead to two bug fixes instead of just one, including one that has been lingering for years! I think that warrants another shout out to @lucy-zhang for introducing the test.

Of course, I think this also means we'll need to add whatever fix we create into the alpha that we cut today. Or we can just revert 80eea9e, as it was just a refactor and I don't think there's very much built on top of it.

cc. @andreimatei

nvanbenschoten · 2019-02-06T06:38:04Z

So with all that said, don't be surprised if we still see test failures tonight.

tbg · 2019-02-06T08:00:15Z

I wasn't able to find anything wrong in 80eea9e (though I should look again), but I figured that maybe we're returning ambiguous results more than before and that messes something up in higher layers. Not really sure this is related to this particular bug, but it seems weird that the DistSender code below returns ambiguous results either as br.Err or as err.

cockroach/pkg/kv/dist_sender.go

Lines 1475 to 1484 in 70be833

    
           if propagateError { 
        
           	if ambiguousError != nil { 
        
           		return nil, roachpb.NewAmbiguousResultError(fmt.Sprintf("error=%s [propagate]", ambiguousError)) 
        
           	} 
        
           	// The error received is likely not specific to this 
        
           	// replica, so we should return it instead of trying other 
        
           	// replicas. 
        
           	return br, nil 
        
           }

tbg · 2019-02-06T08:04:59Z

Hm, seems like both cases get joined later:

cockroach/pkg/kv/dist_sender.go

Lines 496 to 512 in 70be833

    
           br, err := ds.sendRPC(ctx, desc.RangeID, replicas, ba, cachedLeaseHolder) 
        
           if err != nil { 
        
           	log.VErrEvent(ctx, 2, err.Error()) 
        
           	return nil, roachpb.NewError(err) 
        
           } 
        
           // If the reply contains a timestamp, update the local HLC with it. 
        
           if br.Error != nil && br.Error.Now != (hlc.Timestamp{}) { 
        
           	ds.clock.Update(br.Error.Now) 
        
           } else if br.Now != (hlc.Timestamp{}) { 
        
           	ds.clock.Update(br.Now) 
        
           } 
        
           // Untangle the error from the received response. 
        
           pErr := br.Error 
        
           br.Error = nil // scrub the response error 
        
           return br, pErr

tbg · 2019-02-06T08:09:16Z

Heh, this is more likely the cause of the bug:

note how response.Err is used to determine whether to hand out the intents to the caller. I think what that could do is reinstantiate a bug we'v fixed before: if an EndTransaction evaluates successfully but bounces below Raft because of an illegal lease index, we will now give the committed proto to waiting pushers (or similar during a PushTxn that aborts). I'll see if that fixes it on master.

tbg · 2019-02-06T08:55:32Z

Running two instances of scrub-index-only-tpcc-1000 on master and two on a binary that has the one-line fix (replacing response.Err by pErr in the highlighted line above).

tbg · 2019-02-06T11:58:24Z

The first failure on master is in:

--- FAIL: scrub/index-only/tpcc-1000 (10829.34s)
	test.go:743,cluster.go:1226,tpcc.go:130,scrub.go:58: /Users/tschottdorf/go/bin/roachprod run tobias-1549443202-scrub-index-only-tpcc-1000:5 -- ./workload check tpcc --warehouses=1000 {pgurl:1} returned:
		stderr:
		I190206 11:53:20.623699 1 workload/tpcc/tpcc.go:288  check 3.3.2.1 took 2.090668716s
		Error: check failed: 3.3.2.1: 9 rows returned, expected zero
		Error:  exit status 1

		stdout:
		: exit status 1

the second master instance is still running. The two fixed instances have already made it past check 3.3.2.1 🙌 and probably even all the checks (the test is "waiting" which I assume is the scrub part).

So assuming the second master run also fails, I'm fairly comfortable that that was the bug.

tbg · 2019-02-06T11:59:39Z

and here's the second failure on master:

--- FAIL: scrub/index-only/tpcc-1000 (11138.28s)
	test.go:743,cluster.go:1226,tpcc.go:130,scrub.go:58: /Users/tschottdorf/go/bin/roachprod run tobias-1549443223-scrub-index-only-tpcc-1000:5 -- ./workload check tpcc --warehouses=1000 {pgurl:1} returned:
		stderr:
		I190206 11:58:48.216353 1 workload/tpcc/tpcc.go:288  check 3.3.2.1 took 3.921505659s
		Error: check failed: 3.3.2.1: 7 rows returned, expected zero
		Error:  exit status 1

		stdout:
		: exit status 1

A recent commit (master only) reintroduced a bug that we ironically had spent a lot of time on [before]. In summary, it would allow the result of an EndTransaction which would in itself *not* apply to leak and would result in intents being committed even though their transaction ultimately would not: cockroachdb#34025 (comment) We've diagnosed this pretty quickly the second time around, but clearly we didn't do a good job at preventing the regression. I can see how this would happen as the method this code is in is notoriously difficult to test for it interfaces so much with everything else that it's difficult to unit test it; one needs to jump through lots of hoops to target it, and so we do it less than we ought to. I believe this wasn't released in any alpha (nor backported anywhere), so no release note is necessary. Fixes cockroachdb#34025. [before]: cockroachdb#30792 (comment) Release note: None

bdarnell · 2019-02-06T18:03:26Z

Fool me twice, shame on me. This definitely reveals a flaw in our testing. We added a test #32236 for #30792, but apparently it was too narrowly-targeted and wasn't sensitive to other related bugs. We should think about a more general way to test for things like this. I'd have thought the jepsen bank test would at least be sensitive to this kind of thing.

One thought is to run tests with a flag that turns any reevaluations into non-retryable errors. This means that any "leakage" like this would result in long-lasting inconsistencies, instead of only a very short-lived window in which the side effects are applied before the command ultimately succeeds after retry.

34651: server: rework TestClusterVersionBootstrapStrict r=andreimatei a=andreimatei This test... I'm not entirely sure what it was supposed to test to be honest, but it seemed to be more complicated than it needed to be. It forced and emphasized MinSupportedVersion being equal to BinaryServerVersion (which is generally not a thing). I've simplified it, making it not muck with the versions, while keep (I think) the things it was testing (to the extent that it was testing anything). This test was also in my way because it created servers that pretended to be versions that are not technically supported by the binary, and this kind of funkiness is making my life hard as I'm trying to rework the way in which versions are propagated and what knobs servers have, etc. Release note: None 34659: storage: don't leak committed protos to pushers on reproposal r=bdarnell,andreimatei a=tbg TODO: test ---- A recent commit (master only) reintroduced a bug that we ironically had spent a lot of time on [before]. In summary, it would allow the result of an EndTransaction which would in itself *not* apply to leak and would result in intents being committed even though their transaction ultimately would not: #34025 (comment) We've diagnosed this pretty quickly the second time around, but clearly we didn't do a good job at preventing the regression. I can see how this would happen as the method this code is in is notoriously difficult to test for it interfaces so much with everything else that it's difficult to unit test it; one needs to jump through lots of hoops to target it, and so we do it less than we ought to. I believe this wasn't released in any alpha (nor backported anywhere), so no release note is necessary. Fixes #34025. [before]: #30792 (comment) Release note: None Co-authored-by: Andrei Matei <andrei@cockroachlabs.com> Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>

The quoted reason for the tolerance was cockroachdb#34025, which has long been fixed. Release note: None

61668: roachtest/cdc: don't tolerate TPCC errors r=andreimatei a=andreimatei The quoted reason for the tolerance was #34025, which has long been fixed. Release note: None Co-authored-by: Andrei Matei <andrei@cockroachlabs.com>

thoszhang added the A-schema-changes label Jan 15, 2019

vivekmenezes added the C-test-failure Broken test (automatically or manually discovered). label Jan 21, 2019

jordanlewis mentioned this issue Jan 25, 2019

TPC-C performance regression on 3 node and 6 node #34234

Closed

nvanbenschoten self-assigned this Jan 28, 2019

nvanbenschoten mentioned this issue Jan 29, 2019

roachtest: tpcc/w=100/nodes=3/chaos=true failed #34293

Closed

This was referenced Jan 29, 2019

roachtest: tpcc/nodes=3/w=max failed #30886

Closed

roachtest: cdc/tpcc-1000 failed #32813

Closed

*log.safeError: txn_interceptor_heartbeat.go:429: txn status: %s, but heartbeat loop hasn't been signaled to stop... #34341

Closed

nvanbenschoten mentioned this issue Feb 5, 2019

storage: have sequenced reads use the intent history #33244

Merged

nvanbenschoten reopened this Feb 6, 2019

tbg mentioned this issue Feb 6, 2019

storage: don't leak committed protos to pushers on reproposal #34659

Merged

This was referenced Feb 6, 2019

roachtest: schemachange/index/tpcc-1000 failed #34669

Closed

roachtest: scrub/index-only/tpcc-1000 failed #34670

Closed

craig bot closed this as completed in #34659 Feb 6, 2019

bdarnell mentioned this issue Apr 11, 2019

roachtest: scrub/all-checks/tpcc/w=1000 failed #35986

Closed

asubiotto mentioned this issue Jan 23, 2020

roachtest: tpcc/mixed-headroom/n5cpu16 failed #43110

Closed

andreimatei added a commit to andreimatei/cockroach that referenced this issue Mar 8, 2021

roachtest/cdc: don't tolerate TPCC errors

f8bdb34

The quoted reason for the tolerance was cockroachdb#34025, which has long been fixed. Release note: None

andreimatei mentioned this issue Mar 8, 2021

roachtest/cdc: don't tolerate TPCC errors #61668

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: tpcc-1000 roachtests have failing checks #34025

roachtest: tpcc-1000 roachtests have failing checks #34025

thoszhang commented Jan 15, 2019

vivekmenezes commented Jan 21, 2019

jordanlewis commented Jan 22, 2019

petermattis commented Jan 22, 2019

jordanlewis commented Jan 22, 2019

thoszhang commented Jan 23, 2019

vivekmenezes commented Jan 24, 2019

vivekmenezes commented Jan 24, 2019 •

edited

Loading

tbg commented Jan 28, 2019

nvanbenschoten commented Jan 28, 2019

tbg commented Jan 28, 2019 via email

vivekmenezes commented Jan 28, 2019

nvanbenschoten commented Jan 29, 2019

nvanbenschoten commented Jan 29, 2019

nvanbenschoten commented Jan 29, 2019

nvanbenschoten commented Feb 3, 2019

petermattis commented Feb 3, 2019

tbg commented Feb 4, 2019

tbg commented Feb 4, 2019 •

edited

Loading

tbg commented Feb 5, 2019 via email

nvanbenschoten commented Feb 6, 2019 •

edited

Loading

nvanbenschoten commented Feb 6, 2019

tbg commented Feb 6, 2019

tbg commented Feb 6, 2019

tbg commented Feb 6, 2019

tbg commented Feb 6, 2019

tbg commented Feb 6, 2019

tbg commented Feb 6, 2019

bdarnell commented Feb 6, 2019

roachtest: tpcc-1000 roachtests have failing checks #34025

roachtest: tpcc-1000 roachtests have failing checks #34025

Comments

thoszhang commented Jan 15, 2019

vivekmenezes commented Jan 21, 2019

jordanlewis commented Jan 22, 2019

petermattis commented Jan 22, 2019

jordanlewis commented Jan 22, 2019

thoszhang commented Jan 23, 2019

vivekmenezes commented Jan 24, 2019

vivekmenezes commented Jan 24, 2019 • edited Loading

tbg commented Jan 28, 2019

nvanbenschoten commented Jan 28, 2019

tbg commented Jan 28, 2019 via email

vivekmenezes commented Jan 28, 2019

nvanbenschoten commented Jan 29, 2019

nvanbenschoten commented Jan 29, 2019

nvanbenschoten commented Jan 29, 2019

nvanbenschoten commented Feb 3, 2019

petermattis commented Feb 3, 2019

tbg commented Feb 4, 2019

tbg commented Feb 4, 2019 • edited Loading

tbg commented Feb 5, 2019 via email

nvanbenschoten commented Feb 6, 2019 • edited Loading

nvanbenschoten commented Feb 6, 2019

tbg commented Feb 6, 2019

tbg commented Feb 6, 2019

tbg commented Feb 6, 2019

tbg commented Feb 6, 2019

tbg commented Feb 6, 2019

tbg commented Feb 6, 2019

bdarnell commented Feb 6, 2019

vivekmenezes commented Jan 24, 2019 •

edited

Loading

tbg commented Feb 4, 2019 •

edited

Loading

nvanbenschoten commented Feb 6, 2019 •

edited

Loading