-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: tpcc-1000 roachtests have failing checks #34025
Comments
It's running expecting 0 to be returned. The schema changes do not run on the new_order table which makes it especially confounding. |
Yeah I would say that this probably isn't a schema change issue per se. When we've seen issues like this before, it's been either a problem with the test or a problem with transactionality in the database itself. |
Do we run the TPC-C checks after the restore/import is finished? I'm wondering if this is possibly a problem with restore/import. |
I believe we do run the checks after the restore/import, right @lucy-zhang? I think we run two sets - after the restore/import and after the test, and this problem was after the test. |
Yes, the checks are run immediately after the restore/import, and they pass. |
starting failing on Jan 11th on master started failing on Jan 14th on master started failing on Jan 14th on master So it does appear something changed around Jan 10-11th that is causing this problem |
#33566 was a complex change that got merged before the failure. The first failure seen was https://teamcity.cockroachdb.com/viewLog.html?buildId=1088848&tab=buildResultsDiv&buildTypeId=Cockroach_Nightlies_WorkloadNightly#testNameId-6687189696048107282 |
A different check failed here: |
Yeah, these all look related. I'm working on isolating #33566 as the culprit now. Once that's confirmed, I'll dig into what's going wrong. |
SGTM, let me know if I can be of any help.
…On Mon, Jan 28, 2019 at 4:53 PM Nathan VanBenschoten < ***@***.***> wrote:
Yeah, these all look related. I'm working on isolating #33566
<#33566> as the culprit now.
Once that's confirmed, I'll dig into what's going wrong.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#34025 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AE135BMY2nGUdXGtJd9pipd2nDIgjWEFks5vHxz8gaJpZM4aBiGL>
.
|
@nvanbenschoten thanks for your help on this issue! |
This corresponds to 12e2815, which was actually merged before #33566. That doesn't mean its sibling change 04189f5 isn't responsible (it still probably is), but it narrows the search. |
I just got a reproduction on c5516ef. That doesn't tell us too much since we saw it before, but it's nice to have some degree of reproducibility, even if it takes a few hours. I convinced myself that the
is just another way of saying the the
is similar in that it demonstrates a partially applied These two check failures also fit a similar pattern. In both, the first write in the transaction seems to be missing. Interestingly, in both of these transactions, the writes are meant to be the most heavily contended. I suspect that we're seeing something like the following series of events:
Unfortunately I don't actually see a warning like that in the logs, but perhaps the transaction didn't commit at epoch 0. I'm going to add some logging in and see if I can get a repro again. |
An interesting note is that these |
I'm still not to the bottom of this, but it's been my main focus for about a week and feel like I'm slowly painting a full picture of what's going on. The slow reproduction cycle (1-2 times a day) doesn't make quick iteration easy. First off, I'll retract my previous statement that the In terms of the easiest way to reproduce this issue, I've found that the Another interesting point which @danhhz and I discussed in #32813 (comment) is that the failure usually occurs within a minute of starting load on a tpcc cluster. By dropping the load duration in the scrub tests down to 10 minutes I was able to shorten the iteration time without reducing the failure rate (noticeably?). More on why I think this is later. Given that our prime suspect for this is 04189f5 (although other than #33149 (comment), I've never seen this fail before c5516ef), I started by adding a lot of logging around transaction state transitions. Specifically, I logged about transaction pushes, transaction aborts, transaction record creation, transaction commits, etc. When I finally got a reproduction, I parsed all the logs, threw them into an advanced system specializing in ad-hoc query processing over large data sets (a CockroachDB cluster), and began looking for strange transaction timelines. I didn't find anything that stood out and soon realized that I had been logging so heavily for ~3 hours that most of the log files had rolled over and I was missing anything of interest. After fixing that and waiting another day for a repro, I finally had a full set of logs. Unfortunately, on its own, I still couldn't find anything that looked strange. Since I still had the cluster up and running, I could verify that the query from check I spent a while looking at neighboring transactions in the logs and how they interacted with our suspect transaction (let's call it Txn A). Everything checked out. No one thought they had aborted A (other than after it had committed and its txn record had been removed, but that's kosher). I went back through the reproduction cycle twice more just to get more logging on intent resolution, intent removal, and txn record removal. Still, nothing popped up. I then realized I could even query the This is when I did what I should have done in the first place and ran the I've since seen this exact behavior on three different clusters (every repro since I started looking for it). I've even seen it four times on the same cluster once, which corresponded to a
Since Txn B always has a higher orig timestamp than Txn A's commit timestamp (although they are active concurrently) it's baffling that Txn B misses Txn A's write. What should happen is either
I haven't been able to think of any way that the second mechanism could break down in general or due to any of the changes in 04189f5. Even if TxnB incorrectly thought that TxnA was aborted (which I don't see evidence for), I don't see evidence that it tried to remove TxnA's intent. Without removing the intent, I don't think it's possible for it to have read the previously committed value. So I think a breakdown in the second mechanism is more likely, especially because this seems to be related to lease changes. But I also don't see how that could break down. Random theories I explored but disproved:
Any insight or wild guesses here would be helpful. This is going to delay the next Alpha (as will the |
This is a fantastic write-up, though I'm afraid I don't have any good ideas to offer. One thing to keep in mind is that it is possible that 04189f5 simply revealed a previously existing bug (perhaps one that was not exercisable prior to that change).
That's super curious. Something happened on that range that affected 4 different txns? I suppose this lends weight to the suspicion that an add replica, remove replica, or split, is related to the problem. |
I'll sleuth some more code and read your report more thoroughly tomorrow, but I'll let this script run overnight on my worker in the hope that it reproduces. It basically merges-splits-scatters |
Very likely unrelated, but my tpcc workload always dies with (right at the beginning, takes a few sec though):
That's 134mb. |
Hooray!
…On Tue, Feb 5, 2019 at 9:14 PM craig[bot] ***@***.***> wrote:
Closed #34025 <#34025> via
#34548 <#34548>.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#34025 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AE135OYYNCWhZ71SB4_uGI3gyNufJUaxks5vKeYNgaJpZM4aBiGL>
.
|
For completeness, I spent about 6 hours stressing the test with c5516ef + the patch from #34548. I didn't see a single failure. I then did the same test on The test still took 2 hours to run, but now that its failures were frequent, I was able to bisect the problem to 80eea9e. So it appears that we messed something up in that commit. The next step here is to figure out what that is. This is actually in some senses a huge relief because it explains why the frequency of these failures (along with a few others with similar symptoms) has appeared to rise over the month of January. Amusingly, I mentioned something along these lines in #34025 (comment) - check out the dates between #33381 (comment) and #33396 (comment). So it looks like the test failed very rarely before 80eea9e because of the bug that was fixed today, but began failing regularly because of something in that commit. I feel a little silly that I never once thought to test on master before looking back at commits in the vacinity of where we saw the test first fail. On the bright side, it seems like this will lead to two bug fixes instead of just one, including one that has been lingering for years! I think that warrants another shout out to @lucy-zhang for introducing the test. Of course, I think this also means we'll need to add whatever fix we create into the alpha that we cut today. Or we can just revert 80eea9e, as it was just a refactor and I don't think there's very much built on top of it. cc. @andreimatei |
So with all that said, don't be surprised if we still see test failures tonight. |
I wasn't able to find anything wrong in 80eea9e (though I should look again), but I figured that maybe we're returning ambiguous results more than before and that messes something up in higher layers. Not really sure this is related to this particular bug, but it seems weird that the DistSender code below returns ambiguous results either as cockroach/pkg/kv/dist_sender.go Lines 1475 to 1484 in 70be833
|
Hm, seems like both cases get joined later: cockroach/pkg/kv/dist_sender.go Lines 496 to 512 in 70be833
|
Heh, this is more likely the cause of the bug: note how |
Running two instances of scrub-index-only-tpcc-1000 on |
The first failure on
the second So assuming the second master run also fails, I'm fairly comfortable that that was the bug. |
and here's the second failure on
|
A recent commit (master only) reintroduced a bug that we ironically had spent a lot of time on [before]. In summary, it would allow the result of an EndTransaction which would in itself *not* apply to leak and would result in intents being committed even though their transaction ultimately would not: cockroachdb#34025 (comment) We've diagnosed this pretty quickly the second time around, but clearly we didn't do a good job at preventing the regression. I can see how this would happen as the method this code is in is notoriously difficult to test for it interfaces so much with everything else that it's difficult to unit test it; one needs to jump through lots of hoops to target it, and so we do it less than we ought to. I believe this wasn't released in any alpha (nor backported anywhere), so no release note is necessary. Fixes cockroachdb#34025. [before]: cockroachdb#30792 (comment) Release note: None
Fool me twice, shame on me. This definitely reveals a flaw in our testing. We added a test #32236 for #30792, but apparently it was too narrowly-targeted and wasn't sensitive to other related bugs. We should think about a more general way to test for things like this. I'd have thought the jepsen bank test would at least be sensitive to this kind of thing. One thought is to run tests with a flag that turns any reevaluations into non-retryable errors. This means that any "leakage" like this would result in long-lasting inconsistencies, instead of only a very short-lived window in which the side effects are applied before the command ultimately succeeds after retry. |
34651: server: rework TestClusterVersionBootstrapStrict r=andreimatei a=andreimatei This test... I'm not entirely sure what it was supposed to test to be honest, but it seemed to be more complicated than it needed to be. It forced and emphasized MinSupportedVersion being equal to BinaryServerVersion (which is generally not a thing). I've simplified it, making it not muck with the versions, while keep (I think) the things it was testing (to the extent that it was testing anything). This test was also in my way because it created servers that pretended to be versions that are not technically supported by the binary, and this kind of funkiness is making my life hard as I'm trying to rework the way in which versions are propagated and what knobs servers have, etc. Release note: None 34659: storage: don't leak committed protos to pushers on reproposal r=bdarnell,andreimatei a=tbg TODO: test ---- A recent commit (master only) reintroduced a bug that we ironically had spent a lot of time on [before]. In summary, it would allow the result of an EndTransaction which would in itself *not* apply to leak and would result in intents being committed even though their transaction ultimately would not: #34025 (comment) We've diagnosed this pretty quickly the second time around, but clearly we didn't do a good job at preventing the regression. I can see how this would happen as the method this code is in is notoriously difficult to test for it interfaces so much with everything else that it's difficult to unit test it; one needs to jump through lots of hoops to target it, and so we do it less than we ought to. I believe this wasn't released in any alpha (nor backported anywhere), so no release note is necessary. Fixes #34025. [before]: #30792 (comment) Release note: None Co-authored-by: Andrei Matei <andrei@cockroachlabs.com> Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>
The quoted reason for the tolerance was cockroachdb#34025, which has long been fixed. Release note: None
Some roachtests that involve running tpcc-1000 are failing the tpcc checks at the end, e.g.:
The tests affected by this:
https://teamcity.cockroachdb.com/viewLog.html?buildId=1085451&buildTypeId=Cockroach_Nightlies_WorkloadNightly&tab=buildResultsDiv seems to be the first test failure of this kind. I reproduced this failure on
scrub/index-only/tpcc-1000
on a separate roachprod cluster, and ran SCRUB on the entire tpcc database, which did not turn up any anomalies.It's not clear whether this is a schema change problem, a problem with the tests, or something else. Some next steps would be to determine whether this occurs when just running tpcc-1000 by itself, and look at what the anomalous rows are in the results of the check query.
The text was updated successfully, but these errors were encountered: