-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: import/tpcc/warehouses=1000/nodes=32 failed #39072
Comments
SHA: https://github.com/cockroachdb/cockroach/commits/26edea51118a0e16b61748c08068bfa6f76543ca Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1404886&tab=buildLog
|
SHA: https://github.com/cockroachdb/cockroach/commits/65055d6c16bf9386d8c4f4f9cd23e0a848814dc9 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1411157&tab=buildLog
|
SHA: https://github.com/cockroachdb/cockroach/commits/40f8f0eb00f4b3bf5bac11fb5ae132e33a492713 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1452154&tab=buildLog
|
SHA: https://github.com/cockroachdb/cockroach/commits/d51fa78ff90a113c9009d263dfaf58d3672670a6 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1463583&tab=buildLog
|
SHA: https://github.com/cockroachdb/cockroach/commits/c3f447650742706c52354f629b30bb64e5a6959c Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1468445&tab=buildLog
|
SHA: https://github.com/cockroachdb/cockroach/commits/4e1deb21a3e28387c45e6a87a3cefe6bb49f39ac Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1473326&tab=buildLog
|
Could this be a release blocker? Isn't this an import into an empty cluster failing? cc @dt |
I wonder if it's the same thing, because this issue and 39071 seem to time out, whereas this newest thing is failing with the |
|
In cockroachdb#39072 we see: ``` Error: importing fixture: importing table stock: pq: descriptor changed: [expected] r1155:/Table/5{7/2-8} [(n16,s16):1, (n8,s8):5, (n30,s30):11, next=13, gen=239] != [actual] r1155:/Table/5{7/2-8} [(n16,s16):1, (n8,s8):5, (n30,s30):11, next=13, gen=239]: unexpected value: raw_bytes:"\363P\244T\003\010\203\t\022\002\301\212\032\001\302\"\006\010\020\020\020\030\001\"\006\010\010\020\010\030\005\"\006\010\036\020\036\030\013(\r0\357\001:\n\010\234\366\273\371\232\322\204\341\025@\001" timestamp:<wall_time:1567832452567360492 logical:1 > ``` Which is frustrating because those stringified values look identical to me. Hopefully this debugging will reveal something in any future repros. Release note: None
Is this possibly #37499 and the The line in question is: cockroach/pkg/storage/replica_command.go Line 73 in 3ad8bda
|
Err more likely this is a sticky bit thing. |
There are two classes of failures here. One is too many retries in AdminSplit. Some of that comes from the setting of the sticky bit (I think). The others (where the error is |
After looking a little deeper into the above These new failures I see seem to involve joint concensus and splits and merges: They both seem to involve the error caused by the workaround to #40333 in #40363
I'm trying to get some more repros now. |
It's possible that the sticky bit issue is resolved by a533375, which was merged yesterday? I think this was broken by one of the atomic replica change refactors before being resolved in that commit. |
One thing to note is that these were produced on a branch designed to fail this test due to the race. It does not internally retry AdminSplit as much. It's possible that the error from AdminSplit is actually what introduces the hazard. The build was this diff off of 3ad8bda. https://gist.github.com/ajwerner/9e98a6ed5285408b57275de60faba872 |
Does that change matter in completely 19.2 clusters? I think it's just the case that each processor sets the sticky bit to a different time:
And also there's a ton of concurrent data movement going and nowadays with learners and atomic changes we update the replica descriptor a bunch so it's not the most insane thing to me that we'd need to retry 10 times.
I hope will make it somewhat better. I'm also open to increasing that 10 to something higher. |
It's not clear that this repros as readily on vanilla master. There are two interesting things in that diff, the scatter retries on condition failed and the number of retries in the split command. I'm trying to isolate which is the culprit. I'm thinking it's the retries which are masking a correctness bug. |
I did hit:
on master |
Oof so the consistency failure repro'd with the following patch:
off of 3ad8bda The cluster is alive ajwerner-1568147181-02-n32cpu4 EDIT: that one was 2, I rebuilt with 1 thinking it wasn't going to repro. |
Internal data link for the test failure above: https://drive.google.com/file/d/1i9YsjvWXtc0b6D627-tRvLwkkKygSLtJ/view?usp=sharing |
Okay 5 for 5 of these seems to follow soon after
|
And now of course #6 doesn't. Attached are the files from another repro. Given how easy it is to repro I'm going to destroy these cluster. Pretty scary that a retry loop makes this okay. Internal: https://drive.google.com/file/d/16aDQF21j37m3YRyjwCiiauPHBug8Ob5a/view?usp=sharing |
@andreimatei I'm passing this one over, at least for the consistency failures. We can talk tomorrow |
I have been able to reproduce the replica inconsistency crash without atomic replica reconfiguration enabled, so it doesn't look like this is related to that change. |
On a hunch, I tried reverting #39709 to see if we were failing to fully replicate a range during snapshots. That did not fix the issue. |
So far I've run this test with #40676 (to fix the extra splits during IMPORT) 5 times with I can try running with max retries set to 10, but do we have a principled way of setting the max retries? Is it fine to just have a very high probability of not failing due to this issue, or do we expect to never get it at all? Internal link to files for one of the runs: https://drive.google.com/open?id=1xQ6ER_LUzbSdqAcKz3ZA9pbaxoTBrgNv (cc @pbardea) |
I think a very high probability of not failing is ok. @ajwerner and I talked last week about introducing some kind of best-effort semaphore on leaseholders to reduce contention on the descriptor, but I think fundementally we'll never be able to place a hard upper-bound on the number of retries that this operation requires. |
40676: sql: stop presplitting spans on every import proc r=pbardea a=pbardea We are currently presplitting spans (on every table and index boundary) for every processor. We should only be doing this once so this PR moves it to the planning stage. Addresses #39072. Release note: None 40838: makefile: increase test timeout r=andreimatei a=andreimatei This patch increases the test timeout for every package from 12m to 20m (and under stress from 25m to 30m). This is motivated by the sql TestLogic which has been inching towards the 12m recently (according to TeamCity history) and is sometimes timing out. Hopefully it's because we've been adding more tests... Fixes #40572 Release note: None Release justification: fix timeout flakes Co-authored-by: Paul Bardea <pbardea@gmail.com> Co-authored-by: Andrei Matei <andrei@cockroachlabs.com>
Closed by #40676. |
SHA: https://github.com/cockroachdb/cockroach/commits/86eab2ff0a1a4c2d9b5f7e7a45deda74c98c6c37
Parameters:
To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1402541&tab=buildLog
The text was updated successfully, but these errors were encountered: