-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: ycsb/F/nodes=3/cpu=32 failed #70019
Comments
|
Moving to SQL-Queries for the same reasons as the master sibling, see #69545 (comment) |
This does not seem like a blocker, but I'm also not sure this is a SQL Queries issue. @rafiss you seem to have the most familiarity with this issue -- can you comment on this? |
@rafiss did you get a chance to take a look at this at all? |
I did look for a few hours and couldn't repro or find anything helpful yet. |
Have you tried reproducing via something like |
I ran with roachstress locally and with a smaller count; I didn't know I could use a GCE project. I'll try that |
I spent a couple hours pouring over the artifacts but couldn't find anything. The only idea I had was to come up with some way for the client to signal all the nodes in the database to panic all at once when an unexpected error like this occurs, with the understanding that panic will dump a core when GOTRACEBACK=crash is specified. |
FYI using this test as a guinea pig for stressing a roachtest in CI: #70435 (comment) Will let you know if that turns up a repro. @cucaroach when the roachtest times out, the test harness sends SIGSEGV to ~everything to get stack traces, and we already collect the cores (though not as artifacts; with |
@tbg, thanks I didn't know that. I have some changes I'm playing with to re-enable core gathering and enable gzip compression of cores in order to streamline OOM investigations (#69317). Maybe --debug could drive core gathering as well if we don't want to do it all the time? They usually compress very well. I'm assuming that in cases like this even thought we got a "bad connection" the node is still running because there doesn't appear to be anything amiss in the logs but yeah any hints about why the connection died could be long gone by the the core is generated but who knows we might get lucky. |
With |
roachtest.ycsb/F/nodes=3 failed with artifacts on release-21.2 @ c46f5a5a098577b936e56f03d20c97300b4cce61:
Same failure on other branches
|
Something I noticed the other day while using the debugger on some local unit test was that if you pause the process for a while, you can also get a |
Worth looking into -- though to clarify, our |
roachtest.ycsb/F/nodes=3 failed with artifacts on release-21.2 @ 24021ba163e4ac438b169d575cf1527a4aae394d:
Same failure on other branches
|
Any updates to report for the triage meeting? |
I'm wondering if it still could be lib/pq#1000 I previously made #68665 to handle this, but maybe the
check is too strict. perhaps it should be
|
The reason it could be that is because the error always happens at 120.0s after the test starts -- which is how long the ramp-up period for this test is. After the ramp period, the context is cancelled and a new context is created. see cockroach/pkg/workload/cli/run.go Lines 443 to 477 in b50be15
|
So you're suggesting we ignore |
Yeah, my thinking was that that there might be a weird case where the client hasn't seen the
ErrBadConn is the error you get if the connection is unexpectedly broken, which maybe could include the node being down. But I'm pretty sure you'd see a timeout first.
We already have picked up the linked PR. lib/pq#1000 first appears in v1.9.0 and we are now using v1.10.2. To be clear, lib/pq#1000 is the PR that caused the behavior of |
I added the following diff so I could get a stack trace of which worker was actually getting
After running roachstress.sh for a while:
That corresponds to this line: cockroach/pkg/workload/cli/run.go Line 475 in bdb4c1a
The notable thing here is that it's coming from the non-cancelled I'm going to try to see where exactly ErrBadConn is coming from... |
I wrote this quick test based on what It has non-deterministic behavior. Sometimes the test passes, but sometimes it gets a |
I have a non-YCSB repro and was able to reproduce the same behavior on 21.1. So this is almost certainly a workload-side problem, and if not that it is definitely not a new CRDB v21.2 bug. It seems like a race condition in lib/pq. The test below fails somewhere around 50% of the time. It fails much less frequently when I run against a local CRDB single-node cluster. But it fails more reliably if I run against a 3-node roachprod cluster. If I add a So I'll remove the GA-blocker tag since it shouldn't block any release. But I'll keep working on this to prevent this from flaking and will backport whatever workload fix I make to release-21.2 Expand code
|
roachtest.ycsb/F/nodes=3/cpu=32 failed with artifacts on release-21.2 @ 99a4816fc272228a63df20dae3cc41d235e705f3:
Reproduce
See: roachtest README
Same failure on other branches
This test on roachdash | Improve this report!
The text was updated successfully, but these errors were encountered: