-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: hotspotsplits/nodes=4 failed #33660
Comments
on GCE. What gives? This has just started happening yesterday, but the stall detection has been in for longer. |
The mechanism seems to do what it's supposed to - the log is full of slow disk:
and logging mostly stops around that time, until the crash a few sec later. The latest runtime stats are
That cgo number is low for a node that is supposedly running lots of splits. |
@petermattis where do you think we should go from here? Can you check (or find someone to do so) whether we're catching write throttling or whether the GCE disk stats for these nodes shows any degradation? |
Perhaps the disk stall test should also, for informative purposes, use a raw write to a temp file in the storage directory (distinguishing between RocksDB slowness and actual block device slowness). |
For example, we could've run into Rocks write throttling. |
We've only seen this failure once. Certainly could be some GCE badness, or it could be something else. I'll poke around and see if I notice anything you didn't. |
The logs contain lots of |
It's curious that most of the slow
It's unfortunate that those log messages don't indicate the replica being processed. I'll fix that. |
Looks like the long handle Raft ready messages are due to RocksDB sync commits. I think. I'm adding more instrumentation to be sure. |
99%-tile log commit latencies around 1s. That seems surprisingly high. |
SHA: https://github.com/cockroachdb/cockroach/commits/395d842feb97c5bd8cad2b32b71a5156c03061eb Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1115923&tab=buildLog
|
The last failure is fixed in #34399. |
SHA: https://github.com/cockroachdb/cockroach/commits/8179cd9efec890f1ba063488c7a502a96b8241dc Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1119877&tab=buildLog
|
SHA: https://github.com/cockroachdb/cockroach/commits/82026117d83262e87873aad52b8eca2dd0bea335 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1126329&tab=buildLog
|
SHA: https://github.com/cockroachdb/cockroach/commits/57e825a7940495b67e0cc7213a5fabc24e12be0e Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1176948&tab=buildLog
|
Two different nodes hit OOMs a few minutes apart. Both have the majority of their memory (2.65 GB and 3.86 GB) allocated by
At the top of this stack, we can see a We see from the heap profile that the memory allocated by @RaduBerinde or @rytaft any idea why |
The It's possible that the extra memory usage is due to the hashes stored in the HyperLogLog sketches, but in that case I don't think the memory should show up as being allocated at the storage layer. Also, the memory used by the sketches should be pretty small unless this is a very wide table with a lot of indexes. Are there multiple @RaduBerinde do you have any ideas? |
The sampler is not holding on to any memory from the input rows. It only maintains the sketches which are updated with hashes of input data. The processors are fused so it's probably the fetcher inside the tablereader holding on to this memory. The only thing I can think of is what Becca pointed out, if we're running many of these processors in parallel (auto stats should only run one at a time so it would be a bug). |
No, there is only one
How about in the
Also, FWIW we see transaction interceptors like |
The sampler only sends sketch data to the sampleAggregator. Note that the sampler is throttled (sleeps a lot) so it might be holding onto a given KV batch for much longer than a regular scan. But it shouldn't be holding on to more than one batch. |
SHA: https://github.com/cockroachdb/cockroach/commits/b5768aecd39461ab9a54e2e7db059a3fe8b00459 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1191957&tab=buildLog
|
SHA: https://github.com/cockroachdb/cockroach/commits/bb50fd396f6ce79258b744e1f8efa2d1bc9dfbd2 Parameters: To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1272551&tab=buildLog
|
(roachtest).hotspotsplits/nodes=4 failed on master@349d2c0fff138c2a3f452b11b900924d6dc8b445:
More
Artifacts: /hotspotsplits/nodes=4
See this test on roachdash |
(roachtest).hotspotsplits/nodes=4 failed on master@08bb94d98b7f60126666e9c5a317afca61422d7f:
More
Artifacts: /hotspotsplits/nodes=4
See this test on roachdash |
(roachtest).hotspotsplits/nodes=4 failed on master@c4cf5585cbda55caadb1782f6705e52a83659538:
More
Artifacts: /hotspotsplits/nodes=4
See this test on roachdash |
(roachtest).hotspotsplits/nodes=4 failed on master@fde991a0f5aba7ffd66112295996ad26f3725036:
More
Artifacts: /hotspotsplits/nodes=4
See this test on roachdash |
(roachtest).hotspotsplits/nodes=4 failed on master@b797cad6d130714748983bc53d4611ddc6151153:
More
Artifacts: /hotspotsplits/nodes=4
See this test on roachdash |
(roachtest).hotspotsplits/nodes=4 failed on master@a51ddb8342ff54283baf5c7556cb1d4ff8c4e5da:
More
Artifacts: /hotspotsplits/nodes=4
See this test on roachdash |
(roachtest).hotspotsplits/nodes=4 failed on master@49e2b18471a42aed8a8ad7cb658863f045ea1aac:
More
Artifacts: /hotspotsplits/nodes=4
See this test on roachdash |
@ajwerner is this you? hotspotsplits.go:85,hotspotsplits.go:102,test_runner.go:741: range size 490 MiB exceeded 192 MiB |
Fixes cockroachdb#33660. Release note (bug fix): Significantly reduce the amount of memory allocated while scanning tables with a large average row size.
Ah yeah obviously 660b3e7. I'll fix, stressing this test right now anyway |
660b3e7 bumped the default range size by a factor of 8. Do the same in this test. Addresses one failure mode of cockroachdb#33660. Release note: None
(roachtest).hotspotsplits/nodes=4 failed on master@abe944fb0ad990ceaba7e592cc83b8922639016c:
More
Artifacts: /hotspotsplits/nodes=4
See this test on roachdash |
(roachtest).hotspotsplits/nodes=4 failed on master@8b5adba703fae9b6961623f65b685d93b0fe0290:
More
Artifacts: /hotspotsplits/nodes=4
See this test on roachdash |
(roachtest).hotspotsplits/nodes=4 failed on master@317061202f4510e8087ff6ba9e2e5cb2fce0bb70:
More
Artifacts: /hotspotsplits/nodes=4
See this test on roachdash |
I’ll take the more recent failures here that are related to the range size changes. |
(roachtest).hotspotsplits/nodes=4 failed on master@fc5c7f093bf1e86852c3b839bc0f6710d9902729:
More
Artifacts: /hotspotsplits/nodes=4
See this test on roachdash |
660b3e7 bumped the default range size by a factor of 8. Do the same in this test. Addresses one failure mode of cockroachdb#33660. Release note: None
45323: row: set TargetBytes for kvfetcher r=nvanbenschoten a=tbg This finishes up the work in #44925 and completes the TargetBytes functionality. This is then used in kvfetcher, which henceforth aims to return no more than ~1mb per request. Additional commits restore the hotspotsplits roachtest and fix it. Reverting the relevant commit from this PR, the test failed nine out of ten times). With all commits, it passed ten times. The one question I have whether TargetBytes should be set to a value higher than 1mb to avoid a performance regression (where should I look?). Fixes #33660. Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>
SHA: https://github.com/cockroachdb/cockroach/commits/f5e3c29b2eed92868cf3d449575283e2e383f199
Parameters:
To repro, try:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1088848&tab=buildLog
The text was updated successfully, but these errors were encountered: