-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Single node has a very high CPU utilization when attempting a split of a range, causing range to be inaccessible. #43106
Comments
CCing @ajwerner @petermattis |
Zendesk ticket #4246 has been linked to this issue. |
2.1.9 has multiple infinite loop fixes: https://www.cockroachlabs.com/docs/releases/v2.1.9.html. One for compaction and one for reverse scan. Note there is yet another infinite loop reverse scan bug that is not present in any 2.1 patch release: #35505. |
The goroutine traces are not implicating MVCC scan or compaction. All the
stuck goroutines are in DBIterSeek.
…On Tue, Dec 10, 2019 at 7:30 PM Andrew Kryczka ***@***.***> wrote:
2.1.9 has multiple infinite loop fixes:
https://www.cockroachlabs.com/docs/releases/v2.1.9.html. One for
compaction and one for reverse scan. Note there is yet another infinite
loop reverse scan bug that is not present in any 2.1 patch release: #35505
<#35505>.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#43106?email_source=notifications&email_token=ABPJ755H3INTK6XF6TAF65TQYAYCXA5CNFSM4JZGRB4KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEGROC5Q#issuecomment-564322678>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABPJ7573NTT63DD56X6XUM3QYAYCXANCNFSM4JZGRB4A>
.
|
Got it. These are the stack traces I'm seeing.
I am not sure it rules out an MVCC scan loop as presumably that is a caller of DBIterSeek. But I am still looking and don't have evidence to rule it in either. |
Actually it does rule it out as mvcc scan would show up in the C++ stack trace, sorry. |
We have a goroutine dump as well which does not show the MVCC scan
routines. This is the first I’m seeing the C++ stacks. Looks range
tombstone related.
…On Tue, Dec 10, 2019 at 7:46 PM Andrew Kryczka ***@***.***> wrote:
Got it. These are the stack traces I'm seeing.
Thread 29 (LWP 315581):
#0 0x00007fbeb90c1600 in ?? ()
#1 0x000000000209058d in rocksdb::Slice::compare () at /go/src/github.com/cockroachdb/cockroach/c-deps/libroach/../rocksdb/include/rocksdb/slice.h:223
#2 <http://github.com/cockroachdb/cockroach/c-deps/libroach/../rocksdb/include/rocksdb/slice.h:223#2> cockroach::DBComparator::Compare () at /go/src/github.com/cockroachdb/cockroach/c-deps/libroach/comparator.cc:29
#3 <http://github.com/cockroachdb/cockroach/c-deps/libroach/comparator.cc:29#3> 0x00000000021759b2 in rocksdb::InternalKeyComparator::Compare () at /go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/db/dbformat.cc:156
#4 <http://github.com/cockroachdb/cockroach/c-deps/rocksdb/db/dbformat.cc:156#4> 0x00000000021aea42 in operator() () at /go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/db/range_del_aggregator.cc:33
#5 <http://github.com/cockroachdb/cockroach/c-deps/rocksdb/db/range_del_aggregator.cc:33#5> _M_upper_bound () at /x-tools/x86_64-unknown-linux-gnu/x86_64-unknown-linux-gnu/include/c++/6.3.0/bits/stl_tree.h:1686
#6 upper_bound () at /x-tools/x86_64-unknown-linux-gnu/x86_64-unknown-linux-gnu/include/c++/6.3.0/bits/stl_tree.h:1111
--Type <RET> for more, q to quit, c to continue without paging--
#7 upper_bound () at /x-tools/x86_64-unknown-linux-gnu/x86_64-unknown-linux-gnu/include/c++/6.3.0/bits/stl_map.h:1195
#8 rocksdb::CollapsedRangeDelMap::GetTombstone () at /go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/db/range_del_aggregator.cc:339
#9 <http://github.com/cockroachdb/cockroach/c-deps/rocksdb/db/range_del_aggregator.cc:339#9> 0x00000000021ac45a in rocksdb::RangeDelAggregator::GetTombstone () at /go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/db/range_del_aggregator.cc:602
#10 <http://github.com/cockroachdb/cockroach/c-deps/rocksdb/db/range_del_aggregator.cc:602#10> 0x000000000222ceec in rocksdb::BlockBasedTableIterator::InitRangeTombstone () at /go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/table/block_based_table_reader.cc:2133
#11 <http://github.com/cockroachdb/cockroach/c-deps/rocksdb/table/block_based_table_reader.cc:2133#11> 0x000000000222f900 in rocksdb::BlockBasedTableIterator::FindKeyForward () at /go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/table/block_based_table_reader.cc:2039
#12 <http://github.com/cockroachdb/cockroach/c-deps/rocksdb/table/block_based_table_reader.cc:2039#12> 0x000000000222fc51 in rocksdb::BlockBasedTableIterator::SeekToFirst () at /go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/table/block_based_table_reader.cc:1877
#13 <http://github.com/cockroachdb/cockroach/c-deps/rocksdb/table/block_based_table_reader.cc:1877#13> 0x00000000021c16a6 in rocksdb::IteratorWrapper::SeekToFirst () at /go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/table/iterator_wrapper.h:69
#14 <http://github.com/cockroachdb/cockroach/c-deps/rocksdb/table/iterator_wrapper.h:69#14> SkipEmptyFileForward () at /go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/db/version_set.cc:685
#15 <http://github.com/cockroachdb/cockroach/c-deps/rocksdb/db/version_set.cc:685#15> 0x000000000224dd5d in rocksdb::IteratorWrapper::Seek () at /go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/table/iterator_wrapper.h:63
#16 <http://github.com/cockroachdb/cockroach/c-deps/rocksdb/table/iterator_wrapper.h:63#16> rocksdb::MergingIterator::Seek () at /go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/table/merging_iterator.cc:112
#17 <http://github.com/cockroachdb/cockroach/c-deps/rocksdb/table/merging_iterator.cc:112#17> 0x000000000216d886 in rocksdb::DBIter::Seek () at /go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/db/db_iter.cc:1165
#18 <http://github.com/cockroachdb/cockroach/c-deps/rocksdb/db/db_iter.cc:1165#18> 0x000000000208e038 in Seek () at /go/src/github.com/cockroachdb/cockroach/c-deps/libroach/batch.cc:155
#19 <http://github.com/cockroachdb/cockroach/c-deps/libroach/batch.cc:155#19> 0x0000000002092d5a in DBIterSeek () at /go/src/github.com/cockroachdb/cockroach/c-deps/libroach/db.cc:518
#20 <http://github.com/cockroachdb/cockroach/c-deps/libroach/db.cc:518#20> 0x0000000001ff1733 in _cgo_77181d5e6ea5_Cfunc_DBIterSeek (v=0xc467b10468) at cgo-gcc-prolog:646
I am not sure it rules out an MVCC scan loop as presumably that is a
caller of DBIterSeek. But I am still looking and don't have evidence to
rule it in either.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#43106?email_source=notifications&email_token=ABPJ757LCPDYBQP4I4BRP53QYAZ5HA5CNFSM4JZGRB4KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEGRO6XA#issuecomment-564326236>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABPJ75ZVQ4AO2OTEYW4TNL3QYAZ5HANCNFSM4JZGRB4A>
.
|
Here is a list of the threads from the core file: It looks like there is a pattern of |
Based on some potentially error-prone steps I suspect the offending file is an L5 file, "2326537.sst". Another notable thing is the cached range tombstone ( |
Mind walking through your analysis? Some folks here in NY would like to
follow along.
An empty tombstone might certainly cause problems.
…On Wed, Dec 11, 2019 at 1:31 PM Andrew Kryczka ***@***.***> wrote:
Based on some potentially error-prone steps I suspect the offending file
is an L5 file, "2326537.sst".
Another notable thing is the cached range tombstone appears to have the
same start and end key. Maybe the skipping optimization gets stuck in this
case.
—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
<#43106?email_source=notifications&email_token=ABPJ753FZJ4E5SUOIWBXSSDQYEWZ7A5CNFSM4JZGRB4KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEGUEAAY#issuecomment-564674563>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABPJ75722LKD3LEPVUCCCCTQYEWZ7ANCNFSM4JZGRB4A>
.
|
It isn't recommended but I rebuilt v2.1.6 with
|
Thanks. That is exactly what Andrei wanted to try tonight. I’d feel ok
about this if you used the builder image so we have exactly the same tool
chain.
…On Wed, Dec 11, 2019 at 1:48 PM Andrew Kryczka ***@***.***> wrote:
It isn't recommended but I rebuilt v2.1.6 with -g instead of -g1 (which
does not include local variables) and the debug info looks good to me.
Picked a few of the stacks stuck in SeekToFirst(), changed frame to one
in BlockBasedTableIterator, then print the file number. It was the same
for the few cases I checked:
(gdb) p this->file_meta_->fd.packed_number_and_path_id & kFileNumberMask
$20 = 2326537
—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
<#43106?email_source=notifications&email_token=ABPJ754DEDGLQTARVU3PHJTQYEYXHA5CNFSM4JZGRB4KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEGUFN6I#issuecomment-564680441>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABPJ757MTVGQTXO523VTLQ3QYEYXHANCNFSM4JZGRB4A>
.
|
Yes it was done using builder image. Here's something else interesting - the empty range tombstone originates from two points in the collapsed map with the same user key, where one has Used std map pretty printers from https://sourceware.org/ml/gdb/2008-02/msg00064/stl-views.gdb. Ran in a stack frame of
User keys replaced with "X" for privacy. Verified they are the same before replacement. |
I think the |
I looked into how we got in a situation where the collapsed range tombstone map can return a The point key we noted earlier that scan gets stuck at is "X" in "2326537.sst" at L5. Further investigation shows the covering range tombstone is three levels earlier (i.e., L2) and split across files "2632281.sst" and "2327340.sst". Note the split point is the same "X" as in L5 where scan is stuck.
Now, we need to figure out why the collapsed map contains an internal key with seqnum smaller than a point key at a lower level. Knowing the latter of the above two files has a range tombstone spanning beyond its earliest key, we can look at that table's smallest key to see how its tombstones would be truncated.
The above output indicates the table's smallest key has type 15 (range delete) and seqnum zero. This is highly unusual. Typically a file's smallest key corresponds to its first point key, which should have a non-zero seqnum at an upper level. We would not expect the file to be extended left by range tombstones. It means truncation of our range tombstone produces "X@0" to be inserted into the collapsed map. And the "X" point key at L5 certainly has a higher seqnum than zero. So the necessary condition for the infinite loop is satisfied. Another unusual thing about the above two files is, although they have overlapping endpoints, they do not have consecutive file numbers. In fact the latter's file number is smaller than the former's file number. This points to subcompaction, which indeed is enabled in v2.1.6. Additionally, L2 appears to be the base level, and L0->Lbase is the case where subcompaction would happen. So, without a complete understanding/repro, my feeling so far is (1) the unusual database state is related to subcompactions; and (2) although the state is unusual, it doesn't look like any corruption happened, so our seek-to-end-key optimization should handle this case. |
Excellent sleuthing, @ajkr. Definitely looks like a partial tombstone with the same start/end user-key would cause a problem. Something that was confusing when I was verifying this is that we seek to Seems like we could just skip this optimization if Does the same problem exist for |
Looks like this bug is present in 19.1.x as well. The code in question is very different in 19.2, and this particular optimization is not present. |
Yes it sounds right. Or By the way, the condition for infinite loop turned out to be slightly different from a
The second bullet point makes it hard to repro. Requiring the file's max seqnum less than |
The comment in
Can you repro with a test directly on |
Well, I was able to write the test of the condition using subcompaction: ajkr/rocksdb@6b55f57. Sadly the result is reappearing keys under the subcompaction split point rather than infinite loop. |
Can you elaborate on this? Did the test find another bug? |
Yes. Well we already knew there are bugs with DeleteRange+subcompactions -- it just materialized one. Let me explain by printing the
The subcompaction split points were This happens because the last file in the subcompaction left of the split point has largest key |
It's fixed by facebook/rocksdb@64aabc9. We don't need to backport as we already disabled subcompactions in v2.1. For the original bug, I am still trying to understand what conditions caused the collapsed range-del map to contain a |
Ack. Glad you tracked that down.
In parallel, it would be good to fix the |
Will do. Here is the test: ajkr/rocksdb@c46e860. While writing it I became even more surprised by the presence of |
A user is currently using 2.1.6 and is running into an issue where a single node in the 5 node cluster is experiencing nearly 100% CPU usage on 12 cores. Logs show the following error on repeat for nearly 6 hours:
The range reports for r14507 shows everything expected with the node in question also as the leaseholder. An output of all goroutines from the Debugging page of the AdminUI shows that a high number of routines are in
rocksDBIterator.Seek
A core dump has been created of the cockroach process and is available here for further analysis.Of note, when creating the core dump, the cockroach process was paused long enough for the leaseholder to be moved to a new node, which caused the range to be available. The user has confirmed that previously inaccessible rows are now available. The cockroach process on the node is still running at near max CPU usage as of the creating of this issue.
The text was updated successfully, but these errors were encountered: