Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect mark is found while running debug tests #18913

Closed
alexey-milovidov opened this issue Jan 11, 2021 · 6 comments · Fixed by #19123
Closed

Incorrect mark is found while running debug tests #18913

alexey-milovidov opened this issue Jan 11, 2021 · 6 comments · Fixed by #19123
Assignees
Labels
bug-unreleased Bug that affects unreleased or experimental code fuzz Problem found by one of the fuzzers

Comments

@alexey-milovidov
Copy link
Member

Describe the bug
#18899 (comment)

@alexey-milovidov alexey-milovidov added the fuzz Problem found by one of the fuzzers label Jan 11, 2021
@alesapin
Copy link
Member

2021.01.10 17:54:10.588483 [ 229040 ] {} <Debug> system.text_log (29ec3865-0db2-4a55-834b-8fcd9c7fc97e) (MergerMutator): Merging 6 parts: from 202101_1_236_38 to 202101_1214_1276_12 into Wide
2021.01.10 17:54:10.588922 [ 229040 ] {} <Debug> system.text_log (29ec3865-0db2-4a55-834b-8fcd9c7fc97e) (MergerMutator): Selected MergeAlgorithm: Vertical
2021.01.10 17:54:10.589664 [ 229040 ] {} <Trace> MergeTreeSequentialSource: Reading 30 marks from part 202101_1_236_38, total 231097 rows starting from the beginning of the part
2021.01.10 17:54:10.590701 [ 229040 ] {} <Trace> MergeTreeSequentialSource: Reading 60 marks from part 202101_237_630_29, total 486482 rows starting from the beginning of the part
2021.01.10 17:54:10.591648 [ 229040 ] {} <Trace> MergeTreeSequentialSource: Reading 42 marks from part 202101_631_803_15, total 334259 rows starting from the beginning of the part
2021.01.10 17:54:10.592655 [ 229040 ] {} <Trace> MergeTreeSequentialSource: Reading 28 marks from part 202101_804_1009_19, total 216171 rows starting from the beginning of the part
2021.01.10 17:54:10.593577 [ 229040 ] {} <Trace> MergeTreeSequentialSource: Reading 16 marks from part 202101_1010_1213_25, total 111962 rows starting from the beginning of the part
2021.01.10 17:54:10.594493 [ 229040 ] {} <Trace> MergeTreeSequentialSource: Reading 4 marks from part 202101_1214_1276_12, total 22793 rows starting from the beginning of the part
2021.01.10 17:54:10.750617 [ 229040 ] {} <Trace> MergeTreeSequentialSource: Reading 30 marks from part 202101_1_236_38, total 231097 rows starting from the beginning of the part, column event_time_microseconds
2021.01.10 17:54:10.751308 [ 229040 ] {} <Trace> MergeTreeSequentialSource: Reading 60 marks from part 202101_237_630_29, total 486482 rows starting from the beginning of the part, column event_time_microseconds
2021.01.10 17:54:10.751918 [ 229040 ] {} <Trace> MergeTreeSequentialSource: Reading 42 marks from part 202101_631_803_15, total 334259 rows starting from the beginning of the part, column event_time_microseconds
2021.01.10 17:54:10.752542 [ 229040 ] {} <Trace> MergeTreeSequentialSource: Reading 28 marks from part 202101_804_1009_19, total 216171 rows starting from the beginning of the part, column event_time_microseconds
2021.01.10 17:54:10.753175 [ 229040 ] {} <Trace> MergeTreeSequentialSource: Reading 16 marks from part 202101_1010_1213_25, total 111962 rows starting from the beginning of the part, column event_time_microseconds
2021.01.10 17:54:10.753799 [ 229040 ] {} <Trace> MergeTreeSequentialSource: Reading 4 marks from part 202101_1214_1276_12, total 22793 rows starting from the beginning of the part, column event_time_microseconds
2021.01.10 17:54:11.277107 [ 229040 ] {} <Fatal> : Logical error: 'Incorrect mark rows for mark #174 (compressed offset 6655963, decompressed offset 0), actually in bin file 7610, in mrk file 8139'.

Unfortunately it's a merge of text log :( so it's not easy to reproduce. Will try to read the code...

@alesapin
Copy link
Member

In the failed example all source parts participating in the merge are wide and were merged with the horizontal merge. It means that their marks were correct and passed the same check which failed for vertical merge.

So the error related to a single vertical merge and it's not accumulated from some previous merges. During the vertical merge, we don't compute index granularity. We use source blocks sizes as granules size. So we should never adjust granules, have partially written granules, and avoid other complex cases. The only way how we can break this invariant -- when we get a block that is bigger than the fixed index_granularity for table https://github.com/ClickHouse/ClickHouse/blob/master/src/Storages/MergeTree/MergeTreeDataPartWriterOnDisk.cpp#L130. But I don't know how to achieve this, because merged blocks from merging algorithm must have average granularity from their source blocks. And their granularity must be always lower than the fixed index_granularity.

I don't want to make a blind fix, so I've just added more checks to the writer which should better detect this case. Hopefully, we will catch it in CI again.

@alesapin
Copy link
Member

Got it: #19011

2021.01.13 17:39:45.410267 [ 5781 ] {} <Fatal> : Logical error: 'Incomplete granules are not allowed while blocks are granules size. Mark number 9 (rows 8192), rows written in last mark 0, rows to write in last mark from block 4434 (from row 8192), total marks currently 10'.
2021.01.13 17:39:45.411101 [ 38507 ] {} <Fatal> BaseDaemon: ########################################
2021.01.13 17:39:45.411249 [ 38507 ] {} <Fatal> BaseDaemon: (version 21.1.1.5665, build id: 5B2C6781B06DCEB60EFD88F724D1A954AF2632DE) (from thread 5781) (no query) Received signal Aborted (6)
2021.01.13 17:39:45.411361 [ 38507 ] {} <Fatal> BaseDaemon:
2021.01.13 17:39:45.411526 [ 38507 ] {} <Fatal> BaseDaemon: Stack trace: 0x7f929722918b 0x7f9297208859 0x10734316 0x107343c1 0x1a7542b0 0x1a74d4e7 0x1a74ddbf 0x1a8893d9 0x1a889361 0x1a6e0195 0x1a3f6f41 0x1a40352f 0x1a40347d 0x1a40344d 0x1a403425 0x1a403400 0x10773c09 0x10772e45 0x1a62ce5a 0x1a62ce0d 0x1a62cddd 0x1a62cdb5 0x1a62cd90 0x10773c09 0x10772e45 0x107978cf 0x107a1214 0x107a11dd
2021.01.13 17:39:45.411751 [ 38507 ] {} <Fatal> BaseDaemon: 4. raise @ 0x4618b in /usr/lib/x86_64-linux-gnu/libc-2.31.so
2021.01.13 17:39:45.411895 [ 38507 ] {} <Fatal> BaseDaemon: 5. abort @ 0x25859 in /usr/lib/x86_64-linux-gnu/libc-2.31.so
2021.01.13 17:39:45.412153 [ 38507 ] {} <Fatal> BaseDaemon: 6. ./obj-x86_64-linux-gnu/../src/Common/Exception.cpp:50: DB::handle_error_code(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int)
@ 0x10734316 in /usr/bin/clickhouse
2021.01.13 17:39:45.412403 [ 38507 ] {} <Fatal> BaseDaemon: 7. ./obj-x86_64-linux-gnu/../src/Common/Exception.cpp:56: DB::Exception::Exception(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int) @ 0x107343c1 in /usr/bin/clickhouse
2021.01.13 17:39:45.464910 [ 38507 ] {} <Fatal> BaseDaemon: 8. ./obj-x86_64-linux-gnu/../src/Common/Exception.h:38: DB::Exception::Exception<unsigned long&, unsigned long, unsigned long&, unsigned long&, unsigned long&, unsigned long>(int, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, unsigned long&, unsigned long&&, unsigned long&, unsigned long&, unsigned long&, unsigned long&&) @ 0x1a7542b0 in /usr/bin/clickhouse
2021.01.13 17:39:45.515754 [ 38507 ] {} <Fatal> BaseDaemon: 9. ./obj-x86_64-linux-gnu/../src/Storages/MergeTree/MergeTreeDataPartWriterWide.cpp:144: DB::MergeTreeDataPartWriterWide::shiftCurrentMark(std::__1::vector<DB::Granule, std::__1::allocator<DB::Granule> > const&) @ 0x1a74d4e7 in /usr/bin/clickhouse
2021.01.13 17:39:45.567325 [ 38507 ] {} <Fatal> BaseDaemon: 10. ./obj-x86_64-linux-gnu/../src/Storages/MergeTree/MergeTreeDataPartWriterWide.cpp:238: DB::MergeTreeDataPartWriterWide::write(DB::Block const&, DB::PODArray<unsigned long, 4096ul, Allocator<false, false>, 15ul, 16ul> const*) @ 0x1a74ddbf in /usr/bin/clickhouse
2021.01.13 17:39:45.620697 [ 38507 ] {} <Fatal> BaseDaemon: 11. ./obj-x86_64-linux-gnu/../src/Storages/MergeTree/MergedBlockOutputStream.cpp:189: DB::MergedBlockOutputStream::writeImpl(DB::Block const&, DB::PODArray<unsigned long, 4096ul,
Allocator<false, false>, 15ul, 16ul> const*) @ 0x1a8893d9 in /usr/bin/clickhouse
2021.01.13 17:39:45.673558 [ 38507 ] {} <Fatal> BaseDaemon: 12. ./obj-x86_64-linux-gnu/../src/Storages/MergeTree/MergedBlockOutputStream.cpp:45: DB::MergedBlockOutputStream::write(DB::Block const&) @ 0x1a889361 in /usr/bin/clickhouse
2021.01.13 17:39:45.725997 [ 38507 ] {} <Fatal> BaseDaemon: 13. ./obj-x86_64-linux-gnu/../src/Storages/MergeTree/MergeTreeDataMergerMutator.cpp:941: DB::MergeTreeDataMergerMutator::mergePartsToTemporaryPart(DB::FutureMergedMutatedPart const&, std::__1::shared_ptr<DB::StorageInMemoryMetadata const> const&, DB::BackgroundProcessListEntry<DB::MergeListElement, DB::MergeInfo>&, std::__1::shared_ptr<DB::RWLockImpl::LockHolderImpl>&, long, DB::Context const&, std::__1::unique_ptr<DB::IReservation, std::__1::default_delete<DB::IReservation> > const&, bool, std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > const&) @ 0x1a6e0195 in /usr/bin/clickhouse
2021.01.13 17:39:45.775888 [ 38507 ] {} <Fatal> BaseDaemon: 14. ./obj-x86_64-linux-gnu/../src/Storages/StorageMergeTree.cpp:793: DB::StorageMergeTree::mergeSelectedParts(std::__1::shared_ptr<DB::StorageInMemoryMetadata const> const&, bool, std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > const&, DB::StorageMergeTree::MergeMutateSelectedEntry&, std::__1::shared_ptr<DB::RWLockImpl::LockHolderImpl>&) @ 0x1a3f6f41 in /usr/bin/clickhouse
2021.01.13 17:39:45.825927 [ 38507 ] {} <Fatal> BaseDaemon: 15. ./obj-x86_64-linux-gnu/../src/Storages/StorageMergeTree.cpp:962: DB::StorageMergeTree::getDataProcessingJob()::$_4::operator()() @ 0x1a40352f in /usr/bin/clickhouse
2021.01.13 17:39:45.876727 [ 38507 ] {} <Fatal> BaseDaemon: 16. ./obj-x86_64-linux-gnu/../contrib/libcxx/include/type_traits:3676: decltype(std::__1::forward<DB::StorageMergeTree::getDataProcessingJob()::$_4&>(fp)()) std::__1::__invoke<DB::StorageMergeTree::getDataProcessingJob()::$_4&>(DB::StorageMergeTree::getDataProcessingJob()::$_4&) @ 0x1a40347d in /usr/bin/clickhouse
2021.01.13 17:39:45.926640 [ 38507 ] {} <Fatal> BaseDaemon: 17. ./obj-x86_64-linux-gnu/../contrib/libcxx/include/__functional_base:349: void std::__1::__invoke_void_return_wrapper<void>::__call<DB::StorageMergeTree::getDataProcessingJob()::$_4&>(DB::StorageMergeTree::getDataProcessingJob()::$_4&) @ 0x1a40344d in /usr/bin/clickhouse
2021.01.13 17:39:45.976836 [ 38507 ] {} <Fatal> BaseDaemon: 18. ./obj-x86_64-linux-gnu/../contrib/libcxx/include/functional:1608: std::__1::__function::__default_alloc_func<DB::StorageMergeTree::getDataProcessingJob()::$_4, void ()>::operator()() @ 0x1a403425 in /usr/bin/clickhouse
2021.01.13 17:39:46.026321 [ 38507 ] {} <Fatal> BaseDaemon: 19. ./obj-x86_64-linux-gnu/../contrib/libcxx/include/functional:2089: void std::__1::__function::__policy_invoker<void ()>::__call_impl<std::__1::__function::__default_alloc_func<DB::StorageMergeTree::getDataProcessingJob()::$_4, void ()> >(std::__1::__function::__policy_storage const*) @ 0x1a403400 in /usr/bin/clickhouse
2021.01.13 17:39:46.026813 [ 38507 ] {} <Fatal> BaseDaemon: 20. ./obj-x86_64-linux-gnu/../contrib/libcxx/include/functional:2221: std::__1::__function::__policy_func<void ()>::operator()() const @ 0x10773c09 in /usr/bin/clickhouse
2021.01.13 17:39:46.027117 [ 38507 ] {} <Fatal> BaseDaemon: 21. ./obj-x86_64-linux-gnu/../contrib/libcxx/include/functional:2560: std::__1::function<void ()>::operator()() const @ 0x10772e45 in /usr/bin/clickhouse
2021.01.13 17:39:46.079672 [ 38507 ] {} <Fatal> BaseDaemon: 22. ./obj-x86_64-linux-gnu/../src/Storages/MergeTree/BackgroundJobsExecutor.cpp:101: DB::IBackgroundJobExecutor::jobExecutingTask()::$_0::operator()() const @ 0x1a62ce5a in /usr/bin/clickhouse
2021.01.13 17:39:46.130978 [ 38507 ] {} <Fatal> BaseDaemon: 23. ./obj-x86_64-linux-gnu/../contrib/libcxx/include/type_traits:3676: decltype(std::__1::forward<DB::IBackgroundJobExecutor::jobExecutingTask()::$_0&>(fp)()) std::__1::__invoke<DB::IBackgroundJobExecutor::jobExecutingTask()::$_0&>(DB::IBackgroundJobExecutor::jobExecutingTask()::$_0&) @ 0x1a62ce0d in /usr/bin/clickhouse
2021.01.13 17:39:46.182159 [ 38507 ] {} <Fatal> BaseDaemon: 24. ./obj-x86_64-linux-gnu/../contrib/libcxx/include/__functional_base:349: void std::__1::__invoke_void_return_wrapper<void>::__call<DB::IBackgroundJobExecutor::jobExecutingTask()::$_0&>(DB::IBackgroundJobExecutor::jobExecutingTask()::$_0&) @ 0x1a62cddd in /usr/bin/clickhouse
2021.01.13 17:39:46.234318 [ 38507 ] {} <Fatal> BaseDaemon: 25. ./obj-x86_64-linux-gnu/../contrib/libcxx/include/functional:1608: std::__1::__function::__default_alloc_func<DB::IBackgroundJobExecutor::jobExecutingTask()::$_0, void ()>::operator()() @ 0x1a62cdb5 in /usr/bin/clickhouse
2021.01.13 17:39:46.284851 [ 38507 ] {} <Fatal> BaseDaemon: 26. ./obj-x86_64-linux-gnu/../contrib/libcxx/include/functional:2089: void std::__1::__function::__policy_invoker<void ()>::__call_impl<std::__1::__function::__default_alloc_func<DB::IBackgroundJobExecutor::jobExecutingTask()::$_0, void ()> >(std::__1::__function::__policy_storage const*) @ 0x1a62cd90 in /usr/bin/clickhouse
2021.01.13 17:39:46.285320 [ 38507 ] {} <Fatal> BaseDaemon: 27. ./obj-x86_64-linux-gnu/../contrib/libcxx/include/functional:2221: std::__1::__function::__policy_func<void ()>::operator()() const @ 0x10773c09 in /usr/bin/clickhouse
2021.01.13 17:39:46.285630 [ 38507 ] {} <Fatal> BaseDaemon: 28. ./obj-x86_64-linux-gnu/../contrib/libcxx/include/functional:2560: std::__1::function<void ()>::operator()() const @ 0x10772e45 in /usr/bin/clickhouse
2021.01.13 17:39:46.286047 [ 38507 ] {} <Fatal> BaseDaemon: 29. ./obj-x86_64-linux-gnu/../src/Common/ThreadPool.cpp:243: ThreadPoolImpl<ThreadFromGlobalPool>::worker(std::__1::__list_iterator<ThreadFromGlobalPool, void*>) @ 0x107978cf in /usr/bin/clickhouse
2021.01.13 17:39:46.287005 [ 38507 ] {} <Fatal> BaseDaemon: 30. ./obj-x86_64-linux-gnu/../src/Common/ThreadPool.cpp:124: void ThreadPoolImpl<ThreadFromGlobalPool>::scheduleImpl<void>(std::__1::function<void ()>, int, std::__1::optional<unsigned long>)::'lambda1'()::operator()() const @ 0x107a1214 in /usr/bin/clickhouse
2021.01.13 17:39:46.287983 [ 38507 ] {} <Fatal> BaseDaemon: 31. ./obj-x86_64-linux-gnu/../contrib/libcxx/include/type_traits:3682: decltype(std::__1::forward<void>(fp)(std::__1::forward<void ThreadPoolImpl<ThreadFromGlobalPool>::scheduleImpl<void>(std::__1::function<void ()>, int, std::__1::optional<unsigned long>)::'lambda1'()&>(fp0)...)) std::__1::__invoke_constexpr<void ThreadPoolImpl<ThreadFromGlobalPool>::scheduleImpl<void>(std::__1::function<void ()>, int, std::__1::optional<unsigned long>)::'lambda1'()&>(void&&, void ThreadPoolImpl<ThreadFromGlobalPool>::scheduleImpl<void>(std::__1::function<void ()>, int, std::__1::optional<unsigned long>)::'lambda1'()&...) @ 0x107a11dd in /usr/bin/clickhouse
2021.01.13 17:39:47.411615 [ 38507 ] {} <Fatal> BaseDaemon: Calculated checksum of the binary: E6571A7BE27437B0088CCB4685C0CF26. There is no information about the reference checksum.
2021.01.13 17:39:57.628690 [ 154 ] {} <Fatal> Application: Child process was terminated by signal 6.

@alesapin
Copy link
Member

alesapin commented Jan 15, 2021

Unfortunately, the problem is deeper. We get marks bigger than fixed granularity during the horizontal merge. Vertical merge error is a consequence.
The additional check will fail before this bad horizontal merge, but I hope to figure out the error earlier.
#19123

@alesapin
Copy link
Member

alesapin commented Jan 15, 2021

Work as expected during manual insert/select from bad part:

$ ~/code/cpp/BuildCH/utils/check-marks/check-marks message.mrk2 message.bin
Mark 0, points to 0, 0, has rows after 8192, decompressed size 1048576.
Mark 1, points to 116015, 0, has rows after 8192, decompressed size 791228.
Mark 2, points to 217163, 0, has rows after 8192, decompressed size 815819.
Mark 3, points to 314244, 0, has rows after 8192, decompressed size 637427.
Mark 4, points to 397913, 0, has rows after 8192, decompressed size 721264.
Mark 5, points to 503491, 0, has rows after 8192, decompressed size 462223.
Mark 6, points to 572762, 0, has rows after 8192, decompressed size 744004.
Mark 7, points to 655562, 0, has rows after 8192, decompressed size 765670.
Mark 8, points to 723566, 0, has rows after 8192, decompressed size 1048576.
Mark 9, points to 826343, 0, has rows after 8192, decompressed size 1048576.
Mark 10, points to 959079, 0, has rows after 8192, decompressed size 1048576.
Mark 11, points to 1543735, 0, has rows after 3608, decompressed size 293846.
Mark 12, points to 1592516, 0, has rows after 0, decompressed size 293846.

@alesapin
Copy link
Member

Another example:

$ ~/code/cpp/BuildCH/utils/check-marks/check-marks memory_usage.mrk2 memory_usage.bin
Mark 0, points to 0, 0, has rows after 7452, decompressed size 122216.
Mark 1, points to 0, 59616, has rows after 7825, decompressed size 62600.
Mark 2, points to 2795, 0, has rows after 8192, decompressed size 65536.
Mark 3, points to 3215, 0, has rows after 8192, decompressed size 65536.
Mark 4, points to 3862, 0, has rows after 6670, decompressed size 148824.
Mark 5, points to 3862, 53360, has rows after 11933, decompressed size 95464. <--  so it can happen in the middle of granule
Mark 6, points to 8669, 0, has rows after 5351, decompressed size 95784.
Mark 7, points to 8669, 42808, has rows after 6622, decompressed size 52976.
Mark 8, points to 11876, 0, has rows after 7543, decompressed size 61896.
Mark 9, points to 11876, 60344, has rows after 194, decompressed size 1552.
Mark 10, points to 11876, 61896, has rows after 0, decompressed size 0.

@alesapin alesapin added the bug-unreleased Bug that affects unreleased or experimental code label Jan 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug-unreleased Bug that affects unreleased or experimental code fuzz Problem found by one of the fuzzers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants