Support Segment for BetaRowset #1577

imay · 2019-08-02T13:14:08Z

We create a new segment format for BetaRowset. New format merge
data file and index file into one file. And we create a new format
for short key index. In origin code index is stored in format like
RowCusor which is not efficient to compare. Now we encode multiple
column into binary, and we assure that this binary is sorted same
with the key columns.

be/src/olap/rowset/segment_v2/segment_writer.cpp

gensrc/proto/segment_v2.proto

be/src/olap/iterators.h

be/src/olap/rowset/segment_v2/segment_iterator.cpp

gaodayue · 2019-08-05T02:54:56Z

be/src/olap/rowset/segment_v2/segment_iterator.cpp

+    auto start_iter = _segment->lower_bound(index_key);
+    if (start_iter.valid()) {
+        // Because previous block may contain this key, so we should set rowid to
+        // last block's first row.


Is it only possible for duplicated key model?

This is for all key models.

What we store is short key, full key will be truncated. If we see an equal short key, there may be some same full key in previous block.

Now I think it's even possible when full key is stored. Say the 1st block contains ('aaa', 'baa'), the 2nd block contains ('bab', 'bac'). If we are searching for key >= 'baa', short key index returns the 2nd block, but the first matching key resides in the 1st block.

Yes, you are right. The root cause is that this index is a sparse index.

gaodayue · 2019-08-05T04:00:02Z

be/test/olap/rowset/segment_v2/segment_test.cpp

+    std::shared_ptr<TabletSchema> tablet_schema(new TabletSchema());
+    tablet_schema->_num_columns = 4;
+    tablet_schema->_num_key_columns = 3;
+    tablet_schema->_num_short_key_columns = 2;


What's the difference between _num_key_columns and _num_short_key_columns, when could them be unequal?

We use some of keys as index, which is short key. We don't use all key columns as key for memory concern. We want to load all index in to memory to accelerate reading.

Now I understand, thanks

be/src/olap/rowset/segment_v2/segment_writer.cpp

In this patch, we create a new format for short key index. In orgin code index is stored in format like RowCusor which is not effecient to compare. Now we encode multiple column into binary, and we assure that this binary is sorted same with the key columns.

gaodayue · 2019-08-05T08:07:55Z

be/src/olap/rowset/segment_v2/segment_iterator.cpp

+    auto start_iter = _segment->lower_bound(index_key);
+    if (start_iter.valid()) {
+        // Because previous block may contain this key, so we should set rowid to
+        // last block's first row.


Now I think it's even possible when full key is stored. Say the 1st block contains ('aaa', 'baa'), the 2nd block contains ('bab', 'bac'). If we are searching for key >= 'baa', short key index returns the 2nd block, but the first matching key resides in the 1st block.

kangpinghuang · 2019-08-05T08:27:49Z

be/src/olap/rowset/segment_v2/segment.h

+private:
+    friend class SegmentIterator;
+
+    Status new_column_iterator(uint32_t cid, ColumnIterator** iter);


Suggested change

Status new_column_iterator(uint32_t cid, ColumnIterator** iter);

Status _new_column_iterator(uint32_t cid, ColumnIterator** iter);

the same to the following two functions

this is used by SegmentIterator. So I don't add _

kangpinghuang · 2019-08-05T08:37:25Z

be/src/olap/rowset/segment_v2/segment_writer.cpp

+}
+
+SegmentWriter::~SegmentWriter() {
+    for (auto writer : _column_writers) {


Suggested change

for (auto writer : _column_writers) {

for (auto& writer : _column_writers) {

For pointer, I think no need to use reference

kangpinghuang · 2019-08-05T08:43:18Z

be/src/olap/rowset/segment_v2/segment_writer.cpp

+        DCHECK(type_info != nullptr);
+
+        ColumnWriterOptions opts;
+        std::unique_ptr<ColumnWriter> writer(new ColumnWriter(opts, type_info, is_nullable, _output_file.get()));


can type_info and is_nullable and output file be put into ColumnWriterOptions?

ColumnWirterOptions contains options. If there is no other set, this also can work. However I think type_info and is_nullable is what we really need, so I'd like put these out of options

kangpinghuang · 2019-08-05T09:33:02Z

be/src/olap/storage_engine.h

@@ -53,6 +52,7 @@ namespace doris {
 class Tablet;
 class DataDir;
 class EngineTask;
+class SegmentGroup;


why add SegmentGroup here? I think segment group should not be used here.

because _gc_files use this

_gc_files is useless, delete it and SegmentGroup from storage engine

gaodayue reviewed Aug 5, 2019

View reviewed changes

imay mentioned this pull request Aug 5, 2019

Add new format short key index #1572

Closed

zhaochun added 3 commits August 5, 2019 14:12

Add Segment Iterator for BetaRowset

d7358d7

Change according to comments

940831f

imay force-pushed the add-segment-iter branch from b876627 to 25fb13d Compare August 5, 2019 07:34

Update according to review

983ca60

imay force-pushed the add-segment-iter branch from 25fb13d to 983ca60 Compare August 5, 2019 07:36

imay closed this Aug 5, 2019

imay reopened this Aug 5, 2019

gaodayue previously approved these changes Aug 5, 2019

View reviewed changes

kangpinghuang requested changes Aug 5, 2019

View reviewed changes

Update according to review

4984b43

imay dismissed gaodayue’s stale review via 4984b43 August 6, 2019 01:59

kangpinghuang approved these changes Aug 6, 2019

View reviewed changes

chaoyli approved these changes Aug 6, 2019

View reviewed changes

imay merged commit b2e678d into apache:master Aug 6, 2019

imay deleted the add-segment-iter branch August 6, 2019 09:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Segment for BetaRowset #1577

Support Segment for BetaRowset #1577

imay commented Aug 2, 2019 •

edited

Loading

gaodayue Aug 5, 2019

imay Aug 5, 2019

gaodayue Aug 5, 2019

imay Aug 5, 2019

gaodayue Aug 5, 2019

imay Aug 5, 2019

gaodayue Aug 5, 2019

gaodayue Aug 5, 2019

kangpinghuang Aug 5, 2019

kangpinghuang Aug 5, 2019

imay Aug 5, 2019

kangpinghuang Aug 5, 2019

imay Aug 5, 2019

kangpinghuang Aug 5, 2019

imay Aug 5, 2019

kangpinghuang Aug 5, 2019

imay Aug 5, 2019

kangpinghuang Aug 6, 2019

	Status new_column_iterator(uint32_t cid, ColumnIterator** iter);
	Status _new_column_iterator(uint32_t cid, ColumnIterator** iter);

	for (auto writer : _column_writers) {
	for (auto& writer : _column_writers) {

Support Segment for BetaRowset #1577

Support Segment for BetaRowset #1577

Conversation

imay commented Aug 2, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

imay commented Aug 2, 2019 •

edited

Loading