Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Segment for BetaRowset #1577

Merged
merged 5 commits into from
Aug 6, 2019
Merged

Support Segment for BetaRowset #1577

merged 5 commits into from
Aug 6, 2019

Conversation

imay
Copy link
Contributor

@imay imay commented Aug 2, 2019

We create a new segment format for BetaRowset. New format merge
data file and index file into one file. And we create a new format
for short key index. In origin code index is stored in format like
RowCusor which is not efficient to compare. Now we encode multiple
column into binary, and we assure that this binary is sorted same
with the key columns.

be/src/olap/rowset/segment_v2/segment_writer.cpp Outdated Show resolved Hide resolved
be/src/olap/rowset/segment_v2/segment_writer.cpp Outdated Show resolved Hide resolved
gensrc/proto/segment_v2.proto Outdated Show resolved Hide resolved
gensrc/proto/segment_v2.proto Outdated Show resolved Hide resolved
be/src/olap/iterators.h Show resolved Hide resolved
be/src/olap/rowset/segment_v2/segment_iterator.cpp Outdated Show resolved Hide resolved
auto start_iter = _segment->lower_bound(index_key);
if (start_iter.valid()) {
// Because previous block may contain this key, so we should set rowid to
// last block's first row.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it only possible for duplicated key model?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for all key models.

What we store is short key, full key will be truncated. If we see an equal short key, there may be some same full key in previous block.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now I think it's even possible when full key is stored. Say the 1st block contains ('aaa', 'baa'), the 2nd block contains ('bab', 'bac'). If we are searching for key >= 'baa', short key index returns the 2nd block, but the first matching key resides in the 1st block.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right. The root cause is that this index is a sparse index.

std::shared_ptr<TabletSchema> tablet_schema(new TabletSchema());
tablet_schema->_num_columns = 4;
tablet_schema->_num_key_columns = 3;
tablet_schema->_num_short_key_columns = 2;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the difference between _num_key_columns and _num_short_key_columns, when could them be unequal?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use some of keys as index, which is short key. We don't use all key columns as key for memory concern. We want to load all index in to memory to accelerate reading.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now I understand, thanks

be/src/olap/rowset/segment_v2/segment_writer.cpp Outdated Show resolved Hide resolved
zhaochun added 3 commits August 5, 2019 14:12
In this patch, we create a new format for short key index. In orgin code
index is stored in format like RowCusor which is not effecient to
compare. Now we encode multiple column into binary, and we assure that
this binary is sorted same with the key columns.
@imay imay force-pushed the add-segment-iter branch from b876627 to 25fb13d Compare August 5, 2019 07:34
@imay imay force-pushed the add-segment-iter branch from 25fb13d to 983ca60 Compare August 5, 2019 07:36
@imay imay closed this Aug 5, 2019
@imay imay reopened this Aug 5, 2019
gaodayue
gaodayue previously approved these changes Aug 5, 2019
auto start_iter = _segment->lower_bound(index_key);
if (start_iter.valid()) {
// Because previous block may contain this key, so we should set rowid to
// last block's first row.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now I think it's even possible when full key is stored. Say the 1st block contains ('aaa', 'baa'), the 2nd block contains ('bab', 'bac'). If we are searching for key >= 'baa', short key index returns the 2nd block, but the first matching key resides in the 1st block.

private:
friend class SegmentIterator;

Status new_column_iterator(uint32_t cid, ColumnIterator** iter);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Status new_column_iterator(uint32_t cid, ColumnIterator** iter);
Status _new_column_iterator(uint32_t cid, ColumnIterator** iter);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the same to the following two functions

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is used by SegmentIterator. So I don't add _

}

SegmentWriter::~SegmentWriter() {
for (auto writer : _column_writers) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for (auto writer : _column_writers) {
for (auto& writer : _column_writers) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For pointer, I think no need to use reference

DCHECK(type_info != nullptr);

ColumnWriterOptions opts;
std::unique_ptr<ColumnWriter> writer(new ColumnWriter(opts, type_info, is_nullable, _output_file.get()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can type_info and is_nullable and output file be put into ColumnWriterOptions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ColumnWirterOptions contains options. If there is no other set, this also can work. However I think type_info and is_nullable is what we really need, so I'd like put these out of options

@@ -53,6 +52,7 @@ namespace doris {
class Tablet;
class DataDir;
class EngineTask;
class SegmentGroup;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why add SegmentGroup here? I think segment group should not be used here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because _gc_files use this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_gc_files is useless, delete it and SegmentGroup from storage engine

@imay imay merged commit b2e678d into apache:master Aug 6, 2019
@imay imay deleted the add-segment-iter branch August 6, 2019 09:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants