Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HLL support default value #1825

Merged
merged 11 commits into from
Sep 25, 2019
Merged

HLL support default value #1825

merged 11 commits into from
Sep 25, 2019

Conversation

HangyuanLiu
Copy link
Contributor

@HangyuanLiu HangyuanLiu commented Sep 18, 2019

When load date don't have HLL value , HLL value is a empty HyperLogLog

eg:
CREATE TABLE test_uv_9 (
pin_id bigint(20) NOT NULL COMMENT "",
id bigint(20) NULL COMMENT "",
uv1 hll HLL_UNION NOT NULL COMMENT "",
uv2 hll HLL_UNION NOT NULL COMMENT ""
) ENGINE=OLAP
AGGREGATE KEY(pin_id, id)
DISTRIBUTED BY HASH(pin_id) BUCKETS 16
PROPERTIES (
"storage_type" = "COLUMN"
)
1、curl --location-trusted -u root:123456 -H column_separator:, -H label:test_uv_14 -H "columns:pin_id,idx,u1,u2,id=12 ,uv1=hll_hash(u1)" -T uv_test http://11.40.166.162:8030/api/test/test_uv_9/_stream_load
Result :hll_union_agg(uv2) should be 0,And hll_union_agg(uv1) has values > 0

2、curl --location-trusted -u root:123456 -H column_separator:, -H label:test_uv_15 -H "columns:pin_id,idx,u1,u2,id=12" -T uv_test http://11.40.166.162:8030/api/test/test_uv_10/_stream_load
Result :hll_union_agg(uv1) and hll_union_agg(uv2) should be 0

@imay
Copy link
Contributor

imay commented Sep 18, 2019

@HangyuanLiu
Can you add some document or usage example for this change?

@@ -408,7 +408,11 @@ struct AggregateFuncTraits<OLAP_FIELD_AGGREGATION_HLL_UNION, OLAP_FIELD_TYPE_HLL
dst_slice->size = sizeof(HyperLogLog);
// use 'placement new' to allocate HyperLogLog on arena, so that we can control the memory usage.
char* mem = arena->Allocate(dst_slice->size);
dst_slice->data = (char*) new (mem) HyperLogLog(src_slice->data);
if (src_slice->empty()) {
dst_slice->data = (char*) new (mem) HyperLogLog();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the slice size is empty?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When HLL columns are NULL in the LOAD data

@HangyuanLiu
Copy link
Contributor Author

Update example @imay

@kangkaisen
Copy link
Contributor

curl --location-trusted -u root:123456 -H column_separator:, -H label:test_uv_15 -H "columns:pin_id,idx,u1,u2,id=12" -T uv_test http://11.40.166.162:8030/api/test/test_uv_10/_stream_load
Result :hll_union_agg(uv1) and hll_union_agg(uv2) should be 0

I think we don't support this usage.

When load data with HLL column, we must use hll_hash function, hll_hash function will handle null value.

@HangyuanLiu
Copy link
Contributor Author

This PR will support this capability.
There is a scenario where a Doris table may have multiple HLL columns that may come from different streams. The A stream may not contain the required HLL columns in the B stream and may not have NULL fields. The number of fields in this stream may be less than the number of fields in the Doris table
@kangkaisen

@kangkaisen
Copy link
Contributor

@HangyuanLiu Hi, I See.
Could we add a empty_hll function support this capability? add then, we could still ensure the HLL input in aggregate_func.h will not null.

@HangyuanLiu
Copy link
Contributor Author

HangyuanLiu commented Sep 19, 2019

@kangkaisen
If we add a empty_hll function , so we should write load command like .

LOAD LABEL test.uv
DATA INFILE ("hdfs://ns1017/streamA/*") INTO TABLE test_uv
(pin, id, u1)
set (
uv1=hll_hash(u1),
uv2= empty_hll(),
);

LOAD LABEL test.uv
DATA INFILE ("hdfs://ns1017/streamB/*") INTO TABLE test_uv
(pin, id, u2)
set (
uv1= empty_hll(),
uv2= hll_hash(u2),
);

uv2 may be another business stream
It is unreasonable to let the import data for A business care about the other B business HLL type.

@morningman
Copy link
Contributor

morningman commented Sep 21, 2019

@HangyuanLiu I think using empty_hll() is more reasonable. If business A does not care not column uv2, It should leave uv2 unchanged by explicitly assigning a function means unchanged, which is what empty_hll() will do.

If user assign nothing to HLL column, HLL column's content will be undefined. Because HLL is not like other types of column, which can be filled by default value or null.

empty_hll() to HLL is like 0 to SUM.

And also, should will add a function empty_bitmap() for BITMAP column? @kangkaisen

@kangkaisen
Copy link
Contributor

And also, should will and a function empty_bitmap() for BITMAP column

@morningman I agree with you.

@imay
Copy link
Contributor

imay commented Sep 21, 2019

And also, should will and a function empty_bitmap() for BITMAP column

@morningman I agree with you.

I agree too

@HangyuanLiu

Will you add a hll_empty() function? And it will be better if you can create an issue to record this improvement.

@HangyuanLiu
Copy link
Contributor Author

OK,I can add a function

const int HLL_EMPTY_SIZE = 1;
std::string buf;
std::unique_ptr<HyperLogLog> hll;
hll.reset(new HyperLogLog());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A stack variable is enough

@@ -103,6 +103,14 @@ class HyperLogLog {

int64_t estimate_cardinality();

std::string empty() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not make this function static?

@@ -18,6 +18,9 @@

HLL_HASH(column_name)
生成HLL列类型,用于insert或导入的时候,导入的使用见相关说明

EMPTY_HLL()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

english doc

@imay imay merged commit 40b9c35 into apache:master Sep 25, 2019
@imay imay mentioned this pull request Sep 26, 2019
@HangyuanLiu
Copy link
Contributor Author

#1895

SWJTU-ZhangLei pushed a commit to SWJTU-ZhangLei/incubator-doris that referenced this pull request Jul 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants