You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
drop table if exists t1;
create table t1(c1 int, c2 varchar(100), c3 int);
insert into t1 values(1, 'a', 1), (2, 'b', 2), ..., (10000, 'xxx', 10000);
select sum(c1), c2, c3 from t1 group by c2, c3;
For tiflash HashAgg current implementation, the internal computation procedure includes:
For each row, insert column serialized c2+c2 into HashMap and update agg state in HashMap
After all rows are handled, start to read HashMap. Will copy data from HashMap to Column. And the real copy include:
Copy result of sum(c1) to ColumnDecimal.
Copy result of first_row(c2) to ColumnString.
Copy result of first_row(c3) to ColumnInt.
Copy result of any(c2) to ColumnString.
Copy result of c3 to ColumnInt.
For 2.i, 2.ii and 2.iii, they are corresponsding to the select item of sum(c1), c2 and c3.
For 2.iv and 2.v, they are corresponding to the group by column c2 and c3.
But actually the copy of 2.ii is duplicated with 2.iv, and the copy of 2.iii is duplicated with 2.v. So we can do the following optimizations:
Eliminate the first_row(c3) agg func(a.k.a. 2.iii) and make a pointer to reference c3 directly. So we can avoid the computation of first_row and the copy of first_row(c3) to result ColumnInt.
Eliminate the any(c2) agg func (a.k.a. 2.v) and make a pointer to reference first_row(c2). So we can aovoid the computation of any(c2) (a.k.a. 2.iv) and the copy of any(c2)
If c2 and c3 are high NDV, this optimization is significant.
The text was updated successfully, but these errors were encountered:
Enhancement
Check the following example:
For tiflash HashAgg current implementation, the internal computation procedure includes:
Column
. And the real copy include:sum(c1)
toColumnDecimal
.first_row(c2)
toColumnString
.first_row(c3)
toColumnInt
.any(c2)
toColumnString
.c3
toColumnInt
.For
2.i
,2.ii
and2.iii
, they are corresponsding to the select item ofsum(c1)
,c2
andc3
.For
2.iv
and2.v
, they are corresponding to the group by columnc2
andc3
.But actually the copy of
2.ii
is duplicated with2.iv
, and the copy of2.iii
is duplicated with2.v
. So we can do the following optimizations:first_row(c3)
agg func(a.k.a.2.iii
) and make a pointer to referencec3
directly. So we can avoid the computation offirst_row
and the copy offirst_row(c3)
toresult ColumnInt
.any(c2)
agg func (a.k.a.2.v
) and make a pointer to referencefirst_row(c2)
. So we can aovoid the computation ofany(c2)
(a.k.a.2.iv
) and the copy ofany(c2)
If c2 and c3 are high NDV, this optimization is significant.
The text was updated successfully, but these errors were encountered: