-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support array_distinct function. #8268
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the new function support, I did some tests and have a few suggestions:
❯ select array_distinct([]);
Optimizer rule 'simplify_expressions' failed
caused by
Internal error: could not cast value to arrow_array::array::list_array::GenericListArray<i32>.
This was likely caused by a bug in DataFusion's code and we would welcome that you file an bug report in our issue tracker
I think this empty array case should be handled inside implementation (and also included in sqllogictest)
I merged #8269 so we can probably pick up the change for this PR |
fe5bd05
to
1e615b2
Compare
PTAL, @alamb @jayzhan211 @2010YOUY01 , thanks. |
1e615b2
to
ec3d443
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks 👍
5f42d9c
to
86ccb87
Compare
let converter = RowConverter::new(vec![SortField::new(dt.clone())])?; | ||
// distinct for each list in ListArray | ||
for arr in array.iter().flatten() { | ||
let values = converter.convert_columns(&[arr])?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not distinct array in columnar way, arr
has only one column, using row format need extra encoding and decoding
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not distinct array in columnar way,
arr
has only one column, using row format need extra encoding and decoding
It is great to distinct array without row converter, but I don't think we can do that without downcast to exact arr then do the distinction. Is there any recommended way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also don't have another idea other than downcast arr, I was just wondering if it is worth to downcast to exact arr.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also don't have another idea other than downcast arr, I was just wondering if it is worth to downcast to exact arr.
Downcasting to the exact array type can result in faster code in many cases, as the rust compiler can make specialized implemenations for each type. However, there are a lot of DataTypes, including nested ones like Dict, List, Struct, etc so making specialized implementations often requires a lot of work
The row converter handles all the types internally.
What we have typically done in the past with DataFusion is to use non type specific code like RowConverter
for the general case, and then if we find a particular usecase needs faster performance we make special implementations. For example, we do so for grouing by single primtive columns (GROUP BY int32
) for example
ccfed93
to
9e22960
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
9e22960
to
62f11f5
Compare
implement slt & proto fix null & empty list
Co-authored-by: Alex Huang <huangweijun1001@gmail.com>
62f11f5
to
c2f5451
Compare
Since no more comments for a fews days, I think maybe this pr can go ahead? |
Thanks @my-vegetable-has-exploded -- I'll take a look hopefully today or maybe tomorrow |
let array = as_list_array(&args[0])?; | ||
general_array_distinct(array, field) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: put let array = as_list_array(&args[0])?;
in general_array_distinct
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I don't get your point. Iargelist differs with list, so I think it maybe better to handle it before generic function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean change the function signature:
general_array_distinct<OffsetSize: OffsetSizeTrait>(
array: &ArrayRef,
field: &FieldRef,
)
Then cast array in the fucntion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you are referring to something like general_array_has_dispatch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can merge this PR as is and then add support for LargeList
(using the OffsetSize
trait) as a follow on PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
THank you for this contribution @my-vegetable-has-exploded and thank you @Weijun-H and @jayzhan211 for the help getting this PR ready.
I think it looks very nice and is a good example of collaboration 🦾
let converter = RowConverter::new(vec![SortField::new(dt.clone())])?; | ||
// distinct for each list in ListArray | ||
for arr in array.iter().flatten() { | ||
let values = converter.convert_columns(&[arr])?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also don't have another idea other than downcast arr, I was just wondering if it is worth to downcast to exact arr.
Downcasting to the exact array type can result in faster code in many cases, as the rust compiler can make specialized implemenations for each type. However, there are a lot of DataTypes, including nested ones like Dict, List, Struct, etc so making specialized implementations often requires a lot of work
The row converter handles all the types internally.
What we have typically done in the past with DataFusion is to use non type specific code like RowConverter
for the general case, and then if we find a particular usecase needs faster performance we make special implementations. For example, we do so for grouing by single primtive columns (GROUP BY int32
) for example
let array = as_list_array(&args[0])?; | ||
general_array_distinct(array, field) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can merge this PR as is and then add support for LargeList
(using the OffsetSize
trait) as a follow on PR
I took the liberty of merging up from main to make sure there there are no logical conflicts. I intend to merge the PR when the tests pass |
Thanks all. |
* implement distinct func implement slt & proto fix null & empty list * add comment for slt Co-authored-by: Alex Huang <huangweijun1001@gmail.com> * fix largelist * add largelist for slt * Use collect for rows & init capcity for offsets. * fixup: remove useless match * fix fmt * fix fmt --------- Co-authored-by: Alex Huang <huangweijun1001@gmail.com> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Which issue does this PR close?
Closes #7289
Rationale for this change
just use list.iter().sorted().dedup() to remove duplicates for each list in listarray
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?