-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-14705: [C++][Python] Handle more types in UnifySchemas #12000
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some high level questions from a quick look. But this will be a nice improvement to have more flexible schema unification!
(I personally don't have a strong opinion on the need of exposing the fine-grained options or not. It seems fine to me to keep this)
On the python side, I think this can be exposed in concat_tables
, which uses ConcatenateTables
under the hood which already supports passing those field_merge_options
.
cpp/src/arrow/type.h
Outdated
bool promote_decimal_float = false; | ||
|
||
/// Allow an integer of a given bit width to be promoted to a | ||
/// float of an equal or greater bit width. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to be clear, that would mean that int64 cannot be unified with float32? (since that bitwidth is not "equal or greater") Or that it would be unified to float64?
For reference, numpy promotes that specific case to float64.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As implemented, you need also promote_numeric_width
to unify int64 and float32 (to float64).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh wait, I'm wrong (awkward). Yes, int64 + float32 = float64.
cpp/src/arrow/type.h
Outdated
bool promote_integer_float = false; | ||
|
||
/// Allow an unsigned integer of a given bit width to be promoted | ||
/// to a signed integer of the same bit width. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.. or greater bit width?
Eg in numpy uint32 + int64 -> int64
options.promote_integer_sign = true; | ||
options.promote_large = true; | ||
options.promote_binary = true; | ||
return options; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For reading clarity, can we list here the ones that are still left to False in a comment? (eg promote_dictionary_ordered
is still False, which I think is good)
promote_decimal
should probably be added as true?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops, I didn't update this after adding some of the additional options. I'll document what's not enabled as you suggest, thanks.
ee3be7c
to
a80f94f
Compare
Corrected the doc comments. For concat_tables, this requires us to have access to Cast() inside table.cc, so we need ARROW-8891 or some #ifdef'ing to implement this. |
I've used some #ifdefs so that ConcatenateTables can cast columns as necessary now. I still have some TODOs listed up top before I can undraft this. |
@lidavidm Do we still want to do this? If so, do you want to push it forward? |
I think it's still useful to have, but I don't think I'll have time to push this forward anytime soon. |
…36846) Revival of #12000 ### Rationale for this change It would be great to be able to do promotions when `concat`'ing a table, such as: ```python def test_concat_tables_with_promotion_int(): import pyarrow as pa t1 = pa.Table.from_arrays( [pa.array([1, 2], type=pa.int64())], ["int"]) t2 = pa.Table.from_arrays( [pa.array([3, 4], type=pa.int32())], ["int"]) result = pa.concat_tables([t1, t2], promote=True) assert result.equals(pa.Table.from_arrays([ pa.array([1, 2, 3, 4], type=pa.int64()) ], ["int"])) ``` ### What changes are included in this PR? ### Are these changes tested? ### Are there any user-facing changes? * Closes: #36845 Lead-authored-by: Fokko Driesprong <fokko@tabular.io> Co-authored-by: David Li <li.davidm96@gmail.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
…es` (apache#36846) Revival of apache#12000 ### Rationale for this change It would be great to be able to do promotions when `concat`'ing a table, such as: ```python def test_concat_tables_with_promotion_int(): import pyarrow as pa t1 = pa.Table.from_arrays( [pa.array([1, 2], type=pa.int64())], ["int"]) t2 = pa.Table.from_arrays( [pa.array([3, 4], type=pa.int32())], ["int"]) result = pa.concat_tables([t1, t2], promote=True) assert result.equals(pa.Table.from_arrays([ pa.array([1, 2, 3, 4], type=pa.int64()) ], ["int"])) ``` ### What changes are included in this PR? ### Are these changes tested? ### Are there any user-facing changes? * Closes: apache#36845 Lead-authored-by: Fokko Driesprong <fokko@tabular.io> Co-authored-by: David Li <li.davidm96@gmail.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
…es` (apache#36846) Revival of apache#12000 ### Rationale for this change It would be great to be able to do promotions when `concat`'ing a table, such as: ```python def test_concat_tables_with_promotion_int(): import pyarrow as pa t1 = pa.Table.from_arrays( [pa.array([1, 2], type=pa.int64())], ["int"]) t2 = pa.Table.from_arrays( [pa.array([3, 4], type=pa.int32())], ["int"]) result = pa.concat_tables([t1, t2], promote=True) assert result.equals(pa.Table.from_arrays([ pa.array([1, 2, 3, 4], type=pa.int64()) ], ["int"])) ``` ### What changes are included in this PR? ### Are these changes tested? ### Are there any user-facing changes? * Closes: apache#36845 Lead-authored-by: Fokko Driesprong <fokko@tabular.io> Co-authored-by: David Li <li.davidm96@gmail.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
…es` (apache#36846) Revival of apache#12000 ### Rationale for this change It would be great to be able to do promotions when `concat`'ing a table, such as: ```python def test_concat_tables_with_promotion_int(): import pyarrow as pa t1 = pa.Table.from_arrays( [pa.array([1, 2], type=pa.int64())], ["int"]) t2 = pa.Table.from_arrays( [pa.array([3, 4], type=pa.int32())], ["int"]) result = pa.concat_tables([t1, t2], promote=True) assert result.equals(pa.Table.from_arrays([ pa.array([1, 2, 3, 4], type=pa.int64()) ], ["int"])) ``` ### What changes are included in this PR? ### Are these changes tested? ### Are there any user-facing changes? * Closes: apache#36845 Lead-authored-by: Fokko Driesprong <fokko@tabular.io> Co-authored-by: David Li <li.davidm96@gmail.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
This expands UnifySchemas to handle more types, e.g. int32 + float64 = float64, by adding flags for various types.
The flags here may be overly fine-grained.
Not implemented here: