Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-14705: [C++][Python] Handle more types in UnifySchemas #12000

Closed
wants to merge 22 commits into from

Conversation

lidavidm
Copy link
Member

This expands UnifySchemas to handle more types, e.g. int32 + float64 = float64, by adding flags for various types.

The flags here may be overly fine-grained.

Not implemented here:

@github-actions
Copy link

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some high level questions from a quick look. But this will be a nice improvement to have more flexible schema unification!
(I personally don't have a strong opinion on the need of exposing the fine-grained options or not. It seems fine to me to keep this)

On the python side, I think this can be exposed in concat_tables, which uses ConcatenateTables under the hood which already supports passing those field_merge_options.

bool promote_decimal_float = false;

/// Allow an integer of a given bit width to be promoted to a
/// float of an equal or greater bit width.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be clear, that would mean that int64 cannot be unified with float32? (since that bitwidth is not "equal or greater") Or that it would be unified to float64?

For reference, numpy promotes that specific case to float64.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As implemented, you need also promote_numeric_width to unify int64 and float32 (to float64).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh wait, I'm wrong (awkward). Yes, int64 + float32 = float64.

bool promote_integer_float = false;

/// Allow an unsigned integer of a given bit width to be promoted
/// to a signed integer of the same bit width.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.. or greater bit width?

Eg in numpy uint32 + int64 -> int64

options.promote_integer_sign = true;
options.promote_large = true;
options.promote_binary = true;
return options;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reading clarity, can we list here the ones that are still left to False in a comment? (eg promote_dictionary_ordered is still False, which I think is good)

promote_decimal should probably be added as true?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops, I didn't update this after adding some of the additional options. I'll document what's not enabled as you suggest, thanks.

@lidavidm
Copy link
Member Author

Corrected the doc comments.

For concat_tables, this requires us to have access to Cast() inside table.cc, so we need ARROW-8891 or some #ifdef'ing to implement this.

@lidavidm
Copy link
Member Author

I've used some #ifdefs so that ConcatenateTables can cast columns as necessary now. I still have some TODOs listed up top before I can undraft this.

@pitrou
Copy link
Member

pitrou commented Jul 13, 2022

@lidavidm Do we still want to do this? If so, do you want to push it forward?

@lidavidm
Copy link
Member Author

I think it's still useful to have, but I don't think I'll have time to push this forward anytime soon.

@lidavidm lidavidm closed this Feb 16, 2023
jorisvandenbossche added a commit that referenced this pull request Oct 10, 2023
…36846)

Revival of #12000

### Rationale for this change

It would be great to be able to do promotions when `concat`'ing a table, such as:

```python
def test_concat_tables_with_promotion_int():
    import pyarrow as pa
    t1 = pa.Table.from_arrays(
        [pa.array([1, 2], type=pa.int64())], ["int"])
    t2 = pa.Table.from_arrays(
        [pa.array([3, 4], type=pa.int32())], ["int"])

    result = pa.concat_tables([t1, t2], promote=True)

    assert result.equals(pa.Table.from_arrays([
        pa.array([1, 2, 3, 4], type=pa.int64())
    ], ["int"]))
```

### What changes are included in this PR?

### Are these changes tested?

### Are there any user-facing changes?

* Closes: #36845

Lead-authored-by: Fokko Driesprong <fokko@tabular.io>
Co-authored-by: David Li <li.davidm96@gmail.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
JerAguilon pushed a commit to JerAguilon/arrow that referenced this pull request Oct 23, 2023
…es` (apache#36846)

Revival of apache#12000

### Rationale for this change

It would be great to be able to do promotions when `concat`'ing a table, such as:

```python
def test_concat_tables_with_promotion_int():
    import pyarrow as pa
    t1 = pa.Table.from_arrays(
        [pa.array([1, 2], type=pa.int64())], ["int"])
    t2 = pa.Table.from_arrays(
        [pa.array([3, 4], type=pa.int32())], ["int"])

    result = pa.concat_tables([t1, t2], promote=True)

    assert result.equals(pa.Table.from_arrays([
        pa.array([1, 2, 3, 4], type=pa.int64())
    ], ["int"]))
```

### What changes are included in this PR?

### Are these changes tested?

### Are there any user-facing changes?

* Closes: apache#36845

Lead-authored-by: Fokko Driesprong <fokko@tabular.io>
Co-authored-by: David Li <li.davidm96@gmail.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
loicalleyne pushed a commit to loicalleyne/arrow that referenced this pull request Nov 13, 2023
…es` (apache#36846)

Revival of apache#12000

### Rationale for this change

It would be great to be able to do promotions when `concat`'ing a table, such as:

```python
def test_concat_tables_with_promotion_int():
    import pyarrow as pa
    t1 = pa.Table.from_arrays(
        [pa.array([1, 2], type=pa.int64())], ["int"])
    t2 = pa.Table.from_arrays(
        [pa.array([3, 4], type=pa.int32())], ["int"])

    result = pa.concat_tables([t1, t2], promote=True)

    assert result.equals(pa.Table.from_arrays([
        pa.array([1, 2, 3, 4], type=pa.int64())
    ], ["int"]))
```

### What changes are included in this PR?

### Are these changes tested?

### Are there any user-facing changes?

* Closes: apache#36845

Lead-authored-by: Fokko Driesprong <fokko@tabular.io>
Co-authored-by: David Li <li.davidm96@gmail.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
dgreiss pushed a commit to dgreiss/arrow that referenced this pull request Feb 19, 2024
…es` (apache#36846)

Revival of apache#12000

### Rationale for this change

It would be great to be able to do promotions when `concat`'ing a table, such as:

```python
def test_concat_tables_with_promotion_int():
    import pyarrow as pa
    t1 = pa.Table.from_arrays(
        [pa.array([1, 2], type=pa.int64())], ["int"])
    t2 = pa.Table.from_arrays(
        [pa.array([3, 4], type=pa.int32())], ["int"])

    result = pa.concat_tables([t1, t2], promote=True)

    assert result.equals(pa.Table.from_arrays([
        pa.array([1, 2, 3, 4], type=pa.int64())
    ], ["int"]))
```

### What changes are included in this PR?

### Are these changes tested?

### Are there any user-facing changes?

* Closes: apache#36845

Lead-authored-by: Fokko Driesprong <fokko@tabular.io>
Co-authored-by: David Li <li.davidm96@gmail.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants