ARROW-1883: [Python] Fix handling of metadata in to_pandas when not all columns are present #1386

jorisvandenbossche · 2017-12-04T13:39:53Z

So basically what I did in _add_any_metadata was replacing col = table[i] with:

idx = schema.get_field_index(raw_name)
if idx != -1:
     col = table[idx]

to check that the column is actually present in the schema. However, that involved some more code to get to raw_name (the name how the column is present in the schema), as this does not always match the name in pandas_metadata['column'][..]['name']. Not sure if there is a better way to get that name.
(or if it would be better to filter pandas_metadata earlier on, instead of checking when actually trying to process the metadata of that column)

…ll columns are present

cpcloud · 2017-12-04T18:25:50Z

The travis failure looks unrelated.

cpcloud

@jorisvandenbossche Thanks for reporting and fixing! Couple of minor things to address before merging.

cpcloud · 2017-12-04T18:19:50Z

python/pyarrow/pandas_compat.py

+        raw_name = col_meta['name']
+        if i >= n_columns:
+            # index columns
+            raw_name = pandas_metadata['index_columns'][i - n_columns]


can you pull pandas_metadata['index_columns'] out into a variable?

cpcloud · 2017-12-04T18:29:56Z

python/pyarrow/tests/test_convert_pandas.py

@@ -1221,6 +1221,21 @@ def test_array_from_pandas_typed_array_with_mask(self, t, data, expected):
        assert pa.Array.from_pandas(expected,
                                    type=pa.list_(t())).equals(result)

+    def test_table_column_subset_metadata(self):


Can you add 2 tests: 1 against a datetime index and 1 against another index that isn't datetime? We need to make sure the i >= ncolumns condition is tested.

You'll need to pass preserve_index=True to pa.Table.from_pandas().

Yes, I was actually playing with such frames interactively testing it out, but forgot to add to the actual tests.

preserve_index=True is the default, but I can add it for explicitness if you want

jorisvandenbossche · 2017-12-05T15:59:41Z

Thanks for the review, added some more tests.

wesm · 2017-12-09T21:18:14Z

How does this patch align with #1397?

cpcloud · 2017-12-09T21:36:59Z

@wesm I believe @jorisvandenbossche said he was going to rebase on top of #1397 when that's merged

jorisvandenbossche · 2017-12-10T12:59:47Z

@wesm yes, and I have time to that today (or tomorrow) once the other is merged

…lection

jorisvandenbossche · 2017-12-10T20:04:14Z

I updated this now #1397 is merged. But, I could not really simplify the code to just use the new 'field_name' as I suppose we still want to be able to read in metadata written by older versions of arrow?

jorisvandenbossche · 2017-12-10T20:33:02Z

Added a test that a parquet file written by arrow 0.7.1 is correctly read.

@cpcloud @wesm @xhochy this is ready again for review

jorisvandenbossche · 2017-12-10T20:36:35Z

For reference, the test file is written as:

In [114]: pa.__version__
Out[114]: '0.7.1'

In [117]: df = pd.DataFrame({
     ...:             'a': [1, 2, 3],
     ...:             'b': [.1, .2, .3], 'c': pd.date_range("2017-01-01", periods=3, tz='Europe/Brussels')})

In [118]: df.index = pd.MultiIndex.from_arrays([['a', 'b', 'c'], pd.date_range("2017-01-01", periods=3, tz='Europe/Brussels')], names=['index', None])

In [119]: df
Out[119]: 
                                 a    b                         c
index                                                            
a     2017-01-01 00:00:00+01:00  1  0.1 2017-01-01 00:00:00+01:00
b     2017-01-02 00:00:00+01:00  2  0.2 2017-01-02 00:00:00+01:00
c     2017-01-03 00:00:00+01:00  3  0.3 2017-01-03 00:00:00+01:00

In [120]: import pyarrow.parquet as pq

In [123]: table = pa.Table.from_pandas(df)

In [124]: table
Out[124]: 
pyarrow.Table
a: int64
b: double
c: timestamp[ns, tz=Europe/Brussels]
index: string
__index_level_1__: timestamp[ns, tz=Europe/Brussels]
metadata
--------
{b'pandas': b'{"index_columns": ["index", "__index_level_1__"], "pandas_versio'
            b'n": "0.22.0.dev0+313.g7105339", "columns": [{"numpy_type": "int6'
            b'4", "metadata": null, "pandas_type": "int64", "name": "a"}, {"nu'
            b'mpy_type": "float64", "metadata": null, "pandas_type": "float64"'
            b', "name": "b"}, {"numpy_type": "datetime64[ns, Europe/Brussels]"'
            b', "metadata": {"timezone": "Europe/Brussels"}, "pandas_type": "d'
            b'atetimetz", "name": "c"}, {"numpy_type": "object", "metadata": n'
            b'ull, "pandas_type": "unicode", "name": "index"}, {"numpy_type": '
            b'"datetime64[ns, Europe/Brussels]", "metadata": {"timezone": "Eur'
            b'ope/Brussels"}, "pandas_type": "datetimetz", "name": "__index_le'
            b'vel_1__"}]}'}

In [125]: pq.write_table(table, 'v0.7.1.column-metadata-handling.parquet')

(similar code to construct the dataframe is in the test to create the expected result)

wesm

+1. We should probably remove the backwards compatibility code after some statute of limitations has passed (1 or 2 major releases) so we aren't maintaining it forever. Having this release available will give people the ability to "fix" their files if needed

jorisvandenbossche · 2017-12-10T21:46:16Z

We should probably remove the backwards compatibility code after some statute of limitations has passed

To make this cleaner, I could also move all back compat-handling code to a function that takes a metadata object and rewrites it in a new form, so the rest of the code doesn't need to care about it, and it is easier to remove later on. If you would be interested in that, I can do that here, or in a new PR (as that is not release critical)

wesm · 2017-12-10T22:01:34Z

No need to rock the boat here, I will merge this once the Appveyor build runs

wesm · 2017-12-10T23:40:36Z

thanks @jorisvandenbossche!!

jorisvandenbossche · 2017-12-10T23:41:50Z

You're welcome! And happy to have a first patch to arrow :)

ARROW-1883: [Python] Fix handling of metadata in to_pandas when not a…

306eaba

…ll columns are present

jorisvandenbossche mentioned this pull request Dec 4, 2017

DOC/BUG: broken example in read_parquet with selecting columns pandas-dev/pandas#18628

Closed

cpcloud requested changes Dec 4, 2017

View reviewed changes

Add additional tests

5df51f1

This was referenced Dec 5, 2017

DOC: Update parquet metadata format description around index levels pandas-dev/pandas#18201

Merged

ARROW-1895/ARROW-1897: [Python] Add field_name to pandas index metadata #1397

Closed

jorisvandenbossche added 2 commits December 10, 2017 20:53

Merge remote-tracking branch 'upstream/master' into parquet-column-se…

f6bdd1d

…lection

Use field_name if available

ea891b2

jorisvandenbossche force-pushed the parquet-column-selection branch from a800b5a to ea891b2 Compare December 10, 2017 20:02

add test for compatibility for arrow 0.7.1 written parquet files

3f605ef

wesm approved these changes Dec 10, 2017

View reviewed changes

wesm closed this in 97678c1 Dec 10, 2017

jorisvandenbossche deleted the parquet-column-selection branch December 10, 2017 23:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-1883: [Python] Fix handling of metadata in to_pandas when not all columns are present #1386

ARROW-1883: [Python] Fix handling of metadata in to_pandas when not all columns are present #1386

jorisvandenbossche commented Dec 4, 2017

cpcloud commented Dec 4, 2017

cpcloud left a comment

cpcloud Dec 4, 2017

cpcloud Dec 4, 2017

jorisvandenbossche Dec 5, 2017

jorisvandenbossche commented Dec 5, 2017

wesm commented Dec 9, 2017

cpcloud commented Dec 9, 2017

jorisvandenbossche commented Dec 10, 2017

jorisvandenbossche commented Dec 10, 2017

jorisvandenbossche commented Dec 10, 2017

jorisvandenbossche commented Dec 10, 2017

wesm left a comment

jorisvandenbossche commented Dec 10, 2017

wesm commented Dec 10, 2017

wesm commented Dec 10, 2017

jorisvandenbossche commented Dec 10, 2017

ARROW-1883: [Python] Fix handling of metadata in to_pandas when not all columns are present #1386

ARROW-1883: [Python] Fix handling of metadata in to_pandas when not all columns are present #1386

Conversation

jorisvandenbossche commented Dec 4, 2017

cpcloud commented Dec 4, 2017

cpcloud left a comment

Choose a reason for hiding this comment

cpcloud Dec 4, 2017

Choose a reason for hiding this comment

cpcloud Dec 4, 2017

Choose a reason for hiding this comment

jorisvandenbossche Dec 5, 2017

Choose a reason for hiding this comment

jorisvandenbossche commented Dec 5, 2017

wesm commented Dec 9, 2017

cpcloud commented Dec 9, 2017

jorisvandenbossche commented Dec 10, 2017

jorisvandenbossche commented Dec 10, 2017

jorisvandenbossche commented Dec 10, 2017

jorisvandenbossche commented Dec 10, 2017

wesm left a comment

Choose a reason for hiding this comment

jorisvandenbossche commented Dec 10, 2017

wesm commented Dec 10, 2017

wesm commented Dec 10, 2017

jorisvandenbossche commented Dec 10, 2017