DOC: Update parquet metadata format description around index levels #18201

cpcloud · 2017-11-09T19:44:25Z

cc @wesm @jcrist

jreback · 2017-11-17T01:22:43Z

lgtm. merge when ready @cpcloud

jreback · 2017-11-22T02:29:02Z

@cpcloud ?

codecov · 2017-11-22T08:07:41Z

Codecov Report

Merging #18201 into master will decrease coverage by 0.03%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master   #18201      +/-   ##
==========================================
- Coverage   91.59%   91.55%   -0.04%     
==========================================
  Files         153      153              
  Lines       51257    51257              
==========================================
- Hits        46949    46929      -20     
- Misses       4308     4328      +20

Flag	Coverage Δ
#multiple	`89.41% <ø> (-0.03%)`	⬇️
#single	`40.68% <ø> (-0.11%)`	⬇️

Impacted Files	Coverage Δ
pandas/io/gbq.py	`25% <0%> (-58.34%)`	⬇️
pandas/plotting/_converter.py	`64.78% <0%> (-1.74%)`	⬇️
pandas/core/frame.py	`97.81% <0%> (-0.1%)`	⬇️
pandas/core/indexes/datetimes.py	`95.59% <0%> (-0.1%)`	⬇️
pandas/util/testing.py	`82.01% <0%> (+0.19%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9629fef...fd7ff76. Read the comment docs.

jreback · 2017-12-02T17:38:30Z

@cpcloud good to go?

jorisvandenbossche · 2017-12-04T13:53:12Z

I think the example at then end also should be updated: https://github.com/cpcloud/pandas/blame/6183078033adc62c1ccf043dac15d5b4f121fdac/doc/source/developer.rst#L157 (it still uses '__index_level_0__' in the columns part of the metadata, where the actual name (None in this case) should be included.

jorisvandenbossche · 2017-12-05T16:30:31Z

Having dived a little bit in the pyarrow side of the code in apache/arrow#1386, I have two design-level comments or questions related to the index names:

Currently there is no strict mapping possible (or at least harder than it should be) between the pandas metadata and the actual field names in the schema: because index column names in the pandas_metadata['columns'] entry don't match the field name, and because the order / number of entries in the pandas_metadata is not always the same as the fields when subsetting data (so you can't just enumerate both together).

Therefore I was wondering whether it would be worth adding a 'field_name' entry to a pandas_metadata['columns'] entry (next to the 'name' entry). 'field_name' would then strictly match the names of the schema, making lookups much easier.
The choice was made to always encode index columns in the schema with __index_level_n__-like names, even if the index (or index level) of the pandas DataFrame has a name that is not None. The actual name is still preserved in pandas_metadata['columns'][..]['name'], so it is not a problem for roundtripping.
But for usability of pyarrow itself, I would much prefer having a field with the actual name and not __index_level_n__. Also in pandas we are trying in some places to reduce the difference between index levels and columns (seeing the index as a special column), eg in groupby, merging, sorting you can now mix references of columns and index levels by name.
This change was done IIRC to accomodate index level names that were the same as column names, but I would personally just regard that the same as duplicate column names (thus: not supported).

I think those changes are not yet in a released version of pyarrow? (and the last one is also not yet implemented in fastparquet I think to align with pyarrow)

wesm · 2017-12-05T20:33:41Z

Well, we are running out of time to make any more such changes for Arrow 0.8.0. What about using the disambiguated names only when there is a naming conflict?

wesm · 2017-12-05T20:33:54Z

cc @cpcloud

cpcloud · 2017-12-05T21:10:49Z

This change was done IIRC to accomodate index level names that were the same as column names, but I would personally just regard that the same as duplicate column names (thus: not supported).

We introduced this to accommodate that use case specifically. See ARROW-1754

Currently there is no strict mapping possible (or at least harder than it should be) between the pandas metadata and the actual field names in the schema

I don't understand this comment. Can you give an example of what isn't possible or is cumbersome here?

because the order / number of entries in the pandas_metadata is not always the same as the fields when subsetting data

That's also true for some cases independent of how we deal with index names. For example pq.read_parquet(..., columns=['b', 'a']) if the order of b and a in the schema is ['a', 'b'].

@jorisvandenbossche Can you give some examples to illustrate your points?

jorisvandenbossche · 2017-12-05T21:44:17Z

This change was done IIRC to accomodate index level names that were the same as column names, but I would personally just regard that the same as duplicate column names (thus: not supported).

We introduced this to accommodate that use case specifically. See ARROW-1754

Yes, I know, that's why I mentioned it. But I would say to that issue: "pitty, won't fix, this is not supported" (or of course: raise an informative error message. The solution for the user can simply be to specify preserve_index=False in Table.from_pandas). For me this is very similar to writing dataframes with duplicate column names to parquet which is also not supported.

Currently there is no strict mapping possible (or at least harder than it should be) between the pandas metadata and the actual field names in the schema

I don't understand this comment. Can you give an example of what isn't possible or is cumbersome here?

In apache/arrow#1386, I needed to do the following to get the field name corresponding to a specific pandas metadata column entry:

index_columns = pandas_metadata['index_columns']
n_index_levels = len(index_columns)
n_columns = len(pandas_metadata['columns']) - n_index_levels

for i, col_meta in enumerate(pandas_metadata['columns']):

    raw_name = col_meta['name']
    if i >= n_columns:
        # index columns
        raw_name = index_columns[i - n_columns]
    if raw_name is None:
        raw_name = 'None'
        
    idx = schema.get_field_index(raw_name)

It might be that I am missing something and the above is overly complicated. But if you would have a field_name entry in the pandas metadata column entries, the above code could be simplified to:

for i, col_meta in enumerate(pandas_metadata['columns']):

    raw_name = col_meta['field_name']
    idx = schema.get_field_index(raw_name)

I would think such an operation (getting the field name corresponding to a pandas column name) is something rather common (it is also something that they will need in fastparquet), and it is of course not "impossible" (that was a bit strong stated), but just rather hard.
On the other hand, in pyarrow itself you can easily make this into a helper function if you would need it in multiple places, so of course also not that much of a problem. But as somebody starting to play with pyarrow Tables from pandas dataframes, figuring this out (the relationship between the pandas names and field names, and coming up with the above code snippet) was not that straightforward from the beginning.

This is indeed not fully related to the index names discussion, because even if we would revert the "always using __index_level_n__ pattern", you would need the above code anyhow. So it are two separate issues.

jorisvandenbossche · 2017-12-06T12:52:51Z

@cpcloud Another illustration of the possible difficulties with the mapping between the pandas metadata and the actual field names in the schema: currently this mapping is possible in pyarrow because it assumes that the index columns are the last ones inside the schema.
However, this assumption breaks eg with files written by fastparquet, because they do not put index levels at the end, but in the beginning:

In [79]: df = pd.DataFrame({'a': [1, 2, 3], 'b': [.1, .2, .3], 'c':pd.date_range("2017-01-01", periods=3, tz='Europe/Brussels')}, index=pd.Index(['A', 'B', 'C'], name='index'))

In [80]: df
Out[80]: 
       a    b                         c
index                                  
A      1  0.1 2017-01-01 00:00:00+01:00
B      2  0.2 2017-01-02 00:00:00+01:00
C      3  0.3 2017-01-03 00:00:00+01:00

In [81]: fastparquet.write("__test_index_fastparquet-dev.parquet", df)
/home/joris/miniconda3/envs/pyarrow-dev/lib/python3.6/site-packages/fastparquet/writer.py:136: UserWarning: Coercing datetimes to UTC
  warnings.warn('Coercing datetimes to UTC')

In [82]: pq.read_table("__test_fastparquet-dev.parquet")
Out[82]: 
pyarrow.Table
index: string
a: int64
b: double
c: timestamp[us]
metadata
--------
{b'pandas': b'{"columns": [{"metadata": null, "name": "index", "numpy_type": "'
            b'object", "pandas_type": "unicode"}, {"metadata": null, "name": "'
            b'a", "numpy_type": "int64", "pandas_type": "int64"}, {"metadata":'
            b' null, "name": "b", "numpy_type": "float64", "pandas_type": "flo'
            b'at64"}, {"metadata": {"timezone": "Europe/Brussels"}, "name": "c'
            b'", "numpy_type": "datetime64[ns, Europe/Brussels]", "pandas_type'
            b'": "datetimetz"}], "index_columns": ["index"], "pandas_version":'
            b' "0.22.0.dev0+260.g5da3759"}'}

In [83]: pq.read_table("__test_fastparquet-dev.parquet").to_pandas()
...
ArrowNotImplementedError: No cast implemented from string to timestamp[ns, tz=Europe/Brussels]

Of course, this could (and maybe should) be fixed in fastparquet (they still have to update with regards to the index name change as well: dask/fastparquet#251). And aonce fixed, this again will not be an impossible problem. However, this assumption was up to now not written in the pandas metadata specifications (what you are updating now), so they are not to blame. And to me it shows a bit the brittleness of this assumption.

TomAugspurger · 2017-12-06T13:23:30Z

I'm looking into this on the fastparquet / dask side.

But if you would have a field_name entry in the pandas metadata column entries

Joris' suggestion seems sensible to me.

jorisvandenbossche · 2017-12-06T15:12:12Z

Whatever the decision for this discussion, this PR should close #16391 (where Wes proposes an additional name entry for the index to disambiguate with column names)

cpcloud · 2017-12-06T19:49:59Z

@jorisvandenbossche @TomAugspurger Thanks for commenting. I agree adding 'field_name' will be helpful and will allow writing index metadata that fastparquet can read as well. PR is in the works for arrow. I'll also update this PR with the changes.

TomAugspurger · 2017-12-06T20:08:53Z

Thanks. Thanks. Once ARROW-1895 is ready I'll update fastparquet's metadata writer to be compliant.

…

On Wed, Dec 6, 2017 at 1:50 PM, Phillip Cloud ***@***.***> wrote: @jorisvandenbossche <https://github.com/jorisvandenbossche> @TomAugspurger <https://github.com/tomaugspurger> Thanks for commenting. I agree adding 'field_name' will be helpful and will allow writing index metadata that fastparquet can read as well. PR is in the works for arrow. I'll also update this PR with the changes. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#18201 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHIloCpiOVkaff3vA-Q9JLE_2Cs5_wks5s9u_vgaJpZM4QYi4Q> .

jorisvandenbossche · 2017-12-06T21:10:23Z

Thanks! Happy to review when there is a PR.

Regarding my other point (the second bullet point here: #18201 (comment)) about always using __index_level_n__ as the index column name in the schema.
I was thinking a bit more about it, and there is maybe an alternative that does not bring back the bug fixed in ARROW-1754, which is to use the __index_level_n__ only if the index name is None or already exists in the column names (both conditions should be easy to check).

The reason that I argue for this is the following: as long as one uses to_parquet/read_parquet to roundtrip pandas dataframes, it does not really matter how the index columns are stored in the parquet file. But if one also wants to use those files in other systems, I can imagine that names like __index_level_n__ can be rather annoying, certainly if your actual dataframe had a sensible index name.
And we will by default write index columns in to_parquet (see in progress PR #18629, until now only supported default integer range indexes), so this will become more common.

I understand that the 0.8.0 release is close, so that you might not want to reconsider this at this point. But I can certainly do a PR with the change to help out (I think it should be a relatively straightforward patch, just as apache/arrow#1271 was also not that a big patch)

wesm · 2017-12-06T21:12:39Z

Here's the PR apache/arrow#1397

TomAugspurger · 2017-12-07T02:43:00Z

This is unrelated to the changes here, but I noticed pyarrow uses "string" for the "pandas_type":

>>> df = pd.DataFrame({"A": [1, 2]})
>>> sink = io.BytesIO()
>>> pq.write_metadata(pa.Table.from_pandas(df).schema, sink)
>>> json.loads(pq.read_metadata(sink).metadata[b'pandas'].decode('utf-8'))['column_indexes']
[{'metadata': None,
  'name': None,
  'numpy_type': 'object',
  'pandas_type': 'string'}]

Though the spec says just 'unicode' or 'bytes' for String type. Could we add 'string' as an alias for unicode?

cpcloud · 2017-12-07T14:30:02Z

@TomAugspurger That only appears in the 'column_indexes' metadata, because that doesn't go through arrow schemas, so that's a bug. This was added in between 0.7.1 and upcoming 0.8.0 so we can change this without any backward compatibility consequences.

cpcloud · 2017-12-07T14:38:50Z

@jorisvandenbossche @TomAugspurger Thanks for ironing this stuff out, it's very helpful to have a third and fourth pair of eyes on this code.

wesm · 2017-12-09T17:48:15Z

We could always use guids for the index column names in the Parquet schema instead of the generated dunder-names (and use the index column metadata to list the guids corresponding to each index level).

I think this is the last blocking issue for Arrow 0.8.0 -- there are a couple other small items I'm going to address today/tomorrow

cpcloud · 2017-12-09T18:57:52Z

I took the guid approach back when I first did this and it wasn't a viable solution because there are new guids generated for every dataframe. It becomes really difficult to do something like a concat operation under this scheme. You need reliable and reproducible names for the index columns.

wesm · 2017-12-09T21:17:07Z

Ah, good point, since you need to be deterministic across files

jorisvandenbossche · 2017-12-10T15:49:13Z

If unique column names are not required in parquet and/or arrow, I think the is_index flag is a good solution.
You will still run into trouble when having to process specific columns for their metadata and this column its field name is not unique, but that is not something we are going to solve with a better specification (unless requiring unique column names).

jorisvandenbossche · 2017-12-10T22:00:12Z

Regarding the preserving of the index name if possible, I quickly tried this and opened apache/arrow#1408, just as a proof of concept that this is IMO not difficult to do while also keeping the bug fix of ARROW-1754

TomAugspurger · 2017-12-18T21:51:26Z

What's the recommended way for determining whether df.index.name was originally None? Special case __index_level_0__?

jorisvandenbossche · 2017-12-18T21:54:50Z

The actual name is stored in the metadata (name vs field_name) and can be null I think?

TomAugspurger · 2017-12-18T22:02:43Z

Ah, it wasn't null as of as of 0.7.1:

In [12]: import pandas as pd

In [13]: import pandas as pd; import pyarrow as pa; import pyarrow.parquet as pq

In [14]: df = pd.DataFrame({"x": [1, 2]})

In [15]: pq.write_table(pa.Table.from_pandas(df), '/tmp/a.parq')

In [16]: json.loads(pq.read_metadata('/tmp/a.parq').metadata[b'pandas'])
Out[16]:
{'columns': [{'metadata': None,
   'name': 'x',
   'numpy_type': 'int64',
   'pandas_type': 'int64'},
  {'metadata': None,
   'name': '__index_level_0__',
   'numpy_type': 'int64',
   'pandas_type': 'int64'}],
 'index_columns': ['__index_level_0__'],
 'pandas_version': '0.21.0'}

But it is null in 0.8.0:

In [8]: import pandas as pd; import pyarrow as pa; import pyarrow.parquet as pq

In [9]: df = pd.DataFrame({"x": [1, 2]})

In [10]: pq.write_table(pa.Table.from_pandas(df), '/tmp/b.parq')

In [11]: json.loads(pq.read_metadata('/tmp/b.parq').metadata[b'pandas'])
Out[11]:
{'column_indexes': [{'field_name': None,
   'metadata': {'encoding': 'UTF-8'},
   'name': None,
   'numpy_type': 'object',
   'pandas_type': 'unicode'}],
 'columns': [{'field_name': 'x',
   'metadata': None,
   'name': 'x',
   'numpy_type': 'int64',
   'pandas_type': 'int64'},
  {'field_name': '__index_level_0__',
   'metadata': None,
   'name': None,
   'numpy_type': 'int64',
   'pandas_type': 'int64'}],
 'index_columns': ['__index_level_0__'],
 'pandas_version': '0.21.0'}

In [12]: pa.__version__
Out[12]: '0.8.0'

TomAugspurger · 2017-12-20T11:46:04Z

https://issues.apache.org/jira/browse/ARROW-1941 was just opened and I noticed that the pyarrow metadata for nested types were like list[float64].

metadata
--------
{b'pandas': b'{"index_columns": ["__index_level_0__"], "column_indexes": [{"na'
            b'me": null, "field_name": null, "pandas_type": "unicode", "numpy_'
            b'type": "object", "metadata": {"encoding": "UTF-8"}}], "columns":'
            b' [{"name": "c1", "field_name": "c1", "pandas_type": "list[unicod'
            b'e]", "numpy_type": "object", "metadata": null}, {"name": "c2", "'
            b'field_name": "c2", "pandas_type": "list[float64]", "numpy_type":'
            b' "object", "metadata": null}, {"name": null, "field_name": "__in'
            b'dex_level_0__", "pandas_type": "int64", "numpy_type": "int64", "'
            b'metadata": null}], "pandas_version": "0.21.1"}'}

I don't believe that nested types are mentioned anywhere for pandas_type. We should update that as well (either here or a new issue).

wesm · 2017-12-26T22:39:38Z

I opened https://issues.apache.org/jira/browse/ARROW-1950

…that preserves index name if available Related to the discussion about the pandas metadata specification in pandas-dev/pandas#18201, and an alternative to #1271. I don't open this PR because it should necessarily be merged, I just want to show that it is not that difficult to both fix [ARROW-1754](https://issues.apache.org/jira/browse/ARROW-1754) and preserve index names as field names when possible (as this was mentioned in pandas-dev/pandas#18201 as the reason to make this change to not preserve index names). The diff is partly a revert of #1271, but then adapted to the current codebase. Main reasons I prefer to preserve index names: 1) usability in pyarrow itself (if you would want to work with pyarrow Tables created from pandas) and 2) when interchanging parquet files with other people / other non-pandas systems, then it would be much nicer to not have `__index_level_n__` column names if possible. Author: Joris Van den Bossche <jorisvandenbossche@gmail.com> Closes #1408 from jorisvandenbossche/index-names and squashes the following commits: eef1d33 [Joris Van den Bossche] alternative fix for duplicate index/column name that preserves index name if available

jorisvandenbossche · 2018-02-03T12:09:36Z

@cpcloud can you update this? (a.o. recent changes in apache/arrow#1408 affect this)
I can also push some changes to this branch.

jreback · 2018-03-30T21:29:48Z

@cpcloud update on this?

cpcloud · 2018-03-30T21:40:47Z

@jorisvandenbossche Any chance you can submit a PR to my fork for the changes you have in mind?

jreback · 2018-11-01T01:36:46Z

ping @jorisvandenbossche

@cpcloud can you rebase when you have a chance.

datapythonista

I think we can merge this, and make any update in a separate PR. Just there is a required change, as we are not using the right directive for the example.

datapythonista · 2018-12-02T00:21:57Z

doc/source/developer.rst

+
+   Here's an example of how the index metadata is structured in pyarrow:
+
+    .. code-block:: python


Suggested change

.. code-block:: python

.. ipython:: python

jreback · 2018-12-03T01:18:08Z

@datapythonista that is fine. or if @cpcloud can update would be great.

datapythonista

I double checked, and the code block is correct, as it doesn't show any output, and it's actually not expected to run (some variables are not initialized).

lgtm then

@jreback can we merge this then?

jreback · 2018-12-07T23:49:58Z

sure

datapythonista · 2018-12-07T23:52:16Z

thanks @cpcloud

cpcloud self-assigned this Nov 9, 2017

cpcloud added the Docs label Nov 9, 2017

cpcloud added this to the 0.21.1 milestone Nov 9, 2017

cpcloud mentioned this pull request Nov 9, 2017

Support arrow in to_parquet. Several other parquet cleanups. dask/dask#2868

Merged

jreback added the Needs Backport label Nov 17, 2017

cpcloud closed this Nov 22, 2017

cpcloud reopened this Nov 22, 2017

jorisvandenbossche added the IO Parquet parquet, feather label Dec 5, 2017

jorisvandenbossche modified the milestones: 0.21.1, 0.22.0 Dec 7, 2017

jorisvandenbossche mentioned this pull request Dec 10, 2017

ARROW-1754: [Python] alternative fix for duplicate index/column name that preserves index name if available apache/arrow#1408

Closed

jreback modified the milestones: 0.23.0, 0.24.0 Mar 30, 2018

jreback removed this from the 0.24.0 milestone Nov 6, 2018

datapythonista reviewed Dec 2, 2018

View reviewed changes

datapythonista self-assigned this Dec 3, 2018

datapythonista approved these changes Dec 7, 2018

View reviewed changes

datapythonista merged commit c911151 into pandas-dev:master Dec 7, 2018

datapythonista mentioned this pull request Dec 8, 2018

DOC: Ignoring F821 in developer.rst, that are breaking the build #24160

Merged

Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019


		Here's an example of how the index metadata is structured in pyarrow:

		.. code-block:: python

DOC: Update parquet metadata format description around index levels #18201

DOC: Update parquet metadata format description around index levels #18201

Conversation

cpcloud commented Nov 9, 2017

jreback commented Nov 17, 2017

jreback commented Nov 22, 2017

codecov bot commented Nov 22, 2017 • edited Loading

Codecov Report

jreback commented Dec 2, 2017

jorisvandenbossche commented Dec 4, 2017

jorisvandenbossche commented Dec 5, 2017

wesm commented Dec 5, 2017

wesm commented Dec 5, 2017

cpcloud commented Dec 5, 2017 • edited Loading

jorisvandenbossche commented Dec 5, 2017

jorisvandenbossche commented Dec 6, 2017

TomAugspurger commented Dec 6, 2017

jorisvandenbossche commented Dec 6, 2017

cpcloud commented Dec 6, 2017

TomAugspurger commented Dec 6, 2017 via email

jorisvandenbossche commented Dec 6, 2017

wesm commented Dec 6, 2017

TomAugspurger commented Dec 7, 2017

cpcloud commented Dec 7, 2017

cpcloud commented Dec 7, 2017

wesm commented Dec 9, 2017

cpcloud commented Dec 9, 2017

wesm commented Dec 9, 2017

jorisvandenbossche commented Dec 10, 2017

jorisvandenbossche commented Dec 10, 2017

TomAugspurger commented Dec 18, 2017

jorisvandenbossche commented Dec 18, 2017

TomAugspurger commented Dec 18, 2017

TomAugspurger commented Dec 20, 2017

wesm commented Dec 26, 2017

jorisvandenbossche commented Feb 3, 2018

jreback commented Mar 30, 2018

cpcloud commented Mar 30, 2018

jreback commented Nov 1, 2018

datapythonista left a comment

Choose a reason for hiding this comment

datapythonista Dec 2, 2018

Choose a reason for hiding this comment

jreback commented Dec 3, 2018

datapythonista left a comment

Choose a reason for hiding this comment

jreback commented Dec 7, 2018

datapythonista commented Dec 7, 2018

codecov bot commented Nov 22, 2017 •

edited

Loading

cpcloud commented Dec 5, 2017 •

edited

Loading