Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOC: Update parquet metadata format description around index levels #18201

Merged
merged 4 commits into from
Dec 7, 2018
Merged

DOC: Update parquet metadata format description around index levels #18201

merged 4 commits into from
Dec 7, 2018

Conversation

cpcloud
Copy link
Member

@cpcloud cpcloud commented Nov 9, 2017

@jreback
Copy link
Contributor

jreback commented Nov 17, 2017

lgtm. merge when ready @cpcloud

@jreback
Copy link
Contributor

jreback commented Nov 22, 2017

@cpcloud ?

@cpcloud cpcloud closed this Nov 22, 2017
@cpcloud cpcloud reopened this Nov 22, 2017
@codecov
Copy link

codecov bot commented Nov 22, 2017

Codecov Report

Merging #18201 into master will decrease coverage by 0.03%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #18201      +/-   ##
==========================================
- Coverage   91.59%   91.55%   -0.04%     
==========================================
  Files         153      153              
  Lines       51257    51257              
==========================================
- Hits        46949    46929      -20     
- Misses       4308     4328      +20
Flag Coverage Δ
#multiple 89.41% <ø> (-0.03%) ⬇️
#single 40.68% <ø> (-0.11%) ⬇️
Impacted Files Coverage Δ
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/plotting/_converter.py 64.78% <0%> (-1.74%) ⬇️
pandas/core/frame.py 97.81% <0%> (-0.1%) ⬇️
pandas/core/indexes/datetimes.py 95.59% <0%> (-0.1%) ⬇️
pandas/util/testing.py 82.01% <0%> (+0.19%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9629fef...fd7ff76. Read the comment docs.

@jreback
Copy link
Contributor

jreback commented Dec 2, 2017

@cpcloud good to go?

@jorisvandenbossche
Copy link
Member

I think the example at then end also should be updated: https://github.com/cpcloud/pandas/blame/6183078033adc62c1ccf043dac15d5b4f121fdac/doc/source/developer.rst#L157 (it still uses '__index_level_0__' in the columns part of the metadata, where the actual name (None in this case) should be included.

@jorisvandenbossche jorisvandenbossche added the IO Parquet parquet, feather label Dec 5, 2017
@jorisvandenbossche
Copy link
Member

Having dived a little bit in the pyarrow side of the code in apache/arrow#1386, I have two design-level comments or questions related to the index names:

  • Currently there is no strict mapping possible (or at least harder than it should be) between the pandas metadata and the actual field names in the schema: because index column names in the pandas_metadata['columns'] entry don't match the field name, and because the order / number of entries in the pandas_metadata is not always the same as the fields when subsetting data (so you can't just enumerate both together).

    Therefore I was wondering whether it would be worth adding a 'field_name' entry to a pandas_metadata['columns'] entry (next to the 'name' entry). 'field_name' would then strictly match the names of the schema, making lookups much easier.

  • The choice was made to always encode index columns in the schema with __index_level_n__-like names, even if the index (or index level) of the pandas DataFrame has a name that is not None. The actual name is still preserved in pandas_metadata['columns'][..]['name'], so it is not a problem for roundtripping.
    But for usability of pyarrow itself, I would much prefer having a field with the actual name and not __index_level_n__. Also in pandas we are trying in some places to reduce the difference between index levels and columns (seeing the index as a special column), eg in groupby, merging, sorting you can now mix references of columns and index levels by name.
    This change was done IIRC to accomodate index level names that were the same as column names, but I would personally just regard that the same as duplicate column names (thus: not supported).

I think those changes are not yet in a released version of pyarrow? (and the last one is also not yet implemented in fastparquet I think to align with pyarrow)

@wesm
Copy link
Member

wesm commented Dec 5, 2017

Well, we are running out of time to make any more such changes for Arrow 0.8.0. What about using the disambiguated names only when there is a naming conflict?

@wesm
Copy link
Member

wesm commented Dec 5, 2017

cc @cpcloud

@cpcloud
Copy link
Member Author

cpcloud commented Dec 5, 2017

This change was done IIRC to accomodate index level names that were the same as column names, but I would personally just regard that the same as duplicate column names (thus: not supported).

We introduced this to accommodate that use case specifically. See ARROW-1754

Currently there is no strict mapping possible (or at least harder than it should be) between the pandas metadata and the actual field names in the schema

I don't understand this comment. Can you give an example of what isn't possible or is cumbersome here?

because the order / number of entries in the pandas_metadata is not always the same as the fields when subsetting data

That's also true for some cases independent of how we deal with index names. For example pq.read_parquet(..., columns=['b', 'a']) if the order of b and a in the schema is ['a', 'b'].

@jorisvandenbossche Can you give some examples to illustrate your points?

@jorisvandenbossche
Copy link
Member

This change was done IIRC to accomodate index level names that were the same as column names, but I would personally just regard that the same as duplicate column names (thus: not supported).

We introduced this to accommodate that use case specifically. See ARROW-1754

Yes, I know, that's why I mentioned it. But I would say to that issue: "pitty, won't fix, this is not supported" (or of course: raise an informative error message. The solution for the user can simply be to specify preserve_index=False in Table.from_pandas). For me this is very similar to writing dataframes with duplicate column names to parquet which is also not supported.

Currently there is no strict mapping possible (or at least harder than it should be) between the pandas metadata and the actual field names in the schema

I don't understand this comment. Can you give an example of what isn't possible or is cumbersome here?

In apache/arrow#1386, I needed to do the following to get the field name corresponding to a specific pandas metadata column entry:

index_columns = pandas_metadata['index_columns']
n_index_levels = len(index_columns)
n_columns = len(pandas_metadata['columns']) - n_index_levels

for i, col_meta in enumerate(pandas_metadata['columns']):

    raw_name = col_meta['name']
    if i >= n_columns:
        # index columns
        raw_name = index_columns[i - n_columns]
    if raw_name is None:
        raw_name = 'None'
        
    idx = schema.get_field_index(raw_name)

It might be that I am missing something and the above is overly complicated. But if you would have a field_name entry in the pandas metadata column entries, the above code could be simplified to:

for i, col_meta in enumerate(pandas_metadata['columns']):

    raw_name = col_meta['field_name']
    idx = schema.get_field_index(raw_name)

I would think such an operation (getting the field name corresponding to a pandas column name) is something rather common (it is also something that they will need in fastparquet), and it is of course not "impossible" (that was a bit strong stated), but just rather hard.
On the other hand, in pyarrow itself you can easily make this into a helper function if you would need it in multiple places, so of course also not that much of a problem. But as somebody starting to play with pyarrow Tables from pandas dataframes, figuring this out (the relationship between the pandas names and field names, and coming up with the above code snippet) was not that straightforward from the beginning.

This is indeed not fully related to the index names discussion, because even if we would revert the "always using __index_level_n__ pattern", you would need the above code anyhow. So it are two separate issues.

@jorisvandenbossche
Copy link
Member

@cpcloud Another illustration of the possible difficulties with the mapping between the pandas metadata and the actual field names in the schema: currently this mapping is possible in pyarrow because it assumes that the index columns are the last ones inside the schema.
However, this assumption breaks eg with files written by fastparquet, because they do not put index levels at the end, but in the beginning:

In [79]: df = pd.DataFrame({'a': [1, 2, 3], 'b': [.1, .2, .3], 'c':pd.date_range("2017-01-01", periods=3, tz='Europe/Brussels')}, index=pd.Index(['A', 'B', 'C'], name='index'))

In [80]: df
Out[80]: 
       a    b                         c
index                                  
A      1  0.1 2017-01-01 00:00:00+01:00
B      2  0.2 2017-01-02 00:00:00+01:00
C      3  0.3 2017-01-03 00:00:00+01:00

In [81]: fastparquet.write("__test_index_fastparquet-dev.parquet", df)
/home/joris/miniconda3/envs/pyarrow-dev/lib/python3.6/site-packages/fastparquet/writer.py:136: UserWarning: Coercing datetimes to UTC
  warnings.warn('Coercing datetimes to UTC')

In [82]: pq.read_table("__test_fastparquet-dev.parquet")
Out[82]: 
pyarrow.Table
index: string
a: int64
b: double
c: timestamp[us]
metadata
--------
{b'pandas': b'{"columns": [{"metadata": null, "name": "index", "numpy_type": "'
            b'object", "pandas_type": "unicode"}, {"metadata": null, "name": "'
            b'a", "numpy_type": "int64", "pandas_type": "int64"}, {"metadata":'
            b' null, "name": "b", "numpy_type": "float64", "pandas_type": "flo'
            b'at64"}, {"metadata": {"timezone": "Europe/Brussels"}, "name": "c'
            b'", "numpy_type": "datetime64[ns, Europe/Brussels]", "pandas_type'
            b'": "datetimetz"}], "index_columns": ["index"], "pandas_version":'
            b' "0.22.0.dev0+260.g5da3759"}'}

In [83]: pq.read_table("__test_fastparquet-dev.parquet").to_pandas()
...
ArrowNotImplementedError: No cast implemented from string to timestamp[ns, tz=Europe/Brussels]

Of course, this could (and maybe should) be fixed in fastparquet (they still have to update with regards to the index name change as well: dask/fastparquet#251). And aonce fixed, this again will not be an impossible problem. However, this assumption was up to now not written in the pandas metadata specifications (what you are updating now), so they are not to blame. And to me it shows a bit the brittleness of this assumption.

@TomAugspurger
Copy link
Contributor

I'm looking into this on the fastparquet / dask side.

But if you would have a field_name entry in the pandas metadata column entries

Joris' suggestion seems sensible to me.

@jorisvandenbossche
Copy link
Member

Whatever the decision for this discussion, this PR should close #16391 (where Wes proposes an additional name entry for the index to disambiguate with column names)

@cpcloud
Copy link
Member Author

cpcloud commented Dec 6, 2017

@jorisvandenbossche @TomAugspurger Thanks for commenting. I agree adding 'field_name' will be helpful and will allow writing index metadata that fastparquet can read as well. PR is in the works for arrow. I'll also update this PR with the changes.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Dec 6, 2017 via email

@jorisvandenbossche
Copy link
Member

Thanks! Happy to review when there is a PR.

Regarding my other point (the second bullet point here: #18201 (comment)) about always using __index_level_n__ as the index column name in the schema.
I was thinking a bit more about it, and there is maybe an alternative that does not bring back the bug fixed in ARROW-1754, which is to use the __index_level_n__ only if the index name is None or already exists in the column names (both conditions should be easy to check).

The reason that I argue for this is the following: as long as one uses to_parquet/read_parquet to roundtrip pandas dataframes, it does not really matter how the index columns are stored in the parquet file. But if one also wants to use those files in other systems, I can imagine that names like __index_level_n__ can be rather annoying, certainly if your actual dataframe had a sensible index name.
And we will by default write index columns in to_parquet (see in progress PR #18629, until now only supported default integer range indexes), so this will become more common.

I understand that the 0.8.0 release is close, so that you might not want to reconsider this at this point. But I can certainly do a PR with the change to help out (I think it should be a relatively straightforward patch, just as apache/arrow#1271 was also not that a big patch)

@wesm
Copy link
Member

wesm commented Dec 6, 2017

Here's the PR apache/arrow#1397

@TomAugspurger
Copy link
Contributor

This is unrelated to the changes here, but I noticed pyarrow uses "string" for the "pandas_type":

>>> df = pd.DataFrame({"A": [1, 2]})
>>> sink = io.BytesIO()
>>> pq.write_metadata(pa.Table.from_pandas(df).schema, sink)
>>> json.loads(pq.read_metadata(sink).metadata[b'pandas'].decode('utf-8'))['column_indexes']
[{'metadata': None,
  'name': None,
  'numpy_type': 'object',
  'pandas_type': 'string'}]

Though the spec says just 'unicode' or 'bytes' for String type. Could we add 'string' as an alias for unicode?

@cpcloud
Copy link
Member Author

cpcloud commented Dec 7, 2017

@TomAugspurger That only appears in the 'column_indexes' metadata, because that doesn't go through arrow schemas, so that's a bug. This was added in between 0.7.1 and upcoming 0.8.0 so we can change this without any backward compatibility consequences.

@cpcloud
Copy link
Member Author

cpcloud commented Dec 7, 2017

@jorisvandenbossche @TomAugspurger Thanks for ironing this stuff out, it's very helpful to have a third and fourth pair of eyes on this code.

@jorisvandenbossche jorisvandenbossche modified the milestones: 0.21.1, 0.22.0 Dec 7, 2017
@wesm
Copy link
Member

wesm commented Dec 9, 2017

We could always use guids for the index column names in the Parquet schema instead of the generated dunder-names (and use the index column metadata to list the guids corresponding to each index level).

I think this is the last blocking issue for Arrow 0.8.0 -- there are a couple other small items I'm going to address today/tomorrow

@cpcloud
Copy link
Member Author

cpcloud commented Dec 9, 2017

I took the guid approach back when I first did this and it wasn't a viable solution because there are new guids generated for every dataframe. It becomes really difficult to do something like a concat operation under this scheme. You need reliable and reproducible names for the index columns.

@wesm
Copy link
Member

wesm commented Dec 9, 2017

Ah, good point, since you need to be deterministic across files

@jorisvandenbossche
Copy link
Member

If unique column names are not required in parquet and/or arrow, I think the is_index flag is a good solution.
You will still run into trouble when having to process specific columns for their metadata and this column its field name is not unique, but that is not something we are going to solve with a better specification (unless requiring unique column names).

@jorisvandenbossche
Copy link
Member

Regarding the preserving of the index name if possible, I quickly tried this and opened apache/arrow#1408, just as a proof of concept that this is IMO not difficult to do while also keeping the bug fix of ARROW-1754

@TomAugspurger
Copy link
Contributor

What's the recommended way for determining whether df.index.name was originally None? Special case __index_level_0__?

@jorisvandenbossche
Copy link
Member

The actual name is stored in the metadata (name vs field_name) and can be null I think?

@TomAugspurger
Copy link
Contributor

Ah, it wasn't null as of as of 0.7.1:

In [12]: import pandas as pd

In [13]: import pandas as pd; import pyarrow as pa; import pyarrow.parquet as pq

In [14]: df = pd.DataFrame({"x": [1, 2]})

In [15]: pq.write_table(pa.Table.from_pandas(df), '/tmp/a.parq')

In [16]: json.loads(pq.read_metadata('/tmp/a.parq').metadata[b'pandas'])
Out[16]:
{'columns': [{'metadata': None,
   'name': 'x',
   'numpy_type': 'int64',
   'pandas_type': 'int64'},
  {'metadata': None,
   'name': '__index_level_0__',
   'numpy_type': 'int64',
   'pandas_type': 'int64'}],
 'index_columns': ['__index_level_0__'],
 'pandas_version': '0.21.0'}

But it is null in 0.8.0:

In [8]: import pandas as pd; import pyarrow as pa; import pyarrow.parquet as pq

In [9]: df = pd.DataFrame({"x": [1, 2]})

In [10]: pq.write_table(pa.Table.from_pandas(df), '/tmp/b.parq')

In [11]: json.loads(pq.read_metadata('/tmp/b.parq').metadata[b'pandas'])
Out[11]:
{'column_indexes': [{'field_name': None,
   'metadata': {'encoding': 'UTF-8'},
   'name': None,
   'numpy_type': 'object',
   'pandas_type': 'unicode'}],
 'columns': [{'field_name': 'x',
   'metadata': None,
   'name': 'x',
   'numpy_type': 'int64',
   'pandas_type': 'int64'},
  {'field_name': '__index_level_0__',
   'metadata': None,
   'name': None,
   'numpy_type': 'int64',
   'pandas_type': 'int64'}],
 'index_columns': ['__index_level_0__'],
 'pandas_version': '0.21.0'}

In [12]: pa.__version__
Out[12]: '0.8.0'

@TomAugspurger
Copy link
Contributor

https://issues.apache.org/jira/browse/ARROW-1941 was just opened and I noticed that the pyarrow metadata for nested types were like list[float64].

metadata
--------
{b'pandas': b'{"index_columns": ["__index_level_0__"], "column_indexes": [{"na'
            b'me": null, "field_name": null, "pandas_type": "unicode", "numpy_'
            b'type": "object", "metadata": {"encoding": "UTF-8"}}], "columns":'
            b' [{"name": "c1", "field_name": "c1", "pandas_type": "list[unicod'
            b'e]", "numpy_type": "object", "metadata": null}, {"name": "c2", "'
            b'field_name": "c2", "pandas_type": "list[float64]", "numpy_type":'
            b' "object", "metadata": null}, {"name": null, "field_name": "__in'
            b'dex_level_0__", "pandas_type": "int64", "numpy_type": "int64", "'
            b'metadata": null}], "pandas_version": "0.21.1"}'}

I don't believe that nested types are mentioned anywhere for pandas_type. We should update that as well (either here or a new issue).

@wesm
Copy link
Member

wesm commented Dec 26, 2017

wesm pushed a commit to apache/arrow that referenced this pull request Feb 2, 2018
…that preserves index name if available

Related to the discussion about the pandas metadata specification in pandas-dev/pandas#18201, and an alternative to #1271.

I don't open this PR because it should necessarily be merged, I just want to show that it is not that difficult to both fix [ARROW-1754](https://issues.apache.org/jira/browse/ARROW-1754) and preserve index names as field names when possible (as this was mentioned in pandas-dev/pandas#18201 as the reason to make this change to not preserve index names).
The diff is partly a revert of #1271, but then adapted to the current codebase.

Main reasons I prefer to preserve index names: 1) usability in pyarrow itself (if you would want to work with pyarrow Tables created from pandas) and 2) when interchanging parquet files with other people / other non-pandas systems, then it would be much nicer to not have `__index_level_n__` column names if possible.

Author: Joris Van den Bossche <jorisvandenbossche@gmail.com>

Closes #1408 from jorisvandenbossche/index-names and squashes the following commits:

eef1d33 [Joris Van den Bossche] alternative fix for duplicate index/column name that preserves index name if available
@jorisvandenbossche
Copy link
Member

@cpcloud can you update this? (a.o. recent changes in apache/arrow#1408 affect this)
I can also push some changes to this branch.

@jreback jreback modified the milestones: 0.23.0, 0.24.0 Mar 30, 2018
@jreback
Copy link
Contributor

jreback commented Mar 30, 2018

@cpcloud update on this?

@cpcloud
Copy link
Member Author

cpcloud commented Mar 30, 2018

@jorisvandenbossche Any chance you can submit a PR to my fork for the changes you have in mind?

@jreback
Copy link
Contributor

jreback commented Nov 1, 2018

ping @jorisvandenbossche

@cpcloud can you rebase when you have a chance.

@jreback jreback removed this from the 0.24.0 milestone Nov 6, 2018
Copy link
Member

@datapythonista datapythonista left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can merge this, and make any update in a separate PR. Just there is a required change, as we are not using the right directive for the example.


Here's an example of how the index metadata is structured in pyarrow:

.. code-block:: python
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.. code-block:: python
.. ipython:: python

@jreback
Copy link
Contributor

jreback commented Dec 3, 2018

@datapythonista that is fine. or if @cpcloud can update would be great.

@datapythonista datapythonista self-assigned this Dec 3, 2018
Copy link
Member

@datapythonista datapythonista left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I double checked, and the code block is correct, as it doesn't show any output, and it's actually not expected to run (some variables are not initialized).

lgtm then

@jreback can we merge this then?

@jreback
Copy link
Contributor

jreback commented Dec 7, 2018

sure

@datapythonista datapythonista merged commit c911151 into pandas-dev:master Dec 7, 2018
@datapythonista
Copy link
Member

thanks @cpcloud

Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019