Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: support non default indexes in writing to Parquet #18629

Merged

Conversation

dhirschfeld
Copy link
Contributor

@dhirschfeld dhirschfeld commented Dec 4, 2017

@jorisvandenbossche
Copy link
Member

We need to check how this interacts with the columns keyword when reading back in. Ideally selecting only a few columns to read will still preserve your index (I think?), but I guess this is currently not happening (at least with pyarrow).

@codecov
Copy link

codecov bot commented Dec 4, 2017

Codecov Report

Merging #18629 into master will decrease coverage by 0.01%.
The diff coverage is 90.9%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #18629      +/-   ##
==========================================
- Coverage   91.59%   91.58%   -0.02%     
==========================================
  Files         155      155              
  Lines       51255    51260       +5     
==========================================
- Hits        46949    46945       -4     
- Misses       4306     4315       +9
Flag Coverage Δ
#multiple 89.44% <90.9%> (ø) ⬆️
#single 40.66% <0%> (-0.12%) ⬇️
Impacted Files Coverage Δ
pandas/io/parquet.py 66.26% <90.9%> (+0.88%) ⬆️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/frame.py 97.81% <0%> (-0.1%) ⬇️
pandas/core/indexes/datetimes.py 95.68% <0%> (+0.09%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2c903d5...a14edbe. Read the comment docs.

@codecov
Copy link

codecov bot commented Dec 4, 2017

Codecov Report

Merging #18629 into master will decrease coverage by 0.03%.
The diff coverage is 53.19%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #18629      +/-   ##
==========================================
- Coverage    91.6%   91.57%   -0.04%     
==========================================
  Files         153      153              
  Lines       51317    51363      +46     
==========================================
+ Hits        47011    47034      +23     
- Misses       4306     4329      +23
Flag Coverage Δ
#multiple 89.43% <53.19%> (-0.02%) ⬇️
#single 40.73% <17.02%> (-0.1%) ⬇️
Impacted Files Coverage Δ
pandas/io/parquet.py 60.19% <53.19%> (-4.37%) ⬇️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/plotting/_converter.py 64.78% <0%> (-1.74%) ⬇️
pandas/core/frame.py 97.81% <0%> (-0.1%) ⬇️
pandas/core/generic.py 95.9% <0%> (ø) ⬆️
pandas/core/indexes/interval.py 93.8% <0%> (ø) ⬆️
pandas/core/indexes/base.py 96.43% <0%> (ø) ⬆️
pandas/core/indexes/multi.py 96.29% <0%> (+0.01%) ⬆️
pandas/core/indexes/datetimes.py 95.7% <0%> (+0.01%) ⬆️
... and 8 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 37086a0...0fbdc14. Read the comment docs.

"to make the index into column(s)".format(
type(df.index)))

if not df.index.equals(RangeIndex.from_range(range(len(df)))):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could create the RangeIndex via RangeIndex(len(df)) here to make this a little cleaner.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. some changes; pls. add a bug fix note in 0.22.0

if df.index.name is not None:
raise ValueError("parquet does not serialize index meta-data on a "
"default index")
# *unless* we're using pyarrow >= 0.7.1 which does support multi-indexes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so it might be better (though a bit more refactoring), to move the validation of index and validation of columns into methods on a base class to the impl

e.g.

class BaseImpl(object):


    def validate_index(....):
       ..

    def validate_columns(....):
      ....


class PyArrowImpl(BaseImpl):

     def write(...):
          if pyarrow < 0.7.1:
             self.validate_index()

         self.validate_columns()

class FastParquetImpl(BaseImpl):
      .....

as it makes things cleaner.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored in the latest commits...

df.index = index
self.check_round_trip(df, engine)

# index with meta-data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could factor these 2 tests here from pyarrow impl (and also from fastparquet) into another test (that succeeds on all impls and versions)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the behaviour changes based on the library and the version of the library I'm not sure how else to do it while maintaining test coverage on all versions.

Feel free to push a fix if you'd like as I'm knocking off for the evening shortly...

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor comments. if you can refactor the tests a bit ok (if not no big deal). ping on green.

self._validate_index(df)

def write(self, df, path, compression, **kwargs):
raise NotImplementedError()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

import and use pandas.core.common.AbstractMethodError here instead

@@ -77,6 +77,7 @@ Other Enhancements
- :func:`Series.fillna` now accepts a Series or a dict as a ``value`` for a categorical dtype (:issue:`17033`)
- :func:`pandas.read_clipboard` updated to use qtpy, falling back to PyQt5 and then PyQt4, adding compatibility with Python3 and multiple python-qt bindings (:issue:`17722`)
- Improved wording of ``ValueError`` raised in :func:`read_csv` when the ``usecols`` argument cannot match all columns. (:issue:`17301`)
- Enabled the use of non-default indexes in ``to_parquet`` with pyarrow>=0.7.0 (:issue:`18581`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use :func:`DataFrame.to_parquet` here, use double-backticks around pyarrow>=0.7.0

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fastparquet also supports writing the index, so no need to still raise the errors in that case.

Also, to repeat my comment from before: we need to decide how this interacts with columns keyword before merging this.

@dhirschfeld
Copy link
Contributor Author

fastparquet also supports writing the index, so no need to still raise the errors in that case.

It seems it might just be easier all-round to just bump the pyarrow/fastparquet deps to the min required to support non-default indices - i.e. >=0.7.0 for pyarrow and ??? for fastparquet

@jreback
Copy link
Contributor

jreback commented Dec 5, 2017

would be ok with bumping pyarrow dep
i think fp is ok as is (i bumped it when doing this originally)

@pep8speaks
Copy link

pep8speaks commented Dec 6, 2017

Hello @dhirschfeld! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on December 11, 2017 at 17:00 Hours UTC

@dhirschfeld dhirschfeld force-pushed the parquet-non-default-indexes branch 2 times, most recently from 0f7007a to 7b25fbe Compare December 6, 2017 12:12
@jorisvandenbossche jorisvandenbossche changed the title Parquet non default indexes ENH: support non default indexes in writing to Parquet Dec 6, 2017
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doc changes, ping on green.

@@ -8,4 +8,4 @@ echo "install 35"
conda remove -n pandas python-dateutil --force
pip install python-dateutil

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you check if we have pyarrow installed in ci/* anywhere else (and if so update the pinned versions)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked and changed any pinned versions to the new minimum. If it wasn't already pinned I left it as is since a normal install will pull in later versions that the minimum.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

@@ -163,7 +163,7 @@ Other Enhancements
- :func:`pandas.read_clipboard` updated to use qtpy, falling back to PyQt5 and then PyQt4, adding compatibility with Python3 and multiple python-qt bindings (:issue:`17722`)
- Improved wording of ``ValueError`` raised in :func:`read_csv` when the ``usecols`` argument cannot match all columns. (:issue:`17301`)
- :func:`DataFrame.corrwith` now silently drops non-numeric columns when passed a Series. Before, an exception was raised (:issue:`18570`).

- Enabled the use of non-default indexes in :func:`DataFrame.to_parquet` where the underlying engine supports it (:issue:`18581`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also there is another section on changed deps, pls add another entry which says pyarrow is now min 0.7.0 for parquet (also may need to update in io.rst)

@jorisvandenbossche
Copy link
Member

@jreback not merge on green, see my request for changes on which I got no reply yet

BTW, I think we can do this for the 0.21.1 milestone. The other enhancements to to/read_parquet also went into that one (seeing it as fixes on very new functionality)

@dhirschfeld
Copy link
Contributor Author

dhirschfeld commented Dec 7, 2017

@jorisvandenbossche - as of fastparquet=0.1.3 it doesn't appear to support multi-indexes

image

...at least reading them anyway. Interestingly, it seems to write them fine and the file written by fastparquet can then be read in perfectly by pyarrow.

Not sure what to do about that but since fastparquet throws a sensible error I'd be inclined to just leave it and a future update will likely add the missing functionality.

from numpy.random import randn
import pandas as pd

dates = pd.date_range('01-Jan-2018', '01-Dec-2018', freq='MS')
df = pd.DataFrame(randn(2*len(dates), 3), columns=list('ABC'))
index = pd.MultiIndex.from_product([['Level1', 'Level2'], dates])
df.index = index

df.to_parquet('tmp_pa.pq', engine='pyarrow')
df.to_parquet('tmp_fp.pq', engine='fastparquet', compression='UNCOMPRESSED')

df1 = pd.io.parquet.read_parquet('tmp_pa.pq', engine='pyarrow')
df1.equals(df)

@dhirschfeld
Copy link
Contributor Author

@jorisvandenbossche - you're right about the columns argument not preserving the index:

image

The workaround, at the cost of efficiency, is simple enough:

pd.io.parquet.read_parquet('tmp_pa.pq', engine='pyarrow')[['A','B']]

...but you have to know beforehand that the file was saved with a multi-index - i.e. it's a silent failure mode which isn't ideal.

If I name the index levels you can actually pull them out by name:

df = pd.DataFrame(randn(2*len(dates), 3), columns=list('ABC'))
index = pd.MultiIndex.from_product([['Level1', 'Level2'], dates], names=['level','date'])
df.index = index

image

pd.io.parquet.read_parquet('tmp_pa.pq', engine='pyarrow', columns=['A','B','level','date']).head()

image

...so, ideally if pandas can tell if the file has a multi-index the index names could then always be prepended to the columns argument.

@dhirschfeld
Copy link
Contributor Author

@jorisvandenbossche - were there other changes you wanted that I've missed?

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Dec 7, 2017

Not sure what to do about that but since fastparquet throws a sensible error I'd be inclined to just leave it and a future update will likely add the missing functionality.

It's a bit unfortunate it only comes up when reading it back in, so either we are ok with that, or either we need to to raise on writing MIs with fastparquet.

If I name the index levels you can actually pull them out by name:

That will not always work. Eg if they have no names, and pyarrow master has also changed this and the index levels will never have the level names but a __index_level_n_ pattern, so the workaround you show will not work anymore with the new release.

The thing we can do is the following:

In [85]: import pyarrow.parquet as pq

In [86]: f = pq.ParquetFile("__test_mi.parquet")

In [87]: f.metadata.metadata[b'pandas']
Out[87]: b'{"columns": [{"name": "A", "pandas_type": "float64", "numpy_type": "float64", "metadata": null}, {"name": "B", "pandas_type": "float64", "numpy_type": "float64", "metadata": null}, {"name": "C", "pandas_type": "float64", "numpy_type": "float64", "metadata": null}, {"name": "__index_level_0__", "pandas_type": "unicode", "numpy_type": "object", "metadata": null}, {"name": "__index_level_1__", "pandas_type": "datetime", "numpy_type": "datetime64[ns]", "metadata": null}], "index_columns": ["__index_level_0__", "__index_level_1__"], "pandas_version": "0.22.0.dev0+288.g6b6cfb8"}'

In [88]: import json

In [89]: pandas_metadata = json.loads(f.metadata.metadata[b'pandas'].decode('utf8'))

In [90]: pandas_metadata['index_columns']
Out[90]: ['__index_level_0__', '__index_level_1__']

and then add those to the columns list.
And something similar for fastparquet:

In [98]: import fastparquet

In [99]: f = fastparquet.ParquetFile("__test_mi.parquet")

In [100]: json.loads(f.key_value_metadata['pandas'])['index_columns']
Out[100]: ['__index_level_0__', '__index_level_1__']

@dhirschfeld
Copy link
Contributor Author

Heh, beat me to it:

import pyarrow.parquet as pq
filemetadata = pq.read_metadata('tmp_fp.pq')
metadata = eval(filemetadata.metadata.get(b'pandas'), None, dict(null=None))
metadata['index_columns']

...eval is pretty nasty!

@jorisvandenbossche
Copy link
Member

I also think that if there is a default index (range-like index without name, the case that was allowed now), we should not write it to the parquet file, because that would just be waste of time and space.

@jorisvandenbossche
Copy link
Member

Yep, the json.loads is the better alternative :-)

@jreback jreback added this to the 0.22.0 milestone Dec 7, 2017
@jreback
Copy link
Contributor

jreback commented Dec 7, 2017

BTW, I think we can do this for the 0.21.1 milestone. The other enhancements to to/read_parquet also went into that one (seeing it as fixes on very new functionality)

since we are bumping the version I don't think this is a good idea. marked for 0.22

I am ok with this fix, @jorisvandenbossche

@dhirschfeld
Copy link
Contributor Author

For pyarrow the columns fix is as simple as setting use_pandas_metadata=True:

    def read(self, path, columns=None, **kwargs):
        path, _, _ = get_filepath_or_buffer(path)
        parquet_file = self.api.parquet.ParquetFile(path)
        kwargs['use_pandas_metadata'] = True
        return parquet_file.read(columns=columns, **kwargs).to_pandas()

For fastparquet I would use:

    def read(self, path, columns=None, **kwargs):
        path, _, _ = get_filepath_or_buffer(path)
        parquet_file = self.api.ParquetFile(path)
        metadata = json.loads(parquet_file.key_value_metadata['pandas'])
        if columns is not None:
            columns = set(chain(columns, metadata['index_columns']))
        return parquet_file.to_pandas(columns=columns, **kwargs)

...but since it doesnt' support multi-indexes anyway it's a moot point.

@@ -6,6 +6,7 @@
from warnings import catch_warnings

import numpy as np
from numpy.random import randn
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't do the import like this anywhere else, appreciate changing it

df.columns = pd.MultiIndex.from_tuples([('a', 1), ('a', 2), ('b', 1)]),
self.check_error_on_write(df, engine, ValueError)

def test_multiindex_with_columns(self, engine):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can just pass pa instead of engine to skip fp

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dhirschfeld Thanks for the update. I think there are still some left-overs to clean-up from the undo removing support for older versions, but I can also put some time in it this evening if needed.

self.validate_dataframe(df)
if self._pyarrow_lt_070:
self._validate_write_lt_070(
df, path, compression, coerce_timestamps, **kwargs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason you pass path, compression, coerce_timestamps here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a left over from a refactoring. Should be removed as they're no longer needed...

if self._pyarrow_lt_060:
table = self.api.Table.from_pandas(df, timestamps_to_ms=True)
self.api.parquet.write_table(
table, path, compression=compression, **kwargs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to keep this for older pyarrow versions ?

if not isinstance(df.index, Int64Index):
msg = (
"parquet does not support serializing {} for the index;"
"you can .reset_index() to make the index into column(s)"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we should add here that they can also install latest pyarrow or fastparquet version (same for the others)

class FastParquetImpl(object):
parquet_file = self.api.parquet.ParquetFile(path)
if self._pyarrow_lt_070:
parquet_file.path = path
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would just pass path to the _read_lt_070 function?

@jorisvandenbossche
Copy link
Member

@dhirschfeld sorry for taking over your PR, but I want to get this merged for 0.21.1, and didn't know if you would have time today/tomorrow.
I restored some of the testing code for testing the older versions, and fixed some bugs in the code to get those working. And addressed the comments of myself and @jreback.

But feel free to review my changes

@dhirschfeld
Copy link
Contributor Author

@jorisvandenbossche - that's fine, I'd only be able to look into it tonight so am happy for you to take the lead! I'm hoping to make use of it so am keen to get it in too...

@dhirschfeld
Copy link
Contributor Author

LGTM!

@jorisvandenbossche
Copy link
Member

And all CI is passing!

@jreback more comments?
(the compat for older versions is maybe not the cleanest, but it will be directly removed on master anyway .. (I think we want to do that?))

@jreback jreback added this to the 0.21.1 milestone Dec 11, 2017
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tiny style comments, otherwise lgtm.

def validate_dataframe(df):
if not isinstance(df, DataFrame):
raise ValueError("to_parquet only supports IO with DataFrames")
# must have value column names (strings only)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a blank line before comments, easier to read

@TomAugspurger TomAugspurger mentioned this pull request Dec 11, 2017
58 tasks
@jorisvandenbossche jorisvandenbossche merged commit 8d7e876 into pandas-dev:master Dec 11, 2017
@jorisvandenbossche
Copy link
Member

@dhirschfeld Thanks a lot for starting the PR!

TomAugspurger pushed a commit to TomAugspurger/pandas that referenced this pull request Dec 11, 2017
)

fastparquet automatically names an index 'index' if it doesn't
already have a name

(cherry picked from commit 8d7e876)
@dhirschfeld dhirschfeld deleted the parquet-non-default-indexes branch December 12, 2017 01:06
@dhirschfeld
Copy link
Contributor Author

Thanks @jorisvandenbossche for getting this over the line! I'm excited to make use of this new functionality! :)

TomAugspurger pushed a commit that referenced this pull request Dec 12, 2017
fastparquet automatically names an index 'index' if it doesn't
already have a name

(cherry picked from commit 8d7e876)
yarikoptic added a commit to neurodebian/pandas that referenced this pull request Feb 22, 2018
Version 0.22.0

* tag 'v0.22.0': (777 commits)
  RLS: v0.22.0
  DOC: Fix min_count docstring (pandas-dev#19005)
  DOC: More 0.22.0 updates (pandas-dev#19002)
  TST: Remove pow test in expressions
  COMPAT: Avoid td.skip decorator
  DOC: 0.22.0 release docs (pandas-dev#18983)
  DOC: Include 0.22.0 whatsnew
  Breaking changes for sum / prod of empty / all-NA (pandas-dev#18921)
  ENH: Added a min_count keyword to stat funcs (pandas-dev#18876)
  RLS: v0.21.1
  DOC: Add date to whatsnew (pandas-dev#18740)
  DOC: Include 0.21.1 whatsnew
  DOC: Update relase notes (pandas-dev#18739)
  CFG: Ignore W503
  DOC: fix options table (pandas-dev#18730)
  ENH: support non default indexes in writing to Parquet (pandas-dev#18629)
  BUG: Fix to_latex with longtable (pandas-dev#17959) (pandas-dev#17960)
  Parquet: Add error message for no engine found (pandas-dev#18717)
  BUG: Categorical data fails to load from hdf when all columns are NaN (pandas-dev#18652)
  DOC: clean-up whatsnew file for 0.21.1 (pandas-dev#18690)
  ...
yarikoptic added a commit to neurodebian/pandas that referenced this pull request Feb 22, 2018
* releases: (777 commits)
  RLS: v0.22.0
  DOC: Fix min_count docstring (pandas-dev#19005)
  DOC: More 0.22.0 updates (pandas-dev#19002)
  TST: Remove pow test in expressions
  COMPAT: Avoid td.skip decorator
  DOC: 0.22.0 release docs (pandas-dev#18983)
  DOC: Include 0.22.0 whatsnew
  Breaking changes for sum / prod of empty / all-NA (pandas-dev#18921)
  ENH: Added a min_count keyword to stat funcs (pandas-dev#18876)
  RLS: v0.21.1
  DOC: Add date to whatsnew (pandas-dev#18740)
  DOC: Include 0.21.1 whatsnew
  DOC: Update relase notes (pandas-dev#18739)
  CFG: Ignore W503
  DOC: fix options table (pandas-dev#18730)
  ENH: support non default indexes in writing to Parquet (pandas-dev#18629)
  BUG: Fix to_latex with longtable (pandas-dev#17959) (pandas-dev#17960)
  Parquet: Add error message for no engine found (pandas-dev#18717)
  BUG: Categorical data fails to load from hdf when all columns are NaN (pandas-dev#18652)
  DOC: clean-up whatsnew file for 0.21.1 (pandas-dev#18690)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support non-default indexes in to_parquet
6 participants