Support non-default indexes in to_parquet #18581

dhirschfeld · 2017-11-30T23:15:08Z

Calling to_parquet on a DataFrame with a non-default index results in the error below:

ValueError: parquet does not support serializing a non-default index for the index; you can .reset_index() to make the index into column(s)

While, you can work-around this by calling reset_index() as the message says, this loses the information about what columns made up the index so means you can't round-trip a DataFrame with a non-default index.

Output of `pd.show_versions()`

INSTALLED VERSIONS
------------------
commit: None
python: 3.6.3.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 142 Stepping 9, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.21.0
pytest: 3.2.5
pip: 9.0.1
setuptools: 37.0.0
Cython: 0.27.3
numpy: 1.13.3
scipy: 1.0.0
pyarrow: 0.7.1
xarray: None
IPython: 6.2.1
sphinx: 1.6.5
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.3
blosc: 1.4.4

The text was updated successfully, but these errors were encountered:

jreback · 2017-12-01T01:23:03Z

If I remove the checking code (this is with pyarrow 0.7.1)

In [2]: df = pd.DataFrame({'A': list('abc')}, index=[2, 3, 4])

In [3]: df
Out[3]: 
   A
2  a
3  b
4  c

In [4]: df.to_parquet('foo.parquet')

In [5]: pd.read_parquet('foo.parquet')
Out[5]: 
   A
2  a
3  b
4  c

diff --git a/pandas/io/parquet.py b/pandas/io/parquet.py
index 4a13d2c9d..b7cce3ae7 100644
--- a/pandas/io/parquet.py
+++ b/pandas/io/parquet.py
@@ -147,28 +147,6 @@ def to_parquet(df, path, engine='auto', compression='snappy', **kwargs):
 
     valid_types = {'string', 'unicode'}
 
-    # validate index
-    # --------------
-
-    # validate that we have only a default index
-    # raise on anything else as we don't serialize the index
-
-    if not isinstance(df.index, Int64Index):
-        raise ValueError("parquet does not support serializing {} "
-                         "for the index; you can .reset_index()"
-                         "to make the index into column(s)".format(
-                             type(df.index)))
-
-    if not df.index.equals(RangeIndex.from_range(range(len(df)))):
-        raise ValueError("parquet does not support serializing a "
-                         "non-default index for the index; you "
-                         "can .reset_index() to make the index "
-                         "into column(s)")
-
-    if df.index.name is not None:
-        raise ValueError("parquet does not serialize index meta-data on a "
-                         "default index")
-
     # validate columns
     # ----------------

We support pyarrow >= 0.4.1, I don't remember exactly when index support was added (and had a bug or 2), but we could check conditionally (as we have other compat code for pyarrow < 0.5.0, and < 0.6.0 for other items). alternatively could bump minimum to 0.6.0 is ok too.

@dhirschfeld would love a PR.

cc @cpcloud @wesm

dhirschfeld · 2017-12-01T02:21:50Z

Seems a simple fix! Will see about putting in a PR shortly...

...when supported by the underlying engine. Fixes pandas-dev#18581

jreback added Compat pandas objects compatability with Numpy or Python functions Difficulty Intermediate Enhancement IO Parquet parquet, feather labels Dec 1, 2017

jreback added this to the Next Major Release milestone Dec 1, 2017

dhirschfeld added a commit to dhirschfeld/pandas that referenced this issue Dec 4, 2017

Allow non-default indexes in to_parquet.

30d85d8

...when supported by the underlying engine. Fixes pandas-dev#18581

dhirschfeld added a commit to dhirschfeld/pandas that referenced this issue Dec 4, 2017

Allow non-default indexes in to_parquet.

3ca913a

...when supported by the underlying engine. Fixes pandas-dev#18581

dhirschfeld mentioned this issue Dec 4, 2017

ENH: support non default indexes in writing to Parquet #18629

Merged

4 tasks

jreback modified the milestones: Next Major Release, 0.22.0 Dec 5, 2017

dhirschfeld added a commit to dhirschfeld/pandas that referenced this issue Dec 5, 2017

Added a whatsnew entry for issue pandas-dev#18581

9f16982

dhirschfeld added a commit to dhirschfeld/pandas that referenced this issue Dec 6, 2017

Allow non-default indexes in to_parquet.

0710646

...when supported by the underlying engine. Fixes pandas-dev#18581

dhirschfeld added a commit to dhirschfeld/pandas that referenced this issue Dec 9, 2017

Allow non-default indexes in to_parquet.

5afb7e8

...when supported by the underlying engine. Fixes pandas-dev#18581

dhirschfeld added a commit to dhirschfeld/pandas that referenced this issue Dec 9, 2017

Allow non-default indexes in to_parquet.

8529343

...when supported by the underlying engine. Fixes pandas-dev#18581

dhirschfeld added a commit to dhirschfeld/pandas that referenced this issue Dec 9, 2017

Allow non-default indexes in to_parquet.

da6cc14

...when supported by the underlying engine. Fixes pandas-dev#18581

dhirschfeld added a commit to dhirschfeld/pandas that referenced this issue Dec 9, 2017

Allow non-default indexes in to_parquet.

4bf7f56

...when supported by the underlying engine. Fixes pandas-dev#18581

jreback modified the milestones: 0.22.0, 0.21.1 Dec 11, 2017

jorisvandenbossche closed this as completed in #18629 Dec 11, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support non-default indexes in to_parquet #18581

Support non-default indexes in to_parquet #18581

dhirschfeld commented Nov 30, 2017

jreback commented Dec 1, 2017 •

edited

Loading

dhirschfeld commented Dec 1, 2017

Support non-default indexes in to_parquet #18581

Support non-default indexes in to_parquet #18581

Comments

dhirschfeld commented Nov 30, 2017

Output of pd.show_versions()

jreback commented Dec 1, 2017 • edited Loading

dhirschfeld commented Dec 1, 2017

Output of `pd.show_versions()`

jreback commented Dec 1, 2017 •

edited

Loading