Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support non-default indexes in to_parquet #18581

Closed
dhirschfeld opened this issue Nov 30, 2017 · 2 comments · Fixed by #18629
Closed

Support non-default indexes in to_parquet #18581

dhirschfeld opened this issue Nov 30, 2017 · 2 comments · Fixed by #18629
Labels
Compat pandas objects compatability with Numpy or Python functions Enhancement IO Parquet parquet, feather
Milestone

Comments

@dhirschfeld
Copy link
Contributor

Calling to_parquet on a DataFrame with a non-default index results in the error below:

ValueError: parquet does not support serializing a non-default index for the index; you can .reset_index() to make the index into column(s)

While, you can work-around this by calling reset_index() as the message says, this loses the information about what columns made up the index so means you can't round-trip a DataFrame with a non-default index.

Output of pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.6.3.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 142 Stepping 9, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.21.0
pytest: 3.2.5
pip: 9.0.1
setuptools: 37.0.0
Cython: 0.27.3
numpy: 1.13.3
scipy: 1.0.0
pyarrow: 0.7.1
xarray: None
IPython: 6.2.1
sphinx: 1.6.5
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.3
blosc: 1.4.4
@jreback
Copy link
Contributor

jreback commented Dec 1, 2017

If I remove the checking code (this is with pyarrow 0.7.1)

In [2]: df = pd.DataFrame({'A': list('abc')}, index=[2, 3, 4])

In [3]: df
Out[3]: 
   A
2  a
3  b
4  c

In [4]: df.to_parquet('foo.parquet')

In [5]: pd.read_parquet('foo.parquet')
Out[5]: 
   A
2  a
3  b
4  c
diff --git a/pandas/io/parquet.py b/pandas/io/parquet.py
index 4a13d2c9d..b7cce3ae7 100644
--- a/pandas/io/parquet.py
+++ b/pandas/io/parquet.py
@@ -147,28 +147,6 @@ def to_parquet(df, path, engine='auto', compression='snappy', **kwargs):
 
     valid_types = {'string', 'unicode'}
 
-    # validate index
-    # --------------
-
-    # validate that we have only a default index
-    # raise on anything else as we don't serialize the index
-
-    if not isinstance(df.index, Int64Index):
-        raise ValueError("parquet does not support serializing {} "
-                         "for the index; you can .reset_index()"
-                         "to make the index into column(s)".format(
-                             type(df.index)))
-
-    if not df.index.equals(RangeIndex.from_range(range(len(df)))):
-        raise ValueError("parquet does not support serializing a "
-                         "non-default index for the index; you "
-                         "can .reset_index() to make the index "
-                         "into column(s)")
-
-    if df.index.name is not None:
-        raise ValueError("parquet does not serialize index meta-data on a "
-                         "default index")
-
     # validate columns
     # ----------------
 

We support pyarrow >= 0.4.1, I don't remember exactly when index support was added (and had a bug or 2), but we could check conditionally (as we have other compat code for pyarrow < 0.5.0, and < 0.6.0 for other items). alternatively could bump minimum to 0.6.0 is ok too.

@dhirschfeld would love a PR.

cc @cpcloud @wesm

@jreback jreback added Compat pandas objects compatability with Numpy or Python functions Difficulty Intermediate Enhancement IO Parquet parquet, feather labels Dec 1, 2017
@jreback jreback added this to the Next Major Release milestone Dec 1, 2017
@dhirschfeld
Copy link
Contributor Author

Seems a simple fix! Will see about putting in a PR shortly...

dhirschfeld added a commit to dhirschfeld/pandas that referenced this issue Dec 4, 2017
...when supported by the underlying engine.
Fixes pandas-dev#18581
dhirschfeld added a commit to dhirschfeld/pandas that referenced this issue Dec 4, 2017
...when supported by the underlying engine.
Fixes pandas-dev#18581
@jreback jreback modified the milestones: Next Major Release, 0.22.0 Dec 5, 2017
dhirschfeld added a commit to dhirschfeld/pandas that referenced this issue Dec 5, 2017
dhirschfeld added a commit to dhirschfeld/pandas that referenced this issue Dec 6, 2017
...when supported by the underlying engine.
Fixes pandas-dev#18581
dhirschfeld added a commit to dhirschfeld/pandas that referenced this issue Dec 9, 2017
...when supported by the underlying engine.
Fixes pandas-dev#18581
dhirschfeld added a commit to dhirschfeld/pandas that referenced this issue Dec 9, 2017
...when supported by the underlying engine.
Fixes pandas-dev#18581
dhirschfeld added a commit to dhirschfeld/pandas that referenced this issue Dec 9, 2017
...when supported by the underlying engine.
Fixes pandas-dev#18581
dhirschfeld added a commit to dhirschfeld/pandas that referenced this issue Dec 9, 2017
...when supported by the underlying engine.
Fixes pandas-dev#18581
@jreback jreback modified the milestones: 0.22.0, 0.21.1 Dec 11, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Compat pandas objects compatability with Numpy or Python functions Enhancement IO Parquet parquet, feather
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants