ENH: add to/from_parquet with pyarrow & fastparquet #15838

jreback · 2017-03-29T17:39:58Z

xref dask/dask#2127

TODO: these are fixed, waiting for release to update tests.

fastparquet: duplicate columns errors msg
pyarrow 0.3: passing dataframe with non-string object columns

This is a wrapper around pyarrow and fastparquet to provide seemless IO interop within pandas.

cc @wesm
cc @martindurant
cc @mrocklin

jreback · 2017-03-29T17:40:35Z

note that there are several specific tests that will need changing for pyarrow 0.3.0 (soon).

jreback · 2017-03-29T17:41:01Z

doc/source/io.rst

+
+.. versionadded:: 0.20.0
+
+Parquet provides a sharded binary columnar serialization for data frames. It is designed to make reading and writing data


pretty much a copy-paste of the feather section, so prob needs some updating.

It would be nice to have some explanation to users what is the difference between both

The main difference between Feather and Parquet is that Feather stores the data as-it's-in-memory whereas Parquet uses a variety of encoding and compression techniques to shrink the file size as much as possible while still maintaining good read performance.

Furthermore, Parquet adds metadata so that you can query the files efficiently ("predicate pushdown").

jreback · 2017-03-29T17:42:03Z

pandas/io/parquet.py

+    elif engine == 'fastparquet':
+        fastparquet = _try_import_fastparquet()
+
+        # thriftpy/protocol/compact.py:339:


cc @martindurant. from thriftpy

jreback · 2017-03-29T17:42:39Z

pandas/tests/io/test_parquet.py

+        df = pd.DataFrame({'a': pd.period_range('2013', freq='M', periods=3)})
+        self.check_error_on_write(df, pa, ArrowException)
+
+        # categorical


@wesm these last 3 pretty sure supported in pyarrow 0.3

I just created PARQUET-929 and PARQUET-930, I think a little bit of work there is needed

do these have links?

https://issues.apache.org/jira/browse/PARQUET-930 , and https://issues.apache.org/jira/browse/PARQUET-929

jreback · 2017-03-29T17:43:05Z

pandas/tests/io/test_parquet.py

+        self.check_error_on_write(df, pa, ArrowException)
+
+    def test_mixed(self, pa):
+        # mixed python objects are returned as None ATM


@wesm not sure if this is addressed pa 0.3

What would you expect this to do?

I would raise as this is not serializable. (could certainly do this at pandas level, but I think this is something that the impl library should do).

a mixed-dtype column needs to have something happen to it (via encoding or otherwise). But these are decisions that cannot be made automatically. IIRC fastparquet has the option to make these into JSON string. Not suggesting anything, just error checking / NotImplementedError for now.

Right, I agree this is buggy

In [4]: df = pd.DataFrame({'a': ['a', 1, 2.0]}) In [5]: df Out[5]: a 0 a 1 1 2 2 In [6]: pt = pa.Table.from_pandas(df) In [7]: pt.to_pandas() Out[7]: a 0 a 1 None 2 None

https://issues.apache.org/jira/browse/ARROW-736

jreback · 2017-03-29T17:43:20Z

pandas/tests/io/test_parquet.py

+
+        self.check_round_trip(df, fp)
+
+    @pytest.mark.skip(reason="not supported")


cc @martindurant ?

Correct, column names should be unique. Although the schema definition could contain multiple items with the same name, column chunks (the data) identify themselves by name, so I don't think this is doable.

yeah was trying to have helpful errors. ok I think both pyarrow and fastparquet should fail gracefully here then.

In [1]: df = pd.DataFrame(np.arange(12).reshape(4, 3), columns=list('aaa'))

In [3]: df.to_parquet('foo', 'pyarrow') --------------------------------------------------------------------------- ValueError Traceback (most recent call last) /Users/jreback/pandas/pandas/core/common.py in _asarray_tuplesafe(values, dtype) 398 result = np.empty(len(values), dtype=object) --> 399 result[:] = values 400 except ValueError: ValueError: could not broadcast input array from shape (4,3) into shape (4) During handling of the above exception, another exception occurred: ValueError Traceback (most recent call last) <ipython-input-3-185ceaef9fe4> in <module>() ----> 1 df.to_parquet('foo', 'pyarrow') /Users/jreback/pandas/pandas/core/frame.py in to_parquet(self, fname, engine, compression) 1538 """ 1539 from pandas.io.parquet import to_parquet -> 1540 to_parquet(self, fname, engine, compression=compression) 1541 1542 @Substitution(header='Write out column names. If a list of string is given, \ /Users/jreback/pandas/pandas/io/parquet.py in to_parquet(df, path, engine, compression) 97 from pyarrow import parquet as pq 98 ---> 99 table = pyarrow.Table.from_pandas(df) 100 pq.write_table(table, path, compression=compression) ValueError: cannot copy sequence with size 3 to array axis with dimension 4

n [4]: df.to_parquet('foo', 'fastparquet') --------------------------------------------------------------------------- AttributeError Traceback (most recent call last) <ipython-input-4-6b46c1abdc2f> in <module>() ----> 1 df.to_parquet('foo', 'fastparquet') /Users/jreback/pandas/pandas/core/frame.py in to_parquet(self, fname, engine, compression) 1538 """ 1539 from pandas.io.parquet import to_parquet -> 1540 to_parquet(self, fname, engine, compression=compression) 1541 1542 @Substitution(header='Write out column names. If a list of string is given, \ /Users/jreback/pandas/pandas/io/parquet.py in to_parquet(df, path, engine, compression) 107 # Use tobytes() instead. 108 with catch_warnings(record=True): --> 109 fastparquet.write(path, df, compression=compression) 110 111 /Users/jreback/miniconda3/envs/pandas/lib/python3.6/site-packages/fastparquet/writer.py in write(filename, data, row_group_offsets, compression, file_scheme, open_with, mkdirs, has_nulls, write_index, partition_on, fixed_text, append, object_encoding, times) 747 fmd = make_metadata(data, has_nulls=has_nulls, ignore_columns=ignore, 748 fixed_text=fixed_text, object_encoding=object_encoding, --> 749 times=times) 750 751 if file_scheme == 'simple': /Users/jreback/miniconda3/envs/pandas/lib/python3.6/site-packages/fastparquet/writer.py in make_metadata(data, has_nulls, ignore_columns, fixed_text, object_encoding, times) 608 object_encoding.get(column, None)) 609 fixed = None if fixed_text is None else fixed_text.get(column, None) --> 610 if str(data[column].dtype) == 'category': 611 se, type = find_type(data[column].cat.categories, 612 fixed_text=fixed, object_encoding=oencoding) /Users/jreback/pandas/pandas/core/generic.py in __getattr__(self, name) 2888 if name in self._info_axis: 2889 return self[name] -> 2890 return object.__getattribute__(self, name) 2891 2892 def __setattr__(self, name, value): AttributeError: 'DataFrame' object has no attribute 'dtype'

jreback · 2017-03-29T17:43:28Z

pandas/tests/io/test_parquet.py

+        df = pd.DataFrame({'a': pd.period_range('2013', freq='M', periods=3)})
+        self.check_error_on_write(df, fp, ValueError)
+
+        # mixed


cc @martindurant

If a column has object dtype, there are a few encoding options. Most typical is UTF8 for strings, but 'infer' will guess from the values; the only way to encode mixed types like this would be JSON or BSON.

is this done at the user level?

Yes, the user can specify the "object_encoding" keyword to fastparquet.writer.write. A value of "infer" will default to JSON representation only for dicts and lists (as guessed from the first ten values) - perhaps it should be for all types that are not str or bytes?

I see you have lots of options in fastparquet.write. since I don't want to duplicate them at all! maybe just pass thru to you directly. going to think about this.

added passing thru kwargs, so this should be easy to test

jreback · 2017-03-29T17:43:44Z

pandas/tests/io/test_parquet.py

+                                              tz='US/Eastern')})
+
+        # warns on the coercion
+        with catch_warnings(record=True):


cc @martindurant this warning is odd btw.

Parquet does not store timezone information (except maybe in specialized key-value metadata), so this would amount to a silent loss of data.

We should create a spec document for "Python metadata" (that we can implement in either library) to store as key-value metadata so that we can persist anything that does not fit into the format as is

The to_stata function warns when it writes things that won't load back the same way or when it's changing things - I think that'd also be very cool, and there could be a strict_mode=True that will raise exceptions instead.

Aside from this, it'd be very helpful from an interop perspective to know what things are parquet-native and thus readable from another application, vs what things are pandas-only.

wesm · 2017-03-29T19:07:44Z

pandas/io/parquet.py

+    elif engine == 'fastparquet':
+        fastparquet = _try_import_fastparquet()
+        pf = fastparquet.ParquetFile(path)
+        return pf.to_pandas()


you could encapsulate the details for both implementations in a FastparquetImpl and PyarrowImpl which have the same API

codecov · 2017-03-29T20:05:08Z

Codecov Report

Merging #15838 into master will decrease coverage by 0.02%.
The diff coverage is 68.75%.

@@            Coverage Diff             @@
##           master   #15838      +/-   ##
==========================================
- Coverage   91.02%   90.99%   -0.03%     
==========================================
  Files         161      162       +1     
  Lines       49414    49494      +80     
==========================================
+ Hits        44980    45039      +59     
- Misses       4434     4455      +21

Flag	Coverage Δ
#multiple	`88.77% <68.75%> (-0.01%)`	⬇️
#single	`40.26% <26.25%> (-0.09%)`	⬇️

Impacted Files	Coverage Δ
pandas/util/_print_versions.py	`15.71% <ø> (ø)`	⬆️
pandas/io/feather_format.py	`85.71% <ø> (ø)`	⬆️
pandas/core/frame.py	`97.66% <100%> (-0.1%)`	⬇️
pandas/io/api.py	`100% <100%> (ø)`	⬆️
pandas/core/config_init.py	`94.48% <100%> (+0.13%)`	⬆️
pandas/io/parquet.py	`65.75% <65.75%> (ø)`
pandas/io/gbq.py	`25% <0%> (-58.34%)`	⬇️
pandas/core/series.py	`95.04% <0%> (+0.09%)`	⬆️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ab49d1f...f553a5f. Read the comment docs.

wesm · 2017-03-29T22:07:49Z

doc/source/install.rst

@@ -236,6 +236,7 @@ Optional Dependencies
 * `xarray <http://xarray.pydata.org>`__: pandas like handling for > 2 dims, needed for converting Panels to xarray objects. Version 0.7.0 or higher is recommended.
 * `PyTables <http://www.pytables.org>`__: necessary for HDF5-based storage. Version 3.0.0 or higher required, Version 3.2.1 or higher highly recommended.
 * `Feather Format <https://github.com/wesm/feather>`__: necessary for feather-based storage, version 0.3.1 or higher.
+* ``Parquet Format``, either `pyarrow <https://github.com/apache/parquet-cpp>`__ or `fastparquet <https://fastparquet.readthedocs.io/en/latest/necessary>`__ for parquet-based storage. The `snappy <https://pypi.python.org/pypi/python-snappy>`__ and `brotli <https://pypi.python.org/pypi/brotlipy>`__ are available for compression support.


The link is wrong for pyarrow

wesm · 2017-03-29T22:09:09Z

pandas/io/parquet.py

+        # we need to import on first use
+
+        try:
+            import pyarrow  # noqa


I might recommend

import pyarrow.parquet as pq self.api = pq

updated, though I still need to keep a ref to pyarrow to construct the table. maybe you can suggest a better way.

sorted; I realized that pyarrow.parquet is an explicit import which is fine.

TomAugspurger · 2017-04-01T19:57:45Z

pandas/core/frame.py

+        Parameters
+        ----------
+        fname : str
+            string file path


I haven't played with either of the engines, but do they both share similar semantics on the path argument? Does it have to be a string, or can it be an open file object, or pathlib.Path? Can it be an s3 path?

In the case of fastparquet, this is anything that can be passed to open; and you can specify what function to open files with (open_with=), which must return a file-like; this is how you open with s3 etc., by passing S3FileSystem.open.

Only in dask can you supply something like "s3://user:pass@bucket/path', and get it parsed to pass the correct open_with automatically.

It might be useful for pandas to handle the conversion of a file path into a file-like object for semantic conformity. An exception would be unless a particular engine can do better with a local file path -- as an example, in pyarrow, we memory map local files which has generally better performance than Python file objects

Please note that parquet data-sets are not necessarily single-file, so I don't think it's a great idea to pass open files, local or otherwise.

On the other hand, from a Dask perspective it might be nice to one day rely entirely on a pandas.read_parquet function for chunk-wise logic. In this case we would want to hand pandas a file-like object and ask it to get us a few particular row groups from that object. If inconvenient I don't think we should worry about this use case near-term. I just thought I'd bring it up.

we handle path_or_buffers in the following way:

path-like objects (pathlib and py.local), we stringify

if a string

if its a url we turn this into a Bytes object (also handling gzip content encoding)

if its a s3 url we defer to s3fs for opening

else we would do things like expand_user

we can infer a compression from the filepath itself (we just path this thru if its found),
mainly useful for text files where we decompress.

file-like we pass thru

for csv reading we will handle the file io & encoding

all others we pass the string-path thru

So i don't see any reason to handle this differently. The IO engine gets to handle a fully qualified string path. (e.g. HDF5, excel, pickle, json) look all the same to pandas. The IO engine is in charge of opening closing the actual files.

Please note that parquet data-sets are not necessarily single-file, so I don't think it's a great idea to pass open files, local or otherwise.

For this, there's a good argument that pandas should define a file system abstract interface that 3rd parties can implement. In practice in dask and pandas, this is already the case, but it may be worth defining with more formal rigor (as far as pandas is concerned at least) to help with API conformity. pandas doesn't really have a "plugin" API, but this is something to consider more and more as we try to be less monolithic

TomAugspurger · 2017-04-01T20:04:55Z

pandas/tests/io/test_parquet.py

+    def test_compression(self, engine, compression):
+
+        if compression == 'snappy':
+            try:


@jreback you can use an importor skip So I think

pytest.importorskip("snappy")

instead of the try / except. I don't know if they can be used in fixture functions to handle the above skipifs.

TomAugspurger · 2017-04-01T20:08:21Z

pandas/core/frame.py

+        ----------
+        fname : str
+            string file path
+        engine : parquet engine


Either here, or as a followup, we could add a config option to control the default reader / writer. This will be like io.excel.xls.writer, so io.parquet.writer/reader?

jreback · 2017-04-01T22:30:15Z

does parquet automatically infer compression from a filepath?

e.g. say foo.pq.gz, does this imply compression='gzip' or do you have to actually pass the parameter? (I can do either, because I actually DO infer this), just need to know the standard.

martindurant · 2017-04-01T22:53:42Z

No, compression is an internal thing, and can vary between columns/chunks. You can typically (e.g., from spark) have files that are name.parquet.gz and have gzip compression throughout, but that's not the spec.

jreback · 2017-04-02T00:00:13Z

No, compression is an internal thing, and can vary between columns/chunks. You can typically (e.g., from spark) have files that are name.parquet.gz and have gzip compression throughout, but that's not the spec.

ok, that's what I did already, the actual path/extension will not impact the compression argument which must be explicit.

jorisvandenbossche

Some doc comments

jorisvandenbossche · 2017-04-02T20:27:31Z

doc/source/io.rst

+- Non supported types include ``Period`` and actual python object types. These will raise a helpful error message
+  on an attempt at serialization.
+
+See the documentation for `pyarrow <https://pyarrow.readthedocs.io/en/latest/` and `fastparquet <https://fastparquet.readthedocs.io/en/latest/necessary>`


Do those links need a __ after the `....<..>`?

also the fastparquet link is not valid. "necessary" needs to be removed I think

jorisvandenbossche · 2017-04-02T20:34:45Z

pandas/core/frame.py

+        ----------
+        fname : str
+            string file path
+        engine : parquet engine


engine : str

jorisvandenbossche · 2017-04-02T20:37:32Z

pandas/io/parquet.py

+
+def to_parquet(df, path, engine=None, compression=None, **kwargs):
+    """
+    Write a DataFrame to the pyarrow


"the pyarrow" -> "parquet" file

jorisvandenbossche · 2017-04-02T20:38:32Z

pandas/core/frame.py

+            string file path
+        engine : parquet engine
+            supported are {'pyarrow', 'fastparquet'}
+            if None, will use the option: io.parquet.engine


Can you use proper punctuation / capital letters here? (in online html docs, line breaks are gone, so this would read strange)

jorisvandenbossche · 2017-04-02T20:39:32Z

pandas/io/parquet.py

+    # raise on anything else as we don't serialize the index
+
+    if not isinstance(df.index, Int64Index):
+        raise ValueError("parquet does not serializing {} "


"does not support serializing" or "does not serialize"

(and same comment for the one below)

jorisvandenbossche · 2017-04-02T20:41:37Z

pandas/io/parquet.py

+
+    if not df.index.equals(RangeIndex.from_range(range(len(df)))):
+        raise ValueError("parquet does not serializing a non-default index "
+                         "for the index; you can .reset_index()"


space at the end of this line (same for error message above)

jorisvandenbossche · 2017-04-02T20:44:28Z

pandas/core/config_init.py

+: string
+    The default parquet reader/writer engine. Available options:
+    None, 'pyarrow', 'fastparquet'
+"""


What is the default?

Also, you can you add this option to the options.rst docs

jorisvandenbossche · 2017-04-02T20:46:20Z

doc/source/io.rst

+
+.. versionadded:: 0.20.0
+
+Parquet provides a sharded binary columnar serialization for data frames. It is designed to make reading and writing data


It would be nice to have some explanation to users what is the difference between both

martindurant · 2017-07-22T21:35:14Z

done.

jreback · 2017-07-26T23:42:46Z

going to merge on pass (as testing with pyarrow 0.5.0 as well as 0.4.1 / fp 0.1.0)

@wesm @martindurant

gfyoung · 2017-07-27T04:53:02Z

@jreback @wesm @martindurant : Everything looks green and good to go!

jorisvandenbossche

Looks good, added a bunch of mainly doc comments (docs section is also not building at the moment).

Additional non-doc comment:

when you try to write a non-supported type with pyarrow (this was the case in the doc build), a file is actually created, but this is empty. Not sure this is a pandas or pyarrow issues, but is is certainly a bug.
The result is also that when trying to read this file afterwards, you get a unexpected "Memory mapping file failed" error

jorisvandenbossche · 2017-07-27T07:28:55Z

doc/source/io.rst

+
+.. versionadded:: 0.21.0
+
+Parquet provides a sharded binary columnar serialization for data frames. It is designed to make reading and writing data


IMO 'sharded' is too technical for this introduction (many users will already not understand this sentence).

Can you also include a link to the parquet website?

jorisvandenbossche · 2017-07-27T07:30:15Z

doc/source/io.rst

+Several caveats.
+
+- The format will NOT write an ``Index``, or ``MultiIndex`` for the ``DataFrame`` and will raise an
+  error if a non-default one is provided. You can simply ``.reset_index()`` in order to store the index.


would it be worth adding ".. or reset_index(drop=True) to ignore the index." ?

jorisvandenbossche · 2017-07-27T07:38:45Z

doc/source/io.rst

+
+.. ipython:: python
+
+   df.to_parquet('example_pa.parquet', engine='pyarrow')


This raises an error, since categorical is not supported by pyarrow?

https://issues.apache.org/jira/browse/ARROW-1285
(and going to remove the cat; we test this but docs didn't get updated)

jorisvandenbossche · 2017-07-27T07:39:07Z

doc/source/io.rst

+
+   These engines are very similar and should read/write nearly identical parquet format files.
+   These libraries differ by having different underlying dependencies (``fastparquet`` by using ``numba``, while ``pyarrow`` uses a c-library).
+   TODO: differing options to write non-standard columns & null treatment


leftover TODO ?

jorisvandenbossche · 2017-07-27T07:39:10Z

doc/source/io.rst

+- Non supported types include ``Period`` and actual python object types. These will raise a helpful error message
+  on an attempt at serialization.
+
+See the documentation for `pyarrow <http://arrow.apache.org/docs/python/`__ and `fastparquet <https://fastparquet.readthedocs.io/en/latest/>`__


Can you mention you need at least one of those libraries installed ? (that those are required dependencies, and you can specify which is used with the engine keyword. Now 'engine' is used in the note without really explaining what it is)

jorisvandenbossche · 2017-07-27T07:51:25Z

pandas/core/frame.py

+            if None, will use the option: `io.parquet.engine`
+        compression : str, optional, default 'snappy'
+            compression method, includes {'gzip', 'snappy', 'brotli'}
+        kwargs passed to the engine


kwargs Additional keyword arguments passed to the engine

jorisvandenbossche · 2017-07-27T07:51:59Z

pandas/core/config_init.py

+parquet_engine_doc = """
+: string
+    The default parquet reader/writer engine. Available options:
+    'pyarrow', 'fastparquet', the default is 'pyarrow'


What if fastparquet is installed and pyarrow not ?

it will raise, we don't do a detection at startup. i guess we could but hate increasing the import footprint even more.

jorisvandenbossche · 2017-07-27T07:53:59Z

pandas/io/parquet.py

+                              "you can install via conda\n"
+                              "conda install pyarrow -c conda-forge\n"
+                              "\nor via pip\n"
+                              "pip install pyarrow\n")


I would add a -U otherwise pip will say it is already installed (I think?)

jorisvandenbossche · 2017-07-27T07:54:30Z

pandas/io/parquet.py

+                              "you can install via conda\n"
+                              "conda install fastparquet -c conda-forge\n"
+                              "\nor via pip\n"
+                              "pip install fastparquet")


jorisvandenbossche · 2017-07-27T07:55:36Z

pandas/core/frame.py

+    def to_parquet(self, fname, engine=None, compression='snappy',
+                   **kwargs):
+        """
+        write out the binary parquet for DataFrames


The io.parquet.to_parquet had a nicer docstring intro IMO, I would use that one here as well: "Write a DataFrame to the binary parquet format."

hmm, I just changed the first line, they are the same otherwise (ex arg)

jreback · 2017-07-27T11:18:45Z

In [3]:    df = pd.DataFrame({'a': list('abc'),
   ...:                       'b': list(range(1, 4)),
   ...:                       'c': np.arange(3, 6).astype('u1'),
   ...:                       'd': np.arange(4.0, 7.0, dtype='float64'),
   ...:                       'e': [True, False, True],
   ...:                       'f': pd.date_range('20130101', periods=3),
   ...:                       'g': pd.date_range('20130101', periods=3, tz='US/Eastern'),
   ...:                       'h': pd.date_range('20130101', periods=3, freq='ns')})
   ...: 
   ...: 

In [4]: df.to_parquet('foo.pq', engine='pyarrow')

In [5]: pd.read_parquet('foo.pq', engine='pyarrow')
Out[5]: 
   a  b  c    d      e          f                   g          h
0  a  1  3  4.0   True 2013-01-01 2013-01-01 05:00:00 2013-01-01
1  b  2  4  5.0  False 2013-01-02 2013-01-02 05:00:00 2013-01-01
2  c  3  5  6.0   True 2013-01-03 2013-01-03 05:00:00 2013-01-01

In [6]: df.to_parquet('foo2.pq', engine='fastparquet')

In [7]: pd.read_parquet('foo2.pq', engine='fastparquet')
Out[7]: 
   a  b  c    d      e          f                   g          h
0  a  1  3  4.0   True 2013-01-01 2013-01-01 05:00:00 2013-01-01
1  b  2  4  5.0  False 2013-01-02 2013-01-02 05:00:00 2013-01-01
2  c  3  5  6.0   True 2013-01-03 2013-01-03 05:00:00 2013-01-01

In [8]: pd.read_parquet('foo.pq', engine='fastparquet')
Out[8]: 
In [8]: pd.read_parquet('foo.pq', engine='fastparquet')
Out[8]: 
                   a  b  c    d      e          f                   g          h
__index_level_0__                                                               
0                  a  1  3  4.0   True 2013-01-01 2013-01-01 05:00:00 2013-01-01
1                  b  2  4  5.0  False 2013-01-02 2013-01-02 05:00:00 2013-01-01
2                  c  3  5  6.0   True 2013-01-03 2013-01-03 05:00:00 2013-01-01

In [9]: pd.read_parquet('foo2.pq', engine='pyarrow')
---------------------------------------------------------------------------
ArrowIOError                              Traceback (most recent call last)
<ipython-input-9-bd20d9f6a7ee> in <module>()
----> 1 pd.read_parquet('foo2.pq', engine='pyarrow')

/Users/jreback/pandas/pandas/io/parquet.py in read_parquet(path, engine, **kwargs)
    177 
    178     impl = get_engine(engine)
--> 179     return impl.read(path)

/Users/jreback/pandas/pandas/io/parquet.py in read(self, path)
     57     def read(self, path):
     58         path, _, _ = get_filepath_or_buffer(path)
---> 59         return self.api.parquet.read_table(path).to_pandas()
     60 
     61 

/Users/jreback/miniconda3/envs/pandas/lib/python3.6/site-packages/pyarrow/parquet.py in read_table(source, columns, nthreads, metadata, use_pandas_metadata)
    711     pf = ParquetFile(source, metadata=metadata)
    712     return pf.read(columns=columns, nthreads=nthreads,
--> 713                    use_pandas_metadata=use_pandas_metadata)
    714 
    715 

/Users/jreback/miniconda3/envs/pandas/lib/python3.6/site-packages/pyarrow/parquet.py in read(self, columns, nthreads, use_pandas_metadata)
    116             columns, use_pandas_metadata=use_pandas_metadata)
    117         return self.reader.read_all(column_indices=column_indices,
--> 118                                     nthreads=nthreads)
    119 
    120     def _get_column_indices(self, column_names, use_pandas_metadata=False):

_parquet.pyx in pyarrow._parquet.ParquetReader.read_all()

error.pxi in pyarrow.lib.check_status()

ArrowIOError: IOError: Unknown encoding type.

@wesm @martindurant

[8] is fine (fixed in fp master)
[9] is odd. seems some encoding information is not being written or a default is different.

pyarrow 0.5.0, fp 0.1.0

wesm · 2017-07-27T14:54:13Z

@jreback on [9] it looks like this error is coming from https://github.com/apache/parquet-cpp/blob/2f5ef8957851fe13dfb1b8c67f7a6786730a404e/src/parquet/column_reader.cc#L221. That suggests the metadata is malformed somehow, can you link me to a generated file so I can take a look?

Since Parquet datasets generally consist of many files instead of a single file, we should think about extending this function to be able to read directories of files or a list of file paths in a follow up PR

jreback · 2017-07-27T14:59:54Z

@wesm
foo2.pq.gz

jorisvandenbossche · 2017-07-27T16:33:39Z

[about defaulting to 'arrow', what if fastparquet is installed and arrow not] it will raise, we don't do a detection at startup. i guess we could but hate increasing the import footprint even more.

@jreback I understand that it is some more complexity, so won't argue further for it, but it feels a bit strange that pyarrow is the 'hard' default when only fastparquet is installed, while fastparquet can actually handle more data types, like Categorical (which seems a nice plus?)

wesm · 2017-07-27T16:44:35Z

Seems reasonable to fall back, perhaps with a warning (like when passing engine='python' to read_csv). The Categorical round trip (which requires the special pandas metadata, since Parquet does not have a categorical type) can be sorted out in the near future; if anyone would like to help with that it would be very welcome

TomAugspurger · 2017-07-27T17:37:01Z

+1 for using fastparquet if that's the only available engine (with a warning is probably best).

mrocklin · 2017-07-27T17:45:50Z

FWIW here is the policy used by dask.dataframe.read_parquet when @wesm added Arrow support to dask.dataframe:

https://github.com/dask/dask/pull/2223/files

engine : {'auto', 'fastparquet', 'arrow'}, default 'auto'
    Parquet reader library to use. If only one library is installed, it
    will use that one; if both, it will use 'fastparquet'

I am not surprised to see folks here prefer to set Arrow as the default (nor do I disagree with this chocie). It does seem odd to warn fastparquet users by default though.

jreback · 2017-07-27T22:51:57Z

ok, revised to use 'auto', 'pyarrow', 'fastparquet'.

wesm · 2017-07-29T14:03:47Z

pandas/io/parquet.py

+
+    def write(self, df, path, compression='snappy', **kwargs):
+        path, _, _ = get_filepath_or_buffer(path)
+        table = self.api.Table.from_pandas(df, timestamps_to_ms=True)


We might want to remove timestamps_to_ms here, or add a pandas compatibility wrapper in pyarrow. Ultimately this is going away

@wesm is there a JIRA link for this? maybe link it in #17102 so we don't forget to do this.

added link to ARROW-622

jreback · 2017-08-02T09:47:49Z

merging, will followup in #17102 as needed.

@TomAugspurger

* consolidated the duplicate definitions of NA values (in parsers & IO) (pandas-dev#16589) * GH15943 Fixed defaults for compression in HDF5 (pandas-dev#16355) * DOC: add header=None to read_excel docstring (pandas-dev#16689) * TST: Test against python-dateutil master (pandas-dev#16648) * BUG: .iloc[:] and .loc[:] return a copy of the original object pandas-dev#13873 (pandas-dev#16443) closes pandas-dev#13873 * TST: Add test of building frame from named Series and columns (pandas-dev#9232) (pandas-dev#16700) * DOC: fix wrongly placed versionadded (pandas-dev#16702) * DOC: pin sphinx to version 1.5 (pandas-dev#16704) * CI: restore np 113 in ci builds (pandas-dev#16656) * Revert "BLD: fix numpy on 3.6 build as 1.13 was released but no deps are built for it (pandas-dev#16633)" This reverts commit dfebd8a. closes pandas-dev#16634 * BUG: Fix regression for RGB(A) color arguments (pandas-dev#16701) * Add test * Pass tuples that are RGB or RGBA like in list * Update what's new * change whatsnew to reflect regression fix * Add test for RGBA as well * CI: pin jemalloc=4.4.0 (pandas-dev#16727) * MAINT: Drop Categorical.order & sort (pandas-dev#16728) Deprecated back in 0.18.1 xref pandas-devgh-12882 * Fix reading Series with read_hdf (pandas-dev#16610) * Added test to reproduce issue pandas-dev#16583 * Fix pandas-dev#16583 by adding an explicit `mode` argument to `read_hdf` kwargs which are meant for the opening of the HDFStore should be filtered out before passing the remaining kwargs to the `select` function to load the data. * Noted fix for pandas-dev#16583 in WhatsNew * DOC: typo (pandas-dev#16733) * whatsnew v0.21.0.txt typos (pandas-dev#16742) * whatsnew v0.20.3 edits (pandas-dev#16743) * BUG: do not raise UnsortedIndexError if sorting is not required closes pandas-dev#16734 Author: Pietro Battiston <me@pietrobattiston.it> This patch had conflicts when merged, resolved by Committer: Jeff Reback <jeff.reback@twosigma.com> Closes pandas-dev#16736 from toobaz/index_what_you_can and squashes the following commits: f77e2b3 [Pietro Battiston] BUG: do not raise UnsortedIndexError if sorting is not required * DOC: whatsnew typos * Test for pandas-dev#16726. unittest that ensures datetime is understood (pandas-dev#16744) * Test for pandas-dev#16726. unittest that ensures datetime is understood * Corrected the test as suggested by @TomAugspurger * Fixed flake8 errors and warnings * DOC: some rst fixes (pandas-dev#16763) * DOC: Update Sphinx Deprecated Directive (pandas-dev#16512) * MAINT: Drop Index.sym_diff (pandas-dev#16760) Deprecated in 0.18.1 xref pandas-devgh-12591, pandas-devgh-12594 * MAINT: Drop pd.options.display.mpl_style (pandas-dev#16761) Deprecated in 0.18.0 xref pandas-devgh-12190 * DOC: remove section on Panel4D support in HDF io (pandas-dev#16783) * DOC: add section on data validation and library engarde (pandas-dev#16758) * TST: register slow marker (pandas-dev#16797) * TST: register slow marker * Update setup.cfg * BUG: Load data from a CategoricalIndex for dtype comparison, closes #… (pandas-dev#16738) * BUG: Load data from a CategoricalIndex for dtype comparison, closes pandas-dev#16627 * Enable is_dtype_equal on CategoricalIndex, fixed some doc typos, added ordered CategoricalIndex test * Flake8 windows suggestion * Fixed some documentation/formatting issues, clarified the purpose of the test case. * Bug in pd.merge() when merge/join with multiple categorical columns (pandas-dev#16786) closes pandas-dev#16767 * BUG: Fix read of py3 PeriodIndex DataFrame HDF made in py2 (pandas-dev#16781) (pandas-dev#16790) In Python3, reading a DataFrame with a PeriodIndex from an HDF file created in Python2 would incorrectly return a DataFrame with an Int64Index. * BUG: Fix Series doesn't work in pd.astype(). Now treat Series as dict. (pandas-dev#16725) * FIX: Allow aggregate to return dictionaries again pandas-dev#16741 (pandas-dev#16752) * BUG: fix to_latex bold_rows option (pandas-dev#16708) * Revert "CI: pin jemalloc=4.4.0 (pandas-dev#16727)" (pandas-dev#16731) This reverts commit 09d8c22. * CI: use dist/trusty rather than os/linux (pandas-dev#16806) closes pandas-dev#16730 * TST: Verify columns entirely below chop_threshold still print (pandas-dev#6839) (pandas-dev#16809) * BUG: clip dataframe column-wise pandas-dev#15390 (pandas-dev#16504) * TST: Verify that positional shifting works with duplicate columns (pandas-dev#9092) (pandas-dev#16810) * BUG: render dataframe as html do not produce duplicate element id's (pandas-dev#16780) (pandas-dev#16801) * BUG: when rendering dataframe as html do not produce duplicate element id's pandas-dev#16780 * CLN: removing spaces in code causes pylint check to fail * DOC: moved whatsnew comment to 0.20.3 release from 0.21.0 * fix BUG: ValueError when performing rolling covariance on multi indexed DataFrame (pandas-dev#16814) * fix multi index names * fix line length to pep8 * added what's new entry and reference issue number in test * Update test_multi.py * Update v0.20.3.txt * BUG: rolling.cov with multi-index columns should presever the MI (pandas-dev#16825) xref pandas-dev#16814 * use network decorator on additional tests (pandas-dev#16824) * BUG: TimedeltaIndex raising ValueError when slice indexing (pandas-dev#16637) (pandas-dev#16638) * Bug issue 16819 Index.get_indexer_not_unique inconsistent return types vs get_indexer (pandas-dev#16826) * TST: Verify that float columns stay float after pivot (pandas-dev#7142) (pandas-dev#16815) * BUG/MAINT: Change default of inplace to False in pd.eval (pandas-dev#16732) * BUG: kind parameter on categorical argsort (pandas-dev#16834) * DOC: Updated cookbook to show usage of Grouper instead of TimeGrouper… (pandas-dev#16794) * BUG: allow empty multiindex (fixes .isin regression, GH16777) (pandas-dev#16782) * BUG: fix missing sort keyword for PeriodIndex.join (pandas-dev#16586) * COMPAT: 32-bit compat for testing of indexers (pandas-dev#16849) xref pandas-dev#16826 * BUG: fix infer frequency for business daily (pandas-dev#16683) * DOC: Whatsnew updates (pandas-dev#16853) [ci skip] * TST/PKG: Move test HDF5 file to legacy (pandas-dev#16856) It wasn't being picked up in our package data otherwise * COMPAT: moar 32-bit compat for testing of indexers (pandas-dev#16861) xref pandas-dev#16826 * MAINT: Drop the get_offset_name method (pandas-dev#16863) Deprecated since 0.18.0 xref pandas-devgh-11834 * DOC: Fix missing parentheses in documentation (pandas-dev#16862) * BUG: rolling.quantile does not return an interpolated result (pandas-dev#16247) * ENH - Modify Dataframe.select_dtypes to accept scalar values (pandas-dev#16860) * COMPAT: moar 32-bit compat for testing of indexers (pandas-dev#16869) xref pandas-dev#16826 * Confirm that select was *not* clearer in 0.12 (pandas-dev#16878) * Added tests for _get_dtype (pandas-dev#16845) * BUG: Series.isin fails or categoricals (pandas-dev#16858) * COMPAT with dateutil 2.6.1, fixed ambiguous tz dst behavior (pandas-dev#16880) * fix wrongly named method (pandas-dev#16881) * TST/PKG: Removed pandas.util.testing.slow definition (pandas-dev#16852) * MAINT: Remove unused mock import (pandas-dev#16908) We import it, set it as an attribute, and then don't use it. * Let _get_dtype accept Categoricals and CategoricalIndex (pandas-dev#16887) * Fixes for pandas-dev#16896(TimedeltaIndex indexing regression for strings) (pandas-dev#16907) * Fix for pandas-dev#16909(DeltatimeIndex.get_loc is not working on np.deltatime64 data type) (pandas-dev#16912) * DOC: Recommend sphinx 1.5 for now (pandas-dev#16929) For the SciPy sprint tomorrow, until the cause of the doc-building slowdown is fully identified. * BUG: Allow value labels to be read with iterator (pandas-dev#16926) All value labels to be read before the iterator has been used Fix issue where categorical data was incorrectly reformatted when write_index was False closes pandas-dev#16923 * DOC: Update flake8 command instructions (pandas-dev#16919) * TST: Don't assert that a bug exists in numpy (pandas-dev#16940) Better to ignore the warning from the bug, rather than assert the bug is still there After this change, numpy/numpy#9412 _could_ be backported to fix the bug * CI: add .pep8speakes.yml * CLN16668: remove OrderedDefaultDict (pandas-dev#16939) * Change "pls" to "please" in error message (pandas-dev#16947) * BUG: MultiIndex sort with ascending as list (pandas-dev#16937) * DOC: Improving docstring of pop method (pandas-dev#16416) (pandas-dev#16520) * PEP8 * WARN: add stacklevel to to_dict() UserWarning (pandas-dev#16927) (pandas-dev#16936) * ERR: add stacklevel to to_dict() UserWarning (pandas-dev#16927) * TST: Add warning testing to to_dict() * Fix warning assertion on to_dict() test * Add github issue to documentation on to_dict() warning test * CI: fix pep8speaks .yml file * DOC: whatsnew 0.21.0 edits * CI: disable codecov reporting * MAINT: Move series.remove_na to core.dtypes.missing.remove_na_arraylike Closes pandas-devgh-16935 * Support non unique period indexes on join and merge operations (pandas-dev#16949) * Support non unique period indexes on join and merge operations * Add frame assertion on tests and release notes * Explicitly use dtype int64 on arange * BUG: Set secondary axis font size for `secondary_y` during plotting The parameter was not being respected for `secondary_y`. Closes pandas-devgh-12565 * DOC: more whatsnew fixes * DOC: Reset index examples closes pandas-dev#16416 Author: aernlund <awe220@nyumc.org> Closes pandas-dev#16967 from aernlund/reset_index_docs and squashes the following commits: 3c6a4b6 [aernlund] DOC: added examples to reset_index 4838155 [aernlund] DOC: added examples to reset_index 2a51e2b [aernlund] DOC: added examples to reset_index * channel from pandas to conda-forge (pandas-dev#16966) * BUG: coercing of bools in groupby transform (pandas-dev#16895) * DOC: misspelling in DatetimeIndex.indexer_between_time [CI skip] (pandas-dev#16963) * CLN: some residual code removed, xref to pandas-dev#16761 (pandas-dev#16974) * ENH: Create a 'Y' alias for date_range yearly frequency Closes pandas-devgh-9313 * Revert "ENH: Create a 'Y' alias for date_range yearly frequency" (pandas-dev#16976) This reverts commit 9c096d2, as it was prematurely made. * DOC: behavior when slicing with missing bounds (pandas-dev#16932) closes pandas-dev#16917 * TST: Add test for sub-char in read_csv (pandas-dev#16977) Closes pandas-devgh-16893. * DEPR: deprecate html.border option (pandas-dev#16970) * DOC: document convention argument for resample() (pandas-dev#16965) * DOC: document convention argument for resample() * DOC: Clarify 'it' in aggregate doc (pandas-dev#16989) Closes pandas-devgh-16988. * CLN/COMPAT: for various py2/py3 in doc/bench scripts (pandas-dev#16984) * PERF: SparseDataFrame._init_dict uses intermediary dict, not DataFrame (pandas-dev#16883) Closes pandas-devgh-16773. * MAINT: Drop line_width and height from options (pandas-dev#16993) Deprecated since 0.11 and 0.12 respectively. * COMPAT: Add back remove_na for seaborn (pandas-dev#16992) Closes pandas-devgh-16971. * COMPAT: np.full not available in all versions, xref pandas-dev#16773 (pandas-dev#17000) * DOC, TST: Clarify whitespace behavior in read_fwf documentation (pandas-dev#16950) Closes pandas-devgh-16772 * API: add infer_objects for soft conversions (pandas-dev#16915) * API: add infer_objects for soft conversions * doc fixups * fixups * doc * BUG: np.inf now causes Index to upcast from int to float (pandas-dev#16996) Closes pandas-devgh-16957. * DOC: Make highlight functions match documentation (pandas-dev#16999) Closes pandas-devgh-16998. * BUG: Large object array isin closes pandas-dev#16012 Author: Morgan Stuart <morgansstuart243@gmail.com> Closes pandas-dev#16969 from Morgan243/large_array_isin and squashes the following commits: 31cb4b3 [Morgan Stuart] Removed unneeded details from whatsnew description 4b59745 [Morgan Stuart] Linting errors; additional test clarification 186607b [Morgan Stuart] BUG pandas-dev#16012 - fix isin for large object arrays * BUG: reindex would throw when a categorical index was empty pandas-dev#16770 closes pandas-dev#16770 Author: ri938 <r_irv938@hotmail.com> Author: Jeff Reback <jeff@reback.net> Author: Tuan <tuan.d.tran@hotmail.com> Author: Forbidden Donut <forbdonut@gmail.com> This patch had conflicts when merged, resolved by Committer: Jeff Reback <jeff@reback.net> Closes pandas-dev#16820 from ri938/bug_issue16770 and squashes the following commits: 0e2d315 [ri938] Merge branch 'master' into bug_issue16770 9802288 [ri938] Update v0.20.3.txt 1f2865e [ri938] Update v0.20.3.txt 83fd749 [ri938] Update v0.20.3.txt eab3192 [ri938] Merge branch 'master' into bug_issue16770 7acc09f [ri938] Minor correction to previous submit 6e8f1b3 [ri938] Minor corrections to previous submit (pandas-dev#16820) 9ed80f0 [ri938] Bring documentation into line with master branch. 26e1a60 [ri938] Move documentation of change to the next major release 0.21.0 59b17cd [Jeff Reback] BUG: rolling.cov with multi-index columns should presever the MI (pandas-dev#16825) 5362447 [Tuan] fix BUG: ValueError when performing rolling covariance on multi indexed DataFrame (pandas-dev#16814) 800b40d [ri938] BUG: render dataframe as html do not produce duplicate element id's (pandas-dev#16780) (pandas-dev#16801) a725fbf [Forbidden Donut] BUG: Fix read of py3 PeriodIndex DataFrame HDF made in py2 (pandas-dev#16781) (pandas-dev#16790) 8f8e3d6 [ri938] TST: register slow marker (pandas-dev#16797) 0645868 [ri938] Add backticks in documentation 0a20024 [ri938] Minor correction to previous submit 69454ec [ri938] Minor corrections to previous submit (pandas-dev#16820) 3092bbc [ri938] BUG: reindex would throw when a categorical index was empty pandas-dev#16770 * BUG: Don't with empty Series for .isin (pandas-dev#17006) Empty Series initializes to float64, even when the data type is object for .isin, leading to an error with membership. Closes pandas-devgh-16991. * ENH: Use 'Y' as an alias for end of year (pandas-dev#16978) Closes pandas-devgh-9313 Redo of pandas-devgh-16958 * DOC: infer_objects doc fixup (pandas-dev#17018) * Fixes SparseSeries initiated with dictionary raising AttributeError (pandas-dev#16960) * DOC: Improving docstring of reset_index method (pandas-dev#16416) (pandas-dev#16975) * DOC: add warning to append about inefficiency (pandas-dev#17017) * DOC : Remove redundant backtick (pandas-dev#17025) * DOC: Document business frequency aliases (pandas-dev#17028) Follow-up to pandas-devgh-16978. * DOC: Fix double back-tick in 'Reshaping by Melt' section (pandas-dev#17030) See current stable docs for the issue: https://pandas.pydata.org/pandas-docs/stable/reshaping.html#reshaping-by-melt The double ` is causing the entire paragraph to be fixed width until the next double `. This commit removes the extra "`" * Define DataFrame plot methods in DataFrame (pandas-dev#17020) * CLN: move safe_sort from core.algorithms to core.sorting (pandas-dev#17034) COMPAT: safe_sort will only coerce list-likes to object, not a numpy string type xref: pandas-dev#17003 (comment) * DOC: Fixed Minor Typo (pandas-dev#17043) Cocumentation to Documentation * BUG: do not cast ints to floats if inputs o crosstab are not aligned (pandas-dev#17011) closes pandas-dev#17005 * BUG in merging categorical dates closes pandas-dev#16900 Author: Dave Willmer <dave.willmer@gmail.com> This patch had conflicts when merged, resolved by Committer: Jeff Reback <jeff@reback.net> Closes pandas-dev#16986 from dwillmer/cat_fix and squashes the following commits: 1ea1977 [Dave Willmer] Minor tweaks + comment 21a35a0 [Dave Willmer] Merge branch 'cat_fix' of https://github.com/dwillmer/pandas into cat_fix 04d5404 [Dave Willmer] Update tests 3cc5c24 [Dave Willmer] Merge branch 'master' into cat_fix 5e8e23b [Dave Willmer] Add whatsnew item b82d117 [Dave Willmer] Lint fixes a81933d [Dave Willmer] Remove unused import 218da66 [Dave Willmer] Generic solution to categorical problem 48e7163 [Dave Willmer] Test inner join 8843c10 [Dave Willmer] Fix TypeError when merging categorical dates * BUG: __setitem__ with a tuple induces NaN with a tz-aware DatetimeIndex (pandas-dev#16889) (pandas-dev#16897) * Added test for _get_dtype_type. (pandas-dev#16899) * BUG/API: dtype inconsistencies in .where / .setitem / .putmask / .fillna (pandas-dev#16821) * CLN/BUG: fix ndarray assignment may cause unexpected cast supersedes pandas-dev#14145 closes pandas-dev#14001 * API: This fixes a number of inconsistencies and API issues w.r.t. dtype conversions. This is a reprise of pandas-dev#14145 & pandas-dev#16408. This removes some code from the core structures & pushes it to internals, where the primitives are made more consistent. This should all us to be a bit more consistent for pandas2 type things. closes pandas-dev#16402 supersedes pandas-dev#14145 closes pandas-dev#14001 CLN: remove uneeded code in internals; use split_and_operate when possible * BUG: Improved thread safety for read_html() GH16928 (pandas-dev#16930) * Fixed 'add_methods' when the 'select' argument is specified. (pandas-dev#17045) * TST: Fix error message check in np.argsort comparision (pandas-dev#17051) Closes pandas-devgh-17046. * TST: Move some Series ctor tests to SharedWithSparse (pandas-dev#17050) * BUG: Made SparseDataFrame.fillna() fill all NaNs A continuation of pandas-dev#16178 closes pandas-dev#16112 closes pandas-dev#16178 Author: Kernc <kerncece@gmail.com> Author: keitakurita <kris337jbn@yahoo.co.jp> This patch had conflicts when merged, resolved by Committer: Jeff Reback <jeff@reback.net> Closes pandas-dev#16892 from kernc/sparse-fillna and squashes the following commits: c1cd33e [Kernc] fixup! BUG: Made SparseDataFrame.fillna() fill all NaNs 2974232 [Kernc] fixup! BUG: Made SparseDataFrame.fillna() fill all NaNs 4bc01a1 [keitakurita] BUG: Made SparseDataFrame.fillna() fill all NaNs * BUG: Use size_t to avoid array index overflow; add missing malloc of error_msg Fix a few locations where a parser's `error_msg` buffer is written to without having been previously allocated. This manifested as a double free during exception handling code making use of the `error_msg`. Additionally, use `size_t/ssize_t` where array indices or lengths will be stored. Previously, int32_t was used and would overflow on columns with very large amounts of data (i.e. greater than INTMAX bytes). xref pandas-dev#14696 closes pandas-dev#16798 Author: Jeff Knupp <jeff.knupp@enigma.com> Author: Jeff Knupp <jeff@jeffknupp.com> Closes pandas-dev#17040 from jeffknupp/16790-core-on-large-csv and squashes the following commits: 6a1ba23 [Jeff Knupp] Clear up prose a5d5677 [Jeff Knupp] Fix linting issues 4380c53 [Jeff Knupp] Fix linting issues 7b1cd8d [Jeff Knupp] Fix linting issues e3cb9c1 [Jeff Knupp] Add unit test plus '--high-memory' option, *off by default*. 2ab4971 [Jeff Knupp] Remove debugging code 2930eaa [Jeff Knupp] Fix line length to conform to linter rules e4dfd19 [Jeff Knupp] Revert printf format strings; fix more comment alignment 3171674 [Jeff Knupp] Fix some leftover size_t references 0985cf3 [Jeff Knupp] Remove debugging code; fix type cast 669d99b [Jeff Knupp] Fix linting errors re: line length 1f24847 [Jeff Knupp] Fix comment alignment; add whatsnew entry e04d12a [Jeff Knupp] Switch to use int64_t rather than size_t due to portability concerns. d5c75e8 [Jeff Knupp] BUG: Use size_t to avoid array index overflow; add missing malloc of error_msg * TST: remove some test warnings in parser tests (pandas-dev#17057) TST: move highmemory test to proper location in c_parser_only xref pandas-dev#16798 * DOC: Add more examples for reset_index (pandas-dev#17055) * MAINT: Add dash in high memory message Follow-up to pandas-devgh-17057. * MAINT: kwards --> kwargs in parsers.pyx * CLN: Cleanup comments in before_install_travis.sh envars.sh doesn't exist anymore. In fact, it's been gone for awhile. * MAINT: Remove duplicate Series sort_index check Duplicate boolean validation check for sort_index in series/test_validate.py * BLD: Pin pyarrow=0.4.1 (pandas-dev#17065) Addresses pandas-devgh-17064. Also add some additional build information when calling `pd.show_versions` * ENH: provide "inplace" argument to set_axis() closes pandas-dev#14636 Author: Pietro Battiston <me@pietrobattiston.it> Closes pandas-dev#16994 from toobaz/set_axis_inplace and squashes the following commits: 8fb9d0f [Pietro Battiston] REF: adapt NDFrame.set_axis() calls to new signature 409f502 [Pietro Battiston] ENH: provide "inplace" argument to set_axis(), change signature * BUG: Fix parser field type compatability on 32-bit systems. (pandas-dev#17071) Closes pandas-devgh-17063 * COMPAT: rename isnull -> isna, notnull -> notna (pandas-dev#16972) closes pandas-dev#15001 * BUG: Thoroughly dedup columns in read_csv (pandas-dev#17060) * ENH: Add skipna parameter to infer_dtype (pandas-dev#17066) Currently defaults to False for backwards compatibility. Will default to True in the future. Closes pandas-devgh-17059. * MAINT: Remove unused variable in test_scalar.py The "expected" variable is unused at the end of a test in indexing/test_scalar.py * TST: Add tests/indexing/ and reshape/ to setup.py (pandas-dev#17076) Looks like we just forgot about them. Oops. * CI: partially revert pandas-dev#17065, un-pin pyarrow on some builds * DOC: whatsnew typos * TST: Check more error messages in tests (pandas-dev#17075) * BUG: Respect dtype when calling pivot_table with margins=True closes pandas-dev#17013 This fix actually exposed an occurrence of pandas-dev#17035 in an existing test (as well as in one I added). Author: Pietro Battiston <me@pietrobattiston.it> Closes pandas-dev#17062 from toobaz/pivot_margin_int and squashes the following commits: 2737600 [Pietro Battiston] Removed now obsolete workaround 956c4f9 [Pietro Battiston] BUG: respect dtype when calling pivot_table with margins=True * MAINT: Add missing space in parsers.pyx "2< heuristic" --> "2 < heuristic" * MAINT: Add missing paren around print statement Stray verbose print statement in parsers.pyx was bare without any parentheses. * DOC: fix typos in missing.rst xref pandas-dev#16972 * DOC: further clean-up null/na changes (pandas-dev#17113) * BUG: Allow pd.unique to accept tuple of strings (pandas-dev#17108) * BUG: Allow Series with same name with crosstab (pandas-dev#16028) Closes pandas-devgh-13279 * COMPAT: make sure use_inf_as_null is deprecated (pandas-dev#17126) closes pandas-dev#17115 * CI: bump version of xlsxwriter to 0.5.2 (pandas-dev#17142) * DOC: Clean up instructions in ISSUE_TEMPLATE (pandas-dev#17146) * Add missing space to the NotImplementedError's message for compound dtypes (pandas-dev#17140) * DOC: (de)type the return value of concat (pandas-dev#17079) (pandas-dev#17119) * BUG: Thoroughly dedup column names in read_csv (pandas-dev#17095) * DOC: Additions/updates to documentation (pandas-dev#17150) * ENH: add to/from_parquet with pyarrow & fastparquet (pandas-dev#15838) * DOC: doc typos, xref pandas-dev#15838 * TST: test for categorical index monotonicity (pandas-dev#17152) * correctly determine bottleneck version * tests for categorical index monotonicity * fix Index.is_monotonic to point to Index.is_monotonic_increasing directly * MAINT: Remove non-standard and inconsistently-used imports (pandas-dev#17085) * DOC: typos in whatsnew * DOC: whatsnew 0.21.0 fixes * BUG: Fix CSV parsing of singleton list header (pandas-dev#17090) Closes pandas-devgh-7757. * ENH: Support strings containing '%' in add_prefix/add_suffix (pandas-dev#17151) (pandas-dev#17162) * REF: repr - allow block to override values that get formatted (pandas-dev#17143) * MAINT: Drop unnecessary newlines in issue template * remove direct import of nan Author: Brock Mendel <jbrockmendel@gmail.com> Closes pandas-dev#17185 from jbrockmendel/dont_import_nan and squashes the following commits: ee260b8 [Brock Mendel] remove direct import of nan * use == to test String equality (pandas-dev#17171) * ENH: Add warning when setting into nonexistent attribute (pandas-dev#16951) closes pandas-dev#7175 closes pandas-dev#5904 * DOC: added string processing comparison with SAS (pandas-dev#16497) * CLN: remove unused get methods in internals (pandas-dev#17169) * Remove unused get methods that would raise AttributeError if called * Remove unnecessary import * TST: Partial Boolean DataFrame Indexing (pandas-dev#17186) Closes pandas-devgh-17170 * CLN: Reformat docstring for IPython fixture * Define Series.plot and Series.hist in class definition (pandas-dev#17199) * BUG: support pandas objects in iloc with old numpy versions (pandas-dev#17194) closes pandas-dev#17193 * Implement _make_accessor classmethod for PandasDelegate (pandas-dev#17166) * Create ABCDateOffset (pandas-dev#17165) * BUG: resample and apply modify the index type for empty Series (pandas-dev#17149) * DOC: Updated NDFrame.astype docs (pandas-dev#17203) * MAINT: Minor touch-ups to GitHub PULL_REQUEST_TEMPLATE (pandas-dev#17207) Remove leading space from task-list so that tasks aren't nested. * CLN: replace %s syntax with .format in core.computation (pandas-dev#17209) * Bugfix for multilevel columns with empty strings in Python 2 (pandas-dev#17099) * CLN/ASV clean-up frame stat ops benchmarks (pandas-dev#17205) * BUG: Rolling apply on DataFrame with Datetime index returns NaN (pandas-dev#17156) * CLN: Remove import exception handling (pandas-dev#17218) Imports should succeed on all versions of Python that pandas supports. * MAINT: Remove extra the's in deprecation messages (pandas-dev#17222) * DOC: Patch docs in _decorators.py * CLN: replace %s syntax with .format in pandas.util (pandas-dev#17224) * Add 'See also' sections (pandas-dev#17223) * move pivot_table doc-string to DataFrame (pandas-dev#17174) * Remove import of pandas as pd in core.window (pandas-dev#17233) * TST: Move more frame tests to SharedWithSparse (pandas-dev#17227) * REF: _get_objs_combined_axis (pandas-dev#17217) * ENH/PERF: Remove frequency inference from .dt accessor (pandas-dev#17210) * ENH/PERF: Remove frequency inference from .dt accessor * BENCH: Add DatetimeAccessor benchmark * DOC: Whatsnew * Fix apparent typo in tests (pandas-dev#17247) * COMPAT: avoid calling getsizeof() on PyPy closes pandas-dev#17228 Author: mattip <matti.picus@gmail.com> Closes pandas-dev#17229 from mattip/getsizeof-unavailable and squashes the following commits: d2623e4 [mattip] COMPAT: avoid calling getsizeof() on PyPy * CLN: replace %s syntax with .format in pandas.core.reshape (pandas-dev#17252) Replaced %s syntax with .format in pandas.core.reshape. Additionally, made some of the existing positional .format code more explicit. * ENH: Infer compression from non-string paths (pandas-dev#17206) * Fix bugs in IntervalIndex.is_non_overlapping_monotonic (pandas-dev#17238) * BUG: Fix behavior of argmax and argmin with inf (pandas-dev#16449) (pandas-dev#16449) Closes pandas-dev#13595 * CLN: Remove have_pytz (pandas-dev#17266) Closes pandas-devgh-17251 * CLN: replace %s syntax with .format in core.dtypes and core.sparse (pandas-dev#17270) * Replace imports of * with explicit imports (pandas-dev#17269) xref pandas-dev#17234 * TST: pytest deprecation warnings GH17197 (pandas-dev#17253) Test parameters with marks are updated according to the updated API of Pytest. https://docs.pytest.org/en/latest/changelog.html#pytest-3-2-0-2017-07-30 https://docs.pytest.org/en/latest/parametrize.html * Handle more date/datetime/time formats (pandas-dev#15871) * DOC: add example on json_normalize (pandas-dev#16438) * BUG: Have object dtype for empty Categorical.categories (pandas-dev#17249) * BUG: Have object dtype for empty Categorical ctor Previously we had a `Float64Index`, which is inconsistent with, e.g., the regular Index constructor. * TST: Update tests in multi for new return Previously these relied worked around the return type by wrapping list-likes in `np.array` and relying on that to cast to float. These workarounds are no longer nescessary. * TST: Update union_categorical tests This relied on `NaN` being a float and empty being a float. Not a necessary test anymore. * TST: set object dtype * CLN: replace %s syntax with .format in pandas.tseries (pandas-dev#17290) * TST: parameterize consistency tests for rolling/expanding windows (pandas-dev#17292) * FIX: define `DataFrame.items` for all versions of python (pandas-dev#17214) * PERF: Update ASV publish config (pandas-dev#17293) Stricter cutoffs for considering regressions [ci skip] * DOC: Expand docstrings for head / tail methods (pandas-dev#16941) * MAINT: Use set literal for unsupported + depr args Initializes unsupported and deprecated argument sets with set literals instead of the set constructor in pandas/io/parsers.py, as the former is slightly faster than the latter. * DOC: Add proper docstring to maybe_convert_indices Patches several spelling errors and expands current doc to a proper doc-string. * DOC: Improving docstring of take method (pandas-dev#16948) * BUG: Fixed regex in asv.conf.json (pandas-dev#17300) In pandas-dev#17293 I messed up the syntax. I used a glob instead of a regex. According to the docs at http://asv.readthedocs.io/en/latest/asv.conf.json.html#regressions-thresholds we want to use a regex. I've actually manually tested this change and verified that it works. [ci skip] * Remove unnecessary usage of _TSObject (pandas-dev#17297) * BUG: clip should handle null values closes pandas-dev#17276 Author: Michael Gasvoda <mgasvoda@mercatus.gmu.edu> Author: mgasvoda <mgasvoda01@gmail.com> Closes pandas-dev#17288 from mgasvoda/master and squashes the following commits: a1dbdf2 [mgasvoda] Merge branch 'master' into master 9333952 [Michael Gasvoda] Checking output of tests 4e0464e [Michael Gasvoda] fixing whatsnew text c442040 [Michael Gasvoda] formatting fixes 7e23678 [Michael Gasvoda] formatting updates 781ea72 [Michael Gasvoda] whatsnew entry d9627fe [Michael Gasvoda] adding clip tests 9aa0159 [Michael Gasvoda] Treating na values as none for clips * BUG: fillna returns frame when inplace=True if value is a dict (pandas-dev#16156) (pandas-dev#17279) * CLN: Index.append() refactoring (pandas-dev#16236) * DEPS: set min versions (pandas-dev#17002) closes pandas-dev#15206, numpy >= 1.9 closes pandas-dev#15543, matplotlib >= 1.4.3 scipy >= 0.14.0 * CLN: replace %s syntax with .format in core.tools, algorithms.py, base.py (pandas-dev#17305) * BUG: Fix strange behaviour of Series.iloc on MultiIndex Series (pandas-dev#17148) (pandas-dev#17291) * DOC: Add module doc-string to tseries/api.py * MAINT: Clean up docs in pandas/errors/__init__.py * CLN: replace %s syntax with .format in missing.py, nanops.py, ops.py (pandas-dev#17322) Replaced %s syntax with .format in missing.py, nanops.py, ops.py. Additionally, made some of the existing positional .format code more explicit. * Make pd.Period immutable (pandas-dev#17239) * Bug: groupby multiindex levels equals rows (pandas-dev#16859) closes pandas-dev#16843 * BUG: Cannot use tz-aware origin in to_datetime (pandas-dev#16842) closes pandas-dev#16842 Author: step4me <prosikeffect@gmail.com> Closes pandas-dev#17244 from step4me/step4me-feature and squashes the following commits: 09d051d [step4me] BUG: Cannot use tz-aware origin in to_datetime (pandas-dev#16842) * Replace usage of total_seconds compat func with timedelta method (pandas-dev#17289) * CLN: replace %s syntax with .format in core/indexing.py (pandas-dev#17357) Progress toward issue pandas-dev#16130. Converted old string formatting to new string formatting in core/indexing.py. * DOC: Point to dev-docs in issue template (pandas-dev#17353) [ci skip] * CLN: remove total_seconds compat from json (pandas-dev#17341) * CLN: Move test_intersect_str_dates (pandas-dev#17366) Moves test_intersect_str_dates from tests/indexes/test_range.py to tests/indexes/test_base.py. * BUG: Respect dups in reindexing CategoricalIndex (pandas-dev#17355) When the indexer is identical to the elements. We should still return duplicates when the indexer contains duplicates. Closes pandas-devgh-17323. * Unify Index._dir_* with Series implementation (pandas-dev#17117) * BUG: make order of index from pd.concat deterministic (pandas-dev#17364) closes pandas-dev#17344 * Fix typo that causes several NaT methods to have incorrect docstrings (pandas-dev#17327) * CLN: replace %s syntax with .format in io/formats/format.py (pandas-dev#17358) Progress toward issue pandas-dev#16130. Converted old string formatting to new string formatting in io/formats/format.py. * PKG: Added pyproject.toml for PEP 518 (pandas-dev#16745) Declaring build-time requirements: https://www.python.org/dev/peps/pep-0518/ * DOC: Update Overview page in documentation (pandas-dev#17368) * Update Overview page in documentation * DOC Revise Overview page * DOC Make further revisions in Overview webpage * Update overview.rst Remove references to Panel * API: Have MultiIndex consturctors always return a MI (pandas-dev#17236) * API: Have MultiIndex constructors return MI This removes the special case for MultiIndex constructors returning an Index if all the levels are length-1. Now this will return a MultiIndex with a single level. This is a backwards incompatabile change, with no clear method for deprecation, so we're making a clean break. Closes pandas-dev#17178 * fixup! API: Have MultiIndex constructors return MI * Update for comments

jreback added IO Data IO issues that don't fit into a more specific label Enhancement labels Mar 29, 2017

jreback added this to the 0.20.0 milestone Mar 29, 2017

jreback commented Mar 29, 2017

View reviewed changes

jreback mentioned this pull request Mar 29, 2017

nicer error message on duplicate columns names dask/fastparquet#118

Closed

wesm reviewed Mar 29, 2017

View reviewed changes

jreback force-pushed the parquet branch 2 times, most recently from 9662db7 to 628b62c Compare March 31, 2017 15:54

TomAugspurger reviewed Apr 1, 2017

View reviewed changes

TomAugspurger approved these changes Apr 1, 2017

View reviewed changes

jreback force-pushed the parquet branch from 52ef75b to f238f9b Compare April 1, 2017 22:21

jreback force-pushed the parquet branch from 4aceb7e to a017d8f Compare April 2, 2017 00:17

jreback mentioned this pull request Apr 2, 2017

API: formalize the pandas IO API #15862

Closed

jorisvandenbossche reviewed Apr 2, 2017

View reviewed changes

jreback mentioned this pull request Apr 3, 2017

DOC: section on differences from the pyarrow impl dask/fastparquet#122

Closed

jreback force-pushed the parquet branch from a017d8f to 6a95c81 Compare April 3, 2017 12:24

jreback force-pushed the parquet branch 2 times, most recently from 37cc304 to e45dfe8 Compare July 26, 2017 23:22

jorisvandenbossche reviewed Jul 27, 2017

View reviewed changes

jreback force-pushed the parquet branch from e45dfe8 to d3ec8b5 Compare July 27, 2017 11:27

jreback mentioned this pull request Jul 27, 2017

ENH: followup support for pd.read_parquet #17102

Closed

3 tasks

wesm reviewed Jul 29, 2017

View reviewed changes

ENH: add to/from_parquet with pyarrow & fastparquet

f553a5f

jreback force-pushed the parquet branch from bb87d0d to f553a5f Compare August 1, 2017 22:28

jreback merged commit f433061 into pandas-dev:master Aug 2, 2017

jreback added a commit that referenced this pull request Aug 2, 2017

DOC: doc typos, xref #15838

8e6b09f

chris-b1 mentioned this pull request Aug 22, 2017

Update Performance Considerations section in docs #17303

Merged

4 tasks

jreback added the IO Parquet parquet, feather label Sep 6, 2017

jowens pushed a commit to jowens/pandas that referenced this pull request Sep 20, 2017

ENH: add to/from_parquet with pyarrow & fastparquet (pandas-dev#15838)

5ce00e1

jowens pushed a commit to jowens/pandas that referenced this pull request Sep 20, 2017

DOC: doc typos, xref pandas-dev#15838

9aadb64

alanbato pushed a commit to alanbato/pandas that referenced this pull request Nov 10, 2017

ENH: add to/from_parquet with pyarrow & fastparquet (pandas-dev#15838)

52c8cab

alanbato pushed a commit to alanbato/pandas that referenced this pull request Nov 10, 2017

DOC: doc typos, xref pandas-dev#15838

6177c81

jakirkham mentioned this pull request Jan 22, 2018

Allow the CI dev build to fail dask/dask#3088

Merged


		.. versionadded:: 0.20.0

		Parquet provides a sharded binary columnar serialization for data frames. It is designed to make reading and writing data


		self.check_round_trip(df, fp)

		@pytest.mark.skip(reason="not supported")


		.. versionadded:: 0.21.0

		Parquet provides a sharded binary columnar serialization for data frames. It is designed to make reading and writing data


		.. ipython:: python

		df.to_parquet('example_pa.parquet', engine='pyarrow')

ENH: add to/from_parquet with pyarrow & fastparquet #15838

ENH: add to/from_parquet with pyarrow & fastparquet #15838

Conversation

jreback commented Mar 29, 2017 • edited Loading

jreback commented Mar 29, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback Mar 29, 2017 • edited Loading

Choose a reason for hiding this comment

jreback Mar 29, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback Mar 29, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Mar 29, 2017 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wesm Apr 1, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback Apr 1, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Apr 1, 2017

martindurant commented Apr 1, 2017

jreback commented Apr 2, 2017

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

martindurant commented Jul 22, 2017

jreback commented Jul 26, 2017

gfyoung commented Jul 27, 2017

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Mar 29, 2017 •

edited

Loading

jreback Mar 29, 2017 •

edited

Loading

jreback Mar 29, 2017 •

edited

Loading

jreback Mar 29, 2017 •

edited

Loading

codecov bot commented Mar 29, 2017 •

edited

Loading

wesm Apr 1, 2017 •

edited

Loading

jreback Apr 1, 2017 •

edited

Loading