Creating an array from a pandas object column with mostly ints raises #3479

max-sixty · 2019-01-24T19:21:24Z

It looks like an object column with mostly ints is interpreted as a column with all ints:

import pandas as pd
df = pd.DataFrame(dict(a=range(100000)))
df.iloc[-1] = 'a'

df.to_parquet('x')

ArrowTypeErrorTraceback (most recent call last)
<ipython-input-73-aa5b9f7c0625> in <module>()
----> 1 df.to_parquet('x')

/usr/local/lib/python2.7/site-packages/pandas/core/frame.pyc in to_parquet(self, fname, engine, compression, **kwargs)
   1943         from pandas.io.parquet import to_parquet
   1944         to_parquet(self, fname, engine,
-> 1945                    compression=compression, **kwargs)
   1946 
   1947     @Substitution(header='Write out the column names. If a list of strings '

/usr/local/lib/python2.7/site-packages/pandas/io/parquet.pyc in to_parquet(df, path, engine, compression, **kwargs)
    255     """
    256     impl = get_engine(engine)
--> 257     return impl.write(df, path, compression=compression, **kwargs)
    258 
    259 

/usr/local/lib/python2.7/site-packages/pandas/io/parquet.pyc in write(self, df, path, compression, coerce_timestamps, **kwargs)
    116 
    117         else:
--> 118             table = self.api.Table.from_pandas(df)
    119             self.api.parquet.write_table(
    120                 table, path, compression=compression,

/usr/local/lib/python2.7/site-packages/pyarrow/table.pxi in pyarrow.lib.Table.from_pandas()
   1139         <pyarrow.lib.Table object at 0x7f05d1fb1b40>
   1140         """
-> 1141         names, arrays, metadata = pdcompat.dataframe_to_arrays(
   1142             df,
   1143             schema=schema,

/usr/local/lib/python2.7/site-packages/pyarrow/pandas_compat.pyc in dataframe_to_arrays(df, schema, preserve_index, nthreads, columns, safe)
    435             arrays = list(executor.map(convert_column,
    436                                        columns_to_convert,
--> 437                                        convert_types))
    438 
    439     types = [x.type for x in arrays]

/usr/local/lib/python2.7/site-packages/concurrent/futures/_base.pyc in result_iterator()
    639                     # Careful not to keep a reference to the popped future
    640                     if timeout is None:
--> 641                         yield fs.pop().result()
    642                     else:
    643                         yield fs.pop().result(end_time - time.time())

/usr/local/lib/python2.7/site-packages/concurrent/futures/_base.pyc in result(self, timeout)
    453                 raise CancelledError()
    454             elif self._state == FINISHED:
--> 455                 return self.__get_result()
    456 
    457             self._condition.wait(timeout)

/usr/local/lib/python2.7/site-packages/concurrent/futures/thread.pyc in run(self)
     61 
     62         try:
---> 63             result = self.fn(*self.args, **self.kwargs)
     64         except:
     65             e, tb = sys.exc_info()[1:]

/usr/local/lib/python2.7/site-packages/pyarrow/pandas_compat.pyc in convert_column(col, ty)
    424             e.args += ("Conversion failed for column {0!s} with type {1!s}"
    425                        .format(col.name, col.dtype),)
--> 426             raise e
    427 
    428     if nthreads == 1:

ArrowTypeError: ('an integer is required', 'Conversion failed for column a with type object')

Tracking down the call within the pyarrow library:

import pyarrow as pa
col = df['a']
pa.array(col, type=None, from_pandas=True, safe=True)

ArrowTypeErrorTraceback (most recent call last)
<ipython-input-71-f418a2733bf5> in <module>()
      4 col = df['a']
      5 
----> 6 pa.array(col, type=None, from_pandas=True, safe=True)

/usr/local/lib/python2.7/site-packages/pyarrow/array.pxi in pyarrow.lib.array()
    169             values, type = pdcompat.get_datetimetz_type(values, obj.dtype,
    170                                                         type)
--> 171             return _ndarray_to_array(values, mask, type, from_pandas, safe,
    172                                      pool)
    173     else:

/usr/local/lib/python2.7/site-packages/pyarrow/array.pxi in pyarrow.lib._ndarray_to_array()
     78 
     79     with nogil:
---> 80         check_status(NdarrayToArrow(pool, values, mask, from_pandas,
     81                                     c_type, cast_options, &chunked_out))
     82 

/usr/local/lib/python2.7/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()
     89             raise ArrowNotImplementedError(message)
     90         elif status.IsTypeError():
---> 91             raise ArrowTypeError(message)
     92         elif status.IsCapacityError():
     93             raise ArrowCapacityError(message)

ArrowTypeError: an integer is required

Is there a way, from the pandas API, of specifying the schema?
Could pyarrow pass the pandas schema of the column vs (supposedly) reading it from the initial column values?

Thanks!

wesm · 2019-01-24T19:25:51Z

On the first question, that would be a question for the pandas project cc @jreback @TomAugspurger

On the second question, what are you hoping to happen? We have discussed having an option to treat errors as nulls (https://issues.apache.org/jira/browse/ARROW-2098), so one possibility if you indicate that the column should be an integer, is that the lone string should be made null. You'd have to opt in to this behavior, though

max-sixty · 2019-01-24T19:41:41Z

Thanks for the lightning-fast response @wesm

On the second question, what are you hoping to happen?

I was hoping that pyarrow could see that the column had an object type, and attempt to encode it as a string, rather than guessing based on the initial values. (I wouldn't think that the correct result would be to null the lone string)

Is that reasonable?

TomAugspurger · 2019-01-24T19:50:43Z

Pandas doesn't currently pass a schema through. IIUC, we could add a schema argument to to_parquet, and pass it through at https://github.com/pandas-dev/pandas/blob/5761e359b65631db491349a65fc21d0da51dcc0f/pandas/io/parquet.py#L103-L113, with little effort.

That seems generally useful, but it wouldn't have helped with this exception right?

Edit: Oh, reading #3479 (comment) I see that it may have helped if you opt into it.

max-sixty · 2019-01-24T19:57:35Z

On the pandas / first question:

Pandas doesn't currently pass a schema through. IIUC, we could add a schema argument to to_parquet, and pass it through at

I think a user could now pass schema as a kwarg to to_parquet, which would then be passed through here: https://github.com/pandas-dev/pandas/blob/5761e359b65631db491349a65fc21d0da51dcc0f/pandas/io/parquet.py#L113, (though an explicit arg may be slightly friendlier)

TomAugspurger · 2019-01-24T19:59:24Z

@max-sixty I thought that at first, but that from_pandas_kwargs is built internally. User-provided kwargs aren't passed there (they're sent to write_table.

max-sixty · 2019-01-24T21:00:18Z

Right, thanks @TomAugspurger

wesm · 2019-01-27T21:20:12Z

I'm closing this for now. If you have a well-scoped feature request can you please open a JIRA issue?

wesm closed this as completed Jan 27, 2019

max-sixty mentioned this issue Feb 16, 2019

BigQuery: Use BigQuery schema (from LoadJobConfig) if available when converting to Parquet in load_table_from_dataframe googleapis/google-cloud-python#7370

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Creating an array from a pandas object column with mostly ints raises #3479

Creating an array from a pandas object column with mostly ints raises #3479

max-sixty commented Jan 24, 2019

wesm commented Jan 24, 2019

max-sixty commented Jan 24, 2019 •

edited

Loading

TomAugspurger commented Jan 24, 2019 •

edited

Loading

max-sixty commented Jan 24, 2019

TomAugspurger commented Jan 24, 2019

max-sixty commented Jan 24, 2019

wesm commented Jan 27, 2019

Creating an array from a pandas object column with mostly ints raises #3479

Creating an array from a pandas object column with mostly ints raises #3479

Comments

max-sixty commented Jan 24, 2019

wesm commented Jan 24, 2019

max-sixty commented Jan 24, 2019 • edited Loading

TomAugspurger commented Jan 24, 2019 • edited Loading

max-sixty commented Jan 24, 2019

TomAugspurger commented Jan 24, 2019

max-sixty commented Jan 24, 2019

wesm commented Jan 27, 2019

max-sixty commented Jan 24, 2019 •

edited

Loading

TomAugspurger commented Jan 24, 2019 •

edited

Loading