Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Creating an array from a pandas object column with mostly ints raises #3479

Closed
max-sixty opened this issue Jan 24, 2019 · 7 comments
Closed

Comments

@max-sixty
Copy link
Contributor

It looks like an object column with mostly ints is interpreted as a column with all ints:

import pandas as pd
df = pd.DataFrame(dict(a=range(100000)))
df.iloc[-1] = 'a'

df.to_parquet('x')

ArrowTypeErrorTraceback (most recent call last)
<ipython-input-73-aa5b9f7c0625> in <module>()
----> 1 df.to_parquet('x')

/usr/local/lib/python2.7/site-packages/pandas/core/frame.pyc in to_parquet(self, fname, engine, compression, **kwargs)
   1943         from pandas.io.parquet import to_parquet
   1944         to_parquet(self, fname, engine,
-> 1945                    compression=compression, **kwargs)
   1946 
   1947     @Substitution(header='Write out the column names. If a list of strings '

/usr/local/lib/python2.7/site-packages/pandas/io/parquet.pyc in to_parquet(df, path, engine, compression, **kwargs)
    255     """
    256     impl = get_engine(engine)
--> 257     return impl.write(df, path, compression=compression, **kwargs)
    258 
    259 

/usr/local/lib/python2.7/site-packages/pandas/io/parquet.pyc in write(self, df, path, compression, coerce_timestamps, **kwargs)
    116 
    117         else:
--> 118             table = self.api.Table.from_pandas(df)
    119             self.api.parquet.write_table(
    120                 table, path, compression=compression,

/usr/local/lib/python2.7/site-packages/pyarrow/table.pxi in pyarrow.lib.Table.from_pandas()
   1139         <pyarrow.lib.Table object at 0x7f05d1fb1b40>
   1140         """
-> 1141         names, arrays, metadata = pdcompat.dataframe_to_arrays(
   1142             df,
   1143             schema=schema,

/usr/local/lib/python2.7/site-packages/pyarrow/pandas_compat.pyc in dataframe_to_arrays(df, schema, preserve_index, nthreads, columns, safe)
    435             arrays = list(executor.map(convert_column,
    436                                        columns_to_convert,
--> 437                                        convert_types))
    438 
    439     types = [x.type for x in arrays]

/usr/local/lib/python2.7/site-packages/concurrent/futures/_base.pyc in result_iterator()
    639                     # Careful not to keep a reference to the popped future
    640                     if timeout is None:
--> 641                         yield fs.pop().result()
    642                     else:
    643                         yield fs.pop().result(end_time - time.time())

/usr/local/lib/python2.7/site-packages/concurrent/futures/_base.pyc in result(self, timeout)
    453                 raise CancelledError()
    454             elif self._state == FINISHED:
--> 455                 return self.__get_result()
    456 
    457             self._condition.wait(timeout)

/usr/local/lib/python2.7/site-packages/concurrent/futures/thread.pyc in run(self)
     61 
     62         try:
---> 63             result = self.fn(*self.args, **self.kwargs)
     64         except:
     65             e, tb = sys.exc_info()[1:]

/usr/local/lib/python2.7/site-packages/pyarrow/pandas_compat.pyc in convert_column(col, ty)
    424             e.args += ("Conversion failed for column {0!s} with type {1!s}"
    425                        .format(col.name, col.dtype),)
--> 426             raise e
    427 
    428     if nthreads == 1:

ArrowTypeError: ('an integer is required', 'Conversion failed for column a with type object')

Tracking down the call within the pyarrow library:

import pyarrow as pa
col = df['a']
pa.array(col, type=None, from_pandas=True, safe=True)

ArrowTypeErrorTraceback (most recent call last)
<ipython-input-71-f418a2733bf5> in <module>()
      4 col = df['a']
      5 
----> 6 pa.array(col, type=None, from_pandas=True, safe=True)

/usr/local/lib/python2.7/site-packages/pyarrow/array.pxi in pyarrow.lib.array()
    169             values, type = pdcompat.get_datetimetz_type(values, obj.dtype,
    170                                                         type)
--> 171             return _ndarray_to_array(values, mask, type, from_pandas, safe,
    172                                      pool)
    173     else:

/usr/local/lib/python2.7/site-packages/pyarrow/array.pxi in pyarrow.lib._ndarray_to_array()
     78 
     79     with nogil:
---> 80         check_status(NdarrayToArrow(pool, values, mask, from_pandas,
     81                                     c_type, cast_options, &chunked_out))
     82 

/usr/local/lib/python2.7/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()
     89             raise ArrowNotImplementedError(message)
     90         elif status.IsTypeError():
---> 91             raise ArrowTypeError(message)
     92         elif status.IsCapacityError():
     93             raise ArrowCapacityError(message)

ArrowTypeError: an integer is required
  • Is there a way, from the pandas API, of specifying the schema?
  • Could pyarrow pass the pandas schema of the column vs (supposedly) reading it from the initial column values?

Thanks!

@wesm
Copy link
Member

wesm commented Jan 24, 2019

On the first question, that would be a question for the pandas project cc @jreback @TomAugspurger

On the second question, what are you hoping to happen? We have discussed having an option to treat errors as nulls (https://issues.apache.org/jira/browse/ARROW-2098), so one possibility if you indicate that the column should be an integer, is that the lone string should be made null. You'd have to opt in to this behavior, though

@max-sixty
Copy link
Contributor Author

max-sixty commented Jan 24, 2019

Thanks for the lightning-fast response @wesm

On the second question, what are you hoping to happen?

I was hoping that pyarrow could see that the column had an object type, and attempt to encode it as a string, rather than guessing based on the initial values. (I wouldn't think that the correct result would be to null the lone string)

Is that reasonable?

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jan 24, 2019

Pandas doesn't currently pass a schema through. IIUC, we could add a schema argument to to_parquet, and pass it through at https://github.com/pandas-dev/pandas/blob/5761e359b65631db491349a65fc21d0da51dcc0f/pandas/io/parquet.py#L103-L113, with little effort.

That seems generally useful, but it wouldn't have helped with this exception right?

Edit: Oh, reading #3479 (comment) I see that it may have helped if you opt into it.

@max-sixty
Copy link
Contributor Author

On the pandas / first question:

Pandas doesn't currently pass a schema through. IIUC, we could add a schema argument to to_parquet, and pass it through at

I think a user could now pass schema as a kwarg to to_parquet, which would then be passed through here: https://github.com/pandas-dev/pandas/blob/5761e359b65631db491349a65fc21d0da51dcc0f/pandas/io/parquet.py#L113, (though an explicit arg may be slightly friendlier)

@TomAugspurger
Copy link
Contributor

@max-sixty I thought that at first, but that from_pandas_kwargs is built internally. User-provided kwargs aren't passed there (they're sent to write_table.

@max-sixty
Copy link
Contributor Author

Right, thanks @TomAugspurger

@wesm
Copy link
Member

wesm commented Jan 27, 2019

I'm closing this for now. If you have a well-scoped feature request can you please open a JIRA issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants