ARROW-369: [Python] Convert multiple record batches at once to Pandas #216

BryanCutler · 2016-11-28T06:55:41Z

Modified Pandas adapter to handle columns with multiple chunks with ConvertColumnToPandas. This modifies the pyarrow public API by adding a class RecordBatchList and static method toPandas which takes a list of Arrow RecordBatches and outputs a Pandas DataFrame.

Adds unit test in test_table.py to do the conversion for each column with typed specialization.

…s and cleanup

BryanCutler · 2016-11-28T06:58:05Z

ping @wesm. I still need to check that the RecordBatch schemas are equal and add some negative unit tests, but I wanted to check with you if the approach so far looks good. Could you please take a look when you get a chance? Thanks!

BryanCutler · 2016-11-28T06:58:57Z

python/pyarrow/table.pyx

+            c_col.reset(new CColumn(schema.sp_schema.get().field(i), c_array_vec[i]))
+            # TODO - why need PyOject ref? arr is placeholder
+            check_status(pyarrow.ConvertColumnToPandas(
+                c_col, <PyObject*> arr, &np_arr))


the PyObject * param doesn't seem to do anything, is it needed?

I think you can remove the cast.

If you instead construct a Table from a list of RecordBatches you can avoid the code duplication. See Table.to_pandas

BryanCutler · 2016-11-28T07:01:00Z

python/src/pyarrow/adapters/pandas.cc


-    if (arr->null_count() > 0) {
+    if (data->null_count() > 0) {
      RETURN_NOT_OK(AllocateOutput(NPY_FLOAT64));


Just to make sure, an Integer column only needs to be casted to a double if there are Null values right?

wesm

This seems fine so far, minor comments

wesm · 2016-11-29T20:12:37Z

python/pyarrow/table.pyx

+    '''
+
+    @staticmethod
+    def to_pandas(batches):


This is a Java-like construct -- just make this a plain old function "dataframe_from_batches" or something like that

wesm · 2016-11-29T20:14:36Z

python/pyarrow/table.pyx

+            c_col.reset(new CColumn(schema.sp_schema.get().field(i), c_array_vec[i]))
+            # TODO - why need PyOject ref? arr is placeholder
+            check_status(pyarrow.ConvertColumnToPandas(
+                c_col, <PyObject*> arr, &np_arr))


I think you can remove the cast.

If you instead construct a Table from a list of RecordBatches you can avoid the code duplication. See Table.to_pandas

wesm · 2016-11-29T20:16:08Z

python/pyarrow/table.pyx

+        for batch in batches:
+            for i in range(K):
+                arr = batch[i]
+                c_array_vec[i].push_back(arr.sp_array)


If you instead loop over column index, then loop over batches, you can avoid this c_array_vec business and construct the Column directly.

wesm · 2016-11-29T20:18:55Z

python/src/pyarrow/adapters/pandas.cc


-    if (arr->null_count() > 0) {
+    if (data->null_count() > 0) {
      RETURN_NOT_OK(AllocateOutput(NPY_FLOAT64));


wesm · 2016-11-29T20:19:26Z

python/src/pyarrow/adapters/pandas.cc

+      arrow::PrimitiveArray* prim_arr = static_cast<arrow::PrimitiveArray*>(
+          arr.get());
+      const T* in_values = reinterpret_cast<const T*>(prim_arr->data()->data());
+      T* out_values = reinterpret_cast<T*>(PyArray_DATA(out_)) + chunk_offset;


you're welcome to use auto with these casts for better DRY

…s now just a free function

BryanCutler · 2016-11-30T01:37:34Z

python/pyarrow/tests/test_table.py

+
+    # TODO: Enable when subclass of unittest.testcase
+    #with self.assertRaises(pyarrow.ArrowException):
+    #    self.assertRaises(pa.dataframe_from_batches([batch1, batch2]))


Can these tests be changed to a unittest.testcase subclass? I could open another JIRA to do this if it's ok

I don't have a strong opinion -- I think we've been trying to use pytest (i.e. fixtures over TestCase classes), @xhochy do you have an opinion?

pytest is fine too. I saw unittest being used somewhere, so I wasn't sure if there was a preference.

BryanCutler · 2016-11-30T01:40:24Z

Thanks for the review @wesm , I fixed the issues you pointed out and added a check that the schema in each batch is equal.

BryanCutler · 2016-11-30T01:41:23Z

python/pyarrow/table.pyx

+    # check schemas are equal
+    for i in range(1, len(batches)):
+        schema_comp = batches[i].schema
+        if not schema.sp_schema.get().Equals(schema_comp.sp_schema):


Should the method CSchema.Equals also be in the pyarrow class?

definitely, pyarrow.schema.Schema.equals

…das-ARROW-369

BryanCutler · 2016-12-02T16:54:47Z

@wesm, hopefully this is good to go now. Please take another look when you can, thanks!

xhochy

+1, LGTM

wesm

+1

wesm · 2016-12-02T19:32:43Z

python/pyarrow/table.pyx

+    # check schemas are equal
+    if any((not schema.equals(other.schema) for other in batches[1:])):
+        raise ArrowException("Error converting list of RecordBatches to "
+                "DataFrame, not all schemas are equal")


Later we'll want to display the mismatched schemas in the error message but this is ok for now

Yeah, I should have added that.. I'll make a note to do that later.

BryanCutler · 2016-12-02T21:21:02Z

Thanks for the reviews @xhochy and @wesm!

wesm · 2016-12-05T15:44:07Z

I was thinking a better API for this would be:

table = pa.Table.from_batches(batches)
df = table.to_pandas()

what do you think?

BryanCutler · 2016-12-05T17:24:05Z

yeah, it would be more flexible that way. I can go ahead and make a PR for it.

See https://github.com/apache/parquet-mr/blob/df9d8e415436292ae33e1ca0b8da256640de9710/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter.java#L86, this number should be 1 for Parquet 1.0 files, I believe. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes apache#216 from wesm/PARQUET-828 and squashes the following commits: ab6773c [Wes McKinney] Do not implicitly cast ParquetVersion enum to int. Set 1.0 to 1, 2.0 to 2

See https://github.com/apache/parquet-mr/blob/df9d8e415436292ae33e1ca0b8da256640de9710/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter.java#L86, this number should be 1 for Parquet 1.0 files, I believe. Author: Wes McKinney <wes.mckinney@twosigma.com> Closes apache#216 from wesm/PARQUET-828 and squashes the following commits: ab6773c [Wes McKinney] Do not implicitly cast ParquetVersion enum to int. Set 1.0 to 1, 2.0 to 2 Change-Id: I7cb74e2c2ea4d567ff0d04cd4510efda9115a53f

BryanCutler added 3 commits November 22, 2016 16:10

Initial working version of RecordBatch list to_pandas, need more test…

7b29a55

…s and cleanup

Fixed case for Integer specialization without nulls

398b18d

cleanup

3ee51e6

BryanCutler commented Nov 28, 2016

View reviewed changes

wesm requested changes Nov 29, 2016

View reviewed changes

BryanCutler added 3 commits November 29, 2016 16:27

Changed conversion to make Table from columns first, now conversion i…

c3d7e8f

…s now just a free function

added testcase for schema not equal, disabled now

bd2a720

used auto keyword where some typecasting was done in ConvertValues

9edb0ba

BryanCutler commented Nov 30, 2016

View reviewed changes

BryanCutler added 4 commits November 30, 2016 12:02

fixed test case for schema checking

da65345

Merge remote-tracking branch 'upstream/master' into multi-batch-toPan…

068bc1b

…das-ARROW-369

simplified with pyarrow.schema.Schema.equals

edf056e

fixed formatting

b6c9986

xhochy approved these changes Dec 2, 2016

View reviewed changes

wesm reviewed Dec 2, 2016

View reviewed changes

asfgit closed this in b5de9e5 Dec 2, 2016

toddfarmer mentioned this pull request Feb 13, 2017

Python: Change pyarrow.Table.dataframe_from_batches API to create Table instead toddfarmer/arrow-migration#173

Closed

paleolimbot mentioned this pull request Jan 28, 2023

[R] Crash on MacOS (x86) when running tests with homebrew apache-arrow also installed #33903

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-369: [Python] Convert multiple record batches at once to Pandas #216

ARROW-369: [Python] Convert multiple record batches at once to Pandas #216

BryanCutler commented Nov 28, 2016

BryanCutler commented Nov 28, 2016

BryanCutler Nov 28, 2016

wesm Nov 29, 2016 •

edited

Loading

BryanCutler Nov 28, 2016

wesm Nov 29, 2016

wesm left a comment

wesm Nov 29, 2016

wesm Nov 29, 2016 •

edited

Loading

wesm Nov 29, 2016 •

edited

Loading

wesm Nov 29, 2016

wesm Nov 29, 2016

BryanCutler Nov 30, 2016

wesm Nov 30, 2016

BryanCutler Nov 30, 2016

BryanCutler commented Nov 30, 2016

BryanCutler Nov 30, 2016

wesm Nov 30, 2016

BryanCutler commented Dec 2, 2016

xhochy left a comment

wesm left a comment

wesm Dec 2, 2016

BryanCutler Dec 2, 2016

BryanCutler commented Dec 2, 2016

wesm commented Dec 5, 2016

BryanCutler commented Dec 5, 2016

ARROW-369: [Python] Convert multiple record batches at once to Pandas #216

ARROW-369: [Python] Convert multiple record batches at once to Pandas #216

Conversation

BryanCutler commented Nov 28, 2016

BryanCutler commented Nov 28, 2016

Choose a reason for hiding this comment

wesm Nov 29, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wesm left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wesm Nov 29, 2016 • edited Loading

Choose a reason for hiding this comment

wesm Nov 29, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BryanCutler commented Nov 30, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BryanCutler commented Dec 2, 2016

xhochy left a comment

Choose a reason for hiding this comment

wesm left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BryanCutler commented Dec 2, 2016

wesm commented Dec 5, 2016

BryanCutler commented Dec 5, 2016

wesm Nov 29, 2016 •

edited

Loading

wesm Nov 29, 2016 •

edited

Loading

wesm Nov 29, 2016 •

edited

Loading