GH-26685: [Python] use IPC for pickle serialisation #37683

anjakefala · 2023-09-12T18:21:45Z

Rationale for this change

Existing pickling serialises the whole buffer, even if the Array is sliced.

What changes are included in this PR?

Changes use Arrow's buffer truncation implemented for IPC serialization for pickling and restoring.

Relies on a RecordBatch wrapper, adding ~230 bytes to the pickled payload per Array chunk.

Chunks are not automatically combined pre-pickling.

Are these changes tested?

Yes

Are there any user-facing changes?

No

Closes: [Python] Pickling a sliced array serializes all the buffers #26685

anjakefala · 2023-09-12T19:12:04Z

I was trying to be clever by creating a set of parameters that all of the pickling tests used, but it may not be appropriate for the protocol5 ones.

The other thing I am noticing is that sliced + pickled Boolean arrays are a lot larger than any other data-type:

      # Check truncation upon serialization
>       assert len(serialized_slice) <= 0.5 * len(serialized_arr)
E       AssertionError: assert 367 <= (0.5 * 607)

Existing pickling serialises the whole buffer, even if the Array is sliced. Now we use Arrow's buffer truncation implemented for IPC serialization for pickling. Relies on a RecordBatch wrapper, adding ~230 bytes to the pickled payload per Array chunk. Closes apache#26685

AlenkaF · 2023-10-04T07:35:28Z

I was trying to be clever by creating a set of parameters that all of the pickling tests used, but it may not be appropriate for the protocol5 ones.

If you keep the parameters as they were, do the tests pass?

The other thing I am noticing is that sliced + pickled Boolean arrays are a lot larger than any other data-type:

I am curious if you pickle the whole boolean array (not sliced) - what is the diff in size there?

anjakefala · 2023-10-05T23:06:38Z

If you keep the parameters as they were, do the tests pass?

No! They seem to be legitimately failing due to the changes in the pickling process.

I am still wrapping my head around the PEP, but the changes do not seem to adhere to PEP-574.

One example from the test CI:
pyarrow/tests/test_array.py::test_array_pickle_protocol5[builtin_pickle-data0-typ0] - assert [0, 140696923189512] == [0, 140696923177856]

anjakefala · 2023-10-10T23:29:29Z

Confirmed that this approach would break support for pickle protocol 5 for out of band data.

anjakefala requested review from jorisvandenbossche and AlenkaF September 12, 2023 18:21

github-actions bot added Component: Python awaiting review Awaiting review labels Sep 12, 2023

anjakefala added 3 commits October 3, 2023 11:25

Ran linter

2124aab

Fix variable names

d1cdc4e

anjakefala force-pushed the kef-26685 branch from cf189cc to d1cdc4e Compare October 3, 2023 18:25

Refactor pickle tests to use test-specific parameters

d79b837

Run linter and fix indenting to minimise diff

3e54f20

anjakefala closed this Oct 12, 2023

anjakefala mentioned this pull request Oct 12, 2023

[Python] Pickling a sliced array serializes all the buffers #26685

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-26685: [Python] use IPC for pickle serialisation #37683

GH-26685: [Python] use IPC for pickle serialisation #37683

anjakefala commented Sep 12, 2023 •

edited by github-actions bot

Loading

anjakefala commented Sep 12, 2023

AlenkaF commented Oct 4, 2023

anjakefala commented Oct 5, 2023 •

edited

Loading

anjakefala commented Oct 10, 2023

GH-26685: [Python] use IPC for pickle serialisation #37683

GH-26685: [Python] use IPC for pickle serialisation #37683

Conversation

anjakefala commented Sep 12, 2023 • edited by github-actions bot Loading

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

anjakefala commented Sep 12, 2023

AlenkaF commented Oct 4, 2023

anjakefala commented Oct 5, 2023 • edited Loading

anjakefala commented Oct 10, 2023

anjakefala commented Sep 12, 2023 •

edited by github-actions bot

Loading

anjakefala commented Oct 5, 2023 •

edited

Loading