PERF: Improve pickle support with BZ2 & LZMA #49068

jakirkham · 2022-10-13T08:50:55Z

Fixes ENH: Always write directly to output in to_pickle #46747
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

`PickleBuffer` isn't currently included in `SupportBytes`, which causes issues with pyright when passing `PickleBuffer` instances to `bytes`. Though it appears ok passing `PickleBuffer` instances to `memoryview`s. So do that instead. This is functionaly very equivalent. There is a slight performance cost to making a `memoryview`, but this is likely negligible compared to copying to `bytes`.

mroeschke · 2022-10-13T16:44:07Z

Noting that we're having unrelated, CI issues currently. Hopefully it should be fixed today.

jakirkham · 2022-10-13T18:11:38Z

No worries. Thanks Matthew! 🙏

Looks like any related test failures have been fixed, but will recheck once CI issues clear.

If you have thoughts on this approach, please let me know. Though no worries if you are busy 🙂

mroeschke · 2022-10-13T18:22:19Z

The approach seems reasonable; it would be good to ensure we have tests to ensure the raw and memoryview path are both tested.

Also am I correct in understanding in #46747 that once we drop 3.9 that this won't be entirely necessary or would it still be a good performance benefit to keep?

jakirkham · 2022-10-13T19:16:37Z

Mostly, Python 3.9.6 has the fix. Also the fix is included in Python 3.10.0b4 and Python 3.11.0a1. So Python 3.9.6+ don't need the workaround. Python 3.8 and earlier versions of Python 3.9 would need the workaround. IOW it is probably safer to keep until Python 3.10 is the minimum (or when Python 3.9 is dropped).

Once Python 3.10 is minimum, we could drop the wrapper classes and just keep the simplified to_pickle.

Happy to flag the code based on Python version somehow if that makes it easier to track.

In terms of testing, think most of what we need is covered in these tests. The default path raw should be handled there (though please correct me if I'm missing something). The memoryview case would only come up if the data was not contiguous somehow (for example numpy.arange(10)[::2]). Do you know of good examples in Pandas where that would come up?

pandas/io/common.py

This provides a reasonable proxy for testing patched `BZ2File` and `LZMAFile` objects.

This ran into cyclic import issues in `pickle_compat`. So move `flatten_buffer` to its own module free of these issues.

mroeschke · 2022-10-18T16:53:39Z

Happy to flag the code based on Python version somehow if that makes it easier to track.

That would be really helpful as cleaning up code is always nice :)

The memoryview case would only come up if the data was not contiguous somehow (for example numpy.arange(10)[::2]). Do you know of good examples in Pandas where that would come up?

Yeah this can come up in pandas with ExtensionArrays (1D) being indexed in a DataFrame (2D). We have some tests checking contiguousness of the underlying numpy array when indexing

pandas/pandas/tests/extension/base/dim2.py

Line 274 in 2e7f5a3

class NDArrayBacked2DTests(Dim2CompatTests):

This should limit the effects of this patch. Also should make it easier to remove this backport later once all supported Python versions have the fix.

Also test another non-contiguous array.

If a `memoryview` is returned, make sure it as close to `bytes` | `bytearray` as possible. This ensures if other functions assume something like `bytes` (for example assuming `len(b)` is the number of bytes contained), things will continue to work even though this is a `memoryview`.

jakirkham · 2022-10-18T19:18:00Z

Sounds good. Have refactored out a common utility function, flatten_buffer, which simplifies both subclasses. Also flatten_buffer is tested in a variety of cases to make sure it is coercing data into something both compressors can handle (without the fix). Also conditioned the subclass patches on Python version support to clarify things. The subclasses themselves have been moved into compat.

pandas/compat/_utils.py

doc/source/whatsnew/v2.0.0.rst

pandas/compat/__init__.py

Should work around linter issues.

pandas/compat/__init__.py

jakirkham · 2022-10-21T00:16:43Z

Anything else needed here?

doc/source/whatsnew/v2.0.0.rst

jakirkham · 2022-10-21T06:58:10Z

How does the updated changelog entry look?

doc/source/whatsnew/v2.0.0.rst

mroeschke · 2022-10-21T17:29:45Z

Sweet thanks @jakirkham!

jakirkham · 2022-10-21T17:34:38Z

Awesome! Thank you both 😄

Would it be possible to backport this change as well? If so, what is needed from me to do that?

mroeschke · 2022-10-21T18:11:28Z

Backports are currently done for regressions (bug/performance) that were introduced in 1.5. IIUC the performance of to_pickle in this situation didn't necessarily regress in 1.5?

jakirkham · 2022-10-21T18:27:31Z

Good question. The history goes like this:

PR ( Write pickle to file-like without intermediate in-memory buffer #37056 ) enabled this for all compressors in 1.2
Users discovered issues with BZ2 & LZMA ( BUG:to_pickle() raises TypeError when compressing large dataframe #39002 )
PR ( REGR: write compressed pickle files with protocol=5 #39376 ) special cased BZ2 & LZMA to fix the bug in 1.2.2 (though this has reduced perf, which of course is the right tradeoff)

So the perf improvement and then reduction occurred in 1.2 (not 1.5).

That said, maybe there is still value in having this perf improvement in some 1.x version (especially given the move to 2 occurring). What do you think? 🙂

Edit: Should add have no strong feelings here. Just wondering if we can help 1.x users with this change.

mroeschke · 2022-10-21T18:39:50Z

Would be a nice performance benefit for 1.5.x users. Also need to weigh any unknown bugs/behavior changes this might introduce. Thoughts @pandas-dev/pandas-core?

WillAyd · 2022-10-21T18:52:48Z

I would err on the side of caution and go 2.0

* Add `BZ2File` wrapper for pickle protocol 5 * Add `LZMAFile` wrapper for pickle protocol 5 * Use BZ2 & LZMA wrappers for full pickle support * Workaround linter issue `PickleBuffer` isn't currently included in `SupportBytes`, which causes issues with pyright when passing `PickleBuffer` instances to `bytes`. Though it appears ok passing `PickleBuffer` instances to `memoryview`s. So do that instead. This is functionaly very equivalent. There is a slight performance cost to making a `memoryview`, but this is likely negligible compared to copying to `bytes`. * Refactor out `flatten_buffer` * Refactor `B2File` into separate module * Test `flatten_buffer` This provides a reasonable proxy for testing patched `BZ2File` and `LZMAFile` objects. * Move `flatten_buffer` to `_utils` This ran into cyclic import issues in `pickle_compat`. So move `flatten_buffer` to its own module free of these issues. * Import `annotations` to fix `|` usage * Sort `import`s to fix lint * Patch `BZ2File` & `LZMAFile` on Python pre-3.10 This should limit the effects of this patch. Also should make it easier to remove this backport later once all supported Python versions have the fix. * Test C & F contiguous NumPy arrays Also test another non-contiguous array. * Test `memoryview` is 1-D `uint8` contiguous data If a `memoryview` is returned, make sure it as close to `bytes` | `bytearray` as possible. This ensures if other functions assume something like `bytes` (for example assuming `len(b)` is the number of bytes contained), things will continue to work even though this is a `memoryview`. * Run `black` on `bz2` and `lzma` compat files * One more lint fix * Drop unused `PickleBuffer` `import`s * Simplify change to `panda.compat.__init__` Now that the LZMA changes are in a separate file, cleanup the changes to `pandas.compat.__init__`. * Type `flatten_buffer` result * Use `order="A"` in `memoryview.tobytes(...)` In the function `flatten_buffer`, the order is already effectively enforced when copying can be avoided by using `PickleBuffer.raw(...)`. However some test comparisons failed (when they shouldn't have) as this wasn't specified. So add the `order` in both the function and the test. This should fix that test failure. * Move all compat compressors into a single file * Fix `BZ2File` `import` * Refactor out common compat constants * Fix `import` sorting * Drop unused `import` * Ignore `flake8` errors on wildcard `import` * Revert "Ignore `flake8` errors on wildcard `import`" This reverts commit f1f1a2e. * Explicitly `import` all constants * Assign `IS64` first * Try `noqa` on wildcard `import` again * Declare `BZ2File` & `LZMAFile` once Fixes a linter issue from pyright. * `import PickleBuffer` for simplicity * Add `bytearray` to return type * Test `bytes` & `bytearray` are returned unaltered * Explicit list all constants * Trick linter into thinking constants are used ;) * Add new entry to 2.0.0 * Assign constants to themselves Should work around linter issues. * Update changelog entry [skip ci] * Add constants to `__all__` * Update changelog entry [ci skip] * Use Sphinx method annotation

phofl · 2022-10-22T00:42:10Z

I'd also prefer to keep this in on main only

* Add `BZ2File` wrapper for pickle protocol 5 * Add `LZMAFile` wrapper for pickle protocol 5 * Use BZ2 & LZMA wrappers for full pickle support * Workaround linter issue `PickleBuffer` isn't currently included in `SupportBytes`, which causes issues with pyright when passing `PickleBuffer` instances to `bytes`. Though it appears ok passing `PickleBuffer` instances to `memoryview`s. So do that instead. This is functionaly very equivalent. There is a slight performance cost to making a `memoryview`, but this is likely negligible compared to copying to `bytes`. * Refactor out `flatten_buffer` * Refactor `B2File` into separate module * Test `flatten_buffer` This provides a reasonable proxy for testing patched `BZ2File` and `LZMAFile` objects. * Move `flatten_buffer` to `_utils` This ran into cyclic import issues in `pickle_compat`. So move `flatten_buffer` to its own module free of these issues. * Import `annotations` to fix `|` usage * Sort `import`s to fix lint * Patch `BZ2File` & `LZMAFile` on Python pre-3.10 This should limit the effects of this patch. Also should make it easier to remove this backport later once all supported Python versions have the fix. * Test C & F contiguous NumPy arrays Also test another non-contiguous array. * Test `memoryview` is 1-D `uint8` contiguous data If a `memoryview` is returned, make sure it as close to `bytes` | `bytearray` as possible. This ensures if other functions assume something like `bytes` (for example assuming `len(b)` is the number of bytes contained), things will continue to work even though this is a `memoryview`. * Run `black` on `bz2` and `lzma` compat files * One more lint fix * Drop unused `PickleBuffer` `import`s * Simplify change to `panda.compat.__init__` Now that the LZMA changes are in a separate file, cleanup the changes to `pandas.compat.__init__`. * Type `flatten_buffer` result * Use `order="A"` in `memoryview.tobytes(...)` In the function `flatten_buffer`, the order is already effectively enforced when copying can be avoided by using `PickleBuffer.raw(...)`. However some test comparisons failed (when they shouldn't have) as this wasn't specified. So add the `order` in both the function and the test. This should fix that test failure. * Move all compat compressors into a single file * Fix `BZ2File` `import` * Refactor out common compat constants * Fix `import` sorting * Drop unused `import` * Ignore `flake8` errors on wildcard `import` * Revert "Ignore `flake8` errors on wildcard `import`" This reverts commit f1f1a2e. * Explicitly `import` all constants * Assign `IS64` first * Try `noqa` on wildcard `import` again * Declare `BZ2File` & `LZMAFile` once Fixes a linter issue from pyright. * `import PickleBuffer` for simplicity * Add `bytearray` to return type * Test `bytes` & `bytearray` are returned unaltered * Explicit list all constants * Trick linter into thinking constants are used ;) * Add new entry to 2.0.0 * Assign constants to themselves Should work around linter issues. * Update changelog entry [skip ci] * Add constants to `__all__` * Update changelog entry [ci skip] * Use Sphinx method annotation

jakirkham force-pushed the fix_pickle5 branch 2 times, most recently from 9bbb696 to 794c315 Compare October 13, 2022 09:22

jakirkham added 3 commits October 13, 2022 02:53

Add BZ2File wrapper for pickle protocol 5

b3e1bc5

Add LZMAFile wrapper for pickle protocol 5

17f725b

Use BZ2 & LZMA wrappers for full pickle support

280731e

jakirkham force-pushed the fix_pickle5 branch from 794c315 to 280731e Compare October 13, 2022 09:53

mroeschke added Performance Memory or execution speed performance IO Pickle read_pickle, to_pickle labels Oct 13, 2022

twoertwein reviewed Oct 13, 2022

View reviewed changes

pandas/io/common.py Outdated Show resolved Hide resolved

jakirkham added 6 commits October 18, 2022 03:15

Refactor out flatten_buffer

08c37e5

Refactor B2File into separate module

8109338

Merge pandas-dev/main into jakirkham/fix_pickle5

3c498bd

Test flatten_buffer

691eba7

This provides a reasonable proxy for testing patched `BZ2File` and `LZMAFile` objects.

Move flatten_buffer to _utils

7a93b70

This ran into cyclic import issues in `pickle_compat`. So move `flatten_buffer` to its own module free of these issues.

Import annotations to fix | usage

8f5b0a1

jakirkham added 5 commits October 18, 2022 11:54

Merge pandas-dev/main into jakirkham/fix_pickle5

b5ce67c

Sort imports to fix lint

c54529a

Patch BZ2File & LZMAFile on Python pre-3.10

7604d48

This should limit the effects of this patch. Also should make it easier to remove this backport later once all supported Python versions have the fix.

Test C & F contiguous NumPy arrays

9f3d387

Also test another non-contiguous array.

twoertwein reviewed Oct 18, 2022

View reviewed changes

pandas/compat/_utils.py Outdated Show resolved Hide resolved

jakirkham added 2 commits October 18, 2022 13:10

Run black on bz2 and lzma compat files

6df7e08

One more lint fix

39ffab0

jakirkham added 2 commits October 19, 2022 18:33

Merge pandas-dev/main into jakirkham/fix_pickle5

b18a3f0

Add new entry to 2.0.0

0dae476

jakirkham commented Oct 20, 2022

View reviewed changes

doc/source/whatsnew/v2.0.0.rst Outdated Show resolved Hide resolved

jakirkham commented Oct 20, 2022

View reviewed changes

pandas/compat/__init__.py Outdated Show resolved Hide resolved

Assign constants to themselves

366f645

Should work around linter issues.

mroeschke reviewed Oct 20, 2022

View reviewed changes

pandas/compat/__init__.py Outdated Show resolved Hide resolved

jakirkham requested a review from mroeschke October 20, 2022 18:23

jakirkham added 3 commits October 20, 2022 12:50

Update changelog entry [skip ci]

092e726

Merge pandas-dev/main into jakirkham/fix_pickle5

03b8eac

Add constants to __all__

e49ba4f

twoertwein reviewed Oct 21, 2022

View reviewed changes

doc/source/whatsnew/v2.0.0.rst Outdated Show resolved Hide resolved

Update changelog entry [ci skip]

453b4e3

twoertwein reviewed Oct 21, 2022

View reviewed changes

doc/source/whatsnew/v2.0.0.rst Outdated Show resolved Hide resolved

jakirkham added 2 commits October 21, 2022 08:22

Use Sphinx method annotation

30124dd

Merge pandas-dev/main into jakirkham/fix_pickle5

72aeff2

mroeschke added this to the 2.0 milestone Oct 21, 2022

mroeschke approved these changes Oct 21, 2022

View reviewed changes

mroeschke merged commit 4a2b068 into pandas-dev:main Oct 21, 2022

jakirkham deleted the fix_pickle5 branch October 21, 2022 17:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: Improve pickle support with BZ2 & LZMA #49068

PERF: Improve pickle support with BZ2 & LZMA #49068

jakirkham commented Oct 13, 2022 •

edited

Loading

mroeschke commented Oct 13, 2022

jakirkham commented Oct 13, 2022

mroeschke commented Oct 13, 2022

jakirkham commented Oct 13, 2022 •

edited

Loading

mroeschke commented Oct 18, 2022

jakirkham commented Oct 18, 2022

jakirkham commented Oct 21, 2022

jakirkham commented Oct 21, 2022

mroeschke commented Oct 21, 2022

jakirkham commented Oct 21, 2022

mroeschke commented Oct 21, 2022

jakirkham commented Oct 21, 2022 •

edited

Loading

mroeschke commented Oct 21, 2022

WillAyd commented Oct 21, 2022

phofl commented Oct 22, 2022

PERF: Improve pickle support with BZ2 & LZMA #49068

PERF: Improve pickle support with BZ2 & LZMA #49068

Conversation

jakirkham commented Oct 13, 2022 • edited Loading

mroeschke commented Oct 13, 2022

jakirkham commented Oct 13, 2022

mroeschke commented Oct 13, 2022

jakirkham commented Oct 13, 2022 • edited Loading

mroeschke commented Oct 18, 2022

jakirkham commented Oct 18, 2022

jakirkham commented Oct 21, 2022

jakirkham commented Oct 21, 2022

mroeschke commented Oct 21, 2022

jakirkham commented Oct 21, 2022

mroeschke commented Oct 21, 2022

jakirkham commented Oct 21, 2022 • edited Loading

mroeschke commented Oct 21, 2022

WillAyd commented Oct 21, 2022

phofl commented Oct 22, 2022

jakirkham commented Oct 13, 2022 •

edited

Loading

jakirkham commented Oct 13, 2022 •

edited

Loading

jakirkham commented Oct 21, 2022 •

edited

Loading