Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: Improve pickle support with BZ2 & LZMA #49068

Merged
merged 49 commits into from
Oct 21, 2022

Conversation

jakirkham
Copy link
Contributor

@jakirkham jakirkham commented Oct 13, 2022

@jakirkham jakirkham force-pushed the fix_pickle5 branch 2 times, most recently from 9bbb696 to 794c315 Compare October 13, 2022 09:22
`PickleBuffer` isn't currently included in `SupportBytes`, which causes
issues with pyright when passing `PickleBuffer` instances to `bytes`.
Though it appears ok passing `PickleBuffer` instances to `memoryview`s.
So do that instead. This is functionaly very equivalent. There is a
slight performance cost to making a `memoryview`, but this is likely
negligible compared to copying to `bytes`.
@mroeschke mroeschke added Performance Memory or execution speed performance IO Pickle read_pickle, to_pickle labels Oct 13, 2022
@mroeschke
Copy link
Member

Noting that we're having unrelated, CI issues currently. Hopefully it should be fixed today.

@jakirkham
Copy link
Contributor Author

No worries. Thanks Matthew! 🙏

Looks like any related test failures have been fixed, but will recheck once CI issues clear.

If you have thoughts on this approach, please let me know. Though no worries if you are busy 🙂

@mroeschke
Copy link
Member

The approach seems reasonable; it would be good to ensure we have tests to ensure the raw and memoryview path are both tested.

Also am I correct in understanding in #46747 that once we drop 3.9 that this won't be entirely necessary or would it still be a good performance benefit to keep?

@jakirkham
Copy link
Contributor Author

jakirkham commented Oct 13, 2022

Mostly, Python 3.9.6 has the fix. Also the fix is included in Python 3.10.0b4 and Python 3.11.0a1. So Python 3.9.6+ don't need the workaround. Python 3.8 and earlier versions of Python 3.9 would need the workaround. IOW it is probably safer to keep until Python 3.10 is the minimum (or when Python 3.9 is dropped).

Once Python 3.10 is minimum, we could drop the wrapper classes and just keep the simplified to_pickle.

Happy to flag the code based on Python version somehow if that makes it easier to track.

In terms of testing, think most of what we need is covered in these tests. The default path raw should be handled there (though please correct me if I'm missing something). The memoryview case would only come up if the data was not contiguous somehow (for example numpy.arange(10)[::2]). Do you know of good examples in Pandas where that would come up?

pandas/io/common.py Outdated Show resolved Hide resolved
This provides a reasonable proxy for testing patched `BZ2File` and
`LZMAFile` objects.
This ran into cyclic import issues in `pickle_compat`. So move
`flatten_buffer` to its own module free of these issues.
@mroeschke
Copy link
Member

Happy to flag the code based on Python version somehow if that makes it easier to track.

That would be really helpful as cleaning up code is always nice :)

The memoryview case would only come up if the data was not contiguous somehow (for example numpy.arange(10)[::2]). Do you know of good examples in Pandas where that would come up?

Yeah this can come up in pandas with ExtensionArrays (1D) being indexed in a DataFrame (2D). We have some tests checking contiguousness of the underlying numpy array when indexing

class NDArrayBacked2DTests(Dim2CompatTests):

This should limit the effects of this patch. Also should make it easier
to remove this backport later once all supported Python versions have
the fix.
Also test another non-contiguous array.
If a `memoryview` is returned, make sure it as close to `bytes` |
`bytearray` as possible. This ensures if other functions assume
something like `bytes` (for example assuming `len(b)` is the number of
bytes contained), things will continue to work even though this is a
`memoryview`.
@jakirkham
Copy link
Contributor Author

Sounds good. Have refactored out a common utility function, flatten_buffer, which simplifies both subclasses. Also flatten_buffer is tested in a variety of cases to make sure it is coercing data into something both compressors can handle (without the fix). Also conditioned the subclass patches on Python version support to clarify things. The subclasses themselves have been moved into compat.

pandas/compat/_utils.py Outdated Show resolved Hide resolved
pandas/compat/__init__.py Outdated Show resolved Hide resolved
Should work around linter issues.
@jakirkham
Copy link
Contributor Author

Anything else needed here?

@jakirkham
Copy link
Contributor Author

How does the updated changelog entry look?

@mroeschke mroeschke added this to the 2.0 milestone Oct 21, 2022
@mroeschke mroeschke merged commit 4a2b068 into pandas-dev:main Oct 21, 2022
@mroeschke
Copy link
Member

Sweet thanks @jakirkham!

@jakirkham jakirkham deleted the fix_pickle5 branch October 21, 2022 17:33
@jakirkham
Copy link
Contributor Author

Awesome! Thank you both 😄

Would it be possible to backport this change as well? If so, what is needed from me to do that?

@mroeschke
Copy link
Member

Backports are currently done for regressions (bug/performance) that were introduced in 1.5. IIUC the performance of to_pickle in this situation didn't necessarily regress in 1.5?

@jakirkham
Copy link
Contributor Author

jakirkham commented Oct 21, 2022

Good question. The history goes like this:

So the perf improvement and then reduction occurred in 1.2 (not 1.5).

That said, maybe there is still value in having this perf improvement in some 1.x version (especially given the move to 2 occurring). What do you think? 🙂

Edit: Should add have no strong feelings here. Just wondering if we can help 1.x users with this change.

@mroeschke
Copy link
Member

Would be a nice performance benefit for 1.5.x users. Also need to weigh any unknown bugs/behavior changes this might introduce. Thoughts @pandas-dev/pandas-core?

@WillAyd
Copy link
Member

WillAyd commented Oct 21, 2022

I would err on the side of caution and go 2.0

phofl pushed a commit to phofl/pandas that referenced this pull request Oct 21, 2022
* Add `BZ2File` wrapper for pickle protocol 5

* Add `LZMAFile` wrapper for pickle protocol 5

* Use BZ2 & LZMA wrappers for full pickle support

* Workaround linter issue

`PickleBuffer` isn't currently included in `SupportBytes`, which causes
issues with pyright when passing `PickleBuffer` instances to `bytes`.
Though it appears ok passing `PickleBuffer` instances to `memoryview`s.
So do that instead. This is functionaly very equivalent. There is a
slight performance cost to making a `memoryview`, but this is likely
negligible compared to copying to `bytes`.

* Refactor out `flatten_buffer`

* Refactor `B2File` into separate module

* Test `flatten_buffer`

This provides a reasonable proxy for testing patched `BZ2File` and
`LZMAFile` objects.

* Move `flatten_buffer` to `_utils`

This ran into cyclic import issues in `pickle_compat`. So move
`flatten_buffer` to its own module free of these issues.

* Import `annotations` to fix `|` usage

* Sort `import`s to fix lint

* Patch `BZ2File` & `LZMAFile` on Python pre-3.10

This should limit the effects of this patch. Also should make it easier
to remove this backport later once all supported Python versions have
the fix.

* Test C & F contiguous NumPy arrays

Also test another non-contiguous array.

* Test `memoryview` is 1-D `uint8` contiguous data

If a `memoryview` is returned, make sure it as close to `bytes` |
`bytearray` as possible. This ensures if other functions assume
something like `bytes` (for example assuming `len(b)` is the number of
bytes contained), things will continue to work even though this is a
`memoryview`.

* Run `black` on `bz2` and `lzma` compat files

* One more lint fix

* Drop unused `PickleBuffer` `import`s

* Simplify change to `panda.compat.__init__`

Now that the LZMA changes are in a separate file, cleanup the changes to
`pandas.compat.__init__`.

* Type `flatten_buffer` result

* Use `order="A"` in `memoryview.tobytes(...)`

In the function `flatten_buffer`, the order is already effectively
enforced when copying can be avoided by using `PickleBuffer.raw(...)`.
However some test comparisons failed (when they shouldn't have) as this
wasn't specified. So add the `order` in both the function and the test.
This should fix that test failure.

* Move all compat compressors into a single file

* Fix `BZ2File` `import`

* Refactor out common compat constants

* Fix `import` sorting

* Drop unused `import`

* Ignore `flake8` errors on wildcard `import`

* Revert "Ignore `flake8` errors on wildcard `import`"

This reverts commit f1f1a2e.

* Explicitly `import` all constants

* Assign `IS64` first

* Try `noqa` on wildcard `import` again

* Declare `BZ2File` & `LZMAFile` once

Fixes a linter issue from pyright.

* `import PickleBuffer` for simplicity

* Add `bytearray` to return type

* Test `bytes` & `bytearray` are returned unaltered

* Explicit list all constants

* Trick linter into thinking constants are used ;)

* Add new entry to 2.0.0

* Assign constants to themselves

Should work around linter issues.

* Update changelog entry    [skip ci]

* Add constants to `__all__`

* Update changelog entry    [ci skip]

* Use Sphinx method annotation
@phofl
Copy link
Member

phofl commented Oct 22, 2022

I'd also prefer to keep this in on main only

noatamir pushed a commit to noatamir/pandas that referenced this pull request Nov 9, 2022
* Add `BZ2File` wrapper for pickle protocol 5

* Add `LZMAFile` wrapper for pickle protocol 5

* Use BZ2 & LZMA wrappers for full pickle support

* Workaround linter issue

`PickleBuffer` isn't currently included in `SupportBytes`, which causes
issues with pyright when passing `PickleBuffer` instances to `bytes`.
Though it appears ok passing `PickleBuffer` instances to `memoryview`s.
So do that instead. This is functionaly very equivalent. There is a
slight performance cost to making a `memoryview`, but this is likely
negligible compared to copying to `bytes`.

* Refactor out `flatten_buffer`

* Refactor `B2File` into separate module

* Test `flatten_buffer`

This provides a reasonable proxy for testing patched `BZ2File` and
`LZMAFile` objects.

* Move `flatten_buffer` to `_utils`

This ran into cyclic import issues in `pickle_compat`. So move
`flatten_buffer` to its own module free of these issues.

* Import `annotations` to fix `|` usage

* Sort `import`s to fix lint

* Patch `BZ2File` & `LZMAFile` on Python pre-3.10

This should limit the effects of this patch. Also should make it easier
to remove this backport later once all supported Python versions have
the fix.

* Test C & F contiguous NumPy arrays

Also test another non-contiguous array.

* Test `memoryview` is 1-D `uint8` contiguous data

If a `memoryview` is returned, make sure it as close to `bytes` |
`bytearray` as possible. This ensures if other functions assume
something like `bytes` (for example assuming `len(b)` is the number of
bytes contained), things will continue to work even though this is a
`memoryview`.

* Run `black` on `bz2` and `lzma` compat files

* One more lint fix

* Drop unused `PickleBuffer` `import`s

* Simplify change to `panda.compat.__init__`

Now that the LZMA changes are in a separate file, cleanup the changes to
`pandas.compat.__init__`.

* Type `flatten_buffer` result

* Use `order="A"` in `memoryview.tobytes(...)`

In the function `flatten_buffer`, the order is already effectively
enforced when copying can be avoided by using `PickleBuffer.raw(...)`.
However some test comparisons failed (when they shouldn't have) as this
wasn't specified. So add the `order` in both the function and the test.
This should fix that test failure.

* Move all compat compressors into a single file

* Fix `BZ2File` `import`

* Refactor out common compat constants

* Fix `import` sorting

* Drop unused `import`

* Ignore `flake8` errors on wildcard `import`

* Revert "Ignore `flake8` errors on wildcard `import`"

This reverts commit f1f1a2e.

* Explicitly `import` all constants

* Assign `IS64` first

* Try `noqa` on wildcard `import` again

* Declare `BZ2File` & `LZMAFile` once

Fixes a linter issue from pyright.

* `import PickleBuffer` for simplicity

* Add `bytearray` to return type

* Test `bytes` & `bytearray` are returned unaltered

* Explicit list all constants

* Trick linter into thinking constants are used ;)

* Add new entry to 2.0.0

* Assign constants to themselves

Should work around linter issues.

* Update changelog entry    [skip ci]

* Add constants to `__all__`

* Update changelog entry    [ci skip]

* Use Sphinx method annotation
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO Pickle read_pickle, to_pickle Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: Always write directly to output in to_pickle
5 participants