-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF: Improve pickle support with BZ2 & LZMA #49068
Conversation
9bbb696
to
794c315
Compare
794c315
to
280731e
Compare
`PickleBuffer` isn't currently included in `SupportBytes`, which causes issues with pyright when passing `PickleBuffer` instances to `bytes`. Though it appears ok passing `PickleBuffer` instances to `memoryview`s. So do that instead. This is functionaly very equivalent. There is a slight performance cost to making a `memoryview`, but this is likely negligible compared to copying to `bytes`.
Noting that we're having unrelated, CI issues currently. Hopefully it should be fixed today. |
No worries. Thanks Matthew! 🙏 Looks like any related test failures have been fixed, but will recheck once CI issues clear. If you have thoughts on this approach, please let me know. Though no worries if you are busy 🙂 |
The approach seems reasonable; it would be good to ensure we have tests to ensure the Also am I correct in understanding in #46747 that once we drop 3.9 that this won't be entirely necessary or would it still be a good performance benefit to keep? |
Mostly, Python 3.9.6 has the fix. Also the fix is included in Python 3.10.0b4 and Python 3.11.0a1. So Python 3.9.6+ don't need the workaround. Python 3.8 and earlier versions of Python 3.9 would need the workaround. IOW it is probably safer to keep until Python 3.10 is the minimum (or when Python 3.9 is dropped). Once Python 3.10 is minimum, we could drop the wrapper classes and just keep the simplified Happy to flag the code based on Python version somehow if that makes it easier to track. In terms of testing, think most of what we need is covered in these tests. The default path |
This provides a reasonable proxy for testing patched `BZ2File` and `LZMAFile` objects.
This ran into cyclic import issues in `pickle_compat`. So move `flatten_buffer` to its own module free of these issues.
That would be really helpful as cleaning up code is always nice :)
Yeah this can come up in pandas with ExtensionArrays (1D) being indexed in a DataFrame (2D). We have some tests checking contiguousness of the underlying numpy array when indexing pandas/pandas/tests/extension/base/dim2.py Line 274 in 2e7f5a3
|
This should limit the effects of this patch. Also should make it easier to remove this backport later once all supported Python versions have the fix.
Also test another non-contiguous array.
If a `memoryview` is returned, make sure it as close to `bytes` | `bytearray` as possible. This ensures if other functions assume something like `bytes` (for example assuming `len(b)` is the number of bytes contained), things will continue to work even though this is a `memoryview`.
Sounds good. Have refactored out a common utility function, |
Should work around linter issues.
Anything else needed here? |
How does the updated changelog entry look? |
Sweet thanks @jakirkham! |
Awesome! Thank you both 😄 Would it be possible to backport this change as well? If so, what is needed from me to do that? |
Backports are currently done for regressions (bug/performance) that were introduced in 1.5. IIUC the performance of |
Good question. The history goes like this:
So the perf improvement and then reduction occurred in 1.2 (not 1.5). That said, maybe there is still value in having this perf improvement in some 1.x version (especially given the move to 2 occurring). What do you think? 🙂 Edit: Should add have no strong feelings here. Just wondering if we can help 1.x users with this change. |
Would be a nice performance benefit for 1.5.x users. Also need to weigh any unknown bugs/behavior changes this might introduce. Thoughts @pandas-dev/pandas-core? |
I would err on the side of caution and go 2.0 |
* Add `BZ2File` wrapper for pickle protocol 5 * Add `LZMAFile` wrapper for pickle protocol 5 * Use BZ2 & LZMA wrappers for full pickle support * Workaround linter issue `PickleBuffer` isn't currently included in `SupportBytes`, which causes issues with pyright when passing `PickleBuffer` instances to `bytes`. Though it appears ok passing `PickleBuffer` instances to `memoryview`s. So do that instead. This is functionaly very equivalent. There is a slight performance cost to making a `memoryview`, but this is likely negligible compared to copying to `bytes`. * Refactor out `flatten_buffer` * Refactor `B2File` into separate module * Test `flatten_buffer` This provides a reasonable proxy for testing patched `BZ2File` and `LZMAFile` objects. * Move `flatten_buffer` to `_utils` This ran into cyclic import issues in `pickle_compat`. So move `flatten_buffer` to its own module free of these issues. * Import `annotations` to fix `|` usage * Sort `import`s to fix lint * Patch `BZ2File` & `LZMAFile` on Python pre-3.10 This should limit the effects of this patch. Also should make it easier to remove this backport later once all supported Python versions have the fix. * Test C & F contiguous NumPy arrays Also test another non-contiguous array. * Test `memoryview` is 1-D `uint8` contiguous data If a `memoryview` is returned, make sure it as close to `bytes` | `bytearray` as possible. This ensures if other functions assume something like `bytes` (for example assuming `len(b)` is the number of bytes contained), things will continue to work even though this is a `memoryview`. * Run `black` on `bz2` and `lzma` compat files * One more lint fix * Drop unused `PickleBuffer` `import`s * Simplify change to `panda.compat.__init__` Now that the LZMA changes are in a separate file, cleanup the changes to `pandas.compat.__init__`. * Type `flatten_buffer` result * Use `order="A"` in `memoryview.tobytes(...)` In the function `flatten_buffer`, the order is already effectively enforced when copying can be avoided by using `PickleBuffer.raw(...)`. However some test comparisons failed (when they shouldn't have) as this wasn't specified. So add the `order` in both the function and the test. This should fix that test failure. * Move all compat compressors into a single file * Fix `BZ2File` `import` * Refactor out common compat constants * Fix `import` sorting * Drop unused `import` * Ignore `flake8` errors on wildcard `import` * Revert "Ignore `flake8` errors on wildcard `import`" This reverts commit f1f1a2e. * Explicitly `import` all constants * Assign `IS64` first * Try `noqa` on wildcard `import` again * Declare `BZ2File` & `LZMAFile` once Fixes a linter issue from pyright. * `import PickleBuffer` for simplicity * Add `bytearray` to return type * Test `bytes` & `bytearray` are returned unaltered * Explicit list all constants * Trick linter into thinking constants are used ;) * Add new entry to 2.0.0 * Assign constants to themselves Should work around linter issues. * Update changelog entry [skip ci] * Add constants to `__all__` * Update changelog entry [ci skip] * Use Sphinx method annotation
I'd also prefer to keep this in on main only |
* Add `BZ2File` wrapper for pickle protocol 5 * Add `LZMAFile` wrapper for pickle protocol 5 * Use BZ2 & LZMA wrappers for full pickle support * Workaround linter issue `PickleBuffer` isn't currently included in `SupportBytes`, which causes issues with pyright when passing `PickleBuffer` instances to `bytes`. Though it appears ok passing `PickleBuffer` instances to `memoryview`s. So do that instead. This is functionaly very equivalent. There is a slight performance cost to making a `memoryview`, but this is likely negligible compared to copying to `bytes`. * Refactor out `flatten_buffer` * Refactor `B2File` into separate module * Test `flatten_buffer` This provides a reasonable proxy for testing patched `BZ2File` and `LZMAFile` objects. * Move `flatten_buffer` to `_utils` This ran into cyclic import issues in `pickle_compat`. So move `flatten_buffer` to its own module free of these issues. * Import `annotations` to fix `|` usage * Sort `import`s to fix lint * Patch `BZ2File` & `LZMAFile` on Python pre-3.10 This should limit the effects of this patch. Also should make it easier to remove this backport later once all supported Python versions have the fix. * Test C & F contiguous NumPy arrays Also test another non-contiguous array. * Test `memoryview` is 1-D `uint8` contiguous data If a `memoryview` is returned, make sure it as close to `bytes` | `bytearray` as possible. This ensures if other functions assume something like `bytes` (for example assuming `len(b)` is the number of bytes contained), things will continue to work even though this is a `memoryview`. * Run `black` on `bz2` and `lzma` compat files * One more lint fix * Drop unused `PickleBuffer` `import`s * Simplify change to `panda.compat.__init__` Now that the LZMA changes are in a separate file, cleanup the changes to `pandas.compat.__init__`. * Type `flatten_buffer` result * Use `order="A"` in `memoryview.tobytes(...)` In the function `flatten_buffer`, the order is already effectively enforced when copying can be avoided by using `PickleBuffer.raw(...)`. However some test comparisons failed (when they shouldn't have) as this wasn't specified. So add the `order` in both the function and the test. This should fix that test failure. * Move all compat compressors into a single file * Fix `BZ2File` `import` * Refactor out common compat constants * Fix `import` sorting * Drop unused `import` * Ignore `flake8` errors on wildcard `import` * Revert "Ignore `flake8` errors on wildcard `import`" This reverts commit f1f1a2e. * Explicitly `import` all constants * Assign `IS64` first * Try `noqa` on wildcard `import` again * Declare `BZ2File` & `LZMAFile` once Fixes a linter issue from pyright. * `import PickleBuffer` for simplicity * Add `bytearray` to return type * Test `bytes` & `bytearray` are returned unaltered * Explicit list all constants * Trick linter into thinking constants are used ;) * Add new entry to 2.0.0 * Assign constants to themselves Should work around linter issues. * Update changelog entry [skip ci] * Add constants to `__all__` * Update changelog entry [ci skip] * Use Sphinx method annotation
to_pickle
#46747doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.