BUG:to_pickle() raises TypeError when compressing large dataframe #39002

zhuoqiang · 2021-01-06T15:27:36Z

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.

Code Sample

import pandas as pd

df = pd.DataFrame(range(100000))
df.to_pickle("df.pkl.xz", protocol=5)

Problem description

the above code raises TypeError

TypeError                                 Traceback (most recent call last)
<ipython-input-220-8c4b6e9cecfc> in <module>
      2 
      3 df = pd.DataFrame(range(100000))
----> 4 df.to_pickle("df.pkl.xz", protocol=5)

/python3.9/site-packages/pandas/core/generic.py in to_pickle(self, path, compression, protocol, storage_options)
   2859         from pandas.io.pickle import to_pickle
   2860 
-> 2861         to_pickle(
   2862             self,
   2863             path,

/python3.9/site-packages/pandas/io/pickle.py in to_pickle(obj, filepath_or_buffer, compression, protocol, storage_options)
     95         storage_options=storage_options,
     96     ) as handles:
---> 97         pickle.dump(obj, handles.handle, protocol=protocol)  # type: ignore[arg-type]
     98 
     99 

/python3.9/lzma.py in write(self, data)
    232         compressed = self._compressor.compress(data)
    233         self._fp.write(compressed)
--> 234         self._pos += len(data)
    235         return len(data)
    236 

TypeError: object of type 'pickle.PickleBuffer' has no len()

Note that in order to reproduce the bug, it have to:

the data frame is large enough
it is using compressing (".xz" or ".zip")
it is using latest pickle protocol (default). set protocol to 4 explicity will workaround this bug

Expected Output

the pickle should save ok

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : 3e89b4c
python : 3.9.0.final.0
python-bits : 64
OS : Darwin
OS-release : 19.6.0
Version : Darwin Kernel Version 19.6.0: Mon Aug 31 22:12:52 PDT 2020; root:xnu-6153.141.2~1/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : zh_CN.UTF-8
LOCALE : zh_CN.UTF-8

pandas : 1.2.0
numpy : 1.19.4
pytz : 2020.5
dateutil : 2.8.1
pip : 20.0.2
setuptools : 45.2.0
Cython : 0.29.21
pytest : 6.2.1
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.6.2
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.19.0
pandas_datareader: 0.9.0
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.3.3
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : 1.5.4
sqlalchemy : None
tables : None
tabulate : 0.8.7
xarray : None
xlrd : 2.0.1
xlwt : None
numba : None

The text was updated successfully, but these errors were encountered:

jreback · 2021-01-06T15:38:41Z

this is failing in lzma.py which is part of python itself search for an issue there.

mhoene · 2021-01-09T19:29:10Z

Can confirm this bug here with pandas 1.2.0, but works with 1.1.5.
@jreback , I'm not sure this is an issue in python itself, both for this reason and for the fact it fails while applying a function to a pickle object.

asishm · 2021-01-10T21:00:15Z

Bisecting gives

first bad commit: [0fa47b6] Write pickle to file-like without intermediate in-memory buffer (#37056)

although looking at the PR - I can't see why that would introduce the regression

tengels · 2021-01-12T07:38:12Z

I confirm having the same issue (Python 3.8.5), after upgrading to pandas 1.2.0.

SebDarco · 2021-01-20T09:41:19Z

Hello,
I had similar issues with Python 3.8.5 and Pandas 1.2.0.
Not exactly the one reported, but 'zip' option would not work properly with a 3MB dataframe, saving multiples files within one pickle. I tried changing the protocol to 4 and even to 3. Errors were different but pickle archive was not readable.
I reverted to pandas 1.1.5 and everything went fine.
Thank you so much for all the work here!!

TNieuwdorp · 2021-01-20T11:37:34Z

Same here! Exact same error for all the possible compression algorithms (tried with gz, bz2, xz, zip). Although gz doesn't throw an error but just lets the kernel crash.

twoertwein · 2021-01-24T02:25:24Z

I think that pickle.dump https://github.com/pandas-dev/pandas/pull/37056/files#diff-039bb99cc2b18c72809cb81401901b2a29cb650e80490ade09a6f4cc66090023R101
calls write multiple times. That would explain the zip behavior in #39002 (comment) (we had a similar issue for to_csv #38714).

I could imagine that other non-zip compression algorithms also do not like if write is called multiple times but they exhibit different 'symptoms'?

edit: it just fails for big dataframes

from bz2 import BZ2File
from gzip import GzipFile
from lzma import LZMAFile
import pickle
from zipfile import ZipFile

import pandas as pd

small_object = True
big_object = b'a' * 1000000000
small_dataframe =  pd.DataFrame(range(100))
big_dataframe =  pd.DataFrame(range(100000))

for obj in (small_object, big_object, small_dataframe, big_dataframe):
    for module in (GzipFile, BZ2File, LZMAFile):
        print(module)
        with module('test.foo', mode="w") as compressed:
            pickle.dump(obj, compressed, protocol=5)  # fails
            # compressed.write(pickle.dumps(obj, protocol=5)) # does not fail

twoertwein · 2021-01-24T21:01:50Z

@TNieuwdorp you said gzip isn't working for you as well. It works for me locally and also on pandas's CI.

TNieuwdorp · 2021-01-25T23:32:55Z

Hmm, for me locally it resulted in a kernel freeze and resulting timeout/crash. Maybe this time it's a 'It doesn't run on my system' case? On 24 Jan 2021 22:02, Torsten Wörtwein <notifications@github.com> wrote: @TNieuwdorp you said gzip isn't working for you as well. It works for me locally and also on pandas's CI. —You are receiving this because you were mentioned.Reply to this email directly, view it on GitHub, or unsubscribe.

eelcovv · 2021-02-15T06:09:42Z

Confirms same bug with pandas 1.2.1 using bz2 compression

jreback · 2021-02-15T14:12:25Z

try 1.2.2 which patches

bluehope · 2021-03-10T18:32:07Z

Confirms the same bug with pandas 1.2.3 with lzma compression.
For me, using protocol=4 was a possible workround.
pd.show_versions()

INSTALLED VERSIONS
------------------
commit           : f2c8480af2f25efdbd803218b9d87980f416563e
python           : 3.8.8.final.0
python-bits      : 64
OS               : Linux
OS-release       : 4.4.0-203-generic
Version          : #235-Ubuntu SMP Tue Feb 2 02:49:08 UTC 2021
machine          : x86_64
processor        : x86_64
byteorder        : little
LC_ALL           : ko_KR.UTF-8
LANG             : ko_KR.UTF-8
LOCALE           : ko_KR.UTF-8

pandas           : 1.2.3
numpy            : 1.20.1
pytz             : 2021.1
dateutil         : 2.8.1
pip              : 21.0.1
setuptools       : 49.6.0.post20210108
Cython           : None
pytest           : None
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : None
jinja2           : 2.11.3
IPython          : 7.21.0
pandas_datareader: None
bs4              : None
bottleneck       : None
fsspec           : None
fastparquet      : None
gcsfs            : None
matplotlib       : 3.3.4
numexpr          : None
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : None
pyxlsb           : None
s3fs             : None
scipy            : 1.6.0
sqlalchemy       : 1.3.20
tables           : None
tabulate         : None
xarray           : None
xlrd             : None
xlwt             : None
numba            : None

jreback · 2021-03-10T19:55:42Z

@bluehope pls make a new issue if you are using 1.2.3 as this was patched

bluehope · 2021-03-11T06:29:46Z

@bluehope pls make a new issue if you are using 1.2.3 as this was patched

Sorry, It was the problem of 'pickle', not the pandas.

ghost · 2021-07-22T03:38:16Z

It's Python's bug, fixed in Python 3.9.6/3.10 beta 4
https://bugs.python.org/issue44439

jakirkham · 2022-04-11T23:37:15Z

An alternative approach that would allow pickle protocol 5 in these cases (where the bug fixes are not available), would be to wrap the write method provided by these other file formats. For example...

from bz2 import BZ2File as _BZ2File

try:
    from pickle import PickleBuffer
except ImportError:
    # On Python 3.7 or earlier
    PickleBuffer = None


class BZ2File(_BZ2File):
    def write(self, b):
        if PickleBuffer is not None and isinstance(b, PickleBuffer):
            try:
                b = b.raw()  # coerce to 1-D `uint8` C-contiguous `memoryview` zero-copy
            except BufferError:
                b = bytes(b)  # perform in-memory copy if buffer is not contiguous
        return super(BZ2File, self).write(b)

Then use these wrapped versions with to_pickle.

It seems like there are already some other IO classes being added here. Maybe that would be a natural place for this logic?

Edit: Filed this suggestion as issue ( #46747 ).

zhuoqiang added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 6, 2021

TNieuwdorp referenced this issue Jan 20, 2021

Write pickle to file-like without intermediate in-memory buffer (#37056)

0fa47b6

twoertwein mentioned this issue Jan 24, 2021

REGR: write compressed pickle files with protocol=5 #39376

Merged

4 tasks

twoertwein added IO Pickle read_pickle, to_pickle Regression Functionality that used to work in a prior pandas version and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 24, 2021

jreback added this to the 1.2.2 milestone Jan 24, 2021

jreback closed this as completed in #39376 Jan 26, 2021

jakirkham mentioned this issue Apr 12, 2022

ENH: Always write directly to output in to_pickle #46747

Closed

jakirkham mentioned this issue Oct 21, 2022

PERF: Improve pickle support with BZ2 & LZMA #49068

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG:to_pickle() raises TypeError when compressing large dataframe #39002

BUG:to_pickle() raises TypeError when compressing large dataframe #39002

zhuoqiang commented Jan 6, 2021

INSTALLED VERSIONS

jreback commented Jan 6, 2021

mhoene commented Jan 9, 2021

asishm commented Jan 10, 2021

tengels commented Jan 12, 2021

SebDarco commented Jan 20, 2021

TNieuwdorp commented Jan 20, 2021

twoertwein commented Jan 24, 2021 •

edited

Loading

twoertwein commented Jan 24, 2021

TNieuwdorp commented Jan 25, 2021 via email

eelcovv commented Feb 15, 2021

jreback commented Feb 15, 2021

bluehope commented Mar 10, 2021

jreback commented Mar 10, 2021

bluehope commented Mar 11, 2021

ghost commented Jul 22, 2021

jakirkham commented Apr 11, 2022 •

edited

Loading

BUG:to_pickle() raises TypeError when compressing large dataframe #39002

BUG:to_pickle() raises TypeError when compressing large dataframe #39002

Comments

zhuoqiang commented Jan 6, 2021

Code Sample

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

jreback commented Jan 6, 2021

mhoene commented Jan 9, 2021

asishm commented Jan 10, 2021

tengels commented Jan 12, 2021

SebDarco commented Jan 20, 2021

TNieuwdorp commented Jan 20, 2021

twoertwein commented Jan 24, 2021 • edited Loading

twoertwein commented Jan 24, 2021

TNieuwdorp commented Jan 25, 2021 via email

eelcovv commented Feb 15, 2021

jreback commented Feb 15, 2021

bluehope commented Mar 10, 2021

jreback commented Mar 10, 2021

bluehope commented Mar 11, 2021

ghost commented Jul 22, 2021

jakirkham commented Apr 11, 2022 • edited Loading

Output of `pd.show_versions()`

twoertwein commented Jan 24, 2021 •

edited

Loading

jakirkham commented Apr 11, 2022 •

edited

Loading