Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG:to_pickle() raises TypeError when compressing large dataframe #39002

Closed
2 of 3 tasks
zhuoqiang opened this issue Jan 6, 2021 · 16 comments · Fixed by #39376
Closed
2 of 3 tasks

BUG:to_pickle() raises TypeError when compressing large dataframe #39002

zhuoqiang opened this issue Jan 6, 2021 · 16 comments · Fixed by #39376
Labels
Bug IO Pickle read_pickle, to_pickle Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@zhuoqiang
Copy link

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Code Sample

import pandas as pd

df = pd.DataFrame(range(100000))
df.to_pickle("df.pkl.xz", protocol=5)

Problem description

the above code raises TypeError

TypeError                                 Traceback (most recent call last)
<ipython-input-220-8c4b6e9cecfc> in <module>
      2 
      3 df = pd.DataFrame(range(100000))
----> 4 df.to_pickle("df.pkl.xz", protocol=5)

/python3.9/site-packages/pandas/core/generic.py in to_pickle(self, path, compression, protocol, storage_options)
   2859         from pandas.io.pickle import to_pickle
   2860 
-> 2861         to_pickle(
   2862             self,
   2863             path,

/python3.9/site-packages/pandas/io/pickle.py in to_pickle(obj, filepath_or_buffer, compression, protocol, storage_options)
     95         storage_options=storage_options,
     96     ) as handles:
---> 97         pickle.dump(obj, handles.handle, protocol=protocol)  # type: ignore[arg-type]
     98 
     99 

/python3.9/lzma.py in write(self, data)
    232         compressed = self._compressor.compress(data)
    233         self._fp.write(compressed)
--> 234         self._pos += len(data)
    235         return len(data)
    236 

TypeError: object of type 'pickle.PickleBuffer' has no len()

Note that in order to reproduce the bug, it have to:

  • the data frame is large enough
  • it is using compressing (".xz" or ".zip")
  • it is using latest pickle protocol (default). set protocol to 4 explicity will workaround this bug

Expected Output

the pickle should save ok

Output of pd.show_versions()

INSTALLED VERSIONS

commit : 3e89b4c
python : 3.9.0.final.0
python-bits : 64
OS : Darwin
OS-release : 19.6.0
Version : Darwin Kernel Version 19.6.0: Mon Aug 31 22:12:52 PDT 2020; root:xnu-6153.141.2~1/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : zh_CN.UTF-8
LOCALE : zh_CN.UTF-8

pandas : 1.2.0
numpy : 1.19.4
pytz : 2020.5
dateutil : 2.8.1
pip : 20.0.2
setuptools : 45.2.0
Cython : 0.29.21
pytest : 6.2.1
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.6.2
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.19.0
pandas_datareader: 0.9.0
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.3.3
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : 1.5.4
sqlalchemy : None
tables : None
tabulate : 0.8.7
xarray : None
xlrd : 2.0.1
xlwt : None
numba : None

@zhuoqiang zhuoqiang added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 6, 2021
@jreback
Copy link
Contributor

jreback commented Jan 6, 2021

this is failing in lzma.py which is part of python itself search for an issue there.

@mhoene
Copy link

mhoene commented Jan 9, 2021

Can confirm this bug here with pandas 1.2.0, but works with 1.1.5.
@jreback , I'm not sure this is an issue in python itself, both for this reason and for the fact it fails while applying a function to a pickle object.

@asishm
Copy link
Contributor

asishm commented Jan 10, 2021

Bisecting gives

first bad commit: [0fa47b6] Write pickle to file-like without intermediate in-memory buffer (#37056)

although looking at the PR - I can't see why that would introduce the regression

@tengels
Copy link

tengels commented Jan 12, 2021

I confirm having the same issue (Python 3.8.5), after upgrading to pandas 1.2.0.

@SebDarco
Copy link

Hello,
I had similar issues with Python 3.8.5 and Pandas 1.2.0.
Not exactly the one reported, but 'zip' option would not work properly with a 3MB dataframe, saving multiples files within one pickle. I tried changing the protocol to 4 and even to 3. Errors were different but pickle archive was not readable.
I reverted to pandas 1.1.5 and everything went fine.
Thank you so much for all the work here!!

@TNieuwdorp
Copy link

Same here! Exact same error for all the possible compression algorithms (tried with gz, bz2, xz, zip). Although gz doesn't throw an error but just lets the kernel crash.

@twoertwein
Copy link
Member

twoertwein commented Jan 24, 2021

I think that pickle.dump https://github.com/pandas-dev/pandas/pull/37056/files#diff-039bb99cc2b18c72809cb81401901b2a29cb650e80490ade09a6f4cc66090023R101
calls write multiple times. That would explain the zip behavior in #39002 (comment) (we had a similar issue for to_csv #38714).

I could imagine that other non-zip compression algorithms also do not like if write is called multiple times but they exhibit different 'symptoms'?

edit: it just fails for big dataframes

from bz2 import BZ2File
from gzip import GzipFile
from lzma import LZMAFile
import pickle
from zipfile import ZipFile

import pandas as pd

small_object = True
big_object = b'a' * 1000000000
small_dataframe =  pd.DataFrame(range(100))
big_dataframe =  pd.DataFrame(range(100000))

for obj in (small_object, big_object, small_dataframe, big_dataframe):
    for module in (GzipFile, BZ2File, LZMAFile):
        print(module)
        with module('test.foo', mode="w") as compressed:
            pickle.dump(obj, compressed, protocol=5)  # fails
            # compressed.write(pickle.dumps(obj, protocol=5)) # does not fail

@twoertwein twoertwein added IO Pickle read_pickle, to_pickle Regression Functionality that used to work in a prior pandas version and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 24, 2021
@twoertwein
Copy link
Member

@TNieuwdorp you said gzip isn't working for you as well. It works for me locally and also on pandas's CI.

@jreback jreback added this to the 1.2.2 milestone Jan 24, 2021
@TNieuwdorp
Copy link

TNieuwdorp commented Jan 25, 2021 via email

@eelcovv
Copy link

eelcovv commented Feb 15, 2021

Confirms same bug with pandas 1.2.1 using bz2 compression

@jreback
Copy link
Contributor

jreback commented Feb 15, 2021

try 1.2.2 which patches

@bluehope
Copy link

Confirms the same bug with pandas 1.2.3 with lzma compression.
For me, using protocol=4 was a possible workround.
pd.show_versions()

INSTALLED VERSIONS
------------------
commit           : f2c8480af2f25efdbd803218b9d87980f416563e
python           : 3.8.8.final.0
python-bits      : 64
OS               : Linux
OS-release       : 4.4.0-203-generic
Version          : #235-Ubuntu SMP Tue Feb 2 02:49:08 UTC 2021
machine          : x86_64
processor        : x86_64
byteorder        : little
LC_ALL           : ko_KR.UTF-8
LANG             : ko_KR.UTF-8
LOCALE           : ko_KR.UTF-8

pandas           : 1.2.3
numpy            : 1.20.1
pytz             : 2021.1
dateutil         : 2.8.1
pip              : 21.0.1
setuptools       : 49.6.0.post20210108
Cython           : None
pytest           : None
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : None
jinja2           : 2.11.3
IPython          : 7.21.0
pandas_datareader: None
bs4              : None
bottleneck       : None
fsspec           : None
fastparquet      : None
gcsfs            : None
matplotlib       : 3.3.4
numexpr          : None
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : None
pyxlsb           : None
s3fs             : None
scipy            : 1.6.0
sqlalchemy       : 1.3.20
tables           : None
tabulate         : None
xarray           : None
xlrd             : None
xlwt             : None
numba            : None

@jreback
Copy link
Contributor

jreback commented Mar 10, 2021

@bluehope pls make a new issue if you are using 1.2.3 as this was patched

@bluehope
Copy link

@bluehope pls make a new issue if you are using 1.2.3 as this was patched

Sorry, It was the problem of 'pickle', not the pandas.

@ghost
Copy link

ghost commented Jul 22, 2021

It's Python's bug, fixed in Python 3.9.6/3.10 beta 4
https://bugs.python.org/issue44439

@jakirkham
Copy link
Contributor

jakirkham commented Apr 11, 2022

An alternative approach that would allow pickle protocol 5 in these cases (where the bug fixes are not available), would be to wrap the write method provided by these other file formats. For example...

from bz2 import BZ2File as _BZ2File

try:
    from pickle import PickleBuffer
except ImportError:
    # On Python 3.7 or earlier
    PickleBuffer = None


class BZ2File(_BZ2File):
    def write(self, b):
        if PickleBuffer is not None and isinstance(b, PickleBuffer):
            try:
                b = b.raw()  # coerce to 1-D `uint8` C-contiguous `memoryview` zero-copy
            except BufferError:
                b = bytes(b)  # perform in-memory copy if buffer is not contiguous
        return super(BZ2File, self).write(b)

Then use these wrapped versions with to_pickle.

It seems like there are already some other IO classes being added here. Maybe that would be a natural place for this logic?

Edit: Filed this suggestion as issue ( #46747 ).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO Pickle read_pickle, to_pickle Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging a pull request may close this issue.