Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOC: Serialization recommendation is deprecated #39956

Closed
chrisroat opened this issue Feb 21, 2021 · 12 comments · Fixed by #41899
Closed

DOC: Serialization recommendation is deprecated #39956

chrisroat opened this issue Feb 21, 2021 · 12 comments · Fixed by #41899
Assignees
Labels
Docs IO Data IO issues that don't fit into a more specific label
Milestone

Comments

@chrisroat
Copy link

Location of the documentation

https://pandas.pydata.org/pandas-docs/dev/user_guide/io.html#io-msgpack

Documentation problem

Since the deprecation of msgpack for on-the-wire transmission, it is recommended to use pyarrow serialization/deserialization. However, since pyarrow 2.0, this has been deprecated for arbitrary objects. A deprecation message is emitted when using the documented code snippet.

$ python
Python 3.8.5 (default, Jul 28 2020, 12:59:40) 
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyarrow as pa
>>> pa.__version__
'3.0.0'
>>> pa.default_serialization_context()
<stdin>:1: DeprecationWarning: 'pyarrow.default_serialization_context' is deprecated as of 2.0.0 and will be removed in a future version. Use pickle or the pyarrow IPC functionality instead.
<pyarrow.lib.SerializationContext object at 0x7f23dfa36940>
>>> 

Suggested fix for documentation

Would pickle be next in line for a recommended on-the-wire format?

@chrisroat chrisroat added Docs Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 21, 2021
@jreback
Copy link
Contributor

jreback commented Feb 21, 2021

no just need to update to the renamed pyarrow format

@chrisroat
Copy link
Author

I may not understand the situation fully -- what is renamed? The deprecation message in the pyarrow docs linked above recommends pickle for non-pyarrow objects. One can convert a dataframe to/from pyarrow table, but it may not be fully compatible.

@jorisvandenbossche
Copy link
Member

@chrisroat thanks for the report! We should indeed have updated our docs after pyarrow deprecated the serialization functionality.

The most appropriate alternative will depend on your exact use case, but in general I think we can indeed refer users to use pickle instead.

@jorisvandenbossche jorisvandenbossche added IO Data IO issues that don't fit into a more specific label and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 23, 2021
@jorisvandenbossche jorisvandenbossche added this to the 1.3 milestone Feb 23, 2021
@simonjayhawkins
Copy link
Member

removing 1.3 milestone.

@simonjayhawkins simonjayhawkins modified the milestones: 1.3, Contributions Welcome Jun 8, 2021
@jorisvandenbossche jorisvandenbossche modified the milestones: Contributions Welcome, 1.3 Jun 8, 2021
@jorisvandenbossche
Copy link
Member

It's important to fix this, as our docs are simply pointing to a (soon) no-longer existing alternative. Since this is arrow-related, will look into it one of the next days (not crucial for the RC of course)

@jorisvandenbossche
Copy link
Member

Opened a PR for this at #41899

@stochastic-thread
Copy link

Does anyone have any information as to why this was deprecated?
Pandas deprecated to_msgpack
Now PyArrow plans to deprecate a simple and useful serialization option.
Kind of annoying if I'm being honest.

@Neltherion
Copy link

Neltherion commented Sep 26, 2021

Using Pickle5 (as suggested) doesn't seem to have the same performance as PyArrow's deprecated Serialization method. Is there ANY proper replacements for pa.serialize() and pa.deserialize() ? I hate this kind of deprecation....

@jorisvandenbossche
Copy link
Member

@Neltherion can you show some example code that illustrates the performance difference? That might help finding out the reason / how this can be improved.

@Neltherion
Copy link

Neltherion commented Sep 26, 2021

@jorisvandenbossche Here's a simplified code that compares the difference between PyArrow & Pickle when Serializing/Deserializing:

import time

import numpy as np
import pickle5
import pyarrow as pa


class Person:
    def __init__(self, Thumbnail: np.ndarray = None):
        if Thumbnail is not None:
            self.Thumbnail: np.ndarray = Thumbnail
        else:
            self.Thumbnail: np.ndarray = np.random.rand(256, 256, 3)


def serialize_Person(person):
    return {'Thumbnail': person.Thumbnail}


def deserialize_Person(person):
    return Person(person['Thumbnail'])


context = pa.SerializationContext()
context.register_type(Person, 'Person', custom_serializer=serialize_Person, custom_deserializer=deserialize_Person)

PERSONS = [Person() for i in range(100)]

"""
PyArrow
"""
t1 = time.time()
persons_serialized = pa.serialize(PERSONS, context=context).to_buffer()
persons_deserialized = pa.deserialize(persons_serialized, context=context)
t2 = time.time()
print(f'PyArrow Time => {t2 - t1}')

"""
Pickle
"""
t1 = time.time()
persons_pickled = pickle5.dumps(PERSONS, protocol=5)
persons_depickled = pickle5.loads(persons_pickled)
t2 = time.time()
print(f'Pickle Time => {t2 - t1}')

The outputs on my system are:

PyArrow Time => 0.04499983787536621
Pickle Time => 0.2220008373260498

@Neltherion
Copy link

@jorisvandenbossche Did the example help?

@jorisvandenbossche
Copy link
Member

@Neltherion I answered at apache/arrow#11239

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Docs IO Data IO issues that don't fit into a more specific label
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants