Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Support metadata at de/serialization time #3297

Closed
wants to merge 1 commit into from
Closed

WIP: Support metadata at de/serialization time #3297

wants to merge 1 commit into from

Conversation

ghost
Copy link

@ghost ghost commented Apr 9, 2013

#2485, with much reduced scope.

While collecting vbench data recently, it became painfully
obvious how useful it would be to be able to attach metadata
tags to serialized frames/series/panel.

This is a rough draft for something I'd like to see happen in 0.12.
It creates yet another version of pickle files, so much testing
needs to be done there.

There's a plan (#686, timeline unclear) for implementing a binary serialization
format for pandas, which will need to replicate this functionality if this makes it in.

@jreback, if you have something planned in this direction, I'm glad to
withdraw this PR, it's just a statement of intent to prod us into getting something
working during the 0.12 release cycle.

The design choices I'm going for right now:

  • no propagation semantics through operations, the metadata tags are restored
    if present, and are guranteed on load, only.
    Example use case : store measurements as dataframes with data,
    location, etc'. Then, load a mess of them back up and (generally) you may
    use the metadata either as column/index labels, or just pick out a subset
    based on them, sort based on them, etc'. edit: Document this very clearly.
  • only JSON-able data allowed in dicts. Allowing anything else would bring
    us back to the pickle problem, might create issues if Create efficient binary storage format alternative to pickle #686 gets implemented
    (difficulty in serializing arbitrary objects in metadata without using pickle), and is generally
    an unknown quantity. JSON can cover a lot of mileage.
    letting objects define their own way to be serialized for metadata purposes
    is reinventing pickle, so no. (JSONable not yet enforced in the code).
  • TODO: consider the binary blob case.
  • TODO: if there's no propogation, should the data live on the object at all? ( what about o, meta = pd.load()?)
  • reserved keywords/namespace for pandas' use?
  • establish naming convention (for reserved kwds or more broadly?)
  • make it minimal, no setdefault(), save/load -time hooks etc, users should roll their own there.
  • currently, it's just a dict .meta that gets pickled and unpickled with the object.
  • versioning/schema to be baked in as reserved keywords? the pickle
    code taught me that much.
  • The usual "I want tab completion/attr access/call it _foo" discussion to follow.
In [8]: df=mkdf(10,5)
   ...: df.meta['Im']="persistent"
   ...: df.save('/tmp/1.pickle')
   ...: dfrt =pd.load('/tmp/1.pickle')
   ...: 

In [9]: dfrt.meta
Out[9]: {'Im': 'persistent'}

@jreback
Copy link
Contributor

jreback commented Apr 9, 2013

@y-p nothing really, but I think that can change pickle to make this compatible in an easy way, because all objects will have a block manager pickle (rather than separately like now), should be pretty straightforward to add. Adding to HDFStore is also straightforward.

That said, we again come back to the problem of propogation (which I agree no prop is prob right thing), What are the use cases of this? IOW have you actually needed something like this?

@ghost
Copy link
Author

ghost commented Apr 9, 2013

I've never needed propgation and I don't think it fits pandas' data model well.
If #686 hadn't gone down that road, this probably could have been implemented
sooner. as it is, it took some time to reorient.

@ghost
Copy link
Author

ghost commented Apr 9, 2013

Is there anything you care about for interop of this with HDFStore?

@jreback
Copy link
Contributor

jreback commented Apr 9, 2013

@y-p no HDFStore of meta is trivial, see storing attributes to the group node: http://pandas.pydata.org/pandas-docs/dev/cookbook.html#hdfstore, which in fact pickles the data

@ghost
Copy link
Author

ghost commented Apr 9, 2013

Is pickle compatible across py2-py3, at least for JSONables? strings obviously
merit consideration.

If there's another breaking change planned for pickle, let's roll it all into one - every extra variation is pain,
and it's already pretty hairy.
is there an ongoing issue for that to link to so we don't forget?

@jreback
Copy link
Contributor

jreback commented Apr 9, 2013

Not going to break pickle at all, well only sort of. The code will be able to read all existing pickles (as there exists a compatibiliy layer), but new pickles will only be readable by that version going forward. I don't think that's a problem? (well I haven't changed anything, but this would)

@ghost
Copy link
Author

ghost commented Apr 9, 2013

ofcourse, I meant a new pandas pickle "version", though unfortunately that's
an adhoc concept at this point (not backed up by schema versioning in the pickle files themselves).

@jreback
Copy link
Contributor

jreback commented Apr 9, 2013

thats what I was saying, its not hard to add a new 'format' and keep it reasonably back compatibly, e.g.
instead of saving a (self._data) which is how its done now

{ '_data' : self._data, '_typ' : 'this_type', '_version' : version }

make things a bit easier, and make it reasonably future proof (as far as pickle can be)
and we can still disambiguate these

@ghost
Copy link
Author

ghost commented Apr 9, 2013

that's only frame/panel, series does things differently, having to do with
it's ndarray heritage probably. Another thing to clean up in "the great refactor".

@dalejung
Copy link
Contributor

dalejung commented Apr 9, 2013

My only concern is that users would expect .meta to propagate like .name. Things like a dynamic attribute or subclass disappearing seem to be the common cause for confusion.

@ghost
Copy link
Author

ghost commented Apr 9, 2013

We'll be very clear in the documentation and RELEASE.rst.

and .name does not propogate everywhere either. because that's not well defined.

@jreback
Copy link
Contributor

jreback commented Apr 9, 2013

@y-p thats what the great unification is about, series is a sub of NDFrame (almost done)

@wesm
Copy link
Member

wesm commented Apr 10, 2013

Note that I'm pretty sure we have to solve the "pickle problem" before Series can be made not an ndarray.

@jreback
Copy link
Contributor

jreback commented Apr 10, 2013

This is an interesting idea, basically allows meta data attached to the index labels which naturally carry around anyhow. Don't have to deal with the propo issue at all. I guess I would call this a shadow index (kind of like the name attribute on an index)

http://stackoverflow.com/questions/15932896/is-it-a-good-idea-to-store-metadata-in-the-column-label-of-pandas-dataframe

@ghost
Copy link
Author

ghost commented Apr 10, 2013

Something to think about for the future, but I want to keep this PR strictly about
tagging df/s/p at de/serialization time. Once we have some experience and
user feedback on that, we can consider upping the ante.

@westurner
Copy link
Contributor

http://docs.python.org/2/library/pickle.html#pickle-python-object-serialization

Warning The pickle module is not intended to be secure against erroneous or maliciously constructed data. Never unpickle data received from an untrusted or unauthenticated source.

@ghost
Copy link
Author

ghost commented Apr 27, 2013

Yes, we are aware of the pickle security issues, though that's more an issue to warn
users about then an argument to scrap the whole thing.
Plans for another serialization format go way back, but for now pickle is what we have in practice.

Please, try to add a smidgen of prose in the future, so It doesn't feel like
I'm discussing technical issues with a search engine. thanks.

@ghost
Copy link
Author

ghost commented Apr 27, 2013

Since this PR is only concerned with attaching metadata to a serialized object and
getting it back on load, I'm planning to dump the .meta attribute (saving it for something
more appropriate in the future), and just extending the signatures of serializations
class/methods.

df.save(filname,meta=dict())
meta, o = pd.load_with_meta()

or some such.

Although I ran across the need for this myself in the past, The difficulty I have
convincing myself to merge something in is probably an indication of something.

@westurner
Copy link
Contributor

@y-p

Please, try to add a smidgen of prose in the future, so It doesn't feel like
I'm discussing technical issues with a search engine. thanks.

Upon unserialization (.loads, .load), Python Pickles may execute arbitrary code.

Because of the warning in the Python documentation, this functionality of Pickle is not an:

JSON is not as fast; but is there a way to not execute arbitrary python code when loading a DataFrame?

@ghost
Copy link
Author

ghost commented Apr 29, 2013

I see, well, if that's a real concern in the context you're using pandas in, there are the following
options:

  • Only use data from trusted sources
  • Use an alternate format for storing data, HDFStore, csv files and rely on pd.read_x
    to reconstruct dataframes as needed. perhaps with some glue code added.
  • You can use pandasjson, although I can't vouch for it's
    maturity.
  • You can write your own serialization solution. If this out wonderfully well, perhaps a PR
    could be made.

Hope that helps.

@ghost
Copy link
Author

ghost commented Apr 29, 2013

I've pushed #3472. I agree that some minimal effort to raise awareness among users is
a good thing to do. Thanks for prompting.

@ghost ghost closed this May 18, 2013
@ghost
Copy link
Author

ghost commented May 18, 2013

moved to #3643.

This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants