Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Easier sub-classing for Series and DataFrame #4271

Closed
wants to merge 4 commits into from
Closed

Easier sub-classing for Series and DataFrame #4271

wants to merge 4 commits into from

Conversation

carsonfarmer
Copy link

This is a relatively trival 'fix', which makes it easier to sub-class Pandas Series' and Dataframes. Basically, it makes more consistent use of self._constructor when constructing outputs from operations on Series' and DataFrames. For example, with these changes, this is now possible:

In [1]: from pandas import Series
In [2]: class MySeries(Series):
    def say_hello(self):
        print "hello!"   
In [3]: s = MySeries([1,2,3,4,5])
In [4]: s.say_hello()
hello!
In [5]: s.__class__
Out[5]: __main__.MySeries
In [6]: s2 = s[1:3]
In [7]: s2.__class__
Out[7]: __main__.MySeries
In [8]: s2.say_hello()
hello!

This addresses (to some extent) issues #1713 and #60 and the issues mentioned therein. After these changes, all nosetests continue to pass (except those that skip).

Note: This is my first 'pull request', so be gentle!

@jtratner
Copy link
Contributor

Thanks for this PR - looks interesting! There are a few things you can do to make it easier for the pandas core team to incorporate this into pandas. I'm definitely interested in making it easier to subclass Series and Frame (as are others), so if you have any questions please ask - I'm happy to help out. :)

Can you run test_perf.sh on this? _constructor is a property, so it would be useful to know if this adds any performance penalties. If your master is tracking pydata/pandas, this should work for you.

./test_perf.sh -b master -t `git rev-parse HEAD`

Even though this looks trivial, it's going to need to have some test cases too. The test cases you've shown above are a good start, but you'd want to have tests that cover all the operations that ought to maintain the class given your changes. We have a wiki page on testing that hopefully will help you write some tests - https://github.com/pydata/pandas/wiki/Testing . Series tests should go into tests/test_series and Frame tests should go into tests/test_frame.py.

@jtratner
Copy link
Contributor

More general notes for @cfarmer and others: I believe this will be inconsistent across the entire suite of pandas operations (and that could be okay for now). E.g., this wouldn't work if you used groupby, merge or PyTables, right? Would be good to test that out and see if there's an easy way to make this work (e.g., groupby and others could try to take the constructor from the object that's being grouped).

Also, how should pandas handle things like stacking and unstacking? (I don't know what the current behavior is in regards to subclasses of series and frame like the sparse variants).

@cpcloud
Copy link
Member

cpcloud commented Jul 16, 2013

I think pytables has its classes hard coded. i use a subclass of frame regularly and i have to convert before saving to hdf5. might be nice to have a way to register valid classes to pytables when they are subclassed, might be harder than it sound tho

@jreback
Copy link
Contributor

jreback commented Jul 16, 2013

@cfarmer this really doesn't do anything for you

after all you can get exactly the same functionaility by doing monkeypatching see here: http://pandas.pydata.org/pandas-docs/dev/faq.html#adding-features-to-your-pandas-installation

def say_hello(self):
   print "hello!"

Series.say_hello = say_hello

and @jtratner has valid points, this is a quite complicated issue, some of which is solved by eg. #3482

most of the time inheritance is not what you want in any event, much better to use some sort of composition
see very long discussion on this and other issues: #2485, #2695

@jreback
Copy link
Contributor

jreback commented Jul 16, 2013

@cpcloud you are one of those huh :)

maybe give a short description of why/what you are doing?

(and registering serializers is pretty easy if all you are after is have a different constructor called), or something more complicated you can just create a Table sub-class and handle it directly (you can also patch the mappers at run-time)

@jtratner
Copy link
Contributor

@jreback I appreciate what you're saying here, but there are others who want to be able to subclass and have the class be maintained through operations. If we explicitly note that it only works for some functions (i.e., only basic arithmetic and slicing) and it doesn't have a performance hit, is it problematic to start employing the NDFrame idiom of relying on self._constructor to build the object? If nothing else, might help us see which functions could be abstracted, etc.

@cpcloud
Copy link
Member

cpcloud commented Jul 16, 2013

I wrote a little package called span (be nice (or don't, whatever), I haven't done much with it since I started writing my thesis!) to deal with a proprietary neurophys format and a few other things that were very specific to the field.

i found monkey-patching to be impractical for experimenting, as i wanted to add very domain specific methods to DataFrame, but I still needed the frame to have all the arith and cmp ops and I didn't want to write yet another decorator to wrap the ops around, a values element (for example) using composition. Maybe there's a way to "forward" methods that i don't know about. Still, I use the index and columns in a very specific way.

Actually DataFrame was too general because I wanted to enforce a specific type of index (DatetimeIndex) to compute the sampling rate of a data set as a property (again, maybe it could have been done without inheritance)

I also wanted to be able to know when I was using a SpikeDataFrame vs a DataFrame when I was doing stuff in the shell without having to type hasattr(df, 'a_method_i_monkey_patched')

@jreback
Copy link
Contributor

jreback commented Jul 16, 2013

@jtratner conceptually what this PR does is not a problem and it does make it easier to subclass

the issue is since subclasses aren't supported per se it's really at your own risk

my point is really more of a design point - a pandas object is really good at a small set of operations - that is the point of oo design
make a clear cut object

99% of the time I dissuade people from using sub classing because they get into trouble

composition is almost always better (with the main exception being 'changing' the way certain operations work)

sub classing is generally less transparent and more error prone

I think making it a bit harder to subclass actually is a feature! (flames appear here :)

@jreback
Copy link
Contributor

jreback commented Jul 16, 2013

one of the first issues you are going to see is what @cpcloud raised, 'why can't I store my shiny sub-classed DataFrame in an HDFStore` or how do I serialize it?

@jreback
Copy link
Contributor

jreback commented Jul 16, 2013

@cpcloud

I like this!

@property
def super(self):
        return super(SpikeDataFrame, self)

@cpcloud
Copy link
Member

cpcloud commented Jul 16, 2013

@jreback thanks. i actually stopped using it because i saw it recently and i was like "wtf is that?" but it does make it easier to call super methods...

it would be great if you could do

@property
def super(self):
    return super(self.__class__, self)

but i htink there's a reason that won't work, but i can't remember why

@jtratner
Copy link
Contributor

@jreback I understand your point, but given that we already have an _constructor on NDFrame, doesn't it make sense to actually use it?

Also, your example only works in Python3. In Python 2 you'll end up with an unbound instance method.

@jtratner
Copy link
Contributor

@cpcloud If you do super like that, you never end up going up the mro, since it looks for super with getattribute on the current class, which defaults to the super for the class (i.e., if you wanted to use super on the __init__ method and be able to go up multiple levels). You'd need to do something like __super(self) and explicitly state the current class.

@cpcloud
Copy link
Member

cpcloud commented Jul 16, 2013

ah i c. super is starting to not look like a word anymore...good to know

what does python3 do? you can do super() iirc

@jtratner
Copy link
Contributor

@cpcloud yep, you can just do super(self)...much better.

@carsonfarmer
Copy link
Author

@jtratner The PR doesn't work for groupby etc yet, but this is actually (relatively) easy to implement. I have it working on my local copy. So will add to PR when ready. As for things like merge etc... I'm not 100% clear how these cases should be handled, so my preference would be to add them as needed (i.e., not in this initial PR).

@carsonfarmer
Copy link
Author

@jreback I totally agree with you: subclassing is often not necessary, and I've recently found several nice ways to avoid it given the current difficulty in subclassing pandas objects. Having said that, subclassing in Pandas does seem to come up often, and I think if people (including me!) are smart about it, it could prove useful... now to create some tests!

@hayd
Copy link
Contributor

hayd commented Jul 18, 2013

@Komnomnomnom
Copy link
Contributor

I am also somewhat in this boat, for my use-case I'd like to subclass so I can add new property descriptors, but you have to do that at the class level and I only want these properties on certain instances. Easy to avoid by just not using properties but it's a drawback nonetheless...

FWIW I hacked my subclass into the HDF5 store using something like:

import pandas as pd
from pandas.io import pytables 

class MyPanel(pd.Panel):
    pass

class MyPanelStorer(pytables.PanelStorer):
    pandas_kind = 'mywide'
    obj_type = MyPanel
pytables._TYPE_MAP[MyPanel] = 'mywide'
pytables._STORER_MAP['mywide'] = 'MyPanelStorer'
pytables.MyPanelStorer = MyPanelStorer 

@carsonfarmer
Copy link
Author

As @jreback mentioned previously, most of the issues that this PR addresses are also addressed by #3482. I think the PR here could probably be dropped in favour of the changes in #3482.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants