Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New accessor API #26710

Open
datapythonista opened this issue Jun 7, 2019 · 17 comments
Open

New accessor API #26710

datapythonista opened this issue Jun 7, 2019 · 17 comments
Labels
Accessors accessor registration mechanism (not .str, .dt, .cat) API Design Enhancement Needs Discussion Requires discussion from core team before further action

Comments

@datapythonista
Copy link
Member

datapythonista commented Jun 7, 2019

Currently, to extend pandas Series, DataFrame and Index with user-defined methods, we use accessors in the next way:

@pandas.api.extensions.register_series_accessor('emoji')
class Emoji:
    def __init__(self, data):
        self.data = data

    def is_monkey(self):
        """
        This would create `Series().emoji.is_monkey`
        """
        return self.data.isin(['🙈', '🙉', '🙊'])

While this works well, I think there are two problems with this approach:

  • The API looks somehow intimidating, and it's not well known. I think because pandas.api.extensions.register_series_accessor is too long and lives in pandas.api, separate of functionality most users know.
  • It's not possible to register methods directly (Series().is_monkey instead of Series().emoji.is_monkey)

I think all the projects extending pandas I've seen, simply "inject" the methods (except the ones implemented by pandas maintainers). For example:

What I propose is to have a easier/simpler API for the user. To be specific, this is the syntax I'd like when extending Series...

import pandas

@pandas.Series.extend('emoji')
class Emoji:
    def __init__(self, data):
        self.data = data

    def is_monkey(self):
        """
        This would create `Series().emoji.is_monkey`
        """
        return self.data.isin(['🙈', '🙉', '🙊'])

@pandas.Series.extend(namespace='emoji')
def is_monkey(data):
    """
    This would also create `Series().emoji.is_monkey`
    """
    return data.isin(['🙈', '🙉', '🙊'])

@pandas.Series.extend
class Emoji:
    def __init__(self, data):
        self.data = data

    def is_monkey(self):
        """
        This would directly create `Series().is_monkey`
        """
        return self.data.isin(['🙈', '🙉', '🙊'])

@pandas.Series.extend
def is_monkey(data):
    """
    This would create `Series().emoji.is_monkey`
    """
    return data.isin(['🙈', '🙉', '🙊'])

This would make things much easier for the user, because:

  • The name pandas.Series.extend is much easier to remember
  • A single function can be used (without creating a class)
  • A direct method of Series... can be created

CC: @pandas-dev/pandas-core

@datapythonista datapythonista added API Design Needs Discussion Requires discussion from core team before further action labels Jun 7, 2019
@gfyoung
Copy link
Member

gfyoung commented Jun 7, 2019

@pandas.Series.extend(namespace='emoji')
def is_monkey(data):
    """
    This would also create `Series().emoji.is_monkey`
    """
    return data.isin(['🙈', '🙉', '🙊'])

The second option (replicated above) seems like a logical one IMO. No overhead of OOP.

@datapythonista
Copy link
Member Author

To be clear, what I'm proposing is:

  1. Let users be able to register both, classes (as we do now) and also single functions
  2. The name change pandas.api.extensions.register_series_accessor -> pandas.Series.extend
  3. Make optional the parameter of the decorator (the one currently named name, and in my example named namespace) . And if it's not present register the methods directly in Series,... and not with an accessor (e.g. str, dt,...)

@gfyoung
Copy link
Member

gfyoung commented Jun 7, 2019

Let users be able to register both, classes (as we do now) and also single functions

That's fair, though I think we should encourage functional over OOP.

Make optional the parameter of the decorator

Right

@datapythonista
Copy link
Member Author

That's fair, though I think we should encourage functional over OOP.

Agree, as far as the class doesn't add value we should encourage using a function, but there will be cases where a class is useful, for example:

@pandas.Series.extend
class Emoji:
    def __init__(self, data):
        self.data = data

    def is_monkey(self):
        return self.data.isin(['🙈', '🙉', '🙊'])

    def is_cat(self):
        return '😺' < self.data < '😾'

    def is_animal(self):
        return self.is_monkey() | self.is_cat()

@gfyoung
Copy link
Member

gfyoung commented Jun 7, 2019

but there will be cases where a class is useful

Hmmm...that's a good point. Not sure right now how we could compose in the functional version, though that would be quite useful.

@jreback
Copy link
Contributor

jreback commented Jun 7, 2019

what is the reason for this? is there some notion that things are 'hard' to extend? is that actually a bad thing? these are generally only for other libraries and NOT for users.

I think all the projects extending pandas I've seen, simply "inject" the methods (except the ones implemented by pandas maintainers). For example:

better to actually have these projects use an official api. if they want to do something ad-hoc that is up to them.

@jbrockmendel
Copy link
Member

Two nits to pick:

  1. Use a name other than "extend". There is already list.extend and Index.extend (and unfortunately these behave slightly differently). A user could be forgiven for expecting Series.extend to behave like the others.

  2. IIRC our internally-implemented accessors have standardized on self._parent to avoid (further) overloading self._data. We should encourage this idiom, even if it isn't required.

@datapythonista
Copy link
Member Author

Good points @jbrockmendel, I was a bit unsure about extend, but couldn't find anything much better, may be register?

@jreback I used those libraries as example, but they are not the point. I think it's about code readability, of third-party libraries, pandas itself, and users of pandas. Adding methods to Series,... is something that applies to the 3 cases.

For pandas, an example where this could be useful is DataFrame.to_stata. Personally I think it'd make more sense that method definition lives in pandas/core/stata.py, where the rest of the related code is. And have it registered in the simple possible way there, so just importing the module adds it to DataFrame. Would even be cool to be able to deregister. I personally never used stata and would be happy to have one method less in DataFrame. :)

For third-party packages and users code, I agree that they should use an official API. If they do, we can warn them when they overwrite an attribute, we can keep track of registered methods... We do all that for accessors, but without providing a way to register methods directly, they use DataFrame.attr = whatever and we can't offer them much.

Not sure what's the drawback here. I see a lot of potential on better code organisation of pandas, more modularity, and more scalability. And may be we'd give up in something by implementing this, but I don't see what.

@shoyer
Copy link
Member

shoyer commented Jun 7, 2019

I don't love encouraging users to monkey-patch methods directly onto pandas.Series. I guess the argument is that people do it anyways, but that feels like an anti-pattern to me.

I like the class method, though. pandas.Series.extension could be a good name.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jun 7, 2019 via email

@shoyer
Copy link
Member

shoyer commented Jun 7, 2019

I think all the projects extending pandas I've seen, simply "inject" the methods (except the ones implemented by pandas maintainers). For example:

These examples look like more of a case of not knowing about pandas' accessor API. They already use a prefix for their special methods, so they might as well use a namespace:

  • Pandas-Bokeh: all methods are grouped under plot_bokeh already
  • pandarellel: it injects parallel_apply, parallel_applymap and parallel_map methods, which could all go under parallel

@datapythonista
Copy link
Member Author

I like to see the question as the same as Python with the standard library. Python was design "batteries included" with lots of modules. But also with an standard and easy way for developers to implement an ecosystem of modules around it. That while not included with Python and not maintained by the Python core devs, work exactly the same as the ones in the standard library. Once a module is installed, the difference between a module of the standard library and a third-party module is minimal. And I guess we are all happy and all benefited from this design.

In opposition, pandas is designed as a single piece, with an increasing integration with the ecosystem, but still with a clear distinction on what we provide, and third-party packages. To me, conceptually, pandas.io.stata or pandas-bokeh look like the same concept. An application that is plugged into the pandas core to provide extra functionality. But while conceptually they can be the same, in practice there are some important differences:

  • pandas.io.stata is not decoupled from the core of pandas (I personally think our code would be much better if it was)
  • pandas-bokeh doesn't have the right to create Series or DataFrame methods, it can only use our second-class accessor system, or monkey patch pandas

Personally, I think the modular design of Python or Django (which also follows the same model) worked really well for them. And I think the steps in the extension arrays, to create a single interface, no matter if it's the core numpy, the other we provide, or third parties also are simplifying things for us.

I see this as moving in the same direction for Series, DataFrame and Index methods. And I think there are many immediate advantages:

  • A clearer and more uniform source code
  • More freedom for our users (to customize pandas, as you can customize most open source projects for your needs)
  • It'd be trivial to add/remove things from the functionalities we provide. Like moving to a third-party package the io for some format that doesn't seem to be popular anymore, when this happens. Or adding to pandas some cool feature that a third-party package implements.
  • And if for example we need to refactor all the statistical functions to work with extension arrays based in arrow. It'd be much much easy to do, if this can be developed as a third-party library. And users are able to deregister the numpy stuff, and register the arrow functions. And once the third-party package is mature, we could just replace one by the other just moving code around.

I think the proposal here is a good first step to move in this direction, and I don't see any drawback.

@jreback
Copy link
Contributor

jreback commented Jun 7, 2019

@datapythonista your points are pretty general, not objectionable but orthogonal to the issues at hand

how does your proposal
advance the current state in a meaningful way?

I am also -1 on patching directly to the main namespace as this very very confusing

how does a shorter accessor api actually help here?
I am -1 of you are attempting to make this user accessible

it is library accessible a crucial difference

@datapythonista
Copy link
Member Author

For me the key issue of this proposal is being able to register methods directly in Series,... I guess we agree that renaming pandas.api.extensions.register_series_accessor to something shorter, or implementing accessors as classes or function is no that relevant, just making the code more beautiful (sorry for mixing the 3 here).

I think being able to register methods directly does advance the current state significantly. For example, I could register plot from pandas.plotting, and nothing in the rest of pandas should import it, solving problems with cycles in the imports. Or as I said in the examples, we could have all the stata functionality in pandas.io.stata (and same for excel, gbq...), and not splitted between the pandas core and their modules.

I understand your point about patching the main pandas classes, but Series has currently 204 methods (not counting attributes, accessors,...). I think defining a standard way of patching some of these methods, and using it consistently will make things clearer/easier, and not more confusing.

@jorisvandenbossche
Copy link
Member

Something I have been thinking about, not the same but certainly related (quickly going to put it here before I am away for the weekend): that the data type can decide which methods are availabe on a Series.
This could also a way to decide on methods on a Series directly as external party, but specifically for when using ExtensionArrays (so certainly not as a replacement of Marc's idea, as not every extension of pandas needs an extension dtype).

Like we now have the dt accessor, the Series could also say: OK, I am a datetime dtype, so for getting my methods/attributes, I will also check a list methods that the dtype/EA listed as methods to be dispatched to the EA (we can also do this in __dir__ so that tab completion on actual objects works).

@jbrockmendel
Copy link
Member

pandas.io.stata is not decoupled from the core of pandas (I personally think our code would be much better if it was)

@datapythonista I think this merits its own discussion. Framing the issue in terms of decoupling will make it appealing.

@datapythonista
Copy link
Member Author

Thanks @jbrockmendel that makes sense. I thought this would be non-controversial besides naming things, and once implemented would allow to have the discussion over a simple prototype PR, which would make things less abstract.

Will see what I can do, but I really thing a more modular code base for pandas would be extremely beneficial, so will open the discussion again once I can present my ideas in a more clear way.

@mroeschke mroeschke added Enhancement ExtensionArray Extending pandas with custom dtypes or arrays. labels Jul 10, 2021
@jbrockmendel jbrockmendel added the Accessors accessor registration mechanism (not .str, .dt, .cat) label Jul 27, 2023
@mroeschke mroeschke removed the ExtensionArray Extending pandas with custom dtypes or arrays. label Aug 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Accessors accessor registration mechanism (not .str, .dt, .cat) API Design Enhancement Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

No branches or pull requests

8 participants