Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Refactor accessors, unify usage, make "recipe" #17042

Closed
wants to merge 21 commits into from

Conversation

jbrockmendel
Copy link
Member

xref #14781, #16890

There are currently accessors for dt, str, cat, and plot. These are each constructed in different ways and using slightly different naming conventions, turning a fairly simple underlying logic into a bit of a maze. This goals for this PR are to:

  • Document a recipe for users to create custom accessors. This now boils down to "subclass PandasDelegate".
  • Unify the ways in which these accessors are constructed, including clarifying naming conventions.
  • Keep never-directly-used methods like _make_dt_accessor, _make_cat_accessor out of the Series namespace.
  • Avoid injecting names into existing namespaces.

The biggest simplification (and the main change in logic) is in where the _make_foo_accessor is defined. Under the status quo, this looks roughly like:

    def _make_foo_acessor(self):
        [validation that self is foo-like]
        return FooDelegate(self)

    foo = AccessorProperty(FooDelegate, _make_foo_acessor)

This PR changes that, so that _make_foo_accessor is now a classmethod _make_accessor of FooDelegate. Then in Series all we need is foo = AccessorProperty(FooDelegate).

At the moment the documentation/recipe is in the form of an example in the module docstring to core.accessors. Not sure where that belongs. I'm not wild about the going back-and-forth between code that is copy/paste-able and ">>>" interludes. test_accessors is a bare-bones copy of that example with some assertions.

  • closes #xxxx
  • tests added / passed
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry

Make str_normalize as func; rename copy--> copy_doc to avoid stdlib name
Rename strings classes in tests

Update imports in tests
flake8 fixes

Update import name
def func(self, *args, **kwargs):
return self._delegate_method(name, *args, **kwargs)

if callable(name):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to self: you tried this option and decided against it.

the class definition. For things that we cannot keep directly
in the class definition, a decorator is more directly tied to
the definition than a method call outside the definition.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The paragraph above may be helpful for the dev discussion, but probably doesn't belong in the docstring.

func = Delegator.create_delegator_method(name, delegate)

# Allow for a callable to be passed instead of a name.
title = com._get_callable_name(name)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to self: you decided against this option.

@@ -6006,8 +6005,8 @@ def _put_str(s, space):

# ----------------------------------------------------------------------
# Add plotting methods to DataFrame
DataFrame.plot = base.AccessorProperty(gfx.FramePlotMethods,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs rebasing.



DatetimeAccessor = CombinedDatetimelikeProperties
# Alias to mirror CategoricalAccessor
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're subclassing PandasDelegate, would it make more sense for these to be named e.g. DatetimeDelegate?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing these names over now.

method = getattr(self.values, name)
res = method(*args, **kwargs)
# TODO: Should this get wrapped in an index?
return res
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: even though it isn't explicitly needed, having StringMethods subclass PandasDelegate will clarify its parallels with the other accessors.

Plus actually using _delegate_method could get rid of a lot of boilerplate.

# and the _dir_additions/_dir_deletions won't play nicely with
# any other class this gets mixed into that *does* implement its own
# _dir_additions/_dir_deletions. This should be deprecated.
class StringAccessorMixin(object):
Copy link
Member Author

@jbrockmendel jbrockmendel Jul 20, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StringAccessorMixin should be considered for deprecation. See comment above.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is internal, you can just remove it if not needed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great; wasn't sure if it should be saved for a separate PR. Doing so requires copy/pasting _accessors, _dir_deletions, and _dir_additions from Series (which I see as a good thing, per #8162 , #17061)

def __init__(self, *args, **kwargs):
# Override to prevent accessors.PandasDelegate.__init__ from executing
# This is a kludge.
pass
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to self: fix this.

@@ -104,6 +106,9 @@ def _delegate_property_get(self, name):
result = result.astype('int64')
elif not is_list_like(result):
return result
elif isinstance(result, DataFrame):
# e.g. TimedeltaProperties.components
return result.set_index(self.index)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This allows _delegate_property_get to correctly handle components.

@pep8speaks
Copy link

pep8speaks commented Jul 20, 2017

Hello @jbrockmendel! Thanks for updating the PR.

Line 2072:5: E303 too many blank lines (2)

Line 1950:1: W391 blank line at end of file

Comment last updated on September 20, 2017 at 04:01 Hours UTC

@gfyoung gfyoung added API Design Clean Internals Related to non-user accessible pandas implementation labels Jul 20, 2017
@codecov
Copy link

codecov bot commented Jul 20, 2017

Codecov Report

Merging #17042 into master will decrease coverage by 0.02%.
The diff coverage is 94.41%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #17042      +/-   ##
==========================================
- Coverage   91.02%   90.99%   -0.03%     
==========================================
  Files         161      162       +1     
  Lines       49308    49344      +36     
==========================================
+ Hits        44883    44903      +20     
- Misses       4425     4441      +16
Flag Coverage Δ
#multiple 88.77% <94.41%> (-0.01%) ⬇️
#single 40.21% <79.32%> (-0.05%) ⬇️
Impacted Files Coverage Δ
pandas/core/base.py 96.56% <ø> (+0.34%) ⬆️
pandas/core/categorical.py 95.5% <100%> (+0.03%) ⬆️
pandas/core/series.py 94.85% <100%> (-0.06%) ⬇️
pandas/core/frame.py 97.75% <100%> (-0.11%) ⬇️
pandas/core/indexes/category.py 98.5% <100%> (-0.01%) ⬇️
pandas/core/accessors.py 90% <90%> (ø)
pandas/core/strings.py 98.07% <94.91%> (-0.41%) ⬇️
pandas/core/indexes/accessors.py 88.29% <95.45%> (+0.2%) ⬆️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8e58225...953598a. Read the comment docs.

@codecov
Copy link

codecov bot commented Jul 20, 2017

Codecov Report

Merging #17042 into master will decrease coverage by 0.19%.
The diff coverage is 97.94%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master   #17042     +/-   ##
=========================================
- Coverage   91.19%      91%   -0.2%     
=========================================
  Files         163      162      -1     
  Lines       49626    49430    -196     
=========================================
- Hits        45258    44984    -274     
- Misses       4368     4446     +78
Flag Coverage Δ
#multiple 88.78% <97.94%> (-0.19%) ⬇️
#single 40.28% <76.92%> (+0.02%) ⬆️
Impacted Files Coverage Δ
pandas/core/base.py 96.56% <ø> (+0.54%) ⬆️
pandas/core/frame.py 97.65% <100%> (-0.22%) ⬇️
pandas/core/series.py 94.89% <100%> (-0.04%) ⬇️
pandas/core/indexes/base.py 95.96% <100%> (-0.32%) ⬇️
pandas/core/strings.py 97.61% <100%> (-0.85%) ⬇️
pandas/core/categorical.py 95.5% <100%> (-0.08%) ⬇️
pandas/core/indexes/category.py 98.48% <100%> (-0.07%) ⬇️
pandas/core/indexes/accessors.py 88.29% <93.1%> (-0.95%) ⬇️
pandas/core/accessors.py 96.36% <96.36%> (ø)
pandas/io/s3.py 0% <0%> (-85%) ⬇️
... and 82 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 21a3800...264a7e7. Read the comment docs.

# -*- coding: utf-8 -*-
"""

An example/recipe for creating a custom accessor.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would remove all about the custom use of this. This as a refactor is fine, but this would really need a good usecase to actually add a custom delegate. Happy to have this in internals.rst if its necessary.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shoyer: You commented on this in #14781. Have you found that custom accessors are frequently used by xarray users?

@jreback I can delete it or move it somewhere out of the way. The test_accessors file might make a good place to park the discussion seeing as how it uses the same example anyway.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm +1 on the general idea of exposing a way to register custom accessors, though the api in your example seems a little clunky compared to what xarray does? That said, may be better to punt that piece to a follow-up PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree that docs should live in internals.rts, logical companion to the subclassing docs

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

example seems a little clunky compared to what xarray does

The examples are for pretty different use cases. The xarray example defines a center attribute in terms of multiple existing [column]s of a [DataFrame]. That's why it doesn't need any vectorization boilerplate.

The example I used vectorizes properties/methods of the elements in a Series. StringMethods or CategoricalAccessor work roughly this way. (CombinedDatetimelikeProperties doesn't need to apply the property/method point-wise because they already exist in the underlying Series/Index.)

That said, if this isn't already obvious, that is a shortcoming of the documentation.

There is some avoidable clunkiness in that we are requiring every accessible attribute to be explicitly listed.

Copy link
Member Author

@jbrockmendel jbrockmendel Jul 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just pushed commits that address some of these. Took the example out of the module docstring, put a comment in PandasDelegate.__doc__ pointing readers towards the existing examples.

@shoyer
Copy link
Member

shoyer commented Jul 22, 2017

This interface still looks really complicated to me, much more complex than the recipe we have in xarray. Based on the xarray recipe, you could write something like:

@pd.register_series_accessor('my_accessor')
class MyAccessor(object):
    def __init__(self, series):
        ...

and voila, series.my_accessor returns a MyAccessor object.

In your case, library writers need to define lots of special methods like _make_accessor, _delegate_property_get, etc., not to mention the decorator that requires explicitly enumerating every method.

Why is all this necessary? This sort of complexity is OK internally, but not desirable for user facing API.

One reason I can think of is that we don't even define accessors if they aren't valid (e.g., .cat only exists on Series with a categorical dtype). But even here, I think this could require one extra method at most, e.g.,

@pd.register_series_accessor('cat')
class CategoricalAccessor(object):
    def __init__(self, series):
        ...
    @classmethod
    def _should_exist(cls, series):
        # it might also work to just raise an error in __init__
        # in case where the accessor is invalid
        return series.dtype.is_categorical()

@jbrockmendel
Copy link
Member Author

Why is all this necessary? This sort of complexity is OK internally

@shoyer The first and most accurate answer is that it isn't, but there was a limit to how much legacy logic I wanted to change in my first non-trivial PR here.

Second, be sure that you are comparing analogous functions. xr.register_dataset_accessor('geo')(GeoAccessor) corresponds roughly to str = accessors.AccessorProperty(StringAccessor).

Using _make_accessor instead of just putting the validation in __init__ is there for the branching cases in pd.core.indexes.accessors. I mainly focused on getting _make_dt_accessor and _make_cat_accessor out of the Series namespace. Now that I think about it, checking for _make_accessor and then falling back to __init__ might be simpler.

I haven't looked through the history, but I expect that explicitly listing the delegated names was motivated by the desire to surface some properties/methods DatetimeIndex without inheriting all of it.

@shoyer
Copy link
Member

shoyer commented Jul 23, 2017

The first and most accurate answer is that it isn't, but there was a limit to how much legacy logic I wanted to change in my first non-trivial PR here.

This makes sense. I'll leave it up to others to judge if this internal clean-up is worth it, but let's refrain from adding any new public API until we have that cleaner interface.

@jbrockmendel
Copy link
Member Author

If this is a sticking point, I'd argue that there are two refactorings here that are much more important than everything else.

  1. AccessorProperty.__init__ currently takes arguments accessor_cls and construct_accesor. That constructor can often just be accessor_cls.__init__, but should always be method of the accessor_cls. The Series (and StringAccessorMixin) namespace is definitely not where these belong.

  2. PandasDelegate._add_delegate_accessors took a while to figure out. De-nesting that method could save the next guy a headache.

@jbrockmendel
Copy link
Member Author

A note transplanted from #17061: StringAccessorMixin can/should be deprecated. Doing so requires only a small fraction of this PR.

Then Index can have _accessor, _dir_deletions, and dir_additions copy/pasted from Series (or both could inherit from something more generic).

@jreback Would you prefer 17061-style discussion go somewhere else like the mailing list?

@jreback
Copy link
Contributor

jreback commented Jul 24, 2017

u can certainly try the ML

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generally looks ok. lmk when you are ready ping.

Delegate : instance of PandasDelegate or subclass

"""
raise NotImplementedError(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use AbstractMethodError

[...]


This replaces the older usage in which following a class definition
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need to say what is was, just what it does now.

typ : 'property' or 'method'
overwrite : boolean, default False
overwrite the method/property in the target class if it exists
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe some asserts to valid types of the accessors and such

@@ -2013,7 +2014,20 @@ def repeat(self, repeats, *args, **kwargs):
# The Series.cat accessor


class CategoricalAccessor(PandasDelegate, NoNewAttributesMixin):
@accessors.wrap_delegate_names(delegate=Categorical,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

an alternative way to do this is to add in the classes themselves which methods / properties should be considered accessors (either in a class method or via decorators), i did do this for a while, but actually though it simpler here. but can revisit at a later point.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm amenable to that. Over in core.indexes.accessors there is a comment above DatetimeAccessor noting an alternative way to define the accessed properties/methods.

The only pattern I actively dislike is CategoricalIndex._add_accessors; pinning attributes to the class from a distance can lead to surprises (like the still-mysterious circular import core.series<-->plotting._core).

def _delegate_method(self, name, *args, **kwargs):
from pandas import Series
method = getattr(self.categorical, name)
res = method(*args, **kwargs)
if res is not None:
return Series(res, index=self.index)

# TODO: Can we get this from _delegate_property_get?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, but its simpler here as it



DatetimeAccessor = CombinedDatetimelikeProperties
# Alias to mirror CategoricalAccessor
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

import pandas.core.strings as strings
from pandas.core.indexes.accessors import (
maybe_to_datetimelike, CombinedDatetimelikeProperties)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ideally don't move imports around unless they are germane, it makes the diff hard to read

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted.

@@ -1316,6 +1358,27 @@ def str_encode(arr, encoding, errors="strict"):
return _na_map(f, arr)


def str_normalize(arr, form):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nearly all the StringMethods methods follow a pattern of defining str_foo as a module-level function and then wrapping it as the foo method. StringMethods.normalize didn't follow this pattern, so I moved it out. At the time I intended to use _delegate_method to get rid of a bunch of boilerplate in the class (and clarify its relation to other PandasDelegate subclasses), but eventually chose to hold off on that.

So this isn't strictly pertinent to this PR; I'll be happy to revert it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would getting rid of the not-strictly-relevant edits here in strings.py grease the wheels on this?

# and the _dir_additions/_dir_deletions won't play nicely with
# any other class this gets mixed into that *does* implement its own
# _dir_additions/_dir_deletions. This should be deprecated.
class StringAccessorMixin(object):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is internal, you can just remove it if not needed

@jbrockmendel
Copy link
Member Author

Does the CircleCI timeout require action?

@chris-b1
Copy link
Contributor

I restarted it; think it's been a bit flaky

@gfyoung
Copy link
Member

gfyoung commented Jul 26, 2017

I restarted it; think it's been a bit flaky

Yeah, the CI services can sometimes be annoying like. Travis I've found is generally the least "disobedient" (for lack of a better term), but it has its own share of problems 😢

@jreback
Copy link
Contributor

jreback commented Aug 8, 2017

you can rebase or close this. prob easier to do step by step changes.

@jbrockmendel
Copy link
Member Author

Do you mind letting this stay open until #17117 is resolved? I'll have to rebase again after that anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Clean Internals Related to non-user accessible pandas implementation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants