Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH/DOC: reimplement Series delegates/accessors using descriptors #9322

Merged
merged 3 commits into from
Jan 25, 2015

Conversation

shoyer
Copy link
Member

@shoyer shoyer commented Jan 21, 2015

Fixes #9184

This PR fixes the API docs to use Series.str and Series.dt instead of StringMethods and DatetimeProperties.

It will need a rebase once #9318 is merged.

CC @jorisvandenbossche @jreback

@shoyer shoyer added the Docs label Jan 21, 2015
@shoyer shoyer added this to the 0.16.0 milestone Jan 21, 2015
@jorisvandenbossche
Copy link
Member

I wanted to say, "I don't think this is going to work, as the dt/str are None for not initialised Series", but then I saw your conf.py addition -> I like!

Of course, this is not fully 'correct', and users will still see the the full name of the objects when requesting help with ?, but as this is the way how users interact with this, I think this is much better!

@@ -449,109 +449,103 @@ Datetimelike Properties

``Series.dt`` can be used to access the values of the series as
datetimelike and return several properties.
Due to implementation details the methods show up here as methods of the
``DatetimeProperties/PeriodProperties/TimedeltaProperties`` classes. These can be accessed like ``Series.dt.<property>``.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking, should we keep this sentence (but eg between brackets as a note, that they are implemented like that, as users can still see this if they do s.str? or type(s.str))?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this an implementation detail that users don't need to know, kind of like the various indexer classes. So my vote is for not mentioning it. Note that s.str? does work, at least if s is series object.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I know s.str? works, and there the StringMethods name is visible to users, and it was for this reason I was thinking if we should mention it or not

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add it as a note after the introduction/methods?

@jorisvandenbossche
Copy link
Member

One more places where this has to be replaced: reshaping.rst L481

And then: should we do something about the obsolete links that will arise? Eg the link http://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.strings.StringMethods.cat.html will no longer exist due to this change.

@shoyer shoyer force-pushed the better-delegate-api-docs branch 2 times, most recently from 242f536 to 342e776 Compare January 21, 2015 09:17
@shoyer
Copy link
Member Author

shoyer commented Jan 21, 2015

Not quite sure what to do about obsolete links, but made your other suggested changes.

ReadTheDocs has a user-defined redirects feature, but I can't find that option for standard sphinx. We could certainly use a hack to generate pages at the old URLs, but that's not really ideal either.

@jorisvandenbossche
Copy link
Member

Hmm, only problem, this patching of pandas.Series.str/dt also messes up the docs itself, as in the tutorial docs, the use of these functions don't work anymore. See eg the travis doc build log, typical error message unbound method str_lower() must be called with StringMethods instance as first argument (got nothing instead)

Series.cat.remove_categories
Series.cat.remove_unused_categories
Series.cat.set_categories
Series.cat.codes

To create a Series of dtype ``category``, use ``cat = s.astype("category")``.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe move this to the Series section and/or these the original here - these are valid Crategorical properties that should be listed

@jorisvandenbossche
Copy link
Member

About the failures in the code snippets due to this patching, should the code snippets run in the ipython directive normally not be run in a separate process?

@jorisvandenbossche
Copy link
Member

So I don't think this is going to be this easy:

  • patching pandas in the conf file also patches it for all code examples in docs -> not possible

  • patching pandas would possibly work when we only do it in api.rst itself, and when this is parsed as last. But, the problem is that the autosummary extension already gathers objects to build in the beginning, and then fails with WARNING: [autosummary] failed to import 'pandas.Series.dt.microseconds': no module named pandas.Series.dt.microseconds

  • I thought that the IPython directive would create a new InteractiveShell instance for each file, but this is not the case. So the full doc build is done in one and the same InteractiveShell. We could reset this at the beginning of each rst file (this resets the namespace, something we should maybe do in any case), but this does not seem to remove the patching:

    In [1]: import pandas as pd
    
    In [2]: pd.Series.str
    
    In [3]: pd.Series.str = 'test'
    
    In [4]: pd.Series.str
    Out[4]: 'test'
    
    In [5]: from IPython import Config, InteractiveShell
    
    In [6]: IP = InteractiveShell.instance()
    
    In [7]: IP.reset()
    
    In [2]: pd.Series.str
    ---------------------------------------------------------------------------
    NameError                                 Traceback (most recent call last)
    <ipython-input-2-bc49ee86f4ca> in <module>()
    ----> 1 pd.Series.str
    
    NameError: name 'pd' is not defined
    
    In [3]: import pandas as pd
    
    In [4]: pd.Series.str
    Out[4]: 'test'
    

    But I am not really familiar with the IPython internals, so maybe my reasoning is a bit simple.

@shoyer shoyer force-pushed the better-delegate-api-docs branch from 342e776 to 0331b64 Compare January 21, 2015 19:13
@shoyer
Copy link
Member Author

shoyer commented Jan 21, 2015

OK, tried again with more serious trickery and reverted Series.cat -> Categorical.

@shoyer
Copy link
Member Author

shoyer commented Jan 21, 2015

Another approach, which would be more robust and make this work even in normal code, would be to make StringMethods, etc. property subclasses (or something similar with descriptors). See http://stackoverflow.com/questions/12405087/subclassing-pythons-property

@shoyer
Copy link
Member Author

shoyer commented Jan 21, 2015

OK, descriptors are definitely the right way to solve this problem -- we can write a custom property which can be defined (differently!) on both the type and instance. This means autocomplete with pd.Series.str.<tab> will work interactively, not just in the docs.

The main annoyance is that descriptors are created when the class is defined (unlike the current delegates which are only defined on instances), so we need to do some import reorganization to avoid newly recursive imports.

Still can't quite believe I'm writing my own descriptor...

@shoyer shoyer force-pushed the better-delegate-api-docs branch 2 times, most recently from 29a1ec3 to d6d82e8 Compare January 22, 2015 03:34
@shoyer shoyer changed the title DOC: refer to Series delegates in API docs directly ENH/DOC: reimplement Series delegates/accessors using descriptors Jan 22, 2015
@shoyer shoyer force-pushed the better-delegate-api-docs branch 2 times, most recently from 5272b97 to 2b012f2 Compare January 22, 2015 03:42
@shoyer
Copy link
Member Author

shoyer commented Jan 22, 2015

OK, new commit implements the descriptors solution. pd.Series.str.<tab> should work now! (along with .dt and .cat)

This is now a bit more than DOC fix; PR title update accordingly (and I added what's new notes).

@shoyer shoyer force-pushed the better-delegate-api-docs branch from 2b012f2 to b7f5775 Compare January 22, 2015 04:38
@@ -579,6 +571,8 @@ To create a Series of dtype ``category``, use ``cat = s.astype("category")``.
The following two ``Categorical`` constructors are considered API but should only be used when
adding ordering information or special categories is need at creation time of the categorical data:

.. currentmodule:: pandas.core.categorical
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, this does not need to be pandas.core.categorical, the user API is just pandas.Categorical

@jorisvandenbossche
Copy link
Member

In any case, this looks like a nice solution anyway. The tab completion works, and the tutorial docs are also OK. Only the API docs don't seem to like it already .. (I think an autodoc problem, now doing a full doc build)

@jorisvandenbossche
Copy link
Member

@shoyer I think I have got something working for the autodoc problem, will send a patch soon

@jorisvandenbossche
Copy link
Member

@shoyer The result of my hacking: jorisvandenbossche@14b743b (If looks OK, I can do a PR against your branch if you want: this)

I don't know if this is the best way, but at least I got it working that way.
Problem was that autosummary split it up as module 'pandas.Series' and object 'dt.hour' (this is fixed with the autosummary template), and afterwards that autodoc split it as module 'Series' and object path 'dt'->'hour' (and then you got 'module Series could not be imported') instead of module 'pandas' and object path 'Series'->'str'->'hour'. This is fixed with the custom Documenters

@shoyer
Copy link
Member Author

shoyer commented Jan 22, 2015

@jorisvandenbossche Thanks for figuring out that sphinx mess! I did a full docs rebuild myself and your fix does the job (I didn't quite realize my version was broken)

I've merged your commit and did a bit of history rewriting to separate this into two commit (both of which should pass CI tests), mine with the descriptors and yours with the new sphinx directive + doc changes. This needs a review from @jreback but otherwise I think could be merged.

@jreback
Copy link
Contributor

jreback commented Jan 22, 2015

@shoyer looks good. I would raise the TypeError on using .str on non-object (btw I think there might be an open issue about that, pls link for close to it as well).

assume you guys are happy with the doc build.

never used descriptors myself....but seems sold.

@jreback
Copy link
Contributor

jreback commented Jan 22, 2015

its #9184

@shoyer
Copy link
Member Author

shoyer commented Jan 23, 2015

@jreback thanks for the tip.

For future reference, I found this SO post showing how to implement property as a descriptor in pure Python very helpful:
http://stackoverflow.com/questions/12405087/subclassing-pythons-property

@shoyer
Copy link
Member Author

shoyer commented Jan 23, 2015

OK, added another commit adding TypeError for .str on non-object dtypes.

FYI @cpcloud I had to remove a test you wrote in #3645 (test_iter_numeric_try_string) to get this to pass -- Series.str now raises for numeric Series, so it should not be possible to reach this edge case.

@jreback
Copy link
Contributor

jreback commented Jan 23, 2015

so prob need a slightly stronger check if something is a string like

eg you have a series of Python objects
however figuring this out can be perf intensive so we don't want to do it
but the str methiss may bork on this type of input

so maybe make a note / open issue for this

@shoyer
Copy link
Member Author

shoyer commented Jan 23, 2015

@jreback I agree, I think we pretty much need a real string dtype to do that, but at least this should cover us 90% of the time. I'll add the note/issue.

Fixes GH9184

Also includes a fix for Series.apply to ensure that it propagates
metadata and dtypes properly for empty Series (this was necessary to
fix a Stata test)
@shoyer shoyer force-pushed the better-delegate-api-docs branch from 8e4a49b to b7a6d1b Compare January 23, 2015 08:06
@shoyer
Copy link
Member Author

shoyer commented Jan 23, 2015

@jreback Note and new issue added. Also needed a bug fix for Series.apply to ensure it preserves dtypes, added a change and tests.

@jorisvandenbossche
Copy link
Member

Just a small remark: we could also opt to only do the check when an actual method/attribute gets called on the accessor?
Maybe it is not worth it, but now if you have a numeric Series s = pd.Series([1]), you can still get str with tab completion, but you cannot get the help:

In [10]: s.str?
Object `s.str` not found.

If you just access it with s.str without the help, you get the TypeError

@shoyer
Copy link
Member Author

shoyer commented Jan 23, 2015

@jorisvandenbossche The trickiness here would be for .dt. If we make it defined until an actual method/attribute is called, then we can't ensure that autocomplete only gets the right attributes (e.g., datetime vs timedelta).

@shoyer
Copy link
Member Author

shoyer commented Jan 25, 2015

OK, a few things we could do:

  1. Do checks to take str out of __dir__ for invalid types. This would eliminate the auto-complete issue, but I think s.str? would give the same message you showed above (object s.str not found).
  2. Return a standard StringsMethod object, but add some sort of hook that checks that the type is valid before every method lookup. You could still auto-complete str methods, though, and this is more complex for .dt, because it can create several sub-types of accessors.
  3. Make s.str for invalid types some sort of "deferred error" object that raises TypeError when any attribute is accessed but with a copied docstring from StringMethods. I tossed together an implementation, which gives us functionality like the following:
In [15]: s = pd.Series([1])

In [16]: s.str.<tab>

In [17]: s.str
Out[17]: <pandas.core.series.InvalidStringMethods at 0x107a32fd0>

In [18]: s.str?
Type:        InvalidStringMethods
String form: <pandas.core.series.InvalidStringMethods object at 0x107a8e150>
File:        /Users/shoyer/dev/pandas/pandas/core/series.py
Docstring:
Vectorized string functions for Series. NAs stay NA unless handled
otherwise by a particular method. Patterned after Python's string methods,
with some inspiration from R's stringr package.

Examples
--------
>>> s.str.split('_')
>>> s.str.replace('_', '')

In [19]: s.str.cat
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-19-e75e3f77c883> in <module>()
----> 1 s.str.cat

/Users/shoyer/dev/pandas/pandas/core/series.py in __getattr__(self, name)
   2552
   2553     def __getattr__(self, name):
-> 2554         raise self._error
   2555
   2556

TypeError: Can only use .str accessor with string values, which use np.object_ dtype in pandas

Unfortunately, it is not possible (AFAICT) to make an object on which repr raises a TypeError but for which __doc__ is well defined.

I'm -0 on these options. They add complexity and I don't think they're that much more usable -- if s.str? says not found, the first thing I'm going to try to do is see what s.str is, which will raise the TypeError. I also don't think there are that many who search through the Series namespace for methods -- there are simply too many methods/properties for that to be very useable.

@jankatins
Copy link
Contributor

can't these accessors by added after object creation in __init__()? So 's.str' is really not there for Series of type int?

@shoyer
Copy link
Member Author

shoyer commented Jan 25, 2015

@JanSchulz Yes, but not if we want pd.Series.str (on the type object) to be well defined.

@jreback
Copy link
Contributor

jreback commented Jan 25, 2015

does the behavior you want exist in the impl for .dt?

@jorisvandenbossche
Copy link
Member

@jreback The current behaviour on master for .dt is the same as now implemented here for .str:

In [5]: s = pd.Series([1])

In [6]: s.dt
TypeError: Can only use .dt accessor with datetimelike values

In [7]: s.dt?
Object `s.dt` not found.

So that was my remark if this last one could not be solved (to let the TypeError only occur if really a method is called on s.dt, and not when just accessing dt, so to let s.dt? still return the docstring).

But @shoyer, I agree fully that the possibilities you mention just add complexity for only a small thing. So I agree we shouldn't add it. I think the current behaviour is OK (in any case s.dt/str does return a useful feedback to the user).

@jorisvandenbossche
Copy link
Member

There are still some warnings with the doc building. I added AccessorMethod and AccessorAttribute documenters, but will probably have to do something similar for the accessor itself (Series.str and Series.dt), as these show up somewhere in the docs and are generating the warnings (possibly in the pandas.Series api page, as this page automatically lists all methods and attributes).

But, it are only warnings (the rest of the doc builds fine), and on travis the api isn't even built, so OK for merging this if it is ready for you.
And I can look into the doc issue later this week (probably no time first two days)

@jreback
Copy link
Contributor

jreback commented Jan 25, 2015

@jorisvandenbossche

ok, if s.dt? doesn't work no biggie....

ok by me

@jreback
Copy link
Contributor

jreback commented Jan 25, 2015

@shoyer we have 2 PR's after this (that are about string methods), so ping whn you merge

shoyer added a commit that referenced this pull request Jan 25, 2015
ENH/DOC: reimplement Series delegates/accessors using descriptors
@shoyer shoyer merged commit 327340b into pandas-dev:master Jan 25, 2015
@shoyer
Copy link
Member Author

shoyer commented Jan 25, 2015

@jreback Merged!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

str.contains - returns series of zeroes instead of series of bools when all values are NaNs.
4 participants