Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why does pandas work around numpy limitations with custom dtypes instead of fixing them upstream? #8350

Closed
shoyer opened this issue Sep 22, 2014 · 17 comments

Comments

@shoyer
Copy link
Member

shoyer commented Sep 22, 2014

I was thinking particularly of cases like datetime64, Categorical and GeoSeries (from geopandas).

I recently posted on the numpy discussion mailing to try to get a sense of what solutions exist for writing custom dtypes without writing C. Unfortunately, it appears there's not much hope!
http://mail.scipy.org/pipermail/numpy-discussion/2014-September/071231.html

@njsmith suggested that I really should ask pandas developers to chime in to find out why they choose to work around numpy's limitations rather than enhance it. I would love it if someone who understands the "why" for the choices pandas made could add their perspective to that thread.

@jreback @JanSchulz any thoughts to add? or did a sum it up well enough with "nobody wants to write C"?

@jreback
Copy link
Contributor

jreback commented Sep 22, 2014

@shoyer I'll give some comments, can copy to the list in a bit. Hopefully not TL;DR!

Their are 3 'dtype' likes that exist in pandas that could in theory mostly be migrated back to numpy. These currently exist as the .values in-other-words the object to which pandas defers data storage and computation for some/most of operations.

  1. SparseArray: This is the basis for SparseSeries. It is ndarray-like (its actually a ndarray-sub-class) and optimized for the 1-d case. My guess is that @wesm created this because it a) didn't exist in numpy, and b) didn't want scipy as a dependency.

  2. datetime support: This is not a target dtype per se, but really a reimplementation over the top of datetime64[ns], with the associated scalar Timestamp which is a proper sub-class of datetime.datetime. I believe @wesm created this because numpy datetime support was (and still is to some extent) just completely broken. It doesn't support proper timezones. And the scalar type (np.datetime64) is not extensible at all (e.g. so have not easy to have custom printing, or parsing).

  3. pd.Categorical: This was another class wesm wrote several years ago. It is actually could be a numpy sub-class, though its a bit awkward as its really a numpy-like sub-class that contains 2 ndarray-like arrays, and is more appropriately implemented as a container of multiple-ndarrays.

So when we added support for Categoricals recently, why didn't we say try to push a categorical dtype? I think their are several reasons, in no particular order:

  • pd.Categorical is really a container of multiple ndarrays, and is ndarray-like. Further its API is somewhat constrained. It was simpler to make a python container class rather than try to sub-class ndarray and basically override / throw out many methods (as a lot of computation methods simply don't make sense between 2 categoricals). You can make a case that this *should not * be in numpy for this reason.
  • The changes in pandas for the 3 cases outlined above, were mostly on how to integrate these with the top-level containers (Series/DataFrame), rather than actually writing / re-writing a new dtype for a ndarray class. We always try to reuse, so we just try to extend the ndarray-like rather than create a new one from scratch.
  • Getting for example a Categorical dtype into numpy prob would take a pretty long cycle time. I think you need a champion for new features to really push them. It hasn't happened with datetime and that's been a while (of course its possible that pandas diverted some of this need)
  • API design: I think this is a big issue actually. When I added Categorical container support, I didn't want to change the API of Categorical much (and it pretty much worked out that way, mainly adding to it). So, say we took the path of assuming that numpy would have a nice categorical data dtype. We would almost certainly have to wrap it in something to provided needed functionaility that would necessarily be missing in an initial version. (of course eventually that may not be necessary).
  • So the 'nobody wants to write in C' argument is true for datetimes, but not for SparseArray/Categorical. In fact much of that code is just calling out to numpy (though some cython code too).
  • from a performance perspective, numpy needs a really good hashtable in order to support proper factorizing, which @wesm co-opted klib to do (see this thread here

So I know I am repeating myself, but it comes down to this. The API/interface of the delegated methods needs to be defined. For ndarrays it is long established and well-known. So easy to gear pandas to that. However with a newer type that is not the case, so pandas can easily decide, hey this is the most correct behavior, let's do it this way, nothing to break, no back compat needed.

@ischwabacher
Copy link
Contributor

I'm so glad you asked this question; I've been wondering about it, too.

@jreback
Copy link
Contributor

jreback commented Sep 22, 2014

@ischwabacher does my answer shed some light? anything still not clear?

@ischwabacher
Copy link
Contributor

It surely does. At one point I did go poking around in numpy to see if I could figure out where one would start to update datetime64, but it got intimidating very quickly. :/

@jreback
Copy link
Contributor

jreback commented Sep 22, 2014

FYI: Here are some of the numpy issue / proposals for datetime:

numpy/numpy#3290

http://numpy-tst.readthedocs.org/en/latest/neps/datetime-proposal3.html

http://numpy-discussion.10968.n7.nabble.com/timezones-and-datetime64-td33407.html (I seem to remember a more recent 2014 discussion though)

@cpcloud
Copy link
Member

cpcloud commented Sep 22, 2014

Additionally, this all has to work with missing data. numpy support for dtyped missing data values doesn't seem like it'll happen anytime soon.

@jtratner
Copy link
Contributor

@cpcloud yeah, I think that's a really key additional point: good handling of missing data. Plus, the need to provide a single API across multiple versions of numpy leads to some additional hacking around.

@jreback
Copy link
Contributor

jreback commented Sep 22, 2014

@cpcloud maybe you'd like to send a follow-up to the numpy list w.r.t. missing data support et all.

@cpcloud
Copy link
Member

cpcloud commented Sep 22, 2014

@jreback I'm not sure that would be very helpful. There's already been a pretty epic discussion: http://www.numpy.org/NA-overview.html. Not sure if there's much more to add.

current status: https://github.com/njsmith/numpy/wiki/NA-discussion-status
ML discussion: http://thread.gmane.org/gmane.comp.python.numeric.general/46704/focus=46739

@jreback
Copy link
Contributor

jreback commented Sep 22, 2014

@cpcloud good point :)

@shoyer
Copy link
Member Author

shoyer commented Sep 22, 2014

All we needed for pandas was NaN for integers and datetimes... alas!

On Mon, Sep 22, 2014 at 12:03 PM, jreback notifications@github.com wrote:

@cpcloud https://github.com/cpcloud good point :)


Reply to this email directly or view it on GitHub
#8350 (comment).

@njsmith
Copy link

njsmith commented Sep 22, 2014

Maybe I should add-- The reason this question came up is, the only reason
numpy doesn't ship with lovely working datetime64/categorical/missing data
support out of the box, is that we have very limited personpower and
no-one's gotten around to implementing them. Yet obviously you folks do
care enough to invest the effort in implementing them :-). So maybe
possibly you have some insights into why you decided to take what seems
like a potentially more awkward route, and/or ways that numpy could
encourage upstream contributions instead of just downstream workarounds.

On Mon, Sep 22, 2014 at 2:43 PM, Phillip Cloud notifications@github.com
wrote:

Additionally, this all has to work with missing data. numpy support for
dtyped missing data values doesn't seem like it'll happen anytime soon.


Reply to this email directly or view it on GitHub
#8350 (comment).

Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org

@cpcloud
Copy link
Member

cpcloud commented Sep 22, 2014

@njsmith thanks for bringing all this up

here's my 2 cents.

we have very limited personpower and no-one's gotten around to implementing them

same here. we have around i would say 3-7 active devs at any given time, with @jreback doing a large majority of the work. i personally have dropped off quite a bit ever since starting work at continuum.

So maybe possibly you have some insights into why you decided to take what seems like a potentially more awkward route

  1. i think the choice to use nan to indicate missing was made awhile ago to avoid tying pandas to MaskedArray but wanting to move forward with implementing something rather than waiting for NA support.
  2. pandas has a lot of momentum
  3. nan is "good enough" in many cases

I agree that it's more awkward and i do understand the need to differentiate a nan calculation from an NA calculation (for example), however i think back compat + nan being satisfactory in many cases has kept anyone from being interested enough to finish up the work done in with_maskna.

that said, @njsmith is there a TODO list for that branch anywhere? there is a bloomberg hackathon this weekend and maybe folks could make some headway there.

plenty of interesting low-level stuff there, i'm sure a ton to learn about numpy internals since i think the impl touches a lot of different parts of the code base

i don't have much to say re getting people to contribute. i personally haven't contributed anything huge (i think maybe a bug report or two) but not for any particular reason other than lack of time.

@shoyer
Copy link
Member Author

shoyer commented Sep 23, 2014

Thanks everyone for adding your perspective -- this has been very enlightening for me.

@njsmith After reading your wiki page and NEPs, I am still not entirely sure I understand the resolution of the NAs in numpy discussion. What is the blocker for your miniNEP2 proposal to implement bit-pattern NAs via optional dtypes? Just the lack of an implementation? Or are there still arguments about whether it is even a good idea? Of course, I think your proposal is/was a great idea :).

From a practical perspective, Categorical does seem like a case where -- hypothetically -- it could be done as a custom numpy dtype (if extending dtypes were easier), but probably implemented in pandas to avoid waiting on the numpy release cycle and to enable using klib. In theory, I don't think API design would be a blocker if we wrote the numpy dtype specifically for pandas.

@jankatins
Copy link
Contributor

Just to chime in on the "nobody wants to write C": I tried to see how Categorical could be implemented in numpy but I don't know C, so finding a starting point for a "monkey see, monkey do" kind of (a start of) implementation is hard.

On the other hand: I also didn't know where to start such an implementation in pandas and without @jreback start I wouldn't have found it either.

@ischwabacher
Copy link
Contributor

Speaking of fixing things upstream, we should probably weigh in on PEP 431 -- Time zone support improvements. It seems like it would be bad if the stdlib finally implemented decent time zone support and it were incompatible with ours.

@jreback
Copy link
Contributor

jreback commented Sep 23, 2014

@ischwabacher you and @rockg are the DST sticklers! go for it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants