-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Series constructor converts dicts with tuples in keys to MultiIndex #4805
Conversation
Previous discussion of:
|
@@ -499,7 +499,11 @@ def _init_dict(self, data, index, columns, dtype=None): | |||
keys = list(data.keys()) | |||
if not isinstance(data, OrderedDict): | |||
keys = _try_sort(list(data.keys())) | |||
columns = data_names = Index(keys) | |||
if keys and isinstance(keys[0], tuple): | |||
index_func = MultiIndex.from_tuples |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know whether it's worth considering the edge case where only the first element is a tuple, and just doing Index.
You have some failing tests, because this isn't valid in python 3 (IIRC keys() is a set rather than a list):
You need to add some tests to demonstrate this behaviour for both Series and DataFrame. I guess for consistency, this should also be the case if you are passing a list of tuples to index argument...? Are there any other cases where this would also make sense? |
Hey @hayd, Thanks for your response. I was waiting for a response before continuing with this pr, hence the general incompleteness of the pr. I think your idea to put this in the Index constructor is great, because then we don't need duplicate code in the Series or DataFrames. However, I've been trying to make this change, and it's actually quite complicated. First, I'm getting multiple seg faults (one of which I fixed and this fix might resolve other bugs in pandas). Second, because the Index constructor converts a list of tuples into a MultiIndex, it breaks a few tests. It's fairly complicated to resolve, especially since I don't know an efficient way for figuring out causes for seg faults. Hopefully, I'll update this pr later this weekend with some progress. |
@adgaudio if you're experiencing segfaults, you might want to check that you aren't doing the following (with Cython code):
For example, if you pass a tuple to a function that expects type |
@hayd if I had to hazard a guess, it's probably because |
Thanks for that tips @jtratner and @hayd! I just figured out some interesting details about GH #4187 that could let us propose a workaround fix to that one... When the Series constructor is called, it calls _sanitize_array, which eventually calls infer_dtype and causes the seg fault. ...So, if you make the cpython call to "infer_dtype(...)" (see inference.pyx) with a MultiIndex, you can cause the seg fault. If we're interested in avoiding workarounds, I think the core question we need to ask is just what @jtratner basically asked: Is it possible to make a MultiIndex compatible with Cython? If we can't make the MultiIndex (or pa.Array) cython compatible, it seems pretty clear that we will have a million little work arounds for these damn MultiIndexes. The work-arounds I'd propose are to add an "if isinstance(X, MultiIndex): data = data.values" to _sanitize_array and a variation of that to _homogenize() (in frame.py) |
Its faulting somewhere else....this is safe to call (as it accepts an object type)
|
@adgaudio show the traceback where it faults (the whole one) |
Sorry, I realized that I mis-spoke and updated my comment a few mins ago. This is the error I get. No traceback:
|
I'm getting this segfault as well. I don't even get an error message :( |
Do this.....the problem is an index/mi is a 'special' that overrides some ndarray methods.... This is called in _possibly_convert_datetime, so that will do nothing (which is correct)
|
also...pls check perf, e.g. test_perf.sh for any change you make....it is important that Series construction doesn't change the common cases too much as an aside...I don't think doing an |
@adgaudio @jreback yeah, it's tricky. MultiIndex is only nominally an ndarray, because all of its features are actually reimplemented in Python so if Cython does any optimizations for ndarray, it'll mess up around this. Btw, for cython, better to do this: values = getattr(values, 'values', values) than the hasattr check, because it can optimize it better. |
good 2 know about |
Hmm - I think the topic of this pr has strayed a little from its original goal (which is primarily my fault!). The original goal was to make pandas.Index(list_of_tuples) return a MultiIndex. Having spent several hours in the past couple days looking at pandas internals, it seems pretty clear to me now that the Index should not return a MultiIndex. For instance, methods like MultiIndex._tuple_index rely on the fact that the Index won't return a MI... And as a further argument, if an Index constructor was allowed to return a MultiIndex, then shouldn't a DataFrame be able to return a Panel? That doesn't seem like functionality pandas would support. On the other hand, we discussed some useful things that are probably worth implementing:
Is everyone okay if I close this PR and maybe make a new one to address the seg faults? In sum, I changed my mind and don't think an Index should return a MultiIndex. |
@adgaudio if you want that makes sense. That said, I don't think it's unreasonable to have a dict input to constructor return something with a MultiIndex. |
Ok - I can certainly do that! In that case, I may just try to merge seg fault stuff into here. Will be interesting to see whether there's a significant perf hit with the changes this brings. ... will be in prog here for a few days |
Ok...this is ready for review. Sorry for the delay In summary, MultiIndexes are automatically created for index and columns when passing a dict to Series and DataFrames. I also ran a perf test and will be posting those results in a follow-up comment. I described() the last column of numbers, which represent the ratio of master to "my changes rebased against master". If I understand this correctly, this feature makes pandas slower by 1.005 times on average. count 171.000000 |
|
can you add releaset notes? does it close all 3 of those listed issues? have tests from all 3? pls add the issues it closes to the very top of the PR, with something like |
can you perf tests the bottom 2 (period_setitem and series_constructor) with a higher number of ncalls (so it will avg them) |
If it turns out to be slow, might make sense to add a keyword argument to disable/enable this processing. (similar to tupleize_cols for csv, etc) |
@@ -481,7 +482,7 @@ def _init_dict(self, data, index, columns, dtype=None): | |||
data = dict((k, v) for k, v in compat.iteritems(data) if k in columns) | |||
|
|||
if index is None: | |||
index = extract_index(list(data.values())) | |||
index = extract_index(data.values()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think you need to keep the list
constructor here because dict.values()
is a view in Python 3, unless extract_index
does the conversion for you
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extract_index just iterates over the values and appends its elements to some list. So isn't the list(...) call unnecessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if it's passing on the Python 3 builds then yes :)
It's not so crazy that passing in tuples to index could lead to a mi... (You can also get a DatetimeIndex when creating an Index and it's very different from the Panel vs DataFrame!*)
I think this should be consistent however you construct... imo this is a good opportunity to correct the above behaviour, where you get a different index depending! Should be the same via Index, Series(index=) or passing dict... atm it's not +1 for adding some keyword to disable (?) this (like tuplize_col..) |
I think it would be a great idea to have the Index constructure also return a MultiIndex, but only if a kwarg like "infer_multiindex=True" was passed to the constructor. By default, the "infer_multiindex=False" because if it was True, it would cause seg faults and break several tests. I might be able to tackle this in this PR. If not, perhaps another pr... Logically, something also seems funky about letting an Index return a MultiIndex. For instance, should a DataFrame be able to return a Panel? I think you hinted at this in your comment. |
@jtratner - thanks for your comments. Here are the perf results with ncalls=5. I also just rebased against master. The average ratio is now actually a tiny bit worse (was 1.004 and now is 1.01), but the basic distribution of ratios looks more or less like the picture I posted above. Are these performance hits alright for this PR, or should I add a kwarg to the DataFrame and Series constructors that disables it by default? count 190.000000
|
s = Series([1, 2, 3], index=idx) | ||
s | ||
|
||
s.index.identical(idx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
take this out; identical is public but will confuse users; they don't need to know about it
ready for review/merge again. just rebased master. |
|
||
.. ipython:: python | ||
|
||
arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'], | ||
arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
take this space out
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done - btw why are there 3 spaces in these sections instead of 4? Is this some kind of documentation standard?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it's not a standard. It's just that it doesn't matter, and there is no real standard. In reStructuredText, you will often see that it is indented like the start of the directive name (here ipython
), so then you have 3 spaces (but even in their documentation they are not consequent with that). But I personally like to use 4 spaces as this is much more convenient to type with tab.
@adgaudio can you rebase ....i'll take another look |
- Series and DataFrame constructor autodetect when index/columns should be MultiIndex - prevents some seg faults in calls to cython funcs - add tupleize_cols kwarg and update tests to git PR comments - support name= xor names= in Index(tuples, ....) constructor - docs
merged via f8c566c...9119641 (also added a bit of docs to dsintro.rst) thanks @adgaudio ! nice addition! |
@adgaudio side note, this is actually a dict of tuples, right? (you refer to it as a list of tuples) (I guess its technically a list of tuples too), or are the docs I added not clear? |
@adgaudio There also a couple of doc failures due to your changes, see here https://travis-ci.org/pydata/pandas/jobs/22652839#L1855 and http://pandas-docs.github.io/pandas-docs-travis/indexing.html#creating-a-multiindex-hierarchical-index-object Can you fix them? You can also check the log of the docs build on travis (build 3, fully below), or build them first locally (see docs on this here: http://pandas.pydata.org/pandas-docs/stable/contributing.html#contributing-to-the-documentation) |
fixed partially in c1440e9 |
This is great! Thank you for merging this work :) @jreback Thanks for the partial fix to documentation. Also, you are correct that the PR works for both dicts of tuples and lists of tuples. @jorisvandenbossche I should be able to get to these soon in another PR. I guess I was wrong to assume passing tests meant the documentation wouldn't have errors. Why do Travis builds still pass if I introduce documentation errors? Can/should we change that behavior? |
@adgaudio I think the docs are OK now. Yes, indeed, getting green light from travis does not say anything about the doc build, only about the test suite that is passing. The docs are built on travis so you can see the log of that, but it does not notify when there were warnings (but we were thinking to maybe try to change this, see #3800) |
closes #3323
related #4187
related #1848
Hi,
I'd love for Series (and DataFrame) initialized with a dict to automatically create the MultiIndex. See below example.
Desired Result:
Current result:
Is this something I can go ahead and update tests for? It seems like we should have it.
Also, what are your thoughts about adding this to the DataFrame constructor as well?
Thanks,
Alex