ENH: Series constructor converts dicts with tuples in keys to MultiIndex #4805

adgaudio · 2013-09-10T18:58:31Z

closes #3323
related #4187
related #1848

Hi,

I'd love for Series (and DataFrame) initialized with a dict to automatically create the MultiIndex. See below example.

Desired Result:

In [1]: pandas.Series({('a', 'b'): 1, ('a', 'a'): 0, ('a', 'c'): 2, ('b', 'a'): 3, ('b', 'b'): 4})
Out[1]:
a  a    0
   b    1
   c    2
b  a    3
   b    4
dtype: int64

In [2]:

Current result:

In [1]: pandas.Series({('a', 'b'): 1, ('a', 'a'): 0, ('a', 'c'): 2, ('b', 'a'): 3, ('b', 'b'): 4})
Out[1]:
(a, a)    0
(a, b)    1
(a, c)    2
(b, a)    3
(b, b)    4
dtype: int64

In [2]:

Is this something I can go ahead and update tests for? It seems like we should have it.

Also, what are your thoughts about adding this to the DataFrame constructor as well?

Thanks,
Alex

adgaudio · 2013-09-10T19:03:57Z

Previous discussion of:

Series + MultiIndex:
- FIX disallow MultiIndex in Series constructor GH4187 #4190
- ENH: Series from MultiIndex causes segfault #4187
DataFrames + MultiIndex:
- ENH/API: Allow dicts with tuple keys in DataFrame constructor #3323

hayd · 2013-09-12T10:47:25Z

pandas/core/frame.py

@@ -499,7 +499,11 @@ def _init_dict(self, data, index, columns, dtype=None):
            keys = list(data.keys())
            if not isinstance(data, OrderedDict):
                keys = _try_sort(list(data.keys()))
-            columns = data_names = Index(keys)
+            if keys and isinstance(keys[0], tuple):
+                index_func = MultiIndex.from_tuples


I don't know whether it's worth considering the edge case where only the first element is a tuple, and just doing Index.

hayd · 2013-09-12T10:52:20Z

You have some failing tests, because this isn't valid in python 3 (IIRC keys() is a set rather than a list):

data.keys()[0], tuple):
TypeError: 'dict_keys' object does not support indexing

You need to add some tests to demonstrate this behaviour for both Series and DataFrame.

I guess for consistency, this should also be the case if you are passing a list of tuples to index argument...? Are there any other cases where this would also make sense?

adgaudio · 2013-09-14T06:17:12Z

Hey @hayd,

Thanks for your response. I was waiting for a response before continuing with this pr, hence the general incompleteness of the pr.

I think your idea to put this in the Index constructor is great, because then we don't need duplicate code in the Series or DataFrames. However, I've been trying to make this change, and it's actually quite complicated. First, I'm getting multiple seg faults (one of which I fixed and this fix might resolve other bugs in pandas). Second, because the Index constructor converts a list of tuples into a MultiIndex, it breaks a few tests.

It's fairly complicated to resolve, especially since I don't know an efficient way for figuring out causes for seg faults. Hopefully, I'll update this pr later this weekend with some progress.

jtratner · 2013-09-14T14:18:06Z

@adgaudio if you're experiencing segfaults, you might want to check that you aren't doing the following (with Cython code):

passing incorrectly typed objects to the code (Cython functions need to have exactly the correct type)
passing objects of the wrong length (i.e., with code with the boundscheck false decorator)

For example, if you pass a tuple to a function that expects type list or a defaultdict to a function expecting dict, it could fail (though usually it'll explicitly tell you).

hayd · 2013-09-14T17:33:03Z

@jtratner I've seen some segfaulting with the MultiIndexes before (example was Series of a mi #4187, now NotImplemented), I didn't look into the why it was happening...

jtratner · 2013-09-14T17:41:56Z

@hayd if I had to hazard a guess, it's probably because MultiIndex is basically a fake ndarray, as opposed to Index which really is an ndarray. So you can pass a MultiIndex to Cython and it will try to do things with it, but all of them will fail, since the internal datastructure does't really have anything to do with ndarray.

adgaudio · 2013-09-14T21:30:28Z

Thanks for that tips @jtratner and @hayd! I just figured out some interesting details about GH #4187 that could let us propose a workaround fix to that one... When the Series constructor is called, it calls _sanitize_array, which eventually calls infer_dtype and causes the seg fault. ...So, if you make the cpython call to "infer_dtype(...)" (see inference.pyx) with a MultiIndex, you can cause the seg fault.

If we're interested in avoiding workarounds, I think the core question we need to ask is just what @jtratner basically asked: Is it possible to make a MultiIndex compatible with Cython?

If we can't make the MultiIndex (or pa.Array) cython compatible, it seems pretty clear that we will have a million little work arounds for these damn MultiIndexes. The work-arounds I'd propose are to add an "if isinstance(X, MultiIndex): data = data.values" to _sanitize_array and a variation of that to _homogenize() (in frame.py)

jreback · 2013-09-14T21:37:09Z

Its faulting somewhere else....this is safe to call (as it accepts an object type)

In [1]: pd.lib.infer_dtype(np.array([1,2,3]))
Out[1]: 'integer'

In [2]: pd.lib.infer_dtype(np.array([1,2,3],dtype=object))
Out[2]: 'integer'

In [3]: pd.lib.infer_dtype(np.array([1,2,3.0],dtype=object))
Out[3]: 'mixed-integer-float'

In [4]: pd.lib.infer_dtype(np.array(['a','b','c'],dtype=object))
Out[4]: 'string'

jreback · 2013-09-14T21:37:29Z

@adgaudio show the traceback where it faults (the whole one)

adgaudio · 2013-09-14T21:41:25Z

Sorry, I realized that I mis-spoke and updated my comment a few mins ago. This is the error I get. No traceback:

In [13]: a = pandas.MultiIndex.from_tuples(zip(range(10), range(10)))

In [14]: pandas.lib.infer_dtype(a)
zsh: segmentation fault (core dumped)  ipython

-- OR --

In [2]: pandas.lib.fast_multiget({1: 2}, a)
zsh: segmentation fault (core dumped)  ipython

cpcloud · 2013-09-14T21:42:49Z

I'm getting this segfault as well. I don't even get an error message :(

jreback · 2013-09-14T21:51:23Z

Do this.....the problem is an index/mi is a 'special' that overrides some ndarray methods....
I don't think this will break anything else.

This is called in _possibly_convert_datetime, so that will do nothing (which is correct)

diff --git a/pandas/src/inference.pyx b/pandas/src/inference.pyx
index e0bbc1a..a37de47 100644
--- a/pandas/src/inference.pyx
+++ b/pandas/src/inference.pyx
@@ -41,6 +41,9 @@ def infer_dtype(object _values):
             _values = list(_values)
         values = list_to_object_array(_values)

+    if hasattr(values,'values'):
+        values = values.values
+
     val_kind = values.dtype.type
     if val_kind in _TYPE_MAP:
         return _TYPE_MAP[val_kind]

jreback · 2013-09-14T21:53:40Z

also...pls check perf, e.g. test_perf.sh for any change you make....it is important that Series construction doesn't change the common cases too much

as an aside...I don't think doing an isinstance(value, Index) in _sanitize_array is a big deal (after all they ARE (currrently) ndarrays)

jtratner · 2013-09-15T03:03:21Z

@adgaudio @jreback yeah, it's tricky. MultiIndex is only nominally an ndarray, because all of its features are actually reimplemented in Python so if Cython does any optimizations for ndarray, it'll mess up around this.

Btw, for cython, better to do this:

values = getattr(values, 'values', values)

than the hasattr check, because it can optimize it better.

cpcloud · 2013-09-15T03:40:26Z

good 2 know about hasattr vs. getattr.

adgaudio · 2013-09-15T06:17:13Z

Hmm - I think the topic of this pr has strayed a little from its original goal (which is primarily my fault!). The original goal was to make pandas.Index(list_of_tuples) return a MultiIndex. Having spent several hours in the past couple days looking at pandas internals, it seems pretty clear to me now that the Index should not return a MultiIndex. For instance, methods like MultiIndex._tuple_index rely on the fact that the Index won't return a MI... And as a further argument, if an Index constructor was allowed to return a MultiIndex, then shouldn't a DataFrame be able to return a Panel? That doesn't seem like functionality pandas would support.

On the other hand, we discussed some useful things that are probably worth implementing:

There may be some benefit to adding additional safeguards against seg faults.
- in fast_multiget(...) and infer_dtype(...)
Series(some_multi_index) should work
- (I think) more to the point, np.array(some_multi_index) should not return an empty list
- this also would depend on @jreback's seg fault fix 3 comments up.

Is everyone okay if I close this PR and maybe make a new one to address the seg faults? In sum, I changed my mind and don't think an Index should return a MultiIndex.

jtratner · 2013-09-15T06:20:26Z

@adgaudio if you want that makes sense. That said, I don't think it's unreasonable to have a dict input to constructor return something with a MultiIndex.

adgaudio · 2013-09-15T19:03:17Z

Ok - I can certainly do that! In that case, I may just try to merge seg fault stuff into here. Will be interesting to see whether there's a significant perf hit with the changes this brings.

... will be in prog here for a few days

adgaudio · 2013-09-24T06:24:57Z

Ok...this is ready for review. Sorry for the delay

In summary, MultiIndexes are automatically created for index and columns when passing a dict to Series and DataFrames.

I also ran a perf test and will be posting those results in a follow-up comment. I described() the last column of numbers, which represent the ratio of master to "my changes rebased against master". If I understand this correctly, this feature makes pandas slower by 1.005 times on average.

count 171.000000
mean 1.004673
std 0.076243
min 0.779600
25% 0.977800
50% 0.993800
75% 1.026200
max 1.498600

adgaudio · 2013-09-24T06:25:04Z

Invoked with :
--ncalls: 1
--repeats: 3


-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------
frame_reindex_both_axes                      |  15.2452 |  32.3210 |   0.4717 |
series_value_counts_strings                  |   4.0641 |   5.2130 |   0.7796 |
frame_add                                    |   3.9530 |   4.9031 |   0.8062 |
timeseries_asof_single                       |   0.0291 |   0.0360 |   0.8079 |
indexing_dataframe_boolean                   |  49.4530 |  57.8468 |   0.8549 |
indexing_dataframe_boolean_rows              |   0.2789 |   0.3099 |   0.9000 |
join_dataframe_index_single_key_bigger       |   5.2481 |   5.7790 |   0.9081 |
write_csv_standard                           |  30.0810 |  33.0341 |   0.9106 |
stats_rolling_mean                           |   1.0490 |   1.1470 |   0.9146 |
timeseries_add_irregular                     |  15.7468 |  17.2162 |   0.9147 |
reindex_frame_level_reindex                  |   0.5591 |   0.6101 |   0.9164 |
reshape_pivot_time_series                    | 129.9939 | 141.0310 |   0.9217 |
mask_bools                                   |  24.4670 |  26.4800 |   0.9240 |
frame_to_string_floats                       |  30.3071 |  32.7630 |   0.9250 |
frame_repr_wide                              |   0.7708 |   0.8330 |   0.9253 |
series_string_vector_slice                   | 146.1499 | 156.5220 |   0.9337 |
frame_getitem_single_column                  |  17.7062 |  18.9250 |   0.9356 |
stats_rank_average                           |  23.2301 |  24.7140 |   0.9400 |
append_frame_single_mixed                    |   0.5651 |   0.5999 |   0.9420 |
sort_level_one                               |   3.7041 |   3.9291 |   0.9427 |
mask_floats                                  |  12.7940 |  13.5541 |   0.9439 |
melt_dataframe                               |   1.6730 |   1.7691 |   0.9457 |
groupby_multi_series_op                      |  12.2011 |  12.8889 |   0.9466 |
stat_ops_frame_mean_int_axis_1               |   5.3780 |   5.6581 |   0.9505 |
frame_fancy_lookup                           |   2.7161 |   2.8570 |   0.9507 |
reshape_unstack_simple                       |   2.7251 |   2.8622 |   0.9521 |
frame_insert_500_columns_end                 |  79.2389 |  83.1900 |   0.9525 |
frame_drop_duplicates_na                     |  13.4761 |  14.1251 |   0.9541 |
frame_drop_duplicates                        |  13.3901 |  14.0190 |   0.9551 |
index_datetime_intersection                  |   8.4031 |   8.7969 |   0.9552 |
stat_ops_frame_sum_int_axis_0                |   4.8449 |   5.0640 |   0.9567 |
frame_to_csv2                                |  81.8870 |  85.3350 |   0.9596 |
stats_rank2d_axis1_average                   |  10.9470 |  11.3881 |   0.9613 |
stat_ops_frame_sum_float_axis_1              |   4.5249 |   4.6859 |   0.9657 |
frame_fillna_inplace                         |  10.3819 |  10.7038 |   0.9699 |
indexing_panel_subset                        |   0.3691 |   0.3800 |   0.9711 |
frame_to_csv                                 |  91.5911 |  94.3000 |   0.9713 |
frame_ctor_list_of_dict                      |  64.0192 |  65.8691 |   0.9719 |
merge_2intkey_sort                           |  32.4190 |  33.3450 |   0.9722 |
frame_multi_and                              |  69.8371 |  71.7850 |   0.9729 |
stat_ops_frame_sum_int_axis_1                |   4.9090 |   5.0452 |   0.9730 |
frame_reindex_both_axes_ix                   |  32.0160 |  32.8770 |   0.9738 |
series_drop_duplicates_int                   |   0.5660 |   0.5791 |   0.9774 |
timeseries_asof                              |   6.8550 |   7.0112 |   0.9777 |
join_dataframe_index_multi                   |  16.8700 |  17.2520 |   0.9779 |
join_dataframe_index_single_key_bigger_sort  |  13.3920 |  13.6938 |   0.9780 |
frame_loc_dups                               |   0.6258 |   0.6399 |   0.9780 |
frame_constructor_ndarray                    |   0.0429 |   0.0439 |   0.9783 |
reindex_fillna_backfill                      |   0.5271 |   0.5388 |   0.9783 |
series_align_left_monotonic                  |  10.3209 |  10.5450 |   0.9787 |
groupby_multi_cython                         |  13.3641 |  13.6521 |   0.9789 |
index_int64_union                            |  64.3780 |  65.7530 |   0.9791 |
stat_ops_frame_sum_float_axis_0              |   4.4549 |   4.5469 |   0.9798 |
dti_reset_index_tz                           |  10.0758 |  10.2811 |   0.9800 |
read_csv_vb                                  |  16.2349 |  16.5591 |   0.9804 |
stat_ops_series_std                          |   0.8879 |   0.9048 |   0.9813 |
timeseries_asof_nan                          |   6.3412 |   6.4590 |   0.9818 |
stat_ops_frame_mean_int_axis_0               |   4.8530 |   4.9391 |   0.9826 |
frame_to_csv_mixed                           | 191.1800 | 194.5429 |   0.9827 |
stat_ops_frame_mean_float_axis_0             |   4.4539 |   4.5290 |   0.9834 |
groupby_multi_different_numpy_functions      |  10.8140 |  10.9761 |   0.9852 |
frame_reindex_upcast                         |  10.7739 |  10.9279 |   0.9859 |
groupby_last                                 |   3.2680 |   3.3140 |   0.9861 |
stat_ops_level_series_sum                    |   2.3448 |   2.3770 |   0.9865 |
series_value_counts_int64                    |   1.8809 |   1.9040 |   0.9879 |
groupby_multi_python                         | 104.6109 | 105.8180 |   0.9886 |
merge_2intkey_nosort                         |  14.5009 |  14.6639 |   0.9889 |
frame_reindex_axis1                          | 454.3140 | 459.0740 |   0.9896 |
match_strings                                |   0.2739 |   0.2768 |   0.9897 |
groupby_series_simple_cython                 |   4.0159 |   4.0562 |   0.9901 |
series_align_int64_index                     |  23.7849 |  24.0159 |   0.9904 |
read_csv_thou_vb                             |  14.1041 |  14.2410 |   0.9904 |
frame_fancy_lookup_all                       |  12.5110 |  12.6290 |   0.9907 |
frame_mult                                   |   3.9458 |   3.9830 |   0.9907 |
lib_fast_zip                                 |   7.6091 |   7.6802 |   0.9907 |
stats_rank_average_int                       |  18.6820 |  18.8482 |   0.9912 |
groupby_pivot_table                          |  16.5930 |  16.7401 |   0.9912 |
timeseries_timestamp_downsample_mean         |   3.5810 |   3.6111 |   0.9917 |
frame_reindex_axis0                          |  77.9679 |  78.6190 |   0.9917 |
stats_corr_spearman                          |  70.6170 |  71.1949 |   0.9919 |
timeseries_infer_freq                        |   6.5250 |   6.5749 |   0.9924 |
timeseries_period_downsample_mean            |   5.3890 |   5.4300 |   0.9924 |
stat_ops_level_frame_sum                     |   2.9190 |   2.9399 |   0.9929 |
join_dataframe_integer_2key                  |   4.4181 |   4.4479 |   0.9933 |
series_align_irregular_string                |  50.0710 |  50.4081 |   0.9933 |
reindex_frame_level_align                    |   0.5832 |   0.5870 |   0.9935 |
stat_ops_level_frame_sum_multiple            |   7.2091 |   7.2539 |   0.9938 |
groupby_multi_size                           |  24.5860 |  24.7180 |   0.9947 |
stats_rank2d_axis0_average                   |  18.5640 |  18.6539 |   0.9952 |
sparse_series_to_frame                       | 117.1200 | 117.5661 |   0.9962 |
unstack_sparse_keyspace                      |   1.3909 |   1.3959 |   0.9964 |
groupby_first                                |   3.0730 |   3.0839 |   0.9964 |
frame_ctor_nested_dict                       |  57.8749 |  58.0280 |   0.9974 |
reindex_fillna_backfill_float32              |   0.4182 |   0.4191 |   0.9977 |
reshape_stack_simple                         |   1.4410 |   1.4429 |   0.9987 |
join_dataframe_integer_key                   |   1.4930 |   1.4949 |   0.9987 |
replace_replacena                            |   3.6631 |   3.6659 |   0.9992 |
groupby_frame_median                         |   5.4619 |   5.4641 |   0.9996 |
read_csv_standard                            |   9.2361 |   9.2380 |   0.9998 |
groupby_last_float32                         |   3.2320 |   3.2320 |   1.0000 |
series_constructor_ndarray                   |   0.0160 |   0.0160 |   1.0000 |
frame_fillna_many_columns_pad                |  12.1920 |  12.1920 |   1.0000 |
timeseries_large_lookup_value                |   0.0231 |   0.0231 |   1.0000 |
reindex_daterange_backfill                   |   2.4390 |   2.4371 |   1.0008 |
timeseries_sort_index                        |  18.2889 |  18.2478 |   1.0022 |
reindex_daterange_pad                        |   2.4340 |   2.4259 |   1.0033 |
frame_sort_index_by_columns                  |  31.2161 |  31.1110 |   1.0034 |
read_csv_comment2                            |  17.5040 |  17.4401 |   1.0037 |
groupby_indices                              |   6.3572 |   6.3312 |   1.0041 |
frame_boolean_row_select                     |   0.2131 |   0.2120 |   1.0056 |
timeseries_to_datetime_iso8601               |   3.8910 |   3.8612 |   1.0077 |
replace_fillna                               |   3.6750 |   3.6449 |   1.0082 |
lib_fast_zip_fillna                          |  10.5841 |  10.4971 |   1.0083 |
groupby_multi_different_functions            |  10.8781 |  10.7880 |   1.0084 |
datetimeindex_normalize                      |   2.2500 |   2.2280 |   1.0098 |
stat_ops_level_series_sum_multiple           |   6.6872 |   6.6178 |   1.0105 |
index_datetime_union                         |   8.6920 |   8.6010 |   1.0106 |
frame_get_dtype_counts                       |   0.0880 |   0.0870 |   1.0110 |
frame_iteritems                              |  25.3410 |  25.0630 |   1.0111 |
reindex_fillna_pad                           |   0.5260 |   0.5200 |   1.0115 |
frame_insert_100_columns_begin               |  15.4629 |  15.2860 |   1.0116 |
panel_from_dict_two_different_indexes        |  39.9530 |  39.4778 |   1.0120 |
groupby_first_float32                        |   3.0582 |   3.0160 |   1.0140 |
frame_getitem_single_column2                 |  18.0881 |  17.8120 |   1.0155 |
stat_ops_frame_mean_float_axis_1             |   5.0459 |   4.9639 |   1.0165 |
indexing_dataframe_boolean_rows_object       |   0.4680 |   0.4601 |   1.0171 |
groupby_sum_booleans                         |   0.7811 |   0.7651 |   1.0209 |
frame_drop_dup_inplace                       |   2.2190 |   2.1710 |   1.0221 |
read_parse_dates_iso8601                     |   1.1170 |   1.0900 |   1.0247 |
groupby_simple_compress_timing               |  29.3281 |  28.5370 |   1.0277 |
sparse_frame_constructor                     |   9.3942 |   9.1388 |   1.0279 |
sort_level_zero                              |   3.7801 |   3.6709 |   1.0297 |
panel_from_dict_all_different_indexes        |  53.7281 |  52.1729 |   1.0298 |
groupby_frame_singlekey_integer              |   2.1300 |   2.0669 |   1.0306 |
reindex_multiindex                           |   0.8409 |   0.8130 |   1.0343 |
append_frame_single_homogenous               |   0.2060 |   0.1991 |   1.0347 |
reindex_fillna_pad_float32                   |   0.4320 |   0.4170 |   1.0360 |
dataframe_reindex                            |   0.3462 |   0.3340 |   1.0364 |
groupby_transform                            | 117.1820 | 112.8821 |   1.0381 |
frame_get_numeric_data                       |   0.0780 |   0.0751 |   1.0381 |
groupby_apply_dict_return                    |  24.9550 |  23.9921 |   1.0401 |
index_int64_intersection                     |  23.8118 |  22.8329 |   1.0429 |
concat_series_axis1                          |  65.4199 |  62.6309 |   1.0445 |
read_table_multiple_date_baseline            |  75.4662 |  72.0320 |   1.0477 |
panel_from_dict_equiv_indexes                |  24.9131 |  23.7088 |   1.0508 |
timeseries_1min_5min_mean                    |   0.4830 |   0.4590 |   1.0525 |
dti_reset_index                              |   0.1781 |   0.1690 |   1.0536 |
panel_from_dict_same_index                   |  24.9889 |  23.6831 |   1.0551 |
datetimeindex_unique                         |   0.0939 |   0.0889 |   1.0563 |
read_table_multiple_date                     | 163.5280 | 153.4901 |   1.0654 |
frame_drop_dup_na_inplace                    |   2.2259 |   2.0850 |   1.0676 |
datetimeindex_add_offset                     |   0.1731 |   0.1619 |   1.0692 |
frame_iloc_dups                              |   0.2170 |   0.2019 |   1.0744 |
frame_iteritems_cached                       |   0.4621 |   0.4299 |   1.0749 |
concat_small_frames                          |  12.3281 |  11.4250 |   1.0790 |
frame_xs_row                                 |   0.0370 |   0.0341 |   1.0839 |
series_drop_duplicates_string                |   0.4370 |   0.4032 |   1.0840 |
frame_reindex_columns                        |   0.3030 |   0.2789 |   1.0863 |
join_dataframe_index_single_key_small        |   5.3852 |   4.9469 |   1.0886 |
frame_repr_tall                              |   2.2731 |   2.0730 |   1.0965 |
frame_ctor_nested_dict_int64                 |  82.3250 |  75.0730 |   1.0966 |
groupby_frame_cython_many_columns            |   3.0849 |   2.8129 |   1.0967 |
frame_xs_col                                 |   0.0231 |   0.0210 |   1.1023 |
timeseries_1min_5min_ohlc                    |   0.5291 |   0.4780 |   1.1067 |
timeseries_to_datetime_YYYYMMDD              |   7.9660 |   7.1719 |   1.1107 |
ctor_index_array_string                      |   0.0169 |   0.0150 |   1.1270 |
timeseries_slice_minutely                    |   0.0441 |   0.0381 |   1.1562 |
timeseries_timestamp_tzinfo_cons             |   0.0129 |   0.0110 |   1.1739 |
groupby_frame_apply_overhead                 |   8.7209 |   7.2041 |   1.2106 |
groupby_frame_apply                          |  42.1200 |  34.2922 |   1.2283 |
series_ctor_from_dict                        |   2.9459 |   2.1560 |   1.3664 |
period_setitem                               | 156.1499 | 104.1942 |   1.4986 |
-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------

Ratio < 1.0 means the target commit is faster then the baseline.
Seed used: 1234

Target [da81e40] : ENH: Series and DataFrame constructor autodetect when index/columns
should be MultiIndex

ENH: prevents some seg faults in calls to cython funcs
Base   [d702de0] : Merge pull request #4783 from jtratner/wrap-compressed-fileobjects-in-py3

BUG: Fix input bytes conversion in Py3 to return str


<class 'pandas.core.frame.DataFrame'>
Index: 172 entries, frame_reindex_both_axes to period_setitem
Data columns (total 8 columns):
count    172  non-null values
mean     172  non-null values
std      172  non-null values
min      172  non-null values
25%      172  non-null values
50%      172  non-null values
75%      172  non-null values
max      172  non-null values
dtypes: float64(8)

jreback · 2013-09-24T10:18:29Z

can you add releaset notes? does it close all 3 of those listed issues? have tests from all 3? pls add the issues it closes to the very top of the PR, with something like closes #issue

jreback · 2013-09-24T10:22:08Z

can you perf tests the bottom 2 (period_setitem and series_constructor) with a higher number of ncalls (so it will avg them)

jtratner · 2013-09-24T10:23:32Z

If it turns out to be slow, might make sense to add a keyword argument to disable/enable this processing. (similar to tupleize_cols for csv, etc)

cpcloud · 2013-09-24T12:01:20Z

pandas/core/frame.py

@@ -481,7 +482,7 @@ def _init_dict(self, data, index, columns, dtype=None):
            data = dict((k, v) for k, v in compat.iteritems(data) if k in columns)

            if index is None:
-                index = extract_index(list(data.values()))
+                index = extract_index(data.values())


i think you need to keep the list constructor here because dict.values() is a view in Python 3, unless extract_index does the conversion for you

extract_index just iterates over the values and appends its elements to some list. So isn't the list(...) call unnecessary?

if it's passing on the Python 3 builds then yes :)

hayd · 2013-09-24T15:37:00Z

It's not so crazy that passing in tuples to index could lead to a mi... (You can also get a DatetimeIndex when creating an Index and it's very different from the Panel vs DataFrame!*)

In [1]: s = pd.Series(index=[[1, 3], [3, 4]])

In [2]: s.index
Out[2]:
MultiIndex
[(1, 3), (3, 4)]

In [3]: isinstance(s.index, pd.Index)
Out[3]: True

In [4]: pd.Index([[1, 3], [3, 4]])  # inconsistent / weird
Out[4]: Index([[1, 3], [3, 4]], dtype=object)

I think this should be consistent however you construct... imo this is a good opportunity to correct the above behaviour, where you get a different index depending! Should be the same via Index, Series(index=) or passing dict... atm it's not

+1 for adding some keyword to disable (?) this (like tuplize_col..)

adgaudio · 2013-09-24T18:38:09Z

@hayd,

I think it would be a great idea to have the Index constructure also return a MultiIndex, but only if a kwarg like "infer_multiindex=True" was passed to the constructor. By default, the "infer_multiindex=False" because if it was True, it would cause seg faults and break several tests. I might be able to tackle this in this PR. If not, perhaps another pr...

Logically, something also seems funky about letting an Index return a MultiIndex. For instance, should a DataFrame be able to return a Panel? I think you hinted at this in your comment.

adgaudio · 2013-09-24T19:46:13Z

@jtratner - thanks for your comments.

Here are the perf results with ncalls=5. I also just rebased against master. The average ratio is now actually a tiny bit worse (was 1.004 and now is 1.01), but the basic distribution of ratios looks more or less like the picture I posted above.

Are these performance hits alright for this PR, or should I add a kwarg to the DataFrame and Series constructors that disables it by default?

count 190.000000
mean 1.010531
std 0.041385
min 0.877500
25% 0.995000
50% 1.003550
75% 1.018125
max 1.231000

[... truncated ...]
frame_ctor_list_of_dict                      |  68.6394 |  64.7336 |   1.0603 |
merge_2intkey_sort                           |  34.2390 |  32.1878 |   1.0637 |
frame_fancy_lookup_all                       |  13.8032 |  12.9736 |   1.0639 |
frame_sort_index_by_columns                  |  33.8332 |  31.7710 |   1.0649 |
frame_ctor_nested_dict_int64                 |  81.2192 |  75.8838 |   1.0703 |
series_value_counts_strings                  |   4.4824 |   4.1206 |   1.0878 |
stats_rank_average                           |  23.8794 |  21.7868 |   1.0961 |
merge_2intkey_nosort                         |  15.9592 |  14.2020 |   1.1237 |
join_dataframe_index_multi                   |  19.3036 |  16.8800 |   1.1436 |
mask_bools                                   |  13.2502 |  11.4928 |   1.1529 |
groupby_pivot_table                          |  20.3042 |  16.7432 |   1.2127 |
groupby_multi_different_functions            |  13.4090 |  11.0168 |   1.2171 |
groupby_multi_different_numpy_functions      |  13.3692 |  10.8602 |   1.2310 |
-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------

Ratio < 1.0 means the target commit is faster then the baseline.
Seed used: 1234

Target [8e15111] : rel notes  ### FYI - I squashed this, and it represents my latest commit.
Base   [275a42c] : Merge pull request #4965 from westurner/patch-1

DOC: remote_data.rst: Fama/French punctuation

jreback · 2014-03-09T19:32:23Z

doc/source/basics.rst

+    s = Series([1, 2, 3], index=idx)
+    s
+
+    s.index.identical(idx)


take this out; identical is public but will confuse users; they don't need to know about it

adgaudio · 2014-03-21T04:02:20Z

ready for review/merge again. just rebased master.

jreback · 2014-03-21T12:40:32Z

doc/source/indexing.rst


 .. ipython:: python

-   arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
+    arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],


take this space out

done - btw why are there 3 spaces in these sections instead of 4? Is this some kind of documentation standard?

No, it's not a standard. It's just that it doesn't matter, and there is no real standard. In reStructuredText, you will often see that it is indented like the start of the directive name (here ipython), so then you have 3 spaces (but even in their documentation they are not consequent with that). But I personally like to use 4 spaces as this is much more convenient to type with tab.

jreback · 2014-03-30T14:48:41Z

@adgaudio can you rebase ....i'll take another look

- Series and DataFrame constructor autodetect when index/columns should be MultiIndex - prevents some seg faults in calls to cython funcs - add tupleize_cols kwarg and update tests to git PR comments - support name= xor names= in Index(tuples, ....) constructor - docs

jreback · 2014-04-09T23:32:01Z

merged via f8c566c...9119641

(also added a bit of docs to dsintro.rst)

thanks @adgaudio !

nice addition!

jreback · 2014-04-09T23:34:52Z

@adgaudio side note, this is actually a dict of tuples, right? (you refer to it as a list of tuples) (I guess its technically a list of tuples too), or are the docs I added not clear?

jorisvandenbossche · 2014-04-10T08:20:27Z

@adgaudio There also a couple of doc failures due to your changes, see here https://travis-ci.org/pydata/pandas/jobs/22652839#L1855 and http://pandas-docs.github.io/pandas-docs-travis/indexing.html#creating-a-multiindex-hierarchical-index-object
(this is because you changed something in the index, and the same object is used in all subsequent examples in the file indexing.rst)

Can you fix them? You can also check the log of the docs build on travis (build 3, fully below), or build them first locally (see docs on this here: http://pandas.pydata.org/pandas-docs/stable/contributing.html#contributing-to-the-documentation)

jreback · 2014-04-10T12:20:56Z

fixed partially in c1440e9

adgaudio · 2014-04-11T05:03:40Z

This is great! Thank you for merging this work :)

@jreback Thanks for the partial fix to documentation. Also, you are correct that the PR works for both dicts of tuples and lists of tuples.

@jorisvandenbossche I should be able to get to these soon in another PR. I guess I was wrong to assume passing tests meant the documentation wouldn't have errors. Why do Travis builds still pass if I introduce documentation errors? Can/should we change that behavior?

jorisvandenbossche · 2014-04-11T08:10:28Z

@adgaudio I think the docs are OK now. Yes, indeed, getting green light from travis does not say anything about the doc build, only about the test suite that is passing. The docs are built on travis so you can see the log of that, but it does not notify when there were warnings (but we were thinking to maybe try to change this, see #3800)

hayd reviewed Sep 12, 2013
View reviewed changes

cpcloud reviewed Sep 24, 2013
View reviewed changes

jreback reviewed Mar 9, 2014
View reviewed changes

jreback reviewed Mar 21, 2014
View reviewed changes

adgaudio added 3 commits March 30, 2014 20:53

BUG: Index.identical(other) didn't compare type(other) to type(self)

9ddb811

updates

79a717a

jreback closed this Apr 9, 2014

jorisvandenbossche mentioned this pull request Apr 17, 2014

DOC: Incorrect whatsnew entry for GH6459 #6895

Closed

ENH: Series constructor converts dicts with tuples in keys to MultiIndex #4805

ENH: Series constructor converts dicts with tuples in keys to MultiIndex #4805

Conversation

adgaudio commented Sep 10, 2013

adgaudio commented Sep 10, 2013

Choose a reason for hiding this comment

hayd commented Sep 12, 2013

adgaudio commented Sep 14, 2013

jtratner commented Sep 14, 2013

hayd commented Sep 14, 2013

jtratner commented Sep 14, 2013

adgaudio commented Sep 14, 2013

jreback commented Sep 14, 2013

jreback commented Sep 14, 2013

adgaudio commented Sep 14, 2013

cpcloud commented Sep 14, 2013

jreback commented Sep 14, 2013

jreback commented Sep 14, 2013

jtratner commented Sep 15, 2013

cpcloud commented Sep 15, 2013

adgaudio commented Sep 15, 2013

jtratner commented Sep 15, 2013

adgaudio commented Sep 15, 2013

adgaudio commented Sep 24, 2013

adgaudio commented Sep 24, 2013

jreback commented Sep 24, 2013

jreback commented Sep 24, 2013

jtratner commented Sep 24, 2013

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hayd commented Sep 24, 2013

adgaudio commented Sep 24, 2013

adgaudio commented Sep 24, 2013

Choose a reason for hiding this comment

adgaudio commented Mar 21, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Mar 30, 2014

jreback commented Apr 9, 2014

jreback commented Apr 9, 2014

jorisvandenbossche commented Apr 10, 2014

jreback commented Apr 10, 2014

adgaudio commented Apr 11, 2014

jorisvandenbossche commented Apr 11, 2014