Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Series constructor converts dicts with tuples in keys to MultiIndex #4805

Closed
wants to merge 3 commits into from

Conversation

adgaudio
Copy link
Contributor

closes #3323
related #4187
related #1848

Hi,

I'd love for Series (and DataFrame) initialized with a dict to automatically create the MultiIndex. See below example.

Desired Result:

In [1]: pandas.Series({('a', 'b'): 1, ('a', 'a'): 0, ('a', 'c'): 2, ('b', 'a'): 3, ('b', 'b'): 4})
Out[1]:
a  a    0
   b    1
   c    2
b  a    3
   b    4
dtype: int64

In [2]:

Current result:

In [1]: pandas.Series({('a', 'b'): 1, ('a', 'a'): 0, ('a', 'c'): 2, ('b', 'a'): 3, ('b', 'b'): 4})
Out[1]:
(a, a)    0
(a, b)    1
(a, c)    2
(b, a)    3
(b, b)    4
dtype: int64

In [2]:

Is this something I can go ahead and update tests for? It seems like we should have it.

Also, what are your thoughts about adding this to the DataFrame constructor as well?

Thanks,
Alex

@adgaudio
Copy link
Contributor Author

@@ -499,7 +499,11 @@ def _init_dict(self, data, index, columns, dtype=None):
keys = list(data.keys())
if not isinstance(data, OrderedDict):
keys = _try_sort(list(data.keys()))
columns = data_names = Index(keys)
if keys and isinstance(keys[0], tuple):
index_func = MultiIndex.from_tuples
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know whether it's worth considering the edge case where only the first element is a tuple, and just doing Index.

@hayd
Copy link
Contributor

hayd commented Sep 12, 2013

You have some failing tests, because this isn't valid in python 3 (IIRC keys() is a set rather than a list):

data.keys()[0], tuple):
TypeError: 'dict_keys' object does not support indexing

You need to add some tests to demonstrate this behaviour for both Series and DataFrame.

I guess for consistency, this should also be the case if you are passing a list of tuples to index argument...? Are there any other cases where this would also make sense?

@adgaudio
Copy link
Contributor Author

Hey @hayd,

Thanks for your response. I was waiting for a response before continuing with this pr, hence the general incompleteness of the pr.

I think your idea to put this in the Index constructor is great, because then we don't need duplicate code in the Series or DataFrames. However, I've been trying to make this change, and it's actually quite complicated. First, I'm getting multiple seg faults (one of which I fixed and this fix might resolve other bugs in pandas). Second, because the Index constructor converts a list of tuples into a MultiIndex, it breaks a few tests.

It's fairly complicated to resolve, especially since I don't know an efficient way for figuring out causes for seg faults. Hopefully, I'll update this pr later this weekend with some progress.

@jtratner
Copy link
Contributor

@adgaudio if you're experiencing segfaults, you might want to check that you aren't doing the following (with Cython code):

  1. passing incorrectly typed objects to the code (Cython functions need to have exactly the correct type)
  2. passing objects of the wrong length (i.e., with code with the boundscheck false decorator)

For example, if you pass a tuple to a function that expects type list or a defaultdict to a function expecting dict, it could fail (though usually it'll explicitly tell you).

@hayd
Copy link
Contributor

hayd commented Sep 14, 2013

@jtratner I've seen some segfaulting with the MultiIndexes before (example was Series of a mi #4187, now NotImplemented), I didn't look into the why it was happening...

@jtratner
Copy link
Contributor

@hayd if I had to hazard a guess, it's probably because MultiIndex is basically a fake ndarray, as opposed to Index which really is an ndarray. So you can pass a MultiIndex to Cython and it will try to do things with it, but all of them will fail, since the internal datastructure does't really have anything to do with ndarray.

@adgaudio
Copy link
Contributor Author

Thanks for that tips @jtratner and @hayd! I just figured out some interesting details about GH #4187 that could let us propose a workaround fix to that one... When the Series constructor is called, it calls _sanitize_array, which eventually calls infer_dtype and causes the seg fault. ...So, if you make the cpython call to "infer_dtype(...)" (see inference.pyx) with a MultiIndex, you can cause the seg fault.


If we're interested in avoiding workarounds, I think the core question we need to ask is just what @jtratner basically asked: Is it possible to make a MultiIndex compatible with Cython?

If we can't make the MultiIndex (or pa.Array) cython compatible, it seems pretty clear that we will have a million little work arounds for these damn MultiIndexes. The work-arounds I'd propose are to add an "if isinstance(X, MultiIndex): data = data.values" to _sanitize_array and a variation of that to _homogenize() (in frame.py)

@jreback
Copy link
Contributor

jreback commented Sep 14, 2013

Its faulting somewhere else....this is safe to call (as it accepts an object type)

In [1]: pd.lib.infer_dtype(np.array([1,2,3]))
Out[1]: 'integer'

In [2]: pd.lib.infer_dtype(np.array([1,2,3],dtype=object))
Out[2]: 'integer'

In [3]: pd.lib.infer_dtype(np.array([1,2,3.0],dtype=object))
Out[3]: 'mixed-integer-float'

In [4]: pd.lib.infer_dtype(np.array(['a','b','c'],dtype=object))
Out[4]: 'string'

@jreback
Copy link
Contributor

jreback commented Sep 14, 2013

@adgaudio show the traceback where it faults (the whole one)

@adgaudio
Copy link
Contributor Author

Sorry, I realized that I mis-spoke and updated my comment a few mins ago. This is the error I get. No traceback:

In [13]: a = pandas.MultiIndex.from_tuples(zip(range(10), range(10)))

In [14]: pandas.lib.infer_dtype(a)
zsh: segmentation fault (core dumped)  ipython

-- OR --

In [2]: pandas.lib.fast_multiget({1: 2}, a)
zsh: segmentation fault (core dumped)  ipython

@cpcloud
Copy link
Member

cpcloud commented Sep 14, 2013

I'm getting this segfault as well. I don't even get an error message :(

@jreback
Copy link
Contributor

jreback commented Sep 14, 2013

Do this.....the problem is an index/mi is a 'special' that overrides some ndarray methods....
I don't think this will break anything else.

This is called in _possibly_convert_datetime, so that will do nothing (which is correct)

diff --git a/pandas/src/inference.pyx b/pandas/src/inference.pyx
index e0bbc1a..a37de47 100644
--- a/pandas/src/inference.pyx
+++ b/pandas/src/inference.pyx
@@ -41,6 +41,9 @@ def infer_dtype(object _values):
             _values = list(_values)
         values = list_to_object_array(_values)

+    if hasattr(values,'values'):
+        values = values.values
+
     val_kind = values.dtype.type
     if val_kind in _TYPE_MAP:
         return _TYPE_MAP[val_kind]

@jreback
Copy link
Contributor

jreback commented Sep 14, 2013

also...pls check perf, e.g. test_perf.sh for any change you make....it is important that Series construction doesn't change the common cases too much

as an aside...I don't think doing an isinstance(value, Index) in _sanitize_array is a big deal (after all they ARE (currrently) ndarrays)

@jtratner
Copy link
Contributor

@adgaudio @jreback yeah, it's tricky. MultiIndex is only nominally an ndarray, because all of its features are actually reimplemented in Python so if Cython does any optimizations for ndarray, it'll mess up around this.

Btw, for cython, better to do this:

values = getattr(values, 'values', values)

than the hasattr check, because it can optimize it better.

@cpcloud
Copy link
Member

cpcloud commented Sep 15, 2013

good 2 know about hasattr vs. getattr.

@adgaudio
Copy link
Contributor Author

Hmm - I think the topic of this pr has strayed a little from its original goal (which is primarily my fault!). The original goal was to make pandas.Index(list_of_tuples) return a MultiIndex. Having spent several hours in the past couple days looking at pandas internals, it seems pretty clear to me now that the Index should not return a MultiIndex. For instance, methods like MultiIndex._tuple_index rely on the fact that the Index won't return a MI... And as a further argument, if an Index constructor was allowed to return a MultiIndex, then shouldn't a DataFrame be able to return a Panel? That doesn't seem like functionality pandas would support.

On the other hand, we discussed some useful things that are probably worth implementing:

  • There may be some benefit to adding additional safeguards against seg faults.
    • in fast_multiget(...) and infer_dtype(...)
  • Series(some_multi_index) should work
    • (I think) more to the point, np.array(some_multi_index) should not return an empty list
    • this also would depend on @jreback's seg fault fix 3 comments up.

Is everyone okay if I close this PR and maybe make a new one to address the seg faults? In sum, I changed my mind and don't think an Index should return a MultiIndex.

@jtratner
Copy link
Contributor

@adgaudio if you want that makes sense. That said, I don't think it's unreasonable to have a dict input to constructor return something with a MultiIndex.

@adgaudio
Copy link
Contributor Author

Ok - I can certainly do that! In that case, I may just try to merge seg fault stuff into here. Will be interesting to see whether there's a significant perf hit with the changes this brings.

... will be in prog here for a few days

@adgaudio
Copy link
Contributor Author

Ok...this is ready for review. Sorry for the delay

In summary, MultiIndexes are automatically created for index and columns when passing a dict to Series and DataFrames.

I also ran a perf test and will be posting those results in a follow-up comment. I described() the last column of numbers, which represent the ratio of master to "my changes rebased against master". If I understand this correctly, this feature makes pandas slower by 1.005 times on average.

count 171.000000
mean 1.004673
std 0.076243
min 0.779600
25% 0.977800
50% 0.993800
75% 1.026200
max 1.498600

ratios

@adgaudio
Copy link
Contributor Author

Invoked with :
--ncalls: 1
--repeats: 3


-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------
frame_reindex_both_axes                      |  15.2452 |  32.3210 |   0.4717 |
series_value_counts_strings                  |   4.0641 |   5.2130 |   0.7796 |
frame_add                                    |   3.9530 |   4.9031 |   0.8062 |
timeseries_asof_single                       |   0.0291 |   0.0360 |   0.8079 |
indexing_dataframe_boolean                   |  49.4530 |  57.8468 |   0.8549 |
indexing_dataframe_boolean_rows              |   0.2789 |   0.3099 |   0.9000 |
join_dataframe_index_single_key_bigger       |   5.2481 |   5.7790 |   0.9081 |
write_csv_standard                           |  30.0810 |  33.0341 |   0.9106 |
stats_rolling_mean                           |   1.0490 |   1.1470 |   0.9146 |
timeseries_add_irregular                     |  15.7468 |  17.2162 |   0.9147 |
reindex_frame_level_reindex                  |   0.5591 |   0.6101 |   0.9164 |
reshape_pivot_time_series                    | 129.9939 | 141.0310 |   0.9217 |
mask_bools                                   |  24.4670 |  26.4800 |   0.9240 |
frame_to_string_floats                       |  30.3071 |  32.7630 |   0.9250 |
frame_repr_wide                              |   0.7708 |   0.8330 |   0.9253 |
series_string_vector_slice                   | 146.1499 | 156.5220 |   0.9337 |
frame_getitem_single_column                  |  17.7062 |  18.9250 |   0.9356 |
stats_rank_average                           |  23.2301 |  24.7140 |   0.9400 |
append_frame_single_mixed                    |   0.5651 |   0.5999 |   0.9420 |
sort_level_one                               |   3.7041 |   3.9291 |   0.9427 |
mask_floats                                  |  12.7940 |  13.5541 |   0.9439 |
melt_dataframe                               |   1.6730 |   1.7691 |   0.9457 |
groupby_multi_series_op                      |  12.2011 |  12.8889 |   0.9466 |
stat_ops_frame_mean_int_axis_1               |   5.3780 |   5.6581 |   0.9505 |
frame_fancy_lookup                           |   2.7161 |   2.8570 |   0.9507 |
reshape_unstack_simple                       |   2.7251 |   2.8622 |   0.9521 |
frame_insert_500_columns_end                 |  79.2389 |  83.1900 |   0.9525 |
frame_drop_duplicates_na                     |  13.4761 |  14.1251 |   0.9541 |
frame_drop_duplicates                        |  13.3901 |  14.0190 |   0.9551 |
index_datetime_intersection                  |   8.4031 |   8.7969 |   0.9552 |
stat_ops_frame_sum_int_axis_0                |   4.8449 |   5.0640 |   0.9567 |
frame_to_csv2                                |  81.8870 |  85.3350 |   0.9596 |
stats_rank2d_axis1_average                   |  10.9470 |  11.3881 |   0.9613 |
stat_ops_frame_sum_float_axis_1              |   4.5249 |   4.6859 |   0.9657 |
frame_fillna_inplace                         |  10.3819 |  10.7038 |   0.9699 |
indexing_panel_subset                        |   0.3691 |   0.3800 |   0.9711 |
frame_to_csv                                 |  91.5911 |  94.3000 |   0.9713 |
frame_ctor_list_of_dict                      |  64.0192 |  65.8691 |   0.9719 |
merge_2intkey_sort                           |  32.4190 |  33.3450 |   0.9722 |
frame_multi_and                              |  69.8371 |  71.7850 |   0.9729 |
stat_ops_frame_sum_int_axis_1                |   4.9090 |   5.0452 |   0.9730 |
frame_reindex_both_axes_ix                   |  32.0160 |  32.8770 |   0.9738 |
series_drop_duplicates_int                   |   0.5660 |   0.5791 |   0.9774 |
timeseries_asof                              |   6.8550 |   7.0112 |   0.9777 |
join_dataframe_index_multi                   |  16.8700 |  17.2520 |   0.9779 |
join_dataframe_index_single_key_bigger_sort  |  13.3920 |  13.6938 |   0.9780 |
frame_loc_dups                               |   0.6258 |   0.6399 |   0.9780 |
frame_constructor_ndarray                    |   0.0429 |   0.0439 |   0.9783 |
reindex_fillna_backfill                      |   0.5271 |   0.5388 |   0.9783 |
series_align_left_monotonic                  |  10.3209 |  10.5450 |   0.9787 |
groupby_multi_cython                         |  13.3641 |  13.6521 |   0.9789 |
index_int64_union                            |  64.3780 |  65.7530 |   0.9791 |
stat_ops_frame_sum_float_axis_0              |   4.4549 |   4.5469 |   0.9798 |
dti_reset_index_tz                           |  10.0758 |  10.2811 |   0.9800 |
read_csv_vb                                  |  16.2349 |  16.5591 |   0.9804 |
stat_ops_series_std                          |   0.8879 |   0.9048 |   0.9813 |
timeseries_asof_nan                          |   6.3412 |   6.4590 |   0.9818 |
stat_ops_frame_mean_int_axis_0               |   4.8530 |   4.9391 |   0.9826 |
frame_to_csv_mixed                           | 191.1800 | 194.5429 |   0.9827 |
stat_ops_frame_mean_float_axis_0             |   4.4539 |   4.5290 |   0.9834 |
groupby_multi_different_numpy_functions      |  10.8140 |  10.9761 |   0.9852 |
frame_reindex_upcast                         |  10.7739 |  10.9279 |   0.9859 |
groupby_last                                 |   3.2680 |   3.3140 |   0.9861 |
stat_ops_level_series_sum                    |   2.3448 |   2.3770 |   0.9865 |
series_value_counts_int64                    |   1.8809 |   1.9040 |   0.9879 |
groupby_multi_python                         | 104.6109 | 105.8180 |   0.9886 |
merge_2intkey_nosort                         |  14.5009 |  14.6639 |   0.9889 |
frame_reindex_axis1                          | 454.3140 | 459.0740 |   0.9896 |
match_strings                                |   0.2739 |   0.2768 |   0.9897 |
groupby_series_simple_cython                 |   4.0159 |   4.0562 |   0.9901 |
series_align_int64_index                     |  23.7849 |  24.0159 |   0.9904 |
read_csv_thou_vb                             |  14.1041 |  14.2410 |   0.9904 |
frame_fancy_lookup_all                       |  12.5110 |  12.6290 |   0.9907 |
frame_mult                                   |   3.9458 |   3.9830 |   0.9907 |
lib_fast_zip                                 |   7.6091 |   7.6802 |   0.9907 |
stats_rank_average_int                       |  18.6820 |  18.8482 |   0.9912 |
groupby_pivot_table                          |  16.5930 |  16.7401 |   0.9912 |
timeseries_timestamp_downsample_mean         |   3.5810 |   3.6111 |   0.9917 |
frame_reindex_axis0                          |  77.9679 |  78.6190 |   0.9917 |
stats_corr_spearman                          |  70.6170 |  71.1949 |   0.9919 |
timeseries_infer_freq                        |   6.5250 |   6.5749 |   0.9924 |
timeseries_period_downsample_mean            |   5.3890 |   5.4300 |   0.9924 |
stat_ops_level_frame_sum                     |   2.9190 |   2.9399 |   0.9929 |
join_dataframe_integer_2key                  |   4.4181 |   4.4479 |   0.9933 |
series_align_irregular_string                |  50.0710 |  50.4081 |   0.9933 |
reindex_frame_level_align                    |   0.5832 |   0.5870 |   0.9935 |
stat_ops_level_frame_sum_multiple            |   7.2091 |   7.2539 |   0.9938 |
groupby_multi_size                           |  24.5860 |  24.7180 |   0.9947 |
stats_rank2d_axis0_average                   |  18.5640 |  18.6539 |   0.9952 |
sparse_series_to_frame                       | 117.1200 | 117.5661 |   0.9962 |
unstack_sparse_keyspace                      |   1.3909 |   1.3959 |   0.9964 |
groupby_first                                |   3.0730 |   3.0839 |   0.9964 |
frame_ctor_nested_dict                       |  57.8749 |  58.0280 |   0.9974 |
reindex_fillna_backfill_float32              |   0.4182 |   0.4191 |   0.9977 |
reshape_stack_simple                         |   1.4410 |   1.4429 |   0.9987 |
join_dataframe_integer_key                   |   1.4930 |   1.4949 |   0.9987 |
replace_replacena                            |   3.6631 |   3.6659 |   0.9992 |
groupby_frame_median                         |   5.4619 |   5.4641 |   0.9996 |
read_csv_standard                            |   9.2361 |   9.2380 |   0.9998 |
groupby_last_float32                         |   3.2320 |   3.2320 |   1.0000 |
series_constructor_ndarray                   |   0.0160 |   0.0160 |   1.0000 |
frame_fillna_many_columns_pad                |  12.1920 |  12.1920 |   1.0000 |
timeseries_large_lookup_value                |   0.0231 |   0.0231 |   1.0000 |
reindex_daterange_backfill                   |   2.4390 |   2.4371 |   1.0008 |
timeseries_sort_index                        |  18.2889 |  18.2478 |   1.0022 |
reindex_daterange_pad                        |   2.4340 |   2.4259 |   1.0033 |
frame_sort_index_by_columns                  |  31.2161 |  31.1110 |   1.0034 |
read_csv_comment2                            |  17.5040 |  17.4401 |   1.0037 |
groupby_indices                              |   6.3572 |   6.3312 |   1.0041 |
frame_boolean_row_select                     |   0.2131 |   0.2120 |   1.0056 |
timeseries_to_datetime_iso8601               |   3.8910 |   3.8612 |   1.0077 |
replace_fillna                               |   3.6750 |   3.6449 |   1.0082 |
lib_fast_zip_fillna                          |  10.5841 |  10.4971 |   1.0083 |
groupby_multi_different_functions            |  10.8781 |  10.7880 |   1.0084 |
datetimeindex_normalize                      |   2.2500 |   2.2280 |   1.0098 |
stat_ops_level_series_sum_multiple           |   6.6872 |   6.6178 |   1.0105 |
index_datetime_union                         |   8.6920 |   8.6010 |   1.0106 |
frame_get_dtype_counts                       |   0.0880 |   0.0870 |   1.0110 |
frame_iteritems                              |  25.3410 |  25.0630 |   1.0111 |
reindex_fillna_pad                           |   0.5260 |   0.5200 |   1.0115 |
frame_insert_100_columns_begin               |  15.4629 |  15.2860 |   1.0116 |
panel_from_dict_two_different_indexes        |  39.9530 |  39.4778 |   1.0120 |
groupby_first_float32                        |   3.0582 |   3.0160 |   1.0140 |
frame_getitem_single_column2                 |  18.0881 |  17.8120 |   1.0155 |
stat_ops_frame_mean_float_axis_1             |   5.0459 |   4.9639 |   1.0165 |
indexing_dataframe_boolean_rows_object       |   0.4680 |   0.4601 |   1.0171 |
groupby_sum_booleans                         |   0.7811 |   0.7651 |   1.0209 |
frame_drop_dup_inplace                       |   2.2190 |   2.1710 |   1.0221 |
read_parse_dates_iso8601                     |   1.1170 |   1.0900 |   1.0247 |
groupby_simple_compress_timing               |  29.3281 |  28.5370 |   1.0277 |
sparse_frame_constructor                     |   9.3942 |   9.1388 |   1.0279 |
sort_level_zero                              |   3.7801 |   3.6709 |   1.0297 |
panel_from_dict_all_different_indexes        |  53.7281 |  52.1729 |   1.0298 |
groupby_frame_singlekey_integer              |   2.1300 |   2.0669 |   1.0306 |
reindex_multiindex                           |   0.8409 |   0.8130 |   1.0343 |
append_frame_single_homogenous               |   0.2060 |   0.1991 |   1.0347 |
reindex_fillna_pad_float32                   |   0.4320 |   0.4170 |   1.0360 |
dataframe_reindex                            |   0.3462 |   0.3340 |   1.0364 |
groupby_transform                            | 117.1820 | 112.8821 |   1.0381 |
frame_get_numeric_data                       |   0.0780 |   0.0751 |   1.0381 |
groupby_apply_dict_return                    |  24.9550 |  23.9921 |   1.0401 |
index_int64_intersection                     |  23.8118 |  22.8329 |   1.0429 |
concat_series_axis1                          |  65.4199 |  62.6309 |   1.0445 |
read_table_multiple_date_baseline            |  75.4662 |  72.0320 |   1.0477 |
panel_from_dict_equiv_indexes                |  24.9131 |  23.7088 |   1.0508 |
timeseries_1min_5min_mean                    |   0.4830 |   0.4590 |   1.0525 |
dti_reset_index                              |   0.1781 |   0.1690 |   1.0536 |
panel_from_dict_same_index                   |  24.9889 |  23.6831 |   1.0551 |
datetimeindex_unique                         |   0.0939 |   0.0889 |   1.0563 |
read_table_multiple_date                     | 163.5280 | 153.4901 |   1.0654 |
frame_drop_dup_na_inplace                    |   2.2259 |   2.0850 |   1.0676 |
datetimeindex_add_offset                     |   0.1731 |   0.1619 |   1.0692 |
frame_iloc_dups                              |   0.2170 |   0.2019 |   1.0744 |
frame_iteritems_cached                       |   0.4621 |   0.4299 |   1.0749 |
concat_small_frames                          |  12.3281 |  11.4250 |   1.0790 |
frame_xs_row                                 |   0.0370 |   0.0341 |   1.0839 |
series_drop_duplicates_string                |   0.4370 |   0.4032 |   1.0840 |
frame_reindex_columns                        |   0.3030 |   0.2789 |   1.0863 |
join_dataframe_index_single_key_small        |   5.3852 |   4.9469 |   1.0886 |
frame_repr_tall                              |   2.2731 |   2.0730 |   1.0965 |
frame_ctor_nested_dict_int64                 |  82.3250 |  75.0730 |   1.0966 |
groupby_frame_cython_many_columns            |   3.0849 |   2.8129 |   1.0967 |
frame_xs_col                                 |   0.0231 |   0.0210 |   1.1023 |
timeseries_1min_5min_ohlc                    |   0.5291 |   0.4780 |   1.1067 |
timeseries_to_datetime_YYYYMMDD              |   7.9660 |   7.1719 |   1.1107 |
ctor_index_array_string                      |   0.0169 |   0.0150 |   1.1270 |
timeseries_slice_minutely                    |   0.0441 |   0.0381 |   1.1562 |
timeseries_timestamp_tzinfo_cons             |   0.0129 |   0.0110 |   1.1739 |
groupby_frame_apply_overhead                 |   8.7209 |   7.2041 |   1.2106 |
groupby_frame_apply                          |  42.1200 |  34.2922 |   1.2283 |
series_ctor_from_dict                        |   2.9459 |   2.1560 |   1.3664 |
period_setitem                               | 156.1499 | 104.1942 |   1.4986 |
-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------

Ratio < 1.0 means the target commit is faster then the baseline.
Seed used: 1234

Target [da81e40] : ENH: Series and DataFrame constructor autodetect when index/columns
should be MultiIndex

ENH: prevents some seg faults in calls to cython funcs
Base   [d702de0] : Merge pull request #4783 from jtratner/wrap-compressed-fileobjects-in-py3

BUG: Fix input bytes conversion in Py3 to return str


<class 'pandas.core.frame.DataFrame'>
Index: 172 entries, frame_reindex_both_axes to period_setitem
Data columns (total 8 columns):
count    172  non-null values
mean     172  non-null values
std      172  non-null values
min      172  non-null values
25%      172  non-null values
50%      172  non-null values
75%      172  non-null values
max      172  non-null values
dtypes: float64(8)

@jreback
Copy link
Contributor

jreback commented Sep 24, 2013

can you add releaset notes? does it close all 3 of those listed issues? have tests from all 3? pls add the issues it closes to the very top of the PR, with something like closes #issue

@jreback
Copy link
Contributor

jreback commented Sep 24, 2013

can you perf tests the bottom 2 (period_setitem and series_constructor) with a higher number of ncalls (so it will avg them)

@jtratner
Copy link
Contributor

If it turns out to be slow, might make sense to add a keyword argument to disable/enable this processing. (similar to tupleize_cols for csv, etc)

@@ -481,7 +482,7 @@ def _init_dict(self, data, index, columns, dtype=None):
data = dict((k, v) for k, v in compat.iteritems(data) if k in columns)

if index is None:
index = extract_index(list(data.values()))
index = extract_index(data.values())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think you need to keep the list constructor here because dict.values() is a view in Python 3, unless extract_index does the conversion for you

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extract_index just iterates over the values and appends its elements to some list. So isn't the list(...) call unnecessary?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it's passing on the Python 3 builds then yes :)

@hayd
Copy link
Contributor

hayd commented Sep 24, 2013

It's not so crazy that passing in tuples to index could lead to a mi... (You can also get a DatetimeIndex when creating an Index and it's very different from the Panel vs DataFrame!*)

In [1]: s = pd.Series(index=[[1, 3], [3, 4]])

In [2]: s.index
Out[2]:
MultiIndex
[(1, 3), (3, 4)]

In [3]: isinstance(s.index, pd.Index)
Out[3]: True

In [4]: pd.Index([[1, 3], [3, 4]])  # inconsistent / weird
Out[4]: Index([[1, 3], [3, 4]], dtype=object)

I think this should be consistent however you construct... imo this is a good opportunity to correct the above behaviour, where you get a different index depending! Should be the same via Index, Series(index=) or passing dict... atm it's not

+1 for adding some keyword to disable (?) this (like tuplize_col..)

@adgaudio
Copy link
Contributor Author

@hayd,

I think it would be a great idea to have the Index constructure also return a MultiIndex, but only if a kwarg like "infer_multiindex=True" was passed to the constructor. By default, the "infer_multiindex=False" because if it was True, it would cause seg faults and break several tests. I might be able to tackle this in this PR. If not, perhaps another pr...

Logically, something also seems funky about letting an Index return a MultiIndex. For instance, should a DataFrame be able to return a Panel? I think you hinted at this in your comment.

@adgaudio
Copy link
Contributor Author

@jtratner - thanks for your comments.

Here are the perf results with ncalls=5. I also just rebased against master. The average ratio is now actually a tiny bit worse (was 1.004 and now is 1.01), but the basic distribution of ratios looks more or less like the picture I posted above.

Are these performance hits alright for this PR, or should I add a kwarg to the DataFrame and Series constructors that disables it by default?

count 190.000000
mean 1.010531
std 0.041385
min 0.877500
25% 0.995000
50% 1.003550
75% 1.018125
max 1.231000

[... truncated ...]
frame_ctor_list_of_dict                      |  68.6394 |  64.7336 |   1.0603 |
merge_2intkey_sort                           |  34.2390 |  32.1878 |   1.0637 |
frame_fancy_lookup_all                       |  13.8032 |  12.9736 |   1.0639 |
frame_sort_index_by_columns                  |  33.8332 |  31.7710 |   1.0649 |
frame_ctor_nested_dict_int64                 |  81.2192 |  75.8838 |   1.0703 |
series_value_counts_strings                  |   4.4824 |   4.1206 |   1.0878 |
stats_rank_average                           |  23.8794 |  21.7868 |   1.0961 |
merge_2intkey_nosort                         |  15.9592 |  14.2020 |   1.1237 |
join_dataframe_index_multi                   |  19.3036 |  16.8800 |   1.1436 |
mask_bools                                   |  13.2502 |  11.4928 |   1.1529 |
groupby_pivot_table                          |  20.3042 |  16.7432 |   1.2127 |
groupby_multi_different_functions            |  13.4090 |  11.0168 |   1.2171 |
groupby_multi_different_numpy_functions      |  13.3692 |  10.8602 |   1.2310 |
-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------

Ratio < 1.0 means the target commit is faster then the baseline.
Seed used: 1234

Target [8e15111] : rel notes  ### FYI - I squashed this, and it represents my latest commit.
Base   [275a42c] : Merge pull request #4965 from westurner/patch-1

DOC: remote_data.rst: Fama/French punctuation

s = Series([1, 2, 3], index=idx)
s

s.index.identical(idx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

take this out; identical is public but will confuse users; they don't need to know about it

@adgaudio
Copy link
Contributor Author

ready for review/merge again. just rebased master.


.. ipython:: python

arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

take this space out

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done - btw why are there 3 spaces in these sections instead of 4? Is this some kind of documentation standard?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it's not a standard. It's just that it doesn't matter, and there is no real standard. In reStructuredText, you will often see that it is indented like the start of the directive name (here ipython), so then you have 3 spaces (but even in their documentation they are not consequent with that). But I personally like to use 4 spaces as this is much more convenient to type with tab.

@jreback
Copy link
Contributor

jreback commented Mar 30, 2014

@adgaudio can you rebase ....i'll take another look

- Series and DataFrame constructor autodetect when index/columns
should be MultiIndex
- prevents some seg faults in calls to cython funcs
- add tupleize_cols kwarg and update tests to git PR comments
- support name= xor names= in Index(tuples, ....) constructor
- docs
@jreback
Copy link
Contributor

jreback commented Apr 9, 2014

merged via f8c566c...9119641

(also added a bit of docs to dsintro.rst)

thanks @adgaudio !

nice addition!

@jreback jreback closed this Apr 9, 2014
@jreback
Copy link
Contributor

jreback commented Apr 9, 2014

@adgaudio side note, this is actually a dict of tuples, right? (you refer to it as a list of tuples) (I guess its technically a list of tuples too), or are the docs I added not clear?

@jorisvandenbossche
Copy link
Member

@adgaudio There also a couple of doc failures due to your changes, see here https://travis-ci.org/pydata/pandas/jobs/22652839#L1855 and http://pandas-docs.github.io/pandas-docs-travis/indexing.html#creating-a-multiindex-hierarchical-index-object
(this is because you changed something in the index, and the same object is used in all subsequent examples in the file indexing.rst)

Can you fix them? You can also check the log of the docs build on travis (build 3, fully below), or build them first locally (see docs on this here: http://pandas.pydata.org/pandas-docs/stable/contributing.html#contributing-to-the-documentation)

@jreback
Copy link
Contributor

jreback commented Apr 10, 2014

fixed partially in c1440e9

@adgaudio
Copy link
Contributor Author

This is great! Thank you for merging this work :)

@jreback Thanks for the partial fix to documentation. Also, you are correct that the PR works for both dicts of tuples and lists of tuples.

@jorisvandenbossche I should be able to get to these soon in another PR. I guess I was wrong to assume passing tests meant the documentation wouldn't have errors. Why do Travis builds still pass if I introduce documentation errors? Can/should we change that behavior?

@jorisvandenbossche
Copy link
Member

@adgaudio I think the docs are OK now. Yes, indeed, getting green light from travis does not say anything about the doc build, only about the test suite that is passing. The docs are built on travis so you can see the log of that, but it does not notify when there were warnings (but we were thinking to maybe try to change this, see #3800)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH/API: Allow dicts with tuple keys in DataFrame constructor
6 participants