Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CLN: series to now inherit from NDFrame #3482

Merged
merged 8 commits into from
Aug 16, 2013
Merged

Conversation

jreback
Copy link
Contributor

@jreback jreback commented Apr 29, 2013

Major refactor primarily to make Series inherit from NDFrame

affects #4080, #3862, #816, #3217, #3386, #4463, #4204, #4118 , #4555

Preserves pickle compat
very few tests were changed (and only for compat on return objects)
a few performance enhancements, a couple of regressions (see bottom)

obviously this is a large change in terms of the codebase, but it brings more consistency between series/frame/panel (not all of this is there yet, but future changes are much easier)

Series is now like Frame in that it has a BlockManager (called SingleBlockManager), which holds a block (of any type we support). This introduced some overhead in doing certain operations, which I spent a lot of time optimizing away, further optimizations will come from cythonizing the core/internals, which should be straightforward at this point

Highlites below:

In 0.13.0 there is a major refactor primarily to subclass Series from NDFrame,
which is the base class currently for DataFrame and Panel, to unify methods
and behaviors. Series formerly subclassed directly from ndarray.

  • Refactor of series.py/frame.py/panel.py to move common code to generic.py
    • added _setup_axes to created generic NDFrame structures
    • moved methods
      • from_axes,_wrap_array,axes,ix,shape,empty,swapaxes,transpose,pop
      • __iter__,keys,__contains__,__len__,__neg__,__invert__
      • convert_objects,as_blocks,as_matrix,values
      • __getstate__,__setstate__ (though compat remains in frame/panel)
      • __getattr__,__setattr__
      • _indexed_same,reindex_like,align,where,mask,replace
      • filter (also added axis argument to selectively filter on a different axis)
      • reindex,reindex_axis (which was the biggest change to make generic)
      • truncate (moved to become part of NDFrame)
  • These are API changes which make Panel more consistent with DataFrame
    • swapaxes on a Panel with the same axes specified now return a copy
    • support attribute access for setting
    • filter supports same api as original DataFrame filter
  • Reindex called with no arguments will now return a copy of the input object
  • Series now inherits from NDFrame rather than directly from ndarray.
    There are several minor changes that affect the API.
    • numpy functions that do not support the array interface will now
      return ndarrays rather than series, e.g. np.diff and np.where
    • Series(0.5) would previously return the scalar 0.5, this is no
      longer supported
    • several methods from frame/series have moved to NDFrame
      (convert_objects,where,mask)
    • TimeSeries is now an alias for Series. the property is_time_series
      can be used to distinguish (if desired)
  • Refactor of Sparse objects to use BlockManager
    • Created a new block type in internals, SparseBlock, which can hold multi-dtypes
      and is non-consolidatable. SparseSeries and SparseDataFrame now inherit
      more methods from there hierarchy (Series/DataFrame), and no longer inherit
      from SparseArray (which instead is the object of the SparseBlock)
    • Sparse suite now supports integration with non-sparse data. Non-float sparse
      data is supportable (partially implemented)
    • Operations on sparse structures within DataFrames should preserve sparseness,
      merging type operations will convert to dense (and back to sparse), so might
      be somewhat inefficient
    • enable setitem on SparseSeries for boolean/integer/slices
    • SparsePanels implementation is unchanged (e.g. not using BlockManager, needs work)
  • added ftypes method to Series/DataFame, similar to dtypes, but indicates
    if the underlying is sparse/dense (as well as the dtype)
  • All NDFrame objects now have a _prop_attributes, which can be used to indcated various
    values to propogate to a new object from an existing (e.g. name in Series will follow
    more automatically now)

Perf changed a bit primarily in groupby where a Series has to be reconstructed in order to be passed to the function (in some cases). I basically pass a Series-like class to the grouped function to see if it doesn't raise, if its ok, then it is used rather than a full Series in order to reduce overhead of the Series creation for each group.

-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------
groupby_multi_python                         | 109.3636 |  78.3370 |   1.3961 |
frame_iteritems                              |   3.4664 |   2.0154 |   1.7200 |
frame_fancy_lookup                           |   3.3991 |   1.6137 |   2.1064 |
sparse_frame_constructor                     |  11.7100 |   5.3363 |   2.1944 |
-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------

Target [c5d9495] : BUG: fix ujson handling of new series object
Base   [1b91f4f] : BUG: Fixed non-unique indexing memory allocation issue with .ix/.loc (GH4280)

@jreback
Copy link
Contributor Author

jreback commented Jul 17, 2013

cc @Komnomnomnom

If you have a chance...this is my refactor of Series to inherit from NDFrame, like DataFrame and co.

I wrote this a while back and just rebased to current master. Almost all passing, except for the ujson stuff.

I took a brief look, but not easy for me to debug this.

Series looks the same for all intents and purposes (however there maybe some memory differences, e.g.
.values now returns the contained numpy array directly rather than a view on it)

can you have a look see...

thanks

Jeff

@@ -29,19 +29,30 @@
except Exception: # pragma: no cover
pass


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: PEP8 standard is two lines between declarations at the top level

@Komnomnomnom
Copy link
Contributor

@jreback the issues were caused by PyArray_SIZE now always returning 1 for a Series object. I've fixed that up and the problems appear to be gone. Most tests pass (see below) and the output from valgrind appears to be ok.

The one remaining test failure is due to astype no longer preserving Series.name, with your changes I'm unsure where to fix this or if it's intentional. Let me know if I can help further though!

What's the etiquette for a pull request on a pull request? I've just pushed the code to my fork, presumably the best way is to cherry-pick the commits from there?

@jreback
Copy link
Contributor Author

jreback commented Jul 18, 2013

@Komnomnomnom

thanks so much!!!!

I will take a look at the astype...prob not even testing that...will take care of that easy..

I cc'd @cpcloud because not 100% sure how to push to this PR....my git-fu is sometimes non-fu...

@jreback
Copy link
Contributor Author

jreback commented Jul 18, 2013

ok...seems that best way here is just to pull down your branch and cherry-pick....

@cpcloud suggest that you could submit a PR to my branch...I guess that's sort of the same thing

@Komnomnomnom
Copy link
Contributor

Yeah that's what I was talking about with the cherry-pick, I can do a PR on your fork though. Might be cleaner that way, give me a sec.

@Komnomnomnom
Copy link
Contributor

OK pull request created jreback#2

@jreback
Copy link
Contributor Author

jreback commented Jul 18, 2013

worked beautifully (welll after I accidently merged ALL of your branch in, had to rebase it out)

@jreback
Copy link
Contributor Author

jreback commented Jul 22, 2013

@cpcloud thanks...pep8d most of the major changes..easy

@cpcloud
Copy link
Member

cpcloud commented Jul 22, 2013

np. really excited about this pr

@jreback
Copy link
Contributor Author

jreback commented Jul 22, 2013

well...it technically doesn't change anything!!! (except for lots of code)

@wesm
Copy link
Member

wesm commented Jul 22, 2013

Well, this is downright epic, Jeff. To start, can you post a test_perf.sh run of this versus master? I think we should discuss some big picture pandas things at some point. For example, I'm starting to become a bit down on the BlockManager concept generally. It was a nice idea in practice but it's not clear that it's helping us terribly-- it's only beneficial when you have a homogeneously-typed DataFrame that you want to do fast row-wise operations on so consolidation yield performance boosts rather than having to "glue" the pieces together each time. For mixed-type frames, it's basically a bust.

We also need to plot a way to place a layer between pandas and NumPy so we can have better control over the data representation. For example, I would like to have integer NAs. From my point of view at some point using NumPy at all won't continue to make sense (example: why are we forcing people to import the whole numpy library when all we need is an array object, basically).

@jreback
Copy link
Contributor Author

jreback commented Jul 22, 2013

I had to optimize Series/Block/Index creation (basically short cut it)
get close perf...(and in SeriesGrouper had to create a SeriesNDArray
in order to avoid creation overhead) - the groupby_multi_python was actually 2x slower before that

e.g. diff in frame_iteritems is all about Block creation time (its just the diff in the number of function calls that are needed here)

-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------
groupby_apply_dict_return                    |  30.1254 |  42.8553 |   0.7030 |
read_store_table                             |   2.2089 |   2.6564 |   0.8316 |
timeseries_slice_minutely                    |   0.0457 |   0.0544 |   0.8406 |
frame_mult_no_ne                             |   4.8280 |   5.7413 |   0.8409 |
query_store_table_wide                       |   9.2204 |  10.8593 |   0.8491 |
indexing_dataframe_boolean_no_ne             |  47.2077 |  55.0553 |   0.8575 |
write_store_table_wide                       | 119.3500 | 133.1810 |   0.8961 |
frame_add_no_ne                              |   4.8010 |   5.3360 |   0.8997 |
join_dataframe_index_single_key_small        |   5.0484 |   5.6057 |   0.9006 |
frame_add                                    |   4.9837 |   5.5303 |   0.9012 |
write_store_mixed                            |  14.0210 |  15.5247 |   0.9031 |
indexing_panel_subset                        |   0.4337 |   0.4776 |   0.9080 |
indexing_dataframe_boolean_st                |   9.0740 |   9.9507 |   0.9119 |
query_store_table                            |   3.9940 |   4.2813 |   0.9329 |
read_store_mixed                             |   3.8550 |   4.1220 |   0.9352 |
write_store_table_panel                      |  86.0810 |  92.0393 |   0.9353 |
write_store                                  |   5.4080 |   5.7473 |   0.9410 |
series_align_int64_index                     |  23.9603 |  25.2674 |   0.9483 |
merge_2intkey_sort                           |  34.7170 |  36.5040 |   0.9510 |
stats_rank_average                           |  26.5393 |  27.8924 |   0.9515 |
read_store_table_panel                       |  20.4090 |  21.4440 |   0.9517 |
datetimeindex_unique                         |   0.1070 |   0.1123 |   0.9533 |
write_store_table                            |  54.3770 |  56.9147 |   0.9554 |
frame_multi_and_st                           |  33.6154 |  35.1160 |   0.9573 |
groupby_series_simple_cython                 |   3.9733 |   4.1423 |   0.9592 |
frame_mult_st                                |   4.9059 |   5.1064 |   0.9607 |
join_dataframe_integer_2key                  |   5.4057 |   5.6264 |   0.9608 |
groupbym_frame_apply                         |  42.1120 |  43.8066 |   0.9613 |
frame_multi_and_no_ne                        |  94.2967 |  98.0600 |   0.9616 |
read_store                                   |   1.6804 |   1.7427 |   0.9642 |
groupby_frame_cython_many_columns            |   3.2183 |   3.3157 |   0.9706 |
groupby_multi_series_op                      |  12.3080 |  12.6344 |   0.9742 |
frame_reindex_upcast                         |  11.9727 |  12.2847 |   0.9746 |
timeseries_add_irregular                     |  17.7580 |  18.2043 |   0.9755 |
frame_iteritems_cached                       |   0.0580 |   0.0594 |   0.9772 |
join_dataframe_index_single_key_bigger       |   5.7073 |   5.8173 |   0.9811 |
groupby_multi_cython                         |  13.6540 |  13.8827 |   0.9835 |
series_drop_duplicates_int                   |   0.6660 |   0.6760 |   0.9852 |
frame_reindex_both_axes                      |  35.1970 |  35.7217 |   0.9853 |
match_strings                                |   0.3043 |   0.3087 |   0.9858 |
write_store_table_mixed                      | 111.2707 | 112.5107 |   0.9890 |
index_int64_intersection                     |  20.1653 |  20.3880 |   0.9891 |
stats_corr_spearman                          |  84.6077 |  85.5140 |   0.9894 |
timeseries_asof_nan                          |   8.7376 |   8.8310 |   0.9894 |
timeseries_sort_index                        |  18.8496 |  19.0480 |   0.9896 |
groupby_frame_singlekey_integer              |   1.9586 |   1.9767 |   0.9908 |
frame_reindex_both_axes_ix                   |  34.5476 |  34.8340 |   0.9918 |
frame_fillna_many_columns_pad                |  12.4653 |  12.5554 |   0.9928 |
frame_to_csv                                 | 109.6930 | 110.4193 |   0.9934 |
groupby_pivot_table                          |  15.6794 |  15.7820 |   0.9935 |
frame_fillna_inplace                         |  10.8307 |  10.9014 |   0.9935 |
groupby_simple_compress_timing               |  27.0474 |  27.2213 |   0.9936 |
timeseries_infer_freq                        |   8.4887 |   8.5417 |   0.9938 |
timeseries_timestamp_downsample_mean         |   4.0991 |   4.1237 |   0.9940 |
frame_multi_and                              |  33.7937 |  33.9783 |   0.9946 |
groupby_first                                |   3.1400 |   3.1540 |   0.9955 |
groupby_frame_apply_overhead                 |   8.4897 |   8.5273 |   0.9956 |
frame_insert_500_columns                     | 101.0017 | 101.4206 |   0.9959 |
read_table_multiple_date                     | 166.4723 | 167.1600 |   0.9959 |
read_store_table_mixed                       |   4.6920 |   4.7084 |   0.9965 |
period_setitem                               | 136.5511 | 136.8690 |   0.9977 |
frame_to_csv_mixed                           | 175.8357 | 176.2343 |   0.9977 |
read_table_multiple_date_baseline            |  76.1763 |  76.3150 |   0.9982 |
datetimeindex_add_offset                     |   0.1957 |   0.1960 |   0.9984 |
replace_replacena                            |   3.7770 |   3.7830 |   0.9984 |
groupby_transform                            | 123.8833 | 123.9620 |   0.9994 |
reindex_multiindex                           |   1.1110 |   1.1117 |   0.9994 |
frame_insert_100_columns_begin               |  19.5011 |  19.5107 |   0.9995 |
stat_ops_level_frame_sum_multiple            |   6.4936 |   6.4936 |   1.0000 |
index_int64_union                            |  69.1380 |  69.0633 |   1.0011 |
frame_reindex_axis1                          | 557.9860 | 557.2210 |   1.0014 |
timeseries_asof                              |   9.2630 |   9.2460 |   1.0018 |
stats_rank2d_axis1_average                   |  11.4047 |  11.3780 |   1.0023 |
timeseries_period_downsample_mean            |   5.5366 |   5.5220 |   1.0026 |
series_align_left_monotonic                  |  10.9127 |  10.8780 |   1.0032 |
datetimeindex_normalize                      |   3.2537 |   3.2427 |   1.0034 |
stats_rank2d_axis0_average                   |  20.2977 |  20.2250 |   1.0036 |
lib_fast_zip_fillna                          |  11.4247 |  11.3740 |   1.0045 |
reshape_unstack_simple                       |   3.0840 |   3.0696 |   1.0047 |
ctor_index_array_string                      |   0.0161 |   0.0160 |   1.0050 |
mask_floats                                  |   4.0720 |   4.0517 |   1.0050 |
append_frame_single_homogenous               |   0.2440 |   0.2426 |   1.0056 |
groupby_multi_different_functions            |  10.1254 |  10.0676 |   1.0057 |
timeseries_1min_5min_ohlc                    |   0.6196 |   0.6157 |   1.0065 |
stats_rank_average_int                       |  18.8187 |  18.6690 |   1.0080 |
read_parse_dates_iso8601                     |   1.2647 |   1.2543 |   1.0082 |
stat_ops_level_series_sum_multiple           |   5.7410 |   5.6937 |   1.0083 |
replace_fillna                               |   3.7707 |   3.7390 |   1.0085 |
frame_ctor_nested_dict                       |  71.0493 |  70.4453 |   1.0086 |
groupby_sum_booleans                         |   0.8903 |   0.8817 |   1.0098 |
write_csv_standard                           |  36.4190 |  36.0450 |   1.0104 |
frame_boolean_row_select                     |   0.2124 |   0.2100 |   1.0110 |
series_align_irregular_string                |  59.1147 |  58.4103 |   1.0121 |
groupby_last                                 |   3.3380 |   3.2949 |   1.0131 |
sort_level_one                               |   3.8170 |   3.7677 |   1.0131 |
mask_bools                                   |  15.1087 |  14.9036 |   1.0138 |
stat_ops_level_frame_sum                     |   2.5753 |   2.5396 |   1.0141 |
stat_ops_series_std                          |   0.2380 |   0.2347 |   1.0142 |
append_frame_single_mixed                    |   0.7380 |   0.7260 |   1.0165 |
indexing_dataframe_boolean                   |   9.0990 |   8.9504 |   1.0166 |
index_datetime_union                         |  10.5806 |  10.3947 |   1.0179 |
timeseries_1min_5min_mean                    |   0.5693 |   0.5590 |   1.0185 |
groupby_multi_different_numpy_functions      |  10.2143 |  10.0257 |   1.0188 |
series_ctor_from_dict                        |   2.9153 |   2.8613 |   1.0189 |
frame_fancy_lookup_all                       |  14.4053 |  14.1333 |   1.0192 |
groupby_first_float32                        |   3.1363 |   3.0750 |   1.0200 |
unstack_sparse_keyspace                      |   1.4650 |   1.4363 |   1.0200 |
frame_loc_dups                               |   0.6834 |   0.6693 |   1.0210 |
groupby_last_float32                         |   3.3177 |   3.2420 |   1.0233 |
stats_rolling_mean                           |   1.2956 |   1.2640 |   1.0250 |
groupby_indices                              |   5.9083 |   5.7550 |   1.0266 |
frame_drop_duplicates_na                     |  15.6890 |  15.2790 |   1.0268 |
write_store_table_dc                         | 157.9533 | 153.6417 |   1.0281 |
index_datetime_intersection                  |  10.6143 |  10.3124 |   1.0293 |
frame_mult                                   |   5.1623 |   5.0100 |   1.0304 |
timeseries_timestamp_tzinfo_cons             |   0.0130 |   0.0126 |   1.0314 |
join_dataframe_integer_key                   |   1.6936 |   1.6403 |   1.0325 |
frame_wide_repr                              |   0.8543 |   0.8263 |   1.0340 |
frame_to_string_floats                       |  35.7140 |  34.5340 |   1.0342 |
read_csv_thou_vb                             |  31.3516 |  30.3144 |   1.0342 |
frame_drop_duplicates                        |  15.4893 |  14.9683 |   1.0348 |
dti_reset_index_tz                           |  11.5233 |  11.1303 |   1.0353 |
read_csv_vb                                  |  19.4740 |  18.7774 |   1.0371 |
frame_sort_index_by_columns                  |  31.9623 |  30.7306 |   1.0401 |
timeseries_asof_single                       |   0.0317 |   0.0304 |   1.0445 |
reindex_frame_level_align                    |   0.6257 |   0.5977 |   1.0468 |
read_csv_standard                            |  11.3810 |  10.8480 |   1.0491 |
reindex_daterange_pad                        |   0.1620 |   0.1543 |   1.0494 |
concat_small_frames                          |  13.2267 |  12.5913 |   1.0505 |
timeseries_to_datetime_iso8601               |   4.9740 |   4.7104 |   1.0560 |
reindex_frame_level_reindex                  |   0.5933 |   0.5607 |   1.0581 |
panel_from_dict_all_different_indexes        |  59.2630 |  55.9990 |   1.0583 |
stat_ops_level_series_sum                    |   1.8779 |   1.7667 |   1.0630 |
series_value_counts_strings                  |   4.1653 |   3.9126 |   1.0646 |
frame_drop_dup_inplace                       |   2.6743 |   2.5120 |   1.0646 |
dti_reset_index                              |   0.2023 |   0.1900 |   1.0648 |
groupby_frame_median                         |   6.1757 |   5.7983 |   1.0651 |
series_value_counts_int64                    |   2.0937 |   1.9613 |   1.0675 |
frame_to_csv2                                | 102.8513 |  95.9927 |   1.0714 |
panel_from_dict_two_different_indexes        |  44.5470 |  41.4983 |   1.0735 |
frame_ctor_list_of_dict                      |  76.4464 |  71.1413 |   1.0746 |
groupby_multi_size                           |  25.6070 |  23.8177 |   1.0751 |
join_dataframe_index_multi                   |  18.1350 |  16.7967 |   1.0797 |
indexing_dataframe_boolean_rows_object       |   0.4670 |   0.4294 |   1.0875 |
panel_from_dict_equiv_indexes                |  26.9910 |  24.7306 |   1.0914 |
series_drop_duplicates_string                |   0.6380 |   0.5840 |   1.0925 |
reindex_daterange_backfill                   |   0.1640 |   0.1500 |   1.0927 |
frame_ctor_nested_dict_int64                 |  84.7967 |  77.5804 |   1.0930 |
sort_level_zero                              |   4.0953 |   3.7367 |   1.0960 |
dataframe_reindex                            |   0.4083 |   0.3713 |   1.0997 |
reshape_pivot_time_series                    | 157.4883 | 143.1960 |   1.0998 |
read_store_table_wide                        |  18.9090 |  17.1623 |   1.1018 |
join_dataframe_index_single_key_bigger       |  14.5197 |  13.1764 |   1.1019 |
frame_reindex_axis0                          |  88.7680 |  80.4106 |   1.1039 |
panel_from_dict_same_index                   |  27.6597 |  25.0386 |   1.1047 |
indexing_dataframe_boolean_rows              |   0.2530 |   0.2290 |   1.1052 |
read_csv_comment2                            |  23.0260 |  20.8197 |   1.1060 |
merge_2intkey_nosort                         |  17.2536 |  15.5774 |   1.1076 |
frame_reindex_columns                        |   0.3567 |   0.3210 |   1.1112 |
frame_constructor_ndarray                    |   0.0470 |   0.0420 |   1.1172 |
concat_series_axis1                          |  67.7530 |  60.4870 |   1.1201 |
sparse_series_to_frame                       | 135.2253 | 119.4077 |   1.1325 |
frame_add_st                                 |   5.5370 |   4.8783 |   1.1350 |
frame_get_dtype_counts                       |   0.0960 |   0.0843 |   1.1385 |
frame_drop_dup_na_inplace                    |   2.6460 |   2.3100 |   1.1454 |
series_string_vector_slice                   | 167.6626 | 145.2027 |   1.1547 |
frame_iloc_dups                              |   0.2377 |   0.2053 |   1.1580 |
lib_fast_zip                                 |   9.6827 |   8.3347 |   1.1617 |
frame_xs_col                                 |   0.0273 |   0.0233 |   1.1741 |
frame_get_numeric_data                       |   0.0970 |   0.0817 |   1.1877 |
reshape_stack_simple                         |   1.4356 |   1.2027 |   1.1937 |
timeseries_large_lookup_value                |   0.0257 |   0.0213 |   1.2052 |
frame_repr_tall                              |   2.5503 |   2.0956 |   1.2170 |
melt_dataframe                               |   2.0140 |   1.6440 |   1.2251 |
frame_xs_row                                 |   0.0443 |   0.0339 |   1.3068 |
reindex_fillna_pad                           |   0.1480 |   0.1127 |   1.3131 |
reindex_fillna_backfill                      |   0.1489 |   0.1130 |   1.3179 |
reindex_fillna_backfill_float32              |   0.1370 |   0.0970 |   1.4131 |
reindex_fillna_pad_float32                   |   0.1363 |   0.0963 |   1.4150 |
series_constructor_ndarray                   |   0.0170 |   0.0117 |   1.4558 |
groupby_multi_python                         | 119.5020 |  78.5190 |   1.5219 |
dataframe_getitem_scalar                     |   0.0103 |   0.0063 |   1.6456 |
frame_iteritems                              |   3.5400 |   2.0394 |   1.7359 |
frame_fancy_lookup                           |   3.2120 |   1.5923 |   2.0172 |
sparse_frame_constructor                     |  12.0353 |   5.4206 |   2.2203 |
-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------

Ratio < 1.0 means the target commit is faster then the baseline.
Seed used: 1234

Target [a126b7e] : BLD: pep8 major changes
Base   [d7c6eb1] : DOC: cookbook example

@jreback
Copy link
Contributor Author

jreback commented Jul 22, 2013

@wesm here's my 2c on basic design:

The row centric view of the arrays is a natural way of looking at things. However, you are right, the machinery to hold it really doesn't make it any faster or less complicated (and more complicated when dealing with mixed types as the blocks need to be split).

I have seen various reasons why people go to column oriented structures.. (e.g. blaze, ctable), and then just combine or operate on demand.

However, you pretty much need a Holder of some kind to provide the indirection layer between the n-d objects and the actual data.

an ArrayBlockManager could be much simple, holding essentially a dict of Series, and a Panel would be a dict of DataFrames, along with axes info (maybe have a type map interally as well,, e.g. a dict of the FloatBlocks etc)

pros:

  • Duplicate management is simple
  • add/delete very simple.
  • back compat
  • no need for merging blocks or consolidation
  • no need for splitting of blocks
  • type propmotion/demotion is easy

cons:

  • column oriented ops will need essentially a concat before execution,
    but easy to pick out, and this is pretty fast
  • row oriented ops can be executed on the underlying data
  • the underlying data holders are more typed Series than Blocks (but similar), so
    need to have a code mechanism to deal with this

Here's what I see as the fundamental issue:: pandas has to be very general in that though there are better
structures for row-type ops (vs column type-ops), it is very tricky to a-priori specify what you think you need. The user essentialy has to designate it (I am not a big fan of having 2 different reps of DataFrame, e.g. row-major, vs. columns-major), guess you could.

@jreback
Copy link
Contributor Author

jreback commented Jul 22, 2013

on removing some dependence on numpy. A lot of the bug fixes have been aimed at making pandas very consistent in spite of numpy issues/bugs.

I think I once mentioned a possible solution to integer nan, just use a sentinal value, like NaT. Pretty straightforward to do (though this entails type promotion of int16/int32 types).

pd.eval will allow numexpr evalution of expressions, this can (someday) be extended to evaluation by other
engines, e..g. blaze or a numpy replacement.

Not sure what goals you have for a numpy replacement though.

@jreback
Copy link
Contributor Author

jreback commented Jul 23, 2013

related #816

@wesm
Copy link
Member

wesm commented Aug 16, 2013

Just took a more serious look through this PR and if you're all satisfied might as well pull the trigger on merging. Any API incompatibilities introduced (beyond the obvious, isinstance(obj, np.ndarray) where obj is a Series) can be fixed after merging.

jreback and others added 8 commits August 16, 2013 15:09
         axis creation routines now commonized under _setup_axes

ENH: more methods added

PERF: was missing multi-take opportunity in reindex
      was incorrectly passing to com._count_not_none
      doing an extra copy in certain cases

BUG: reindex with called with no args will by default return a copy (fixed bug)

ENH: moved filter and added axis arg
     moved where,mask,align

TST: make reindex benchmarks longer

CLN: fixed up names for creation in panelnd.py

DOC: minor release notes changes

ENH: initial commite - attempt to reengineer series to inherit from NDFrame rather than ndarray

ENH: fixed SparseDataFrame constructor with scalar values
     reindex still broken
     removed refs to SparseSeries in internals (not all SparseArray)

TST: more fixed

TST: more fixes

TST: more tests

TST: fixed up indexing

TST: more sparse fixes

BUG: reindex with single block manager now correctly fills with a method

BUG: fixed pickle I think

BUG: fixed set in internals for sparse

fixed boolean indexing iin series I thnk

BUG: fixed printing and inclusion of sparse series in DataFrame (now keeps its type),
     converted to dense for printing

CLN: took out SeriesIndex, now uses regular indexing properties

BUG: fixed copy (was using series method, bad)
     block filling for datetimes now ok (was filling with NaT, not iNaT)
     NaN in boolean ops now correctly handled (was not working for Datetimes)

BUG: fixed set_item in SparseFrame if only a scalar is passed (needed index)

BUG: sparse join fixed, did I break something in merge?

BUG: consolidated block slicing under _slice

BUG: added Series to santize_array
     all numeric methods now call get_values() rather than values

ENH: partial SparsePanel support

ENH: reverted SparsePanel changes, save for later
     fixed up xs in SparseFrame

BUG: SparsePanel was using an inherited as_matrix(), bad

TST: fixed shift
     default in class creation wrapper is to not pass existing fillers
     added sanitize column for generalitiy
     fixed count (in series)

CLN: modify core/expressions to use get_values()
     remove methods from SparseFrame (and use inherited):
       combine_first,icol,as_matrix,get_dtype_counts
     bug fix in core/internals/get_dtype_counts

CLN: use _values_from_object instead of direct call to get_values()

BUG: fixed set_value semantics, as it could possibily change the index

BUG: fixed tseries/period indexing
     fixed some bugs showing up in 32-bit (in nanops)

BUG: fix incorrect exception raised in indexing (on 32-bit)

BUG: fixed get_merge_keys (add Series to ndarray testing)

BUG: fixed pivot table maybe???x
     core/internals/_ref_locs will now set indexer if ref_items==items

TST: apply_reduce in tests/test_frame still failing

BUG: fixed getitem_boolean_object finally I think (was issue in set_value in Series)
BUG: fixed putmasking mess in Series, now in core/internals

BUG: more fixes

BUG: fixed core/internals/replace as choking on input

BUG: refixed groupby

BUG: fix test_where in series

BUG: fixed reindex on a sparse block (was not taking correctly)

BUG: fixed sparse filling!!!!!

BUG: fixed pivot, need to define __hash__ to raise TypeError in NDFrame

BUG: downcast argument not in SparseBlock or sparse/frame.py for fillna

BUG: fix apply_reduce?

BUG: fixes in reduce.pyx to deal with reconstrucing a Series argument to the function
     if needed

BUG: reducer now produces a Series with its index (to the called function)
     ols converts to_dense to avoid some issues

ENH: fixed core/frame/apply to accept reduce argument (default True),

     to allow turning off the reduction attempt (to preserver the column character)
     if say self.values would change it

BUG: finally fixed reducer?

BUG: reduce on frame bug (showing in py3)

BUG: ols not working with sparse

TST: stats.tests.test_ols/test_wls is not testing for the correct version

     of statsmodels (fails on 32-bit)

     PTF

TST: make sure to skip the test_wls if our version isn't enough

PERF: some perf enhancements

BUG: fix sparse/array/make_sparse to take objects and extract the arrays

PERF: series construction now much faster

PERF: improvements in core/internals

MERGE: updated to master and merged in

MERGE: more merging fixes

PERF: fixed null tests to be MUCH faster

PERF: improvements in series construction via from_array

PERF: merge improvements by using _has_sparse in bms

PERF: some improvements

PERF: more internals optimizations

CLN: Index now subclassed off of PandasObject

BUG: fixed inheritence for core/index.py (Index), solves unicode issues

BUG: some merge errors in sparse

VB: modernize the sparse vb suite

BUG: fixed merging by single item (was broker for sparse for some reason)
     names not propogating in Series constructor on _slice

BUG: add name back to series constructor

ENH: pickle compatibility for Series/SparseSeries prior to 0.12!

ENH: added pickle_compat to common/load

BUG: in core/series on fastpath and index is actually changed

     (e.g. its actually a datelike index, but is of type object),
     need to set the axis in the BlockManager

BUG: _getitem__bool only is active for Index/Int64Index (issues with DatetimeIndex/PeriodIndex)

     so default to having it call (slower) __getitem__

COMPAT: py3 compat fixes

TST: recover pickles in a particular order or names

MERGE: fixup merging with 0.11.0 final

BUG: set _subtyp in sparse (use main type of object)

BUG: fixed mergig on need to reindex sparse

BUG: fixed consolidation issue prior to merge

BUG: construction of a series with another series odd bug

BUG: fix series constructor when passed a dtype (and no copy)

BUG: fixed sparse slicing via blocks (don't use a sparse block when slicing)

BUG: fixed remaining sparse issue (SpareDataFrame was converting SparseArray incorrectly)

BUG: dtypes in groupby nth fixed (converting on aggregation item_by_item)

BUG: partial fix on groupby?

BUG: restored groupby back to master (SeriesGrouper)

BUG: more fixes on groupby

BUG: fixed all groupbys!

BUG: get_median in core/nanops.py complaining

PERF: made constructions of SparseFrame have less redundant steps

PERF: minor series perf improvement

TST: trying to fix how_lambda in tseries/resample

     PTF

PERF: addtl groupby multi_python perf improvements

PERF: speeds up for Series.__getitem__

PERF: some perf on groupby.....

      added _block, _values in SingleBlockManager

PERF: more reducer improvements

BUG: fixed SeriesBinGrouper hopefully

BUG: tseries/index.py was missing __str__ = __repr__
BUG: groupby filter that return a series/ndarray truth testing

BUG: refixed GH3880, prop name index

BUG: not handling sparse block deletes in internals/_delete_from_block

BUG: refix generic/truncate

TST: refixed generic/replace (bug in core/internals/putmask) revealed as well

TST: fix spare_array to put up correct type exceptions rather than Exception

CLN: cleanups

BUG: fix stata dtype inference (error in core/internals/astype)

BUG: fix ujson handling of new series object

BUG: fixed scalar coercion (e.g. calling float(series)) to work

BUG: fixed astyping with and w/o copy

ENH: added _propogate_attributes method to generic.py to allow
     subclasses to automatically propogate things like name

DOC: added v0.13.0.txt feature descriptions

CLN: pep8ish cleanups

BUG: fix 32-bit,numpy 1.6.1 issue with datetimes in astype_nansafe

PERF: speedup for groupby by passing a SNDArray (Series like ndarray) object to evaluation functions
      if allowed, can avoid Series creation overhead

BUG: issue with older numpy (1.6.1) in SeriesGrouper, fallback to passing a Series
     rather than SNDArray

DOC: release notes & doc updates

DOC: fixup doc build failures

DOC: change pasing of direct ndarrays to cython doc functions (enhancedperformance.rst)
…cache based on

      changes (GH4080)

BUG: Series not updating properly with object dtype (GH33217)

BUG: (GH3386) fillna same issue as (GH4080), not updating cacher
CLN: cleaned up internal block action routines, now always return a list of blocks
Instead of the `is_series`, `is_generic`, etc methods, can use the ABC*
methods to check for certain pandas types. This is useful because it
helps decrease issues with circular imports (since they can be easily
imported from core/common).

The checks take advantage of the `_typ` and `_subtyp` attributes to
handle checks. (e.g. `DataFrame` now has `_typ` of `"dataframe"`, etc.
See the code for specifics.

PERF: register _cacher as an internal name

BUG: fixed abstract base class type checking bug in py2.6

DOC: updates for abc type checking

PERF: small perf gains in _get_item_cache
TST/BUG: test/bugfix for GH4463

BUG: fix core/internals/setitem to work for boolean types (weird numpy bug!)

BUG: partial frame setting with dtype change (GH4204)

BUG: Indexing with dtype conversions fixed GH4463 (int->float), GH4204(boolean->float)

BUG: provide better ndarray compat

CLN: removed some duped methods

MERGE: fix an issue cropping up on the rebase
TST: additional test for series dtype conversion with where (and fix!)

DOC: update docstrings in to_json/to_hdf/pd.read_hdf

BLD: ujson rebase issue fixed
@jreback
Copy link
Contributor Author

jreback commented Aug 16, 2013

@wesm .... ok... bombs away shortly

@jreback
Copy link
Contributor Author

jreback commented Aug 16, 2013

thanks to @jtratner, @Komnomnomnom, @cpcloud, @jseabold, @wesm for assistance with various aspects of this PR! squash those bugs!

@jtratner
Copy link
Contributor

@wesm I'm thinking of how to make th e instance check works. Maybe we could
create a filler class that stops the ndarray from controlling class
creation, so that Series can still subclass ndarray but not be an ndarray
(the checks are controlled by the meta class of the class you compare
against instead of the meta class of the object itself) working on a PR
right now.

@jreback
Copy link
Contributor Author

jreback commented Aug 19, 2013

@jtratner isinstance checking is not necessary
all of the internal code is changed already

what is the purpose of this? some sort of back compat?

@jtratner
Copy link
Contributor

@wesm made the comment. It's actually impossible anyways.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants