ENH: sortlevel, docs, vbench for #720 #725

adamklein · 2012-01-31T21:53:06Z

wesm · 2012-02-01T04:26:00Z

Boo. it's actually slower now. after:


In [12]: a = np.repeat(np.arange(100), 1000)

In [13]: b = np.tile(np.arange(1000), 100)

In [14]: mindex = MultiIndex.from_arrays([a, b])

In [15]: m
ma                   matrix               merge                mlab
%macro               matrix_power         meshgrid             mod
mafromtxt            matrix_rank          mgrid                modf
%magic               matshow              min                  more
man                  max                  mindex               movavg
map                  maximum              minimum              mpl
margins              maximum_sctype       minorticks_off       msort
marino.py            may_share_memory     minorticks_on        multinomial
mask_indices         mean                 min_scalar_type      multiply
mat                  median               mintypecode          multivariate_normal
math                 memmap               mirr                 mv
matplotlib           memoryview           mkdir                mx2num

In [15]: mindex = mindex.take(np.random.permutation(np.arange(100000)))

In [16]: timeit mindex.sortlevel(0)
10 loops, best of 3: 44 ms per loop

before:


In [1]: a = np.repeat(np.arange(100), 1000)

In [2]: b = np.tile(np.arange(1000), 100)

In [3]: mindex = MultiIndex.from_arrays([a, b])

In [4]: mindex = mindex.take(np.random.permutation(np.arange(100000)))

In [5]: timeit mindex.sortlevel(0)
10 loops, best of 3: 23.9 ms per loop

here's the line timing:

before (faster):

Timer unit: 1e-06 s

File: pandas/core/index.py
Function: sortlevel at line 1507
Total time: 0.478246 s

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
  1507                                               def sortlevel(self, level=0, ascending=True):
  1508                                                   """
  1509                                                   Sort MultiIndex lexicographically by requested level
  1510                                           
  1511                                                   Parameters
  1512                                                   ----------
  1513                                                   level : int or str, default 0
  1514                                                       If a string is given, must be a name of the level
  1515                                                   ascending : boolean, default True
  1516                                                       False to sort in descending order
  1517                                           
  1518                                                   Returns
  1519                                                   -------
  1520                                                   sorted_index : MultiIndex
  1521                                                   """
  1522                                                   # TODO: check if lexsorted when level=0
  1523                                           
  1524        10           81      8.1      0.0          labels = list(self.labels)
  1525        10          319     31.9      0.1          level = self._get_level_number(level)
  1526        10           34      3.4      0.0          primary = labels.pop(level)
  1527                                           
  1528                                                   # Lexsort starts from END
  1529        10       156646  15664.6     32.8          indexer = np.lexsort(tuple(labels[::-1]) + (primary,))
  1530                                           
  1531        10           22      2.2      0.0          if not ascending:
  1532                                                       indexer = indexer[::-1]
  1533                                           
  1534        30         4524    150.8      0.9          new_labels = [lab.take(indexer) for lab in self.labels]
  1535        10           14      1.4      0.0          new_index = MultiIndex(levels=self.levels, labels=new_labels,
  1536        10       316586  31658.6     66.2                                 names=self.names, sortorder=level)
  1537                                           
  1538        10           20      2.0      0.0          return new_index, indexer

after (slower)

In [5]: %lprun -f MultiIndex.sortlevel for _ in xrange(10): mindex.sortlevel(0)
Timer unit: 1e-06 s

File: pandas/core/index.py
Function: sortlevel at line 1507
Total time: 0.64143 s

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
  1507                                               def sortlevel(self, level=0, ascending=True):
  1508                                                   """
  1509                                                   Sort MultiIndex at the requested level. The result will respect the
  1510                                                   original ordering of the associated factor at that level.
  1511                                           
  1512                                                   Parameters
  1513                                                   ----------
  1514                                                   level : int or str, default 0
  1515                                                       If a string is given, must be a name of the level
  1516                                                   ascending : boolean, default True
  1517                                                       False to sort in descending order
  1518                                           
  1519                                                   Returns
  1520                                                   -------
  1521                                                   sorted_index : MultiIndex
  1522                                                   """
  1523        10          192     19.2      0.0          from pandas.core.frame import _indexer_from_factorized
  1524                                           
  1525        10           53      5.3      0.0          labels = list(self.labels)
  1526                                           
  1527        10          283     28.3      0.0          level = self._get_level_number(level)
  1528        10           33      3.3      0.0          primary = labels.pop(level)
  1529        10           36      3.6      0.0          indexer = _indexer_from_factorized((primary,) + tuple(labels),
  1530        10       316850  31685.0     49.4                                             self.levshape)
  1531        10           27      2.7      0.0          if not ascending:
  1532                                                       indexer = indexer[::-1]
  1533                                           
  1534        30         8472    282.4      1.3          new_labels = [lab.take(indexer) for lab in self.labels]
  1535                                           
  1536        10           23      2.3      0.0          new_index = MultiIndex(levels=self.levels, labels=new_labels,
  1537        10       315437  31543.7     49.2                                 names=self.names, sortorder=level)
  1538                                           
  1539        10           24      2.4      0.0          return new_index, indexer

hmm. most perplexing. deeper down the rabbit hole:


In [7]: %lprun -f _indexer_from_factorized for _ in xrange(10): mindex.sortlevel(0)
Timer unit: 1e-06 s

File: pandas/core/frame.py
Function: _indexer_from_factorized at line 4058
Total time: 0.313475 s

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
  4058                                           def _indexer_from_factorized(labels, shape):
  4059        10           74      7.4      0.0      from pandas.core.groupby import get_group_index, _compress_group_index
  4060                                           
  4061        10        13780   1378.0      4.4      group_index = get_group_index(labels, shape)
  4062        10       281747  28174.7     89.9      comp_ids, obs_ids = _compress_group_index(group_index)
  4063        10           39      3.9      0.0      max_group = len(obs_ids)
  4064        10        17818   1781.8      5.7      indexer, _ = lib.groupsort_indexer(comp_ids.astype('i4'), max_group)
  4065                                           
  4066        10           17      1.7      0.0      return indexer

i think the answer here is not to compress the group labels in the sortlevel case because they're "more likely" to be dense

secondly, i'm questioning again my decision to store the tuples in the MultiIndex. The take operation on the values themselves is not all that fast:

In [14]: timeit mindex.values.take(indexer)
100 loops, best of 3: 4 ms per loop

but faster than the multiindex construction

In [16]: timeit MultiIndex(levels=mindex.levels, labels=mindex.labels)
100 loops, best of 3: 9.34 ms per loop

thus, can you please make the following modifications

add option to indexer method to not compress. do it in the DataFrame.sort_index case but not this case. Note that you'll have to compute the max_group value from the SHAPE (np.prod)!
add alternate private constructor for MultiIndex-- something that takes an 'O' array of tuples, labels, and levels. it should look like

index = values.view(MultiIndex)
index.levels = levels
index.labels = labels
index.names = names
return names

then you can call

new_tuples = self.values.take(indexer)

inside sortlevel

run the above benchmarks to ensure faster and then i'll merge this

adamklein · 2012-02-01T13:36:23Z

Got it. Awesome analysis.

wesm · 2012-02-01T14:29:38Z

Did the vbench catch this btw? If not maybe use my example for the vbench instead

adamklein · 2012-02-01T14:36:24Z

Something broke in the vbench running before, results were empty. Will make sure vbench does catch it, if not existing test, then one above.

adamklein · 2012-02-01T15:40:30Z

I thought I was going crazy, but i'm not: there is a failure in test_sortlevel in test_index, that fails only sometimes, probably due to the random.shuffle(tuples) producing different permutations. I'll isolate the failure; but this is not related to changes you outlined above.

wesm · 2012-02-01T15:54:43Z

before or after the changes above?

adamklein · 2012-02-01T15:55:40Z

Before. I think it's because

indexer = _indexer_from_factorized((primary,) + tuple(labels),
self.levshape, compress=False)

if we have a permutation of the labels, we don't permute levshape accordingly

wesm · 2012-02-01T15:57:01Z

oh sorry i meant before those commits.

yes that is definitely the problem. just have to do the same reordering song and dance with self.levshape as self.labels

adamklein · 2012-02-01T15:58:07Z

Cool. B/c of randomness of test, it wasn't always caught. Wild goose chase. Should be closing this out soon.

wesm · 2012-02-01T15:59:06Z

Yeah-- if the index being used were made bigger it would probably fail every time

adamklein · 2012-02-01T16:08:33Z

still need to make sure vbench catches improvement, will let you know in a few mins

ENH: sortlevel, docs, vbench. close #719 close #720

adamklein added 4 commits January 31, 2012 12:28

BUG: closes #719, check for sortedness of multiindex in to_panel

542b8ff

BUG: closes #719, sortedness check in to_panel

263b15c

BUG: closes #719, sortedness check in to_panel, fixed

30d8d6a

ENH: closes #720, clarification on docs, vbench for sortlevel

b6ee864

ENH: re #720, added alternative private constructor

ce3c4fa

ENH: sortlevel fixes per comments for #720

6194afe

adamklein added 2 commits February 1, 2012 11:19

TST: changed vbench for #720

8b0dd91

fixed pep8 spacing issue

2307902

wesm added a commit that referenced this pull request Feb 1, 2012

Merge pull request #725 from adamklein/IS720

3fd516a

ENH: sortlevel, docs, vbench. close #719 close #720

wesm merged commit 3fd516a into pandas-dev:master Feb 1, 2012

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: sortlevel, docs, vbench for #720 #725

ENH: sortlevel, docs, vbench for #720 #725

adamklein commented Jan 31, 2012

wesm commented Feb 1, 2012

adamklein commented Feb 1, 2012

wesm commented Feb 1, 2012

adamklein commented Feb 1, 2012

adamklein commented Feb 1, 2012

wesm commented Feb 1, 2012

adamklein commented Feb 1, 2012

wesm commented Feb 1, 2012

adamklein commented Feb 1, 2012

wesm commented Feb 1, 2012

adamklein commented Feb 1, 2012

ENH: sortlevel, docs, vbench for #720 #725

ENH: sortlevel, docs, vbench for #720 #725

Conversation

adamklein commented Jan 31, 2012

wesm commented Feb 1, 2012

adamklein commented Feb 1, 2012

wesm commented Feb 1, 2012

adamklein commented Feb 1, 2012

adamklein commented Feb 1, 2012

wesm commented Feb 1, 2012

adamklein commented Feb 1, 2012

wesm commented Feb 1, 2012

adamklein commented Feb 1, 2012

wesm commented Feb 1, 2012

adamklein commented Feb 1, 2012