Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: sortlevel, docs, vbench for #720 #725

Merged
merged 8 commits into from
Feb 1, 2012
Merged

ENH: sortlevel, docs, vbench for #720 #725

merged 8 commits into from
Feb 1, 2012

Conversation

adamklein
Copy link
Contributor

@wesm
Copy link
Member

wesm commented Feb 1, 2012

Boo. it's actually slower now. after:


In [12]: a = np.repeat(np.arange(100), 1000)

In [13]: b = np.tile(np.arange(1000), 100)

In [14]: mindex = MultiIndex.from_arrays([a, b])

In [15]: m
ma                   matrix               merge                mlab
%macro               matrix_power         meshgrid             mod
mafromtxt            matrix_rank          mgrid                modf
%magic               matshow              min                  more
man                  max                  mindex               movavg
map                  maximum              minimum              mpl
margins              maximum_sctype       minorticks_off       msort
marino.py            may_share_memory     minorticks_on        multinomial
mask_indices         mean                 min_scalar_type      multiply
mat                  median               mintypecode          multivariate_normal
math                 memmap               mirr                 mv
matplotlib           memoryview           mkdir                mx2num

In [15]: mindex = mindex.take(np.random.permutation(np.arange(100000)))

In [16]: timeit mindex.sortlevel(0)
10 loops, best of 3: 44 ms per loop

before:


In [1]: a = np.repeat(np.arange(100), 1000)

In [2]: b = np.tile(np.arange(1000), 100)

In [3]: mindex = MultiIndex.from_arrays([a, b])

In [4]: mindex = mindex.take(np.random.permutation(np.arange(100000)))

In [5]: timeit mindex.sortlevel(0)
10 loops, best of 3: 23.9 ms per loop

here's the line timing:

before (faster):

Timer unit: 1e-06 s

File: pandas/core/index.py
Function: sortlevel at line 1507
Total time: 0.478246 s

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
  1507                                               def sortlevel(self, level=0, ascending=True):
  1508                                                   """
  1509                                                   Sort MultiIndex lexicographically by requested level
  1510                                           
  1511                                                   Parameters
  1512                                                   ----------
  1513                                                   level : int or str, default 0
  1514                                                       If a string is given, must be a name of the level
  1515                                                   ascending : boolean, default True
  1516                                                       False to sort in descending order
  1517                                           
  1518                                                   Returns
  1519                                                   -------
  1520                                                   sorted_index : MultiIndex
  1521                                                   """
  1522                                                   # TODO: check if lexsorted when level=0
  1523                                           
  1524        10           81      8.1      0.0          labels = list(self.labels)
  1525        10          319     31.9      0.1          level = self._get_level_number(level)
  1526        10           34      3.4      0.0          primary = labels.pop(level)
  1527                                           
  1528                                                   # Lexsort starts from END
  1529        10       156646  15664.6     32.8          indexer = np.lexsort(tuple(labels[::-1]) + (primary,))
  1530                                           
  1531        10           22      2.2      0.0          if not ascending:
  1532                                                       indexer = indexer[::-1]
  1533                                           
  1534        30         4524    150.8      0.9          new_labels = [lab.take(indexer) for lab in self.labels]
  1535        10           14      1.4      0.0          new_index = MultiIndex(levels=self.levels, labels=new_labels,
  1536        10       316586  31658.6     66.2                                 names=self.names, sortorder=level)
  1537                                           
  1538        10           20      2.0      0.0          return new_index, indexer

after (slower)

In [5]: %lprun -f MultiIndex.sortlevel for _ in xrange(10): mindex.sortlevel(0)
Timer unit: 1e-06 s

File: pandas/core/index.py
Function: sortlevel at line 1507
Total time: 0.64143 s

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
  1507                                               def sortlevel(self, level=0, ascending=True):
  1508                                                   """
  1509                                                   Sort MultiIndex at the requested level. The result will respect the
  1510                                                   original ordering of the associated factor at that level.
  1511                                           
  1512                                                   Parameters
  1513                                                   ----------
  1514                                                   level : int or str, default 0
  1515                                                       If a string is given, must be a name of the level
  1516                                                   ascending : boolean, default True
  1517                                                       False to sort in descending order
  1518                                           
  1519                                                   Returns
  1520                                                   -------
  1521                                                   sorted_index : MultiIndex
  1522                                                   """
  1523        10          192     19.2      0.0          from pandas.core.frame import _indexer_from_factorized
  1524                                           
  1525        10           53      5.3      0.0          labels = list(self.labels)
  1526                                           
  1527        10          283     28.3      0.0          level = self._get_level_number(level)
  1528        10           33      3.3      0.0          primary = labels.pop(level)
  1529        10           36      3.6      0.0          indexer = _indexer_from_factorized((primary,) + tuple(labels),
  1530        10       316850  31685.0     49.4                                             self.levshape)
  1531        10           27      2.7      0.0          if not ascending:
  1532                                                       indexer = indexer[::-1]
  1533                                           
  1534        30         8472    282.4      1.3          new_labels = [lab.take(indexer) for lab in self.labels]
  1535                                           
  1536        10           23      2.3      0.0          new_index = MultiIndex(levels=self.levels, labels=new_labels,
  1537        10       315437  31543.7     49.2                                 names=self.names, sortorder=level)
  1538                                           
  1539        10           24      2.4      0.0          return new_index, indexer

hmm. most perplexing. deeper down the rabbit hole:


In [7]: %lprun -f _indexer_from_factorized for _ in xrange(10): mindex.sortlevel(0)
Timer unit: 1e-06 s

File: pandas/core/frame.py
Function: _indexer_from_factorized at line 4058
Total time: 0.313475 s

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
  4058                                           def _indexer_from_factorized(labels, shape):
  4059        10           74      7.4      0.0      from pandas.core.groupby import get_group_index, _compress_group_index
  4060                                           
  4061        10        13780   1378.0      4.4      group_index = get_group_index(labels, shape)
  4062        10       281747  28174.7     89.9      comp_ids, obs_ids = _compress_group_index(group_index)
  4063        10           39      3.9      0.0      max_group = len(obs_ids)
  4064        10        17818   1781.8      5.7      indexer, _ = lib.groupsort_indexer(comp_ids.astype('i4'), max_group)
  4065                                           
  4066        10           17      1.7      0.0      return indexer

i think the answer here is not to compress the group labels in the sortlevel case because they're "more likely" to be dense

secondly, i'm questioning again my decision to store the tuples in the MultiIndex. The take operation on the values themselves is not all that fast:

In [14]: timeit mindex.values.take(indexer)
100 loops, best of 3: 4 ms per loop

but faster than the multiindex construction

In [16]: timeit MultiIndex(levels=mindex.levels, labels=mindex.labels)
100 loops, best of 3: 9.34 ms per loop

thus, can you please make the following modifications

  • add option to indexer method to not compress. do it in the DataFrame.sort_index case but not this case. Note that you'll have to compute the max_group value from the SHAPE (np.prod)!
  • add alternate private constructor for MultiIndex-- something that takes an 'O' array of tuples, labels, and levels. it should look like
index = values.view(MultiIndex)
index.levels = levels
index.labels = labels
index.names = names
return names

then you can call

new_tuples = self.values.take(indexer)

inside sortlevel

run the above benchmarks to ensure faster and then i'll merge this

@adamklein
Copy link
Contributor Author

Got it. Awesome analysis.

@wesm
Copy link
Member

wesm commented Feb 1, 2012

Did the vbench catch this btw? If not maybe use my example for the vbench instead

@adamklein
Copy link
Contributor Author

Something broke in the vbench running before, results were empty. Will make sure vbench does catch it, if not existing test, then one above.

@adamklein
Copy link
Contributor Author

I thought I was going crazy, but i'm not: there is a failure in test_sortlevel in test_index, that fails only sometimes, probably due to the random.shuffle(tuples) producing different permutations. I'll isolate the failure; but this is not related to changes you outlined above.

@wesm
Copy link
Member

wesm commented Feb 1, 2012

before or after the changes above?

@adamklein
Copy link
Contributor Author

Before. I think it's because

indexer = _indexer_from_factorized((primary,) + tuple(labels),
self.levshape, compress=False)

if we have a permutation of the labels, we don't permute levshape accordingly

@wesm
Copy link
Member

wesm commented Feb 1, 2012

oh sorry i meant before those commits.

yes that is definitely the problem. just have to do the same reordering song and dance with self.levshape as self.labels

@adamklein
Copy link
Contributor Author

Cool. B/c of randomness of test, it wasn't always caught. Wild goose chase. Should be closing this out soon.

@wesm
Copy link
Member

wesm commented Feb 1, 2012

Yeah-- if the index being used were made bigger it would probably fail every time

@adamklein
Copy link
Contributor Author

still need to make sure vbench catches improvement, will let you know in a few mins

wesm added a commit that referenced this pull request Feb 1, 2012
ENH: sortlevel, docs, vbench. close #719 close #720
@wesm wesm merged commit 3fd516a into pandas-dev:master Feb 1, 2012
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants