Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add set_index, reset_index and reorder_levels methods #1028

Merged
merged 17 commits into from
Dec 27, 2016

Conversation

benbovy
Copy link
Member

@benbovy benbovy commented Oct 3, 2016

Another item in #719.

I added tests and updated the docs, so this is ready for review.

Copy link
Member

@shoyer shoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice -- thanks for working on this!

--------
DataArray.reset_index
"""
indexers = utils.combine_pos_and_kw_args(indexers, kw_indexers,
Copy link
Member

@shoyer shoyer Oct 3, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know I wrote this utility (and it is currently used by Dataset.reindex), but I regret it now! I feel like it's usually better to only have a single call signature, accepting either a dictionary or **kwargs, but not both.

This goes for the other new methods, too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be for using **kwargs unless we choose the alternative signatures suggested in the comments above and below.

append : bool, optional
If True, append the supplied indexers to the existing indexes.
Otherwise replace the existing indexes (default).
inplace : bool, optional
Copy link
Member

@shoyer shoyer Oct 3, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we need an inplace=True option. I guess it doesn't hurt. (Just more to test)


Parameters
----------
indexers : dict, optional
Copy link
Member

@shoyer shoyer Oct 3, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of array.set_index({'record': ['level_0', 'level_1']}) or array.set_index(record=['level_0', 'level_1']), it might suffice to simply use array.set_index(['level_0', 'level_1']), without specifying which dimension these get added to. We can infer the dimension names from the variables.

The upside is that accepting a list instead of a dict is a little more succinct. The downside is that it's less explicit/self-documenting.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another downside of using a single argument is that it is less consistent with other parts of the xarray's API which use keyword arguments (e.g., reindex, sel, etc.). IMHO the little additional verbosity is worth the gain in explicit/self-documenting here.

That said, a single dim_levels or levels argument may be more consistent with the signature you suggest below for reset_index: reset_index(dim, levels=None,...). So finally I don't really know what is better.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After discussion in #1030 , I'm wondering if it is common to have to rename a dimension after setting a new (multi-)index, such that either
set_index(['level_0', 'level_1'], name='new_dim_name') or
set_index(new_dim_name=['level_0', 'level_1']) would be a useful shortcut to set_index(...).rename(...).

else:
return self._replace(coords=coords)

def reset_index(self, dim_levels=None, drop=False, inplace=False,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe switch the signature to use separate arguments like reset_index(dim, levels=None, ...) instead of using a dict/**kwargs. This would make the usual use a clearer, e.g., array.reset_index('record') instead of array.reset_index(record=None).

Also, after #1017 (optional indexes), the ability to write array.reset_index('x', drop=True) for clearing an index could be nice to have.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense!

if isinstance(current_index, pd.MultiIndex):
names.extend(current_index.names)
for i in range(current_index.nlevels):
arrays.append(current_index.get_level_values(i))
Copy link
Member

@shoyer shoyer Oct 3, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be premature optimization to worry about this, but something to watch out for is that this will refactorize the existing MultiIndex. It might be better to simply reuse existing levels and labels, if possible, and directly pass those into the MultiIndex constructor.

Note that to create levels and labels directly form an array (if necessary) you can use Categorical as used in MultiIndex.from_arrays, e.g.,

cat = pd.Categorical(array, ordered=True)
levels = cat.categories
labels = cat.codes

@@ -816,6 +816,118 @@ def swap_dims(self, dims_dict):
ds = self._to_temp_dataset().swap_dims(dims_dict)
return self._from_temp_dataset(ds)

def set_index(self, indexers=None, append=False, inplace=False,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A note on terminology: in xarray/pandas, indexer usually refers the argument/key used for indexing, whereas index refers to the set of existing labels, e.g., df.index vs df.loc[indexer]. So in this case I think the argument name indexes would make more sense.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I've naively copied the signature of reindex but now I get it.

@benbovy
Copy link
Member Author

benbovy commented Oct 20, 2016

Some API design questions (mostly from @shoyer's review) we need to fix:

  • We need to choose whether to use dim=indexes kwargs or fixed arg/kwarg relative to a given dimension for the signatures of .set_index(), .reset_index() and .reorder_levels().
  • Do we also allow .set_index() to rename the dimension(s) if needed, instead of doing .set_index(...).rename(...) ? Is this a common use case that is worth it?
  • After discussion in WIP: Optional indexes (no more default coordinates given by range(n)) #1017, it seems that we need an easy way to (re)set indexes either to no index or to range(n).

For point 1, my preference goes to dim=indexes kwargs, especially if we need 2 and 3. It's less succinct, but it's more close to the signatures of other xarray methods like .reindex() or .sel(), and it allows (re)setting the indexes of multiple dimensions in a single call. Given 2, I find set_index(new_dim_name=['level_1', 'level_3']) a bit more elegant than set_index(['level_1', 'level_2'], name='new_dim_name'). Given 3, array.reset_index('x') seems ambiguous compared to array.reset_index(x=None) (no index) and, e.g., array.reset_index(x='range') (range(n) index).

@benbovy benbovy mentioned this pull request Oct 20, 2016
7 tasks
@shoyer
Copy link
Member

shoyer commented Oct 22, 2016

We need to choose whether to use dim=indexes kwargs or fixed arg/kwarg relative to a given dimension for the signatures of .set_index(), .reset_index() and .reorder_levels().

For set_index and reorder_levels, I like the kwargs or a dictionary. It's nice and explicit. But for reset_index, I think we probably want a list.

It's not at all obvious to me what array.reset_index(x=None) does. It could just as easily mean "reset nothing from x" as "reset x to have a null index". In fact, the former seems more consistent with how we handle levels. In contrast, array.reset_index(['x']) pretty clearly means that the 'x' index should be reset.

Do we also allow .set_index() to rename the dimension(s) if needed, instead of doing .set_index(...).rename(...) ? Is this a common use case that is worth it?

My inclination is yes -- this feels like a common thing to do. But we could also safely add this later.

After discussion in #1017, it seems that we need an easy way to (re)set indexes either to no index or to range(n).

We definitely need a way to reset indexes to the default, but after #1017, I'm not sure we will need a way to set them to range(n).

Unfortunately, if x is a normal (non-multi) Index, array.reset_index('x') is not well defined. We need a name for the variable that was formerly named x (or could drop it, e.g., with array.reset_index('x', drop=True) or array.drop('x')), otherwise it will still be the index for the x-axis. Or, I suppose we could rename the x dimension to something else.

One option is to add some sort of prefix or suffix to the index name when it becomes a new variable, e.g., array.reset_index('x') renames the coordinate x to x_. This seems like a probably safe choice, though I hate to add more automatic names to the API.

@benbovy
Copy link
Member Author

benbovy commented Nov 4, 2016

Sorry for the delay @shoyer. I've read your comments above and they all seem relevant. I'll find some time next week to get back on this.

@benbovy
Copy link
Member Author

benbovy commented Nov 7, 2016

Just committed review changes.

.reset_index() doesn't accept kwargs anymore, though I don't know what to choose between the options below (currently option A is implemented):

  • option A: reset_index(dim, levels=None) where dim may accept multiple dimension names (in
    that case levels must be a list of lists with the same length than dim, or simply None that
    would then be applied to all given dimensions).
  • option B: same than option A, reset_index(dim, levels=None), except that dim only accepts
    one dimension (thus a bit simpler but less flexible).
  • option C: reset_index(dim_or_levels) where one can provide a list of dimension(s) and/or
    level(s). This is the most flexible and concise, though maybe less readable. Allow providing both
    dimensions and levels may be ambiguous too.

if x is a normal (non-multi) Index, array.reset_index('x') is not well defined

Currently .reset_index() doesn't allow resetting normal indexes, but we can wait for #1017 before merging this.

@@ -102,6 +103,105 @@ def calculate_dimensions(variables):
return dims


def merge_indexes(indexes, variables, coord_names, append=False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you try adding type annotations here, like the ones I started adding in core/merge.py? I'm not even running mypy yet but I think these could significantly improve readability, and are lighter weight than adding a full docstring.


Returns
-------
reindexed : DataArray
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wrong name -- should not be reindexed

@shoyer
Copy link
Member

shoyer commented Nov 9, 2016

I kind of like Option C, given that we have guaranteed level names and variables to have no conflicts.

Did you go for allowing set_index() to rename variables?

@benbovy
Copy link
Member Author

benbovy commented Nov 15, 2016

This is ready for another round of review.

I've changed the signature of reset_index to option C. It is also almost ready for #1017 (just added two small TODOs).

Did you go for allowing set_index() to rename variables?

Not yet, but as you said we could safely add this later.

@shoyer
Copy link
Member

shoyer commented Dec 16, 2016

Can we update this for optional indexes? (now on master)

@benbovy
Copy link
Member Author

benbovy commented Dec 20, 2016

This should now behave correctly with optional indexes.

Copy link
Member

@shoyer shoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks for persevering on this! My only concern is about the best place to put this new info in the docs (see comments inline).

@@ -478,6 +478,49 @@ Both ``reindex_like`` and ``align`` work interchangeably between
# this is a no-op, because there are no shared dimension names
ds.reindex_like(other)

.. _multi-index handling:

Multi-index handling
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These docs are great, but I wouldn't call them "indexing methods" exactly. Maybe move this section to Reshaping and reorganizing data?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense!

@shoyer shoyer merged commit 7ad2544 into pydata:master Dec 27, 2016
@benbovy benbovy deleted the multi-index_methods branch August 30, 2023 09:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants