Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document the new __repr__ #1199

Closed
fmaussion opened this issue Jan 11, 2017 · 23 comments · Fixed by #1221 or #1236
Closed

Document the new __repr__ #1199

fmaussion opened this issue Jan 11, 2017 · 23 comments · Fixed by #1221 or #1236

Comments

@fmaussion
Copy link
Member

Sorry I missed that one when it was decided upon in #1017, but I think the changes in repr should be documented somewhere (at the minimum in the "Breaking Changes" section of what's new).

I just updated Salem for it to work well with xarray 0.9.0. The changes I had to make where quite small (that's a good thing), but it took me a bit of time to understand what was going on.

What I found confusing is following:

In [1]: import xarray as xr
In [2]: ds = xr.DataArray([1, 2, 3]).to_dataset(name='var')
In [3]: ds
Out[3]: 
<xarray.Dataset>
Dimensions:  (dim_0: 3)
Coordinates:
  o dim_0    (dim_0) -
Data variables:
    var      (dim_0) int64 1 2 3

In [4]: 'dim_0' in ds.coords
Out[4]: False

dim_0is listed as coordinate, but 'dim_0' in ds.coords is False. I think it should remain like this, but maybe we should document somewhere what the "o" and "*" mean?

(possibly here)

@shoyer
Copy link
Member

shoyer commented Jan 11, 2017

Agreed, we should definitely document this and highlight the change in the release notes. I'm still not super happy with using o in the new repr, but it is the best we came up with so far.

@shoyer shoyer added this to the 0.9.0 milestone Jan 11, 2017
@shoyer
Copy link
Member

shoyer commented Jan 15, 2017

I wonder if instead this should be taken as an indication that we need to make the repr more self explanatory, e.g., by writing something like *missing* instead of -. This would be consistent with how we use *empty* in the Dataset repr for empty coords or data_vars.

Current version:

<xarray.Dataset>
Dimensions:  (dim_0: 3)
Coordinates:
  o dim_0    (dim_0) -
Data variables:
    var      (dim_0) int64 1 2 3

Alternative A: write *missing* instead of -:

<xarray.Dataset>
Dimensions:  (dim_0: 3)
Coordinates:
  o dim_0    (dim_0) *missing*
Data variables:
    var      (dim_0) int64 1 2 3

Alternative B: write *missing* instead of -, use * for the marker instead of o:

<xarray.Dataset>
Dimensions:  (dim_0: 3)
Coordinates:
  * dim_0    (dim_0) *missing*
Data variables:
    var      (dim_0) int64 1 2 3

I think I like Alternative B best. @MaximilianR @crusaderky @benbovy any opinions?

@fmaussion
Copy link
Member Author

fmaussion commented Jan 16, 2017

[edited to remove a less relevant comment, now moved to a separate issue]

After having a look at the doc about coordinates (http://xarray.pydata.org/en/latest/data-structures.html#coordinates), here a few thoughts:

  • it would be good to have clear semantics about the different coordinates. For example, "positional coordinates" for coords with "*", "auxilliary coordinates" for coords without "*". How do we call the coordinates with a "o" ?
  • after giving it a second thought: have you considered not to list the coordinates with "o" in the coords repr? After all, what they really are is dimensions, not coordinates (as shown by the test above: 'dim_0' in ds.coords == False). The operations available on them (e.g. isel) really look like operations available on dimensions (similar to axis in numpy), not so much like operations available on coordinates.

I'll go on with a simple example that I just tried out to let you know how I dealt with this new behavior of xarray:

In [1]: import xarray as xr

In [2]: a = np.array([[1.1, 2.2, 3.3], [4.4, 5.5, 6.6]])

In [3]: da = xr.DataArray(a, dims=['y', 'x'],
                          coords={'x':[0.1, 1.1, 2.2], 'xy':(['y', 'x'], a)})

In [4]: da
Out[4]: 
<xarray.DataArray (y: 2, x: 3)>
array([[ 1.1,  2.2,  3.3],
       [ 4.4,  5.5,  6.6]])
Coordinates:
  * x        (x) float64 0.1 1.1 2.2
    xy       (y, x) float64 1.1 2.2 3.3 4.4 5.5 6.6
  o y        (y) -

"OK, I have three types of coordinates here. This makes sense, they are all defined in a different way. y is a dimension of my dataset, so at least I should be able to select along that dimension:".

In [6]: da.isel(y=1)
Out[6]: 
<xarray.DataArray (x: 3)>
array([ 4.4,  5.5,  6.6])
Coordinates:
  * x        (x) float64 0.1 1.1 2.2
    xy       (x) float64 4.4 5.5 6.6

"so far so good! x is a coordinate, so here I should be able to do more complex selection based on values:"

In [7]: da.sel(x=2.2)
Out[7]: 
<xarray.DataArray (y: 2)>
array([ 3.3,  6.6])
Coordinates:
    x        float64 2.2
    xy       (y) float64 3.3 6.6
  o y        (y) -

"That makes sense. However, this is probably not possible with y, which is not defined:"

In [8]: da.sel(y=0)
Out[8]: 
<xarray.DataArray (x: 3)>
array([ 1.1,  2.2,  3.3])
Coordinates:
  * x        (x) float64 0.1 1.1 2.2
    xy       (x) float64 1.1 2.2 3.3

"wait... So I can select by label on an undefined coordinate? Maybe y is a coordinate after all?"

In [9]: 'y' in da.coords
Out[9]: False

"ah, no."

I actually think that .sel() shoud not work on y. I understand that this would break a bunch of existing code, but this is a confusing (if not inconsistent) behavior.

Maybe I'm completely missing the point here: don't hesitate to correct me!

@benbovy
Copy link
Member

benbovy commented Jan 16, 2017

I'm still not super happy with using o in the new repr

Yes this does not seem like the best alternative.

after giving it a second thought: have you considered not to list the coordinates with "o" in the coords repr?

+1. That was the initial implementation in #1017 I think, which is the most consistent. But @crusaderky raised an issue of readability (see #1017 (comment)).

IMO this issue of readability is more a bad habit of using the Coordinates section to actually inspect the dimensions (and thus not focusing enough on the Dimensions section), resulting from the old behavior where each dimension was guaranteed to have a (default) coordinate. Now that this is not the case anymore I wonder if new xarray users (i.e., after #1017) will encounter this issue of readability.

If this remains an issue, another option than the ones above would be to list dimensions with no index in a separate, dedicated section of the repr, e.g. with the examples above,

<xarray.Dataset>
Dimensions:  (dim_0: 3)
Unindexed dimensions:
    dim_0
Data variables:
    var      (dim_0) int64 1 2 3
<xarray.DataArray 'xy' (y: 2, x: 3)>
array([[ 1.1,  2.2,  3.3],
       [ 4.4,  5.5,  6.6]])
Unindexed dimensions:
    y
Coordinates:
  * x        (x) float64 0.1 1.1 2.2
    xy       (y, x) float64 1.1 2.2 3.3 4.4 5.5 6.6

To not add too much verbosity, multiple dimensions without coord may be displayed inline:

<xarray.Dataset>
Dimensions:  (dim_0: 3, dim_1: 4)
Unindexed dimensions:
    dim_0, dim_1
Data variables:
    var      (dim_0) int64 1 2 3

@shoyer
Copy link
Member

shoyer commented Jan 17, 2017

I actually think that .sel() shoud not work on y. I understand that this would break a bunch of existing code, but this is a confusing (if not inconsistent) behavior.

See #1017 (comment) and discussion below for comments on this. This does make .sel a little more complex, but I think it's pretty reasonable given that the fall-back behavior is well defined and not value dependent like pandas's soon to be deprecated .ix.

@fmaussion
Copy link
Member Author

This does make .sel a little more complex, but I think it's pretty reasonable

Sorry, I'm not sure to understand: do you mean you'd rather stay with the current behavior as you said in #1017 (comment) ?

@shoyer
Copy link
Member

shoyer commented Jan 17, 2017

Sorry, I'm not sure to understand: do you mean you'd rather stay with the current behavior as you said in #1017 (comment) ?

I like the currently implemented behavior, which is the second bullet in #1017 (comment). We could potentially switch to the behavior of the first bullet (issue TypeError instead), but I think this would be less convenient.

@shoyer
Copy link
Member

shoyer commented Jan 20, 2017

OK, I'm going to put together a PR to change this behavior to add the Unindexed dimensions section instead of putting these dimensions under Coordinates.

This feels a little overly verbose (I still feel like there may be some sort of solution with symbols here), but it's explicit and fully self-explanatory, which is a big advantage of over what we currently have.

@shoyer
Copy link
Member

shoyer commented Jan 20, 2017

See #1221.

@shoyer shoyer reopened this Jan 26, 2017
@shoyer
Copy link
Member

shoyer commented Jan 26, 2017

We may not have gotten this right yet. See StackOverflow: What are “unindexed dimensions” and why are coordinates empty?

@shoyer
Copy link
Member

shoyer commented Jan 26, 2017

OK, let's go back to the drawing board.

Let ds = xr.Dataset({'foo': (('x', 'y'), [[1, 2], [3, 4]])}, {'x': [1, 2]}).

Current repr (v0.9.0):

<xarray.Dataset>
Dimensions:  (x: 2, y: 2)
Coordinates:
  * x        (x) int64 1 2
Unindexed dimensions:
    y
Data variables:
    foo      (x, y) int64 1 2 3 4

Some alternatives:

  • Rename "Unindexed dimensions" to "Dimensions without coordinates". This is clearer but pretty verbose:
<xarray.Dataset>
Dimensions:  (x: 2, y: 2)
Coordinates:
  * x        (x) int64 1 2
Dimensions without coordinates:
    y
Data variables:
    foo      (x, y) int64 1 2 3 4
  • Same as above, but with everything on one line (less obtrusive):
<xarray.Dataset>
Dimensions:  (x: 2, y: 2)
Coordinates:
  * x        (x) int64 1 2
Data variables:
    foo      (x, y) int64 1 2 3 4
Dimensions without coordinates: y
  • Mark Dimensions with an index with * before there name. @crusaderky suggested this previously but I rejected it as ugly:
<xarray.Dataset>
Dimensions:  (*x: 2, y: 2)
Coordinates:
  * x        (x) int64 1 2
Data variables:
    foo      (x, y) int64 1 2 3 4
  • Use another marker character in Dimensions, maybe ' after dimension names for dimensions with indexes?
<xarray.Dataset>
Dimensions:  (x': 2, y: 2)
Coordinates:
  * x        (x) int64 1 2
Data variables:
    foo      (x, y) int64 1 2 3 4

@gerritholl
Copy link
Contributor

I think "Dimensions without coordinates" is clearer than "Unindexed dimensions", and only marginally more verbose (30 characters instead of 20). Any dimension can be indexed, just the index lookup is by position rather than by coordinate/label. I don't think marking the dimension/coordinate matches makes it any clearer as this matching is by name anyway, and my confusion was due to none of the dimensions having coordinates. I would support simply changing the label.

@benbovy
Copy link
Member

benbovy commented Jan 26, 2017

We may not have gotten this right yet

I agree.

As any dimension can be indexed (at least lookup by position), the name "coordinate" may be indeed more appropriate, but we need also to make the distinction between coordinates which are IndexVariable objects and those which are Variable objects.

Any use case for dimensions which don't have an index (i.e., an IndexVariable) but have coordinates? This would be confusing too:

>>> xr.Dataset({'foo': (('x', 'y'), [[1, 2], [3, 4]])},
...            {'c': (('x', 'y'), [[5, 6], [7, 8]])})
<xarray.Dataset>
Dimensions:  (x: 2, y: 2)
Coordinates:
    c        (x, y) int64 5 6 7 8
Data variables:
    foo      (x, y) int64 1 2 3 4
Dimensions without coordinates: x, y

@benbovy
Copy link
Member

benbovy commented Jan 26, 2017

So the original issue was about highlighting in the repr which dimensions have an IndexVariable object and which dimensions don't.

"Dimensions without index variable" would be unambiguous in all cases, but it doesn't look nice.

Mirroring * in Dimensions is also the less ambiguous (and most concise) of the options above, but it's not very nice too (IMHO not that ugly, though).

@gerritholl
Copy link
Contributor

With any kind of marking (such as with *) the problem is that the user might not know what the marking is for, and syntax is hard to google. When I see *x without whitespace I think of iterable unpacking...

@shoyer
Copy link
Member

shoyer commented Jan 27, 2017

Any use case for dimensions which don't have an index (i.e., an IndexVariable) but have coordinates? This would be confusing too:

I suppose we do something even more explicit, e.g., "Dimensions without a corresponding coordinate". That feels too long to me, though.

I don't like "Dimensions without index variable" because it emphasizes that they don't have an index rather than that they don't have a variable.

For now I think "Dimensions without coordinates" is my favorite.

@fmaussion
Copy link
Member Author

We may not have gotten this right yet. See StackOverflow: What are “unindexed dimensions” and why are coordinates empty?

The problem for the user is that the __repr__ makes him feel that his/her file is "not standard", as there are no proper coordinates in the file (at least not in the nice, CF compliant way).

I'd add another suggestion to the list that @shoyer proposed, which is simply to do nothing with the unindexed dimensions:

<xarray.Dataset>
Dimensions:  (x: 2, y: 2)
Coordinates:
  * x        (x) int64 1 2
Data variables:
    foo      (x, y) int64 1 2 3 4

This has the advantage to be unambiguous and close to the data model of the file at hand (with dimensions but no coordinates).

After that, my preference goes for "Dimensions without coordinates" too.

In the SO post, the OP also wondered about the "empty" coordinates. Any plan to change this too? Maybe a - instead of *empty*?

@gerritholl
Copy link
Contributor

Perhaps more broadly documentation-wise, it might be good to add a terminology list. For example, that could clarify the difference and relation between dimensions, labels, indices, coordinates, etc.. There are dimensions without coordinates, dimensions that are labelled or unlabelled, there are coordinates that are indices, coordinates that are not indices. I'm still figuring out how all of those relate to each other and how I use them.

@shoyer
Copy link
Member

shoyer commented Jan 27, 2017

In the SO post, the OP also wondered about the "empty" coordinates. Any plan to change this too? Maybe a - instead of empty?

I think I want to remove the appearance of *empty* for coordinates, and maybe for data variables, too. The intent was to remove confusion about missing fields in the repr, to avoid something like

<xarray.Dataset>
Dimensions:  ()

for xr.Dataset().

It gives a subtle signal that the user is doing something wrong, but that isn't necessarily the case.

I'd add another suggestion to the list that @shoyer proposed, which is simply to do nothing with the unindexed dimensions

Yes, this is what I did originally, and I think it's a very elegant solution.

Unfortunately, for large Dataset objects, there were valid complaints it can be easy to miss dimensions without coordinates. There's some threshold, probably around five or so items, at which point it becomes hard to see at a glance what's missing from a list.

I don't see it working well to make repr change suddenly when a certain threshold number of items is reached, but maybe there's another way (a different method? an option?) we can use for exposing this information to the power-users who need to see it without making it confusing for newcomers.

@shoyer
Copy link
Member

shoyer commented Jan 27, 2017

Perhaps more broadly documentation-wise, it might be good to add a terminology list. For example, that could clarify the difference and relation between dimensions, labels, indices, coordinates, etc.. There are dimensions without coordinates, dimensions that are labelled or unlabelled, there are coordinates that are indices, coordinates that are not indices. I'm still figuring out how all of those relate to each other and how I use them.

Agreed, this is definitely worth doing. I also made a diagram recently to summarize the xarray data model that I'd like to put in the docs:

response_to_reviewers

(feedback on the diagram is very welcome if anyone has suggestions!)

@benbovy
Copy link
Member

benbovy commented Jan 27, 2017

Nice diagram! Do you think it's worth to also add Variable and IndexVariable (along with inheritance)?

Finally I would be also +1 to do nothing with the unindexed dimensions in the repr, even though complaints for large Dataset are valid.

Although this wouldn't fully solve the problem, maybe an html repr for the notebook would help here?

@shoyer
Copy link
Member

shoyer commented Jan 27, 2017

Nice diagram! Do you think it's worth to also add Variable and IndexVariable (along with inheritance)?

Yes, but probably only for a separate "internal/advanced API" diagram. I want this to focus on the user facing and public API.

Although this wouldn't fully solve the problem, maybe an html repr for the notebook would help here?

Indeed, I think this is quite possible -- we could squeeze a lot more information into an HTML repr. For example, with a little bit of JavaScript (or maybe CSS these days?) we could highlight all appearances of a dimension name when you hover over it. Maybe we can find someone with design talent/interest to work on this?

@shoyer
Copy link
Member

shoyer commented Jan 30, 2017

See #1236 for a proposed fix. After merging it, I will release v0.9.1 and issue the delayed release announcement for xarray v0.9.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants