Document the new repr #1199

fmaussion · 2017-01-11T15:37:37Z

Sorry I missed that one when it was decided upon in #1017, but I think the changes in repr should be documented somewhere (at the minimum in the "Breaking Changes" section of what's new).

I just updated Salem for it to work well with xarray 0.9.0. The changes I had to make where quite small (that's a good thing), but it took me a bit of time to understand what was going on.

What I found confusing is following:

In [1]: import xarray as xr
In [2]: ds = xr.DataArray([1, 2, 3]).to_dataset(name='var')
In [3]: ds
Out[3]: 
<xarray.Dataset>
Dimensions:  (dim_0: 3)
Coordinates:
  o dim_0    (dim_0) -
Data variables:
    var      (dim_0) int64 1 2 3

In [4]: 'dim_0' in ds.coords
Out[4]: False

dim_0is listed as coordinate, but 'dim_0' in ds.coords is False. I think it should remain like this, but maybe we should document somewhere what the "o" and "*" mean?

(possibly here)

The text was updated successfully, but these errors were encountered:

shoyer · 2017-01-11T17:10:20Z

Agreed, we should definitely document this and highlight the change in the release notes. I'm still not super happy with using o in the new repr, but it is the best we came up with so far.

shoyer · 2017-01-15T03:06:25Z

I wonder if instead this should be taken as an indication that we need to make the repr more self explanatory, e.g., by writing something like *missing* instead of -. This would be consistent with how we use *empty* in the Dataset repr for empty coords or data_vars.

Current version:

<xarray.Dataset>
Dimensions:  (dim_0: 3)
Coordinates:
  o dim_0    (dim_0) -
Data variables:
    var      (dim_0) int64 1 2 3

Alternative A: write *missing* instead of -:

<xarray.Dataset>
Dimensions:  (dim_0: 3)
Coordinates:
  o dim_0    (dim_0) *missing*
Data variables:
    var      (dim_0) int64 1 2 3

Alternative B: write *missing* instead of -, use * for the marker instead of o:

<xarray.Dataset>
Dimensions:  (dim_0: 3)
Coordinates:
  * dim_0    (dim_0) *missing*
Data variables:
    var      (dim_0) int64 1 2 3

I think I like Alternative B best. @MaximilianR @crusaderky @benbovy any opinions?

fmaussion · 2017-01-16T10:27:59Z

[edited to remove a less relevant comment, now moved to a separate issue]

After having a look at the doc about coordinates (http://xarray.pydata.org/en/latest/data-structures.html#coordinates), here a few thoughts:

it would be good to have clear semantics about the different coordinates. For example, "positional coordinates" for coords with "*", "auxilliary coordinates" for coords without "*". How do we call the coordinates with a "o" ?
after giving it a second thought: have you considered not to list the coordinates with "o" in the coords repr? After all, what they really are is dimensions, not coordinates (as shown by the test above: 'dim_0' in ds.coords == False). The operations available on them (e.g. isel) really look like operations available on dimensions (similar to axis in numpy), not so much like operations available on coordinates.

I'll go on with a simple example that I just tried out to let you know how I dealt with this new behavior of xarray:

In [1]: import xarray as xr

In [2]: a = np.array([[1.1, 2.2, 3.3], [4.4, 5.5, 6.6]])

In [3]: da = xr.DataArray(a, dims=['y', 'x'],
                          coords={'x':[0.1, 1.1, 2.2], 'xy':(['y', 'x'], a)})

In [4]: da
Out[4]: 
<xarray.DataArray (y: 2, x: 3)>
array([[ 1.1,  2.2,  3.3],
       [ 4.4,  5.5,  6.6]])
Coordinates:
  * x        (x) float64 0.1 1.1 2.2
    xy       (y, x) float64 1.1 2.2 3.3 4.4 5.5 6.6
  o y        (y) -

"OK, I have three types of coordinates here. This makes sense, they are all defined in a different way. y is a dimension of my dataset, so at least I should be able to select along that dimension:".

In [6]: da.isel(y=1)
Out[6]: 
<xarray.DataArray (x: 3)>
array([ 4.4,  5.5,  6.6])
Coordinates:
  * x        (x) float64 0.1 1.1 2.2
    xy       (x) float64 4.4 5.5 6.6

"so far so good! x is a coordinate, so here I should be able to do more complex selection based on values:"

In [7]: da.sel(x=2.2)
Out[7]: 
<xarray.DataArray (y: 2)>
array([ 3.3,  6.6])
Coordinates:
    x        float64 2.2
    xy       (y) float64 3.3 6.6
  o y        (y) -

"That makes sense. However, this is probably not possible with y, which is not defined:"

In [8]: da.sel(y=0)
Out[8]: 
<xarray.DataArray (x: 3)>
array([ 1.1,  2.2,  3.3])
Coordinates:
  * x        (x) float64 0.1 1.1 2.2
    xy       (x) float64 1.1 2.2 3.3

"wait... So I can select by label on an undefined coordinate? Maybe y is a coordinate after all?"

In [9]: 'y' in da.coords
Out[9]: False

"ah, no."

I actually think that .sel() shoud not work on y. I understand that this would break a bunch of existing code, but this is a confusing (if not inconsistent) behavior.

Maybe I'm completely missing the point here: don't hesitate to correct me!

benbovy · 2017-01-16T11:47:47Z

I'm still not super happy with using o in the new repr

Yes this does not seem like the best alternative.

after giving it a second thought: have you considered not to list the coordinates with "o" in the coords repr?

+1. That was the initial implementation in #1017 I think, which is the most consistent. But @crusaderky raised an issue of readability (see #1017 (comment)).

IMO this issue of readability is more a bad habit of using the Coordinates section to actually inspect the dimensions (and thus not focusing enough on the Dimensions section), resulting from the old behavior where each dimension was guaranteed to have a (default) coordinate. Now that this is not the case anymore I wonder if new xarray users (i.e., after #1017) will encounter this issue of readability.

If this remains an issue, another option than the ones above would be to list dimensions with no index in a separate, dedicated section of the repr, e.g. with the examples above,

<xarray.Dataset>
Dimensions:  (dim_0: 3)
Unindexed dimensions:
    dim_0
Data variables:
    var      (dim_0) int64 1 2 3

<xarray.DataArray 'xy' (y: 2, x: 3)>
array([[ 1.1,  2.2,  3.3],
       [ 4.4,  5.5,  6.6]])
Unindexed dimensions:
    y
Coordinates:
  * x        (x) float64 0.1 1.1 2.2
    xy       (y, x) float64 1.1 2.2 3.3 4.4 5.5 6.6

To not add too much verbosity, multiple dimensions without coord may be displayed inline:

<xarray.Dataset>
Dimensions:  (dim_0: 3, dim_1: 4)
Unindexed dimensions:
    dim_0, dim_1
Data variables:
    var      (dim_0) int64 1 2 3

shoyer · 2017-01-17T01:02:46Z

I actually think that .sel() shoud not work on y. I understand that this would break a bunch of existing code, but this is a confusing (if not inconsistent) behavior.

See #1017 (comment) and discussion below for comments on this. This does make .sel a little more complex, but I think it's pretty reasonable given that the fall-back behavior is well defined and not value dependent like pandas's soon to be deprecated .ix.

fmaussion · 2017-01-17T12:20:36Z

This does make .sel a little more complex, but I think it's pretty reasonable

Sorry, I'm not sure to understand: do you mean you'd rather stay with the current behavior as you said in #1017 (comment) ?

shoyer · 2017-01-17T18:19:46Z

Sorry, I'm not sure to understand: do you mean you'd rather stay with the current behavior as you said in #1017 (comment) ?

I like the currently implemented behavior, which is the second bullet in #1017 (comment). We could potentially switch to the behavior of the first bullet (issue TypeError instead), but I think this would be less convenient.

shoyer · 2017-01-20T00:30:56Z

OK, I'm going to put together a PR to change this behavior to add the Unindexed dimensions section instead of putting these dimensions under Coordinates.

This feels a little overly verbose (I still feel like there may be some sort of solution with symbols here), but it's explicit and fully self-explanatory, which is a big advantage of over what we currently have.

shoyer · 2017-01-20T05:41:25Z

See #1221.

shoyer · 2017-01-26T17:56:53Z

We may not have gotten this right yet. See StackOverflow: What are “unindexed dimensions” and why are coordinates empty?

shoyer · 2017-01-26T18:06:14Z

OK, let's go back to the drawing board.

Let ds = xr.Dataset({'foo': (('x', 'y'), [[1, 2], [3, 4]])}, {'x': [1, 2]}).

Current repr (v0.9.0):

<xarray.Dataset>
Dimensions:  (x: 2, y: 2)
Coordinates:
  * x        (x) int64 1 2
Unindexed dimensions:
    y
Data variables:
    foo      (x, y) int64 1 2 3 4

Some alternatives:

Rename "Unindexed dimensions" to "Dimensions without coordinates". This is clearer but pretty verbose:

<xarray.Dataset>
Dimensions:  (x: 2, y: 2)
Coordinates:
  * x        (x) int64 1 2
Dimensions without coordinates:
    y
Data variables:
    foo      (x, y) int64 1 2 3 4

Same as above, but with everything on one line (less obtrusive):

<xarray.Dataset>
Dimensions:  (x: 2, y: 2)
Coordinates:
  * x        (x) int64 1 2
Data variables:
    foo      (x, y) int64 1 2 3 4
Dimensions without coordinates: y

Mark Dimensions with an index with * before there name. @crusaderky suggested this previously but I rejected it as ugly:

<xarray.Dataset>
Dimensions:  (*x: 2, y: 2)
Coordinates:
  * x        (x) int64 1 2
Data variables:
    foo      (x, y) int64 1 2 3 4

Use another marker character in Dimensions, maybe ' after dimension names for dimensions with indexes?

<xarray.Dataset>
Dimensions:  (x': 2, y: 2)
Coordinates:
  * x        (x) int64 1 2
Data variables:
    foo      (x, y) int64 1 2 3 4

gerritholl · 2017-01-26T18:49:41Z

I think "Dimensions without coordinates" is clearer than "Unindexed dimensions", and only marginally more verbose (30 characters instead of 20). Any dimension can be indexed, just the index lookup is by position rather than by coordinate/label. I don't think marking the dimension/coordinate matches makes it any clearer as this matching is by name anyway, and my confusion was due to none of the dimensions having coordinates. I would support simply changing the label.

benbovy · 2017-01-26T20:07:25Z

We may not have gotten this right yet

I agree.

As any dimension can be indexed (at least lookup by position), the name "coordinate" may be indeed more appropriate, but we need also to make the distinction between coordinates which are IndexVariable objects and those which are Variable objects.

Any use case for dimensions which don't have an index (i.e., an IndexVariable) but have coordinates? This would be confusing too:

>>> xr.Dataset({'foo': (('x', 'y'), [[1, 2], [3, 4]])},
...            {'c': (('x', 'y'), [[5, 6], [7, 8]])})
<xarray.Dataset>
Dimensions:  (x: 2, y: 2)
Coordinates:
    c        (x, y) int64 5 6 7 8
Data variables:
    foo      (x, y) int64 1 2 3 4
Dimensions without coordinates: x, y

benbovy · 2017-01-26T21:05:23Z

So the original issue was about highlighting in the repr which dimensions have an IndexVariable object and which dimensions don't.

"Dimensions without index variable" would be unambiguous in all cases, but it doesn't look nice.

Mirroring * in Dimensions is also the less ambiguous (and most concise) of the options above, but it's not very nice too (IMHO not that ugly, though).

gerritholl · 2017-01-26T21:23:32Z

With any kind of marking (such as with *) the problem is that the user might not know what the marking is for, and syntax is hard to google. When I see *x without whitespace I think of iterable unpacking...

shoyer · 2017-01-27T01:01:50Z

Any use case for dimensions which don't have an index (i.e., an IndexVariable) but have coordinates? This would be confusing too:

I suppose we do something even more explicit, e.g., "Dimensions without a corresponding coordinate". That feels too long to me, though.

I don't like "Dimensions without index variable" because it emphasizes that they don't have an index rather than that they don't have a variable.

For now I think "Dimensions without coordinates" is my favorite.

fmaussion · 2017-01-27T08:59:45Z

We may not have gotten this right yet. See StackOverflow: What are “unindexed dimensions” and why are coordinates empty?

The problem for the user is that the __repr__ makes him feel that his/her file is "not standard", as there are no proper coordinates in the file (at least not in the nice, CF compliant way).

I'd add another suggestion to the list that @shoyer proposed, which is simply to do nothing with the unindexed dimensions:

<xarray.Dataset>
Dimensions:  (x: 2, y: 2)
Coordinates:
  * x        (x) int64 1 2
Data variables:
    foo      (x, y) int64 1 2 3 4

This has the advantage to be unambiguous and close to the data model of the file at hand (with dimensions but no coordinates).

After that, my preference goes for "Dimensions without coordinates" too.

In the SO post, the OP also wondered about the "empty" coordinates. Any plan to change this too? Maybe a - instead of *empty*?

gerritholl · 2017-01-27T12:02:46Z

Perhaps more broadly documentation-wise, it might be good to add a terminology list. For example, that could clarify the difference and relation between dimensions, labels, indices, coordinates, etc.. There are dimensions without coordinates, dimensions that are labelled or unlabelled, there are coordinates that are indices, coordinates that are not indices. I'm still figuring out how all of those relate to each other and how I use them.

shoyer · 2017-01-27T17:00:44Z

In the SO post, the OP also wondered about the "empty" coordinates. Any plan to change this too? Maybe a - instead of empty?

I think I want to remove the appearance of *empty* for coordinates, and maybe for data variables, too. The intent was to remove confusion about missing fields in the repr, to avoid something like

<xarray.Dataset>
Dimensions:  ()

for xr.Dataset().

It gives a subtle signal that the user is doing something wrong, but that isn't necessarily the case.

I'd add another suggestion to the list that @shoyer proposed, which is simply to do nothing with the unindexed dimensions

Yes, this is what I did originally, and I think it's a very elegant solution.

Unfortunately, for large Dataset objects, there were valid complaints it can be easy to miss dimensions without coordinates. There's some threshold, probably around five or so items, at which point it becomes hard to see at a glance what's missing from a list.

I don't see it working well to make repr change suddenly when a certain threshold number of items is reached, but maybe there's another way (a different method? an option?) we can use for exposing this information to the power-users who need to see it without making it confusing for newcomers.

shoyer · 2017-01-27T17:04:53Z

Perhaps more broadly documentation-wise, it might be good to add a terminology list. For example, that could clarify the difference and relation between dimensions, labels, indices, coordinates, etc.. There are dimensions without coordinates, dimensions that are labelled or unlabelled, there are coordinates that are indices, coordinates that are not indices. I'm still figuring out how all of those relate to each other and how I use them.

Agreed, this is definitely worth doing. I also made a diagram recently to summarize the xarray data model that I'd like to put in the docs:

(feedback on the diagram is very welcome if anyone has suggestions!)

benbovy · 2017-01-27T17:31:01Z

Nice diagram! Do you think it's worth to also add Variable and IndexVariable (along with inheritance)?

Finally I would be also +1 to do nothing with the unindexed dimensions in the repr, even though complaints for large Dataset are valid.

Although this wouldn't fully solve the problem, maybe an html repr for the notebook would help here?

shoyer · 2017-01-27T17:41:50Z

Nice diagram! Do you think it's worth to also add Variable and IndexVariable (along with inheritance)?

Yes, but probably only for a separate "internal/advanced API" diagram. I want this to focus on the user facing and public API.

Although this wouldn't fully solve the problem, maybe an html repr for the notebook would help here?

Indeed, I think this is quite possible -- we could squeeze a lot more information into an HTML repr. For example, with a little bit of JavaScript (or maybe CSS these days?) we could highlight all appearances of a dimension name when you hover over it. Maybe we can find someone with design talent/interest to work on this?

shoyer · 2017-01-30T00:27:10Z

See #1236 for a proposed fix. After merging it, I will release v0.9.1 and issue the delayed release announcement for xarray v0.9.

shoyer added this to the 0.9.0 milestone Jan 11, 2017

shoyer added the topic-documentation label Jan 11, 2017

shoyer mentioned this issue Jan 15, 2017

WIP: Optional indexes (no more default coordinates given by range(n)) #1017

Merged

5 tasks

shoyer mentioned this issue Jan 20, 2017

List "Unindexed dimensions" separately in __repr__ #1221

Merged

4 tasks

shoyer closed this as completed in #1221 Jan 22, 2017

shoyer reopened this Jan 26, 2017

shoyer mentioned this issue Jan 30, 2017

Better repr for coordinates without dimensions #1236

Merged

4 tasks

shoyer closed this as completed in #1236 Jan 30, 2017

fmaussion mentioned this issue Mar 4, 2017

Terminology for the various coordinates #1295

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document the new repr #1199

Document the new repr #1199

fmaussion commented Jan 11, 2017

shoyer commented Jan 11, 2017

shoyer commented Jan 15, 2017 •

edited

Loading

fmaussion commented Jan 16, 2017 •

edited

Loading

benbovy commented Jan 16, 2017

shoyer commented Jan 17, 2017

fmaussion commented Jan 17, 2017

shoyer commented Jan 17, 2017

shoyer commented Jan 20, 2017

shoyer commented Jan 20, 2017

shoyer commented Jan 26, 2017

shoyer commented Jan 26, 2017

gerritholl commented Jan 26, 2017

benbovy commented Jan 26, 2017 •

edited

Loading

benbovy commented Jan 26, 2017

gerritholl commented Jan 26, 2017

shoyer commented Jan 27, 2017

fmaussion commented Jan 27, 2017

gerritholl commented Jan 27, 2017

shoyer commented Jan 27, 2017

shoyer commented Jan 27, 2017

benbovy commented Jan 27, 2017

shoyer commented Jan 27, 2017

shoyer commented Jan 30, 2017

Document the new __repr__ #1199

Document the new __repr__ #1199

Comments

fmaussion commented Jan 11, 2017

shoyer commented Jan 11, 2017

shoyer commented Jan 15, 2017 • edited Loading

fmaussion commented Jan 16, 2017 • edited Loading

benbovy commented Jan 16, 2017

shoyer commented Jan 17, 2017

fmaussion commented Jan 17, 2017

shoyer commented Jan 17, 2017

shoyer commented Jan 20, 2017

shoyer commented Jan 20, 2017

shoyer commented Jan 26, 2017

shoyer commented Jan 26, 2017

gerritholl commented Jan 26, 2017

benbovy commented Jan 26, 2017 • edited Loading

benbovy commented Jan 26, 2017

gerritholl commented Jan 26, 2017

shoyer commented Jan 27, 2017

fmaussion commented Jan 27, 2017

gerritholl commented Jan 27, 2017

shoyer commented Jan 27, 2017

shoyer commented Jan 27, 2017

benbovy commented Jan 27, 2017

shoyer commented Jan 27, 2017

shoyer commented Jan 30, 2017

Document the new repr #1199

Document the new repr #1199

shoyer commented Jan 15, 2017 •

edited

Loading

fmaussion commented Jan 16, 2017 •

edited

Loading

benbovy commented Jan 26, 2017 •

edited

Loading