Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move ctype/segments from Coordinates to DataSource #357

Closed
jmilloy opened this issue Jan 13, 2020 · 11 comments
Closed

Move ctype/segments from Coordinates to DataSource #357

jmilloy opened this issue Jan 13, 2020 · 11 comments
Assignees
Labels
backwards-incompatible Issues and pull requests that must wait for the next major release. discussion

Comments

@jmilloy
Copy link
Collaborator

jmilloy commented Jan 13, 2020

I think we should remove segment_lengths from Coordinates.

  • Users would only be able to evaluate point coordinates. I think this makes sense and would be simpler for users and would be much simpler to code/maintain. (Note that when a user wants to average over segments, an explicit convolution is better and is already implemented.)

  • Data sources are segmented, not coordinates. The segment information would be moved to the DataSource node alongside the native_coordinates. It would be very clear that the interpolator is responsible for handling segment data sources correctly, and I believe this is the only use-case for segments. I think this would actually be fairly straightforward in most cases (it would make Interpolation ignores ctype #238 less daunting).

  • We would not need to transform the segments, which I don't believe is possible to do reasonably. See Convert segment_lengths when transforming lat and lon coordinates #252.

  • The Coordinates would be so much simpler.

This would really need to be done in 2.0. I note that #252 and #238 are included in 2.0.

@jmilloy jmilloy added discussion backwards-incompatible Issues and pull requests that must wait for the next major release. labels Jan 13, 2020
@jmilloy jmilloy self-assigned this Jan 13, 2020
@jmilloy
Copy link
Collaborator Author

jmilloy commented Jan 13, 2020

Ah, I guess I brought this up before in #266. We've never really discussed it.

@mpu-creare
Copy link
Contributor

This kept getting pushed off. I think you've finally convinced me.
Since we always do the interpolation in the source's coordinate system, you are correct, we'd never have to convert segments if we keep them with the data source. Please go ahead and make it so.

I think my main question would be how the segment information would be propagated through an algorithm pipeline. I think metadata in the xarray data array should suffice, we'd have to think carefully about how that would look.

The reason this is important is because fundamentally the question is, what does the data point actually represent? Is it data at a point, or over an area? Is it a probe in the ground, or active over a pixel of an instrument. Is it the value of the sensor at a point in time? or the number of counts in an hour... or a running count. That type of thing.

@jmilloy
Copy link
Collaborator Author

jmilloy commented Jan 14, 2020

I think my main question would be how the segment information would be propagated through an algorithm pipeline.

AFAIK, this information is not currently propagated in any way, so this would be a new feature.

Are you thinking something like a dictionary mapping datasources by name to its segment information? Like a flat dictionary with the same names as we use in the pipeline definitions? I could write out a spec.

@mpu-creare
Copy link
Contributor

Yeah, some sort of dictionary.

We can put in the attrs of the xarray dataarray... or pass it around like we do the coordinates and make it optional... either way the interpolator will know where to pick it up... :

{'segment_lengths': {
    'lat': 0.5, 'lon': 0.25, 'time': '1,D'
},
'segment_position': {
    'lat': 'left', 'lon': 'middle', 'time': 'right'
}

Then if it's missing the default is 0 (a point) and 'midpoint'.

That would be the basics for uniform grid, or a point sensor with a defined area of effect (i.e. COSMOS). (It would be nice if we could support a circular footprint as well, maybe by setting 'lat_lon': 5 it implies a radius of 5... .). This is a pretty optimized, in terms of storage, approach and probably covers 90% of the cases.

For greater generality, e.g. non-uniform grid, say with lat=[0, 1, 3] you can have:

'segment_coords': {  # or segment_bounds?
    'lat': [-0.5, 0.5, 2, 4], ...

And then for dependent coordinates you can have a N-D array of coordinates describing the boundary of each point (so size is shape + 1, fence-posts).

Actually, you could probably implement segment_coords as a podpac.Coordinates object -- that would make a lot of sense.

I'd have to think more about how this would affect stacked coordinates.

That means Datasource nodes have to implement another function, get_segments(self), where the default is an empty dictionary, indicating that the native coordinates are points.

Now the segment information and the coordinates are completely separated.

@jmilloy
Copy link
Collaborator Author

jmilloy commented Jan 16, 2020

Oh, by "propagated through an algorithm pipeline" you are just talking about how to represent the segments in the datasource and pass them to the interpolator. We're basically on the same page, then. They just need to be passed to the interpolator within the Datasource, that's not a problem. I don't think they need to propagate anywhere else... they are unnecessary once the datasource is evaluated.

It seems that there are 4 ways to define the segments for a single dimension:

  1. a uniform segment length (a single value) and position, e.g. coordinates [0, 1, 3] with segment length 0.5 and position 'midpoint'.
  2. an array of segment lengths (shape n) e.g. [0, 1, 3] with segment lengths [0.5, 1.0, 2.0].
  3. an array of "fence posts" (shape n+1), e.g. [0, 1, 3] with segment boundaries [-0.5, 0.5, 2, 4].
  4. an array of segment bounds (shape (n, 2)), e.g. [0, 1, 3] with segment bounds [(-0.5, 0.5), (0.5, 1.5), (2, 4)].

What do you think about supporting just (1) simple segment length and position and (4) array of bounds?

The simple length (1) is the most common, and the bounds (4) is the most flexible and covers all use cases. Both are easy to implement and understand. Plus, (4) is generalizable to non-square N-d boundaries. Whereas fence posts (3) is limited because it assumes that the segments cover the entire space, and a segment lengths array (2) is limited because all of the coordinates must be positioned in the same location within their segment. (Less important, but with the fence posts it is not always possible to index, e.g. coordinates[::2] has no way to represent the corresponding segments. It would be good for coordinates[I] and segments[I] to work for any I)

@jmilloy
Copy link
Collaborator Author

jmilloy commented Jan 16, 2020

That means Datasource nodes have to implement another function, get_segments(self), where the default is an empty dictionary, indicating that the native coordinates are points.

Currently it is possible to specify the segment type and podpac detects the segment lengths automatically. Do you still want to do that autodetection?

@jmilloy
Copy link
Collaborator Author

jmilloy commented Jan 16, 2020

Uniform segment lengths:

coordinates = Coordinates([lat, lon, time], dims=['lat', 'lon', 'time'])
segments = {
    'lat': (1.0, 'midpoint'),
    'lon': (1.0, 'midpoint'),
    'time': ('1,D', 'left')
}

Uniform time segments, lat/lon points:

coordinates = Coordinates([lat, lon, time], dims=['lat', 'lon', 'time'])
segments = {
    'time': ('1,D', 'left')
}

Uniform time segments, fully defined lat/lon segments:

coordinates = Coordinates([lat, lon, time], dims=['lat', 'lon', 'time'])
segments = {
    'lat': [(0, 1), (1, 2), (2, 4)],
    'lon': [(0.0, 0.1), (1.0, 1.1), (2.0, 2.1)],
    'time': ('1,D', 'midpoint')
}

Autodetect segments (for convenience):

coordinates = Coordinates([lat, lon, time], dims=['lat', 'lon', 'time'])
segments = {
    'lat': 'midpoint',
    'lon': 'midpoint',
    'time': 'left'
}

Maps to a helper function, e.g.:

>>> auto_bounds([0, 1, 2, 3], mode='midpoint')
(1.0, 'midpoint')

>>> auto_bounds([0, 1, 3, 11], mode='midpoint')
array([(-0.5, 0.5), (0.5, 2.0), (2.0, 7.0), (7.0, 15.0)])

>>> auto_bounds([0, 1, 3, 11], mode='left')
array([(0, 1), (1, 3.0), (3.0, 11.0), (11.0, 18.0)])

Lastly, as far as a technical spec goes, I might make a light UniformSegments class so that the uniform segment tuples can be "indexed" the same as the array segment bounds:

# segment bounds
>>> segments['lat']
array([(0.0, 1.0), (1.0, 2.0), (2.0, 4.0), (4.0, 5.0)])
>>> segments['lat'][1:3]
array([(1.0, 2.0), (2.0, 4.0)])

# uniform segments
>>> segments['lat']
UniformSegments(1.0, 'midpoint')
>>> segments['lat'][1:3]
UniformSegments(1.0, 'midpoint')

@mpu-creare
Copy link
Contributor

I like your proposal of doing (1) and (4).

So, for dependent coordinates, instead of (n, 2) and (m, 2) for lat and lon, you could have (nm, 4) and (nm, 4) to give the lat/lon coordinates of the boxes surrounding the node. And then, as you pointed out, you can have (n*m, x) for a x-sided polygon.

Thanks for the examples above. I think it's worth thinking about stacked coordinates as well...

@jmilloy
Copy link
Collaborator Author

jmilloy commented Jan 16, 2020

Okay, let me think about that a little bit.

Stacked coordinates with uniform rectangular regions are the same:

coordinates = Coordinates([[lat, lon]], dims=['lat_lon'])
segments = {
    'lat': (1.0, 'midpoint'),
    'lon': (2.0, 'midpoint'),
}

Stacked coordinates with nonuniform rectangular regions could be the same, too:

coordinates = Coordinates([[lat, lon]], dims=['lat_lon'])
segments = {
    'lat': [(0, 1), (1, 2), ...],
    'lon': [(10, 20), 20, 30), ...]
}

Polygons in general can be the same as the dependent coordinates, except that only 1-d coordinates and 1-d boundary entries are allowed:

coordinates = Coordinates([[lat, lon]], dims=['lat_lon'])
segments = {
    'lat': [latbounds_0, latbounds_1, ..., latbounds_n],
    'lon': [lonbounds_0, lonbounds_1, ..., lonbounds_n]
}

(where each latbounds_i, lonbounds_i pair is a 1-d array of the same length, but alll of the pairs need not be the same length). Note, maybe if all of the entries are length 2, then it is rectangular bounds, if all are length > 2, it is polygons, otherwise it is invalid.

For a circle we would need a new type, maybe

coordinates = Coordinates([[lat, lon]], dims=['lat_lon'])
segments = {
    'lat': (1.0, 'radius')
    'lon': (1.0, 'radius')
}

So then there is the question of whether the dictionary should be nested to match the coordinates instead, e.g.

coordinates = Coordinates([[lat, lon], time], dims=['lat_lon', 'time'])
segments = {
    `lat_lon': {
        'lat': (1.0, 'midpoint'),
        'lon': (2.0, 'midpoint')
     },
     'time': ('1,D', 'left')
}

I'll have to see what is better in practice. The nested version will make grabbing the segment info for the stacked coordinates in one go easier (e.g. segments['lat_lon']), which will be good for the interpolator, plus it will allow convenient shortcuts such as:

# square regions
coordinates = Coordinates([[lat, lon], time], dims=['lat_lon', 'time'])
segments = {
    `lat_lon': (1.0, 'midpoint'),
     'time': ('1,D', 'left')
}

# circular regions
coordinates = Coordinates([[lat, lon], time], dims=['lat_lon', 'time'])
segments = {
    `lat_lon': (1.0, 'radius'),
     'time': ('1,D', 'left')
}

But on the other hand, there can be stacked coordinates with mixed regions such as lat_lon_time where the spatial part is circular but the time is a segment or even a point. In these cases we probably have to list all three separately anyways, and then the nested version seems unnecessarily cumbersome.

coordinates = Coordinates([[lat, lon], time], dims=['lat_lon_time'])

# flat
segments = {
    'lat': (1.0, 'radius'),
    'lon': (1.0, 'radius'),
    'time': ('1,D', 'left')
}

# nested
segments = {
    'lat_lon_time': {
        'lat': (1.0, 'radius'),
        'lon': (1.0, 'radius'),
        'time': ('1,D', 'left')
    }
}

We'll see.

@mpu-creare
Copy link
Contributor

I really like the radius type -- that means we can add it as an extension later. It also supports ellipses!

I think I prefer the flat structure... we can always use udims in the interpolator, and we'll know the stacking situation from the coordinates. The biggest disadvantage in my mind is that given a segments dictionary, you can't know the stacking situation... which suggests nested is better.

@mlshapiro any opinions?

@mpu-creare
Copy link
Contributor

see #395

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backwards-incompatible Issues and pull requests that must wait for the next major release. discussion
Projects
None yet
Development

No branches or pull requests

2 participants