Move ctype/segments from Coordinates to DataSource #357

jmilloy · 2020-01-13T22:15:21Z

I think we should remove segment_lengths from Coordinates.

Users would only be able to evaluate point coordinates. I think this makes sense and would be simpler for users and would be much simpler to code/maintain. (Note that when a user wants to average over segments, an explicit convolution is better and is already implemented.)
Data sources are segmented, not coordinates. The segment information would be moved to the DataSource node alongside the native_coordinates. It would be very clear that the interpolator is responsible for handling segment data sources correctly, and I believe this is the only use-case for segments. I think this would actually be fairly straightforward in most cases (it would make Interpolation ignores ctype #238 less daunting).
We would not need to transform the segments, which I don't believe is possible to do reasonably. See Convert segment_lengths when transforming lat and lon coordinates #252.
The Coordinates would be so much simpler.

This would really need to be done in 2.0. I note that #252 and #238 are included in 2.0.

The text was updated successfully, but these errors were encountered:

jmilloy · 2020-01-13T22:15:58Z

Ah, I guess I brought this up before in #266. We've never really discussed it.

mpu-creare · 2020-01-14T18:24:46Z

This kept getting pushed off. I think you've finally convinced me.
Since we always do the interpolation in the source's coordinate system, you are correct, we'd never have to convert segments if we keep them with the data source. Please go ahead and make it so.

I think my main question would be how the segment information would be propagated through an algorithm pipeline. I think metadata in the xarray data array should suffice, we'd have to think carefully about how that would look.

The reason this is important is because fundamentally the question is, what does the data point actually represent? Is it data at a point, or over an area? Is it a probe in the ground, or active over a pixel of an instrument. Is it the value of the sensor at a point in time? or the number of counts in an hour... or a running count. That type of thing.

jmilloy · 2020-01-14T19:09:43Z

I think my main question would be how the segment information would be propagated through an algorithm pipeline.

AFAIK, this information is not currently propagated in any way, so this would be a new feature.

Are you thinking something like a dictionary mapping datasources by name to its segment information? Like a flat dictionary with the same names as we use in the pipeline definitions? I could write out a spec.

mpu-creare · 2020-01-16T14:34:38Z

Yeah, some sort of dictionary.

We can put in the attrs of the xarray dataarray... or pass it around like we do the coordinates and make it optional... either way the interpolator will know where to pick it up... :

{'segment_lengths': {
    'lat': 0.5, 'lon': 0.25, 'time': '1,D'
},
'segment_position': {
    'lat': 'left', 'lon': 'middle', 'time': 'right'
}

Then if it's missing the default is 0 (a point) and 'midpoint'.

That would be the basics for uniform grid, or a point sensor with a defined area of effect (i.e. COSMOS). (It would be nice if we could support a circular footprint as well, maybe by setting 'lat_lon': 5 it implies a radius of 5... .). This is a pretty optimized, in terms of storage, approach and probably covers 90% of the cases.

For greater generality, e.g. non-uniform grid, say with lat=[0, 1, 3] you can have:

'segment_coords': {  # or segment_bounds?
    'lat': [-0.5, 0.5, 2, 4], ...

And then for dependent coordinates you can have a N-D array of coordinates describing the boundary of each point (so size is shape + 1, fence-posts).

Actually, you could probably implement segment_coords as a podpac.Coordinates object -- that would make a lot of sense.

I'd have to think more about how this would affect stacked coordinates.

That means Datasource nodes have to implement another function, get_segments(self), where the default is an empty dictionary, indicating that the native coordinates are points.

Now the segment information and the coordinates are completely separated.

jmilloy · 2020-01-16T16:31:56Z

Oh, by "propagated through an algorithm pipeline" you are just talking about how to represent the segments in the datasource and pass them to the interpolator. We're basically on the same page, then. They just need to be passed to the interpolator within the Datasource, that's not a problem. I don't think they need to propagate anywhere else... they are unnecessary once the datasource is evaluated.

It seems that there are 4 ways to define the segments for a single dimension:

a uniform segment length (a single value) and position, e.g. coordinates [0, 1, 3] with segment length 0.5 and position 'midpoint'.
an array of segment lengths (shape n) e.g. [0, 1, 3] with segment lengths [0.5, 1.0, 2.0].
an array of "fence posts" (shape n+1), e.g. [0, 1, 3] with segment boundaries [-0.5, 0.5, 2, 4].
an array of segment bounds (shape (n, 2)), e.g. [0, 1, 3] with segment bounds [(-0.5, 0.5), (0.5, 1.5), (2, 4)].

What do you think about supporting just (1) simple segment length and position and (4) array of bounds?

The simple length (1) is the most common, and the bounds (4) is the most flexible and covers all use cases. Both are easy to implement and understand. Plus, (4) is generalizable to non-square N-d boundaries. Whereas fence posts (3) is limited because it assumes that the segments cover the entire space, and a segment lengths array (2) is limited because all of the coordinates must be positioned in the same location within their segment. (Less important, but with the fence posts it is not always possible to index, e.g. coordinates[::2] has no way to represent the corresponding segments. It would be good for coordinates[I] and segments[I] to work for any I)

jmilloy · 2020-01-16T16:32:02Z

That means Datasource nodes have to implement another function, get_segments(self), where the default is an empty dictionary, indicating that the native coordinates are points.

Currently it is possible to specify the segment type and podpac detects the segment lengths automatically. Do you still want to do that autodetection?

jmilloy · 2020-01-16T17:54:54Z

Uniform segment lengths:

coordinates = Coordinates([lat, lon, time], dims=['lat', 'lon', 'time'])
segments = {
    'lat': (1.0, 'midpoint'),
    'lon': (1.0, 'midpoint'),
    'time': ('1,D', 'left')
}

Uniform time segments, lat/lon points:

coordinates = Coordinates([lat, lon, time], dims=['lat', 'lon', 'time'])
segments = {
    'time': ('1,D', 'left')
}

Uniform time segments, fully defined lat/lon segments:

coordinates = Coordinates([lat, lon, time], dims=['lat', 'lon', 'time'])
segments = {
    'lat': [(0, 1), (1, 2), (2, 4)],
    'lon': [(0.0, 0.1), (1.0, 1.1), (2.0, 2.1)],
    'time': ('1,D', 'midpoint')
}

Autodetect segments (for convenience):

coordinates = Coordinates([lat, lon, time], dims=['lat', 'lon', 'time'])
segments = {
    'lat': 'midpoint',
    'lon': 'midpoint',
    'time': 'left'
}

Maps to a helper function, e.g.:

>>> auto_bounds([0, 1, 2, 3], mode='midpoint')
(1.0, 'midpoint')

>>> auto_bounds([0, 1, 3, 11], mode='midpoint')
array([(-0.5, 0.5), (0.5, 2.0), (2.0, 7.0), (7.0, 15.0)])

>>> auto_bounds([0, 1, 3, 11], mode='left')
array([(0, 1), (1, 3.0), (3.0, 11.0), (11.0, 18.0)])

Lastly, as far as a technical spec goes, I might make a light UniformSegments class so that the uniform segment tuples can be "indexed" the same as the array segment bounds:

# segment bounds
>>> segments['lat']
array([(0.0, 1.0), (1.0, 2.0), (2.0, 4.0), (4.0, 5.0)])
>>> segments['lat'][1:3]
array([(1.0, 2.0), (2.0, 4.0)])

# uniform segments
>>> segments['lat']
UniformSegments(1.0, 'midpoint')
>>> segments['lat'][1:3]
UniformSegments(1.0, 'midpoint')

mpu-creare · 2020-01-16T18:02:01Z

I like your proposal of doing (1) and (4).

So, for dependent coordinates, instead of (n, 2) and (m, 2) for lat and lon, you could have (nm, 4) and (nm, 4) to give the lat/lon coordinates of the boxes surrounding the node. And then, as you pointed out, you can have (n*m, x) for a x-sided polygon.

Thanks for the examples above. I think it's worth thinking about stacked coordinates as well...

jmilloy · 2020-01-16T19:06:33Z

Okay, let me think about that a little bit.

Stacked coordinates with uniform rectangular regions are the same:

coordinates = Coordinates([[lat, lon]], dims=['lat_lon'])
segments = {
    'lat': (1.0, 'midpoint'),
    'lon': (2.0, 'midpoint'),
}

Stacked coordinates with nonuniform rectangular regions could be the same, too:

coordinates = Coordinates([[lat, lon]], dims=['lat_lon'])
segments = {
    'lat': [(0, 1), (1, 2), ...],
    'lon': [(10, 20), 20, 30), ...]
}

Polygons in general can be the same as the dependent coordinates, except that only 1-d coordinates and 1-d boundary entries are allowed:

coordinates = Coordinates([[lat, lon]], dims=['lat_lon'])
segments = {
    'lat': [latbounds_0, latbounds_1, ..., latbounds_n],
    'lon': [lonbounds_0, lonbounds_1, ..., lonbounds_n]
}

(where each latbounds_i, lonbounds_i pair is a 1-d array of the same length, but alll of the pairs need not be the same length). Note, maybe if all of the entries are length 2, then it is rectangular bounds, if all are length > 2, it is polygons, otherwise it is invalid.

For a circle we would need a new type, maybe

coordinates = Coordinates([[lat, lon]], dims=['lat_lon'])
segments = {
    'lat': (1.0, 'radius')
    'lon': (1.0, 'radius')
}

So then there is the question of whether the dictionary should be nested to match the coordinates instead, e.g.

coordinates = Coordinates([[lat, lon], time], dims=['lat_lon', 'time'])
segments = {
    `lat_lon': {
        'lat': (1.0, 'midpoint'),
        'lon': (2.0, 'midpoint')
     },
     'time': ('1,D', 'left')
}

I'll have to see what is better in practice. The nested version will make grabbing the segment info for the stacked coordinates in one go easier (e.g. segments['lat_lon']), which will be good for the interpolator, plus it will allow convenient shortcuts such as:

# square regions
coordinates = Coordinates([[lat, lon], time], dims=['lat_lon', 'time'])
segments = {
    `lat_lon': (1.0, 'midpoint'),
     'time': ('1,D', 'left')
}

# circular regions
coordinates = Coordinates([[lat, lon], time], dims=['lat_lon', 'time'])
segments = {
    `lat_lon': (1.0, 'radius'),
     'time': ('1,D', 'left')
}

But on the other hand, there can be stacked coordinates with mixed regions such as lat_lon_time where the spatial part is circular but the time is a segment or even a point. In these cases we probably have to list all three separately anyways, and then the nested version seems unnecessarily cumbersome.

coordinates = Coordinates([[lat, lon], time], dims=['lat_lon_time'])

# flat
segments = {
    'lat': (1.0, 'radius'),
    'lon': (1.0, 'radius'),
    'time': ('1,D', 'left')
}

# nested
segments = {
    'lat_lon_time': {
        'lat': (1.0, 'radius'),
        'lon': (1.0, 'radius'),
        'time': ('1,D', 'left')
    }
}

We'll see.

mpu-creare · 2020-01-16T19:38:46Z

I really like the radius type -- that means we can add it as an extension later. It also supports ellipses!

I think I prefer the flat structure... we can always use udims in the interpolator, and we'll know the stacking situation from the coordinates. The biggest disadvantage in my mind is that given a segments dictionary, you can't know the stacking situation... which suggests nested is better.

@mlshapiro any opinions?

mpu-creare · 2020-04-20T18:40:40Z

see #395

jmilloy added discussion backwards-incompatible Issues and pull requests that must wait for the next major release. labels Jan 13, 2020

jmilloy self-assigned this Jan 13, 2020

This was referenced Jan 24, 2020

Convert segment_lengths when transforming lat and lon coordinates #252

Closed

Interpolation ignores ctype #238

Closed

mpu-creare closed this as completed Apr 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move ctype/segments from Coordinates to DataSource #357

Move ctype/segments from Coordinates to DataSource #357

jmilloy commented Jan 13, 2020

jmilloy commented Jan 13, 2020 •

edited

Loading

mpu-creare commented Jan 14, 2020

jmilloy commented Jan 14, 2020

mpu-creare commented Jan 16, 2020

jmilloy commented Jan 16, 2020

jmilloy commented Jan 16, 2020

jmilloy commented Jan 16, 2020

mpu-creare commented Jan 16, 2020

jmilloy commented Jan 16, 2020 •

edited

Loading

mpu-creare commented Jan 16, 2020

mpu-creare commented Apr 20, 2020

Move ctype/segments from Coordinates to DataSource #357

Move ctype/segments from Coordinates to DataSource #357

Comments

jmilloy commented Jan 13, 2020

jmilloy commented Jan 13, 2020 • edited Loading

mpu-creare commented Jan 14, 2020

jmilloy commented Jan 14, 2020

mpu-creare commented Jan 16, 2020

jmilloy commented Jan 16, 2020

jmilloy commented Jan 16, 2020

jmilloy commented Jan 16, 2020

mpu-creare commented Jan 16, 2020

jmilloy commented Jan 16, 2020 • edited Loading

mpu-creare commented Jan 16, 2020

mpu-creare commented Apr 20, 2020

jmilloy commented Jan 13, 2020 •

edited

Loading

jmilloy commented Jan 16, 2020 •

edited

Loading