Local hypsometric interpolation #36

erikmannerfelt · 2021-03-10T11:34:24Z

I've added a function xdem.volume.hypsometric_binning() which takes two 1D arrays and returns the following dataframe (on a glacier on Svalbard):

                                            median       mean        std   count
(227.21670532226562, 277.2167053222656] -16.151154 -16.702609   9.718108   597.0
(277.2167053222656, 327.2167053222656]  -33.600098 -29.986920  12.277790   736.0
(327.2167053222656, 377.2167053222656]  -34.610580 -33.213806   7.857803   904.0
(377.2167053222656, 427.2167053222656]  -24.568985 -24.089276   4.625535   772.0
(427.2167053222656, 477.2167053222656]  -19.961365 -19.116361   3.405295  1219.0
(477.2167053222656, 527.2167053222656]  -16.065674 -15.813945   4.325823  1415.0
(527.2167053222656, 577.2167053222656]  -10.828613 -10.168528   4.474668  1065.0
(577.2167053222656, 627.2167053222656]   -3.929382  -4.687018   4.310905  1072.0
(627.2167053222656, 677.2167053222656]    2.258606   1.073841   4.334557   821.0
(677.2167053222656, 727.2167053222656]    0.829834   0.944558   3.454747   495.0
(727.2167053222656, 777.2167053222656]    0.249207   0.142597   3.973010   295.0
(777.2167053222656, 827.2167053222656]   -1.081299  -0.164995   5.557001   298.0
(827.2167053222656, 877.2167053222656]    1.049011  -0.785810   7.675866   273.0
(877.2167053222656, 927.2167053222656]    0.102966  -1.146541   7.116261   242.0
(927.2167053222656, 977.2167053222656]   -3.863617  -5.664418   8.469879   166.0

The index is a pd.IntervalIndex instance, with nice convenience properties like df.index.mid for the midpoint of the intervals.

What do you think of this structure?

erikmannerfelt · 2021-03-10T11:35:01Z

Aand I screwed up the double PRs again... I'm too tired to do git properly!

xdem/volume.py

… added documentation

tests/test_volume.py

xdem/volume.py

…ck to handle nans appropriately

erikmannerfelt · 2021-03-10T13:43:55Z

For the hypsometry interpolation/extrapolation, I went with the easiest possible approach using the builtin .interpolate() method in Pandas.

One could go a bit further and add polynomial estimation using RANSAC to filter bad bins. scikit-learn has an excellent implementation which would be quite easy. It would be one more dependency though..

What do you guys think?

…r masked_arrays

xdem/volume.py

adehecq · 2021-03-10T14:18:24Z

xdem/volume.py

+    return output
+
+
+def interpolate_hypsometric_bins(hypsometric_bins: pd.DataFrame, height_column="median", method="polynomial", order=3,


…it only gives one per default

rhugonnet · 2021-03-10T14:46:04Z

For the hypsometry interpolation/extrapolation, I went with the easiest possible approach using the builtin .interpolate() method in Pandas.

One could go a bit further and add polynomial estimation using RANSAC to filter bad bins. scikit-learn has an excellent implementation which would be quite easy. It would be one more dependency though..

What do you guys think?

I'm fine adding scikit-learn with RANSAC, it would be a good addition. Although to be honest I'm not sure it will outperform traditional polynomial by much. I think the issue is mostly with the parametric nature of the polynomial regression, rather than the mitigation of outliers. I had a good performance when implementing 1D lowess, which is nonparametric. It was a while ago and I had to code most of it by hand (https://github.com/iamdonovan/pyddem/blob/master/pyddem/volint_tools.py#L148), not sure what is available now.

erikmannerfelt · 2021-03-10T16:07:52Z

I just added a hypsometry area calculation script. The workflow would now be:

ddem_bins = xdem.volume.hypsometric_binning(ddem, reference_dem)
interpolated_bins = xdem.volume.interpolate_hypsometric_bins(ddem_bins)

bin_area = xdem.volume.calculate_hypsometry_area(interpolated_ddem_bins, reference_dem, pixel_size=pixel_size)

# This will be a float
volume_change = (ddem_bins * bin_area).sum()

Thoughts?

adehecq · 2021-03-10T17:31:21Z

I just added a hypsometry area calculation script. The workflow would now be:

ddem_bins = xdem.volume.hypsometric_binning(ddem, reference_dem)
interpolated_bins = xdem.volume.interpolate_hypsometric_bins(ddem_bins)

bin_area = xdem.volume.calculate_hypsometry_area(interpolated_ddem_bins, reference_dem, pixel_size=pixel_size)

# This will be a float
volume_change = (ddem_bins * bin_area).sum()

Thoughts?

It make sense. But what exactly is in ddem_bins and interpolated_ddem_bins (tthere's a typo I think as you called it interpolated_bins elsewhere too)?
Does it contain the elevation bins and the median/mean dh? Because I don't understand why the median dh is needed in calculate_hypsometry_area, or why the elevation bins are used in the volume calculation...

erikmannerfelt · 2021-03-11T08:53:49Z

I just added a hypsometry area calculation script. The workflow would now be:
ddem_bins = xdem.volume.hypsometric_binning(ddem, reference_dem)
interpolated_bins = xdem.volume.interpolate_hypsometric_bins(ddem_bins)

bin_area = xdem.volume.calculate_hypsometry_area(interpolated_ddem_bins, reference_dem, pixel_size=pixel_size)

# This will be a float
volume_change = (ddem_bins * bin_area).sum()
Thoughts?
It make sense. But what exactly is in ddem_bins and interpolated_ddem_bins (tthere's a typo I think as you called it interpolated_bins elsewhere too)?
Does it contain the elevation bins and the median/mean dh? Because I don't understand why the median dh is needed in calculate_hypsometry_area, or why the elevation bins are used in the volume calculation...

@adehecq ddem_bins looks like this:

                                             value   count
(225.28848266601562, 275.2884826660156] -19.098885   724.0
(275.2884826660156, 325.2884826660156]  -36.254318  1034.0
(325.2884826660156, 375.2884826660156]  -31.608276   617.0
(375.2884826660156, 425.2884826660156]  -22.635315   752.0
(425.2884826660156, 475.2884826660156]  -19.853256  1294.0
(475.2884826660156, 525.2884826660156]  -15.140961  1321.0
(525.2884826660156, 575.2884826660156]  -10.407410   997.0
(575.2884826660156, 625.2884826660156]   -3.927429  1021.0
(625.2884826660156, 675.2884826660156]    2.108887   791.0
(675.2884826660156, 725.2884826660156]    1.071594   530.0
(725.2884826660156, 775.2884826660156]    0.579987   296.0
(775.2884826660156, 825.2884826660156]   -1.488831   302.0
(825.2884826660156, 875.2884826660156]    1.046814   282.0
(875.2884826660156, 925.2884826660156]    0.201599   248.0
(925.2884826660156, 975.2884826660156]   -2.662354   161.0

Indeed, interpolated_bins should have been interpolated_ddem_bins or vice versa!

The reason calculate_hypsometry_area() needs ddem_bins or interpolated_ddem_bins is to make a representative DEM:

ddem_func = scipy.interpolate.interp1d(ddem_bins.index.mid, ddem_bins.values, kind="linear", fill_value="extrapolate")

mean_dem = ref_dem - (ddem_func(ref_dem) / 2)

Would it make more sense to make ddem_bins an optional argument (even though it really should be included at all times)?

adehecq · 2021-03-11T09:28:17Z

I just added a hypsometry area calculation script. The workflow would now be:
ddem_bins = xdem.volume.hypsometric_binning(ddem, reference_dem)
interpolated_bins = xdem.volume.interpolate_hypsometric_bins(ddem_bins)

bin_area = xdem.volume.calculate_hypsometry_area(interpolated_ddem_bins, reference_dem, pixel_size=pixel_size)

# This will be a float
volume_change = (ddem_bins * bin_area).sum()
Thoughts?
It make sense. But what exactly is in ddem_bins and interpolated_ddem_bins (tthere's a typo I think as you called it interpolated_bins elsewhere too)?
Does it contain the elevation bins and the median/mean dh? Because I don't understand why the median dh is needed in calculate_hypsometry_area, or why the elevation bins are used in the volume calculation...
@adehecq ddem_bins looks like this:
                                             value   count
(225.28848266601562, 275.2884826660156] -19.098885   724.0
(275.2884826660156, 325.2884826660156]  -36.254318  1034.0
(325.2884826660156, 375.2884826660156]  -31.608276   617.0
(375.2884826660156, 425.2884826660156]  -22.635315   752.0
(425.2884826660156, 475.2884826660156]  -19.853256  1294.0
(475.2884826660156, 525.2884826660156]  -15.140961  1321.0
(525.2884826660156, 575.2884826660156]  -10.407410   997.0
(575.2884826660156, 625.2884826660156]   -3.927429  1021.0
(625.2884826660156, 675.2884826660156]    2.108887   791.0
(675.2884826660156, 725.2884826660156]    1.071594   530.0
(725.2884826660156, 775.2884826660156]    0.579987   296.0
(775.2884826660156, 825.2884826660156]   -1.488831   302.0
(825.2884826660156, 875.2884826660156]    1.046814   282.0
(875.2884826660156, 925.2884826660156]    0.201599   248.0
(925.2884826660156, 975.2884826660156]   -2.662354   161.0
Indeed, interpolated_bins should have been interpolated_ddem_bins or vice versa!

The reason calculate_hypsometry_area() needs ddem_bins or interpolated_ddem_bins is to make a representative DEM:
ddem_func = scipy.interpolate.interp1d(ddem_bins.index.mid, ddem_bins.values, kind="linear", fill_value="extrapolate")

mean_dem = ref_dem - (ddem_func(ref_dem) / 2)
Would it make more sense to make ddem_bins an optional argument (even though it really should be included at all times)?

But should calculate_hypsometry_area not only depend on the bin edges and a reference DEM? In that case, I would rather just pass the elevation bins, rather than ddem_bins.
Or do you use the ddem elevation for the calculation of the hypsometry? If so, yes it should be optional, because I believe by default only the reference elevation should be used (an elevation that we trust, as opposed to an elevation that might have outliers).

erikmannerfelt · 2021-03-12T15:38:42Z

Is this approved, @rhugonnet and @adehecq, or should we discuss something more?

…hypsometric

…to local_hypsometric

…m' bin options

erikmannerfelt · 2021-03-15T12:43:40Z

I've added "equal area" binning in 3b33d4b, according to @rhugonnet's approach in #39. Right now, the terminology is a bit all over the place. This is how I've made the options, but feel free to suggest changes:

volume.hypsometric_binning(ddem, ref_dem, bins=50, kind="see below")

"equal_height": The elevation range of each bin is 50 georeferenced units.

"fixed_count": There are 50 bins in total with equal elevation range.

"equal_area": There are 50 bins in total, and all cover almost exactly the same area (#39).

To exemplify the outputs (the values don't mean much because I've neither coregistered nor masked the DEMs):

In [1]: import xdem

In [2]: ref_dem = xdem.dem.DEM(xdem.examples.FILEPATHS["longyearbyen_ref_dem"])
No metadata could be read from filename.

In [3]: tba_dem = xdem.dem.DEM(xdem.examples.FILEPATHS["longyearbyen_tba_dem"])
No metadata could be read from filename.

In [4]: ddem = ref_dem.data - tba_dem.data

In [5]: xdem.volume.hypsometric_binning(ddem, ref_dem.data, bins=10, kind="equal_height")
Out[5]:
                                            value     count
(8.6741304397583, 18.6741304397583]      -1.760605  15780.0
(18.6741304397583, 28.6741304397583]     -2.291516  45596.0
(28.6741304397583, 38.6741304397583]     -1.547386  33645.0
(38.6741304397583, 48.6741304397583]     -1.166397  26651.0
(48.6741304397583, 58.6741304397583]     -1.175062  27800.0
...                                            ...      ...
(978.6741304397583, 988.6741304397583]   -1.051819    265.0
(988.6741304397583, 998.6741304397583]   -0.553040    203.0
(998.6741304397583, 1008.6741304397583]  -0.421906    148.0
(1008.6741304397583, 1018.6741304397583]  0.948212     68.0
(1018.6741304397583, 1028.6741304397583]  0.209137     16.0

[102 rows x 2 columns]

In [6]: xdem.volume.hypsometric_binning(ddem, ref_dem.data, bins=10, kind="fixed_count")
Out[6]:
                                            value     count
(8.6741304397583, 109.9881199936731]    -1.807365  242028.0
(109.9881199936731, 211.3021095475879]  -2.855278  142598.0
(211.3021095475879, 312.6160991015027]  -2.900841  183366.0
(312.6160991015027, 413.9300886554175]  -3.097565  177362.0
(413.9300886554175, 515.2440782093322]  -3.637207  160876.0
(515.2440782093322, 616.5580677632471]  -3.866760  144277.0
(616.5580677632471, 717.8720573171619]  -3.371338  121107.0
(717.8720573171619, 819.1860468710767]  -3.312836   90834.0
(819.1860468710767, 920.5000364249914]  -2.082214   41171.0
(920.5000364249914, 1021.8140259789062] -0.364868    8401.0

In [7]: xdem.volume.hypsometric_binning(ddem, ref_dem.data, bins=10, kind="equal_area")
Out[7]:
                                            value     count
(8.6741304397583, 51.949240112304686]    -1.686300  131202.0
(51.949240112304686, 126.75953216552733] -2.185017  131202.0
(126.75953216552733, 216.52808837890626] -2.811653  131202.0
(216.52808837890626, 287.96279296875]    -2.922958  131202.0
(287.96279296875, 362.9493103027344]     -3.012344  131202.0
(362.9493103027344, 439.10780029296876]  -3.240845  131202.0
(439.10780029296876, 523.1372314453124]  -3.678909  131202.0
(523.1372314453124, 615.8244506835937]   -3.877625  131202.0
(615.8244506835937, 726.0829772949219]   -3.362030  131202.0
(726.0829772949219, 1021.8140268789062]  -2.746582  131202.0

EDIT: The text below was written before the name change "equal_count" -> "fixed_count"

One potential limitation in this terminology is that "equal_count" and the "count" column are not really related. "equal_count" is an equal bin count, while the "count" column is the pixel/value count for each bin! Is this confusing or should we just leave it as is?

erikmannerfelt · 2021-03-15T12:48:35Z

Crazy how much a 1 minute bathroom break's reflection can matter.

What about ~~"fixed_height"~~, "fixed_count" and "equal_area" instead?

Especially the kind "equal_count" doesn't make much sense now. Equal to what? (what I mean is that the bin count is fixed!)

EDIT: I've replaced all instances of "equal_count" with "fixed_count". I haven't changed "equal_height".

erikmannerfelt · 2021-03-15T13:05:40Z

Oh no, I just found a large unintended feature in this approach. The "count" is not analogous to "area" when there are nans in the dDEM! That is why I didn't want to mix those terms in the first place, I recall...

What are you guys' opinion on this, @rhugonnet and @adehecq . To be clear, the "equal_area" binning kind makes it so that bins are created with an equal amount of finite dDEM values, not an equal glacier area. That might be what we want, but it wasn't what I intended!

erikmannerfelt · 2021-03-15T13:14:04Z

Another suggestion for alternate names:
"equal_height" like before
"fixed_bincount" for fixed bin count with equal height interval
"equal_count" for a fixed bin count with equal point counts (thus consistent in terminology with the "count" column)

rhugonnet · 2021-03-15T13:31:23Z

What are you guys' opinion on this, @rhugonnet and @adehecq . To be clear, the "equal_area" binning kind makes it so that bins are created with an equal amount of finite dDEM values, not an equal glacier area. That might be what we want, but it wasn't what I intended!

Intuitively, I expect the most "performant" one to be a uniform (equal_count) binning accounting for NaNs. That is to say, equal count for the hypsometric binning, independently of what values are passed in the ddem array.
Let me take a bathroom break and see how much it enlightens me on the naming :D

erikmannerfelt · 2021-03-15T13:37:46Z

Intuitively, I expect the most "performant" one to be a uniform (equal_count) binning accounting for NaNs. That is to say, equal count for the hypsometric binning, independently of what values are passed in the ddem array.
Let me take a bathroom break and see how much it enlightens me on the naming :D

Please correct me if I'm mistaken, @rhugonnet , but that implies the bins have to be calculated using the reference DEM elevations, not the mean elevations. This may be problematic with large change, like we discussed before:

If a glacier has lost 100 m at 1000-1100 m a.s.l. between 1900 and 2000, the 2000 area may be 0, so the apparent volume change will be 0.

rhugonnet · 2021-03-15T13:52:08Z

FYI, pandas naming:
qcut (quantile-cut) = "equal_count"; cut (simple cut) = "any bin range customized binning"
Not very useful...

The problem, I think, is that "fixed_bincount" is a sub-category of equal height! And equal_area is actually also based on height binning (although it translate into area). So using "height" or "area" is a bit confusing to me. You have only 1 category (height), and then several ways of dividing it.
It could be as simple as this:

binning_type = ['custom','fixed','quantile','count'] #you can bind by a fixed number of height or areas
binning_width = [[],50 m , 10%, 100]

For count, we could let the user one/two-line (elevation range/bin_count) and use "fixed" if desired?

rhugonnet · 2021-03-15T13:58:29Z

Intuitively, I expect the most "performant" one to be a uniform (equal_count) binning accounting for NaNs. That is to say, equal count for the hypsometric binning, independently of what values are passed in the ddem array.
Let me take a bathroom break and see how much it enlightens me on the naming :D

Please correct me if I'm mistaken, @rhugonnet , but that implies the bins have to be calculated using the reference DEM elevations, not the mean elevations. This may be problematic with large change, like we discussed before:

If a glacier has lost 100 m at 1000-1100 m a.s.l. between 1900 and 2000, the 2000 area may be 0, so the apparent volume change will be 0.

I'm not sure this makes any difference: if the elevation change is 0 at lower elevation, you can propagate that 0 to the rest of the bin without any issue.
One of the advantage of the hypsometric method is to mitigate the effects of no-data in the dDEM. By binning according to an external reference with almost no data gaps, you propagate the change to where you are missing it. So the count should absolutely be that of the reference DEM with least data gaps, no?

adehecq · 2021-03-15T14:08:58Z

I would say, Romain's last suggestion is the most instuitive? So
'fixed' -> fixed bin height
'quantile' -> bin size is based on quantiles
'count' -> a fixed number of bins
If needed to make it clearer, we could have 'fixed' -> 'fixed_height', 'count' -> 'fixed_count'?

erikmannerfelt · 2021-03-15T14:11:34Z

Please correct me if I'm mistaken, @rhugonnet , but that implies the bins have to be calculated using the reference DEM elevations, not the mean elevations. This may be problematic with large change, like we discussed before:
If a glacier has lost 100 m at 1000-1100 m a.s.l. between 1900 and 2000, the 2000 area may be 0, so the apparent volume change will be 0.

I'm not sure this makes any difference: if the elevation change is 0 at lower elevation, you can propagate that 0 to the rest of the bin without any issue.
One of the advantage of the hypsometric method is to mitigate the effects of no-data in the dDEM. By binning according to an external reference with almost no data gaps, you propagate the change to where you are missing it. So the count should absolutely be that of the reference DEM with least data gaps, no?

Sorry, @rhugonnet, but I honestly don't understand what you mean. If we have a glacier:

Elevation	dH	Area_pre	Area_post
1000-1100	-100	10000	0
1100-1200	-70	12000	5000
1300-1400	-50	15000	10000

etc., the integrated volume change in the first elevation bin would be 0, assuming one uses the modern hypsometry. How and where is this propagated?

erikmannerfelt · 2021-03-15T14:12:07Z

I would say, Romain's last suggestion is the most instuitive? So
'fixed' -> fixed bin height
'quantile' -> bin size is based on quantiles
'count' -> a fixed number of bins
If needed to make it clearer, we could have 'fixed' -> 'fixed_height', 'count' -> 'fixed_count'?

I agree with "fixed", "quantile" and "count", like @rhugonnet suggested. I'll implement that!

erikmannerfelt · 2021-03-15T14:16:36Z

I would say, Romain's last suggestion is the most instuitive? So
'fixed' -> fixed bin height
'quantile' -> bin size is based on quantiles
'count' -> a fixed number of bins
If needed to make it clearer, we could have 'fixed' -> 'fixed_height', 'count' -> 'fixed_count'?

I agree with "fixed", "quantile" and "count", like @rhugonnet suggested. I'll implement that!

Done in 5646dca

rhugonnet · 2021-03-15T14:31:08Z

Sorry, @rhugonnet, but I honestly don't understand what you mean. If we have a glacier:

Elevation dH Area_pre Area_post
1000-1100 -100 10000 0
1100-1200 -70 12000 5000
1300-1400 -50 15000 10000
etc., the integrated volume change in the first elevation bin would be 0, assuming one uses the modern hypsometry. How and where is this propagated?

The date of the reference DEM just shifts all bins, you don't really lose any elevation change information:
If you use a DEM_pre:

Elevation dH Area_pre
900-1000 0 0
1000-1100 -100 10000
1100-1200 -70 10000
1300-1400 -50 10000
1400-1500 -10 10000
1500-1600 0 0

If you use a DEM_post:

Elevation dH Area_pre
800-900 0 0
900-1000 -100 10000
1000-1100 -70 10000
1100-1200 -50 10000
1300-1400 -10 10000
1400-1500 0 0

Hope this makes sense in relation to the comment !

erikmannerfelt · 2021-03-15T14:43:16Z

Sorry, @rhugonnet, but I honestly don't understand what you mean. If we have a glacier:
Elevation dH Area_pre Area_post
1000-1100 -100 10000 0
1100-1200 -70 12000 5000
1300-1400 -50 15000 10000
etc., the integrated volume change in the first elevation bin would be 0, assuming one uses the modern hypsometry. How and where is this propagated?

The date of the reference DEM just shifts all bins, you don't really lose any elevation change information:
If you use a DEM_pre:

Elevation dH Area_pre
900-1000 0 0
1000-1100 -100 10000
1100-1200 -70 10000
1300-1400 -50 10000
1400-1500 -10 10000
1500-1600 0 0

If you use a DEM_post:

Elevation dH Area_pre
800-900 0 0
900-1000 -100 10000
1000-1100 -70 10000
1100-1200 -50 10000
1300-1400 -10 10000
1400-1500 0 0

Hope this makes sense in relation to the comment !

Ooooooh, of course! Wow now I feel stupid. I've made things so hard for myself the whole way

I'll correct this (meaning this will be true: "area" == "count_sum" * "pixel_area") and set the default calculate_hypsometry_area() surface to be the reference_dem, just like you've both suggested for a week now!

…he surface

erikmannerfelt · 2021-03-19T13:16:28Z

Since the latest commit, I daresay this is ready. We solved the issue I had with reference/mean elevations (turns out I was just poorly informed) and the functions are now in a working state.

Once merged, I can start implementing them in the DEMCollection class.

Is there anything else we should consider first?

adehecq · 2021-03-19T13:32:07Z

Since the latest commit, I daresay this is ready. We solved the issue I had with reference/mean elevations (turns out I was just poorly informed) and the functions are now in a working state.

Once merged, I can start implementing them in the DEMCollection class.

Is there anything else we should consider first?

Feel free to merge and move on with the DEMCollection class. Maybe we should have a meeting next week to make sure we agree on the features to implement in xdem.

erikmannerfelt · 2021-03-19T13:34:58Z

Since the latest commit, I daresay this is ready. We solved the issue I had with reference/mean elevations (turns out I was just poorly informed) and the functions are now in a working state.
Once merged, I can start implementing them in the DEMCollection class.
Is there anything else we should consider first?

Feel free to merge and move on with the DEMCollection class. Maybe we should have a meeting next week to make sure we agree on the features to implement in xdem.

Okay! Sounds great to have a discussion about moving forward next week.

Added hypsometry binning function

5e23674

erikmannerfelt added the enhancement Feature improvement or request label Mar 10, 2021

erikmannerfelt requested review from adehecq and rhugonnet March 10, 2021 11:34

erikmannerfelt self-assigned this Mar 10, 2021

rhugonnet reviewed Mar 10, 2021

View reviewed changes