Add robust polynomial, sum of sinusoids fitting #151

rhugonnet · 2021-06-22T08:37:00Z

Polynomial fitting + Sum of sin fitting + Across/Along-track sampling

Resolves #50

coefs, deg = xdem.spatial_tools.robust_polynomial_fit(x, y, estimator='Theil-Sen')

2 robust polynomial fitting solutions

1/ One combining sklearn.linear_model and sklearn.preprocessing.PolynomialFeatures: solves a polynomial with robust estimators Linear Regression, Theil-Sen (median approach), RANSAC or Huber.
2/ Using scipy.optimize.least_squares and specific loss function, some are quite robust to outliers (see example in tests).

Both implemented in the same function

Linear can be solved with scipy or sklearn. While Theil-Sen, RANSAC and Huber only with sklearn.

Input/Output

Simple input: x, y as input, choice of estimator, cost function cost_func (by default, median absolute error), and a few other options described in the docs.
Simple output: polynomial degree as integer, and coefficients in a vector.

Choosing the best polynomial

Simply using the polynomial that has the lowest cost (less spread between true and predicted values) is known to not be a good approach for choosing the optimal degree, as it can lead to overfitting. Here I wrote a simple function that selects the polynomial of smallest degree within a percentage margin of the "best cost" found by the fit.

So, for instance, if degree 1 fit has a cost of 100, degree 2 fit a cost of 20, degree 3 fit a cost of 5 and then degrees 4 to 6 fits have a cost between 4 and 5, the function (with a margin of 20% by default) will select degree 3 as the optimal solution.

EDIT:

TODOLIST to finalize PR

Fix tests
Re-organize estimator subfunction wrappers to be more generic
Refactor poly and sumofsin functions to call new wrapper functions
Add kwargs argument that can be identified to any subfunction call (same logic as in spatialstats.py)
Fix circular import
Move functions to new module fit.py following Sort the mess in spatial_tools.py #157

Update upstream

xdem/spatial_tools.py

erikmannerfelt · 2021-06-22T09:21:22Z

Nice! I presume this is the first step toward your bias corrections?

Is scikit-learn already installed due to some other dependency? It is not a direct dependency in the environment file.

rhugonnet · 2021-06-22T09:28:08Z

Nice! I presume this is the first step toward your bias corrections?

Is scikit-learn already installed due to some other dependency? It is not a direct dependency in the environment file.

Yes, first towards a series of BiasCorr classes. I thought it would make more sense to partition it, especially as those 1D fitting functions can be used for other applications!

Ups! Didn't check, it was in my local xdem environment for some reason.
I guess that then the main question is: should we have scikit-learn as a dependency @adehecq @erikmannerfelt ?
It's becoming very popular (linked to Dask for example), and it has a lot of great statistic tools (including Gaussian Processes) so I'm clearly for it 😄

adehecq · 2021-06-22T09:45:57Z

Looks very promising!! :-)

Regarding the dependency on scikit-learn (and of other packages in general), I believe that unless not importing the module makes xdem useless, e.g. rasterio, we should not make it a hard dependency. So basically, it should only be imported within the functions where it is needed.
We could have two requirement files, one with only the hard dependencies, and one with all, so that it's easier to install if one wants to use all the xdem functionalities.

rhugonnet · 2021-06-22T11:23:58Z

For now, I have added an _has_sklearn flag triggered by the sklearn import failure in
78e2ef7, and checked in the function, as @erikmannerfelt did for csv2, richDEM.
Is this what you meant by "importing within the function", @adehecq ? I'm not sure of the best practice here.

Should we open an issue (improvement) for creating a file for "full environment" and one for "minimal environment"?

adehecq · 2021-06-22T14:17:08Z

For now, I have added an _has_sklearn flag triggered by the sklearn import failure in
78e2ef7, and checked in the function, as @erikmannerfelt did for csv2, richDEM.
Is this what you meant by "importing within the function", @adehecq ? I'm not sure of the best practice here.

Should we open an issue (improvement) for creating a file for "full environment" and one for "minimal environment"?

My idea was to have the import statement within the function directly. I was discussing with Fabien about it and there are pro/cons to each approach:

with the current approach, if you don't catch and raise a specific error (as you did perfectly here!) then the error becomes difficult to trace for the user. So we need to make sure that the error is caught everywhere in the code.
with the import within the function, it's more transparent and an ImportError is raised when calling the function, which is then easy to understand. But if using the function in multiple processes, then the module will be loaded each time...
Any opinion on this @erikmannerfelt ?

erikmannerfelt

Nice functionality! I have some quite small remarks but generally I like it!

tests/test_spatial_tools.py

xdem/spatial_tools.py

erikmannerfelt · 2021-06-22T14:04:50Z

xdem/spatial_tools.py

@@ -398,6 +408,149 @@ def hillshade(dem: Union[np.ndarray, np.ma.masked_array], resolution: Union[floa
    # The output is scaled by "(x + 0.6) / 1.84" to make it more similar to GDAL.
    return np.clip(255 * (shaded + 0.6) / 1.84, 0, 255).astype("float32")

+def get_xy_rotated(raster: gu.georaster.Raster, myang: float) -> tuple[np.ndarray, np.ndarray]:
+    """
+    Rotate x, y axes of image to get along- and cross-track distances.


This returns pixel coordinates rotated around the lower left corner, right? Where do the "cross-track distances" come in?

It could also forego the raster class (if we want) by using xdem.coreg._get_x_and_y_coords on a transform and the shape of the array. I don't know if that is necessary or better; just a suggestion.

Yes, could be a nice idea to combine both with an optional rotation argument.

xdem/spatial_tools.py

tests/test_spatial_tools.py

erikmannerfelt · 2021-06-22T14:26:04Z

For now, I have added an _has_sklearn flag triggered by the sklearn import failure in
78e2ef7, and checked in the function, as @erikmannerfelt did for csv2, richDEM.
Is this what you meant by "importing within the function", @adehecq ? I'm not sure of the best practice here.
Should we open an issue (improvement) for creating a file for "full environment" and one for "minimal environment"?

My idea was to have the import statement within the function directly. I was discussing with Fabien about it and there are pro/cons to each approach:
* with the current approach, if you don't catch and raise a specific error (as you did perfectly here!) then the error becomes difficult to trace for the user. So we need to make sure that the error is caught everywhere in the code.

* with the import within the function, it's more transparent and an ImportError is raised when calling the function, which is then easy to understand. But if using the function in multiple processes, then the module will be loaded each time...
  Any opinion on this @erikmannerfelt ?

I think generally it is advised to import at the top-level because running the function multiple times would otherwise run the imports multiple times. This is not very slow, but it's also not instantaneous. See StackOverflow for reference. I think both approaches are fine; as you mention, there are pros and cons to both. I think we should just be consistent, and for now, we have the _has_blabla syntax in other places.

erikmannerfelt · 2021-06-22T14:30:06Z

For now, I have added an _has_sklearn flag triggered by the sklearn import failure in
78e2ef7, and checked in the function, as @erikmannerfelt did for csv2, richDEM.
Is this what you meant by "importing within the function", @adehecq ? I'm not sure of the best practice here.

Should we open an issue (improvement) for creating a file for "full environment" and one for "minimal environment"?

My two cents are that environment.yml should contain all optional dependencies as well. Most people (that are not developers) will either use pip, whereby the dependencies/optionals are parsed from setup.py, or from conda-forge, whereby it will be read from the conda recipe (which only specifies the mandatory dependencies, thus skipping the optionals). Thus, environment.yml has a weird niche where it is mostly only used by people that either want to develop, those who want main instead of the latest release, or those that want to modify the code. In these cases, I think there's no real drawback to have them install optional dependencies as well.

Thoughts, @adehecq and @rhugonnet ?

rhugonnet · 2021-06-22T14:34:22Z

My two cents are that environment.yml should contain all optional dependencies as well. Most people (that are not developers) will either use pip, whereby the dependencies/optionals are parsed from setup.py, or from conda-forge, whereby it will be read from the conda recipe (which only specifies the mandatory dependencies, thus skipping the optionals). Thus, environment.yml has a weird niche where it is mostly only used by people that either want to develop, those who want main instead of the latest release, or those that want to modify the code. In these cases, I think there's no real drawback to have them install optional dependencies as well.

Thoughts, @adehecq and @rhugonnet ?

Fully agree, setup.py will do the job for the mandatory ones (almost forgot about it)

adehecq · 2021-06-22T15:46:42Z

For now, I have added an _has_sklearn flag triggered by the sklearn import failure in
78e2ef7, and checked in the function, as @erikmannerfelt did for csv2, richDEM.
Is this what you meant by "importing within the function", @adehecq ? I'm not sure of the best practice here.
Should we open an issue (improvement) for creating a file for "full environment" and one for "minimal environment"?

My two cents are that environment.yml should contain all optional dependencies as well. Most people (that are not developers) will either use pip, whereby the dependencies/optionals are parsed from setup.py, or from conda-forge, whereby it will be read from the conda recipe (which only specifies the mandatory dependencies, thus skipping the optionals). Thus, environment.yml has a weird niche where it is mostly only used by people that either want to develop, those who want main instead of the latest release, or those that want to modify the code. In these cases, I think there's no real drawback to have them install optional dependencies as well.

Thoughts, @adehecq and @rhugonnet ?

Just to clarify, where is the information on the dependencies for conda-forge stored? I thought it was in the environment.yml. If not, then I agree that this file is a good place to put all optional requirements. But in that case, we could probably merge it with the dev-environment.yml, if it's only to be used by people developing/testing the code.
And what is the point of the dev-requirements.txt?
In all cases, it would be great to have a place where we explain where dependencies should be added. For now there are at least 3-4 files where this should be updated no?

adehecq · 2021-06-22T15:48:48Z

I think generally it is advised to import at the top-level because running the function multiple times would otherwise run the imports multiple times.

But a module that was previously imported is not imported again, so the function will take a little bit more time on the first call, but then it should be almost the same no?

xdem/spatial_tools.py

adehecq · 2021-06-29T09:39:25Z

Looks great! This makes me think we could use your robust polynomial fit in coreg deramp.
I have some minor comments, but after that it's good for me.

adehecq · 2021-07-27T13:04:02Z

@rhugonnet what's the status of this PR?

rhugonnet · 2021-09-06T20:52:42Z

@rhugonnet what's the status of this PR?

On it, trying to homogenize things and push a final version!

rhugonnet · 2021-09-07T10:25:52Z

Again, I have an assertion error that happens only in CI, while everything passes locally... and this is for a calculation with a fixed random_state
So glad to be back 😭

rhugonnet · 2021-09-07T14:25:23Z

@adehecq @erikmannerfelt All ready for approval to be merged, except for that one test that seems to fail randomly in CI (while it never does locally, the random_state is well fixed and gives a constant result...).

rhugonnet · 2021-09-07T14:26:02Z

I'm still desperately trying to understand just the base: why 🙏

rhugonnet · 2021-09-07T15:09:04Z

I'm still desperately trying to understand just the base: why

After a few hours, still can't trace it back. 😢
I'm giving up for now, let's just open an issue that I'll reference to this comment.
Description here:

test_robust_sumsin_fit randomly fails, in CI only, for the following assertion:

        for i in range(2):
            assert coefs[3*i] == pytest.approx(true_coefs[3*i], abs=0.02)

The function call is based on scipy.optimize.basinhopping. I cannot reproduce the issue locally, even when setting up xdem-dev environment the same way as in CI.
This is very strange because the test uses a random state random_state=42 that is used both by xdem.spatial_tools.subsample_raster and by basinhopping (seed argument). And locally the output is indeed not subject to any random variation (constant).
The error seems to occur 75% of the time in CI. When it does, the scipy version (1.7.1) and numpy version (1.20.3) are exactly the same as the ones I have locally without error after running 15+ times. So the issue doesn't seem to be from there either...
The mystery remains!

adehecq

Great job !
I haven't looked in detail at your latest changes, but if you took into account our comments, it should be alright.
It's annoying you couldn't find the issue behind the test randomly failing...

rhugonnet added 5 commits March 27, 2021 14:26

Merge pull request #6 from GlacioHack/main

f7f6108

Update upstream

Merge branch 'GlacioHack:main' into main

9c97f54

add robust polynomial fit + tests

4c4bfbf

add array dimension to docs

54872f8

streamline test polynomial fit

bafec5e

rhugonnet requested review from erikmannerfelt and adehecq June 22, 2021 08:37

rhugonnet commented Jun 22, 2021

View reviewed changes

xdem/spatial_tools.py Outdated Show resolved Hide resolved

fix polynomial fit tests

204434a

import scikit-learn as optional dependency

78e2ef7

rhugonnet added 4 commits June 22, 2021 13:25

Merge branch 'main' into angle_binning

eb397fa

use new subsample function + small fixes

20e8e5e

fix test

3c4ea4a

add comments

6abe090

erikmannerfelt reviewed Jun 22, 2021

View reviewed changes

rhugonnet mentioned this pull request Jun 24, 2021

Improve management of optional imports #154

Open

rhugonnet added 4 commits June 24, 2021 10:27

fixes with Eriks comments

e6034ea

improve tests with Erik comments

86d565b

fix test

54d4a22

add draft for robust scaling using ML methods

cc9398f

adehecq reviewed Jun 29, 2021

View reviewed changes

xdem/spatial_tools.py Outdated Show resolved Hide resolved

Merge remote-tracking branch 'upstream/main' into angle_binning

6e7cf23

rhugonnet dismissed erikmannerfelt’s stale review via 6e7cf23 September 6, 2021 20:18

rhugonnet added 2 commits September 6, 2021 22:34

rewrite tests with pytest.approx

5cfe2e7

use np.polyval instead of writing out the polynomial

1ca8c4f

rhugonnet added 7 commits September 6, 2021 23:00

rest of amaury comments

f060cf9

add fit module, refactor nmad into spatialstats

616bd7e

fix tests

700bf85

finish refactor nmad, fix tests

a2d4acf

increase error margin of test

dba11f4

try fixing test

28e96ed

add print statement to check values in CI

aaf818f

rhugonnet added 5 commits September 7, 2021 12:28

move print statement to the right place

317c58c

streamline comments

3051384

further streamline comments

9dc9d5f

remove print statement

5387422

subdivide scipy and sklearn into wrapper functions for reuse and clarity

917d905

rhugonnet mentioned this pull request Sep 7, 2021

Basinhopping randomly fails in CI #209

Closed

rhugonnet added 2 commits September 7, 2021 19:12

skip randomly failing test

6588c01

fix skip syntax

0d2e7ec

adehecq approved these changes Sep 8, 2021

View reviewed changes

rhugonnet merged commit bd40f92 into GlacioHack:main Sep 8, 2021

rhugonnet deleted the angle_binning branch September 8, 2021 07:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add robust polynomial, sum of sinusoids fitting #151

Add robust polynomial, sum of sinusoids fitting #151

rhugonnet commented Jun 22, 2021 •

edited by adehecq

Loading

erikmannerfelt commented Jun 22, 2021

rhugonnet commented Jun 22, 2021

adehecq commented Jun 22, 2021

rhugonnet commented Jun 22, 2021

adehecq commented Jun 22, 2021

erikmannerfelt left a comment

erikmannerfelt Jun 22, 2021

erikmannerfelt Jun 22, 2021

adehecq Sep 8, 2021

erikmannerfelt commented Jun 22, 2021

erikmannerfelt commented Jun 22, 2021

rhugonnet commented Jun 22, 2021

adehecq commented Jun 22, 2021

adehecq commented Jun 22, 2021

adehecq commented Jun 29, 2021

adehecq commented Jul 27, 2021

rhugonnet commented Sep 6, 2021

rhugonnet commented Sep 7, 2021

rhugonnet commented Sep 7, 2021

rhugonnet commented Sep 7, 2021

rhugonnet commented Sep 7, 2021

adehecq left a comment

Add robust polynomial, sum of sinusoids fitting #151

Add robust polynomial, sum of sinusoids fitting #151

Conversation

rhugonnet commented Jun 22, 2021 • edited by adehecq Loading

Polynomial fitting + Sum of sin fitting + Across/Along-track sampling

2 robust polynomial fitting solutions

Both implemented in the same function

Input/Output

Choosing the best polynomial

TODOLIST to finalize PR

erikmannerfelt commented Jun 22, 2021

rhugonnet commented Jun 22, 2021

adehecq commented Jun 22, 2021

rhugonnet commented Jun 22, 2021

adehecq commented Jun 22, 2021

erikmannerfelt left a comment

Choose a reason for hiding this comment

erikmannerfelt Jun 22, 2021

Choose a reason for hiding this comment

erikmannerfelt Jun 22, 2021

Choose a reason for hiding this comment

adehecq Sep 8, 2021

Choose a reason for hiding this comment

erikmannerfelt commented Jun 22, 2021

erikmannerfelt commented Jun 22, 2021

rhugonnet commented Jun 22, 2021

adehecq commented Jun 22, 2021

adehecq commented Jun 22, 2021

adehecq commented Jun 29, 2021

adehecq commented Jul 27, 2021

rhugonnet commented Sep 6, 2021

rhugonnet commented Sep 7, 2021

rhugonnet commented Sep 7, 2021

rhugonnet commented Sep 7, 2021

rhugonnet commented Sep 7, 2021

adehecq left a comment

Choose a reason for hiding this comment

rhugonnet commented Jun 22, 2021 •

edited by adehecq

Loading