Fix/check estimator #196

rosecers · 2023-05-17T20:47:45Z

scikit-learn-contrib (our PR is scikit-learn-contrib/scikit-learn-contrib#62) requires that all estimators pass a check_estimators test ala https://scikit-learn.org/stable/modules/generated/sklearn.utils.estimator_checks.parametrize_with_checks.html#sklearn.utils.estimator_checks.parametrize_with_checks, which ours were not currently doing. I've been going through everything that inherits from the BaseEstimator class (or should) and making the required changes. @agoscinski would appreciate your input on the linear_models section, as I'm not sure that OrthogonalRegression can pass the estimators, as to my knowledge the predicted values may have a different shape that the fitted y, no? Please correct me if I'm wrong.

📚 Documentation preview 📚: https://scikit-matter--196.org.readthedocs.build/en/196/

rosecers · 2023-05-17T23:33:15Z

@agoscinski I am honestly perplexed by some of the testing errors coming up right now (matmul failing for correctly-shaped matrices). Would appreciate thoughts -- @ceriottm @PicoCentauri @Luthaf and others also welcome to weigh in

Running tests/test_metrics.py, I get things like:

ERROR: test_local_reconstruction_error_train_idx (__main__.ReconstructionMeasuresTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/rca/source_installs/scikit-matter/tests/test_metrics.py", line 141, in test_local_reconstruction_error_train_idx
    lfre_val = pointwise_local_reconstruction_error(
  File "/Users/rca/miniconda3/lib/python3.10/site-packages/skmatter/metrics/_reconstruction_measures.py", line 456, in pointwise_local_reconstruction_error
    - 2 * X_test @ X_train.T
  File "/Users/rca/miniconda3/lib/python3.10/site-packages/numpy/ma/core.py", line 3077, in __array_wrap__
    m = reduce(mask_or, [getmaskarray(arg) for arg in input_args])
  File "/Users/rca/miniconda3/lib/python3.10/site-packages/numpy/ma/core.py", line 1757, in mask_or
    return make_mask(umath.logical_or(m1, m2), copy=copy, shrink=shrink)
ValueError: operands could not be broadcast together with shapes (15,4) (4,5)

I've checked, it's within the @ operator, not the *. Also checked that both matrices are the same type.

For the record, this only happens for tests/test_metrics.py and tests/test_sparse_kernel_centerer.py on my end.

agoscinski · 2023-05-18T12:09:09Z

In the StandartFlexibleScaler np.ma.* is used which transforms the mean to a <class 'numpy.ma.core.MaskedArray'> and thus also the transformed array.c
https://github.com/lab-cosmo/scikit-matter/blob/766e3dabc42f26727b484b37fcd696ad64a9c222/src/skmatter/preprocessing/_data.py#L152
Then in the metrics code the @ operation between masks is executed as an elementwise multiplication which results in the error because of not broadcastability of the shapes

Just replace np.ma.* with regular np.*

rosecers · 2023-05-18T15:15:50Z

Unfortunately, `np.average` flags an error with scikit learn (scikit learn does not like the name `average`, even if we rename upon import). Unless there's another function we can use, we might need to make a spoof

…

On Thu, May 18, 2023, 07:09 Alexander Goscinski ***@***.***> wrote: In the StandartFlexibleScaler np.ma.* is used which transforms the mean to a <class 'numpy.ma.core.MaskedArray'> and thus also the transformed array.c https://github.com/lab-cosmo/scikit-matter/blob/766e3dabc42f26727b484b37fcd696ad64a9c222/src/skmatter/preprocessing/_data.py#L152 Then in the metrics code the @ operation between masks is executed as an elementwise multiplication which results in the error because of not broadcastability of the shapes Just replace np.ma.* with regular np.* — Reply to this email directly, view it on GitHub <#196 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ALKVP3TLPJJBZPYP2QLHIZDXGYGPBANCNFSM6AAAAAAYFTRLLE> . You are receiving this because you authored the thread.Message ID: ***@***.***>

agoscinski · 2023-05-18T15:25:06Z

I dont fully understand the problem with np.average, could you elaborate. Where does scikit-learn return an error?

rosecers · 2023-05-18T19:00:54Z

Everything passes! Ready for a proper review with one or two outstanding q's:

technically, the sample selectors inherit from BaseEstimator, but do not follow the estimator behavior to a tee (their transformed objects do not have the same number of rows as their inputs). In order to delay the selector-refactor conversation, I suggest we leave them out of test_check_estimators
Should orthogonalregression inherit from base estimator? @agoscinski

.gitignore

rosecers · 2023-05-18T19:08:41Z

I should note what many of the necessary fixes were:

Only assigning values to variable_ type attributes outside of the initializer
Making sure input arrays are always validated
Making sure errors match the expected behavior for scikit-learn (the weird additions of Reshape your data)
Making sure to always define n_features_in_ and n_samples_in during fit functions

hurricane642 · 2023-05-18T23:07:18Z

I have a question here, shouldn't we explicitly add tests with @parametrize_with_checks in our test sets? Or the requirement that it should just be possible to run such tests?

agoscinski · 2023-05-19T07:55:51Z

technically, the sample selectors inherit from BaseEstimator, but do not follow the estimator behavior to a tee (their transformed objects do not have the same number of rows as their inputs). In order to delay the selector-refactor conversation, I suggest we leave them out of test_check_estimators

If they are okay with it, sure. We can write in the doc also that it is not compatible with Pipeline yet, if this helps convincing them. In the end we need to mark this somehow programmatically. I am not sure how to do this, since scikit-learn Pipeline never checks for the type of the transformers, so changing the base class does not help here. Maybe the contributors have some useful suggestions.

Should orthogonalregression inherit from base estimator? @agoscinski

Yes it is an estimator so it should inherit from it, what is the problem with it? That it does not always agrees with the shape of the input, right? We can make mark it as private class (_OrthogonalRegression), because it is only use for reconstruction measures, then I think we can ommit it for the estimator check. But it basically has the same problems as the sample selection classes, so I would do the apply the same solution.

agoscinski

Looks overall good. I would use the chance to add tests for not covered code

src/skmatter/decomposition/_kernel_pcovr.py

src/skmatter/preprocessing/_data.py

tests/test_sample_simple_cur.py

src/skmatter/utils/_pcovr_utils.py

agoscinski · 2023-05-30T08:24:48Z

src/skmatter/utils/_orthogonalizers.py


-        xnew -= col @ (col.T @ xnew)
+        xnew -= (col @ (col.T @ xnew)).astype(xnew.dtype)


Why is this necessary? It seems weird to suddenly enforce the type here

Without this we get numpy.core._exceptions._UFuncOutputCastingError: Cannot cast ufunc 'divide' output from dtype('float64') to dtype('int64') with casting rule 'same_kind'

src/skmatter/linear_model/_ridge.py

tests/test_greedy_selector.py

agoscinski

It became quite a huge PR. I have the feeling if we merge it like it is, we will merge some buggy code which is hard to identify because it is entangled with so many changes, therefore I suggest to put the renaming of private variables into a separate PR to reduce the noise here. Then I can review it again.

These are the changes I could identify from this PR

* renaming member variables marked as private to sklearn style

* consistently validate and check input data in fit functions

* adding whitening option in PCovR

* KernelFlexibleCenterer was not consistently using validated kernel, this has been fixed

* adding tests tests/test_kernel_pcovr.py for different solvers

* adding tests tests/test_standard_flexible_scaler.py for taking average

* add sklearn estimator_checks tests

Co-authored-by: Alexander Goscinski <alex.goscinski@posteo.de>

agoscinski

One comment only. Otherwise looks fine.

Suggested merge commit

add sklearn estimator_checks tests and fix emerging test errors

* consistently validate and check input data in fit functions

* adding whitening option in PCovR

* KernelFlexibleCenterer was not consistently using validated kernel, this has been fixed

* adding tests tests/test_standard_flexible_scaler.py for taking average

* create new test file tests/test_check_estimators.py with sklearn estimator_checks tests

src/skmatter/_selection.py

src/skmatter/utils/_pcovr_utils.py

Co-authored-by: Alexander Goscinski <alex.goscinski@posteo.de>

agoscinski

Looks good.

rosecers force-pushed the fix/check_estimator branch 8 times, most recently from 11218b2 to af60553 Compare May 17, 2023 23:31

rosecers force-pushed the fix/check_estimator branch 3 times, most recently from 33fe51b to 766e3da Compare May 17, 2023 23:51

rosecers marked this pull request as draft May 18, 2023 15:34

rosecers force-pushed the fix/check_estimator branch 4 times, most recently from 126a26b to 1ba16f8 Compare May 18, 2023 18:53

rosecers marked this pull request as ready for review May 18, 2023 19:01

rosecers requested a review from hurricane642 May 18, 2023 19:01

rosecers commented May 18, 2023

View reviewed changes

.gitignore Show resolved Hide resolved

rosecers force-pushed the fix/check_estimator branch from 1ba16f8 to 5dce498 Compare May 18, 2023 19:07

rosecers requested a review from agoscinski May 18, 2023 19:10

agoscinski reviewed May 19, 2023

View reviewed changes

src/skmatter/decomposition/_kernel_pcovr.py Outdated Show resolved Hide resolved

src/skmatter/preprocessing/_data.py Outdated Show resolved Hide resolved

tests/test_sample_simple_cur.py Outdated Show resolved Hide resolved