Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix/check estimator #196

Merged
merged 9 commits into from
Jun 1, 2023
Merged

Fix/check estimator #196

merged 9 commits into from
Jun 1, 2023

Conversation

rosecers
Copy link
Collaborator

@rosecers rosecers commented May 17, 2023

scikit-learn-contrib (our PR is scikit-learn-contrib/scikit-learn-contrib#62) requires that all estimators pass a check_estimators test ala https://scikit-learn.org/stable/modules/generated/sklearn.utils.estimator_checks.parametrize_with_checks.html#sklearn.utils.estimator_checks.parametrize_with_checks, which ours were not currently doing. I've been going through everything that inherits from the BaseEstimator class (or should) and making the required changes. @agoscinski would appreciate your input on the linear_models section, as I'm not sure that OrthogonalRegression can pass the estimators, as to my knowledge the predicted values may have a different shape that the fitted y, no? Please correct me if I'm wrong.


📚 Documentation preview 📚: https://scikit-matter--196.org.readthedocs.build/en/196/

@rosecers rosecers force-pushed the fix/check_estimator branch 8 times, most recently from 11218b2 to af60553 Compare May 17, 2023 23:31
@rosecers
Copy link
Collaborator Author

@agoscinski I am honestly perplexed by some of the testing errors coming up right now (matmul failing for correctly-shaped matrices). Would appreciate thoughts -- @ceriottm @PicoCentauri @Luthaf and others also welcome to weigh in

Running tests/test_metrics.py, I get things like:

ERROR: test_local_reconstruction_error_train_idx (__main__.ReconstructionMeasuresTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/rca/source_installs/scikit-matter/tests/test_metrics.py", line 141, in test_local_reconstruction_error_train_idx
    lfre_val = pointwise_local_reconstruction_error(
  File "/Users/rca/miniconda3/lib/python3.10/site-packages/skmatter/metrics/_reconstruction_measures.py", line 456, in pointwise_local_reconstruction_error
    - 2 * X_test @ X_train.T
  File "/Users/rca/miniconda3/lib/python3.10/site-packages/numpy/ma/core.py", line 3077, in __array_wrap__
    m = reduce(mask_or, [getmaskarray(arg) for arg in input_args])
  File "/Users/rca/miniconda3/lib/python3.10/site-packages/numpy/ma/core.py", line 1757, in mask_or
    return make_mask(umath.logical_or(m1, m2), copy=copy, shrink=shrink)
ValueError: operands could not be broadcast together with shapes (15,4) (4,5) 

I've checked, it's within the @ operator, not the *. Also checked that both matrices are the same type.

For the record, this only happens for tests/test_metrics.py and tests/test_sparse_kernel_centerer.py on my end.

@rosecers rosecers force-pushed the fix/check_estimator branch 3 times, most recently from 33fe51b to 766e3da Compare May 17, 2023 23:51
@agoscinski
Copy link
Collaborator

In the StandartFlexibleScaler np.ma.* is used which transforms the mean to a <class 'numpy.ma.core.MaskedArray'> and thus also the transformed array.c
https://github.com/lab-cosmo/scikit-matter/blob/766e3dabc42f26727b484b37fcd696ad64a9c222/src/skmatter/preprocessing/_data.py#L152
Then in the metrics code the @ operation between masks is executed as an elementwise multiplication which results in the error because of not broadcastability of the shapes

Just replace np.ma.* with regular np.*

@rosecers
Copy link
Collaborator Author

rosecers commented May 18, 2023 via email

@agoscinski
Copy link
Collaborator

I dont fully understand the problem with np.average, could you elaborate. Where does scikit-learn return an error?

@rosecers rosecers marked this pull request as draft May 18, 2023 15:34
@rosecers rosecers force-pushed the fix/check_estimator branch 4 times, most recently from 126a26b to 1ba16f8 Compare May 18, 2023 18:53
@rosecers
Copy link
Collaborator Author

rosecers commented May 18, 2023

Everything passes! Ready for a proper review with one or two outstanding q's:

  • technically, the sample selectors inherit from BaseEstimator, but do not follow the estimator behavior to a tee (their transformed objects do not have the same number of rows as their inputs). In order to delay the selector-refactor conversation, I suggest we leave them out of test_check_estimators
  • Should orthogonalregression inherit from base estimator? @agoscinski

@rosecers rosecers marked this pull request as ready for review May 18, 2023 19:01
@rosecers rosecers requested a review from hurricane642 May 18, 2023 19:01
.gitignore Show resolved Hide resolved
@rosecers rosecers force-pushed the fix/check_estimator branch from 1ba16f8 to 5dce498 Compare May 18, 2023 19:07
@rosecers
Copy link
Collaborator Author

I should note what many of the necessary fixes were:

  • Only assigning values to variable_ type attributes outside of the initializer
  • Making sure input arrays are always validated
  • Making sure errors match the expected behavior for scikit-learn (the weird additions of Reshape your data)
  • Making sure to always define n_features_in_ and n_samples_in during fit functions

@rosecers rosecers requested a review from agoscinski May 18, 2023 19:10
@hurricane642
Copy link
Contributor

I have a question here, shouldn't we explicitly add tests with @parametrize_with_checks in our test sets? Or the requirement that it should just be possible to run such tests?

@agoscinski
Copy link
Collaborator

technically, the sample selectors inherit from BaseEstimator, but do not follow the estimator behavior to a tee (their transformed objects do not have the same number of rows as their inputs). In order to delay the selector-refactor conversation, I suggest we leave them out of test_check_estimators

If they are okay with it, sure. We can write in the doc also that it is not compatible with Pipeline yet, if this helps convincing them. In the end we need to mark this somehow programmatically. I am not sure how to do this, since scikit-learn Pipeline never checks for the type of the transformers, so changing the base class does not help here. Maybe the contributors have some useful suggestions.

Should orthogonalregression inherit from base estimator? @agoscinski

Yes it is an estimator so it should inherit from it, what is the problem with it? That it does not always agrees with the shape of the input, right? We can make mark it as private class (_OrthogonalRegression), because it is only use for reconstruction measures, then I think we can ommit it for the estimator check. But it basically has the same problems as the sample selection classes, so I would do the apply the same solution.

Copy link
Collaborator

@agoscinski agoscinski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks overall good. I would use the chance to add tests for not covered code

src/skmatter/decomposition/_kernel_pcovr.py Outdated Show resolved Hide resolved
src/skmatter/preprocessing/_data.py Outdated Show resolved Hide resolved
tests/test_sample_simple_cur.py Outdated Show resolved Hide resolved

xnew -= col @ (col.T @ xnew)
xnew -= (col @ (col.T @ xnew)).astype(xnew.dtype)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this necessary? It seems weird to suddenly enforce the type here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without this we get numpy.core._exceptions._UFuncOutputCastingError: Cannot cast ufunc 'divide' output from dtype('float64') to dtype('int64') with casting rule 'same_kind'

Copy link
Collaborator

@agoscinski agoscinski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It became quite a huge PR. I have the feeling if we merge it like it is, we will merge some buggy code which is hard to identify because it is entangled with so many changes, therefore I suggest to put the renaming of private variables into a separate PR to reduce the noise here. Then I can review it again.

These are the changes I could identify from this PR

* renaming member variables marked as private to sklearn style

* consistently validate and check input data in fit functions

* adding whitening option in PCovR

* KernelFlexibleCenterer was not consistently using validated kernel, this has been fixed

* adding tests tests/test_kernel_pcovr.py for different solvers

* adding tests tests/test_standard_flexible_scaler.py for taking average

* add sklearn estimator_checks tests

@rosecers rosecers force-pushed the fix/check_estimator branch 4 times, most recently from f0663c5 to cdefe67 Compare May 30, 2023 15:46
@rosecers rosecers force-pushed the fix/check_estimator branch 3 times, most recently from d5cb77d to 097b273 Compare June 1, 2023 01:58
@rosecers rosecers force-pushed the fix/check_estimator branch from 097b273 to cb82ceb Compare June 1, 2023 16:58
Copy link
Collaborator

@agoscinski agoscinski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One comment only. Otherwise looks fine.

Suggested merge commit

add sklearn estimator_checks tests and fix emerging test errors

* consistently validate and check input data in fit functions

* adding whitening option in PCovR

* KernelFlexibleCenterer was not consistently using validated kernel, this has been fixed

* adding tests tests/test_standard_flexible_scaler.py for taking average

* create new test file tests/test_check_estimators.py with sklearn estimator_checks tests

src/skmatter/_selection.py Outdated Show resolved Hide resolved
src/skmatter/utils/_pcovr_utils.py Show resolved Hide resolved
Co-authored-by: Alexander Goscinski <alex.goscinski@posteo.de>
Copy link
Collaborator

@agoscinski agoscinski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

@rosecers rosecers merged commit 929fb7a into main Jun 1, 2023
@rosecers rosecers deleted the fix/check_estimator branch June 1, 2023 19:55
@agoscinski agoscinski mentioned this pull request Jun 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants