ENH duck-typing scikit-learn estimator instead of inheritance #858

NV-jpt · 2021-09-02T16:13:46Z

Reference Issue

Fixes #856

What does this implement/fix? Explain your changes.

This PR has two parts:

First, it switches the logical check in check_neighbors_object() from an inheritance based check to be duck typing based (verifying based on the interface). More specifically, a logical check for whether a neighbors object is appropriate is done by checking if it inherits from the scikit-learn Mixin class KNeighborsMixin. This PR switches this check to evaluate whether the interface of the operator is consistent with KNeighborsMixin, rather than an explicit subclassing.

Second, it modifies several resamplers' _validate_estimator() methods to model the strategy adopted in KMeansSMOTE; that of verifying estimators by way of sklearn.base.clone(), instead of an isinstance check.

These changes allow for greater flexibility, as users will be able to cleanly integrate estimators from libraries that do not directly depend on scikit-learn, but enforce the same API contract.

Any other comments?

imblearn/utils/_validation.py

pep8speaks · 2021-09-09T18:56:34Z

Hello @NV-jpt! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-09-24 19:07:23 UTC

codecov · 2021-09-09T19:04:47Z

Codecov Report

Merging #858 (b75b77d) into master (b3a3dce) will increase coverage by 0.06%.
The diff coverage is 97.95%.

❗ Current head b75b77d differs from pull request most recent head b627cf1. Consider uploading reports for the commit b627cf1 to get more accurate results

@@            Coverage Diff             @@
##           master     #858      +/-   ##
==========================================
+ Coverage   97.20%   97.26%   +0.06%     
==========================================
  Files          96       97       +1     
  Lines        6253     6373     +120     
  Branches      712      721       +9     
==========================================
+ Hits         6078     6199     +121     
- Misses        104      105       +1     
+ Partials       71       69       -2

Impacted Files	Coverage Δ
imblearn/over_sampling/_adasyn.py	`87.50% <ø> (-0.20%)`	⬇️
imblearn/over_sampling/_smote/base.py	`98.97% <ø> (ø)`
imblearn/over_sampling/_smote/cluster.py	`100.00% <ø> (ø)`
imblearn/over_sampling/_smote/tests/test_smote.py	`100.00% <ø> (ø)`
imblearn/over_sampling/tests/test_adasyn.py	`100.00% <ø> (ø)`
...mpling/_prototype_selection/tests/test_nearmiss.py	`100.00% <ø> (ø)`
...election/tests/test_neighbourhood_cleaning_rule.py	`100.00% <ø> (ø)`
imblearn/utils/tests/test_validation.py	`98.91% <81.81%> (-1.09%)`	⬇️
imblearn/utils/testing.py	`98.82% <96.15%> (-1.18%)`	⬇️
imblearn/over_sampling/_smote/filter.py	`87.82% <100.00%> (+1.38%)`	⬆️
... and 8 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b3a3dce...b627cf1. Read the comment docs.

NV-jpt · 2021-09-09T19:48:47Z

@glemaitre , please let me know what you think about these proposed changes when you have a moment - If these changes are agreeable, I would be happy to update the docstring and related documentation to align them with these logical changes.

Additionally, I would also be happy to craft up a jupyter notebook demo tutorial to show how the integration of cuML into imbalanced-learn can provide significant speed-ups when re-sampling large datasets.

…or - similar to _validate_estimator() in KMeansSMOTE

glemaitre · 2021-09-13T17:09:52Z

I will have a look soon. We are in the release process of scikit-learn. So I will probably make a small sprint right afterwards to make a sync release and thus look at all the issues/prs that are available.

lgtm-com · 2021-09-13T17:40:51Z

This pull request introduces 1 alert when merging 94b0725 into edf6eae - view on LGTM.com

new alerts:

1 for Unused import

…estNeighbour

…reshold

NV-jpt · 2021-09-13T22:31:00Z

imblearn/over_sampling/_smote/filter.py

        else:
-            raise_isinstance_error("svm_estimator", [SVC], self.svm_estimator)
+            self.svm_estimator_ = clone(self.svm_estimator)


This change removes the explicit isinstance check for validating the SVC estimator in SVMSMOTE's _validate_estimator method; the estimator is instead validated by way of sklearn.base.clone(), similar to that of KMeansSMOTE.

This will enable the integration of SVM estimators that enforce the same API contract as sklearn instead of requiring the explicit class check (isinstance(svm_estimator, sklearn.svm.SVC))

As a motivating example, the integration of a GPU-accelerated SVC from cuML can offer significant performance gains when working with large datasets.

Hardware Specs for the Loose Benchmark:
Intel Xeon E5-2698, 2.2 GHz, 16-cores & NVIDIA V100 32 GB GPU

Benchmarking gist:
https://gist.github.com/NV-jpt/039a8d9c7d37365379faa1d7c7aafc5e

NV-jpt · 2021-09-14T14:15:11Z

imblearn/under_sampling/_prototype_generation/_cluster_centroids.py

-                f"`estimator` has to be a KMeans clustering."
-                f" Got {type(self.estimator)} instead."
-            )
+            self.estimator_ = clone(self.estimator)


This change removes the explicit isinstance check for validating the KMeans estimator in ClusterCentroid's _validate_estimator method; the estimator is instead validated by way of sklearn.base.clone(), similar to that of KMeansSMOTE.

This will enable KMeans estimators that enforce the same API contract as sklearn to be integrated instead of requiring the explicit class check (isinstance(estimator, sklearn.cluster.KMeans))

As a motivating example, the integration of a GPU-accelerated KMeans estimator from cuML can offer significant performance gains when working with large datasets.

Hardware Specs for the Loose Benchmark:
Intel Xeon E5-2698, 2.2 GHz, 16-cores & NVIDIA V100 32 GB GPU

Benchmarking gist:
https://gist.github.com/NV-jpt/d176d4c1aa3b184103efe605ee96b0e6

imblearn/under_sampling/_prototype_selection/_condensed_nearest_neighbour.py

imblearn/under_sampling/_prototype_selection/_instance_hardness_threshold.py

glemaitre

I think that we will need tests to ensure that the functionality would work.
We would need as well some documentation.

glemaitre · 2021-09-28T18:35:59Z

imblearn/utils/_validation.py

@@ -93,7 +101,7 @@ def check_neighbors_object(nn_name, nn_object, additional_neighbor=0):
    """
    if isinstance(nn_object, Integral):
        return NearestNeighbors(n_neighbors=nn_object + additional_neighbor)
-    elif isinstance(nn_object, KNeighborsMixin):
+    elif _is_neighbors_object(nn_object):
        return clone(nn_object)
    else:
        raise_isinstance_error(nn_name, [int, KNeighborsMixin], nn_object)


I think that we should as well change the error message since we don't strictly require to be a KNeighborsMixin but instead to expose both kneighbors and kneighbors_graph.

I have been debating between two implementations here (my two latest commits [1 2]).

[1] uses sklearn.base.clone to verify that the nn_object is an sklearn-like estimator that can be cloned. This implementation is more consistent with how the library checks the integrity of other estimators - such as the KMeans Estimator check in KMeansSmote, but it does not protect users from the mistake of inputting other types of estimators.

[2] raises a TypeError if the nn_object is neither an integer, nor exposes both kneighbors and kneighbors_graph; thus, it protects users from this potential mistake.

Do you prefer one over the other?

NV-jpt · 2021-09-28T19:31:07Z

Great! Thank you; I will go ahead and update the documentation to accommodate these changes. As for the testing, @glemaitre, do you have any specific tests in mind?

glemaitre · 2021-09-29T08:06:28Z

@glemaitre, do you have any specific tests in mind?

I think that the idea is to make a smoke test plugging the other NeareastNeighbors. We could add a CI build with some additional dependency just for testing such behaviour (like what we do with keras or tensorflow).

beckernick · 2021-12-07T23:46:51Z

imblearn/_min_dependencies.py

@@ -18,6 +18,9 @@
 JOBLIB_MIN_VERSION = "0.11"
 THREADPOOLCTL_MIN_VERSION = "2.0.0"
 PYTEST_MIN_VERSION = "5.0.1"
+CUML_MIN_VERSION = "21.10"


Just wanted to highlight this comment #858 (comment)

I believe using cuML here requires 21.12. 21.12 is the current nightly, which is on track for release this week https://docs.rapids.ai/maintainers

Yep, but I am kind of struggling to install cudatoolkit with conda. Locally, the package is available but it seems that this is not the case on Azure Pipeline. I am currently making sure that I have the latest version of conda.

This said, since there is no GPU on Azure, is cuML falling back on some sort of CPU computing or it will just fail?
In which case, I should investigate to find a free CI service for open source where I can get a GPU.

glemaitre · 2021-12-07T23:59:46Z

On another note, I think that I will make some subsequent PR, to have either cuNumeric/Dask/Numpy as different backends. I already looked a Dask and it seems feasible. It would require to bypass the input validation thought.

glemaitre · 2021-12-08T00:11:11Z

Thank you for the response, @glemaitre !! Does this test I have just now committed look appropriately placed? I can add additional tests for the other resamplers too.

@NV-jpt regarding the test, I would like to make one CI build work. Then, I would rewrite a common test for all samplers that takes a NN, KMeans and check that I can pass a cuML estimator. In a next PR, I will make that any estimator would take cuML estimator (SVM, etc.)

glemaitre · 2021-12-08T10:41:01Z

@NV-jpt @beckernick Do you know if it would be possible to add imbalanced-learn in the gpuCI (https://gpuci.gpuopenanalytics.com/) to check the potential regression?

glemaitre · 2022-01-14T08:00:37Z

So I will take an alternative here by using our own KNN since we only want to check that ducktyping is working properly. It will be enough for the purpose of the PR.

glemaitre · 2022-01-16T13:10:07Z

Thanks @NV-jpt I will make a subsequent PR for the other estimator, namely KNearestNeighbors that could benefit from the same duck-typing check.

sft-managed added 2 commits September 2, 2021 15:24

add duck-type check for KNeighbors-likeness

d17b6b5

removal ofKNeighborsMixin type check

379ea7e

NV-jpt changed the title ~~Switch check_neighbors_object()'s KNeighborsMixin type checks to be interface based rather than inheritance based~~ [ENH] Switch estimator operator logical check in check_neighbors_object() to be interface based rather than inheritance based Sep 2, 2021

beckernick reviewed Sep 7, 2021

View reviewed changes

imblearn/utils/_validation.py Outdated Show resolved Hide resolved

Added _is_neighbors_object() private validation function

8790628

Addded pep8lank lines

e997e23

change isinstance check for SVM estimator to simply clone the estimat…

94b0725

…or - similar to _validate_estimator() in KMeansSMOTE

remove explicit class-check for KMeans estimator

9fbf360

sft-managed added 6 commits September 13, 2021 17:41

remove explicit class check for KNeighborsClassifier

f736879

remove explicit class check for KNeighborsClassifier in CondensedNear…

fcb118e

…estNeighbour

remove explicit class check for ClassifierMixin in InstanceHardnessTh…

a4e959c

…reshold

PEP 8 issue fix

65ae4fd

PEP 8 issue fix - line break before operator

5b76d49

PEP 8 issue fix - no more line break before operator

8284b70

NV-jpt changed the title ~~[ENH] Switch estimator operator logical check in check_neighbors_object() to be interface based rather than inheritance based~~ [ENH] Switch estimator operator logical checks to be interface based rather than inheritance based Sep 13, 2021

NV-jpt commented Sep 13, 2021

View reviewed changes

NV-jpt commented Sep 14, 2021

View reviewed changes

imblearn/under_sampling/_prototype_selection/_condensed_nearest_neighbour.py Outdated Show resolved Hide resolved

NV-jpt commented Sep 14, 2021

View reviewed changes

imblearn/under_sampling/_prototype_selection/_instance_hardness_threshold.py Outdated Show resolved Hide resolved

sft-managed added 3 commits September 15, 2021 17:10

Undo changes to _instance_hardness_threshold

e97ae36

revert OneSidedSelection changes

495ec27

Undo changes to CondensedNearestNeighbour

10456f5

glemaitre reviewed Sep 28, 2021

View reviewed changes

glemaitre added 7 commits December 7, 2021 23:48

iter

525002f

iter

189f0e9

iter

2cbe273

iter

cc7fae9

ITER

29e4619

iter

a098e84

iter

0aa328e

beckernick reviewed Dec 7, 2021

View reviewed changes

dbg

8cce474

dbg

8d4ff31

glemaitre added 2 commits December 8, 2021 10:42

MNT move to circleci

0ceacfb

iter

ee6b7b0

glemaitre added 12 commits January 15, 2022 12:21

iter

d089b7b

Merge remote-tracking branch 'origin/master' into pr/NV-jpt/858

ac7e00a

create custom NN class

d815e2d

add test no dependent on cupy

964d082

update documentation

99d5206

iter

48d1fd5

iter

18b6057

revert redirector

76fbd59

add changelog

8fa97ed

remove duplicated test

615a2bf

make testing function private

b75b77d

iter

b627cf1

glemaitre merged commit afcc8e6 into scikit-learn-contrib:master Jan 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH duck-typing scikit-learn estimator instead of inheritance #858

ENH duck-typing scikit-learn estimator instead of inheritance #858

NV-jpt commented Sep 2, 2021 •

edited

Loading

pep8speaks commented Sep 9, 2021 •

edited

Loading

codecov bot commented Sep 9, 2021 •

edited

Loading

NV-jpt commented Sep 9, 2021

glemaitre commented Sep 13, 2021

lgtm-com bot commented Sep 13, 2021

NV-jpt Sep 13, 2021 •

edited

Loading

NV-jpt Sep 14, 2021 •

edited

Loading

glemaitre left a comment

glemaitre Sep 28, 2021

NV-jpt Sep 29, 2021

NV-jpt commented Sep 28, 2021

glemaitre commented Sep 29, 2021

beckernick Dec 7, 2021

glemaitre Dec 7, 2021

glemaitre commented Dec 7, 2021

glemaitre commented Dec 8, 2021

glemaitre commented Dec 8, 2021

glemaitre commented Jan 14, 2022

glemaitre commented Jan 16, 2022

ENH duck-typing scikit-learn estimator instead of inheritance #858

ENH duck-typing scikit-learn estimator instead of inheritance #858

Conversation

NV-jpt commented Sep 2, 2021 • edited Loading

Reference Issue

What does this implement/fix? Explain your changes.

Any other comments?

pep8speaks commented Sep 9, 2021 • edited Loading

Comment last updated at 2021-09-24 19:07:23 UTC

codecov bot commented Sep 9, 2021 • edited Loading

Codecov Report

NV-jpt commented Sep 9, 2021

glemaitre commented Sep 13, 2021

lgtm-com bot commented Sep 13, 2021

NV-jpt Sep 13, 2021 • edited Loading

Choose a reason for hiding this comment

NV-jpt Sep 14, 2021 • edited Loading

Choose a reason for hiding this comment

glemaitre left a comment

Choose a reason for hiding this comment

glemaitre Sep 28, 2021

Choose a reason for hiding this comment

NV-jpt Sep 29, 2021

Choose a reason for hiding this comment

NV-jpt commented Sep 28, 2021

glemaitre commented Sep 29, 2021

beckernick Dec 7, 2021

Choose a reason for hiding this comment

glemaitre Dec 7, 2021

Choose a reason for hiding this comment

glemaitre commented Dec 7, 2021

glemaitre commented Dec 8, 2021

glemaitre commented Dec 8, 2021

glemaitre commented Jan 14, 2022

glemaitre commented Jan 16, 2022

NV-jpt commented Sep 2, 2021 •

edited

Loading

pep8speaks commented Sep 9, 2021 •

edited

Loading

codecov bot commented Sep 9, 2021 •

edited

Loading

NV-jpt Sep 13, 2021 •

edited

Loading

NV-jpt Sep 14, 2021 •

edited

Loading