[REVIEW] Support for sample_weight parameter in LogisticRegression #3572

viclafargue · 2021-03-02T17:15:18Z

Closes #3559

This PR adds the sample_weight and class_weight parameters to the LogisticRegression estimator.

dantegd

C++ looks very clean, most comments were on the Python side (also looks clean, mostly minor stuff)

python/cuml/linear_model/logistic_regression.pyx

python/cuml/solvers/qn.pyx

python/cuml/linear_model/logistic_regression.pyx

python/cuml/test/test_linear_model.py

viclafargue · 2021-03-12T16:33:32Z

This PR requires a RAFT PR to be approved and merged for CI to pass : raft:#168 .

dantegd

Just 2 very small things, otherwise it looks great

dantegd · 2021-03-18T02:15:20Z

python/cuml/linear_model/logistic_regression.pyx

@@ -126,6 +126,15 @@ class LogisticRegression(Base,
       If False, the model expects that you have centered the data.
    class_weight: None
        Custom class weighs are currently not supported.
+    class_weight: dict or 'balanced', default=None
+        By default all classes have a weight one. However, a dictionnary


Suggested change

By default all classes have a weight one. However, a dictionnary

By default all classes have a weight one. However, a dictionary

dantegd · 2021-03-18T02:23:04Z

python/cuml/linear_model/logistic_regression.pyx

+            if class_weight == 'balanced':
+                self.class_weight_ = 'balanced'
+            else:
+                classes = list(class_weight.keys())


Indeed, but my comment applies more so that the classes that the dictionary of weights has (as keys) have to coincide with the classes of the operation self.classes_ = cp.unique(y_m), no? i.e if the user passes a class in fit that was not in the dict of class_weights (when it was passed as a dict) it would be an error, which would be consistent with what Scikit does if I’m not mistaken https://github.com/scikit-learn/scikit-learn/blob/95119c13af77c76e150b753485c662b7c52a41a2/sklearn/utils/class_weight.py#L67

viclafargue · 2021-03-18T11:16:48Z

@dantegd In my understanding there's two cases:

class_weights does not specify a weight for all of the unique labels provided in fit -> not an issue, they will take a value of 1 as a default
some weight(s) are explicitly specified in class_weights but aren't present in the list of unique labels -> should raise an error

Is this right, or am I mistaken somewhere? If I'm right, thanks for noticing this and providing the link, I'll fix this.

dantegd · 2021-03-18T15:46:13Z

@viclafargue yes indeed that's the case and thanks for writing it more clear than I did :). For reference we can see how that works in scikit indeed:

>>> from sklearn.linear_model import LogisticRegression
>>> import numpy as np
>>> weights = {0: 0.2, 1:0.4}
>>> model1 = LogisticRegression(class_weight=weights)
>>> a = np.random.rand(8).reshape(4,2)
>>> a
array([[0.05872595, 0.7262419 ],
       [0.78006228, 0.47405287],
       [0.52005636, 0.63693384],
       [0.66651034, 0.18605195]])
>>> b = np.array([0, 0, 0, 1])
>>> model1.fit(a, b)
LogisticRegression(class_weight={0: 0.2, 1: 0.4})
>>> model1.__dict__
{'penalty': 'l2', 'dual': False, 'tol': 0.0001, 'C': 1.0, 'fit_intercept': True, 'intercept_scaling': 1, 'class_weight': {0: 0.2, 1: 0.4}, 'random_state': None, 'solver': 'lbfgs', 'max_iter': 100, 'multi_class': 'auto', 'verbose': 0, 'warm_start': False, 'n_jobs': None, 'l1_ratio': None, 'n_features_in_': 2, 'classes_': array([0, 1]), 'coef_': array([[ 0.04954003, -0.10064803]]), 'intercept_': array([-0.38777167]), 'n_iter_': array([6], dtype=int32)}
>>> c = np.array([0, 2, 0, 1])
>>> model1.fit(a, c)
LogisticRegression(class_weight={0: 0.2, 1: 0.4})
>>> model1.__dict__
{'penalty': 'l2', 'dual': False, 'tol': 0.0001, 'C': 1.0, 'fit_intercept': True, 'intercept_scaling': 1, 'class_weight': {0: 0.2, 1: 0.4}, 'random_state': None, 'solver': 'lbfgs', 'max_iter': 100, 'multi_class': 'auto', 'verbose': 0, 'warm_start': False, 'n_jobs': None, 'l1_ratio': None, 'n_features_in_': 2, 'classes_': array([0, 1, 2]), 'coef_': array([[-0.13812798,  0.08768523],
       [ 0.00804634, -0.10698594],
       [ 0.13008164,  0.0193007 ]]), 'intercept_': array([-0.25710905, -0.26129142,  0.51840047]), 'n_iter_': array([6], dtype=int32)}
>>> d = np.array([0, 2, 0, 2])
>>> model1.fit(a, d)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/galahad/miniconda3/envs/ns0317/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py", line 1407, in fit
    fold_coefs_ = Parallel(n_jobs=self.n_jobs, verbose=self.verbose,
  File "/home/galahad/miniconda3/envs/ns0317/lib/python3.8/site-packages/joblib/parallel.py", line 1041, in __call__
    if self.dispatch_one_batch(iterator):
  File "/home/galahad/miniconda3/envs/ns0317/lib/python3.8/site-packages/joblib/parallel.py", line 859, in dispatch_one_batch
    self._dispatch(tasks)
  File "/home/galahad/miniconda3/envs/ns0317/lib/python3.8/site-packages/joblib/parallel.py", line 777, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)
  File "/home/galahad/miniconda3/envs/ns0317/lib/python3.8/site-packages/joblib/_parallel_backends.py", line 208, in apply_async
    result = ImmediateResult(func)
  File "/home/galahad/miniconda3/envs/ns0317/lib/python3.8/site-packages/joblib/_parallel_backends.py", line 572, in __init__
    self.results = batch()
  File "/home/galahad/miniconda3/envs/ns0317/lib/python3.8/site-packages/joblib/parallel.py", line 262, in __call__
    return [func(*args, **kwargs)
  File "/home/galahad/miniconda3/envs/ns0317/lib/python3.8/site-packages/joblib/parallel.py", line 262, in <listcomp>
    return [func(*args, **kwargs)
  File "/home/galahad/miniconda3/envs/ns0317/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py", line 665, in _logistic_regression_path
    class_weight_ = compute_class_weight(class_weight,
  File "/home/galahad/miniconda3/envs/ns0317/lib/python3.8/site-packages/sklearn/utils/validation.py", line 73, in inner_f
    return f(**kwargs)
  File "/home/galahad/miniconda3/envs/ns0317/lib/python3.8/site-packages/sklearn/utils/class_weight.py", line 68, in compute_class_weight
    raise ValueError("Class label {} not present.".format(c))
ValueError: Class label 1 not present.
>>>

dantegd

lgtm

@viclafargue to merge could you check that we're pinning raft to the latest possible commit, and add an xfail to the non monotonic sil score test if it hasn't been done already in branch-0.19

codecov-io · 2021-03-23T18:10:32Z

Codecov Report

Merging #3572 (a15e538) into branch-0.19 (c2f246a) will increase coverage by 0.13%.
The diff coverage is 88.23%.

@@               Coverage Diff               @@
##           branch-0.19    #3572      +/-   ##
===============================================
+ Coverage        80.87%   81.00%   +0.13%     
===============================================
  Files              228      228              
  Lines            17630    17811     +181     
===============================================
+ Hits             14258    14428     +170     
- Misses            3372     3383      +11

Flag	Coverage Δ
dask	`45.13% <7.35%> (+0.17%)`	⬆️
non-dask	`73.31% <88.23%> (+0.21%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
python/cuml/solvers/qn.pyx	`97.20% <85.71%> (-0.43%)`	⬇️
python/cuml/linear_model/logistic_regression.pyx	`89.16% <88.88%> (+0.20%)`	⬆️
python/cuml/linear_model/linear_regression.pyx	`88.23% <0.00%> (-3.53%)`	⬇️
python/cuml/dask/solvers/cd.py	`100.00% <0.00%> (ø)`
python/cuml/common/numba_utils.py	`0.00% <0.00%> (ø)`
python/cuml/internals/global_settings.py	`100.00% <0.00%> (ø)`
python/cuml/dask/preprocessing/encoders.py	`100.00% <0.00%> (ø)`
python/cuml/metrics/_ranking.py	`98.57% <0.00%> (+0.02%)`	⬆️
python/cuml/preprocessing/encoders.py	`95.10% <0.00%> (+0.02%)`	⬆️
python/cuml/common/array_descriptor.py	`98.24% <0.00%> (+0.03%)`	⬆️
... and 36 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c2f246a...a15e538. Read the comment docs.

dantegd · 2021-03-23T20:07:16Z

@gpucibot merge

sample_weight for LogisticRegression

d4e6492

viclafargue requested review from a team as code owners March 2, 2021 17:15

github-actions bot added CUDA/C++ Cython / Python Cython or Python issue labels Mar 2, 2021

class_weight feature + testing

62c814c

viclafargue changed the title ~~[WIP] sample_weight for LogisticRegression~~ [WIP] sample_weight parameter for LogisticRegression Mar 3, 2021

viclafargue changed the title ~~[WIP] sample_weight parameter for LogisticRegression~~ [WIP] Support for sample_weight parameter in LogisticRegression Mar 3, 2021

viclafargue added 4 commits March 4, 2021 16:07

Fix + tests update + check style

5d61653

class_weight as dict

0292923

Multiclass + misc fixes

571182e

Add binary classification check

9bb23c9

dantegd requested changes Mar 5, 2021

View reviewed changes

dantegd added feature request New feature or request non-breaking Non-breaking change labels Mar 5, 2021

viclafargue added 7 commits March 8, 2021 14:07

Requested changes

95bd1ce

Stress test

724eca9

Update error message

5824cd1

Merge branch 'branch-0.19' into sample-weight-logistic-regression

32b098c

Merge branch 'branch-0.19' into sample-weight-logistic-regression

8e693fb

Update copyright header

5106ae7

Update RAFT commit tag

8add1bd

viclafargue requested a review from a team as a code owner March 12, 2021 16:29

github-actions bot added the CMake label Mar 12, 2021

viclafargue added 4 commits March 15, 2021 09:42

Update RAFT with map operation

4978f2d

Adding include

dd0f956

Improve testing

52b5c37

Merge branch-0.19

515e18e

dantegd requested changes Mar 18, 2021

View reviewed changes

dantegd added the 4 - Waiting on Author Waiting for author to respond to review label Mar 18, 2021

Requested changes

0e98d7e

JohnZed changed the title ~~[WIP] Support for sample_weight parameter in LogisticRegression~~ [REVIEW] Support for sample_weight parameter in LogisticRegression Mar 18, 2021

dantegd approved these changes Mar 19, 2021

View reviewed changes

dantegd added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 4 - Waiting on Author Waiting for author to respond to review labels Mar 19, 2021

viclafargue added 5 commits March 19, 2021 09:56

Update RAFT

c9c28c2

Add xfail

8defdf5

Merge branch 'branch-0.19' into sample-weight-logistic-regression

dcf4d7e

Merge branch-0.19

9bd8c15

Downgrade RAFT

a15e538

github-actions bot removed the CMake label Mar 23, 2021

rapids-bot bot merged commit 5bb564e into rapidsai:branch-0.19 Mar 23, 2021

beckernick mentioned this pull request Apr 6, 2021

[FEA] Support for Weights in Nearest Neighbors #3006

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[REVIEW] Support for sample_weight parameter in LogisticRegression #3572

[REVIEW] Support for sample_weight parameter in LogisticRegression #3572

viclafargue commented Mar 2, 2021 •

edited

Loading

dantegd left a comment

viclafargue commented Mar 12, 2021

dantegd left a comment

dantegd Mar 18, 2021

dantegd Mar 18, 2021

viclafargue commented Mar 18, 2021 •

edited

Loading

dantegd commented Mar 18, 2021

dantegd left a comment

codecov-io commented Mar 23, 2021

dantegd commented Mar 23, 2021

	By default all classes have a weight one. However, a dictionnary
	By default all classes have a weight one. However, a dictionary

[REVIEW] Support for sample_weight parameter in LogisticRegression #3572

[REVIEW] Support for sample_weight parameter in LogisticRegression #3572

Conversation

viclafargue commented Mar 2, 2021 • edited Loading

dantegd left a comment

Choose a reason for hiding this comment

viclafargue commented Mar 12, 2021

dantegd left a comment

Choose a reason for hiding this comment

dantegd Mar 18, 2021

Choose a reason for hiding this comment

dantegd Mar 18, 2021

Choose a reason for hiding this comment

viclafargue commented Mar 18, 2021 • edited Loading

dantegd commented Mar 18, 2021

dantegd left a comment

Choose a reason for hiding this comment

codecov-io commented Mar 23, 2021

Codecov Report

dantegd commented Mar 23, 2021

viclafargue commented Mar 2, 2021 •

edited

Loading

viclafargue commented Mar 18, 2021 •

edited

Loading