Bugfix: sklearn 1.1 and SliceDataset #858

BenjaminBossan · 2022-05-21T15:05:40Z

Sklearn 1.1 issue

The error message when using set_params with an invalid key has been
altered to contain quote marks. Now testing message without or with.

`SliceDataset` issue

Using SliceDataset could result in an error because of this line:

skorch/skorch/classifier.py

Line 119 in 090ece0

self.classes_inferred_ = np.unique(to_numpy(y))

This would fail when y comes from a SliceDataset. However, the error was masked
because GridSearchCV swallows errors by default now. Therefore, we now run these
tests with error='raise' to surface the bug.

Implementation

Regarding the bugfix, it is a bit clumsy because we need to make
to_numpy work with SliceDataset, but we cannot check

'isinstance(X, SliceDataset)'

because we don't want to import the helper class. Therefore, we now
check this indirectly by looking at attributes.

Furthermore, SliceDataset now also works with torch tensors as X and y,
not only numpy arrays. Tests have been extended to cover this.

The error message when using set_params with an invalid key has been altered to contain quote marks. Now testing message without or with.

Using SliceDataset could result in an error because of this line: self.classes_inferred_ = np.unique(to_numpy(y)) This would fail with a SliceDataset. However, the error was masked because GridSearchCV swallows errors. Therefore, we now run these tests with error='raise' to surface the bug. Regarding the bugfix, it is a bit clumsy because we need to make to_numpy work with SliceDataset, but we cannot check 'isinstance(X, SliceDataset)' because we don't want to import the helper class. Therefore, we now check this indirectly by looking at attributes. Furthermore, SliceDataset now also works with torch tensors as X and y, not only numpy arrays. Tests have been extended to cover this.

thomasjpfan

I'm okay with the workaround with SliceDataset.

skorch/tests/callbacks/test_all.py

skorch/tests/test_net.py

thomasjpfan · 2022-05-21T19:28:26Z

skorch/utils.py

+        if np.isscalar(X[0]):
+            return np.array([X[i] for i in range(len(X))])
+        return np.array([to_numpy(X[i]) for i in range(len(X))])


Nit:

Suggested change

if np.isscalar(X[0]):

return np.array([X[i] for i in range(len(X))])

return np.array([to_numpy(X[i]) for i in range(len(X))])

Xs = [X[i] for i in range(len(X))]

if np.isscalar(Xs[0]):

return np.array(Xs)

return np.array([to_numpy(x) for x in Xs])

Re sklearn version: I was getting ahead of myself a bit :)

Re this suggestion: Could you explain why it is better? (Also, there is no out variable, is it referring to X or Xs?) It seems like with your suggestion, we iterate over X a second time if we don't deal with a scalar, is it not potentially more expensive?

Ideally, we could have a version that doesn't need to test the first element at all, but I'm not sure it's achievable.

Also, there is no out variable, is it referring to X or Xs

Yup, I meant to type Xs.

I wanted to guard against the second X[0] access, because SliceDataset.transform can do some computation. I have not benchmarked it, but my sense is that the numpy conversation will take up majority of the compute compared to iterating X a second time. I'm okay with what you have now.

Ideally, we could have a version that doesn't need to test the first element at all, but I'm not sure it's achievable.

We can have to_numpy support scalars and return np.asarray(scalar) if the input is is a scalar?

Side note: A slightly nicer solution would be to define SliceDataset.__array__, so it knows how to turn itself into a ndarray.

I wanted to guard against the second X[0] access, because SliceDataset.transform can do some computation. I have not benchmarked it, but my sense is that the numpy conversation will take up majority of the compute compared to iterating X a second time.

Yeah, it's a difficult call. In the best case, a numpy conversion is very cheap because pytorch and numpy share a memory layout. But in the worst case, as you mentioned, we pay for one additional dataset.__getitem__ call. I think the more cautious solution is to prevent the worst case, i.e. avoid an additional __getitem__ but iterate through the list once more.

We can have to_numpy support scalars and return np.asarray(scalar) if the input is is a scalar?

I think this could be a dangerous solution performance-wise, as we would be checking each item of the list. As is, we only check the first item and then assume that the subsequent items have the same type.

Side note: A slightly nicer solution would be to define SliceDataset.array, so it knows how to turn itself into a ndarray.

I like that, I moved the conversion code into SliceDataset.__array__. Of course, performance is the same this way or that, but it looks cleaner.

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

…ev/skorch into bugfix/sklearn-1.11-and-sliceds

BenjaminBossan added 2 commits May 21, 2022 16:58

Error message changed slightly with sklearn 1.11

522d196

The error message when using set_params with an invalid key has been altered to contain quote marks. Now testing message without or with.

BenjaminBossan self-assigned this May 21, 2022

BenjaminBossan requested review from thomasjpfan and ottonemo May 21, 2022 16:06

thomasjpfan reviewed May 21, 2022

View reviewed changes

BenjaminBossan and others added 2 commits May 22, 2022 13:32

Reviewer comment: correct sklearn version

04c65f8

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

Reviewer comment: correct sklearn version

0083f9d

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

BenjaminBossan changed the title ~~Bugfix: sklearn 1.11 and SliceDataset~~ Bugfix: sklearn 1.1 and SliceDataset May 22, 2022

BenjaminBossan mentioned this pull request May 24, 2022

Make NotInitializedError compatible with sklearn #860

Merged

BenjaminBossan added 2 commits May 24, 2022 21:11

numpy conversion support on SliceDataset

b977d79

Merge branch 'bugfix/sklearn-1.11-and-sliceds' of github.com:skorch-d…

ca5455c

…ev/skorch into bugfix/sklearn-1.11-and-sliceds

BenjaminBossan requested a review from thomasjpfan May 24, 2022 19:12

thomasjpfan approved these changes May 24, 2022

View reviewed changes

BenjaminBossan merged commit 8a0e379 into master May 24, 2022

BenjaminBossan deleted the bugfix/sklearn-1.11-and-sliceds branch May 24, 2022 19:47

BenjaminBossan mentioned this pull request Oct 6, 2022

Preparation for release of skorch v0.12.0 #902

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bugfix: sklearn 1.1 and SliceDataset #858

Bugfix: sklearn 1.1 and SliceDataset #858

BenjaminBossan commented May 21, 2022 •

edited

Loading

thomasjpfan left a comment

thomasjpfan May 21, 2022 •

edited

Loading

BenjaminBossan May 22, 2022

thomasjpfan May 22, 2022 •

edited

Loading

BenjaminBossan May 24, 2022

-        if np.isscalar(X[0]):
-            return np.array([X[i] for i in range(len(X))])
-        return np.array([to_numpy(X[i]) for i in range(len(X))])
+        Xs = [X[i] for i in range(len(X))]
+        if np.isscalar(Xs[0]):
+            return np.array(Xs)
+        return np.array([to_numpy(x) for x in Xs])

Bugfix: sklearn 1.1 and SliceDataset #858

Bugfix: sklearn 1.1 and SliceDataset #858

Conversation

BenjaminBossan commented May 21, 2022 • edited Loading

Sklearn 1.1 issue

SliceDataset issue

Implementation

thomasjpfan left a comment

Choose a reason for hiding this comment

thomasjpfan May 21, 2022 • edited Loading

Choose a reason for hiding this comment

BenjaminBossan May 22, 2022

Choose a reason for hiding this comment

thomasjpfan May 22, 2022 • edited Loading

Choose a reason for hiding this comment

BenjaminBossan May 24, 2022

Choose a reason for hiding this comment

BenjaminBossan commented May 21, 2022 •

edited

Loading

`SliceDataset` issue

thomasjpfan May 21, 2022 •

edited

Loading

thomasjpfan May 22, 2022 •

edited

Loading