Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugfix: sklearn 1.1 and SliceDataset #858

Merged
merged 6 commits into from
May 24, 2022

Conversation

BenjaminBossan
Copy link
Collaborator

@BenjaminBossan BenjaminBossan commented May 21, 2022

Sklearn 1.1 issue

The error message when using set_params with an invalid key has been
altered to contain quote marks. Now testing message without or with.

SliceDataset issue

Using SliceDataset could result in an error because of this line:

self.classes_inferred_ = np.unique(to_numpy(y))

This would fail when y comes from a SliceDataset. However, the error was masked
because GridSearchCV swallows errors by default now. Therefore, we now run these
tests with error='raise' to surface the bug.

Implementation

Regarding the bugfix, it is a bit clumsy because we need to make
to_numpy work with SliceDataset, but we cannot check

'isinstance(X, SliceDataset)'

because we don't want to import the helper class. Therefore, we now
check this indirectly by looking at attributes.

Furthermore, SliceDataset now also works with torch tensors as X and y,
not only numpy arrays. Tests have been extended to cover this.

The error message when using set_params with an invalid key has been
altered to contain quote marks. Now testing message without or with.
Using SliceDataset could result in an error because of this line:

self.classes_inferred_ = np.unique(to_numpy(y))

This would fail with a SliceDataset. However, the error was masked
because GridSearchCV swallows errors. Therefore, we now run these tests
with error='raise' to surface the bug.

Regarding the bugfix, it is a bit clumsy because we need to make
to_numpy work with SliceDataset, but we cannot check

'isinstance(X, SliceDataset)'

because we don't want to import the helper class. Therefore, we now
check this indirectly by looking at attributes.

Furthermore, SliceDataset now also works with torch tensors as X and y,
not only numpy arrays. Tests have been extended to cover this.
@BenjaminBossan BenjaminBossan self-assigned this May 21, 2022
Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm okay with the workaround with SliceDataset.

skorch/tests/callbacks/test_all.py Outdated Show resolved Hide resolved
skorch/tests/test_net.py Outdated Show resolved Hide resolved
skorch/utils.py Outdated
Comment on lines 141 to 143
if np.isscalar(X[0]):
return np.array([X[i] for i in range(len(X))])
return np.array([to_numpy(X[i]) for i in range(len(X))])
Copy link
Member

@thomasjpfan thomasjpfan May 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit:

Suggested change
if np.isscalar(X[0]):
return np.array([X[i] for i in range(len(X))])
return np.array([to_numpy(X[i]) for i in range(len(X))])
Xs = [X[i] for i in range(len(X))]
if np.isscalar(Xs[0]):
return np.array(Xs)
return np.array([to_numpy(x) for x in Xs])

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re sklearn version: I was getting ahead of myself a bit :)

Re this suggestion: Could you explain why it is better? (Also, there is no out variable, is it referring to X or Xs?) It seems like with your suggestion, we iterate over X a second time if we don't deal with a scalar, is it not potentially more expensive?

Ideally, we could have a version that doesn't need to test the first element at all, but I'm not sure it's achievable.

Copy link
Member

@thomasjpfan thomasjpfan May 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, there is no out variable, is it referring to X or Xs

Yup, I meant to type Xs.

I wanted to guard against the second X[0] access, because SliceDataset.transform can do some computation. I have not benchmarked it, but my sense is that the numpy conversation will take up majority of the compute compared to iterating X a second time. I'm okay with what you have now.

Ideally, we could have a version that doesn't need to test the first element at all, but I'm not sure it's achievable.

We can have to_numpy support scalars and return np.asarray(scalar) if the input is is a scalar?

Side note: A slightly nicer solution would be to define SliceDataset.__array__, so it knows how to turn itself into a ndarray.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to guard against the second X[0] access, because SliceDataset.transform can do some computation. I have not benchmarked it, but my sense is that the numpy conversation will take up majority of the compute compared to iterating X a second time.

Yeah, it's a difficult call. In the best case, a numpy conversion is very cheap because pytorch and numpy share a memory layout. But in the worst case, as you mentioned, we pay for one additional dataset.__getitem__ call. I think the more cautious solution is to prevent the worst case, i.e. avoid an additional __getitem__ but iterate through the list once more.

We can have to_numpy support scalars and return np.asarray(scalar) if the input is is a scalar?

I think this could be a dangerous solution performance-wise, as we would be checking each item of the list. As is, we only check the first item and then assume that the subsequent items have the same type.

Side note: A slightly nicer solution would be to define SliceDataset.array, so it knows how to turn itself into a ndarray.

I like that, I moved the conversion code into SliceDataset.__array__. Of course, performance is the same this way or that, but it looks cleaner.

BenjaminBossan and others added 2 commits May 22, 2022 13:32
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
@BenjaminBossan BenjaminBossan changed the title Bugfix: sklearn 1.11 and SliceDataset Bugfix: sklearn 1.1 and SliceDataset May 22, 2022
@BenjaminBossan BenjaminBossan merged commit 8a0e379 into master May 24, 2022
@BenjaminBossan BenjaminBossan deleted the bugfix/sklearn-1.11-and-sliceds branch May 24, 2022 19:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants