-
Notifications
You must be signed in to change notification settings - Fork 393
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bugfix: sklearn 1.1 and SliceDataset #858
Conversation
The error message when using set_params with an invalid key has been altered to contain quote marks. Now testing message without or with.
Using SliceDataset could result in an error because of this line: self.classes_inferred_ = np.unique(to_numpy(y)) This would fail with a SliceDataset. However, the error was masked because GridSearchCV swallows errors. Therefore, we now run these tests with error='raise' to surface the bug. Regarding the bugfix, it is a bit clumsy because we need to make to_numpy work with SliceDataset, but we cannot check 'isinstance(X, SliceDataset)' because we don't want to import the helper class. Therefore, we now check this indirectly by looking at attributes. Furthermore, SliceDataset now also works with torch tensors as X and y, not only numpy arrays. Tests have been extended to cover this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm okay with the workaround with SliceDataset.
skorch/utils.py
Outdated
if np.isscalar(X[0]): | ||
return np.array([X[i] for i in range(len(X))]) | ||
return np.array([to_numpy(X[i]) for i in range(len(X))]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit:
if np.isscalar(X[0]): | |
return np.array([X[i] for i in range(len(X))]) | |
return np.array([to_numpy(X[i]) for i in range(len(X))]) | |
Xs = [X[i] for i in range(len(X))] | |
if np.isscalar(Xs[0]): | |
return np.array(Xs) | |
return np.array([to_numpy(x) for x in Xs]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re sklearn version: I was getting ahead of myself a bit :)
Re this suggestion: Could you explain why it is better? (Also, there is no out
variable, is it referring to X
or Xs
?) It seems like with your suggestion, we iterate over X
a second time if we don't deal with a scalar, is it not potentially more expensive?
Ideally, we could have a version that doesn't need to test the first element at all, but I'm not sure it's achievable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, there is no out variable, is it referring to X or Xs
Yup, I meant to type Xs
.
I wanted to guard against the second X[0]
access, because SliceDataset.transform
can do some computation. I have not benchmarked it, but my sense is that the numpy conversation will take up majority of the compute compared to iterating X
a second time. I'm okay with what you have now.
Ideally, we could have a version that doesn't need to test the first element at all, but I'm not sure it's achievable.
We can have to_numpy
support scalars and return np.asarray(scalar)
if the input is is a scalar?
Side note: A slightly nicer solution would be to define SliceDataset.__array__
, so it knows how to turn itself into a ndarray.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to guard against the second X[0] access, because SliceDataset.transform can do some computation. I have not benchmarked it, but my sense is that the numpy conversation will take up majority of the compute compared to iterating X a second time.
Yeah, it's a difficult call. In the best case, a numpy conversion is very cheap because pytorch and numpy share a memory layout. But in the worst case, as you mentioned, we pay for one additional dataset.__getitem__
call. I think the more cautious solution is to prevent the worst case, i.e. avoid an additional __getitem__
but iterate through the list once more.
We can have to_numpy support scalars and return np.asarray(scalar) if the input is is a scalar?
I think this could be a dangerous solution performance-wise, as we would be checking each item of the list. As is, we only check the first item and then assume that the subsequent items have the same type.
Side note: A slightly nicer solution would be to define SliceDataset.array, so it knows how to turn itself into a ndarray.
I like that, I moved the conversion code into SliceDataset.__array__
. Of course, performance is the same this way or that, but it looks cleaner.
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
…ev/skorch into bugfix/sklearn-1.11-and-sliceds
Sklearn 1.1 issue
The error message when using
set_params
with an invalid key has beenaltered to contain quote marks. Now testing message without or with.
SliceDataset
issueUsing
SliceDataset
could result in an error because of this line:skorch/skorch/classifier.py
Line 119 in 090ece0
This would fail when y comes from a
SliceDataset
. However, the error was maskedbecause
GridSearchCV
swallows errors by default now. Therefore, we now run thesetests with
error='raise'
to surface the bug.Implementation
Regarding the bugfix, it is a bit clumsy because we need to make
to_numpy
work withSliceDataset
, but we cannot check'isinstance(X, SliceDataset)'
because we don't want to import the helper class. Therefore, we now
check this indirectly by looking at attributes.
Furthermore,
SliceDataset
now also works with torch tensors as X and y,not only numpy arrays. Tests have been extended to cover this.