-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle Categorical Boolean values #3960
Conversation
Codecov Report
@@ Coverage Diff @@
## main #3960 +/- ##
=======================================
+ Coverage 99.7% 99.7% +0.1%
=======================================
Files 347 347
Lines 36776 36790 +14
=======================================
+ Hits 36656 36670 +14
Misses 120 120
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
@@ -46,6 +47,7 @@ def fit(self, X, y): | |||
if y is None: | |||
raise ValueError("y cannot be None!") | |||
y_ww = infer_feature_types(y) | |||
self.original_typing = str(y_ww.ww.logical_type) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if we want to just put a note into some places to remove this once we fully deprecate typelib.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Cmancuso I think we need to keep this functionality as long as woodwork transforms 1/0, yes/no etc. to True/False unless that change was made for typelib
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we weren't seeing this issue in schemaUpdate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
on the EvalML side we'd still want to output the original form of the target instead of outputting True/False if the target is boolean or boolean inferable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this will only be an issue for OS users when they use it in this specific scenario. If users pass in a yes/no
dataset, they should still receive yes/no
predictions, so I think this is needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
X = pd.DataFrame({}) | ||
# binary | ||
y = pd.Series(["yes", "yes", "no", "yes"]) | ||
y = ww.init_series(y, logical_type="Categorical") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does it matter if we init this series as Boolean
or Categorical
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The behavior is the same, but if the type is passed as categorical, we would expect it to remain categorical rather than bool
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that we should probably parametrize this over both Boolean
and `Categorical.
@@ -221,3 +221,17 @@ def test_label_encoder_with_positive_label_with_custom_indices(): | |||
y_with_custom_indices = pd.Series(["b", "a", "a"], index=[5, 6, 7]) | |||
_, y_transformed = encoder.transform(None, y_with_custom_indices) | |||
assert_index_equal(y_with_custom_indices.index, y_transformed.index) | |||
|
|||
|
|||
def test_label_encoder_categorical_boolean_values(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would repeat the same advice for this test as I would for Becca's - what motivated the adding of this test case? I am not sure the test name as-is reflects the "why" of why this test exists. I think it's worthwhile to add a little context via a docstring to the test function...perhaps link it to a woodwork version upgrade.
X = pd.DataFrame({}) | ||
# binary | ||
y = pd.Series(["yes", "yes", "no", "yes"]) | ||
y = ww.init_series(y, logical_type="Categorical") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that we should probably parametrize this over both Boolean
and `Categorical.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @bchen1116 !
Previously, if we passed in a categorical column (ie 'yes', 'no') that gets inferred as boolean through WW's new boolean typing, the
infer_feature_types
line inLabelEncoder.inverseTransform
will transform the mapped column back into boolean.This PR puts up a fix for that.