Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle Categorical Boolean values #3960

Merged
merged 8 commits into from
Jan 26, 2023
Merged

Handle Categorical Boolean values #3960

merged 8 commits into from
Jan 26, 2023

Conversation

bchen1116
Copy link
Contributor

Previously, if we passed in a categorical column (ie 'yes', 'no') that gets inferred as boolean through WW's new boolean typing, the infer_feature_types line in LabelEncoder.inverseTransform will transform the mapped column back into boolean.

This PR puts up a fix for that.

@bchen1116 bchen1116 self-assigned this Jan 26, 2023
@codecov
Copy link

codecov bot commented Jan 26, 2023

Codecov Report

Merging #3960 (b8e2497) into main (10a4980) will increase coverage by 0.1%.
The diff coverage is 100.0%.

@@           Coverage Diff           @@
##            main   #3960     +/-   ##
=======================================
+ Coverage   99.7%   99.7%   +0.1%     
=======================================
  Files        347     347             
  Lines      36776   36790     +14     
=======================================
+ Hits       36656   36670     +14     
  Misses       120     120             
Impacted Files Coverage Δ
.../components/transformers/encoders/label_encoder.py 100.0% <100.0%> (ø)
evalml/tests/component_tests/test_label_encoder.py 100.0% <100.0%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@@ -46,6 +47,7 @@ def fit(self, X, y):
if y is None:
raise ValueError("y cannot be None!")
y_ww = infer_feature_types(y)
self.original_typing = str(y_ww.ww.logical_type)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if we want to just put a note into some places to remove this once we fully deprecate typelib.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Cmancuso I think we need to keep this functionality as long as woodwork transforms 1/0, yes/no etc. to True/False unless that change was made for typelib

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we weren't seeing this issue in schemaUpdate?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on the EvalML side we'd still want to output the original form of the target instead of outputting True/False if the target is boolean or boolean inferable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this will only be an issue for OS users when they use it in this specific scenario. If users pass in a yes/no dataset, they should still receive yes/no predictions, so I think this is needed.

Copy link
Collaborator

@jeremyliweishih jeremyliweishih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

X = pd.DataFrame({})
# binary
y = pd.Series(["yes", "yes", "no", "yes"])
y = ww.init_series(y, logical_type="Categorical")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it matter if we init this series as Boolean or Categorical?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The behavior is the same, but if the type is passed as categorical, we would expect it to remain categorical rather than bool

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that we should probably parametrize this over both Boolean and `Categorical.

@bchen1116 bchen1116 enabled auto-merge (squash) January 26, 2023 18:54
@@ -221,3 +221,17 @@ def test_label_encoder_with_positive_label_with_custom_indices():
y_with_custom_indices = pd.Series(["b", "a", "a"], index=[5, 6, 7])
_, y_transformed = encoder.transform(None, y_with_custom_indices)
assert_index_equal(y_with_custom_indices.index, y_transformed.index)


def test_label_encoder_categorical_boolean_values():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would repeat the same advice for this test as I would for Becca's - what motivated the adding of this test case? I am not sure the test name as-is reflects the "why" of why this test exists. I think it's worthwhile to add a little context via a docstring to the test function...perhaps link it to a woodwork version upgrade.

X = pd.DataFrame({})
# binary
y = pd.Series(["yes", "yes", "no", "yes"])
y = ww.init_series(y, logical_type="Categorical")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that we should probably parametrize this over both Boolean and `Categorical.

Copy link
Contributor

@chukarsten chukarsten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @bchen1116 !

@bchen1116 bchen1116 merged commit 2dcdcba into main Jan 26, 2023
@bchen1116 bchen1116 deleted the label_encoder_bool branch January 26, 2023 20:40
@chukarsten chukarsten mentioned this pull request Jan 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants