Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle Categorical Boolean values #3960

Merged
merged 8 commits into from
Jan 26, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/source/release_notes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@ Release Notes
**Future Releases**
* Enhancements
* Fixes
* Updated ``LabelEncoder`` to store the original typing information :pr:`3960`
* Fixed bug where all-null ``BooleanNullable`` columns would break the imputer during transform :pr:`3959`
* Changes
* Documentation Changes
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ class LabelEncoder(Transformer):
def __init__(self, positive_label=None, random_seed=0, **kwargs):
parameters = {"positive_label": positive_label}
parameters.update(kwargs)
self.original_typing = ""

super().__init__(
parameters=parameters,
Expand All @@ -46,6 +47,7 @@ def fit(self, X, y):
if y is None:
raise ValueError("y cannot be None!")
y_ww = infer_feature_types(y)
self.original_typing = str(y_ww.ww.logical_type)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if we want to just put a note into some places to remove this once we fully deprecate typelib.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Cmancuso I think we need to keep this functionality as long as woodwork transforms 1/0, yes/no etc. to True/False unless that change was made for typelib

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we weren't seeing this issue in schemaUpdate?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on the EvalML side we'd still want to output the original form of the target instead of outputting True/False if the target is boolean or boolean inferable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this will only be an issue for OS users when they use it in this specific scenario. If users pass in a yes/no dataset, they should still receive yes/no predictions, so I think this is needed.

self.mapping = {val: i for i, val in enumerate(sorted(y_ww.unique()))}
if self.parameters["positive_label"] is not None:
if len(self.mapping) != 2:
Expand Down Expand Up @@ -114,5 +116,5 @@ def inverse_transform(self, y):
if y is None:
raise ValueError("y cannot be None!")
y_ww = infer_feature_types(y)
y_it = infer_feature_types(y_ww.map(self.inverse_mapping))
y_it = infer_feature_types(y_ww.map(self.inverse_mapping), self.original_typing)
return y_it
18 changes: 18 additions & 0 deletions evalml/tests/component_tests/test_label_encoder.py
Original file line number Diff line number Diff line change
Expand Up @@ -221,3 +221,21 @@ def test_label_encoder_with_positive_label_with_custom_indices():
y_with_custom_indices = pd.Series(["b", "a", "a"], index=[5, 6, 7])
_, y_transformed = encoder.transform(None, y_with_custom_indices)
assert_index_equal(y_with_custom_indices.index, y_transformed.index)


@pytest.mark.parametrize("logical_type", ["Categorical", "Boolean"])
def test_label_encoder_categorical_handled_properly_boolean_values(logical_type):
# adding this test after WW version 0.21.2, which introduces auto-boolean inference
# This broke this test case where the logical type converts to boolean after inverse_transform
# because of woodwork inference
X = pd.DataFrame({})
# binary
y = pd.Series(["yes", "yes", "no", "yes"])
y = ww.init_series(y, logical_type=logical_type)
y_expected = pd.Series([1, 1, 0, 1])
encoder = LabelEncoder()
encoder.fit(X, y)
X_t, y_t = encoder.transform(X, y)
pd.testing.assert_series_equal(y_t, y_expected)
y_inverse = encoder.inverse_transform(y_t)
pd.testing.assert_series_equal(y_inverse, y)