Update explainer_base.py #424

praveenjune17 · 2023-12-23T07:03:36Z

Issue context
I'm getting "ValueError: ('Feature', {}, 'has a value outside the dataset.')" when trying to generate counterfactuals by setting
dice_ml.Data = metadata properties for each feature
algorithm = genetic
query_size > 1
permitted_range = None

Why the code fail for the above combination?
Turns out the values of the categorical features in the query instance are not label encoded but the values in the
feature_to_vary are label encoded this raises a mismatch due to which the code fails with the ValueError. This happens only with the 'genetic' method that too when the permitted_range is not supplied

Code to recreate the issue.

from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
import dice_ml
from dice_ml.utils import helpers # helper functions

dataset = helpers.load_adult_income_dataset()
target = dataset["income"]
train_dataset, test_dataset, y_train, y_test = train_test_split(dataset,
target,
test_size=0.2,
random_state=0,
stratify=target)
x_train = train_dataset.drop('income', axis=1)
x_test = test_dataset.drop('income', axis=1)

d = dice_ml.Data(features={'age': [17, 90],
'workclass': ['Government', 'Other/Unknown', 'Private', 'Self-Employed'],
'education': ['Assoc', 'Bachelors', 'Doctorate', 'HS-grad', 'Masters',
'Prof-school', 'School', 'Some-college'],
'marital_status': ['Divorced', 'Married', 'Separated', 'Single', 'Widowed'],
'occupation': ['Blue-Collar', 'Other/Unknown', 'Professional', 'Sales', 'Service', 'White-Collar'],
'race': ['Other', 'White'],
'gender': ['Female', 'Male'],
'hours_per_week': [1, 99]},
outcome_name='income')

numerical = ["age", "hours_per_week"]
categorical = x_train.columns.difference(numerical)
categorical_transformer = Pipeline(steps=[
('onehot', OneHotEncoder(handle_unknown='ignore'))])
transformations = ColumnTransformer(
transformers=[
('cat', categorical_transformer, categorical)])

Append classifier to preprocessing pipeline.

Now we have a full prediction pipeline.

clf = Pipeline(steps=[('preprocessor', transformations),
('classifier', RandomForestClassifier())])
model = clf.fit(x_train, y_train)

Set the number of data points required in the query set

data_point = 2
m = dice_ml.Model(model=model, backend="sklearn")
exp = dice_ml.Dice(d, m, method="genetic")

query instance in the form of a dictionary; keys: feature name, values: feature value

query_instance = pd.DataFrame({'age': [22]*data_point,
'workclass': ['Private']*data_point,
'education': ['HS-grad']*data_point,
'marital_status': ['Single']*data_point,
'occupation': ['Service']*data_point,
'race': ['White']*data_point,
'gender': ['Female']*data_point,
'hours_per_week': [45]*data_point}, index=list(range(data_point)))

generate counterfactuals

dice_exp = exp.generate_counterfactuals(query_instance,
total_CFs=4,
desired_class="opposite",
initialization="random")

visualize the results

dice_exp.visualize_as_dataframe(show_only_changes=True)

Proposed fix
This fix will make sure "get_features_range(permitted_range)" is executed whether or not permitted_range is supplied or not

Bug fix to resolve "ValueError: ('Feature', {}, 'has a value outside the dataset.')" caused due to 'genetic' method when used for Private data with a query instance size > 1 Signed-off-by: Praveenkumar <praveen1050208@gmail.com>

Remove duplicate code Signed-off-by: Praveenkumar <praveen1050208@gmail.com>

gaugup

@praveenjune17, could you please add a unit test for this change? Should be easy since you know how to re-create he issue?

Add test query dataset and model Signed-off-by: Praveenkumar <praveen1050208@gmail.com>

Add test case for the fix Signed-off-by: Praveenkumar <praveen1050208@gmail.com>

praveenjune17 · 2024-01-01T15:20:36Z

@gaugup . pls review the test cases

Update explainer_base.py

2af7562

Bug fix to resolve "ValueError: ('Feature', {}, 'has a value outside the dataset.')" caused due to 'genetic' method when used for Private data with a query instance size > 1 Signed-off-by: Praveenkumar <praveen1050208@gmail.com>

praveenjune17 requested review from gaugup and amit-sharma as code owners December 23, 2023 07:03

Update private_data_interface.py

fb2a5db

Remove duplicate code Signed-off-by: Praveenkumar <praveen1050208@gmail.com>

gaugup requested changes Dec 27, 2023

View reviewed changes

praveenjune17 added 2 commits January 1, 2024 20:48

Update conftest.py

bf085e2

Add test query dataset and model Signed-off-by: Praveenkumar <praveen1050208@gmail.com>

Update test_explainer_base.py

73eee99

Add test case for the fix Signed-off-by: Praveenkumar <praveen1050208@gmail.com>

praveenjune17 requested a review from gaugup January 1, 2024 15:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update explainer_base.py #424

Update explainer_base.py #424

praveenjune17 commented Dec 23, 2023

gaugup left a comment

praveenjune17 commented Jan 1, 2024

Update explainer_base.py #424

Are you sure you want to change the base?

Update explainer_base.py #424

Conversation

praveenjune17 commented Dec 23, 2023

Append classifier to preprocessing pipeline.

Now we have a full prediction pipeline.

Set the number of data points required in the query set

query instance in the form of a dictionary; keys: feature name, values: feature value

generate counterfactuals

visualize the results

gaugup left a comment

Choose a reason for hiding this comment

praveenjune17 commented Jan 1, 2024