Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Issue context
I'm getting "ValueError: ('Feature', {}, 'has a value outside the dataset.')" when trying to generate counterfactuals by setting
dice_ml.Data = metadata properties for each feature
algorithm = genetic
query_size > 1
permitted_range = None
Why the code fail for the above combination?
Turns out the values of the categorical features in the query instance are not label encoded but the values in the
feature_to_vary are label encoded this raises a mismatch due to which the code fails with the ValueError. This happens only with the 'genetic' method that too when the permitted_range is not supplied
Code to recreate the issue.
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
import dice_ml
from dice_ml.utils import helpers # helper functions
dataset = helpers.load_adult_income_dataset()
target = dataset["income"]
train_dataset, test_dataset, y_train, y_test = train_test_split(dataset,
target,
test_size=0.2,
random_state=0,
stratify=target)
x_train = train_dataset.drop('income', axis=1)
x_test = test_dataset.drop('income', axis=1)
d = dice_ml.Data(features={'age': [17, 90],
'workclass': ['Government', 'Other/Unknown', 'Private', 'Self-Employed'],
'education': ['Assoc', 'Bachelors', 'Doctorate', 'HS-grad', 'Masters',
'Prof-school', 'School', 'Some-college'],
'marital_status': ['Divorced', 'Married', 'Separated', 'Single', 'Widowed'],
'occupation': ['Blue-Collar', 'Other/Unknown', 'Professional', 'Sales', 'Service', 'White-Collar'],
'race': ['Other', 'White'],
'gender': ['Female', 'Male'],
'hours_per_week': [1, 99]},
outcome_name='income')
numerical = ["age", "hours_per_week"]
categorical = x_train.columns.difference(numerical)
categorical_transformer = Pipeline(steps=[
('onehot', OneHotEncoder(handle_unknown='ignore'))])
transformations = ColumnTransformer(
transformers=[
('cat', categorical_transformer, categorical)])
Append classifier to preprocessing pipeline.
Now we have a full prediction pipeline.
clf = Pipeline(steps=[('preprocessor', transformations),
('classifier', RandomForestClassifier())])
model = clf.fit(x_train, y_train)
Set the number of data points required in the query set
data_point = 2
m = dice_ml.Model(model=model, backend="sklearn")
exp = dice_ml.Dice(d, m, method="genetic")
query instance in the form of a dictionary; keys: feature name, values: feature value
query_instance = pd.DataFrame({'age': [22]*data_point,
'workclass': ['Private']*data_point,
'education': ['HS-grad']*data_point,
'marital_status': ['Single']*data_point,
'occupation': ['Service']*data_point,
'race': ['White']*data_point,
'gender': ['Female']*data_point,
'hours_per_week': [45]*data_point}, index=list(range(data_point)))
generate counterfactuals
dice_exp = exp.generate_counterfactuals(query_instance,
total_CFs=4,
desired_class="opposite",
initialization="random")
visualize the results
dice_exp.visualize_as_dataframe(show_only_changes=True)
Proposed fix
This fix will make sure "get_features_range(permitted_range)" is executed whether or not permitted_range is supplied or not