~4000 0bservations-- tibble giving one row of outputs #14

jg3405 · 2020-12-03T15:11:07Z

Hi I'm trying to generate the shapley values for a dataset with 14 features and around 4000 observations (dataframe with 14 columns and 4000 rows). I create a model using ranger, but then when I go to generate my shapley values I only end up receiving a one row output.... I get different values depending on how many observations I include... but still just one row output. Please help.

bgreenwell · 2020-12-08T03:19:22Z

Hi @jg3405, in order to help, I’d need to see a reproducible example. Can you try posting one here?

jg3405 · 2020-12-17T16:18:53Z

Hi Brian,

Here is an example of a RF model I generated using the ranger package in R.

example.zip

The data used for the model is also included. As you can see, the response variable is factorial (only 0s and 1s). I would like to get shapley values for each observation, but when using the code:

pfun = function(object, newdata, type) predict(object, data = newdata,type="response")$predictions

shap_new <- explain(model_prob, X = all_data_no_nan[,-1], nsim = 10, pred_wrapper=pfun)

I only get one row of data as output as though I had one observation. Thanks for trying to help.

Julia

bgreenwell · 2020-12-17T17:51:51Z

Thanks for the example code @jg3405. The issue is that your prediction wrapper is returning probability estimates for both classes, but explain() can only work with one predicted output. All you need to do is specify which class probability you want to explain:

pfun <- function(object, newdata, type) {
  # Return probability estimate for Y = 1 (i.e., the second column)
  predict(object, data = newdata, type = "response")$predictions[, 2L]
}
shap_new <- explain(model_prob, X = all_data_no_nan[,-1], nsim = 10, pred_wrapper = pfun, .progress = "text")

bgreenwell · 2020-12-17T17:52:19Z

I'll try to add an informative error message for this in the next release!

jg3405 · 2020-12-17T18:10:51Z

Amazing! Thank you for getting back to me so fast! It works like a charm :)

DavZim · 2021-03-22T15:26:54Z

The same error comes when using glmnet, where the prediction function needs to be function(mdl, newdata) as.numeric(predict(mdl, newdata, s = best_lambda))) (notice the as.numeric(), otherwise the predict function would return a numerical matrix and not a vector, resulting in the single-row output).
Btw. thanks for creating such a wonderful package, greatly appreciated!

Full MWE below:

library(fastshap)
library(AmesHousing)
library(glmnet)

data <- AmesHousing::make_ames()
dim(data)
#> [1] 2930   81

y <- data$Sale_Price
X <- as.matrix(data[, c("Lot_Frontage", "Lot_Area", "Year_Sold", "Year_Built")])

glm <- cv.glmnet(X, y)
expl_glm <- fastshap::explain(
  glm, X = X,
  pred_wrapper = function(mdl, newdata) as.numeric(predict(mdl, newdata, s = glm$lambda.min))
)

sapply(expl_glm, function(x) mean(abs(x)))
#> Lot_Frontage     Lot_Area    Year_Sold   Year_Built 
#>     81.60870     31.03433      3.05142    908.30222

trilliumtechnical · 2021-05-20T22:34:19Z

Question related to the above
I am trying to use FastShap after a tidymodels workflow with xgboost for a binary classification. My confusion is what is fastshap using as the "event" or 'positive' classification (with or without exact = FALSE). In yardstick, the default value is the first level. I suspect that the default is the second level in xgboost/fastShap? The suggestion above for the pred wrapper does not work to select the event of interest - because the predict for xgboost model only gives one set of predictions.

pfun <- function(object, newdata, type) {
  # Return probability estimate for Y = 1 (i.e., the second column)
  predict(object, data = newdata, type = "response")$predictions[, 2L]
}
shap_new <- explain(model_prob, X = all_data_no_nan[,-1], nsim = 10, pred_wrapper = pfun, .progress = "text")  
```       does not work.  

My code is passing the xgbooster model and a matrix of data to fastshap::explain
final_fit         <- finalize_workflow(wflw_spec_xgb, select_best(tune_results_xgb)) %>%
                         fit(dfFinal)
                         
  final_model <- pull_workflow_fit(final_fit)      

X <- prep(xgboost_recipe, dfFinal) %>% 
        juice() %>% 
        select(-Class_Pr) %>% 
        as.matrix()
  
  shap <- fastshap::explain(final_model$fit, X = X, 
                              exact = TRUE)

bgreenwell · 2021-05-21T14:03:55Z

Hi @trilliumtechnical, it is difficult to provide a good answer without a reproducible example, but in general, the reference class for the Shapley values is dictated by the probability estimates output by the prediction wrapper (as supplied to pred_wrappper)---in your example above, it would be whichever class is stored in the second column, which you extract in your definition of pfun(). In XGBoost, binary outcomes have to be encoded as 0/1 and so the reference class is always whichever class is coded as 1. Not sure what tidymodels does during the fitting process with XGBoost, but eventually the outcome has to get encoded as 0/1. Does that help answer your question?

trilliumtechnical · 2021-05-21T18:07:26Z

@bgreenwell
Thank you for the fast response. I think I understand. I am providing what I think is simple, reproducible example, perhaps to help others, that verifies that xgboost uses whatever column is coded as 1 although the predict function only returns one column. Note: I excluded all the tuning (the reason for the workflow) to keep the example simple.

dfFinal.zip

library(tidyverse)
library(tidymodels)
library(xgboost)
library(fastshap)


dfFinal <- read_rds(file.path(here::here(), 'dfFinal.RDS'))

# Create a recipe

xgboost_recipe <- recipe(formula = Class_Pr ~ ., data = dfFinal) %>% 
        step_zv(all_predictors()) %>%
    step_nzv(all_predictors()) %>%
    step_novel(all_nominal(), -all_outcomes(), -has_role('lbl')) %>% 
    step_dummy(all_nominal(), -all_outcomes(), -has_role('lbl'),  one_hot = TRUE)# %>%

# Create a workflow

wflw_spec_xgb <- workflow() %>%
    add_model(
        spec = boost_tree(
            mode       = "classification",
            min_n      = 2,
            learn_rate = 5,
            trees      = 100
            
        ) %>%
            set_engine("xgboost", objective  = 'binary:logistic')
    ) %>%
    add_recipe(xgboost_recipe)

# Fit the original model

final_fit <- wflw_spec_xgb %>% fit(dfFinal)

# Create a matrix from the data

X <- prep(xgboost_recipe, dfFinal) %>% 
    juice() %>% 
    select(-Class_Pr) %>% 
    as.matrix()

# Extract the xgbooster model

final_model <- pull_workflow_fit(final_fit)

# final_model$fit is a xgbooster model
# this predict statement returns a vector of probabilities
# This is the probabilities for the second class of 1

predict(final_model$fit, X)

predict(final_fit, dfFinal, type = 'prob')


pfun <-function(object, newdata) {
    y <-   predict(object, newdata = newdata)
}

shap <- fastshap::explain(final_model$fit, X = X, pred_wrapper = pfun,  nsim = 10)

shap %>% autoplot()

Follow up question, is it possible to write a prediction function with fastshap that uses a workflow. My attempt failed, possibly due to the problem with tibbles? #20
Perhaps this example will help with the tibble issue?

#  Try to write a wrapper function that uses the workflow

pfunWF <-function(object, newdata) {
    y <-   object %>% predict(newdata, type = 'prob') %>% pull(., 2)
}

# Does prediction function work?
yyy <- pfunWF(final_fitWF, dfFinal %>% select(-Class_Pr))

# final_fit is a workflow 
predict(final_fit, dfFinal, type = 'prob')

shap <- fastshap::explain(final_fit, X = dfFinal, pred_wrapper = pfunWF, nsim = 10)

#Error: Can't combine `Variety` <factor<ab9b3>> and `Plant.Day` <double>.


# Tried to correct the problem by changing everything to numeric in the recipe

xgboost_recipe <- recipe(formula = Class_Pr ~ ., data = dfFinal) %>% 
        step_zv(all_predictors()) %>%
    step_nzv(all_predictors()) %>%
    step_novel(all_nominal(), -all_outcomes(), -has_role('lbl')) %>% 
    step_dummy(all_nominal(), -all_outcomes(), -has_role('lbl'),  one_hot = TRUE) %>%
    step_mutate_at(all_nominal(), -all_outcomes(), fn = as.numeric)

# Check the result: Are all features numeric
dfCheck        <- bake(xgboost_recipe %>% prep(), new_data = NULL)
glimpse(dfCheck)

# Fit the original model with new recipe
final_fitWF <- wflw_spec_xgb %>% fit(dfFinal)
predict(final_fitWF, dfFinal, type = 'prob')

shap <- fastshap::explain(final_fitWF, X = dfFinal %>% select(-Class_Pr), pred_wrapper = pfunWF, nsim = 10)

# Still get the error.

bgreenwell · 2021-05-21T19:43:22Z

@trilliumtechnical thanks for including an example. I see two issues with your code:

pfunWF() does not return anything (so simply omit the y <- part).
Coerce the data to a data frame when passing into explain(), e.g., just use X = as.data.frame(dfFinal) and it will work (at least it did for me).

Adding tibble support isn't out of the question, but I'm not sure how difficult that change would be to implement (fastshapp uses logical subsetting and some other operations that may not be allowed on tibbles). For now, it's easy enough to drop the tibble class before passing it in. Also, XGBoost has "exact" Shapley values built-in (so-called TreeSHAP), which can be obtained even faster by setting exact = TRUE in explain() (which just calls predict(xgb_object, newdata = X, predcontrib = TRUE) under the hood), but this will only work if given the XGBoost object directly. Although, I'm sure it's possible to handle WF objects too if it allows you to call predict(xgbobject, newdata = X, predcontrib = TRUE). Maybe @topepo can chime in on that part.

Thomassimancik · 2022-05-26T11:03:01Z

Hello everyone, I am also experiencing difficulties on how to implement the explain() function to a fitted xgboost model from the mlr package. The error reads: Error in UseMethod("explain") :
no applicable method for 'explain' applied to an object of class "WrappedModel"

Is it not possible to apply SHAP values to a fitted xgboost model from the mlr package?

For a simplified version of the fitted model, the code looks accordingly:

train_df <- as.data.frame(train_df)
test_df <- as.data.frame(test_df)
traintask <- makeClassifTask(data = train_df, target = "SeriousDlqin2yrs")
testtask <- makeClassifTask(data = test_df, target = "SeriousDlqin2yrs")
xgb_learner <- makeLearner(
"classif.xgboost",
predict.type = "prob",
par.vals = list(
objective = "binary:logistic",
eval_metric = "error",
nrounds = 200
)
)
xgb_model <- mlr::train(xgb_learner, task = traintask)

  ####### SHAP (fastshap package) #######

pfunWF <-function(object, newdata) {
predict(xgb_model, newdata)$data$prob.1
}

shap <- explain(xgb_model,
X = train_df,
nsim = 1,
pred_wrapper = pfunWF)

Any help on how to fix this error would be much appreciated!

bgreenwell · 2022-05-26T13:36:33Z

Hi @Thomassimancik, not sure why you'd be getting this error message since explain() is a generic and should work as long as you pass in a prediction wrapper. If you post a small reproducible example I can copy and run on my end, I'd be happy to take a closer look.

Thomassimancik · 2022-05-26T14:57:40Z

Hi @bgreenwell, thanks for your swift reply! I am sending a small reproducible example below:

`data(mtcars)
mtcars$vs <- as.factor(mtcars$vs)
mtcars <- as.data.frame(mtcars)
traintask <- makeClassifTask(data = mtcars, target = "vs")
xgb_learner <- makeLearner(
"classif.xgboost",
predict.type = "prob",
par.vals = list(
objective = "binary:logistic",
eval_metric = "error",
nrounds = 50
)
)
model <- mlr::train(xgb_learner, task = traintask)

pfunWF <-function(object, newdata) {
predict(object, newdata, type = 'prob')$data$prob.1
}
shap1 <- fastshap::explain(model,
X = mtcars[, -9],
nsim = 1,
pred_wrapper = pfunWF,
.progress = "text")`

I suspect that it has something to do with the definition of traintask object with the makeClassifTask function.

bgreenwell · 2023-05-05T17:02:11Z

Two issues in the above code: 1) newdata is not the second positional argument in mlr's prediction method, so you'll need to name it; 2) you omitted the wrong column number in X when calling explain(), so the feature names won't match and you'll get another error. Fixes in the example below.

data(mtcars)
mtcars$vs <- as.factor(mtcars$vs)
mtcars <- as.data.frame(mtcars)
traintask <- makeClassifTask(data = mtcars, target = "vs")
xgb_learner <- makeLearner(
  "classif.xgboost",
  predict.type = "prob",
  par.vals = list(
    objective = "binary:logistic",
    eval_metric = "error",
    nrounds = 50
  )
)
model <- mlr::train(xgb_learner, task = traintask)

pfunWF <-function(object, newdata) {
  # predict(object, newdata, type = 'prob')$data$prob.1
  predict(object, newdata = newdata, type = 'prob')$data$prob.1
}
shap1 <- fastshap::explain(model,
                           #X = mtcars[, -9],
                           X = mtcars[, -8],
                           nsim = 1,
                           pred_wrapper = pfunWF)

mayer79 mentioned this issue Feb 4, 2023

Keras prediction issue #42

Closed

bgreenwell closed this as completed May 5, 2023

nimirea mentioned this issue Dec 6, 2023

explain() fails uninformatively if exact argument is specified, on cv.glmnet model objects #74

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

~4000 0bservations-- tibble giving one row of outputs #14

~4000 0bservations-- tibble giving one row of outputs #14

jg3405 commented Dec 3, 2020

bgreenwell commented Dec 8, 2020

jg3405 commented Dec 17, 2020

bgreenwell commented Dec 17, 2020

bgreenwell commented Dec 17, 2020

jg3405 commented Dec 17, 2020

DavZim commented Mar 22, 2021 •

edited

Loading

trilliumtechnical commented May 20, 2021 •

edited

Loading

bgreenwell commented May 21, 2021

trilliumtechnical commented May 21, 2021 •

edited

Loading

bgreenwell commented May 21, 2021 •

edited

Loading

Thomassimancik commented May 26, 2022 •

edited

Loading

bgreenwell commented May 26, 2022

Thomassimancik commented May 26, 2022 •

edited

Loading

bgreenwell commented May 5, 2023

~4000 0bservations-- tibble giving one row of outputs #14

~4000 0bservations-- tibble giving one row of outputs #14

Comments

jg3405 commented Dec 3, 2020

bgreenwell commented Dec 8, 2020

jg3405 commented Dec 17, 2020

bgreenwell commented Dec 17, 2020

bgreenwell commented Dec 17, 2020

jg3405 commented Dec 17, 2020

DavZim commented Mar 22, 2021 • edited Loading

trilliumtechnical commented May 20, 2021 • edited Loading

bgreenwell commented May 21, 2021

trilliumtechnical commented May 21, 2021 • edited Loading

bgreenwell commented May 21, 2021 • edited Loading

Thomassimancik commented May 26, 2022 • edited Loading

bgreenwell commented May 26, 2022

Thomassimancik commented May 26, 2022 • edited Loading

bgreenwell commented May 5, 2023

DavZim commented Mar 22, 2021 •

edited

Loading

trilliumtechnical commented May 20, 2021 •

edited

Loading

trilliumtechnical commented May 21, 2021 •

edited

Loading

bgreenwell commented May 21, 2021 •

edited

Loading

Thomassimancik commented May 26, 2022 •

edited

Loading

Thomassimancik commented May 26, 2022 •

edited

Loading