Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

~4000 0bservations-- tibble giving one row of outputs #14

Closed
jg3405 opened this issue Dec 3, 2020 · 14 comments
Closed

~4000 0bservations-- tibble giving one row of outputs #14

jg3405 opened this issue Dec 3, 2020 · 14 comments

Comments

@jg3405
Copy link

jg3405 commented Dec 3, 2020

Hi I'm trying to generate the shapley values for a dataset with 14 features and around 4000 observations (dataframe with 14 columns and 4000 rows). I create a model using ranger, but then when I go to generate my shapley values I only end up receiving a one row output.... I get different values depending on how many observations I include... but still just one row output. Please help.

@bgreenwell
Copy link
Owner

Hi @jg3405, in order to help, I’d need to see a reproducible example. Can you try posting one here?

@jg3405
Copy link
Author

jg3405 commented Dec 17, 2020

Hi Brian,

Here is an example of a RF model I generated using the ranger package in R.

example.zip

The data used for the model is also included. As you can see, the response variable is factorial (only 0s and 1s). I would like to get shapley values for each observation, but when using the code:

pfun = function(object, newdata, type) predict(object, data = newdata,type="response")$predictions

shap_new <- explain(model_prob, X = all_data_no_nan[,-1], nsim = 10, pred_wrapper=pfun)

I only get one row of data as output as though I had one observation. Thanks for trying to help.

Julia

@bgreenwell
Copy link
Owner

Thanks for the example code @jg3405. The issue is that your prediction wrapper is returning probability estimates for both classes, but explain() can only work with one predicted output. All you need to do is specify which class probability you want to explain:

pfun <- function(object, newdata, type) {
  # Return probability estimate for Y = 1 (i.e., the second column)
  predict(object, data = newdata, type = "response")$predictions[, 2L]
}
shap_new <- explain(model_prob, X = all_data_no_nan[,-1], nsim = 10, pred_wrapper = pfun, .progress = "text")

@bgreenwell
Copy link
Owner

I'll try to add an informative error message for this in the next release!

@jg3405
Copy link
Author

jg3405 commented Dec 17, 2020

Amazing! Thank you for getting back to me so fast! It works like a charm :)

@DavZim
Copy link

DavZim commented Mar 22, 2021

The same error comes when using glmnet, where the prediction function needs to be function(mdl, newdata) as.numeric(predict(mdl, newdata, s = best_lambda))) (notice the as.numeric(), otherwise the predict function would return a numerical matrix and not a vector, resulting in the single-row output).
Btw. thanks for creating such a wonderful package, greatly appreciated!

Full MWE below:

library(fastshap)
library(AmesHousing)
library(glmnet)

data <- AmesHousing::make_ames()
dim(data)
#> [1] 2930   81

y <- data$Sale_Price
X <- as.matrix(data[, c("Lot_Frontage", "Lot_Area", "Year_Sold", "Year_Built")])

glm <- cv.glmnet(X, y)
expl_glm <- fastshap::explain(
  glm, X = X,
  pred_wrapper = function(mdl, newdata) as.numeric(predict(mdl, newdata, s = glm$lambda.min))
)

sapply(expl_glm, function(x) mean(abs(x)))
#> Lot_Frontage     Lot_Area    Year_Sold   Year_Built 
#>     81.60870     31.03433      3.05142    908.30222 

@trilliumtechnical
Copy link

trilliumtechnical commented May 20, 2021

Question related to the above
I am trying to use FastShap after a tidymodels workflow with xgboost for a binary classification. My confusion is what is fastshap using as the "event" or 'positive' classification (with or without exact = FALSE). In yardstick, the default value is the first level. I suspect that the default is the second level in xgboost/fastShap? The suggestion above for the pred wrapper does not work to select the event of interest - because the predict for xgboost model only gives one set of predictions.

pfun <- function(object, newdata, type) {
  # Return probability estimate for Y = 1 (i.e., the second column)
  predict(object, data = newdata, type = "response")$predictions[, 2L]
}
shap_new <- explain(model_prob, X = all_data_no_nan[,-1], nsim = 10, pred_wrapper = pfun, .progress = "text")  
```       does not work.  

My code is passing the xgbooster model and a matrix of data to fastshap::explain
final_fit         <- finalize_workflow(wflw_spec_xgb, select_best(tune_results_xgb)) %>%
                         fit(dfFinal)
                         
  final_model <- pull_workflow_fit(final_fit)      

X <- prep(xgboost_recipe, dfFinal) %>% 
        juice() %>% 
        select(-Class_Pr) %>% 
        as.matrix()
  
  shap <- fastshap::explain(final_model$fit, X = X, 
                              exact = TRUE)
                    

@bgreenwell
Copy link
Owner

Hi @trilliumtechnical, it is difficult to provide a good answer without a reproducible example, but in general, the reference class for the Shapley values is dictated by the probability estimates output by the prediction wrapper (as supplied to pred_wrappper)---in your example above, it would be whichever class is stored in the second column, which you extract in your definition of pfun(). In XGBoost, binary outcomes have to be encoded as 0/1 and so the reference class is always whichever class is coded as 1. Not sure what tidymodels does during the fitting process with XGBoost, but eventually the outcome has to get encoded as 0/1. Does that help answer your question?

@trilliumtechnical
Copy link

trilliumtechnical commented May 21, 2021

@bgreenwell
Thank you for the fast response. I think I understand. I am providing what I think is simple, reproducible example, perhaps to help others, that verifies that xgboost uses whatever column is coded as 1 although the predict function only returns one column. Note: I excluded all the tuning (the reason for the workflow) to keep the example simple.

dfFinal.zip

library(tidyverse)
library(tidymodels)
library(xgboost)
library(fastshap)


dfFinal <- read_rds(file.path(here::here(), 'dfFinal.RDS'))

# Create a recipe

xgboost_recipe <- recipe(formula = Class_Pr ~ ., data = dfFinal) %>% 
        step_zv(all_predictors()) %>%
    step_nzv(all_predictors()) %>%
    step_novel(all_nominal(), -all_outcomes(), -has_role('lbl')) %>% 
    step_dummy(all_nominal(), -all_outcomes(), -has_role('lbl'),  one_hot = TRUE)# %>%

# Create a workflow

wflw_spec_xgb <- workflow() %>%
    add_model(
        spec = boost_tree(
            mode       = "classification",
            min_n      = 2,
            learn_rate = 5,
            trees      = 100
            
        ) %>%
            set_engine("xgboost", objective  = 'binary:logistic')
    ) %>%
    add_recipe(xgboost_recipe)

# Fit the original model

final_fit <- wflw_spec_xgb %>% fit(dfFinal)

# Create a matrix from the data

X <- prep(xgboost_recipe, dfFinal) %>% 
    juice() %>% 
    select(-Class_Pr) %>% 
    as.matrix()

# Extract the xgbooster model

final_model <- pull_workflow_fit(final_fit)

# final_model$fit is a xgbooster model
# this predict statement returns a vector of probabilities
# This is the probabilities for the second class of 1

predict(final_model$fit, X)

predict(final_fit, dfFinal, type = 'prob')


pfun <-function(object, newdata) {
    y <-   predict(object, newdata = newdata)
}

shap <- fastshap::explain(final_model$fit, X = X, pred_wrapper = pfun,  nsim = 10)

shap %>% autoplot()

Follow up question, is it possible to write a prediction function with fastshap that uses a workflow. My attempt failed, possibly due to the problem with tibbles? #20
Perhaps this example will help with the tibble issue?

#  Try to write a wrapper function that uses the workflow

pfunWF <-function(object, newdata) {
    y <-   object %>% predict(newdata, type = 'prob') %>% pull(., 2)
}

# Does prediction function work?
yyy <- pfunWF(final_fitWF, dfFinal %>% select(-Class_Pr))

# final_fit is a workflow 
predict(final_fit, dfFinal, type = 'prob')

shap <- fastshap::explain(final_fit, X = dfFinal, pred_wrapper = pfunWF, nsim = 10)

#Error: Can't combine `Variety` <factor<ab9b3>> and `Plant.Day` <double>.


# Tried to correct the problem by changing everything to numeric in the recipe

xgboost_recipe <- recipe(formula = Class_Pr ~ ., data = dfFinal) %>% 
        step_zv(all_predictors()) %>%
    step_nzv(all_predictors()) %>%
    step_novel(all_nominal(), -all_outcomes(), -has_role('lbl')) %>% 
    step_dummy(all_nominal(), -all_outcomes(), -has_role('lbl'),  one_hot = TRUE) %>%
    step_mutate_at(all_nominal(), -all_outcomes(), fn = as.numeric)

# Check the result: Are all features numeric
dfCheck        <- bake(xgboost_recipe %>% prep(), new_data = NULL)
glimpse(dfCheck)

# Fit the original model with new recipe
final_fitWF <- wflw_spec_xgb %>% fit(dfFinal)
predict(final_fitWF, dfFinal, type = 'prob')

shap <- fastshap::explain(final_fitWF, X = dfFinal %>% select(-Class_Pr), pred_wrapper = pfunWF, nsim = 10)

# Still get the error.

@bgreenwell
Copy link
Owner

bgreenwell commented May 21, 2021

@trilliumtechnical thanks for including an example. I see two issues with your code:

  1. pfunWF() does not return anything (so simply omit the y <- part).
  2. Coerce the data to a data frame when passing into explain(), e.g., just use X = as.data.frame(dfFinal) and it will work (at least it did for me).

Adding tibble support isn't out of the question, but I'm not sure how difficult that change would be to implement (fastshapp uses logical subsetting and some other operations that may not be allowed on tibbles). For now, it's easy enough to drop the tibble class before passing it in. Also, XGBoost has "exact" Shapley values built-in (so-called TreeSHAP), which can be obtained even faster by setting exact = TRUE in explain() (which just calls predict(xgb_object, newdata = X, predcontrib = TRUE) under the hood), but this will only work if given the XGBoost object directly. Although, I'm sure it's possible to handle WF objects too if it allows you to call predict(xgbobject, newdata = X, predcontrib = TRUE). Maybe @topepo can chime in on that part.

@Thomassimancik
Copy link

Thomassimancik commented May 26, 2022

Hello everyone, I am also experiencing difficulties on how to implement the explain() function to a fitted xgboost model from the mlr package. The error reads: Error in UseMethod("explain") :
no applicable method for 'explain' applied to an object of class "WrappedModel"

Is it not possible to apply SHAP values to a fitted xgboost model from the mlr package?

For a simplified version of the fitted model, the code looks accordingly:

train_df <- as.data.frame(train_df)
test_df <- as.data.frame(test_df)
traintask <- makeClassifTask(data = train_df, target = "SeriousDlqin2yrs")
testtask <- makeClassifTask(data = test_df, target = "SeriousDlqin2yrs")
xgb_learner <- makeLearner(
"classif.xgboost",
predict.type = "prob",
par.vals = list(
objective = "binary:logistic",
eval_metric = "error",
nrounds = 200
)
)
xgb_model <- mlr::train(xgb_learner, task = traintask)

  ####### SHAP (fastshap package) #######

pfunWF <-function(object, newdata) {
predict(xgb_model, newdata)$data$prob.1
}

shap <- explain(xgb_model,
X = train_df,
nsim = 1,
pred_wrapper = pfunWF)

Any help on how to fix this error would be much appreciated!

@bgreenwell
Copy link
Owner

Hi @Thomassimancik, not sure why you'd be getting this error message since explain() is a generic and should work as long as you pass in a prediction wrapper. If you post a small reproducible example I can copy and run on my end, I'd be happy to take a closer look.

@Thomassimancik
Copy link

Thomassimancik commented May 26, 2022

Hi @bgreenwell, thanks for your swift reply! I am sending a small reproducible example below:

`data(mtcars)
mtcars$vs <- as.factor(mtcars$vs)
mtcars <- as.data.frame(mtcars)
traintask <- makeClassifTask(data = mtcars, target = "vs")
xgb_learner <- makeLearner(
"classif.xgboost",
predict.type = "prob",
par.vals = list(
objective = "binary:logistic",
eval_metric = "error",
nrounds = 50
)
)
model <- mlr::train(xgb_learner, task = traintask)

pfunWF <-function(object, newdata) {
predict(object, newdata, type = 'prob')$data$prob.1
}
shap1 <- fastshap::explain(model,
X = mtcars[, -9],
nsim = 1,
pred_wrapper = pfunWF,
.progress = "text")`

I suspect that it has something to do with the definition of traintask object with the makeClassifTask function.

@bgreenwell
Copy link
Owner

Two issues in the above code: 1) newdata is not the second positional argument in mlr's prediction method, so you'll need to name it; 2) you omitted the wrong column number in X when calling explain(), so the feature names won't match and you'll get another error. Fixes in the example below.

data(mtcars)
mtcars$vs <- as.factor(mtcars$vs)
mtcars <- as.data.frame(mtcars)
traintask <- makeClassifTask(data = mtcars, target = "vs")
xgb_learner <- makeLearner(
  "classif.xgboost",
  predict.type = "prob",
  par.vals = list(
    objective = "binary:logistic",
    eval_metric = "error",
    nrounds = 50
  )
)
model <- mlr::train(xgb_learner, task = traintask)

pfunWF <-function(object, newdata) {
  # predict(object, newdata, type = 'prob')$data$prob.1
  predict(object, newdata = newdata, type = 'prob')$data$prob.1
}
shap1 <- fastshap::explain(model,
                           #X = mtcars[, -9],
                           X = mtcars[, -8],
                           nsim = 1,
                           pred_wrapper = pfunWF)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants