-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Guidance on method = "permute"
for classification models?
#131
Comments
Hi @juliasilge, thanks for reporting the issue! I only get this error when setting |
Ahh, after taking a second look, I see you're passing an integer to |
Ah, I see about the reference class argument: library(tidymodels)
data("bivariate")
ranger_spec <- rand_forest(trees = 1e3, mode = "classification")
ranger_fit <-
workflow(Class ~ ., ranger_spec) %>%
fit(bivariate_train)
pred_fun <- function(object, newdata) {
predict(object, newdata)$predictions[,1]
}
library(vip)
#>
#> Attaching package: 'vip'
#> The following object is masked from 'package:utils':
#>
#> vi
ranger_fit %>%
vi(method = "permute",
target = "Class", metric = "auc",
pred_wrapper = pred_fun, train = bivariate_train, reference_class = "One")
#> # A tibble: 2 × 2
#> Variable Importance
#> <chr> <dbl>
#> 1 B 0.426
#> 2 A 0.378 Created on 2022-09-04 with reprex v2.0.2 I do still have some questions about what function to use for prediction. I can't get this to work if I use a prediction function for the workflow; I have to use the prediction function for the underlying model. Is that what you expect? Do you know if there's a way to predict on the workflow? |
Hmm, this is strange and certainly NOT what I would expect. Your example shouldn't work at all, and in fact it doesn't for me. The
Does this make sense? Maybe I'm missing something, but The full example I ran on my end is pasted below. library(tidymodels)
library(vip)
data("bivariate")
ranger_spec <- rand_forest(trees = 1e3, mode = "classification")
ranger_fit <-
workflow(Class ~ ., ranger_spec) %>%
fit(bivariate_train)
pred_fun <- function(object, newdata) {
predict(object, newdata)$predictions[,1]
}
ranger_fit %>%
vi(method = "permute",
target = "Class", metric = "auc",
pred_wrapper = pred_fun, train = bivariate_train, reference_class = "One")
#> Warning: Unknown or uninitialised column: `predictions`.
#> Warning: Unknown or uninitialised column: `predictions`.
#> Unknown or uninitialised column: `predictions`.
#> # A tibble: 0 × 2
#> # … with 2 variables: Variable <chr>, Importance <dbl>
class(ranger_fit)
#> [1] "workflow"
ranger_fit$fit$fit$fit$treetype # hmm...using `method = "auc` also shouldn't work here!
#> [1] "Probability estimation"
# This does not work for me, which makes sense because `pred_fun()` trying to extract
# predictions from the `$predictions` component of the underlying ranger object, but we're
# working with a `"workflow"` object instead, which is missing this component and hence
# throwing a warning
# Sanity check (shouldn't work here)
pred_fun(ranger_fit, newdata = head(bivariate_train))
#> NULL
#> Warning message:
#> Unknown or uninitialised column: `predictions`.
# Another sanity check (but it should work here)
pred_fun(ranger_fit$fit$fit$fit, newdata = head(bivariate_train))
#> [1] 0.7267079 0.1212718 0.9590956 0.1580259 0.6113960 0.5332444
# Define a prediction wrapper to tell vi() how to extract predictions from a
# `"workflow"` object instead
pred_fun2 <- function(object, newdata) {
predict(object, new_data = newdata, type = "class")$.pred_class
}
# One more sanity check (should work now)
pred_fun2(ranger_fit, newdata = head(bivariate_train))
#> [1] One Two One Two One One
#> Levels: One Two
# Now we can get AUC-based permutation VI scores
ranger_fit %>%
vi(method = "permute",
target = "Class", metric = "auc",
pred_wrapper = pred_fun2, train = bivariate_train, reference_class = "One")
#> # A tibble: 2 × 2
#> Variable Importance
#> <chr> <dbl>
#> 1 A -0.352
#> 2 B -0.395 Created on 2022-09-04 with reprex v2.0.2 |
Ahh, I think I see the issue. I forgot that @topepo added a method for workflow objects: #' @export
vi.workflow <- function(object, ...) { # package: workflows
vi(workflows::extract_fit_engine(object), ...)
} We might need to alter this so that users can pass in the correct |
A workaround you could try at the moment is to just call |
I think the way to go is to redefine the workflows method using something similar to below: vi.workflow <- function(object, ...) { # package: workflows
dots <- list(...)
if (!is.null(dots[["method"]])) {
# FIXME: What if the `method` argument is passed by position only? We could
# check for that as well by using `if ("model" %in% dots)` bu that could
# cause other problems if, for example, the user passes in another argument
# that happens to have the same name (e.g., `target = "method"`)
if (dots[["method"]] == "model") {
# Extract underlying model fit
object <- workflows::extract_fit_engine(object)
}
}
vi.default(object, ...) # just calling `vi()` would lead to an infinite recursion...
} @juliasilge I think this is the behavior we would want because the other methods that get called (e.g., |
But part of me thinks it would be best to just remove the ranger_fit %>%
extract_fit_engine() %>% ### only needed if `method = "model"` ###
vi(method = "model",
target = "Class", metric = "auc",
pred_wrapper = pred_fun2, train = bivariate_train, reference_class = "One") Since calling |
I'll think about this some more; we are also think about how we use vip and DALEX in our packages. We'll be adding a recursive feature elimination tool in tidymodels and that needs importance scores. Here's some of what I'm thinking about: with model agnostic tools, we'd like to get importance from the original columns (e.g. before dummies and other features) as well as for derived features (like indicator columns, spline terms etc). Of course, model-specific importance is going to always be on the derived features. I bring this up since this might affect the S3 method; I could imagine a parsnip I have not looked under the hood if vip in a while. I'll take a look and respond back with some thoughts. |
Thanks @topepo, happy to evolve vip to work better with the tidymodels ecosystem, so your and @juliasilge's input are extremely appreciated. I'm also planning on removing the plyr dependency in the next wave of commits and improving the docs/functionality for the other two model-agnostic approaches (e.g., SHAP-based VIPs using the fastshap package). All the model-agnostic procedures in vip (and some benchmark comparisons for permutation methods) are discussed in our R Journal article. |
@juliasilge and @topepo, I've got a fix (not sure why I was making it more complicated than it needed to be). Just needed to move the |
Issue should be fixed @juliasilge; let me know if you find the time to test and I'll keep the issue open in the mean time. I just pushed @topepo's workflow and parsnip methods from |
Thank you so much for all your work on this! 🙌 I have installed from the main branch here. I have a question still about how to set up the predictions. The documentation for
And then it says:
In tidymodels, you get class labels with library(tidymodels)
data("bivariate")
ranger_spec <- rand_forest(trees = 1e3, mode = "classification")
ranger_fit <-
workflow(Class ~ ., ranger_spec) %>%
fit(bivariate_train)
pred_fun <- function(object, newdata) {
predict(object, new_data = newdata, type = "prob")$.pred_One
}
library(vip)
#>
#> Attaching package: 'vip'
#> The following object is masked from 'package:utils':
#>
#> vi
ranger_fit %>%
vi(method = "permute", target = "Class", metric = "auc", nsim = 10,
pred_wrapper = pred_fun, train = bivariate_train, reference_class = "One")
#> # A tibble: 2 × 3
#> Variable Importance StDev
#> <chr> <dbl> <dbl>
#> 1 B 0.416 0.00626
#> 2 A 0.372 0.0175 Created on 2022-10-04 with reprex v2.0.2 The helper function that I wrote here does use the However, this also "works" when I provide class labels, but returns a wrong answer: library(tidymodels)
data("bivariate")
ranger_spec <- rand_forest(trees = 1e3, mode = "classification")
ranger_fit <-
workflow(Class ~ ., ranger_spec) %>%
fit(bivariate_train)
pred_fun <- function(object, newdata) {
predict(object, new_data = newdata, type = "class")$.pred_class
}
library(vip)
#>
#> Attaching package: 'vip'
#> The following object is masked from 'package:utils':
#>
#> vi
ranger_fit %>%
vi(method = "permute", target = "Class", metric = "auc", nsim = 10,
pred_wrapper = pred_fun, train = bivariate_train, reference_class = "One")
#> # A tibble: 2 × 3
#> Variable Importance StDev
#> <chr> <dbl> <dbl>
#> 1 A -0.350 0.0165
#> 2 B -0.373 0.0106 Created on 2022-10-04 with reprex v2.0.2 Is there a way for these functions to check whether they have a class label (a factor) or a probability (numeric)? This seems like a fairly easy mistake for folks to make. |
Hey @juliasilge, good call out. I've thought about this a little bit in the past. I toyed with the idea of checking a sample of the predictions, but I'm not sure how this could actually be done in a generally useful way. For instance, how would the function know what the predictions should be (e.g., class labels vs. probs) if the user supplies their own metric function? I'll put some deeper thought into it! |
Should be fixed devel |
Thank you so much for all your great work on this package! 🙌
We typically see folks have success using vip for regression models but have trouble when trying to use classification models.
Take this example:
Created on 2022-09-02 with reprex v2.0.2
Here
pred_fun
is getting predictions from the underlying ranger model, not predicting on the workflow. I have also tried something like:which would predict on the workflow, but that doesn't work either.
Do you have advice on how to guide folks with this task? Is something not working as expected?
The text was updated successfully, but these errors were encountered: