Skip to content

An R package that performs stepwise forward and backward feature selection

License

Notifications You must be signed in to change notification settings

UBC-MDS/punisheR

Repository files navigation

PunisheR

Build Status Coverage status

punisheR is a package for feature and model selection in R. Specifically, this package implements tools for forward and backward model selection (see here). In order to measure model quality during the selection procedures, we have also implemented the Akaike and Bayesian Information Criterion (see below), both of which punish complex models -- hence this package's name.

As examined below, we recognize that well-designed versions of these tools already exist in R. This is acceptable to us because impetus for this project is primarily pedagogical, intended to improve our understanding of model selection techniques and collaborative software development.

Installation

devtools::install_github("UBC-MDS/punisheR")

If you would like to read a comprehensive documentation of punisheR, we recommend that you set build_vignettes = TRUE when you install the package.

Functions included:

punisheR has two stepwise feature selection techniques:

  • forward(): a feature selection method in which you start with a null model and iteratively add useful features
  • backward(): a feature selection method in which you start with a full model and iteratively remove the least useful feature at each step

This package also has three metrics to evaluate model performance:

These three criteria are used to measure the relative quality of models within forward() and backward(). In general, having more parameters in your model increases prediction accuracy but is highly susceptible to overfitting. AIC and BIC add a penalty for the number of features in a model. The penalty term is larger in BIC than in AIC. The lower the AIC and BIC score, the better the model.

How does the package fit into the existing R ecosystem?

In the R ecosystem, the forward and backward selection is implemented in both the olsrr and MASS packages. The former provides ols_step_forward() and ols_step_backward() for forward and backward stepwise selection, respectively. Both of these use p-value as a metric for feature selection. The latter, MASS, contains StepAIC(), which is complete with three modes: forward, backward or both. The selection procedure it uses is based on an information criterion (AIC), as we intend ours to be. Other packages that provide subset selection for regression models are leaps and bestglm.

In punisheR, users can select between metrics such as aic, bic and r-squared for forward and backward selections. Also, the number of features returned by these selection algorithms can be specified by using n_features or by using min_change; users can specify the minimum change in the criterion score for an additional feature to be selected.

Usage examples

Load data

library(punisheR)

data <- mtcars_data()
X_train <- data[[1]]
y_train <- data[[2]]
X_val <- data[[3]]
y_val <- data[[4]]

Forward selection

forward(X_train, y_train, X_val, y_val, min_change=0.5,
    n_features=NULL, criterion='r-squared', verbose=FALSE)
    
#> [1] 10

When implementing forward selection on the demo data, it returns a list of features for the best model. In this example, we use r-squared to determine the "best" model. Here it can be seen that the function correctly returns only 1 feature.

Backward selection

backward(X_train, y_train, X_val, y_val,
    n_features=1, min_change=NULL, criterion='r-squared',
    verbose=FALSE)
    
#> [1] 10

When implementing backward selection on the demo data, it returns a list of features for the best model. Here it can be seen that the function correctly returns only 1 feature.

Scoring a model with AIC, BIC, and r-squared

model <- lm(y_train ~ mpg + cyl + disp, data = X_train)

aic(model)
#> [1] 252.6288

bic(model)
#> [1] 258.5191

When scoring the model using AIC and BIC, we can see that the penalty when using bic is greater than the penalty obtained using aic.

r_squared(model, X_val, y_val)
#> [1] 0.7838625

The value returned by the function r_squared() will be between 0 and 1.

Vignette

For a more comprehensive guide of PunisheR, you can read the vignette here or html version here.

Contributors:

Instructions and guidelines on how to contribute can be found here. To contribute to this project, you must adhere to the terms outlined in our Contributor Code of Conduct