GitHub - UBC-MDS/punisheR: An R package that performs stepwise forward and backward feature selection

PunisheR

punisheR is a package for feature and model selection in R. Specifically, this package implements tools for forward and backward model selection (see here). In order to measure model quality during the selection procedures, we have also implemented the Akaike and Bayesian Information Criterion (see below), both of which punish complex models -- hence this package's name.

As examined below, we recognize that well-designed versions of these tools already exist in R. This is acceptable to us because impetus for this project is primarily pedagogical, intended to improve our understanding of model selection techniques and collaborative software development.

Installation

devtools::install_github("UBC-MDS/punisheR")

If you would like to read a comprehensive documentation of punisheR, we recommend that you set build_vignettes = TRUE when you install the package.

Functions included:

punisheR has two stepwise feature selection techniques:

forward(): a feature selection method in which you start with a null model and iteratively add useful features
backward(): a feature selection method in which you start with a full model and iteratively remove the least useful feature at each step

This package also has three metrics to evaluate model performance:

aic(): computes the Akaike information criterion
bic(): computes the Bayesian information criterion
r_squared(): computes the coefficient of determination

These three criteria are used to measure the relative quality of models within forward() and backward(). In general, having more parameters in your model increases prediction accuracy but is highly susceptible to overfitting. AIC and BIC add a penalty for the number of features in a model. The penalty term is larger in BIC than in AIC. The lower the AIC and BIC score, the better the model.

How does the package fit into the existing R ecosystem?

In the R ecosystem, the forward and backward selection is implemented in both the olsrr and MASS packages. The former provides ols_step_forward() and ols_step_backward() for forward and backward stepwise selection, respectively. Both of these use p-value as a metric for feature selection. The latter, MASS, contains StepAIC(), which is complete with three modes: forward, backward or both. The selection procedure it uses is based on an information criterion (AIC), as we intend ours to be. Other packages that provide subset selection for regression models are leaps and bestglm.

In punisheR, users can select between metrics such as aic, bic and r-squared for forward and backward selections. Also, the number of features returned by these selection algorithms can be specified by using n_features or by using min_change; users can specify the minimum change in the criterion score for an additional feature to be selected.

Usage examples

Load data

library(punisheR)

data <- mtcars_data()
X_train <- data[[1]]
y_train <- data[[2]]
X_val <- data[[3]]
y_val <- data[[4]]

Forward selection

forward(X_train, y_train, X_val, y_val, min_change=0.5,
    n_features=NULL, criterion='r-squared', verbose=FALSE)
    
#> [1] 10

When implementing forward selection on the demo data, it returns a list of features for the best model. In this example, we use r-squared to determine the "best" model. Here it can be seen that the function correctly returns only 1 feature.

Backward selection

backward(X_train, y_train, X_val, y_val,
    n_features=1, min_change=NULL, criterion='r-squared',
    verbose=FALSE)
    
#> [1] 10

When implementing backward selection on the demo data, it returns a list of features for the best model. Here it can be seen that the function correctly returns only 1 feature.

Scoring a model with AIC, BIC, and r-squared

model <- lm(y_train ~ mpg + cyl + disp, data = X_train)

aic(model)
#> [1] 252.6288

bic(model)
#> [1] 258.5191

When scoring the model using AIC and BIC, we can see that the penalty when using bic is greater than the penalty obtained using aic.

r_squared(model, X_val, y_val)
#> [1] 0.7838625

The value returned by the function r_squared() will be between 0 and 1.

Vignette

For a more comprehensive guide of PunisheR, you can read the vignette here or html version here.

Contributors:

Avinash, @avinashkz
Tariq, @TariqAHassan
Jill, @topspinj

Instructions and guidelines on how to contribute can be found here. To contribute to this project, you must adhere to the terms outlined in our Contributor Code of Conduct

Name		Name	Last commit message	Last commit date
Latest commit History 301 Commits
R		R
man		man
tests		tests
vignettes		vignettes
.Rbuildignore		.Rbuildignore
.codecov.yml		.codecov.yml
.gitignore		.gitignore
.lintr		.lintr
.travis.yml		.travis.yml
CONDUCT.md		CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
DESCRIPTION		DESCRIPTION
LICENSE		LICENSE
NAMESPACE		NAMESPACE
README.md		README.md
punishR.Rproj		punishR.Rproj

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PunisheR

Installation

Functions included:

How does the package fit into the existing R ecosystem?

Usage examples

Load data

Forward selection

Backward selection

Scoring a model with AIC, BIC, and r-squared

Vignette

Contributors:

About

Releases 4

Packages

Contributors 3

Languages

License

UBC-MDS/punisheR

Folders and files

Latest commit

History

Repository files navigation

PunisheR

Installation

Functions included:

How does the package fit into the existing R ecosystem?

Usage examples

Load data

Forward selection

Backward selection

Scoring a model with AIC, BIC, and r-squared

Vignette

Contributors:

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 4

Packages 0

Contributors 3

Languages

Packages