punisheR is a package for feature and model selection in R. Specifically, this package implements tools for forward and backward model selection (see here). In order to measure model quality during the selection procedures, we have also implemented the Akaike and Bayesian Information Criterion (see below), both of which punish complex models -- hence this package's name.
As examined below, we recognize that well-designed versions of these tools already exist in R. This is acceptable to us because impetus for this project is primarily pedagogical, intended to improve our understanding of model selection techniques and collaborative software development.
devtools::install_github("UBC-MDS/punisheR")
If you would like to read a comprehensive documentation of punisheR
, we recommend that you set build_vignettes = TRUE
when you install the package.
punisheR has two stepwise feature selection techniques:
forward()
: a feature selection method in which you start with a null model and iteratively add useful featuresbackward()
: a feature selection method in which you start with a full model and iteratively remove the least useful feature at each step
This package also has three metrics to evaluate model performance:
aic()
: computes the Akaike information criterionbic()
: computes the Bayesian information criterionr_squared()
: computes the coefficient of determination
These three criteria are used to measure the relative quality of models within forward()
and backward()
. In general,
having more parameters in your model increases prediction accuracy but is highly susceptible to overfitting. AIC and BIC
add a penalty for the number of features in a model. The penalty term is larger in BIC than in AIC. The lower the AIC and
BIC score, the better the model.
In the R ecosystem, the forward and backward selection is implemented in both the olsrr
and MASS packages. The former provides
ols_step_forward()
and ols_step_backward()
for forward and backward stepwise selection, respectively. Both of these use p-value as a metric for feature selection.
The latter, MASS, contains StepAIC()
, which is
complete with three modes: forward, backward or both. The selection procedure it uses is based on an information criterion
(AIC), as we intend ours to be. Other packages that provide subset selection for regression models are
leaps and bestglm.
In punisheR
, users can select between metrics such as aic
, bic
and r-squared
for forward and backward selections.
Also, the number of features returned by these selection algorithms can be specified by using n_features
or by using min_change
;
users can specify the minimum change in the criterion score for an additional feature to be selected.
library(punisheR)
data <- mtcars_data()
X_train <- data[[1]]
y_train <- data[[2]]
X_val <- data[[3]]
y_val <- data[[4]]
forward(X_train, y_train, X_val, y_val, min_change=0.5,
n_features=NULL, criterion='r-squared', verbose=FALSE)
#> [1] 10
When implementing forward selection on the demo data, it returns a list of features for the best model. In this example, we use r-squared to determine the "best" model. Here it can be seen that the function correctly returns only 1 feature.
backward(X_train, y_train, X_val, y_val,
n_features=1, min_change=NULL, criterion='r-squared',
verbose=FALSE)
#> [1] 10
When implementing backward selection on the demo data, it returns a list of features for the best model. Here it can be seen that the function correctly returns only 1 feature.
model <- lm(y_train ~ mpg + cyl + disp, data = X_train)
aic(model)
#> [1] 252.6288
bic(model)
#> [1] 258.5191
When scoring the model using AIC and BIC, we can see that the penalty when using bic
is greater
than the penalty obtained using aic
.
r_squared(model, X_val, y_val)
#> [1] 0.7838625
The value returned by the function r_squared()
will be between 0 and 1.
For a more comprehensive guide of PunisheR, you can read the vignette here or html version here.
- Avinash, @avinashkz
- Tariq, @TariqAHassan
- Jill, @topspinj
Instructions and guidelines on how to contribute can be found here. To contribute to this project, you must adhere to the terms outlined in our Contributor Code of Conduct