README.Rmd

---
title: "BranchGLM: Efficient Variable Selection for GLMs in R"
output: github_document
---

<!-- badges: start -->
[![CRAN status](https://www.r-pkg.org/badges/version/BranchGLM)](https://CRAN.R-project.org/package=BranchGLM)
[![Codecov test coverage](https://codecov.io/gh/JacobSeedorff21/BranchGLM/branch/master/graph/badge.svg)](https://app.codecov.io/gh/JacobSeedorff21/BranchGLM?branch=master)
<!-- badges: end -->

# Overview

**BranchGLM** is a package for fitting GLMs and performing efficient
variable selection for GLMs.

# How to install

**BranchGLM** can be installed using the `install.packages()` function

```{r, eval = FALSE}
install.packages("BranchGLM")

```

It can also be installed via the `install_github()` function from the 
**devtools** package.

```{r, eval = FALSE}
devtools::install_github("JacobSeedorff21/BranchGLM")

```

# Usage
## Fitting GLMs
### Linear regression

**BranchGLM** can fit large linear regression models very quickly, 
next is a comparison of runtimes with the built-in `lm()` function.
This comparison is based upon a randomly generated linear regression model with 
10000 observations and 250 covariates.

```{r linear, warning = FALSE, message = FALSE, fig.path = "README_images/"}
# Loading libraries
library(BranchGLM)
library(microbenchmark)
library(ggplot2)

# Setting seed
set.seed(99601)

# Defining function to generate dataset for linear regression
NormalSimul <- function(n, d, Bprob = .5){
  
  x <- MASS::mvrnorm(n, mu = rep(1, d), Sigma = diag(.5, nrow = d, ncol = d) + 
                 matrix(.5, ncol = d, nrow = d))
  
  beta <- rnorm(d + 1, mean = 1, sd = 1) 
  
  beta[sample(2:length(beta), floor((length(beta) - 1) * Bprob))] = 0
  
  y <- x %*% beta[-1] + beta[1] + rnorm(n, sd = 3)
  
  df <- cbind(y, x) |> 
    as.data.frame()
  
  df$y <- df$V1
  
  df$V1 <- NULL
  
  df
}
# Generating linear regression dataset
df <- NormalSimul(10000, 250)

# Timing linear regression methods with microbenchmark
Times <- microbenchmark("BranchGLM" = {BranchGLM(y ~ ., data = df, 
                                                        family = "gaussian",
                                                   link = "identity")},
                        "Parallel BranchGLM" = {BranchGLM(y ~ ., data = df, 
                                                        family = "gaussian",
                                                   link = "identity",
                                                   parallel = TRUE)},
                        "lm" = {lm(y ~ ., data = df)},
                        times = 100)

# Plotting results
autoplot(Times, log = FALSE)

```

### Logistic regression

**BranchGLM** can also fit large logistic regression models very quickly, 
next is a comparison of runtimes with the built-in `glm()` function. This comparison 
is based upon a randomly generated logistic regression model with 10000 observations 
and 100 covariates.

```{r logistic, warning = FALSE, message = FALSE, fig.path = "README_images/"}
# Setting seed
set.seed(78771)

# Defining function to generate dataset for logistic regression
LogisticSimul <- function(n, d, Bprob = .5, sd = 1, rho = 0.5){
  
  x <- MASS::mvrnorm(n, mu = rep(1, d), Sigma = diag(1 - rho, nrow = d, ncol = d) + 
                 matrix(rho, ncol = d, nrow = d))
  
  beta <- rnorm(d + 1, mean = 0, sd = sd) 
  
  beta[sample(2:length(beta), floor((length(beta) - 1) * Bprob))] = 0
  beta[beta != 0] <- beta[beta != 0] - mean(beta[beta != 0])
  
  p <- 1/(1 + exp(-x %*% beta[-1] - beta[1]))
  
  y <- rbinom(n, 1, p)
  
  df <- cbind(y, x) |> 
    as.data.frame()
  df
}

# Generating logistic regression dataset
df <- LogisticSimul(10000, 100)

# Timing logistic regression methods with microbenchmark
Times <- microbenchmark("BFGS" = {BranchGLM(y ~ ., data = df, family = "binomial",
                                                   link = "logit", method = "BFGS")}, 
                        "L-BFGS" = {BranchGLM(y ~ ., data = df, family = "binomial",
                                                   link = "logit", method = "LBFGS")},
                        "Fisher" = {BranchGLM(y ~ ., data = df, family = "binomial",
                                                   link = "logit", method = "Fisher")},
                        "Parallel BFGS" = {BranchGLM(y ~ ., data = df, family = "binomial",
                                                   link = "logit", method = "BFGS",
                                                   parallel = TRUE)}, 
                        "Parallel L-BFGS" = {BranchGLM(y ~ ., data = df, 
                                                       family = "binomial",
                                                   link = "logit", method = "LBFGS",
                                                   parallel = TRUE)},
                        "Parallel Fisher" = {BranchGLM(y ~ ., data = df, 
                                                        family = "binomial",
                                                   link = "logit", method = "Fisher",
                                                   parallel = TRUE)},
                        "glm" = {glm(y ~ ., data = df, family = "binomial")},
                        times = 100)

# Plotting results
autoplot(Times, log = FALSE)

```


## Best subset selection

**BranchGLM** can also perform best subset selection very quickly, here is a 
comparison of runtimes with the `bestglm()` function from the **bestglm** package.
This comparison is based upon a randomly generated logistic regression model with 1000 
observations and 15 covariates.

```{r, warning = FALSE, message = FALSE}
# Loading bestglm
library(bestglm)

# Setting seed and creating dataset
set.seed(33391)
df <- LogisticSimul(1000, 15, .5, sd = 0.5)

# Times
## Timing switch branch and bound
BranchTime <- system.time(BranchVS <- VariableSelection(y ~ ., data = df, 
                                      family = "binomial", link = "logit",
                  type = "switch branch and bound", showprogress = FALSE,
                  parallel = FALSE, method = "Fisher", 
                  bestmodels = 10, metric = "AIC"))

BranchTime

## Timing exhaustive search
Xy <- cbind(df[,-1], df[,1])
ExhaustiveTime <- system.time(BestVS <- bestglm(Xy, family = binomial(), IC = "AIC", 
                                                TopModels = 10))
ExhaustiveTime

```

Finding the top 10 logistic regression models according to AIC for this simulated 
regression model with 15 variables with the switch branch and bound algorithm is about 
`r round(ExhaustiveTime[[3]] / BranchTime[[3]], 2)` times faster than an 
exhaustive search.

### Checking results

```{r}
# Results
## Checking if both methods give same results
BranchModels <- t(BranchVS$bestmodels[-1, ] == 1)
ExhaustiveModels <- as.matrix(BestVS$BestModels[, -16])
identical(BranchModels,  ExhaustiveModels)

```

Hence the two methods result in the same top 10 models and the switch branch and bound 
algorithm was much faster than an exhaustive search.

### Visualization

There is also a convenient way to visualize the top models with the **BranchGLM** 
package.

```{r visualization1, fig.path = "README_images/"}
# Plotting models
plot(BranchVS, type = "b")

```

## Backward elimination

**BranchGLM** can also perform backward elimination very quickly with a bounding 
algorithm, here is a comparison of runtimes with the `step()` function from the 
**stats** package. This comparison is based upon a randomly generated logistic 
regression model with 1000 observations and 50 covariates.


```{r, warning = FALSE, message = FALSE}
# Setting seed and creating dataset
set.seed(33391)
df <- LogisticSimul(1000, 50, .5, sd = 0.5)

## Times
### Timing fast backward elimination
BackwardTime <- system.time(BackwardVS <- VariableSelection(y ~ ., data = df, 
                                      family = "binomial", link = "logit",
                  type = "fast backward", showprogress = FALSE,
                  parallel = FALSE, method = "LBFGS", 
                  metric = "AIC"))

BackwardTime

### Timing step function
fullmodel <- glm(y ~ ., data = df, family = binomial(link = "logit"))
stepTime <- system.time(BackwardStep <- step(fullmodel, direction = "backward", trace = 0))
stepTime

```

Using the fast backward elimination algorithm from the **BranchGLM** package was 
about `r round(stepTime[[3]] / BackwardTime[[3]], 2)` times faster than step was 
for this logistic regression model.

### Checking results

```{r}
## Checking if both methods give same results
### Getting names of variables in final model from BranchGLM
BackwardCoef <- coef(BackwardVS)
BackwardCoef <- BackwardCoef[BackwardCoef != 0, ]
BackwardCoef <- BackwardCoef[order(names(BackwardCoef))]

### Getting names of variables in final model from step
BackwardCoefGLM <- coef(BackwardStep)
BackwardCoefGLM <- BackwardCoefGLM[order(names(BackwardCoefGLM))]
identical(names(BackwardCoef), names(BackwardCoefGLM))

```

Hence the two methods result in the same best model and the fast backward 
elimination algorithm is much faster than step.

### Visualization

There is also a convenient way to visualize the backward elimination path with 
the **BranchGLM** package.

```{r visualization2, fig.path = "README_images/"}
## Plotting models
plot(BackwardVS, type = "b")

```

## Double backward elimination

**BranchGLM** can also perform a variant of backward elimination where up to 
two variables can be removed in one step. We call this method double backward 
elimination and we have also developed a faster variant that we call fast double backward 
elimination. This method can result in better solutions than what is obtained from 
traditional backward elimination, but is also slower. Next, we show a comparison 
of runtimes and BIC values from fast backward elimination and fast double backward 
elimination. This comparison is based upon a randomly generated logistic regression model 
with 1000 observations and 100 covariates.


```{r, warning = FALSE, message = FALSE}
# Setting seed and creating dataset
set.seed(79897)
df <- LogisticSimul(1000, 100, .5, sd = 0.5)

## Times
### Timing fast backward
BackwardTime <- system.time(BackwardVS <- VariableSelection(y ~ ., data = df, 
                                      family = "binomial", link = "logit",
                  type = "fast backward", showprogress = FALSE,
                  parallel = FALSE, method = "LBFGS", metric = "BIC"))

BackwardTime

### Timing fast double backward
DoubleBackwardTime <- system.time(DoubleBackwardVS <- VariableSelection(y ~ ., data = df, 
                                      family = "binomial", link = "logit",
                  type = "fast double backward", showprogress = FALSE,
                  parallel = FALSE, method = "LBFGS", metric = "BIC"))

DoubleBackwardTime

```

Using the fast backward elimination algorithm from the **BranchGLM** package was 
about `r round(DoubleBackwardTime[[3]] / BackwardTime[[3]], 2)` times faster than 
the fast double backward elimination algorithm was 
for this logistic regression model. However, the final model from double backward 
elimination had a BIC of `r round(DoubleBackwardVS$bestmetrics[1], 2)` while 
the final model from traditional backward elimination had a BIC of 
`r round(BackwardVS$bestmetrics[1], 2)`. The difference in BIC between these 
two methods is pretty small for this logistic regression model, but for some 
regression models the difference can be quite large.


## Variable Importance

Calculating L0-penalization based variable importance is quite a bit slower than 
just performing best subset selection. But, branch and bound algorithms are used 
to make this process considerably faster than using an exhaustive search.

```{r}
# Getting variable importance values
VITime <- system.time(BranchVI <- VariableImportance(BranchVS, showprogress = FALSE))
BranchVI

# Displaying time to calculate VI values
VITime

# Plotting variable importance
barplot(BranchVI)

```

Furthermore, p-values can be found based on the variable importance values. This can 
be done with the use of the parametric bootstrap. 

```{r}
# Getting p-values
pvalsTime <- system.time(pvals <- VariableImportance.boot(BranchVI, 
                                                          nboot = 100, 
                                                          showprogress = FALSE))
pvals

```

Performing the parametric bootstrap with 100 bootstrapped replications took about 
`r round(pvalsTime[[3]] / VITime[[3]], 2)` times longer than finding the 
variable importance values.

```{r}
# Plotting results
boxplot(pvals, las = 1)

```