-
Notifications
You must be signed in to change notification settings - Fork 0
/
README.Rmd
129 lines (101 loc) · 4.26 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
---
output: github_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(
echo = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%",
warning = FALSE,
message = FALSE
)
options(repos = c(RSPM = "https://packagemanager.posit.co/cran/latest",
CRAN = "https://cloud.r-project.org/"),
install.packages.check.source = "no")
```
# SuperSelector
<!-- badges: start -->
[![R-CMD-check](https://github.com/saraemoore/SuperSelector/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/saraemoore/SuperSelector/actions/workflows/R-CMD-check.yaml)
<!-- badges: end -->
## Overview
The `SuperSelector` package extends Cross-Validated [SuperLearner](https://github.com/ecpolley/SuperLearner) (`SuperLearner::CV.SuperLearner()`) for ensembled feature selection. This package also contains utility functions for simulating input data for, and visualizing outputs from, its feature selection algorithm.
## Installation
The SuperSelector package is currently only available via GitHub. To install:
```{r install_pkg, results='hide'}
remotes::install_github("saraemoore/SuperSelector")
```
## Examples
### Simple
From `?sim_toy_data`:
> Simulate data for a 3 continuous covariate, 1 binary outcome example where the outcome (Y) is unrelated to the first covariate (normally-distributed W1) but is related to the second and third covariates (lognormally-distributed W2 and W3, which are also related to one another).
```{r simple_example}
library(SuperSelector)
library(SuperLearner) # for SL.mean, method.NNloglik
library(FSelector) # for cutoff.biggest.diff
dat <- sim_toy_data(n_obs = 1000, rnd_seed = 54321)
res <- cvSLFeatureSelector(dat$Y,
dat[,-which(colnames(dat) %in% c("ID", "Y"))],
family = binomial())
```
```{r simple_example_table, results='asis'}
knitr::kable(res$whichVariable[,"keep_names", drop = FALSE])
```
```{r simple_example_plot1}
res_sum <- summarizeScreen(res$summary,
groupCols = "method")
p1 <- cvSLVarImpPlot(res_sum)
p1
```
```{r simple_example_plot2}
res_sum2 <- summarizeScreen(res$summary,
groupCols = c("method", "screener"))
p2 <- cvSLVarImpPlot2(res_sum2)
p2
```
### Advanced
From `?sim_proppr_data`:
> Simulate data for a 9 covariate, 1 binary outcome example where the outcome (Y) is unrelated to the first four covariates (normally-distributed W1:4) but is related to the remaining five covariates (some of which are also related to one another).
```{r advanced_example}
library(SuperSelector)
library(SuperLearner) # for SL.mean, SL.glm, method.NNloglik
library(FSelector) # for cutoff.biggest.diff, cutoff.k
# remotes::install_github("saraemoore/SLScreenExtra")
library(SLScreenExtra) # for screen.wgtd.lasso, screen.randomForest.imp, screen.earth.backwardprune
dat <- sim_proppr_data(n_obs = 500, rnd_seed = 54321)
dat_x <- dplyr::bind_cols(dat[, -which(colnames(dat) %in% c("GCS", "ID", "Y"))],
factor_to_indicator("GCS", dat))
libraryCVSLFeatSel <- list(
`lasso mean` = c("SL.mean", "screen.wgtd.lasso"),
`random forest biggest diff mean` = c("SL.mean", "screen.randomForest.imp"),
`splines biggest diff mean` = c("SL.mean", "screen.earth.backwardprune"),
`lasso glm` = c("SL.glm", "screen.wgtd.lasso"),
`random forest biggest diff glm` = c("SL.glm", "screen.randomForest.imp"),
`splines biggest diff glm` = c("SL.glm", "screen.earth.backwardprune")
)
libraryMetaFeatSel <- data.frame(selector = c("cutoff.biggest.diff", "cutoff.k", "cutoff.k"),
k = c(NA, 3, 6),
stringsAsFactors = FALSE)
rownames(libraryMetaFeatSel) <- c("biggest diff", "top3", "top6")
res <- cvSLFeatureSelector(as.numeric(dat$Y),
dat_x,
family = binomial(),
SL.library = libraryCVSLFeatSel,
selector.library = libraryMetaFeatSel,
nFolds = 5)
```
```{r advanced_example_table, results='asis'}
knitr::kable(res$whichVariable[,"keep_names", drop = FALSE])
```
```{r advanced_example_plot1}
res_sum <- summarizeScreen(res$summary,
groupCols = "method")
p1 <- cvSLVarImpPlot(res_sum)
p1
```
```{r advanced_example_plot2, fig.height=6}
res_sum2 <- summarizeScreen(res$summary,
groupCols = c("method", "screener"))
p2 <- cvSLVarImpPlot2(res_sum2)
p2
```