DAGBagM: Learning Directed Acyclic Graphs via Bootstrap Aggregation for Mixture of Continuous and Binary Variables
This repository contains 3 folders.
dagbagM:
contains the R package "dagbagM" for learning directed acycic graphs for mixture of continuous and binary variables
dagbag:
contains the R package "dagbag". The function dagbag::score_shd() is used for aggregating the DAGs learnt from bootstrap resamples
Simulation_scripts:
contains the R scripts for replicating the simulation results in the manuscript.
require(doParallel)
install_github("jie108/dagbagM/dagbagM")
install_github("jie108/dagbagM/dagbag")
dagbagM
hc: A function to learn a DAG model for the given data with no bootstrap resamples by the hill climbing algorithm for mixture of continuous and binary variables
dagbagM::hc(Y,nodeType, whiteList, blackList, tol, standardize, maxStep, restart, seed, verbose)
hc_boot_parallel: A function to learn a DAG model for every bootstrap resmples of the given data by the hill climbing algorithm for mixture of continuous and binary variables
dagbagM::hc_boot_parallel(Y, node.type, n.boot, whiteList, blackList, maxStep, standardize, tol, restart, seed, nodeShuffle, numThread, verbose)
dagbag
score_shd: A function to use structural hamming distance to aggregate DAGs. It aggregates an ensemble of DAGs to obtain a DAG that minimizes the overall distance to the ensemble.
score_shd(boot.adj, alpha, threshold, max.step, blacklist, whitelist, print)
Parameter | Default | Description |
---|---|---|
Y | an n by p data matrix: n – sample size, p – number of variables | |
n.boot (only for hc_boot_parallel) | 0 | an integer: the number of bootstrap resamples of the data matrix Y |
node.type | a vector of length equal to the number of variables specifying the type of variable/node type: "c" for continuous and "b" for binary | |
maxStep | 500 | an integer: the maximum number of search steps of the hill climbing algorithm |
standardize | TRUE | logical: whether to standardize the data to have mean zero and sd one |
nodeShuffle (hc_boot_parallel) | FALSE | logical: whether to shuffle the order of the variables before DAG learning |
restart | 0 | an integer: number of times to restart the search algorithm after a local optimal is achieved. The purpose is to search for global optimal |
blacklist | NULL | a p by p 0-1 matrix: if the (i,j)th-entry is "1", then the edge i–>j will be excluded from the DAG during the search |
whitelist | NULL | a p by p 0-1 matrix: if the (i,j)th-entry is "1", then the edge i–>j will always be included in the DAG during the search |
tol | 1e-06 | a scalar: a number to indicate a threshold below which values will be treated as zero |
numThread (only for hc_boot_parallel) | an integer for running parallel computation of DAG learning from bootstrap resamples | |
verbose | FALSE | logical: whether print the step information |
Parameter | Default | Description |
---|---|---|
boot.adj | A p by p by B array, where B is the number of DAGs to be aggregated. It records the adjacency matrices. It may be the output of the "score" function. | |
alpha | 1 | a positive scalar: alpha defines which member of the gSHD family should be used to aggregate the DAGs. In general, the larger the alpha, the more aggressive of the aggregation, in that less edges are retained leading to smaller FDR and less power |
threshold | 0 | a scalar: it defines the frequency cut-off value, "0" corresponds to cut-off 0.5 |
max.step | 500 | an integer: the maximum number of search steps |
blacklist | NULL | a p by p 0-1 matrix: if the (i,j)th-entry is "1", then the edge i–>j will be excluded from the DAG during the search |
whitelist | NULL | a p by p 0-1 matrix: if the (i,j)th-entry is "1", then the edge i–>j will always be included in the DAG during the search |
FALSE | logical: whether print the step information |
a list of three components
Object | Description |
---|---|
adjacency | adjacency matrix of the learned DAG |
score | BIC score at each search step |
operations | a matrix recording the selected operation, addition, deletion or reversal of an edge, at each search step |
deltaMin | Minimum value of the score change at every step |
a list of three components
Object | Description |
---|---|
adjacency | adjacency matrix of the learned DAG |
a list of three components
Object | Description |
---|---|
adj.matrix | adjacency matrix of the learned DAG |
final.step | a number recording how many search steps are conducted before the procedure stops |
movement | a matrix recording the selected operation, addition, deletion or reversal of an edge, at each search step |
(i) DAG learning by hill climbing: no bootstrap resample
data(example)
Y.n=example$Y # data matrix
p<- dim(Y.n)[2] # no. of nodes
true.dir=example$true.dir #adjacency matrix of the data generating DAG
true.ske=example$true.ske # skeleton graph of the data generating DAG
temp<- dagbagM::hc(Y=Y.n,nodeType=rep("c",p), whiteList=NULL, blackList=NULL, tol = 1e-6, standardize=TRUE, maxStep = 1000, restart=10, seed = 1, verbose = FALSE)
(ii) DAG learning by hill climbing: for bootstrap resamples
library(foreach)
library(doParallel)
temp.boot<- dagbagM::hc_boot_parallel(Y=Y.n, n.boot=10, nodeType=rep("c",p), whiteList=NULL, blackList=NULL, standardize=TRUE, tol = 1e-6, maxStep = 1000, restart=10, seed = 1, nodeShuffle=TRUE, numThread = 2,verbose = FALSE)
boot.adj=temp.boot$adjacency
(iii) Bootstrap aggregation of DAGs learnt from bootstrap resamples
set.seed(1)
temp.bag=dagbag::score_shd(boot.adj, alpha = 1, threshold=0)
adj.bag=temp.bag$adj.matrix
If you use DAGBagM in your research please consider citing us:
Chowdhury, S., Wang, R., Yu, Q. et al. DAGBagM: learning directed acyclic graphs of mixed variables with an application to identify protein biomarkers for treatment response in ovarian cancer. BMC Bioinformatics 23, 321 (2022). https://doi.org/10.1186/s12859-022-04864-y.
If you find small bugs, larger issues, or have suggestions, please email the maintainer at jiepeng108@gmail.com. Contributions (via pull requests or otherwise) are welcome.