-
Notifications
You must be signed in to change notification settings - Fork 31
Graphical Models for Mixed Multi Modal Data
Graphical models provide a powerful and flexible framework for understanding complex multivariate data. These models, sometimes also referred to as network models, capture dependencies in multivariate data, allowing statisticians to discover underlying connections among measured variables. These models have been widely used in applied statistics and machine learning, with particular success in genetics, neuroscience, and finance. Perhaps the clearest sign of their importance to modern data science is the CRAN Task View devoted to “gRaphical Models in R” [1].
In many big-data domains such as finance, internet advertising, social
media, environmental studies, integrative genomics, and multi-modal
neuroimaging, the observed variables often consist of mixed types. That is,
while some features may be continuous random variables, others are
count-valued, categorical, bounded, binary, etc. For example, in
finance, bond returns are continuous random variables, while ratings
are (ordered) categorical, and default probabilities are restricted to
the
Classical approaches to graphical models typically assume that the data are generated from a multivariate Gaussian and that each variable is marginally Gaussian. While mathematically tractable, this assumption is plainly inappropriate for mixed multi-modal data. To address this limitation, two lines of research have been proposed. In the first, the data are transformed to be approximately marginally normal and then a Gaussian graphical model is used [2,3]. In the second, new graphical models which jointly model mixed data are developed [4,5]. The resulting graphical models are referred to as mixed graphical models.
In this vein, Yang et al. developed a theory of mixed graphical models based on univariate exponential families [6,7,8,9,10]. In these models, the conditional distribution of each feature comes from an arbitrary distribution in the exponential family, a large class of probability distributions which includes the Gaussian, Poisson, binomial, and beta distributions, among many others. The resulting joint distributions provide the first multivariate model which directly parameterizes dependencies between mixed data types. By allowing arbitrary distributions at each node, familiar models for each variable may be used, while still allowing essentially arbitrary dependency structures between the nodes to be captured directly. As opposed to non-parametric (copula) approaches, mixed graphical models model the data directly and do not sacrifice statistical power to attain flexibility.
In this project, we propose a new package to make graphical models for
mixed multi-modal data readily available to a wide audience. The proposed
package will allow for fitting, simulating from, and visualizing mixed
graphical models. We anticipate that having an easy-to-use R
package
will increase adoption of these powerful new models.
A number of existing R
packages [11] provide estimation for Gaussian
graphical models. In particular, the huge
package [12] provides
similar functionality to our proposed package, but only for Gaussian
graphical models and nonparametric extensions thereof. The XMRF
package [13], developed by two of the mentors, fits graphical models
with an arbitrary exponential family distribution for each the nodes,
but does not allow for mixed data types across different nodes. No
existing software, for R
or otherwise, currently handles mixed
graphical models with arbitrary node conditional distributions.
-
Fitting Specialized algorithms are required to fit mixed graphical
models efficiently, accurately, and robustly. The mentors have
recently developed a block randomized adaptive iterative lasso
(“B-RAIL”) procedure to fit these models. This algorithm requires
fitting re-weighted or adaptive L1 penalized GLMs in a block-wise manner.
Thus, existing optimization algorithms in standard packages such as
glmnet
[16] orpirls
[17] cannot be used. Instead, new fast optimization routines for fitting L1 penalized GLMs must be programmed inC++
for speed.Finally, a bootstrap-type procedure known as stability selection [18, 19] will be used to estimate the sparsity of the graph. This procedure works by repeatedly fitting mixed graphical models to resampled versions of the data. Because the samples are created and fit independently, this approach is embarrassingly parallel. To exploit this, the package will use the flexible
foreach
framework [20] to allow the end-user to seamlessly select from a wide range of parallel computing strategies. Estimated time: 7 weeks. -
Visualization Graphical models naturally lend themselves to
elegant visualizations. The package will provide visualizations
using the
Cytoscape
graph visualization library [15], building upon the existing BioconductorRCytoscape
package [14]. If time allows, interactive visualizations based on theCytoscape.js
library will also be implemented. Estimated time: 3 weeks.
-
Sampling Two key steps in any statistical analysis are i) model
checking; and ii) providing an accurate measure of the variability
of the estimated model. The ability to simulate data from a model is
essential for both of these steps. Straightforward Gibbs sampling
techniques can be used to generate data from mixed graphical models,
but these iterative algorithms are often slow when implemented in
pure
R
. The package will contain a high-performanceC++
sampler to generate synthetic data from arbitrary mixed graphical models. Estimated time: 2 weeks.
Mixed graphical models represent a significant advance in both the theory and practice of (unsupervised) data integration and the proposed package will make them widely available for the first time. By providing robust and efficient tools for fitting, visualization, and simulation, the package will allow the use of mixed graphical models throughout the entire data analysis pipeline. The package will be immediately useful for a wide range of application domains.
By construction, mixed graphical models represent a superset of existing graphical models. As such, the proposed package represents a natural opportunity to provide functionality for all types of graphical models. By providing lossless conversion routines from other graphical model packages to our proposed package, users will be able to use our visualization and simulation tools as part of their own workflows, even if they are not using mixed graphical models directly.
Finally, we expect that the high performance L1-penalized GLM solver
will be of interest independently of the rest of the package. The
solvers available in glmnet
compromise on numerical accuracy to
achieve their remarkable speed in computing the entire regularization
path. By specializing in the case where only a single value of the
regularization parameter is of interest, our solvers will provide a
cross-platform reference implementation L1-penalized regression.
- Dr. Genevera Allen [Theory and Algorithms], gallen@rice.edu http://www.stat.rice.edu/~gallen
Departments of Statistics and ECE, Rice University
Jan and Dan Duncan Neurological Research Institute, Baylor College of Medicine and Texas Children’s Hospital
- Dr. Zhandong Liu [Implementation], zhandong.liu@bcm.edu http://www.liuzlab.org/
Jan and Dan Duncan Neurological Research Institute, Baylor College of Medicine and Texas Children’s Hospital
- Michael Weylandt [Implementation] michael.weylandt@rice.edu
Department of Statistics, Rice University
- Yulia Baker [Algorithms] yulia.baker@rice.edu
Department of Statistics, Rice University
- Convex Optimization
- Probabilistic simulation (Gibbs sampling)
- Coding with
R
,C++
, andRcpp
Potential applicants must:
- Implement L1-penalized Poisson (log-linear) regression in portable,
standard
C++
using an ADMM algorithm [21] from scratch (not using an existing library); - Wrap their implementation using
Rcpp
; - Test their implementation using
testthat
; - Package their implementation and pass
R CMD check
on at least two of the three major platforms: Windows, MacOS, and Linux (Debian/Ubuntu).
Numerical results will be compared against glmnet::glmnet(...,
family='poisson')
. Mentors will check that the package passes R CMD
check
without any WARNING(s)
or ERROR(s)
.
Test Solution https://github.com/GaryBAYLOR/testRepo.git
Test Solution https://github.com/Xia-Zhang/Poisson-Regression
Test Solution https://github.com/aditya2410/POISSON_Regresssion
- Constructor and basic methods for EFGM class (
print
,summary
, etc.) - Implementation of l1-penalized GLMs for the following data types:
- Gaussian
- Binary
- Count
- Serial implementation of B-RAIL
- Test suite for optimization code and B-RAIL
- Stability Selection
- Parallel implementation of B-RAIL
- Initial implementation of sampling routines
- Sampling routines
- Visualization code
- Documentation
[1] https://CRAN.R-project.org/view=gR
[2] Han Liu, Fang Han, Ming Yuan, John Lafferty, Larry Wasserman. “High-dimensional semiparametric Gaussian copula graphical models.” Annals of Statistics 40(4) (2012), 2293-2326. https://projecteuclid.org/euclid.aos/1358951383
[3] John Lafferty, Han Liu, Larry Wasserman. “Sparse Nonparametric Graphical Models.” Statistical Science 27(4) (2012), 519-537. https://projecteulid.org/euclid.ss/1356098554
[4] Jason D. Lee and Trevor J. Hastie. “Leaning the Structure of Mixed Graphical Models.” Journal of Computational and Graphical Statistics 24 (2015), 230-253. http://amstat.tandfonline.com/doi/abs/10.1080/10618600.2014.900500
[5] Jie Cheng, Tianxi Li, Elizaveta Levina, and Ji Zhu. “High-Dimensional Mixed Graphical Models.” To appear in Journal of Computational and Graphical Statistics. http://dx.doi.org/10.1080/10618600.2016.1237362
[6] Eunho Yang, Pradeep Ravikumar, Genevera I. Allen, Zhandong Liu. “Graphical Models via Generalized Linear Models” in /Advances in Neural Information Processing Systems (NIPS) 2012/. http://papers.nips.cc/paper/4617-graphical-models-via-generalized-linear-models
[7] Eunho Yang, Pradeep Ravikumar, Genevera I. Allen, Zhandong Liu. “Mixed Graphical Models via Exponential Families” in /Artificial Intelligence and Statistics (AISTATS) 2014/. http://www.jmlr.org/proceedings/papers/v33/yang14a.htm
[8] Eunho Yang, Pradeep Ravikumar, Genevera I. Allen, Zhandong Liu. “Graphical Models via Univariate Exponential Family Distributions.” Journal of Machine Learning Research 16 (2015), 3813-3847. https://www.jmlr.org/papers/volume16/yang15a/yang15a.pdf
[9] Eunho Yang, Pradeep Ravikumar, Genevera I. Allen, Yulia Baker, Ying-Wooi Wn, Zhandong Liu. “A General Framework for Mixed Graphical Models”. ArXiv 1411.0288
[10] Shizhe Chen, Daniela M. Witten, Ali Shojaie. “Selection and Estimation for Mixed Graphical Models.” Biometrika 102(1) (2015), 47-64. https://doi.org/10.1093/biomet/asu051
[11] E.g. glasso, QUIC, and GGMselect
[12] https://CRAN.R-project.org/package=huge
[13] https://CRAN.R-project.org/package=XMRF; http://dx.doi.org/10.1186/s12918-016-0313-0
[14] http://bioconductor.org/packages/RCytoscape
[16] https://CRAN.R-project.org/package=glmnet
[17] http://github.com/kaneplusplus/pirls
[18] Nicolai Meinshausen, Peter Bulhmann. “Stability Selection.” Journal of the Royal Statistical Society, Series B: Statistical Methodology 72(4) (2010), 417-473. http://dx.doi.org/10.1111/j.1467-9868.2010.00740.x
[19] Han Liu, Kathryn Roeder, Larry Wasserman. “Stability Approach to Regularization Selection (StARS) for High Dimensional Graphical Models” in Advances in Neural Information Processing Systems (NIPS) 2010. https://papers.nips.cc/paper/3966-stability-approach-to-regularization-selection-stars-for-high-dimensional-graphical-models
[20] https://CRAN.R-projet.org/package=foreach
[21] See, e.g. Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, Jonathan Eckstein. “Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers.” Foundations and Trends in Machine Learning 3(1) (2010), 1-122. http://stanford.edu/~boyd/admm.html