Skip to content

Graphical Models for Mixed Multi Modal Data

michaelweylandt edited this page Mar 23, 2017 · 9 revisions

Background

Graphical models provide a powerful and flexible framework for understanding complex multivariate data. These models, sometimes also referred to as network models, capture dependencies in multivariate data, allowing statisticians to discover underlying connections among measured variables. These models have been widely used in applied statistics and machine learning, with particular success in genetics, neuroscience, and finance. Perhaps the clearest sign of their importance to modern data science is the CRAN Task View devoted to “gRaphical Models in R” [1].

In many big-data domains such as finance, internet advertising, social media, environmental studies, integrative genomics, and multi-modal neuroimaging, the observed variables often consist of mixed types. That is, while some features may be continuous random variables, others are count-valued, categorical, bounded, binary, etc. For example, in finance, bond returns are continuous random variables, while ratings are (ordered) categorical, and default probabilities are restricted to the $[0, 1]$ interval.

Classical approaches to graphical models typically assume that the data are generated from a multivariate Gaussian and that each variable is marginally Gaussian. While mathematically tractable, this assumption is plainly inappropriate for mixed multi-modal data. To address this limitation, two lines of research have been proposed. In the first, the data are transformed to be approximately marginally normal and then a Gaussian graphical model is used [2,3]. In the second, new graphical models which jointly model mixed data are developed [4,5]. The resulting graphical models are referred to as mixed graphical models.

In this vein, Yang et al. developed a theory of mixed graphical models based on univariate exponential families [6,7,8,9,10]. In these models, the conditional distribution of each feature comes from an arbitrary distribution in the exponential family, a large class of probability distributions which includes the Gaussian, Poisson, binomial, and beta distributions, among many others. The resulting joint distributions provide the first multivariate model which directly parameterizes dependencies between mixed data types. By allowing arbitrary distributions at each node, familiar models for each variable may be used, while still allowing essentially arbitrary dependency structures between the nodes to be captured directly. As opposed to non-parametric (copula) approaches, mixed graphical models model the data directly and do not sacrifice statistical power to attain flexibility.

In this project, we propose a new package to make graphical models for mixed multi-modal data readily available to a wide audience. The proposed package will allow for fitting, simulating from, and visualizing mixed graphical models. We anticipate that having an easy-to-use R package will increase adoption of these powerful new models.

Related work

A number of existing R packages [11] provide estimation for Gaussian graphical models. In particular, the huge package [12] provides similar functionality to our proposed package, but only for Gaussian graphical models and nonparametric extensions thereof. The XMRF package [13], developed by two of the mentors, fits graphical models with an arbitrary exponential family distribution for each the nodes, but does not allow for mixed data types across different nodes. No existing software, for R or otherwise, currently handles mixed graphical models with arbitrary node conditional distributions.

Details of your coding project

  • Fitting Specialized algorithms are required to fit mixed graphical models efficiently, accurately, and robustly. The mentors have recently developed a block randomized adaptive iterative lasso (“B-RAIL”) procedure to fit these models. This algorithm requires fitting re-weighted or adaptive L1 penalized GLMs in a block-wise manner. Thus, existing optimization algorithms in standard packages such as glmnet [16] or pirls [17] cannot be used. Instead, new fast optimization routines for fitting L1 penalized GLMs must be programmed in C++ for speed.

    Finally, a bootstrap-type procedure known as stability selection [18, 19] will be used to estimate the sparsity of the graph. This procedure works by repeatedly fitting mixed graphical models to resampled versions of the data. Because the samples are created and fit independently, this approach is embarrassingly parallel. To exploit this, the package will use the flexible foreach framework [20] to allow the end-user to seamlessly select from a wide range of parallel computing strategies. Estimated time: 7 weeks.

  • Visualization Graphical models naturally lend themselves to elegant visualizations. The package will provide visualizations using the Cytoscape graph visualization library [15], building upon the existing Bioconductor RCytoscape package [14]. If time allows, interactive visualizations based on the Cytoscape.js library will also be implemented. Estimated time: 3 weeks.
  • Sampling Two key steps in any statistical analysis are i) model checking; and ii) providing an accurate measure of the variability of the estimated model. The ability to simulate data from a model is essential for both of these steps. Straightforward Gibbs sampling techniques can be used to generate data from mixed graphical models, but these iterative algorithms are often slow when implemented in pure R. The package will contain a high-performance C++ sampler to generate synthetic data from arbitrary mixed graphical models. Estimated time: 2 weeks.

Expected Impact

Mixed graphical models represent a significant advance in both the theory and practice of (unsupervised) data integration and the proposed package will make them widely available for the first time. By providing robust and efficient tools for fitting, visualization, and simulation, the package will allow the use of mixed graphical models throughout the entire data analysis pipeline. The package will be immediately useful for a wide range of application domains.

By construction, mixed graphical models represent a superset of existing graphical models. As such, the proposed package represents a natural opportunity to provide functionality for all types of graphical models. By providing lossless conversion routines from other graphical model packages to our proposed package, users will be able to use our visualization and simulation tools as part of their own workflows, even if they are not using mixed graphical models directly.

Finally, we expect that the high performance L1-penalized GLM solver will be of interest independently of the rest of the package. The solvers available in glmnet compromise on numerical accuracy to achieve their remarkable speed in computing the entire regularization path. By specializing in the case where only a single value of the regularization parameter is of interest, our solvers will provide a cross-platform reference implementation L1-penalized regression.

Mentors

Required Skills

  • Convex Optimization
  • Probabilistic simulation (Gibbs sampling)
  • Coding with R, C++, and Rcpp

Tests

Potential applicants must:

  1. Implement L1-penalized Poisson (log-linear) regression in portable, standard C++ using an ADMM algorithm [21] from scratch (not using an existing library);
  2. Wrap their implementation using Rcpp;
  3. Test their implementation using testthat;
  4. Package their implementation and pass R CMD check on at least two of the three major platforms: Windows, MacOS, and Linux (Debian/Ubuntu).

Solutions of Tests

Numerical results will be compared against glmnet::glmnet(..., family='poisson'). Mentors will check that the package passes R CMD check without any WARNING(s) or ERROR(s).

Test Solution https://github.com/GaryBAYLOR/testRepo.git

Test Solution https://github.com/Xia-Zhang/Poisson-Regression

Test Solution https://github.com/aditya2410/POISSON_Regresssion

Coding Milestones

End of First Month

  • Constructor and basic methods for EFGM class (print, summary, etc.)
  • Implementation of l1-penalized GLMs for the following data types:
    • Gaussian
    • Binary
    • Count
  • Serial implementation of B-RAIL
  • Test suite for optimization code and B-RAIL

End of Second Month

  • Stability Selection
  • Parallel implementation of B-RAIL
  • Initial implementation of sampling routines

End of Third Month

  • Sampling routines
  • Visualization code
  • Documentation

References

[1] https://CRAN.R-project.org/view=gR

[2] Han Liu, Fang Han, Ming Yuan, John Lafferty, Larry Wasserman. “High-dimensional semiparametric Gaussian copula graphical models.” Annals of Statistics 40(4) (2012), 2293-2326. https://projecteuclid.org/euclid.aos/1358951383

[3] John Lafferty, Han Liu, Larry Wasserman. “Sparse Nonparametric Graphical Models.” Statistical Science 27(4) (2012), 519-537. https://projecteulid.org/euclid.ss/1356098554

[4] Jason D. Lee and Trevor J. Hastie. “Leaning the Structure of Mixed Graphical Models.” Journal of Computational and Graphical Statistics 24 (2015), 230-253. http://amstat.tandfonline.com/doi/abs/10.1080/10618600.2014.900500

[5] Jie Cheng, Tianxi Li, Elizaveta Levina, and Ji Zhu. “High-Dimensional Mixed Graphical Models.” To appear in Journal of Computational and Graphical Statistics. http://dx.doi.org/10.1080/10618600.2016.1237362

[6] Eunho Yang, Pradeep Ravikumar, Genevera I. Allen, Zhandong Liu. “Graphical Models via Generalized Linear Models” in /Advances in Neural Information Processing Systems (NIPS) 2012/. http://papers.nips.cc/paper/4617-graphical-models-via-generalized-linear-models

[7] Eunho Yang, Pradeep Ravikumar, Genevera I. Allen, Zhandong Liu. “Mixed Graphical Models via Exponential Families” in /Artificial Intelligence and Statistics (AISTATS) 2014/. http://www.jmlr.org/proceedings/papers/v33/yang14a.htm

[8] Eunho Yang, Pradeep Ravikumar, Genevera I. Allen, Zhandong Liu. “Graphical Models via Univariate Exponential Family Distributions.” Journal of Machine Learning Research 16 (2015), 3813-3847. https://www.jmlr.org/papers/volume16/yang15a/yang15a.pdf

[9] Eunho Yang, Pradeep Ravikumar, Genevera I. Allen, Yulia Baker, Ying-Wooi Wn, Zhandong Liu. “A General Framework for Mixed Graphical Models”. ArXiv 1411.0288

[10] Shizhe Chen, Daniela M. Witten, Ali Shojaie. “Selection and Estimation for Mixed Graphical Models.” Biometrika 102(1) (2015), 47-64. https://doi.org/10.1093/biomet/asu051

[11] E.g. glasso, QUIC, and GGMselect

[12] https://CRAN.R-project.org/package=huge

[13] https://CRAN.R-project.org/package=XMRF; http://dx.doi.org/10.1186/s12918-016-0313-0

[14] http://bioconductor.org/packages/RCytoscape

[15] http://www.cytoscape.org

[16] https://CRAN.R-project.org/package=glmnet

[17] http://github.com/kaneplusplus/pirls

[18] Nicolai Meinshausen, Peter Bulhmann. “Stability Selection.” Journal of the Royal Statistical Society, Series B: Statistical Methodology 72(4) (2010), 417-473. http://dx.doi.org/10.1111/j.1467-9868.2010.00740.x

[19] Han Liu, Kathryn Roeder, Larry Wasserman. “Stability Approach to Regularization Selection (StARS) for High Dimensional Graphical Models” in Advances in Neural Information Processing Systems (NIPS) 2010. https://papers.nips.cc/paper/3966-stability-approach-to-regularization-selection-stars-for-high-dimensional-graphical-models

[20] https://CRAN.R-projet.org/package=foreach

[21] See, e.g. Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, Jonathan Eckstein. “Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers.” Foundations and Trends in Machine Learning 3(1) (2010), 1-122. http://stanford.edu/~boyd/admm.html

Clone this wiki locally