Skip to content

Stochastic Variational Inference Framework for Probabilistic Modeling

mlcoding edited this page Apr 4, 2017 · 3 revisions

Background

Stochastic variational inference is a powerful tool in analyzing probabilistic models, especially for large scale problem. The class of models considered here involves observations, global hidden variables, local hidden variables, (each associated with an observation), and a fixed parameter. The joint distribution is assumed to follow a factorization form of global and local hidden terms. Besides, any local variables, associated with the corresponding observation, is assumed to independent with the rest local variables, associated with the corresponding observations. The generic goal of probabilistic modeling is to estimate the posterior distribution of latent variables given data.

However, this posterior is intractable to compute in practice and a focus of Bayesian statistics, which motivated alternative approaches to approximate the posterior inference. Two most popular approaches are Markov chain Monte Carlo (MCMC) and Variational inference (VI) [1]. The issue of MCMC is there is no theoretical guarantees on its convergence on nite samples, while VI based on the evidence lower bound (ELBO, will be discussed later) can guarantee to converge to a locally optimal value of nite samples, which gained increasing attentions in recent years.

VI is a powerful tool to transform complex inference problems into an optimization problem. The idea behind VI is intuitive, which is to minimize the Kullback-Leibler (KL) divergence of the joint distribution involving observations and the distribution over the hidden variables. This turns out to be equivalent to maximize the likelihood function of the distribution over the hidden variables based on the ELBO.

A large range of probabilistic models can be covered by this family of distributions, including latent Dirichlet allocation (LDA) [2], hidden Markov models (HMM) and their extensions [3,4,5], Bayesian mixture models (BMM) [6], Kalman filters (KF) [7], factorial models [8], hierarchical linear regression models [9], hierarchical probit classification models [10], probabilistic factor analysis models [11], probabilistic matrix factorization models [12], and many others. In the scenario of complete conditionals, the objective of ELBO is decomposable and can be written as sum of individual terms w.r.t. each observation, which makes the stochastic optimization feasible for large scale probabilistic modeling problems.

Popular algorithms to solve the likelihood maximization include coordinate ascent method [13] and the natural gradient method [14]. They update the variational parameters by only considering one (block) parameter each time and xing the rest, which results in an naive updates of local and global parameters in an EM fashion [15]. However, the issue is this cannot scale efficiently to large problems. Specially, when updating the global parameter, all data need to be analyzed to perform an update, which can be very expensive especially when something about the global variational parameter wants to be learned efficiently. This motives the idea of using stochastic optimization [16] to solve the problem. In each iteration of updating the global variational parameter, only one (or a smaller subset of) data is used to compute a noisy approximation of direction (gradient) along which the parameter is updated, which is much cheaper than the full update. When the step sizes are chosen properly, the convergence to a local optimality is guaranteed for stochastic optimization. We will discuss more details of the stochastic method in the next section.

Project Goal

In this project, the goal is to adopt the stochastic variational inference [17] to implement a toolbox to solve a large family of probabilistic modeling problems in large scales. The models we consider in this project consist of latent Dirichlet allocation (LDA), hidden Markov models (HMM), and Bayesian mixture models (BMM). Optionally, we will consider hierarchical Dirichlet process (HDP), HDP-HMM and their generalized version considering the dependence among latent variables [18].

Designs and Implementations

The package will contain the following major modules:

(i) Solvers: The main solver will be based on stochastic optimization based on coordinate ascent and natural gradient methods. We will allow the user to specify the step size or let the program override. The following is an example of the main function of S3 class for model estimation:

svi <- function (X, # data matrix model = c(”lda”, ”hmm”, “bmm”), # model choices step = NULL # step size max.ite = NULL, # maximum number of iteration precision = NULL, # precision for stopping criterion ...)

(ii) Visualization: we will develop a low-dimensional embedding function for visualization purpose. This will allow users to visualize the estimation result for further interpretation of the model. The core engine of the package will be implemented using C. We will implement all core algorithms of stochastic optimization in C, and leave the function interfaces in R. For computations involving matrix operations, we will call routines in BLAS/LAPACK for coding ef ciencies. We will also provide comprehensive comments of coding so that readers can easily follow and modify as they wish.

If the optional models cannot be implemented within the timeline of GSOC, we will keep working on the project after this summer. To the best my knowledge, this is the rst generic toolbox that considers such a wide family of probabilistic models.

Timeline

05/30/2017 – 06/12/2017 : Implement the framework and initial documentation

06/13/2017 – 06/26/2017 : Implement LDA model;

06/27/2017 – 07/10/2017 : Implement HMM model;

07/11/2017 – 07/24/2017 : Implement BGMM model and optional models;

07/25/2017 – 08/07/2017 : Implement visualization module;

08/08/2017 – 08/21/2017 : Optimize the implementation of the whole package for further efficiencies in both R and C;

08/22/2017 – 08/28/2017 : Test the package and finalize the documents.

Student and Mentors

Prospective Student Developer: Jason Ge, Ph.D. Student, Princeton University

Primary Mentor: Tuo Zhao, Assistant Professor, Georgia Tech

Secondary Mentor: Xingguo Li, Ph.D. Student, University of Minnesota

Reference

[1] M. Wainwright and M. Jordan. Graphical Models, Exponential Families, and Variational Inference. Foundations and Trends in Machine Learning, Vol. 1, Nos. 1–2, 1–305, 2008.

[2] D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993–1022, January 2003.

[3] L. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77:257–286, 1989.

[4] S. Fine, Y. Singer, and N. Tishby. The hierarchical hidden Markov model: Analysis and applications. Machine Learning, 32:41–62, 1998.

[5] E. Fox, E. Sudderth, M. Jordan, and A. Willsky. A sticky HDP-HMM with application to speaker diarization. Annals of Applied Statistics, 5(2A):1020–1056, 2011.

[6] H. Attias. A variational Bayesian framework for graphical models. In Neural Information Processing Systems, 2000.

[7] R. Kalman. A new approach to linear ltering and prediction problems a new approach to linear ltering and prediction problems,”. Transaction of the AMSE: Journal of Basic Engineering, 82:35–45, 1960.

[8] Z. Ghahramani and M. Jordan. Factorial hidden Markov models. Machine Learning, 31(1), 1997.

[9] A. Gelman and J. Hill. Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge University Press, 2007.

[10] M. Girolami and S. Rogers. Variational Bayesian multinomial probit regression with Gaussian process priors. Neural Computation, 18(8), 2006.

[11] M. Tipping and C. Bishop. Probabilistic principal component analysis. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 61(3):611–622, 1999.

[12] M. Hoffman, D. Blei, and P. Cook. Bayesian nonparametric matrix factorization for recorded music. In International Conference on Machine Learning, 2010.

[13] C. Bishop. Pattern Recognition and Machine Learning. Springer New York, 2006.

[14] S. Amari. Natural gradient works efficiently in learning. Neural computation, 10(2):251–276, 1998.

[15] A. Dempster, N. Laird, and D. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39:1–38, 1977.

[16] H. Robbins and S. Monro. A stochastic approximation method. The Annals of Mathematical Statistics, 22(3):400–407, 1951.

[17] M. Hoffman, D. Blei, C. Wang, and J. Paisley. Stochastic variational inference. Journal of Machine Learning Research, 14(1):1303–1347, May 2013.

[18] M. Johnson and A. Willsky. Stochastic variational inference for Bayesian time series models. In Proceedings of the International Conference on Machine Learning, 1854–1862, 2014.

Clone this wiki locally