The core of any Rainier model is the joint density function, P(Θ|x,y)
. This function (and any other real-valued functions of the parameters Θ
) is expressed in Rainier as a com.stripe.rainier.compute.Real
object. Real
is similar to TensorFlow's static compute graph: it's a DAG whose root node represents the output of the function and whose leaf nodes are either Variable
(one for each parameter) or constant Double
s, and whose interior nodes represent mathematical operations on these. The number of operations supported is deliberately minimal: the 4 arithmetic operators, along with log
, exp
, and abs
. Real
also includes the standard comparison operators and a simple If
expression, though since these introduces discontinuities their use should be limited.
Unlike TensorFlow, all operations are scalar. Although your models might well involve vectors or other collections of parameters and observations, the underlying compute graph will be expressed purely in terms of scalar values and operations. This choice simplifies the API and the DAG implementation, and is well-suited to targeting CPU-based execution, especially on the JVM where there isn't reliable access to SIMD. In the future, if we want to support execution on GPUs, adding at least some limited vectorization to the API will be valuable.
Also unlike TensorFlow, there is no concept of a "placeholder" for observations, or other support for mini-batch execution. Instead, the assumption is that all of your data will fit in memory, and data points are treated just like any other constants in your model. This choice is informed by the typical use of full-batch iterations in MCMC algorithms, as well as the generally smaller data sizes for Bayesian inference compared to deep learning applications.
One huge benefit that Rainier gets from this last choice is the ability to partially evaluate the model based on the known observations even when the parameters are still unknown. As the Real
DAG is being constructed, it will take every opportunity to aggressively pre-compute operations on constants, expand or simplify expressions, and combine common terms, with the goal of minimizing the number of operations that will be needed for each evaluation later on. (For more info on the representation we use to achieve this, see the comments in Real.scala).
In the extreme, this can reduce what would otherwise be O(n*m)
work, for n data points and m iterations, into O(n+m)
, with a single O(n)
partial evaluation step followed by m O(1)
iterations. As a simple example, if you consider computing the mean squared error of a series of known data points vs an unknown estimate y
, eg data.map{x => Math.pow(x - y,2)}.sum
, this can be reduced to an expression of the form y*a + y*y*b + c
, with known a
, b
, c
, no matter how big data
is.
In order to support gradient-based sampling and optimization methods, Real
provides gradient
, which uses reverse-mode autodifferentiation to construct the gradient for each Variable
leaf in the DAG with respect to the output node (in the form of a new, transformed Real
for each variable). Because the gradient operation is closed (as in, has Real
s as both input and output), it is possible to take second derivatives, though Rainier does not currently make use of this.
To evaluate a Real
given parameters Θ
, you can construct an Evaluator
containing a Map[Variable,Double]
with concrete values for each Variable
leaf. You can reuse this same Evaluator
to evaluate multiple Real
s in sequence (for example, a joint density and also the various components of its gradient), and it will memoize and reuse any shared intermediate results. However, if you know you need to evaluate the same set of Real
s many times, you are much better to use Compiler
, which will produce an extremely efficient Array[Double] => Array[Double]
for computing the values of multiple Real
s given values for the union of their input Variable
s. This makes use of the rainier.ir
package to dynamically generate and load a custom .class
with optimized JVM bytecode for running the computation.
If you're curious about what the generated bytecode looks like for an expression, including to see the effect of the partial evaluation mentioned earlier, Real.trace(expr)
will dump Scala code to the console that should be equivalent to what Compiler
produces (although to be clear, this is just used for debugging purposes, and may diverge slightly from the actual, bytecode-based codegen).
The ir
package provides an intermediate representation AST for mathematical functions from R^N to R^M, which it compiles to a JVM class using the asm4 package. The AST itself is extremely simple, providing nodes for binary and unary mathematical operators, loading constants or parameters, and defining and referencing variables. To ensure that the methods produced are small enough to be JITted by the JVM, the IR's code generator will split a large expression into sub-expressions, each implemented with their own private method. If a variable is defined in one sub-expression and referenced in another, it is assigned a slot in a globals array that is passed around between these methods. The globals array is allocated by the apply(Array[Double] => Array[Double])
entry point to the generated class.
This package has no dependencies on other Rainier code, and is used only by the compute
package.
The sampler
package depends only on the compute
representation of a model: a sampler will take a Real
for the joint density and produce, effectively, a stream of Evaluator
s that encapsulate parameter values sampled from the posterior. These Evaluator
s can be used to sample from other functions of the same parameters.
Rainier currently provides the Walkers
affine-invariant sampler and the HMC
Hamiltonian sampler. It also provides a vanilla gradient descent MAP
optimizer. All of these share the common Sampler
interface.
Walkers
is an ensemble-based method that is fast and robust for small numbers of parameters (< 10). It is currently the default. The best reference for it is Foreman-Mackey, Hogg, Lang & Goodman (2012).
HMC
is a gradient-based method which requires some tuning, but is quite efficient once tuned even for large numbers of parameters. The best reference is Betancourt (2017). Our implementation includes the dual-averaging automatic tuning of step-size from the NUTS paper, but requires you to manually specify a number of leapfrog steps. In the future, we plan to implement the full NUTS algorithm to also dynamically select the number of steps.
The MAP
gradient descent optimizer is likely more of a demonstration than of practical use, but it does have the virtue of simplicity. In the future, we will hopefully add more sophisticated optimizers such as L-BFGS
.
The core
package contains the primary API actually used by end-users who want to do Bayesian inference. The tour provides an overview of how this API is used which will be extremely helpful context for understanding what follows.
The central type of this API is RandomVariable[T]
. RandomVariable[T]
embodies a a model as a pair of related functions, which it labels value
and density
. They both take the same real-valued parameter vector Θ
as inputs. value
is a function that produces the outcome of the random variable (which could be of any type), given values for the parameters. density
is a real-valued function that produces the log of the unnormalized joint probability of the parameters. Although the density is in terms of the parameters, and not in terms of the outcome, if you sample the parameter values according to the density, and then feed those parameter values into the value function, you end up with a valid sample of the distribution of the outcomes of the random variable.
The representation of the density
function is a Real
(or, in fact, several Real
s whose values are summed; this is discussed below.) This is what will ultimately be passed on to the Sampler
.
The representation of the value
function for RandomVariable[T]
is an object of some type T
. To be able to sample
from that RandomVariable
, there must be a typeclass Sampleable[T,V]
for some V
; the combination of value
and its Sampleable
instance forms the required function from Θ
to V
. (It's a bit unintuitive that in RandomVariable[T]
, the T
refers to the type of the function object, rather than the type produced by the function, which is implicit.)
In practice, there are two classes of T
: deterministic functions with outputs in R^N, and everything else. The former cases are captured by typeclasses such as Sampleable[Real,Double]
or Sampleable[Map[String,Real],Map[String,Double]]
. These are statically analyzable, always produce the same value for the same parameters, and can be efficiently evaluated using the Compiler
just like the density
is. (This is true even if there's some structure, like in the Map
case, wrapped around the Real
objects.)
All remaining cases are captured by Sampleable[Generator[V],V]
. A Generator
object can represent any function, including a non-deterministic function, from Θ => V
. It will not, however, directly interact with Θ
; instead, it will hold onto references to one or more Real
objects. Generator
is evaluated with the get
method which takes, as one of its arguments, a Numeric[Real]
- in practice, an Evaluator
- which allows it to evaluate any referenced Real
s via toDouble
, albeit (in the general case) more slowly than with Compiler
. Thus, a Generator[V]
is, conceptually, a composition of one or more functions from Θ => Double
, represented as Real
, with some further function from (Double,...) => V
.
The other argument to Generator.get
is a RNG
object, and many Generator
objects represent sampling functions for various distribution families, parameterized by Real
. Generator
implements map
, zip
, and flatMap
in the standard way for a probability monad (similarly to, eg https://github.com/jliszka/probability-monad).
RandomVariable
also provides monadic map
and flatMap
methods. map(f: T => U)
returns a new RandomVariable[U]
whose density
is unchanged but whose value
function is replaced with some new object which is a transformation via f
of its old value function. Intuitively, this is a transformation of the output value without affecting the underlying parameter distribution.
To understand flatMap(f: T => RandomVariable[U])
, consider val result: RandomVariable[U] = t.flatMap(f)
, which creates an intermediate u: RandomVariable[U] = f(t.value)
. Then, result.value = u.value
, and result.density = t.density + u.density
. This could be doing any or all of the following:
- Introducing a new parameter. Unlike in the
map
case, where the unchangeddensity
means that the parameter space must be remaining the same, here we could be introducing some newVariable
as a leaf node of theu.density
DAG, and so the parameter space ofresult
can be of a higher dimension than the parameter space oft
. - Conditioning on observed data. We could be leaving
t.value
unchanged, but using it as an input into some likelihood function that also brings in some observational data, whose result will becomeu.density
and so be incorporated via theflatMap
into our existing priors. - Transforming the
value
function. Like amap
, aflatMap
can change thevalue
; especially common is to, in combination with conditioning on observed data, change from a deterministicReal
-based value function to a stochasticGenerator
that acts as a posterior prediction.
It's useful to note that shifting to a Generator
in this way is a one-way ratchet, and that after that shift you're unlikely to get anything valuable out of any further flatMap
(apart from the special-case of zip
). That's because a useful flatMap
is usually using some part of t.value
as an input into the new u.density
; once t.value
is a Generator
, the Real
parts of the function are hidden and there's no way to use them to refine the density.
One final subtlety is that flatMap
does not directly sum t.density
and u.density
; instead, the density
is a Set
of Real
s and these sets are unioned during flatMap
; at the end, during sampling, all of the elements of this set are summed. The goal of this is that multiple references to the same, identical (in the sense of object identity) RandomVariable
only affect the density once; for example, introducing a new parameter in one flatMap
and then later referring to that same parameter in another flatMap
should not add the parameter's prior in to the density a second time.
The other important type in the API is Distribution[T]
. These objects represent parameterized distributions with support in T
. Almost all of the implementation work has gone into the Continuous
distributions, which are those with univariate continuous real support, ie Distribution[Double]
. As well as generator
and logDensity
, which all Distributions
s implement, the Continuous
distributions implement param
, which returns a RandomVariable[Real]
whose density
and value
functions together produce outcomes drawn from the appropriate distribution. These param
objects form the priors for Rainier models.
Rainier's common implementation pattern is to define a single Continuous
object for a family of distributions whose density is centered on 0 and has a standard scale; for example, Normal(0,1)
. Other parameterizations are expressed as transformations of these standard distributions: for example, Normal(2,3).param
is (conceptually) implemented as Normal(0,1).param.map{x => x * 3 + 2}
. As noted in the previous section, this map
does not change the density function, and the raw underlying parameters are still Normal(0,1)
, but our transformed view of the parameters is as a Normal(2,3)
.
If we literally just used map
like that, we could use our transformed Normal(2,3)
as a prior to get a param
but we would be unable to use its logDensity
as a likelihood function. Instead, we do the transformations with an Injection
that includes both the forwards transformation to apply to parameters, and the backwards transformation to bring observed data into the raw parameter space, including the necessary jacobian term related to this change of variables.
Similarly, all of the underlying raw parameters are unbounded, and if we need to express a parameter bounded on one or both sides, we transform it. This currently appears in two places: the Gamma
distribution family is bounded to be > 0, which it achieves by applying an exp
transformation to the unbounded raw parameter; and the Uniform
distribution family is (in the standard form) bounded to (0,1), which it achieves with a logistic transformation. In both cases the appropriate jacobian term for the transformation is added to the density and packaged up with the transformed value into the RandomVariable
returned by param
.