Skip to content

Latest commit

 

History

History
74 lines (39 loc) · 15.2 KB

impl.md

File metadata and controls

74 lines (39 loc) · 15.2 KB

Rainier implementation notes

rainier.compute

The core of any Rainier model is the joint density function, P(Θ|x,y). This function (and any other real-valued functions of the parameters Θ) is expressed in Rainier as a com.stripe.rainier.compute.Real object. Real is similar to TensorFlow's static compute graph: it's a DAG whose root node represents the output of the function and whose leaf nodes are either Variable (one for each parameter) or constant Doubles, and whose interior nodes represent mathematical operations on these. The number of operations supported is deliberately minimal: the 4 arithmetic operators, along with log, exp, and abs. Real also includes the standard comparison operators and a simple If expression, though since these introduces discontinuities their use should be limited.

Unlike TensorFlow, all operations are scalar. Although your models might well involve vectors or other collections of parameters and observations, the underlying compute graph will be expressed purely in terms of scalar values and operations. This choice simplifies the API and the DAG implementation, and is well-suited to targeting CPU-based execution, especially on the JVM where there isn't reliable access to SIMD. In the future, if we want to support execution on GPUs, adding at least some limited vectorization to the API will be valuable.

Also unlike TensorFlow, there is no concept of a "placeholder" for observations, or other support for mini-batch execution. Instead, the assumption is that all of your data will fit in memory, and data points are treated just like any other constants in your model. This choice is informed by the typical use of full-batch iterations in MCMC algorithms, as well as the generally smaller data sizes for Bayesian inference compared to deep learning applications.

One huge benefit that Rainier gets from this last choice is the ability to partially evaluate the model based on the known observations even when the parameters are still unknown. As the Real DAG is being constructed, it will take every opportunity to aggressively pre-compute operations on constants, expand or simplify expressions, and combine common terms, with the goal of minimizing the number of operations that will be needed for each evaluation later on. (For more info on the representation we use to achieve this, see the comments in Real.scala).

In the extreme, this can reduce what would otherwise be O(n*m) work, for n data points and m iterations, into O(n+m), with a single O(n) partial evaluation step followed by m O(1) iterations. As a simple example, if you consider computing the mean squared error of a series of known data points vs an unknown estimate y, eg data.map{x => Math.pow(x - y,2)}.sum, this can be reduced to an expression of the form y*a + y*y*b + c, with known a, b, c, no matter how big data is.

In order to support gradient-based sampling and optimization methods, Real provides gradient, which uses reverse-mode autodifferentiation to construct the gradient for each Variable leaf in the DAG with respect to the output node (in the form of a new, transformed Real for each variable). Because the gradient operation is closed (as in, has Reals as both input and output), it is possible to take second derivatives, though Rainier does not currently make use of this.

To evaluate a Real given parameters Θ, you can construct an Evaluator containing a Map[Variable,Double] with concrete values for each Variable leaf. You can reuse this same Evaluator to evaluate multiple Reals in sequence (for example, a joint density and also the various components of its gradient), and it will memoize and reuse any shared intermediate results. However, if you know you need to evaluate the same set of Reals many times, you are much better to use Compiler, which will produce an extremely efficient Array[Double] => Array[Double] for computing the values of multiple Reals given values for the union of their input Variables. This makes use of the rainier.ir package to dynamically generate and load a custom .class with optimized JVM bytecode for running the computation.

If you're curious about what the generated bytecode looks like for an expression, including to see the effect of the partial evaluation mentioned earlier, Real.trace(expr) will dump Scala code to the console that should be equivalent to what Compiler produces (although to be clear, this is just used for debugging purposes, and may diverge slightly from the actual, bytecode-based codegen).

rainier.ir

The ir package provides an intermediate representation AST for mathematical functions from R^N to R^M, which it compiles to a JVM class using the asm4 package. The AST itself is extremely simple, providing nodes for binary and unary mathematical operators, loading constants or parameters, and defining and referencing variables. To ensure that the methods produced are small enough to be JITted by the JVM, the IR's code generator will split a large expression into sub-expressions, each implemented with their own private method. If a variable is defined in one sub-expression and referenced in another, it is assigned a slot in a globals array that is passed around between these methods. The globals array is allocated by the apply(Array[Double] => Array[Double]) entry point to the generated class.

This package has no dependencies on other Rainier code, and is used only by the compute package.

rainier.sampler

The sampler package depends only on the compute representation of a model: a sampler will take a Real for the joint density and produce, effectively, a stream of Evaluators that encapsulate parameter values sampled from the posterior. These Evaluators can be used to sample from other functions of the same parameters.

Rainier currently provides the Walkers affine-invariant sampler and the HMC Hamiltonian sampler. It also provides a vanilla gradient descent MAP optimizer. All of these share the common Sampler interface.

Walkers is an ensemble-based method that is fast and robust for small numbers of parameters (< 10). It is currently the default. The best reference for it is Foreman-Mackey, Hogg, Lang & Goodman (2012).

HMC is a gradient-based method which requires some tuning, but is quite efficient once tuned even for large numbers of parameters. The best reference is Betancourt (2017). Our implementation includes the dual-averaging automatic tuning of step-size from the NUTS paper, but requires you to manually specify a number of leapfrog steps. In the future, we plan to implement the full NUTS algorithm to also dynamically select the number of steps.

The MAP gradient descent optimizer is likely more of a demonstration than of practical use, but it does have the virtue of simplicity. In the future, we will hopefully add more sophisticated optimizers such as L-BFGS.

rainier.core.RandomVariable

The core package contains the primary API actually used by end-users who want to do Bayesian inference. The tour provides an overview of how this API is used which will be extremely helpful context for understanding what follows.

The central type of this API is RandomVariable[T]. RandomVariable[T] embodies a a model as a pair of related functions, which it labels value and density. They both take the same real-valued parameter vector Θ as inputs. value is a function that produces the outcome of the random variable (which could be of any type), given values for the parameters. density is a real-valued function that produces the log of the unnormalized joint probability of the parameters. Although the density is in terms of the parameters, and not in terms of the outcome, if you sample the parameter values according to the density, and then feed those parameter values into the value function, you end up with a valid sample of the distribution of the outcomes of the random variable.

The representation of the density function is a Real (or, in fact, several Reals whose values are summed; this is discussed below.) This is what will ultimately be passed on to the Sampler.

The representation of the value function for RandomVariable[T] is an object of some type T. To be able to sample from that RandomVariable, there must be a typeclass Sampleable[T,V] for some V; the combination of value and its Sampleable instance forms the required function from Θ to V. (It's a bit unintuitive that in RandomVariable[T], the T refers to the type of the function object, rather than the type produced by the function, which is implicit.)

In practice, there are two classes of T: deterministic functions with outputs in R^N, and everything else. The former cases are captured by typeclasses such as Sampleable[Real,Double] or Sampleable[Map[String,Real],Map[String,Double]]. These are statically analyzable, always produce the same value for the same parameters, and can be efficiently evaluated using the Compiler just like the density is. (This is true even if there's some structure, like in the Map case, wrapped around the Real objects.)

All remaining cases are captured by Sampleable[Generator[V],V]. A Generator object can represent any function, including a non-deterministic function, from Θ => V. It will not, however, directly interact with Θ; instead, it will hold onto references to one or more Real objects. Generator is evaluated with the get method which takes, as one of its arguments, a Numeric[Real] - in practice, an Evaluator - which allows it to evaluate any referenced Reals via toDouble, albeit (in the general case) more slowly than with Compiler. Thus, a Generator[V] is, conceptually, a composition of one or more functions from Θ => Double, represented as Real, with some further function from (Double,...) => V.

The other argument to Generator.get is a RNG object, and many Generator objects represent sampling functions for various distribution families, parameterized by Real. Generator implements map, zip, and flatMap in the standard way for a probability monad (similarly to, eg https://github.com/jliszka/probability-monad).

RandomVariable also provides monadic map and flatMap methods. map(f: T => U) returns a new RandomVariable[U] whose density is unchanged but whose value function is replaced with some new object which is a transformation via f of its old value function. Intuitively, this is a transformation of the output value without affecting the underlying parameter distribution.

To understand flatMap(f: T => RandomVariable[U]), consider val result: RandomVariable[U] = t.flatMap(f), which creates an intermediate u: RandomVariable[U] = f(t.value). Then, result.value = u.value, and result.density = t.density + u.density. This could be doing any or all of the following:

  • Introducing a new parameter. Unlike in the map case, where the unchanged density means that the parameter space must be remaining the same, here we could be introducing some new Variable as a leaf node of the u.density DAG, and so the parameter space of result can be of a higher dimension than the parameter space of t.
  • Conditioning on observed data. We could be leaving t.value unchanged, but using it as an input into some likelihood function that also brings in some observational data, whose result will become u.density and so be incorporated via the flatMap into our existing priors.
  • Transforming the value function. Like a map, a flatMap can change the value; especially common is to, in combination with conditioning on observed data, change from a deterministic Real-based value function to a stochastic Generator that acts as a posterior prediction.

It's useful to note that shifting to a Generator in this way is a one-way ratchet, and that after that shift you're unlikely to get anything valuable out of any further flatMap (apart from the special-case of zip). That's because a useful flatMap is usually using some part of t.value as an input into the new u.density; once t.value is a Generator, the Real parts of the function are hidden and there's no way to use them to refine the density.

One final subtlety is that flatMap does not directly sum t.density and u.density; instead, the density is a Set of Reals and these sets are unioned during flatMap; at the end, during sampling, all of the elements of this set are summed. The goal of this is that multiple references to the same, identical (in the sense of object identity) RandomVariable only affect the density once; for example, introducing a new parameter in one flatMap and then later referring to that same parameter in another flatMap should not add the parameter's prior in to the density a second time.

rainier.core.Distribution

The other important type in the API is Distribution[T]. These objects represent parameterized distributions with support in T. Almost all of the implementation work has gone into the Continuous distributions, which are those with univariate continuous real support, ie Distribution[Double]. As well as generator and logDensity, which all Distributionss implement, the Continuous distributions implement param, which returns a RandomVariable[Real] whose density and value functions together produce outcomes drawn from the appropriate distribution. These param objects form the priors for Rainier models.

Rainier's common implementation pattern is to define a single Continuous object for a family of distributions whose density is centered on 0 and has a standard scale; for example, Normal(0,1). Other parameterizations are expressed as transformations of these standard distributions: for example, Normal(2,3).param is (conceptually) implemented as Normal(0,1).param.map{x => x * 3 + 2}. As noted in the previous section, this map does not change the density function, and the raw underlying parameters are still Normal(0,1), but our transformed view of the parameters is as a Normal(2,3).

If we literally just used map like that, we could use our transformed Normal(2,3) as a prior to get a param but we would be unable to use its logDensity as a likelihood function. Instead, we do the transformations with an Injection that includes both the forwards transformation to apply to parameters, and the backwards transformation to bring observed data into the raw parameter space, including the necessary jacobian term related to this change of variables.

Similarly, all of the underlying raw parameters are unbounded, and if we need to express a parameter bounded on one or both sides, we transform it. This currently appears in two places: the Gamma distribution family is bounded to be > 0, which it achieves by applying an exp transformation to the unbounded raw parameter; and the Uniform distribution family is (in the standard form) bounded to (0,1), which it achieves with a logistic transformation. In both cases the appropriate jacobian term for the transformation is added to the density and packaged up with the transformed value into the RandomVariable returned by param.