# Some Recent Results on Minibatch Markov Chain Monte Carlo Methods

I have recently been working on minibatch Markov chain Monte Carlo (MCMC) methods for Bayesian posterior inference. In this post, I’d like to give a brief summary of what that means and mention two ICML papers (from 2011 and 2014) that have substantially influenced my thinking.

When we say we do “MCMC for Bayesian posterior inference,” what this typically means is that we have some dataset \(\{x_1,x_2,\ldots,x_N\}\) and a parameter of interest \(\theta \in \mathbb{R}^K\) for some \(K \ge 1\). The posterior distribution \(f(\theta)\) we wish to estimate (via sampling) is

\[f(\theta) = p(\theta \mid x_1, \ldots x_N) = \frac{p(\theta)\prod_{i=1}^Np(x_i \mid \theta)}{p(x_1, \ldots, x_N)} \propto p(\theta)\prod_{i=1}^Np(x_i \mid \theta)\]You’ll notice that we assume the data is conditionally independent given the parameter. In many cases, this is unrealistic, but most of the literature assumes this for simplicity.

After initializing \(\theta_0\), the procedure for sampling the posterior on step \(t+1\) is:

- Draw a candidate \(\theta'\)
- Compute the acceptance probability (note that the denominators of \(f\) cancel out):

- Draw \(u \sim {\rm Unif}[0,1]\), and accept if \(u < P_a\). This means setting \(\theta_{t+1} =
\theta'\). Otherwise, we set \(\theta_{t+1} = \theta_t\) and repeat this loop. This tripped me up
earlier. Just to be clear, we generate a sample \(\theta\)
*every*iteration, even if it is a repeat of the previous one.

This satisfies “detailed balance,” which roughly speaking, means that if one samples long enough, one will arrive at a stationary distribution matching the posterior, though a burn-in period and/or only using samples at regular intervals is often done in practice. The resulting collection of (correlated!) samples \(\{\theta_1, \theta_2, \ldots, \theta_T\}\) for large \(T\) can be used to compute the value of \(p(\theta \mid x_1,\ldots,x_N)\). Let’s consider a very simple example. Say \(\theta\) can take on three values: \(A\), \(B\), and \(C\). If our sampled set is \(\{A,B,B,C\}\), then since \(B\) appears two times out of four, we have \(p(\theta = B \mid x_1,\ldots,x_N) = 2/4\). The other two probabilities would naturally have \(1/4\) each according to the samples.

One issue with the standard procedure above is that in today’s big data world with \(N\) on the
order of billions, it is ridiculously expensive to compute \(P_a\) because that involves determining
all \(N\) of the likelihood factors. Remember, this has to be done *every iteration*! Doing all this
just to get one bit of data (whether to accept or not) is not a good tradeoff. Hence, there has been
substantial research on how to perform *minibatch* MCMC on large datasets. In this case, rather than
use all \(N\) data points, we just use a subset \(n \ll N\) of points each iteration. This
approximates the target distribution. The downside? It no longer satisfies detailed balance. (I
don’t know the details on why, and it probably involves some complicated convergence studies, but I
am willing to believe it.)

Just to be clear, we are focusing on getting a *distribution*, not a point estimate. That’s the
whole purpose of Bayesian estimation! A distribution means we need a full probability function that
sums to one and is non-negative; if \(\theta\) is one or two dimensional, we can easily plot the
posterior estimate (I provide an example of this later in this post). A point estimate means finding
*one* value of \(\theta^*\), usually the “best”, which we commonly express as the maximum likelihood
estimate \(\theta^* = \theta_{\rm MLE}\).

All right, now let’s briefly discuss two papers that tackle the problem of minibatch MCMC.

## Bayesian Learning via Stochastic Gradient Langevin Dynamics

This paper appeared in ICML 2011 and proposes using minibatch update methods for Bayesian posterior
inference, with a concept known as *Langevin Dynamics* to inject the correct amount of noise into
parameter updates so that the set of sampled parameters converges to the posterior, and not just a
mode. To make the distinction clear, let’s see how we can use minibatch updates to converge to a
*mode* — more specifically, the *maximum a posteriori* estimate. The function \(f\) we are trying
to optimize is listed above. So … we just use stochastic gradient ascent! (We are
*ascending*, not descending, because \(f\) is a posterior probability, and we want to make its
values higher.) This means the update rule is as follows: \(\theta_{t+1} = \theta_t + \alpha_t
\nabla f(\theta_t)\). Plugging in \(f\) above, we get

where \(\epsilon_t\) is a sequence of step sizes.

The above is a stochastic gradient ascent update because we use \(n \ll N\) terms to approximate the gradient value, which is why I inserted an approximation symbol (\(\approx\)) in the underbrace. Because we only use \(n\) terms, however, we must multiply the summation by \(N/n\) to rescale the value appropriately. Intuitively, all those terms in that summation are negative since they’re log probabilities. If we use \(n\) instead of \(N\) terms, that summation is strictly smaller in absolute value. So we must rescale to make the value on the same order of magnitude. Another way of putting it is that the subsampled sum will be an unbiased estimator of the full-batch quantity.

What’s the problem with the above for Bayesian posterior inference? *It doesn’t actually do Bayesian
posterior inference.* The above will mean \(\theta\) converges to a single value. We instead want a
*distribution*. So what can we do? We can use Langevin Dynamics, meaning that (for the full batch
case) our updates are:

A couple of things are worth noting:

- We use all \(N\) terms in the summation, so the gradient is in fact exact.
- The injected noise \(\eta_t\) means the \(\theta_{t}\) values will “bounce around” to approximate a distribution and not converge to a single point.
- The \(\epsilon\) is now constant instead of decreasing, and is balanced so that the variance of the injected noise matches that of the posterior.
- We use \(x_i\) instead of \(x_{ti}\) as we did earlier. The only difference is that the \(t\) indicates the randomness in the minibatch. It does not mean that \(x_{ti}\) is one scalar element of a vector. In other words, both \(x_i\) and \(x_{ti}\) are in \(\mathbb{R}^k\) for some \(k\).

For simplicity, the above assumes that \(\eta_t \in \mathbb{R}\). In the general case, these should be multivariate Gaussians, with covariance \(\epsilon I\).

The problem with this, of course, is the need to use all \(N\) points. So let’s use \(n \ll N\) points, and we have the following update:

\[\theta_{t+1} = \theta_t + \frac{\epsilon_t}{2}\left(\nabla \log p(\theta_t) + \frac{N}{n} \sum_{i=1}^n \nabla \log p(x_{ti} \mid \theta_t)\right) + \eta_t\] \[\eta_t \sim \mathcal{N}(0, \epsilon_t)\]where now, we need \(\epsilon_t\) to vary and decrease towards zero. The reason for this is that as the step size goes to zero, the corresponding (expensive!) Metropolis-Hastings test has rejection rates that decrease to zero. Thus, we can omit the test to make the algorithm computationally cheap.

Fortunately, the authors show that despite how the noise goes to zero, the set of samples from SGLD will indeed converge to the correct posterior distribution, just as with “normal” (i.e., full-batch) Langevin Dynamics noise.

## Austerity in MCMC Land: Cutting the Metropolis-Hastings Budget

This paper appeared in ICML 2014, and is also about minibatch MCMC. Here, instead of relying on simulating the physics of the system (as was the case with Stochastic Gradient Langevin Dynamics), they propose reformulating the standard MCMC method with the standard MH test into a sequential hypothesis test. To frame this, they take the log of both sides of the acceptance inequality:

\[\begin{align*} u < P_a &\iff u < \min \left\{1,\frac{p(\theta')\prod_{i=1}^Np(x_i \mid \theta')q(\theta_t\mid \theta')}{p(\theta_t)\prod_{i=1}^Np(x_i \mid \theta_t)q(\theta' \mid \theta_t)}\right\} \\ &\iff u\frac{p(\theta_t)q(\theta' \mid \theta_t)}{p(\theta')q(\theta_t\mid \theta')} < \frac{\prod_{i=1}^Np(x_i \mid \theta')}{\prod_{i=1}^Np(x_i \mid \theta_t)} \\ &\iff \log\left[u \frac{p(\theta_t)q(\theta' \mid \theta_t)}{p(\theta')q(\theta_t\mid \theta')} \right] < \log \left(\prod_{i=1}^Np(x_i \mid \theta')\right) - \log\left(\prod_{i=1}^Np(x_i \mid \theta_t)\right) \\ &\iff \frac{1}{N} \log\left[u \frac{p(\theta_t)q(\theta' \mid \theta_t)}{p(\theta')q(\theta_t\mid \theta')} \right] < \frac{1}{N}\sum_{i=1}^N (\log p(x_i \mid \theta') - \log p(x_i \mid \theta_t)) \end{align*}\]In the first step we also dropped the initial “min” because if the “1” case applies, we will always
accept. In the last step we divide both sides by \(N\). What is the purpose of this? The above is
equivalent to the original MH test. But the right hand side depends on all \(N\) data points, so
*what happens if we compute the right hand side using \(n \ll N\) points*?

This is the heart of their test. They start out by using a small fraction of the points and compute
the right hand side. If the proposed \(\theta'\) element is so out of whack, then even with just
\(n\) points, we should already know to reject it. (And a similar case holds if \(\theta'\) is
really *good*.) If we cannot tell whether to accept or reject \(\theta'\) with some specified
confidence threshold, then we increase the minibatch size and test again. Their acceptance test
relies on the Central Limit Theorem and the Student-t distribution. The details are in the paper,
but the main idea is straightforward: increasing the number of samples increases our certainty as to
whether we accept or reject, and we can generally make these decisions with far fewer than \(N\)
samples.

What’s the downside of their algorithm? In the worst case, we might need the *entire data* in one
iteration. This may or may not be a problem, depending on the particular circumstances.

Their philosophy runs deeper than what the above says. Here’s a key quote:

We advocate MCMC algorithms with a “bias-knob”, allowing one to dial down the bias at a rate the optimally balances error due to bias and variance.

One other algorithm that adheres to this strategy? Stochastic Gradient Langevin Dynamics! (Not coincidentally, Professor Max Welling is a co-author on both papers.) Side note: the reason why SGLD is biased is because it omits the Metropolis-Hastings test. The reason why this algorithm (Adaptive Sampling) is biased is because it makes decisions based on a fraction of the data. So both are biased, but for slightly different reasons.

## Putting it All Together: A Code Example

In order to better understand the two papers described above, I wrote some Python code to run SGLD and adaptive sampling. I also implemented the standard (full-batch) MCMC method for a baseline.

I tested using the experiment in Section 5.1 of the SGLD paper. The parameter is 2-D, \(\theta = (\theta_1,\theta_2)\), and the parameter/data generation process is

\[(\theta_1,\theta_2) \sim \mathcal{N}((0,0), {\rm diag}(\sigma_1^2, \sigma_2^2))\]and

\[x_i \sim 0.5 \cdot \mathcal{N}(\theta_1, \sigma_x^2) + 0.5 \cdot \mathcal{N}(\theta_1 + \theta_2, \sigma_x^2)\]I’ve pasted all my Python code here. If you put this together in a Jupyter notebook, it should work correctly. If your time is limited, feel free to skip the code and go straight to the output. The code is here mainly for the benefit of current and future researchers who might want a direct implementation of the above algorithms.

The first part of my code imports the necessary libraries and generates the data according to my interpretation of their problem. I am generating 500 points here, whereas the SGLD paper only used 100 points.

Next, I define a bunch of functions that I will need for the future. The most important one is the
`log_f`

function, which returns the log of the posterior. (Implementing this correctly requires
filling in some missing mathematical details that follow from the formulation of multivariate
Gaussian densities.)

Next, I run three methods to estimate the posterior: standard full-batch MCMC, Stochastic Gradient Langevin Dynamics, and the Adaptive Sampling method. Code comments clearly separate and indicate these methods in the following code section. Note the following:

- The standard MCMC and Adaptive Sampling methods use a random walk proposal.
- I used 10000 samples for the methods, except that for SGLD, I found that I needed to increase the number of samples (here it’s 30000) since the algorithm occasionally got stuck at a mode.
- The minibatch size for SGLD is 30, and for the Adaptive Sampling method, it starts at 10 (and can increase by 10 up to 500).

With the sampled \(\theta\) values in `all_1`

, `all_2`

, and `all_3`

, I can now plot
them using my favorite Python library, matplotlib. I also create a contour plot of what the log
posterior should really look like.

By the way, it was only recently that I found out I could put LaTeX directly in matplotlib text. That’s pretty cool!

Here are the results in a scatter plot where darker spots indicate more points:

It looks like all three methods roughly obtain the same posterior form.