Some Recent Results on Minibatch Markov Chain Monte Carlo Methods
I have recently been working on minibatch Markov chain Monte Carlo (MCMC) methods for Bayesian posterior inference. In this post, I’d like to give a brief summary of what that means and mention two ICML papers (from 2011 and 2014) that have substantially influenced my thinking.
When we say we do “MCMC for Bayesian posterior inference,” what this typically means is that we have some dataset \(\{x_1,x_2,\ldots,x_N\}\) and a parameter of interest \(\theta \in \mathbb{R}^K\) for some \(K \ge 1\). The posterior distribution \(f(\theta)\) we wish to estimate (via sampling) is
\[f(\theta) = p(\theta \mid x_1, \ldots x_N) = \frac{p(\theta)\prod_{i=1}^Np(x_i \mid \theta)}{p(x_1, \ldots, x_N)} \propto p(\theta)\prod_{i=1}^Np(x_i \mid \theta)\]You’ll notice that we assume the data is conditionally independent given the parameter. In many cases, this is unrealistic, but most of the literature assumes this for simplicity.
After initializing \(\theta_0\), the procedure for sampling the posterior on step \(t+1\) is:
- Draw a candidate \(\theta'\)
- Compute the acceptance probability (note that the denominators of \(f\) cancel out):
- Draw \(u \sim {\rm Unif}[0,1]\), and accept if \(u < P_a\). This means setting \(\theta_{t+1} = \theta'\). Otherwise, we set \(\theta_{t+1} = \theta_t\) and repeat this loop. This tripped me up earlier. Just to be clear, we generate a sample \(\theta\) every iteration, even if it is a repeat of the previous one.
This satisfies “detailed balance,” which roughly speaking, means that if one samples long enough, one will arrive at a stationary distribution matching the posterior, though a burn-in period and/or only using samples at regular intervals is often done in practice. The resulting collection of (correlated!) samples \(\{\theta_1, \theta_2, \ldots, \theta_T\}\) for large \(T\) can be used to compute the value of \(p(\theta \mid x_1,\ldots,x_N)\). Let’s consider a very simple example. Say \(\theta\) can take on three values: \(A\), \(B\), and \(C\). If our sampled set is \(\{A,B,B,C\}\), then since \(B\) appears two times out of four, we have \(p(\theta = B \mid x_1,\ldots,x_N) = 2/4\). The other two probabilities would naturally have \(1/4\) each according to the samples.
One issue with the standard procedure above is that in today’s big data world with \(N\) on the order of billions, it is ridiculously expensive to compute \(P_a\) because that involves determining all \(N\) of the likelihood factors. Remember, this has to be done every iteration! Doing all this just to get one bit of data (whether to accept or not) is not a good tradeoff. Hence, there has been substantial research on how to perform minibatch MCMC on large datasets. In this case, rather than use all \(N\) data points, we just use a subset \(n \ll N\) of points each iteration. This approximates the target distribution. The downside? It no longer satisfies detailed balance. (I don’t know the details on why, and it probably involves some complicated convergence studies, but I am willing to believe it.)
Just to be clear, we are focusing on getting a distribution, not a point estimate. That’s the whole purpose of Bayesian estimation! A distribution means we need a full probability function that sums to one and is non-negative; if \(\theta\) is one or two dimensional, we can easily plot the posterior estimate (I provide an example of this later in this post). A point estimate means finding one value of \(\theta^*\), usually the “best”, which we commonly express as the maximum likelihood estimate \(\theta^* = \theta_{\rm MLE}\).
All right, now let’s briefly discuss two papers that tackle the problem of minibatch MCMC.
Bayesian Learning via Stochastic Gradient Langevin Dynamics
This paper appeared in ICML 2011 and proposes using minibatch update methods for Bayesian posterior inference, with a concept known as Langevin Dynamics to inject the correct amount of noise into parameter updates so that the set of sampled parameters converges to the posterior, and not just a mode. To make the distinction clear, let’s see how we can use minibatch updates to converge to a mode — more specifically, the maximum a posteriori estimate. The function \(f\) we are trying to optimize is listed above. So … we just use stochastic gradient ascent! (We are ascending, not descending, because \(f\) is a posterior probability, and we want to make its values higher.) This means the update rule is as follows: \(\theta_{t+1} = \theta_t + \alpha_t \nabla f(\theta_t)\). Plugging in \(f\) above, we get
\[\theta_{t+1} = \theta_t + \frac{\epsilon_t}{2}\underbrace{\left( \nabla \log p(\theta_t) + \frac{N}{n} \sum_{i=1}^n\nabla \log p(x_{ti} \mid \theta_t)\right)}_{\approx \nabla f(\theta_t)}\]where \(\epsilon_t\) is a sequence of step sizes.
The above is a stochastic gradient ascent update because we use \(n \ll N\) terms to approximate the gradient value, which is why I inserted an approximation symbol (\(\approx\)) in the underbrace. Because we only use \(n\) terms, however, we must multiply the summation by \(N/n\) to rescale the value appropriately. Intuitively, all those terms in that summation are negative since they’re log probabilities. If we use \(n\) instead of \(N\) terms, that summation is strictly smaller in absolute value. So we must rescale to make the value on the same order of magnitude. Another way of putting it is that the subsampled sum will be an unbiased estimator of the full-batch quantity.
What’s the problem with the above for Bayesian posterior inference? It doesn’t actually do Bayesian posterior inference. The above will mean \(\theta\) converges to a single value. We instead want a distribution. So what can we do? We can use Langevin Dynamics, meaning that (for the full batch case) our updates are:
\[\theta_{t+1} = \theta_t + \frac{\epsilon}{2}\underbrace{\left(\nabla \log p(\theta_t) + \sum_{i=1}^N \nabla \log p(x_i \mid \theta_t)\right)}_{\nabla f(\theta_t)} + \eta_t\] \[\eta_t \sim \mathcal{N}(0,\epsilon)\]A couple of things are worth noting:
- We use all \(N\) terms in the summation, so the gradient is in fact exact.
- The injected noise \(\eta_t\) means the \(\theta_{t}\) values will “bounce around” to approximate a distribution and not converge to a single point.
- The \(\epsilon\) is now constant instead of decreasing, and is balanced so that the variance of the injected noise matches that of the posterior.
- We use \(x_i\) instead of \(x_{ti}\) as we did earlier. The only difference is that the \(t\) indicates the randomness in the minibatch. It does not mean that \(x_{ti}\) is one scalar element of a vector. In other words, both \(x_i\) and \(x_{ti}\) are in \(\mathbb{R}^k\) for some \(k\).
For simplicity, the above assumes that \(\eta_t \in \mathbb{R}\). In the general case, these should be multivariate Gaussians, with covariance \(\epsilon I\).
The problem with this, of course, is the need to use all \(N\) points. So let’s use \(n \ll N\) points, and we have the following update:
\[\theta_{t+1} = \theta_t + \frac{\epsilon_t}{2}\left(\nabla \log p(\theta_t) + \frac{N}{n} \sum_{i=1}^n \nabla \log p(x_{ti} \mid \theta_t)\right) + \eta_t\] \[\eta_t \sim \mathcal{N}(0, \epsilon_t)\]where now, we need \(\epsilon_t\) to vary and decrease towards zero. The reason for this is that as the step size goes to zero, the corresponding (expensive!) Metropolis-Hastings test has rejection rates that decrease to zero. Thus, we can omit the test to make the algorithm computationally cheap.
Fortunately, the authors show that despite how the noise goes to zero, the set of samples from SGLD will indeed converge to the correct posterior distribution, just as with “normal” (i.e., full-batch) Langevin Dynamics noise.
Austerity in MCMC Land: Cutting the Metropolis-Hastings Budget
This paper appeared in ICML 2014, and is also about minibatch MCMC. Here, instead of relying on simulating the physics of the system (as was the case with Stochastic Gradient Langevin Dynamics), they propose reformulating the standard MCMC method with the standard MH test into a sequential hypothesis test. To frame this, they take the log of both sides of the acceptance inequality:
\[\begin{align*} u < P_a &\iff u < \min \left\{1,\frac{p(\theta')\prod_{i=1}^Np(x_i \mid \theta')q(\theta_t\mid \theta')}{p(\theta_t)\prod_{i=1}^Np(x_i \mid \theta_t)q(\theta' \mid \theta_t)}\right\} \\ &\iff u\frac{p(\theta_t)q(\theta' \mid \theta_t)}{p(\theta')q(\theta_t\mid \theta')} < \frac{\prod_{i=1}^Np(x_i \mid \theta')}{\prod_{i=1}^Np(x_i \mid \theta_t)} \\ &\iff \log\left[u \frac{p(\theta_t)q(\theta' \mid \theta_t)}{p(\theta')q(\theta_t\mid \theta')} \right] < \log \left(\prod_{i=1}^Np(x_i \mid \theta')\right) - \log\left(\prod_{i=1}^Np(x_i \mid \theta_t)\right) \\ &\iff \frac{1}{N} \log\left[u \frac{p(\theta_t)q(\theta' \mid \theta_t)}{p(\theta')q(\theta_t\mid \theta')} \right] < \frac{1}{N}\sum_{i=1}^N (\log p(x_i \mid \theta') - \log p(x_i \mid \theta_t)) \end{align*}\]In the first step we also dropped the initial “min” because if the “1” case applies, we will always accept. In the last step we divide both sides by \(N\). What is the purpose of this? The above is equivalent to the original MH test. But the right hand side depends on all \(N\) data points, so what happens if we compute the right hand side using \(n \ll N\) points?
This is the heart of their test. They start out by using a small fraction of the points and compute the right hand side. If the proposed \(\theta'\) element is so out of whack, then even with just \(n\) points, we should already know to reject it. (And a similar case holds if \(\theta'\) is really good.) If we cannot tell whether to accept or reject \(\theta'\) with some specified confidence threshold, then we increase the minibatch size and test again. Their acceptance test relies on the Central Limit Theorem and the Student-t distribution. The details are in the paper, but the main idea is straightforward: increasing the number of samples increases our certainty as to whether we accept or reject, and we can generally make these decisions with far fewer than \(N\) samples.
What’s the downside of their algorithm? In the worst case, we might need the entire data in one iteration. This may or may not be a problem, depending on the particular circumstances.
Their philosophy runs deeper than what the above says. Here’s a key quote:
We advocate MCMC algorithms with a “bias-knob”, allowing one to dial down the bias at a rate the optimally balances error due to bias and variance.
One other algorithm that adheres to this strategy? Stochastic Gradient Langevin Dynamics! (Not coincidentally, Professor Max Welling is a co-author on both papers.) Side note: the reason why SGLD is biased is because it omits the Metropolis-Hastings test. The reason why this algorithm (Adaptive Sampling) is biased is because it makes decisions based on a fraction of the data. So both are biased, but for slightly different reasons.
Putting it All Together: A Code Example
In order to better understand the two papers described above, I wrote some Python code to run SGLD and adaptive sampling. I also implemented the standard (full-batch) MCMC method for a baseline.
I tested using the experiment in Section 5.1 of the SGLD paper. The parameter is 2-D, \(\theta = (\theta_1,\theta_2)\), and the parameter/data generation process is
\[(\theta_1,\theta_2) \sim \mathcal{N}((0,0), {\rm diag}(\sigma_1^2, \sigma_2^2))\]and
\[x_i \sim 0.5 \cdot \mathcal{N}(\theta_1, \sigma_x^2) + 0.5 \cdot \mathcal{N}(\theta_1 + \theta_2, \sigma_x^2)\]I’ve pasted all my Python code here. If you put this together in a Jupyter notebook, it should work correctly. If your time is limited, feel free to skip the code and go straight to the output. The code is here mainly for the benefit of current and future researchers who might want a direct implementation of the above algorithms.
The first part of my code imports the necessary libraries and generates the data according to my interpretation of their problem. I am generating 500 points here, whereas the SGLD paper only used 100 points.
Next, I define a bunch of functions that I will need for the future. The most important one is the
log_f
function, which returns the log of the posterior. (Implementing this correctly requires
filling in some missing mathematical details that follow from the formulation of multivariate
Gaussian densities.)
Next, I run three methods to estimate the posterior: standard full-batch MCMC, Stochastic Gradient Langevin Dynamics, and the Adaptive Sampling method. Code comments clearly separate and indicate these methods in the following code section. Note the following:
- The standard MCMC and Adaptive Sampling methods use a random walk proposal.
- I used 10000 samples for the methods, except that for SGLD, I found that I needed to increase the number of samples (here it’s 30000) since the algorithm occasionally got stuck at a mode.
- The minibatch size for SGLD is 30, and for the Adaptive Sampling method, it starts at 10 (and can increase by 10 up to 500).
With the sampled \(\theta\) values in all_1
, all_2
, and all_3
, I can now plot
them using my favorite Python library, matplotlib. I also create a contour plot of what the log
posterior should really look like.
By the way, it was only recently that I found out I could put LaTeX directly in matplotlib text. That’s pretty cool!
Here are the results in a scatter plot where darker spots indicate more points:
It looks like all three methods roughly obtain the same posterior form.