Deep Learning and Importance Sampling Review
This semester, I am a Graduate Student Instructor for Berkeley’s Deep Learning class, now numbered CS 182/282A. I was last a GSI in fall 2016 for the same course, so I hope my teaching skills are not rusty. At least I am a GSI from the start, and not an “emergency appointment” like I was in fall 2016. I view my goal as helping Professor Canny stuff as much Deep Learning knowledge into the students as possible so that they can use the technology to be confident, go forth, and change the world!
All right, that was cheesy, and admittedly there is a bit too much hype. Nonetheless, Deep Learning has been a critical tool in a variety of my past and current research projects, so my investment in learning the technology over the last few years has paid off. I have read nearly the entire Deep Learning textbook, but for good measure, I want to officially finish digesting everything from the book. Thus, (most of) my next few blog posts will be technical, math-oriented posts that chronicle my final journey through the book. In addition, I will bring up related subjects that aren’t explicitly covered in the book, including possibly some research paper summaries.
Let’s start with a review of Chapter 17. It’s about Monte Carlo sampling, the general idea of using samples to approximate some value of interest. This is an extremely important paradigm, because in many cases sampling is the best (or even only) option we have. A common way that sampling arises in Deep Learning is when we use minibatches to approximate a full-data gradient. And even for that, the full data gradient is really one giant minibatch, as Goodfellow nicely pointed out on Quora.
More formally, assume we have some discrete, vector-valued random variable $\bf{x}$ and we want the following expectation:
\[s = \sum_{x} p(x)f(x) = \mathbb{E}_p[f(\bf{x})]\]where $x$ indicates the possible values (or “instantiations” or “realizations” or … you get the idea) of random variable $\bf{x}$. The expectation $\mathbb{E}$ is taken “under the distribution $p$” in my notation, where $p$ must clearly satisfy the definition of being a (discrete) probability distribution. This just means that $\bf{x}$ is sampled based on $p$.
This formulation is broad, and I like thinking in terms of examples. Let’s turn to reinforcement learning. The goal is to find some parameter $\theta^* \in \Theta$ that maximizes the objective function
\[J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta(\tau)}[R(\tau)]\]where $\tau$ is a trajectory induced by the agent’s policy $\pi_\theta$; that probability is $\pi_\theta(\tau) = p(s_1,a_1,\ldots,s_T,a_T)$, and $R(\tau) = \sum_{t=1}^T R(s_t,a_t)$. Here, the objective plays the role of $\mathbb{E}_p[f(\bf{x})]$ from earlier with the trajectory $\tau$ as the vector-valued random variable.
But how would we exactly compute $J(\theta)$? The process would require us to explicitly enumerate all possible trajectories that could possibly arise from the environment emulator, and then weigh them all accordingly by their (log) probabilities, and compute the expectation from that. The number of trajectories is super-exponential, and this computation would be needed for every gradient update we need to perform on $\theta$, since the distribution of trajectories directly depends on $\pi_\theta(\tau)$.
You can see why sampling is critical for us to make any headway.
(For background on this material, please consult my older post on policy gradients, and an even older post on the basics of Markov Decision Processes.)
The solution is to take a small set of samples \(\{x^{(1)}, \ldots, x^{(n)}\}\) from the distribution of interest, to obtain our estimator
\[\hat{s}_n = \frac{1}{n} \sum_{i=1}^n f(x^{(i)})\]which is unbiased:
\[\mathbb{E}[\hat{s}_n] = \frac{1}{n}\sum_{i=1}^n \mathbb{E}[x^{(i)}] = \frac{1}{n}\sum_{i=1}^ns = s\]and converges almost surely to the expected value, so long as several mild assumptions are met regarding the samples.
Now consider importance sampling. As the book nicely points out, when using $p(x)f(x)$ to compute the expectation, the decomposition does not have to be uniquely set at $p(x)$ and $f(x)$. Why? We can introduce a third function $q$:
\[p(x)f(x) = q(x)\frac{p(x)f(x)}{q(x)}\]and we can sample from $q$ and average $\frac{pf}{q}$ and get our importance sampling estimator:
\[\hat{s}_p = \frac{1}{n} \sum_{i=1,\bf{x}^{(i)}\sim p}^n f(x^{(i)}) \quad \Longrightarrow \quad \hat{s}_q = \frac{1}{n} \sum_{i=1,\bf{x}^{(i)}\sim q}^n \frac{p(x^{(i)})f(x^{(i)})}{q(x^{(i)})}\]which was sampled from $q$. (The $\hat{s}_p$ is the same as $\hat{s}_n$ from earlier.) In importance sampling lingo, $q$ is often called the proposal distribution.
Think about what just happened. We are still computing the same quantity or sample estimator, and under expectation we still get $\mathbb{E}_q[\hat{s}_q] = s$. But we used a different distribution to get our actual samples. The whole $\bf{x}^{(i)}\sim p$ or $\bf{x}^{(i)}\sim q$ notation is used to control the set of samples that we get for approximating the expectation.
We employ this technique primarily to (a) sample from “more interesting regions” and (b) to reduce variance. For (a), this is often motivated by referring to some setup as follows:
We want to use Monte Carlo to compute $\mu = \mathbb{E}[X]$. There is an event $E$ such that $P(E)$ is small but $X$ is small outside of $E$. When we run the usual Monte Carlo algorithm the vast majority of our samples of $X$ will be outside $E$. But outside of $E$, $X$ is close to zero. Only rarely will we get a sample in $E$ where $X$ is not small.
where I’ve quoted this reference. I like this intuition – we need to find the more interesting regions via “overweighting” the sampling distribution there, and then we adjust the probability accordingly for our actual Monte Carlo estimate.
For (b), given two unbiased estimators, all other things being equal, the better one is the one with lower variance. The variance of $\hat{s}_q$ is
\[{\rm Var}(\hat{s}_q) = \frac{1}{n}{\rm Var} \left(\frac{p(\bf{x}) f(\bf{x})}{q(\bf{x})}\right)\]The optimal choice inducing minimum variance is $q^*(x) \propto p(x)|f(x)|$ but this is not usually attained in practice, so in some sense the task of importance sampling is to find a good sampling distribution $q$. For example, one heuristic that I’ve seen is to pick a $q$ that has “fatter tails”, so that we avoid cases where $q(x) \ll p(x)|f(x)|$, which causes the variance of $\frac{p(x)f(x)}{q(x)}$ to explode. (I’m using absolute values around $f(x)$ since $p(x) \ge 0$.) Though, since we are sampling from $q$, normally the case where $q(x)$ is very small shouldn’t happen, but anything can happen in high dimensions.
In a subsequent post, I will discuss importance sampling in the context of some deep learning applications.