# One-Shot Visual Imitation Learning via Meta-Learning

A follow-up paper to the one I discussed in my previous post is One-Shot Visual Imitation Learning via Meta-Learning. The idea is, again, to train neural network parameters $\theta$ on a distribution of tasks such that the parameters are easy to fine-tune to new tasks sampled from the distribution. In this paper, the focus is on imitation learning from raw pixels and showing the effectiveness of a one-shot imitator on a physical PR2 robot.

Recall that the original MAML paper showed the algorithm applied to supervised regression (for sinusoids), supervised classification (for images), and reinforcement learning (for MuJoCo). This paper shows how to use MAML for imitation learning, and the extension is straightforward. First, each imitation task $\mathcal{T}_i \sim p(\mathcal{T})$ contains the following information:

• A trajectory $\tau = \{o_1,a_1,\ldots,o_T,a_T\} \sim \pi_i^*$ consists of a sequence of states and actions from an expert policy $\pi_i^*$. Remember, this is imitation learning, so we can assume an expert. Also, note that the expert policy is task-specific.

• A loss function $\mathcal{L}(a_{1:T},\hat{a}_{1:T}) \to \mathbb{R}$ providing feedback on how closely our actions match those of the expert’s.

Since the focus of the paper is on “one-shot” learning, we assume we only have one trajectory available for the “inner” gradient update portion of meta-training for each task $\mathcal{T}_i$. However, if you recall from MAML, we actually need at least one more trajectory for the “outer” gradient portion of meta-training, as we need to compute a “validation error” for each sampled task. This is not the overall meta-test time evaluation, which relies on an entirely new task sampled from the distribution (and which only needs one trajectory, not two or more). Yes, the terminology can be confusing. When I refer to “test time evaluation” I always refer to when we have trained $\theta$ and we are doing few-shot (or one-shot) learning on a new task that was not seen during training.

All the tasks in this paper use continuous control, so the loss function for optimizing our neural network policy $f_\theta$ can be described as:

where the first sum normally has one trajectory only, hence the “one-shot learning” terminology, but we can easily extend it to several sampled trajectories if our task distribution is very challenging. The overall objective is now:

and one can simply run Adam to update $\theta$.

This paper uses two new techniques for better performance: a two-headed architecture, and a bias transformation.

• Two-Headed Architecture. Let $y_t^{(j)}$ be the vector of post-activation values just before the last fully connected layer which maps to motor torques. The last layer has parameters $W$ and $b$, so the inner loss function $\mathcal{L}_{\mathcal{T}_i}(f_\theta)$ can be re-written as:

where, I suppose, we should write $\phi = (\theta, W, b)$ and re-define $\theta$ to be all the parameters used to compute $y_t^{(j)}$.

In this paper, the test-time single demonstration of the new task is normally provided as a sequence of observations (images) and actions. However, they also experiment with the more challenging case of removing the provided actions for that single test-time demonstration. They simply remove the action and use this inner loss function:

This is still a bit confusing to me. I’m not sure why this loss function leads to the desired outcome. It’s also a bit unclear how the two-headed architecture training works. After another read, maybe only the $W$ and $b$ are updated in the inner portion?

The two-headed architecture seems to be beneficial on the simulated pushing task, with performance improving by about 5-6 percentage points. That may not sound like a lot, but this was in simulation and they were able to test with 444 total trials.

The other confusing part is that if we assume we’re allowed to have access to expert actions, then the real-world experiment actually used the single-headed architecture, and not the two-headed one. So there wasn’t a benefit to the two-headed one assuming we have actions. Without actions, of course, the two-headed one is our only option.

• Bias Transformation. After a certain neural network layer (which in this paper is after the 2D spatial softmax applied after the convolutions to process the images), they concatenate this vector of parameters. They claim that

[…] the bias transformation increases the representational power of the gradient, without affecting the representation power of the network itself. In our experiments, we found this simple addition to the network made gradient-based meta-learning significantly more stable and effective.

However, the paper doesn’t seem to show too much benefit to using the bias transformation. A comparison is reported in the simulated reaching task, with a dimension of 10, but it could be argued that performance is similar without the bias transformation. For the two other experimental domains, I don’t think they reported with and without the bias transformation.

Furthermore, neural networks already have biases. So is there some particular advantage to having more biases packed in one layer, and furthermore, with that layer being the same spot where the robot configuration is concatenated with the processed image (like what people do with self-supervision)? I wish I understood. The math that they use to justify the gradient representation claim makes sense; I’m just missing a tiny step to figure out its practical significance.

They ran their setups on three experimental domains: simulated reaching, simulated pushing, and (drum roll please) real robotic tasks. For these domains, they seem to have tested up to 5.5K demonstrations for reaching and 8.5K for pushing. For the real robot, they used 1.3K demonstrations (ouch, I wonder how long that took!). The results certainly seem impressive, and I agree that this paper is a step towards generalist robots.

# Model-Agnostic Meta-Learning

One of the recent landmark papers in the area of meta-learning is MAML: Model-Agnostic Meta-Learning. The idea is simple yet surprisingly effective: train neural network parameters $\theta$ on a distribution of tasks so that, when faced with a new task, can be rapidly adjusted through just a few gradient steps. In this post, I’ll briefly go over the notation and problem formulation for MAML, and meta-learning more generally.

Here’s the notation and setup, mostly following the paper:

• The overall model $f_\theta$ is what MAML is optimizing, with parameters $\theta$. We denote $\theta_i'$ as weights that have been adapted to the $i$-th task through one or more gradient steps. Since MAML can be applied to classification, regression, reinforcement learning, and imitation learning (plus even more stuff!) we generically refer to $f_\theta$ as mapping from inputs $x_t$ to outputs $a_t$.

• A task $\mathcal{T}_i$ is defined as a tuple $(T_i, q_i, \mathcal{L}_{\mathcal{T}_i})$, where:

• $T_i$ is the time horizon. For (IID) supervised learning problems like classification, $T_i=1$. For reinforcement learning and imitation learning, it’s whatever the environment dictates.

• $q_i$ is the transition distribution, defining a prior over initial observations $q_i(x_1)$ and the transitions $q_i(x_{t+1}\mid x_{t},a_t)$. Again, we can generally ignore this for simple supervised learning. Also, for imitation learning, this reduces to the distribution over expert trajectories.

• $\mathcal{L}_{\mathcal{T}_i}$ is a loss function that maps the sequence of network inputs $x_{1:T}$ and outputs $a_{1:T}$ to a scalar value indicating the quality of the model. For supervised learning tasks, this is almost always the cross entropy or squared error loss.

• Tasks are drawn from some distribution $p(\mathcal{T})$. For example, we can have a distribution over the abstract concept of doing well at “block stacking tasks”. One task could be about stacking blue blocks. Another could be about stacking red blocks. Yet another could be stacking blocks that are numbered and need to be ordered consecutively. Clearly, the performance of meta-learning (or any alternative algorithm, for that matter) on optimizing $f_\theta$ depends on $p(\mathcal{T})$. The more diverse the distribution’s tasks, the harder it is for $f_\theta$ to quickly learn new tasks.

The MAML algorithm specifically finds a set of weights $\theta$ that are easily fine-tuned to new, held-out tasks (for testing) by optimizing the following:

This assumes that $\theta_i' = \theta - \alpha \nabla_\theta \mathcal{L}_{\mathcal{T}_i}(f_\theta)$. It is also possible to do multiple gradient steps, not just one. Thus, if we do $K$-shot learning, then $\theta_i'$ is obtained via $K$ gradient updates based on the task. However, “one shot” is cooler than “few shot” and also easier to write, so we’ll stick with that.

Let’s look at the loss function above. We are optimizing over a sum of loss functions across several tasks. But we are evaluating the (outer-most) loss functions while assuming we made gradient updates to our weights $\theta$. What if the loss function were like this:

This means $f_\theta$ would be capable of learning how to perform well across all these tasks. But there’s no guarantee that this will work on held-out tasks, and generally speaking, unless the tasks are so closely related, it shouldn’t work. (I’ve tried doing some similar stuff in the past with the Atari 2600 benchmark where a “task” was “doing well on game X”, and got networks to optimize across several games, but generalization was not possible without fine-tuning.) Also, even if we were allowed to fine-tune, it’s very unlikely that one or few gradient steps would lead to solid performance. MAML should do better precisely because it optimizes $\theta$ so that it can adapt to new tasks with just a few gradient steps.

MAML is an effective algorithm for meta-learning, and one of its advantages over other algorithms such as ${\rm RL}^2$ is that it is parameter-efficient. The gradient updates above do not introduce extra parameters. Furthermore, the actual optimization over the full model $\theta$ is also done via SGD

again introducing no new parameters. (The update is actually Adam if we’re doing supervised learning, and TRPO if doing RL, but SGD is the foundation of those and it’s easier for me to write the math. Also, even though the updates may be complex, I think the inner part, where we have $f_{\theta - \alpha \nabla_\theta \mathcal{L}_{\mathcal{T}_i}(f_\theta)}$, I think that is always vanilla SGD, but I could be wrong.)

I’d like to emphasize a key point: the above update mandates two instances of $\mathcal{L}_{\mathcal{T}_i}$. One of these — the one in the subscript to get $\theta_i'$ should involve the $K$ training instances from the task $\mathcal{T}_i$ (or more specifically, $q_i$). The outer-most loss function should be computed on testing instances, also from task $\mathcal{T}_i$. This is important because we want our ultimate evaluation to be done on testing instances.

Another important point is that we do not use those “testing instances” for evaluating meta-learning algorithms, as that would be cheating. For testing, one takes a held-out set of test tasks entirely, adjusts $\theta$ for however many steps are allowed (one in the case of one-shot learning, etc.) and then evaluates according to whatever metric is appropriate for the task distribution.

In a subsequent post, I will further investigate several MAML extensions.

# Zero-Shot Visual Imitation

In this post, i will further investigate one of the papers i discussed in an earlier blog post: Zero-Shot Visual Imitation (Pathak et al., 2018).

For notation, I denote states and actions at some time step $t$ as $s_t$ and $a_t$, respectively, if they were obtained through the agent exploring in the environment. A hat symbol, $\hat{s}_t$ or $\hat{a}_t$, refers to a prediction made from some machine learning model.

Basic forward (left) and inverse (right) model designs.

Recall the basic forward and inverse model structure (figure above). A forward model takes in a state-action pair and predicts the subsequent state $\hat{s}_{t+1}$. An inverse model takes in a current state $s_t$ and some goal state $s_g$, and must predict the action that will enable the agent go from $s_t$ to $s_t$.

• It’s easiest to view the goal input to the inverse model as either the very next state $s_{t+1}$, or the final desired goal of the trajectory, but some papers also use $s_g$ as an arbitrary checkpoint (Agrawal et al., 2016, Nair et al., 2017, Pathak et al., 2018). For the simplest model, it probably makes most sense to have $s_g = s_{t+1}$ but I will use $s_g$ to maintain generality. It’s true that $s_g$ may be “far” from $s_t$, but the inverse model can predict a sequence of actions if needed.

• If the states are images, these models tend to use convolutions to get a lower dimensional featurized state representation. For instance, inverse models often process the two input images through tied (i.e., shared) convolutional weights to obtain $\phi(s_t)$ and $\phi(s_{t+1})$, upon which they’re concatenated and then processed through some fully connected layers.

As I discussed earlier, there are a number of issues related to this basic forward/inverse model design, most notably about (a) the high dimensionality of the states, and (b) the multi-modality of the action space. To be clear on (b), there may be many (or no) action(s) that let the agent go from $s_t$ to $s_g$, and the number of possibilities increases with a longer time horizon, if $s_g$ is many states in the future.

Let’s understand how the model proposed in Zero-Shot Visual Imitation mitigates (b). Their inverse model takes in $s_g$ as an arbitrary checkpoint/goal state and must output a sequence of actions that allows the agent to arrive at $s_g$. To simplify the discussion, let’s suppose we’re only interested in predicting one step in the future, so $s_g = s_{t+1}$. Their predictive physics design is shown below.

The basic one-step model, assuming that our inverse model just needs to predict one action. The convolutional layers for the inverse model use the same tied network convolutional weights. The action loss is the cross-entropy loss (assuming discrete actions), and is not written in detail due to cumbersome notation.

The main novelty here is that our predicted action $\hat{a}_t$ from the inverse model is provided as input to the forward model, along with the current state $s_t$. We then try and obtain $s_{t+1}$, the actual state that was encountered during the agent’s exploration. This loss $\mathcal{L}(s_{t+1}, \hat{s}_{t+1})$ is the standard Euclidean distance and is added with the action prediction loss $\mathcal{L}(a_t,\hat{a}_t)$ which is the usual cross-entropy (for discrete actions).

Why is this extra loss function from the successor states used? It’s because we mostly don’t care which action we took, so long as it leads to the desired next state. Thus, we really want $\hat{s}_{t+1} \approx s_{t+1}$.

• There’s some subtlety with making this work. The state loss $\mathcal{L}(s_{t+1}, \hat{s}_{t+1})$ treats $s_{t+1}$ as ground truth, but that assumes we took action $a_t$ from state $s_t$. If we instead took $\hat{a}_t$ from $s_t$, and $\hat{a}_t \ne a_t$, then it seems like the ground-truth should no longer be $s_{t+1}$?

Assuming we’ve trained long enough, then I understand why this will work, because the inverse model will predict $\hat{a}_t = a_t$ most of the time, and hence the forward model loss makes sense. But one has to get to that point first. In short, the forward model training must assume that the given action will actually result in a transition from $s_t$ to $s_{t+1}$.

The authors appear to mitigate this with pre-training the inverse and forward models separately. Given ground truth data $\mathcal{D} = \{s_1,a_1,s_2,\ldots,s_N\}$, we can pre-train the forward model with this collected data (no action predictions) so that it is effective at understanding the effect of actions.

This would also enable better training of the inverse model, which (as the authors point out) depends on an accurate forward model to be able to check that the predicted action $\hat{a}_t$ has the desired effect in state-space. The inverse model itself can also be pre-trained entirely on the ground-truth data while ignoring $\mathcal{L}(s_{t+1}, \hat{s}_{t+1})$ from the training objective.

I think this is what the authors did, though I wish there were a few more details.

• A surprising aspect of the forward model is that it appears to predict the raw states $s_{t+1}$, which could be very high-dimensional. I’m surprised that this works, given that (Agrawal et al., 2016) explicitly avoided this by predicting lower-dimensional features. Perhaps it works, but I wish the network architecture was clear. My guess is that the forward model processes $s_t$ to be a lower dimensional vector $\psi(s_t)$, concatenates it with $\hat{a}_t$ from the inverse model, and then up-samples it to get the original image. Brandon Amos describes up-sampling in his excellent blog post. (Note: don’t call it “deconvolution.”)

Now how do we extend this for multi-step trajectories? The solution is simple: make the inverse model a recurrent neural network. That’s it. The model still predicts $\hat{a}_t$ and we use the same loss function (summing across time steps) and the same forward model. For the RNN, the convolutional layers $\phi$ take in the current state but they always take in $s_g$, the goal state. They also take in $h_{i-1}$ and $a_{i-1}$ the previous hidden unit and the previous action (not the predicted action, that would be a bit silly when we have ground truth).

The multi-step trajectory case, visualizing several steps out of many.

Thoughts:

• Why not make the forward model recurrent?

• Should we weigh shorter-term actions highly instead of summing everything equally as they appear to be doing?

• How do we actually decide the length of the action vector to predict? Or said in a better way, when do we decide that we’ve attained $s_g$?

Fortunately, the authors answer that last thought by training a deep neural network that can learn a stopping criterion. They say:

We sample states at random, and for every sampled state make positives of its temporal neighbors, and make negatives of the remaining states more distant than a certain margin. We optimize our goal classifier by cross-entropy loss.

So, states “close” to each other are positive samples, whereas “father” samples are negative. Sure, that makes sense. By distance I assume simple Euclidean distance on raw pixels? I’m generally skeptical of Euclidean distance but it might be necessary if the forward model also optimizes the same objective. I also assume this is applied after each time step, testing whether $s_i$ at time $i$ has reached $s_g$. Thus, it is not known ahead of time how many actions the RNN must be able to predict before the goal is reset.

An alternative is mentioned about treating stopping as an action. There’s some resemblance to this and DDO’s option termination criterion.

Additionally, we have this relevant comment on OpenReview:

The independent goal recognition network does not require any extra work concerning data or supervision. The data used to train the goal recognition network is the same as the data used to train the PSF. The only prior we are assuming is that nearby states to the randomly selected states are positive and far away are negative which is not domain specific. This prior provides supervision for obtaining positive and negative data points for training the goal classifier. Note that, no human supervision or any particular form of data is required in this self-supervised process.

Yes, this makes sense.

Now let’s discuss the experiments. The authors test several ablations of their model:

• An inverse model with no forward model at all (Nair et al., 2017). This is different from their earlier paper which used a forward model for regularization purposes (Agrawal et al., 2016). The model in (Nair et al., 2017) just used the inverse model for predicting an action given current image $I_t$ and (critically!) a goal image $I_{t+1}'$ specified by a human.

• A more sophisticated inverse model with an RNN, but no forward model. Think of my most recent hand-drawn figure above, except without the forward portion. Furthermore, this baseline also does not use the action $a_i$ as input to the RNN structure.

• An even more sophisticated model where the action history is now input to the RNN. Otherwise, it is the same as the one I just described above.

Thus, all three of their ablations do not use the forward consistency model and are solely trained by minimizing $\mathcal{L}(a_t,\hat{a}_t)$. I suppose this is reasonable, and to be fair, testing these out in physical trials takes a while. (Training should be less cumbersome because data collection is the bottleneck. Once they have data, they can train all of their ablations quickly.) Finally, note that all these inverse models take $(s_t,s_g)$ as input, and $s_g$ is not necessarily $s_{t+1}$. This, I remember from the greedy planner in (Agrawal et al., 2016).

The experiments are: navigating a short mobile robot throughout rooms and performing rope manipulation with the same setup from (Nair et al., 2017).

• Indoor navigation. They show the model an image of the target goal, and check if the robot can use it to arrive there. This obviously works best when few actions are needed; otherwise, waypoints are necessary. However, for results to be interesting enough, the target image should not have any overlap with the starting image.

The actions are: (1) forward 10cm, (2) turn left, (3) turn right, and (4) standing still. They use several “tricks” such as using action repeats, applying a reset maneuver, etc. A ResNet acts as the image processing pipeline, and then (I assume) the ResNet output is fed into the RNN along with the hidden layer and action vector.

Indeed, it seems like their navigating robot can reach goal states and is better than the baselines! They claim their robot learns first to turn and then to move to the target. To make results more impressive, they tested all this on a different floor from where the training data was collected. Nice! The main downside is that they conducted only eight trials for each method, which might not be enough to be entirely convincing.

Another set of experiments tests imitation learning, where the goal images are far away from the robot, thus mandating a series of checkpoint images specified by a human. Every fifth image in a human demonstration was provided as a waypoint. (Note: this doesn’t mean the robot will take exactly five steps for each waypoint even if it was well trained, because it may take four or six or some other number of actions before it deems itself close enough to the target.) Unfortunately, I have a similar complaint as earlier: I wish there were more than just three trials.

• Rope manipulation. They claim almost a 2x performance boost over (Nair et al., 2017) while using the same training data of 60K-70K interaction pairs. That’s the benefit of building upon prior work. They surprisingly never say how many trials they have, and their table reports only a “bootstrapped standard deviation”. Looking at (Nair et al., 2017), I cannot find where the 35.8% figure comes from (I see 38% in that paper but that’s not 35.8%…).

According to OpenReview comments they also trained the model from (Agrawal et al., 2016) and claim 44% accuracy. This needs to be in the final version of the paper. The difference from (Nair et al., 2017) is that (Agrawal et al., 2016) jointly train a forward model (but not to enforce dynamics but just as a regularizer), while (Nair et al., 2017) do not have any forward model.

Despite the lack of detail in some areas of the paper, (where’s the appendix?!?) I certainly enjoyed reading it and would like to try out some of this stuff.

# A Critical Comparison of Three Half Marathons I Have Run

I have now run in three half marathons: the Berkeley Half Marathon (November 2017), the Kaiser Permanente San Francisco Half Marathon (February 2018), and the Oakland Half Marathon (March 2018).

To be clear, the Kaiser Permanente San Francisco half marathon is not the same as a separate set of San Francisco races in the summers. The Oakland Half Marathon is also technically the “Kaiser Permanente […]” but since there’s only one main set of Oakland races a year — known as the “Running Festival” — we can be more lenient in our naming convention.

All these races are popular, and the routes are relatively flat and therefore great for setting PRs. I would be happy to run any of these again. In fact, I’ll probably will, for all three!

In this post, I’ll provide some brief comments on each of the races. Note that:

• When I list registration fees, it’s not always a clear-cut comparison since prices jack up closer to race day. I think I managed to get an “early bird” deal for all these races, so hopefully the prices are somewhat comparable. Also, I include taxes in the fee I list.

• By “packet pickup” I refer to when runners pick up whatever racing material is needed (typically a timing chip, bib, sometimes gear as well) a day or two before the actual race. These pickup events also involve some deals for food and running equipment from race sponsors. Below is a picture that I took of the Oakland package pickup:

• While I list “pros” and “cons” of the races, most are minor in the grand scheme of things, and this review is for those who might be picky. I reiterate that I will probably run in all of these again the next time around.

OK, let’s get started!

## Berkeley Half Marathon

• Website: here.
• Price I paid: about $100, including a$10 bib shipping fee.

Pros:

• The race has a great “local feel” to it, with lots of Berkeley students and residents both running in the race or cheering us as spectators. I saw a number of people that I knew, mostly other student runners, and it was nice to say hi to them. There was also a cool drumming band which played while we were entering the portion of the race close to the San Francisco Bay.

• The course is mostly flat, and enters a few Berkeley neighborhoods (again, a great local feel to it). There’s also a relatively straight section at the roughly 8-11 mile range by the San Francisco Bay and which lets you see the runners ahead of you when you’re entering the portion (for extra motivation). As I discussed two years ago, I regularly run by this area so I was used to the view, but I can see it being attractive for those who don’t use the same routes.

• There are lots of pacers, for half-marathon finish times of 1:27, 1:35 (2x), 1:45, 1:55, etc.

• The post-race food sampling selection was fantastic! There were the obligatory water bottles and bananas, but I also had tasty Power Crunch protein bars, Muscle Milk (this is clearly bad for you, but never mind), pretzels, cookies, coffee, etc. There was also beer, but I didn’t have any.

• Post-race deals are excellent. I used them to order some Power Crunch bars at a discount.

• The packet pickup had some decent free food samples. The race shirt is interesting — it’s a different style from prior years and feels somewhat odd but I surprisingly like it, and I’ll be wearing it both to school and for when I run in my own time.

Cons:

• There’s a $10 bib mailing fee, and I realize now that it’s pointless to pay for it because we also have to pick up a timing chip during packet pickup, and that’s when we could have gotten the bibs. Thus, there seems to be no advantage to paying for the bib to be mailed. Furthermore, I wish the timing chip were attached to the bib; we had to tie it within our shoelaces. I think it’s far easier to stick it on the bib. • The starting location is a bit awkwardly placed in the center of the city, though to be fair, I’m not sure of a better spot. Certainly it’s less convenient for drop-offs and Uber rides compared to, say, Golden Gate Park. • There were seven water stops, one of which had electrolytes and GU energy chews. (Unfortunately, when running, I actually dropped two out of the four GU chews I was given … please use the longer, thinner packages that the Oakland race uses!!) The other two races offered richer goodies at the aid stations so next time, I’ll bring my own energy stuff. • It was the most expensive of the races I’ve run in, though the difference isn’t that much, especially if you avoid making the mistake of getting your bib mailed to you. • The photography selection after the race is excellent, but it’s expensive and most of it is concentrated near the end of the race when it’s crowded, so most pictures weren’t that interesting. ## Kaiser Permanente San Francisco Half Marathon • Website: here. • Price I paid: about$80.

Upsides:

• The race route is great! I enjoyed running through Golden Gate Park and seeing the Japanese Tea Garden, the California Academy of Sciences, and so on. There’s also a very long, straight section in the second half of the race (longer than Berkeley’s!) by the ocean where you can again see the runners ahead of you on their way back.

• There’s a great selection of post-race sampling, arguably on par with Berkeley though there’s no beer. There were water bottles and bananas, along with CLIF Whey protein bars, Ocho candy, some coffee/caffeine-base drinks, etc.

• The price is the cheapest of the three, which is surprising since I figured things in San Francisco would be more expensive. I suspect it has to do with much of the race being in Golden Gate Park, and the course is set so that there isn’t a need to close many roads. On a related note, it’s also easy to drop off and pick up racers.

• You have to finish the race to get your shirt. Of course this is minor, but I believe it’s not a good idea to wear the official race shirt on race day. Incidentally, there’s no package pickup, which means we don’t get free samples or deals, but it’s probably better for me since I would have had to Uber a long distance to and back. You get the bib and timing chip mailed in advance, and the timing chip is (thankfully) attached to the bib.

Downsides:

• No pacers. I don’t normally try to stick to a pacer during my races, but I think they’re useful.

• While there was a great selection of post-race food sampling, there was no beer offered, in contrast to the Berkeley and Oakland races.

• With regards to post-race photographs, my comments on this are basically identical to those of the Berkeley race.

• All the aid stations had electrolytes (I think Nuun) in addition to water. It was a bit unclear to me which cups corresponded to what beverage, though in retrospect I should have realized that the “blank” cups had water and cups with a lightning sign on them had the electrolytes. The drinks situation is better than the Berkeley race, but the downside is that there were no GU energy chews, so perhaps it’s a wash with respect to the aid stations?

• It felt like there were fewer people cheering us on when we raced, particularly compared to the Berkeley race.

• I don’t think there were as many post-race discount deals. I was hoping that there were some deals for the CLIF whey protein bars, which would have been the analogue of the Power Crunch discount for the Berkeley race. The discount deals also lasted only a week, compared to two months for Berkeley’s post-race stuff.

## Oakland Running Festival Half Marathon

• Website: here.

## Group 4: Biographies and Memoirs

I am reading biographies of famous people because I want to be famous someday. My aim is to be famous for a good reason, e.g., developing technology that benefits large swaths of humanity. (It is obviously easier to become famous for a bad reason than a good reason.)

• ** Alan Turing: The Enigma ** is the definitive biography of Alan Turing, quite possibly the best computer scientist of all time. The book was written in 1983 by Andrew Hodges, a British mathematics tutor at the University of Oxford (now retired). I discussed this in a separate blog post so I will not repeat the details here.

• ** My Beloved World ** is Supreme Court Justice Sonia Sotomayor’s memoir, published in 2013. It’s written from the first person perspective and outlines her life from starting in South Bronx and moving up to her appointment as a judge to US District Court, Southern District of New York. It — unfortunately — doesn’t talk much about her experiences after that, getting appointed to the United States Court of Appeals for the Second Circuit in 1998, and of course, her time on the nation’s highest court starting in August 2009. She had a father who struggled with alcoholism and died when she was nine years old, and didn’t appear to be a good student until she was in fifth grade, when she started to obsess over getting “gold stars.” (I can attest to a similar experience over obsessing to get “gold star-like” objects when I was younger.) She then, as we all know, did well in high school and entered Princeton as one of the first incoming batch of women students and Hispanic students, graduating with stellar academic credentials in 1976 and then going on to Yale Law School where she graduated in 1979. The book describes her experiences in vivid terms, and I liked following through her footsteps. I feel and share her pain at not knowing “secrets” that the rich and privileged students had when I was an undergrad (I was clueless about how finance and investment banking jobs worked, and I’m still clueless today.) Overall, I enjoyed the book. It’s brilliantly written, with an engrossing, powerful story. I will be reminded of her key attribute of persistence and determination and focus which she says were key. I’m trying to pursue the same skills myself. While I understand the low likelihood of landing such tiny jobs (e.g., the tech equivalent of a Supreme Court Justice) I do try and think big and that’s what motivates me a lot. I read this book on a day trip where I was sitting in a car passenger seat, and I sometimes dozed off and imagined myself naming various hypothetical Supreme Court Justices.

## Group 5: Conservative Politics and Thoughts

Well, this will be interesting. I’m not a registered Republican, though I possess a surprisingly large amount conservative beliefs, some of which I’m not brave enough to blog about (for obvious reasons). In addition, I believe it is important to understand people’s beliefs across the political spectrum, though for this purpose I exclude the extreme far left (e.g., hardcore Communists) and right (e.g., the fascists and the Ku Klux Klan).

• ** Please Stop Helping Us: How Liberals Make it Harder for Blacks to Succeed ** a 2014 book written by Wall Street Journal columnist Jason Riley. It’s no secret that (a) most blacks tend to be liberal, I would guess due to the liberals getting the civil rights movement correct in the 1960s, and (b) blacks tend to have more credibility when criticizing blacks compared to whites. Riley, as a black conservative, can get away with roundly criticizing blacks in a way that I wouldn’t do since I do not want to be perceived as a racist. In Please Stop Helping Us, Riley “eviscerates nonsense” as described by his hero, Thomas Sowell, criticizing concepts such as the minimum wage, unions, young black culture, and affirmative action policies, among other things, for the decline in black prosperity. His chief claim is that liberals, while having good intentions, have not managed to achieve their desired results with respect to the black population. He also laments that young blacks tend to watch too much TV, engage in hip-hop culture, and the like. One of his stories that stuck with me was when a young (black) relative asked him: “why are you so white”, when all Riley did was speak proper English and tuck in his shirt. Indeed, variants of this story are common complaints that I’ve seen and heard about from black students and professionals across the political spectrum. I don’t agree with Riley on everything. For instance, Riley tends to ignore or explain away issues regarding racism as it relates to the lack of opportunities for job promotions or advancement, or when blacks are penalized more relative to others for a given crime. On the other hand, we agree on affirmative action, which he roundly criticizes, pointing out that no one wants to be the token “diversity hire”. To his credit, he additionally mentions that Asians are hurt the most from affirmative action, as I pointed out in an earlier blog post, making it a dubious policy when it come to advancing racial equality. In the end, this book is a thought-provoking piece about race. My impression is that Riley genuinely wants to see more blacks succeed in America (as I do), but he is disappointed that the major civil rights battles were all won decades ago, and nowadays current policies do not have the same positive impact factor.

• ** The Conservative Heart: How to Build a Fairer, Happier, and More Prosperous America **, is a 2015 book by Arthur Brooks, the president of the American Enterprise Institute, officially a nonpartisan think tank but widely regarded (both inside and outside the organization) as a place for conservative public policy development and analysis. Brooks argues that today’s conservatives, while they have most of the technical arguments right (e.g., on the benefits of free enterprise), lack the “moral high ground” that liberals have. Brooks cites statistics showing that conservatives are seen as less compassionate and less caring than liberals. He argues that conservatives can’t be about being anti-everything: government, minimum wage increases, food stamps, etc. Instead, they have to show that they care about people. They need to emphasize an equal starting line for which people can flourish, which contrasts with the common liberal perspective of making the end product equal (by income redistribution or proportional racial representation). One key point Brooks emphasizes is the need for work fulfillment and purpose instead of lying around while collecting checks from the American welfare state. I liked this book and found it engaging and accessible. It is, Brooks says, a book for a wide range of people, including “open-minded liberals” who wish to understand the conservative perspective. I have two major issues with his book, though. The first is that while he correctly points out the uneven recovery and the lack of progress on fixing poverty, he fails to mention the technological forces that have created these uneven conditions (see my technology, economics, and business related books above), much of which is outside the control of any presidential administration or Congress. The second is that I think he’s been proved wrong on a lot of things. President Donald Trump is virtually none of the stuff that a conservative “heart” would suggest and, well, he was elected President (after this book was published, to boot). I wish President Trump would start following Brooks’ suggestions.

• Conscience of a Conservative: A Rejection of Destructive Politics and a Return to Principle is a brief 2017 book/manifesto by U.S. Senator Jeff Flake of Arizona. Flake is well known for being one of those “Never Trump” style of Republicans since he remains true to certain Republican principles that have arguably fallen out of favor with the recent populist surge of Trump-ian Republicanism in 2016, such as free trade and limited government spending. And yes, I don’t think Republicans can claim to be the party of fiscal prudence nowadays, since Trump is decidedly not a limited spending conservative. In this book, Senator Flake argues that Republicans have to get back to true, Conservative principles and can’t allow populism and immaturity to define their party. He laments at the lack of bipartisanship in Congress, and while he makes it clear that both parties are to blame, in this book he mostly aims at Republicans. This explains why so many Republicans, including Barry Goldwater’s relatives, dislike this book. (Barry Goldwater wrote a book of the same title, “Conscience of a Conservative”, from which Jeff Flake borrowed the title.) I sort of liked this book but didn’t really like it. It still fails to address the notion of how the parties have fallen apart, and he (like everyone else) preaches bipartisanship without proposing clear solutions. Honestly, I think the main reason I read it was not that I think Flake has all the solutions, but that I sometimes think of myself in Congress in my fantasies. Thus, I jumped at the chance to read a book directly from a Congressman, and particularly a book like this where Flake bravely didn’t have his staff revise it to make it more “politically palatable.” It’s a bit raw and lacks the polish of super-skilled writers, but we shouldn’t hold Senators to such a high writing standard so it’s fine with me. It’s unfortunate that Flake isn’t going to seek re-election next year.

## Group 6: Self-Help and Personal Development

I’m reading these “personal development” books because, well, I want to be a far more effective person than I am right now. “Effectiveness” can mean a lot of things. I define it as being vastly more productive in (a) Artificial Intelligence research and (b) my social life.

• ** How to Win Friends and Influence People: The Only Book You Need to Lead You to Success ** is Dale Carnegie’s famous book based on his human interaction courses. It was originally published in 1936, during the depths of the Great Depression, making this book by far the oldest one I’ve read this year. I will not go into too much depth about it since I wrote a summary in an earlier blog post. The good news is that 2017 has been a much better year for me socially, and the book might have helped. I look forward to continuing the upward trend in 2018, and to read other Dale Carnegie books.

• ** The 7 Habits of Highly Effective People: Powerful Lessons in Personal Change **, written by Stephen R. Covey in 1989, is widely considered to be the “successor” to Dale Carnegie’s classic book (see above summary). In The 7 Habits, Covey argues that the habits are based on well-timed principles and thus do not noticeably vary across different religious groups, ethnic groups, and so forth. They are: “Be Proactive”, “Begin With the End in Mind”, “Put First Things First”, “Think Win-Win”, “Seek First to Understand, Then to be Understood”, “Synergize”, and “Sharpen the Saw”. You can find their details on the Wikipedia page so I won’t repeat the points here, but I will say that the one which really touches upon me is “Think Win-Win”. In general, I am always trying to make more friends, and I’d like these to be win-win. My strategy, which aligns with Covey’s (great minds think alike!), is to start a relationship by doing more work than the other person or letting the other person benefit more. Specifically, this means that I will be happy to (a) take the initiative in setting meeting times and any necessary reservations, (b) drive or travel farther distances, (c) let the other person choose the activity, and so forth. At some point, however, the relationship needs to be reciprocal. Indeed, I often tell people, subtly or not so subtly, that the true test of friendship is if friends are willing to do things for you just as much as you do to them. With respect to the six other principles, there isn’t much to disagree. There is striking similarity to Cal Newport’s Deep Work when Covey discusses high-impact, Quadrant II activities. Possibly my main disagreement with the book is that Covey argues how these principles come (to some extent) from religion and God. As an atheist, I do not buy this rationale, but I still agree with the principles themselves and I am trying to follow them as much as I can. This book has earned a place on my desk along with Dale Carnegie’s classic, and I will always remember it because I want to be a highly effective person.

## Group 7: Psychology and Human Relationships

These books are about psychology, broadly speaking, which I suppose can include “human relationships”. I thoroughly enjoyed reading all four of these books.

• ** To Sell Is Human: The Surprising Truth About Moving Others ** is a 2012 book by best-selling author Daniel Pink. He argues that we should stop focusing on outdated views of salespeople. That is, that they are slimy, conniving, attempting to rip of us off, etc. Today, one in nine work in “sales” but Pink’s chief message to the reader is that the other eight of nine are also in sales. We try to influence people all the time. So in some sense this is obvious. I mean, come on, if I’m aiming to get a girlfriend, then I’m trying to influence her based on my positive qualities.11 For academics, we sell our work (i.e., research) all the time. That’s what Pink means when he says “everyone is working in sales.” He argues that nowadays, the barriers have fallen (he almost says “The World is Flat” a la Thomas L. Friedman) and that salespeople are no longer people who walk door by door to ask people to buy things. That’s outdated. One possible negative aspect of the book is that I don’t think we need this much “proof” that we’re all salespeople. Yes, some people think only in terms of that job, but all you have to do is say: “hey everyone is a salesperson, if you try to become friends with someone, that counts…” and if you tell that to people, all of a sudden they’ll get it and I don’t think belaboring the point is necessary. On the positive side, the book contains several case studies and lists of things to do, so that I can think of these and reread the book in case I want to apply them in my life. Indeed, as I was reading this book, I was also thinking of ways I could convince someone to become friends with me.

• ** Originals: How Non-Conformists Move the World ** is a recent 2016 book by famous Wharton professor Adam Grant, also known as the author of Give and Take. I’ve been aware of Grant for some time, in part because he’s been featured in Cal Newport’s writing as someone who engages in the virtues of Deep Work (see an excerpt here). Yeah, he’s really productive, finishing a PhD in less than three years12 and then becoming the youngest tenured professor at his university. But what is this book about, anyway? In Originals, Grant argues that people who “buck the trend” are often ones who can make a difference for the better. As I anticipated ahead of time, Martin Luther King Jr is in the book, but not for all the reasons I thought. One of them — why procrastination might actually have been helpful (i.e., first mover disadvantage) for him when he was crafting his “I Have a Dream” speech, though one was more realistic: focusing on the victims of crimes (blacks facing discrimination) rather than criticizing the perpetrators. Another nice tidbit from Grant was making sure to emphasize the downsides of your work rather than the positives to venture capitalists, as that will help you look more sincere. Other stuff in this book include how to foster a correct sense of dissent in a company (e.g., Bridgewater Associates is unique in this regard because people freely criticize the billionaire founder Ray Dalio). I certainly felt like some of this was cherry-picking, which admittedly is unavoidable, but this book seems to pursue that more than others. Nonetheless, a lot of the advice seems reasonable and I hope to apply it in my life.

## Group 8: Miscellaneous

These books, in my opinion, don’t neatly align in one of the earlier groups.

• ** The Signal and the Noise: Why So Many Predictions Fail – But Some Don’t ** is Nate Silver’s 2012 book where he urges us to consider various issues that might be adversely affecting the quality of predictions. They range from the obvious, such as political biases which affect our assessment of political pundits (known as “hedgehogs” in his book), and perhaps less obvious things such as a bug in the Deep Blue chess program which nonetheless grandmaster Gary Kasparov took to meaning that Deep Blue could “predict twenty moves into advance.” I really enjoyed this book. The examples are far ranging: how to detect terrorist attacks (a major difficulty but one with enormous political importance) to playing poker (Silver’s previous main source of income), to uncertainties involving global warming models (always important to consider), and to the stock market (this one is hardest for me to understand given my lack of background knowledge on the stock market, but I am learning and working to rectify this!). The one issue I have is that Silver seems to just assume: hey let’s apply Bayes’ rule to fix everything, so that we have a prior of $X$, and we assume the probability of $Y$ … and therein lies the problem. In real settings we rarely get those $X$ and $Y$ values to a high degree of accuracy. But I have no issue with the general idea of revising predictions and using Bayes’ rule. I encourage you to see a related critique in The New Yorker. The reality, by the way, is that most current professional statisticians likely employ a mix of Frequentist and Bayesian statistics. For a more technical overview, check out Professor Michael I. Jordan’s talk on Are You A Bayesian or a Frequentist?.

• ** The Soul of an Octopus: A Surprising Exploration into the Wonder of Consciousness ** is a splendid 2015 book by author Sy Montgomery, who has written numerous biology-related books about animals. I would call this entirely a popular science book; it’s more like a combination of the author discovering octopuses and describing her own experience visiting the New England aquarium, learning how to scuba dive, watching octopuses having sex in Seattle, and of course, connecting with octopuses. To be frank, I had no idea octopuses could do any of the things she mentions in the book (such as walking on dry land and squeezing through a tiny hole to get out of a tank). Clearly, aquariums have their hands full trying to deal with octopuses. Much of the book is about trying to connect with the three octopuses the New England aquarium has; the author regularly touches and feeds the octopuses, observing and attempting to understand them. I was impressed by the way Montgomery manages to make the book educational, riveting, and emotional all at once, which was surprising to me when I found out about the book’s title. It’s surely a nice story, and that’s what I care about.

• Nothing Ever Dies: Vietnam and the Memory of War is a book by USC English Professor Viet Thanh Nguyen, published in 2016 and a finalist for the National Book Award in Non-Fiction that same year. It’s not a recap or history of the Vietnam War (since that subject has been beaten to death) but instead it focuses specifically on how people from different sides (obviously, American and Vietnamese, but also the rest of the world) view the war, because that will shape questions such as who is at fault and should make reparations and also how we can avoid similar wars in the future. It’s an academic-style book, so the writing is a bit dry and it’s important not to read this when tired. I think it provides a useful perspective on the Vietnam War and memories in general. Nguyen travels to many areas in Vietnam and Asia and explores how they view America — for instance, he argues that South Korea attempts to both ally with the US and look down on Vietnam with contempt. I found the most thought-provoking discussion to be about identity politics and how minorities often have to be the ones describing their own experiences. I’ve observed this in the books I read, in which if they’re written by a minority author (and here I’ll include Asians despite how critics of the tech industry bizarrely decide otherwise) are often about said minority. Other interesting (though obvious) insights include how the entire war machine and capitalism of the US means it can spread its memories of the war more effectively than Vietnam can. Thus, the favorable American perspective of the US as attempting to “save” minorities is more widespread, which puts America in a better light than (in my opinion, channeling my inner Noam Chomsky) it deserves.

• The Once and Future Liberal: After Identity Politics is a short book (describing it as an essay is probably more accurate) written by humanities professor Mark Lilla of Columbia University. This book grew out of his fantastic (perhaps my all-time favorite) Op-Ed in the NYTimes about the need to end identity politics, or specifically identity liberalism. I agree wholeheartedly; we need to stop treating different groups of people as monolithic. Now, it is certainly the case that racism or mistreating of any group must be called out, and white identity politics is often played on the right, versus the variety of identities on the left. Anyway, this short book is split into three parts: anti-politics, pseudo-politics, and politics, but this doesn’t seem to have registered much to me, and the book is arranged in a different style as I might have hoped. I was mostly intrigued by how he said Roosevelt-esque liberalism dominated from roughly 1930 to 1970. Then the Reagan-esque conservatism (i.e., the era of the individual) has dominated from 1980 to 2016 or so, and now we’re supposed to be starting a new era as Trump has demolished the old conservatism. But Lilla is frustrated that modern liberalism is so obsessed about identity, and quite frankly, so am I. He is correct, and many liberals would agree, that change must be aimed locally now, as Republicans have dominated state and local governments, particularly throughout the Obama years. I do wish, however, that he had focused more directly on (a) how groups are not monolithic, (b) why identity politics is bad politics. I know there was some focus, but there didn’t seem to be enough for me. But I suppose, this being a short essay, he wanted to prioritize the Roosevelt-Reagan parallels, which in all fairness is indeed interesting to ponder.

• ** Climate of Hope: How Cities, Businesses, and Citizens can Save the Planet **, a 2017 book jointly written by Michael Bloomberg and Carl Pope. Surprisingly, considering that I was born and raised in New York state all my life (albeit, upstate and not in the city) the first time I really learned about Bloomberg was when he gave the commencement speech at my college graduation. You can view the video here, and in fact, to the right you can barely see the hands of a sign language interpreter who I really should re-connect with sometime soon. Climate of Hope consists of a series of chapters, which are split into half from Bloomberg’s perspective, half from Pope’s perspective. The dynamics between the two men are interesting. Pope is a “typical” Sierra Nevada member, while Bloomberg is known for being a ridiculously-rich billionaire and a three-term (!!) mayor of New York City.13 The book is about cities, businesses, and citizens, and the omission of national governments is no accident: both men have been critical of Washington’s failure to get things done. Bloomberg and Pope aim their ire at the “climate change deniers” in Washington, though they do levy slight criticism on Democrats for failing to support nuclear power. They offer a brief scientific background on climate change, and then argue that new market forces and the rise of cities (thus greener due to more public transportation and more cramped living quarters) means we should be able to emphasize more renewable energy. One key thing I especially agree with is that to market policies that promote renewable energy — particularly to skeptical conservatives — people cannot talk about how “worldwide temperatures in 2100 will be two degrees higher.” Rather, we need to talk about things we can do now, such as saving money, protecting our cities, creating construction jobs, protecting our health from smog, all these thing we can do right now and which will have the effect of fighting long-term climate change anyway. I enjoyed this easy-to-read and optimistic book, though it’s also fair to say that I tend to view Bloomberg quite favorably and honor his commitment to getting things done rather than having dysfunction in Washington. Or maybe I just want to obtain a fraction of his professional success in my life.

1. Most of the academic papers that I read can be found in this GitHub repository

2. You’ll also notice in that link that Stuart Russell says he thinks superintelligence will happen in “more than 25 years” but he thinks it will happen. Russell’s been one of the leading academics voicing concern about AI. I’m not sure what has been created out of it, except raising a discussion of AI’s risks, kind of like how Barrat’s book doesn’t really propose solutions. (Disclaimer: I have not read all of Russell’s work on this, and I might need to see this page for information.)

3. In this interview, Oren Etzioni said that AI leaders were not concerned about superintelligence, and even quoted an anonymous AAAI Fellow who said that Nick Bostrom was “the Donald Trump of AI”. Stuart Russell, who has praised Superintelligence, wrote a rebuttal to Etzioni, who then apologized to Bostrom.

4. Of course, this raises the other problem with MOOCs. Only people who have sufficient motivation to learn are actually taking advantage of MOOCs, and these tend to be skewed towards those who are already well-educated. Under no circumstances is Brynjolfsson someone who needs a MOOC for his career. But there are many people who cannot afford college and the like, but who don’t have the motivation (or time!) to learn on their own. Is it fair for them to suffer under this new economy?

5. Eric Schmidt got his computer science PhD from Berkeley in 1982. So at least I know someone famous essentially started off on a similar career trajectory as I am.

6. I didn’t realize this until the authors put it in a footnote, but Jonathan Rosenberg’s father is Nathan Rosenberg, who wrote the 1986 book How the West Grew Rich which I also read this year. Heh, the more I read the more I realize that it’s a small world among the academic and technically elite among our society.

7. This blog is hosted on GitHub and built using software called Jekyll. Click here to see the source code

8. To compare, How the West Grew Rich is less than half the length of The Rise and Fall of American Growth. In addition, I skipped most footnotes for the former, but read all the footnotes for the later.

9. A quick thanks to Ben and Marc for helping to fund Berkeley’s Computer Science Graduate Student Association!

10. Dawkins mentions that, if anything was “the making” of him, Oxford was. For me, I consider Berkeley to be “the making of me” as I’ve learned much more, both academically and otherwise, here than at Williams College.

11. For the sake of keeping this blog mostly professional, I won’t list all my positive qualities here. ;)

12. Usually, someone completing a PhD in 2-3 years raises red flags since they likely didn’t get much research done and may have wanted to graduate ASAP. Grant is an exception, and it’s worth noting that there are also exceptions in computer science

13. Given the fact that Bloomberg was able to buy his way into being a politician, I really think the easiest way for me to enter national politics is to have enormous success in the business and technology sector. Then I can just buy my way in, or use my connections. It’s unfortunate that American politics is like this, but at least it’s better than having a king and royal family.

# At Long Last: A Simple Email Subscription for this Blog

It took me six and a half years to do this, but I finally managed to install an email subscription form for readers of this blog. The link is here. No more nasty RSS feeds that no one knows how to use!

The email subscription form for this blog uses MailChimp. Each time I publish a post, I will send an email to everyone on the list using MailChimp’s “Campaign” feature.

Incidentally, this is the same kind of email form we use over at the Berkeley AI Research (BAIR) Blog. If you haven’t already, please subscribe to the BAIR Blog! As a member of the editorial board, I know the posts that are coming up next. I obviously cannot reveal the exact content, though I can say that we have lots of interesting stuff lined up for 2018 and beyond.

For assistance on getting this set up, I thank Jane Liang, a UC Berkeley EECS student who set up MailChimp for the BAIR Blog. I also thank Dominic Spadacene, who wrote dead-simple HTML installation instructions on his Ctrl-F’d blog.

# On the Momentum Sign Flipping for Hamiltonian Monte Carlo

For a long time, I wanted to write a nice, long, friendly blog post on Hamiltonian Monte Carlo that I could come back to for more intuition and understanding as needed.

Fortunately, there’s no need for me to invest a ginormous amount of time I don’t have for that, because physicists/statistician Michael Betancourt has written a fantastic introduction to Hamiltonian Monte Carlo, called A Conceptual Introduction to Hamiltonian Monte Carlo. You can find the preprint here on arXiv. Don’t be deterred by the length; it’s a fast read compared to other academic papers, and certainly a much more intuitive read than Radford Neal’s 2011 review chapter, which I already thought couldn’t be surpassed in terms of a quality introduction to HMC. Indeed, even prominent statisticians such as COPSS Presidents’ Award winner Andrew Gelman have praised the writeup, and someone like him obviously doesn’t need it.

I have extensively read Radford Neal’s writeup, to the point where I was able to reproduce almost all his figures in my MCMC and Dynamics code repository on GitHub. There was, however, one question I had about HMC that I didn’t feel was elaborated upon enough:

Why is it necessary to flip the sign of the momentum to induce a symmetric proposal?

Fortunately, Betancourt’s writeup comes to the rescue! Thus, in this post, I’d like to go through the details on why it is necessary to flip the sign of the momentum term in HMC.

Let $\mathbb{Q}(q' \mid q)$ be the density function defining the current proposal method, whatever that may be. With a Gaussian proposal, we have symmetry in that $\mathbb{Q}(q'\mid q) = \mathbb{Q}(q\mid q')$. The same is true with Hamiltonian Monte Carlo … if we handle the momentum term correctly.

Borrowing Betancourt’s notation (from Section 5.2), we’ll assume that, starting from state $(q_0,p_0)$, we integrate the dynamics for $L$ steps to land at $(q_L,p_L)$, upon which we use that as our proposal:

where $\delta: \mathbb{R} \to \{0,1\}$ is the Dirac delta function, and the difference $q'-q_L$ is assumed to be real-valued; if $q$ and $p$ are vectors, these would need to be done component-wise and then summed up, but the same basic idea holds. Furthermore, $q'$ and $p'$ are “placeholder” random variables, kind of like how we often use $X$ when writing $\mathbb{P}[X=x]$ in introductory probability courses; $X$ is the placeholder and $x$ is the actual quantity.

Reading the definition straight from the Dirac delta functions, we see that our proposal density is one exactly at state $(q',p')=(q_L,p_L)$, and zero everywhere else. This makes sense because Hamiltonian dynamics are deterministic after re-sampling the momentum variables (but it’s understood that $p_0$ represents those states after the re-sampling, not before).

The problem with this is that the proposal becomes “ill-posed”. Betancourt remarks that:

however, I believe that’s a typo and that the numerator and denominator should be flipped, so that the numerator contains the density of the starting state given the proposal.

Regardless, to me it doesn’t make sense to have proposal probabilities or densities with these Dirac delta functions that result in zero everywhere (that means we’d always reject samples). The following figure (from Betancourt) visually depicts the problem:

Because these position and momentum variables are continuous-valued, the probability of actually landing back in the starting state has measure zero.

Suppose, however, that after we integrate for $L$ steps, we flip the sign of the momentum term. Then we have

so that only $(q',p')=(q_L,-p_L)$ results in a probability mass of one. See the following figure for this revision:

The key observation now, of course, is that

Why is this true? The dynamics are time-reversible, and if we set our potential energy to be the usual $K(p) = \frac{p^Tp}{2}$, then flipping the momentum term and going through the leapfrog means the sampler encounters the same exact steps, only in reverse.

To make this concrete, I like to explicitly go through the math of one leapfrog step. It requires some care with notation, but I find it’s worth it. I’ll write $(q_k^{(1)},p_k^{(1)})$ as the $k$-th element encountered during the forward trajectory. For the reverse, I use $(q_k^{(2)},p_k^{(2)})$ so that the superscript is now two instead of one. Furthermore, due to the leapfrog step taking half-steps for momentums, I use $k=0.5$ for this purpose.

Here’s the forward trajectory, starting from $(q_0^{(1)},p_0^{(1)})$:

and the last step negates the momentum, so that the final state is $(q_1^{(1)}, -p_1^{(1)})$.

Here’s the reverse trajectory, starting from $(q_0^{(2)},p_0^{(2)})$:

with our final state as $(q_1^{(2)}, -p_1^{(2)})$. Above, the only difference between the reverse and the forward trajectories is the change in superscripts. But when we do the math for the reverse trajectory while plugging in the values from the forward trajectory, we get:

Gee, this is exactly the negative of the half-step we were at in the first iteration! Similarly, for the position update, we have:

The leapfrog has brought the position back to the starting point. For the final half-step momentum update, we have:

and we see that our reverse trajectory landed us back in $(q_0^{(1)},-p_0^{(1)})$, and flipping the momentum gets us to the same exact starting state.

Thus, using this type of proposal means the proposal densities cancel out, result in a Metropolis test, not a Metropolis-Hastings test.

I should also add one other point: we do not consider the momentum resampling as being part of the proposal, as resampling the momentum can be considered as maintaining the canonical distribution, so it’s something that we can “always” decide to invoke if we want (hopefully that’s not too bad of a hand-wavy explanation).

Hopefully the above discussion clarifies why flipping the sign of the momentum is necessary, at least in theory. In practice, we don’t do it since for the usual Gaussian kinetic energies, $K(p) = K(-p)$ so their energy levels are the same, and because the momentum variables are traditionally entirely resampled after the acceptance test.

# Review of Deep Learning (CS 294-131) at Berkeley

This semester, I took CS 294-131, a Deep Learning “special topics” course which has been offered each semester since Fall 2016 for a variable amount of class units and will be taught again next semester (the course website is already up). As usual, it was co-taught by the Trevor Darrell and Dawn Song team. The course is low-commitment for them because it’s a seminar and they don’t have to give lectures or prepare assignments and exams. CS 294-131 meets only once a week; for us, it was Mondays from 1:00PM to 2:30PM. Each meeting featured a guest speaker from academia or industry who gave a talk on his or her cutting-edge Deep Learning research results.

Here were some of the highlights for me:

• Vladlen Koltun’s talk about his ICLR 2017 paper Learning to Act by Predicting the Future. I enjoyed his presentation, though admittedly most of it was because he was funny and actively engaging with the audience. I previously blogged about the more technical aspects here.

• Barret Zoph and Quoc Le’s joint talk on neural architecture search, also from ICLR 2017 (here’s the OpenReview link) and also (like Koltun’s paper) an oral presentation at that conference. I’ve been hoping to find some time to read their paper and perhaps the winter break will afford me that opportunity. Zoph and Le’s presentation featured a lot of aggressive questioning from students, to the point where Professor Song asked the students to quiet down and let the speakers proceed. Fortunately, at least to me, the technical content of the presentation was interesting enough to keep my attention.

• Ross Girshik’s talk on computer vision and object recognition. Actually, we had a fire alarm for this one, which delayed the start class for about 15 minutes … so then we had to find a new room. Unfortunately, it took about 10 more minutes to get the projector working, and then we were told we had to leave the room at around 2:10PM. At least when Girshik was actually able to talk about computer vision, I found the historical overview to be educational.

• Percy Liang’s presentation on fighting black boxes and adversaries in Deep Learning. This was somewhat more theoretical work but he didn’t go too much into the details. I am less familiar with his work but would like to get accustomed with it as adversarial learning is a pretty hot topic in robotics these days.

In case you’re wondering, yes I had the usual sign language interpreters for the class. Yes they were unhappy, but they tried, and we were able to agree on a few terminology-related issues beforehand. And yes, I didn’t get all the technical details from the talks. I tried to allocate two hours before class to do the background reading, but inevitably that turned into 1.5 hours … and then 1 hour … and then 30 minutes. How do people manage to do class readings ahead of time when they’re juggling four major research projects? Or do people with normal hearing just find it easy to absorb almost all the technical stuff in these talks without prior preparation?1

As I mentioned earlier, CS 294-131 can be taken for a different amount of credits. This year, we had the option of taking it for 1, 2, or 3 credits. We got one credit for doing “arXiv summaries” and “discussion leads” and two for doing a class project. I decided that those summaries and discussions would be too much of a hassle, so I took CS 294-131 for two credits. It helps that I’ve long since finished my course requirements.

While I enjoyed some of the Deep Learning talks, I do have some criticisms about CS 294-131:

• There are too many websites/links related to the class. We have Piazza, the course website, the Google group (seriously?) and Slack channels, with one for each new week. I think this is too much information and things should be centralized in two spots at most — the course website and Piazza.

• I’m also not a fan of the arXiv leads, which was new this semester. Students had to give 1-minute presentations on Deep Learning papers that appeared on arXiv the past week. The problem with this is that the majority of students tried to cram as may technical details in their talk as possible, rather than give the clear key insight from the paper. In addition, students often went over their allotted speaking time (gee, who would have guessed??).

• Finally, I have no idea why class participation is worth 20% of the grading here. On Piazza, we were literally told that we would get class participation credit by simply attending the lectures. Not only does it not make sense to award students who attend lectures but don’t pay attention, it also hurts those who watch the video livestreams to reduce pressure on the lecture room, since the first lecture was “standing-room only.”

To be honest, I didn’t quite enjoy the class as much as I should have, and my project didn’t turn out as well as I would have liked. I worked on a Deep Reinforcement Learning project with three other students, and hopefully that will turn into a research paper later, but in retrospect, it’s difficult to coordinate a four-person project when everyone else has other priorities.

I don’t plan to take the Spring 2018 version of the course, but I’ll certainly keep track of the papers in the background reading. I’m excited to see who the guest speakers will be this time around …

1. For readers of this blog who are EECS PhD students (and yes, I know you read this blog) that means I’m not-so-subtly asking you to tell me how much you can absorb from technical talks and lectures. Posting as a comment here or emailing me personally works.