# Distributed PER, Ape-X DQfD, and Kickstarting Deep RL

In my last post, I briefly mentioned that there were two relevant follow-up papers to the DQfD one: Distributed Prioritized Experience Replay (PER) and the Ape-X DQfD algorithm. In this post, I will briefly review them, along with another relevant follow-up, Kickstarting Deep Reinforcement Learning.

## Distributed PER

This is the epitome of a strong empirical Deep Reinforcement Learning paper. The main idea is to add many actors running in parallel to collect samples. The actors periodically pool their samples into a centralized data repository, which is used for experience replay for a centralized learner to update neural network parameters. Those parameters then get copied back to the actors. That’s it!

Here’s what they have to say:

In this paper we describe an approach to scaling up deep reinforcement learning by generating more data and selecting from it in a prioritized fashion (Schaul et al., 2016). Standard approaches to distributed training of neural networks focus on parallelizing the computation of gradients, to more rapidly optimize the parameters (Dean et al., 2012). In contrast, we distribute the generation and selection of experience data, and find that this alone suffices to improve results. This is complementary to distributing gradient computation, and the two approaches can be combined, but in this work we focus purely on data-generation.

My first question after reading the introduction was about how Distributed PER differs from their two most relevant prior works: distributed Deep RL, also known as the “Gorila” paper (Nair et al., 2015) and A3C (Mnih et al., 2016). As far as I can tell, here are the differences:

• The Gorila paper used a distributed experience replay buffer (and no prioritization). This paper uses a centralized buffer, with prioritization. There is also a slight algorithmic change as to when priorities are actually computed — they are initialized from the actors, not the learners, so that there isn’t too much priority focused on the most recent samples.

• The A3C paper used one CPU and had multithreading for parallelism, with one copy of the environment per thread. They did not use experience replay, arguing that the diversity of samples from the different environments alleviates the need for replay buffers.

That … seems to be it! The algorithm is not particularly novel. I am also not sure how using one CPU and multiple threads compares to using multiple CPUs, with one copy of the environment in each. Do any hardware experts want to chime in? The reason why I am wondering about this is because the Distributed PER algorithm gets massively better performance on the Atari benchmarks compared to prior work. So something must be working better than before. Well, they probably used far more CPUs; they use 360 in the paper (and only one GPU, interestingly).

They evaluate their framework, Ape-X, on DQN and DPG, so the algorithms are called Ape-X DQN and Ape-X DPG. The experiments are extensive, and they even benchmark with Rainbow! At the time of the paper submission to ICLR, Rainbow was just an arXiv preprint, under review at AAAI 2018, where (unsurprisingly) it got accepted. Thus, with Distributed PER being submitted to ICLR 2018, there was no ambiguity on who the authors were. :-)

The results suggest that Ape-X DQN is better than Rainbow … with DOUBLE (!!) the performance, and using half of the runtime! That is really impressive, because Rainbow was considered to be the most recent state of the art on Atari. As a caveat, it can be difficult to compare these algorithms, because with wall clock time, an algorithm with more parallelism (i.e., Ape-X versions) can get more samples, so it’s not clear if this is a fair comparison. Furthermore, Ape-X can certainly be combined with Rainbow. DeepMind only included a handful of Rainbow’s features in Ape-X DQN, so presumably with more of those features, they would get even better results.

Incidentally, when it comes to state of the art on Atari games, we have to be careful about what we mean. There are at least three ways you can define it:

• One algorithm, applied to each game independently, with no other extra information beyond the reward signal.
• One algorithm, applied to each game independently, but with extra information. This information might be demonstrations, as in DQfD.
• One algorithm, with ONE neural network, applied to all games in a dependent manner. Some examples are IMPALA (ICML 2018), PopArt (AAAI 2019) and Recurrent Distributed PER (ICLR 2019).

In the first two cases, we assume no hyperparameter tuning per game.

In this paper, we are using the first item above. I believe that state of the art in this domain proceeded as follows: DQN -> Double DQN (i.e., DDQN) -> Prioritized DDQN -> Dueling+Prioritized DDQN -> A3C -> Rainbow -> Distributed PER (i.e., Ape-X DQN). Is that it? Have there been any updates after this paper? I suppose I could leave out the dueling architectures one because that came at the same time the A3C paper did (ICML 2016), but I wanted to mention it anyway.

## Ape-X DQfD

The motivation for this paper was as follows:

Despite significant advances in the field of deep Reinforcement Learning (RL), today’s algorithms still fail to learn human-level policies consistently over a set of diverse tasks such as Atari 2600 games. We identify three key challenges that any algorithm needs to master in order to perform well on all games: processing diverse reward distributions, reasoning over long time horizons, and exploring efficiently. In this paper, we propose an algorithm that addresses each of these challenges and is able to learn human-level policies on nearly all Atari games.

The algorithm they propose is Ape-X DQfD. But, as suggested in the abstract, there are actually three main challenges they attempt to address, and the “DQfD” portion is mainly for the exploration one; the demonstrations reduce the need for the agent to randomly explore. The other algorithmic aspects do not appear to be as relevant to DQfD so I will not review them in my usual excess detail. Here they are, briefly:

• They propose a new Bellman operator with the goal of reducing the magnitude of the action-value function. This is meant to be a “better solution” than clipping rewards into $[-1,1]$. See the paper for the mathematical details; the main idea is to insert a function $h(z)$ and its inverse $h^{-1}(z)$ into the Bellman operator. I kind of see how this works, but it would be nice to understand the justification better.

• They want to reason over longer horizons, which they can do by increasing the discount factor from the usual $\gamma=0.99$ to $\gamma=0.999$. Unfortunately, this can cause instability since, according to them, it “[…] decreases the temporal difference in value between non-rewarding states.” I don’t think I understand this justification. I believe they argue that states $(x_i,a_i)$ and $(x_{i+1},a_{i+1})$ should not actually have the same value function, so in some odd sense, generalization of the value function is bad.

with $k \in \mathbb{N}$ the current iteration number.

Now let’s get to what I really wanted to see: the DQfD extensions. DeepMind applies the following algorithmic differences:

• They do not have a “pre-training” phase, so the agent learns from its self-generated data and the expert data right away.

• They only apply the imitation loss to the best expert data. Huh, that seems to partially contradict the DQfD result which argued that the imitation loss was a critical aspect of the algorithm. Another relevant change is to increase the margin value from 0.8 to $\sqrt{0.999}$. I am not sure why.

• They use many actors to collect experience in parallel. To be precise on the word “many,” they use 128 parallel actors.

• They only use a fixed 75-25 split of agent-to-expert data.

At first, I was surprised at the last change I wrote above, because I had assumed that adding priorities to both the agent and expert data and putting them in the same replay buffer would be ideal. After all, the Ape-X paper (i.e., the Distributed Prioritized Experience Replay paper that I just discussed) said that prioritization was the most important ingredient in Rainbow:

In an ablation study conducted to investigate the relative importance of several algorithmic ingredients (Hessel et al., 2017), prioritization was found to be the most important ingredient contributing to the agent’s performance.

But after a second read, I realized what the authors meant. They still use priorities for the expert and actor samples, but these are in separate replay buffers. They are then sampled and then combined together at the 75-25 ratio. They explain why in the rebuttal:

Simply adding more actors to the original DQfD algorithm is a poor choice of baseline. First, DQfD uses a single replay buffer and relies on prioritization to select expert transitions. This works fine when there is a single actor process feeding data into the replay buffer. However, in a distributed setup (with 128 actors), the ratio of expert to actor experience drops significantly and prioritization no longer exposes the learner to expert transitions. Second, DQfD uses a pre-training phase during which all actor processes are idle. This wastes a lot of compute resources when running a decent number of actors. Many of our contributions help eliminate these shortcomings.

This makes sense. Did I mention that I really like ICLR’s OpenReview system?

The algorithm named, as mentioned earlier, is Ape-X DQfD because it uses ideas from the Ape-X paper, titled “Distributed Prioritized Experience Replay” (ICLR 2018). Yeah, more distributed systems!! I am seeing a clear trend.

What do the experiments show and suggest? Ablation studies are conducted on six carefully-selected games, and then more extensive results are reported with the same 42 (not 57) games from DQfD for which they have demonstrator data. It looks better than prior methods! Still, this is with the usual caveats that it’s near-impossible to reproduce these even if we had the same computational power as DeepMind.

In addition, with DQfD, we assume privileged information in the form of expert transitions, so it might not be fair to evaluate it with other algorithms that only get data from exploration, such as Rainbow DQN or Ape-X DQN.

After reading this paper, there are still some confusing aspects. I agree with one of the reviewer comments that the addition of demonstrations (forming Ape-X DQfD) feels like an orthogonal contribution. Overall, the paper looks like it was a combination of some techniques, but perhaps was not as rigorous as desired? Hopefully DeepMind will expand Ape-X DQfD in its own paper, as I would be interested in seeing their follow-up work.

# Kickstarting Deep Reinforcement Learning

Given the previous discussion on DQfD and combining IL with RL, I thought it would be worthwhile to comment on the Kickstarting Deep Reinforcement Learning paper. It’s closely related; again, we assume we have some pre-trained “expert” policy, and we hope to accelerate the learning process of a new “student” agent via a combination of IL and RL. This paper appears to be the policy gradient analogue to DQfD. Rather than use a shared experience replay buffer, the student incorporates an extra policy distillation loss to the normal RL objective of maximizing reward. Policy distillation (see my earlier blog post about that here) cannot be the sole thing the agent considers. Otherwise, if the agent did not consider the usual environment reward, it would be unlikely to surpass performance of the teacher.

At a high level, here’s how DeepMind describes it:

The main idea is to employ an auxiliary loss function which encourages the student policy to be close to the teacher policy on the trajectories sampled by the student. Importantly, the weight of this loss in the overall learning objective is allowed to change over time, so that the student can gradually focus more on maximising rewards it receives from the environment, potentially surpassing the teacher (which might indeed have an architecture with less learning capacity).

The paper also says one of the two main prior works is “population-based training” but that part is arguably more of an orthogonal contribution, as it is about how to effectively use many agents training in parallel to find better hyperparameters. Thus, I will focus only on the auxiliary loss function aspect.

Here is a generic loss function that a “kickstarted” student would use:

where $\mathcal{L}_{\rm RL}(\omega, x, t)$ is the usual RL objective formulated as a loss function, $H(\pi_T(a | x_t) || \pi_S(a | x_t, \omega))$ is the cross entropy between a student and a teacher’s output layers, and $\lambda_k$ is a hyperparameter corresponding to optimization iteration $k$, which decreases over the course of training so that the students gradually focuses less on imitating the teacher. Be careful about the notation: $x$ implies a trajectory generated by the student (i.e., the “behavioral” policy) whereas $x_t$ is a single state. In the context of the above notation, it is a state from a teacher trajectory that was generated ahead of time. We know this because it is inside the cross entropy term.

This is a general framework, so there are more specific instances of it depending on which RL algorithm is used. For example, here is how to use a kickstarted A3C loss:

where here, $H$ is now used as simply the policy entropy (first time) of the student, and then the cross entropy (second time). As usual, $V$ is a parameterized value function which plays the role of the critic. Unfortunately, the notation for the states can be a bit confusing and conflates the student and teacher states from their respective trajectories. If I were in charge of the notation, I would use $x_t^{S}$ and $x_t^{T}$ for student and teacher states, respectively.

In practice, to get better results, DeepMind uses the Importance Weighted Actor-Learner Architecture (IMPALA) agent, which extends actor-critic methods to a distributed setting with many workers. Does anyone see a trend here?

The experiments are conducted on the DMLab-30 suite of environments. Unfortunately I don’t have any intuition on this benchmark. It would be nice to test this out but the amount of compute necessary to even reproduce the figures they have is insane.

Since it was put on arXiv last year, DeepMind has not updated the paper. There is still a lot to explore here, so I am not sure why — perhaps it was rejected by ICML 2018, and then the authors decided to move on to other projects? More generally, I am interested in seeing if we have explored the limits of accelerating student training via a set of teacher agents.

# Concluding Thoughts

This post reviewed three related works to the Deep Q-Learning from Demonstrations paper. The trends that I can see are this:

• If we have expert (more generally, “demonstrator”) data, we should use it. But we have to take care to ensure that our agents can effectively utilize the expert data, and eventually surpass expert performance.

• Model-free reinforcement learning is incredibly sample inefficient. Thus, distributed architectures — which can collect many samples in parallel — result in the best performance in practice.

# Combining Imitation Learning and Reinforcement Learning Using DQfD

Imitation Learning (IL) and Reinforcement Learning (RL) are often introduced as similar, but separate problems. Imitation learning involves a supervisor that provides data to the learner. Reinforcement learning means the agent has to explore in the environment to get feedback signals. This crude categorization makes sense as a start, but as with many things in life, the line between them is blurry.

I am interested in both learning problems, but am probably even more fascinated about figuring out how best to merge the techniques to get the best of both words. A landmark paper in the combination of imitation learning and reinforcement learning is DeepMind’s Deep Q-Learning from Demonstrations (DQfD), which appeared at AAAI 2018. (The paper was originally called Learning from Demonstrations for Real World Reinforcement Learning” in an earlier version, and somewhat annoyingly, follow-up work has cited both versions of the title.) In this post, I’ll review the DQfD algorithm and the most relevant follow-up work.

The problem setting of DQfD is when data is available from a demonstrator (or “expert”, or “teacher”) for some task, and using this data, we want to train a new “learner” or “student” agent to learn faster as compared to vanilla RL. DeepMind argues that this is often the case, or at least more common than when a perfectly accurate simulator is available:

While accurate simulators are difficult to find, most of these problems have data of the system operating under a previous controller (either human or machine) that performs reasonably well.

Fair enough. The goal of DQfD is to use this (potentially suboptimal!) demonstration data to pre-train an agent, so that it can then perform RL right away and be effective. But the straightforward solution of performing supervised learning on the demonstration data and then applying RL is not ideal. One reason is that it makes sense to use the demonstration data continuously throughout training as needed, rather than ignoring it. This implies that the agent still needs to do supervised learning during the RL phase. The RL phase, incidentally, should allow the learner to eventually out-perform the agent who provided the data. Keep in mind that we also do not want to force a supervisor to be present to label the data, as in DAGGER — we assume that we are stuck with the data we have at the start.

So how does this all work? DQfD cleverly defines a loss function $J(Q)$ based on the current $Q$ network, and applies it in its two phases.

• In Phase I, the agent does not do any exploration. Using the demonstration data, it performs gradient descent to minimize $J(Q)$. DQfD, like DQN, uses an experience replay buffer which stores the demonstration data. This means DQfD is off-policy as the data comes from the demonstrator, not the agent’s current behavioral policy. (Incidentally, I often refer to “teacher samples in a replay buffer” to explain to people why DQN and Q-Learning methods are off-policy.) The demonstration data is sampled with prioritization using prioritized experience replay.

• In Phase II: the agent explores in the environment to collect its own data. It is not explicitly stated in the paper, but upon communication with the author, I know that DQfD follows an $\epsilon$-greedy policy but keeps $\epsilon=0.01$, so effectively there is no exploration and the only reason for “0.01” is to ensure that the agent cannot get stuck in the Atari environments. Key to this phase is that the agent still makes use of the demonstration data. That is, the demonstration data does not go away, and is still occasionally sampled with prioritization along with the new self-generated data. Each minibatch may therefore consist of a mix of demonstration and self-generated data, and these are used for gradient updates to minimize the same loss function $J(Q)$.

The main detail to address is the loss function $J(Q)$, which is:

where:

• $J_{DQ}(Q)$ is the usual one-step Double DQN loss.

• $J_n(Q)$ is the $n$-step (Double) DQN loss — from communication with the author, they still used the Double DQN loss, even though it doesn’t seem to be explicitly mentioned. Perhaps the Double DQN loss isn’t as important in this case, because I think having longer returns with a discount factor $\gamma$ decreases the need to have a highly accurate bootstrapped Q-value estimate.

• $J_E(Q)$ is a large margin classification loss, which I sometimes refer to as “imitation” or “supervised” loss. It is defined as

where $\ell(a,a_E)$ is a “margin” loss which is 0 if $a=a_E$ and positive otherwise (DQfD used 0.8). This loss is only applied on demonstrator data, where an $a_E$ actually exists!

• $J_{L_2}(Q)$ is the usual $L_2$ regularization loss on the Q-network’s weights.

All these are combined into $J(Q)$, which is then minimized via gradient descent on minibatches of samples. In Phase I, the samples all come from the demonstrator. In Phase II, they could be a mixture of demonstrator and self-generated samples, depending on what gets prioritized. The lambda terms are hyperparameters to weigh different aspects of the loss function. DeepMind set them at $\lambda_1 = 1, \lambda_2 = 1$, and $\lambda_3 = 10^{-5}$, with the obvious correction factor of $\lambda_2=0$ if the sample in question is actually self-generated data.

The large margin imitation loss is simple, but is worth thinking about carefully. Why does it work?

Let’s provide some intuition. Suppose we have a dataset $\{s_0,a_0,s_1,a_1,\ldots\}$ from a demonstrator. For simplicity, suppress the time subscript over $a$ — it’s understood to be $a_t$ if we’re considering state $s_t$ in our Q-values. Furthermore, we’ll suppress $\theta$ in $Q_\theta(s,a)$, which represents the parameters of our Q-network, which tries to estimate the state-action values. If our action space has four actions, $\mathcal{A} = \{0,1,2,3\}$, then the output of the Q-network using the following size-3 minibatch (during training) is this $3\times 4$ matrix:

You can think of DQfD as augmenting that matrix by the margin $\ell$, and then apply the max operator to get this during the training process:

where the “max” operator above is applied on each row of the matrix of Q-values. In the above, the demonstrator took action $a_0=3$ when it was in state $s_0$, action $a_1=1$ when it was in state $s_1$, and action $a_2=2$ when it was in state $s_2$. To reiterate, this is only applied on the demonstrator data.

The loss is designed like this to make the Q-values of non-demonstrator state-action values to be below the Q-values of the state-action values corresponding to what the demonstrator actually took. Let’s go through the three possible cases.

• Suppose for a given state (e.g., $s_0$ from above), our Q-network thinks that some action (e.g., $a_0=0$) has the highest Q-value despite it not being what the expert actually took. Then

and our loss is high.

• Suppose for the same state $s_0$, our Q-network thinks that the action the demonstrator took ($a_0=3$ in this case) has the highest Q-value. But, after adding $\ell$ to the other actions’ Q-values, suppose that the maximum Q-value is something else, i.e., $Q(s_0,k)$ but where $k \ne 3$. Then, the loss is in the range

which is better than the previous case, but not ideal.

• Finally, suppose we are in the previous case, but even after adding $\ell$ to the other Q-values, the Q-value corresponding to the state-action pair the demonstrator took is still the highest. Then the imitation loss is

which is clearly the minimum possible value.

An alternative to the imitation loss could be the cross entropy loss to try and make the learner choose the action most similar to what the expert took, but the DQfD paper argues that the imitation loss is better.

Besides the imitation loss, the other thing about the paper that I was initially unsure about was how to implement the prioritization for the 1-step and $n$-step returns. After communicating with the author, I see that probably the easiest way is to treated the 1 and $n$-step returns as separate entries in the replay buffer. That way, each of these items has independent sampling priorities.

Said another way, instead of our samples in the replay buffer only coming from tuples of the form:

in DQfD, we will also see these terms below which constitute half of the replay buffer’s elements:

This makes sense, though it is tricky to implement. Above, I’ve included the “done masks” that are used to determine whether we should use a bootstrapped estimate or not. But keep in mind that if an episode actually ends sooner than $n$ steps ahead, the done mask must be applied at the point where the episode ends!

One final comment on replay buffers: if you’re using prioritized experience replay, it’s probably easiest to keep the demonstrator and self-generated data in the same experience replay buffer class, and then build on top of the prioritized buffer class as provided by OpenAI baselines. To ensure that the demonstrator data is never overwritten, you can always “reserve” the first $K$ indices in a list of the transition tuples to be for the teacher.

The paper’s experimental results show that DQfD learns faster initially (arguably one of the most critical periods) and often has higher asymptotic performance. You can see the results in the paper.

I like the DQfD paper very much. Otherwise, I would not have invested significant effort into understanding it. Nonetheless, I have at least two concerns:

• The ablation studies seem very minor, or at least compared to the imitation loss experiments. I’m not sure why? It seems like DeepMind prioritized their other results, which probably makes sense. In theory, I think all components in the loss function $J(Q)$ matter, with the possible exception of the $L_2$ loss (see next point).

• The main logic of the $L_2$ loss is to avoid overfitting on the demonstration data. But is that really the best way to prevent overfitting? And in RL here, presumably if the agent gets better and better, it doesn’t need the demonstration data any more. But then we would be effectively applying $L_2$ regularization on a “vanilla” RL agent, and that might be riskier — regularizing RL seems to be an open question.

DeepMind has extended DQfD in several ways. Upon a literature search, it seems like two relevant follow-up works are:

• Distributed Prioritized Experience Replay, with the OpenReview link here for ICLR 2018. The main idea of this paper is to scale up the experience replay data by having many actors collect experience. Their framework is called Ape-X, and they claim that Ape-X DQN achieves a new state of the art performance on Atari games. This paper is not that particularly relevant to DQfD, but I include it here mainly because a follow-up paper (see below) used this technique with DQfD.

• Observe and Look Further: Achieving Consistent Performance on Atari. This one appeared to have been rejected from ICLR 2019, unfortunately, and I’m a little concerned about the rationale. If the reviewers say that there are not enough experiments, then what does this mean for people who do not have as much compute? In any case, this paper proposes the Ape-X DQfD algorithm, which as one might expect combines DQfD with the distributed prioritized experience replay algorithm. DeepMind was able to increase discount factor to be $\gamma=0.999$, which they argue allows for an order of magnitude more downstream reward data. Some other hyperparameter changes: the large margin value increases to $\sqrt{0.999}$, and there is now no prioritization, just a fixed proportion of student-teacher samples in each minibatch. Huh, that’s interesting. I would like to understand this rationale better.

Combining IL and RL is a fascinating concept. I will keep reading relevant papers to remain at the frontiers of knowledge.

The following consists of a set of notes that I wrote for students in CS 182/282A about adversarial examples and taxonomies. Before designing these notes, I didn’t know very much about this sub-field, so I had to read some of the related literature. I hope these notes are enough to get the gist of the ideas here. There are two main parts: (1) fooling networks with adversarial examples, and (2) taxonomies of adversarial-related stuff.

## Fooling Networks with Adversarial Examples

Two classic papers on adversarial examples in Deep Learning were (Szegedy et al., 2014) followed by (Goodfellow et al., 2015), which were among the first to show that Deep Neural Networks reliably mis-classify appropriately perturbed images that look indistinguishable to “normal” images to the human eye.

More formally, image our deep neural network model $f_\theta$ is parameterized by $\theta$. Given a data point $x$, we want to train $\theta$ so that our $f_\theta(x)$ produces some desirable value. For example, if we have a classification task, then we’ll have a corresponding label $y$, and we would like $f_\theta(x)$ to have an output layer such that the maximal component corresponds to the class index from $y$. Adversarial examples, denoted as $\tilde{x}$, can be written as $\tilde{x} = x + \eta$, where $\eta$ is a small perturbation that causes an imperceptible change to the image, as judged by the naked human eye. Yet, despite the small perturbation, $f_\theta(\tilde{x})$ may behave very different from $f_\theta(x)$, and this has important implications for safety and security reasons if Deep Neural Networks are to be deployed in the real world (e.g., in autonomous driving where $x$ is from a camera sensor).

One of the contributions of (Goodfellow et al., 2015) was to argue that the linear nature of neural networks, and many other machine learning models, makes them vulnerable to adversarial examples. This differed from the earlier claim of (Szegedy et al., 2014) which suspected that it was the complicated nature of Deep Neural Networks that was the source of their vulnerability. (If this sounds weird to you, since neural networks are most certainly not linear, many components behave in a linear fashion, and even advanced layers such as LSTMs and convolutional layers are designed to behave linear or to use linear arithmetic.)

We can gain intuition from looking at a linear model. Consider a weight vector $w$ and an adversarial example $\tilde{x} = x + \eta$, where we additionally enforce the constraint that $% $. A linear model $f_w$ might be a simple dot product:

Now, how can we make $w^T\tilde{x}$ sufficiently different from $w^Tx$? Subject to the $\|\eta\|_\infty \le \epsilon$ constraint, we can set $\eta = \epsilon \cdot {\rm sign}(w)$. That way, all the components in the dot product of $w^T\eta$ are positive, which forcibly increases the difference between $w^Tx$ and $w^T\tilde{x}$.

The equation above contains just a simple linear model, but notice that $\| \eta \|_\infty$ does not grow with the dimensionality of the problem, because the $L_\infty$-norm takes the maximum over the absolute value of the components, and that must be $\epsilon$ by construction. The change caused by the perturbation, though, grows linearly with respect to the dimension $n$ of the problem. Deep Neural Networks are applied on high dimensional problems, meaning that many small changes can add up to make a large change in the output.

This equation then motivated the the Fast Gradient Sign Method (FGSM) as a cheap and reliable method for generating adversarial examples.1 Letting $J(\theta, x, y)$ denote the cost (or loss) of training the neural network model, the optimal maximum-norm constrained perturbation $\eta$ based on gradients is

to produce adversarial example $\tilde{x} = x + \eta$. You can alternatively write $\eta = \epsilon \cdot {\rm sign}\left( \nabla_x \mathcal{L}(f_\theta(x), y) \right)$ where $\mathcal{L}$ is our loss function, if that notation is easier to digest.

A couple of comments are in order:

• In (Goodfellow et al., 2015) they set $\epsilon = 0.007$ and were able to fool a GoogLeNet on ImageNet data. That $\epsilon$ is so small that it actually corresponds to the magnitude of the smallest bit of an 8 bit image encoding after GoogLeNet’s conversion to real numbers.

• In the FSGM, the gradient involved is taken with respect to $x$, and not the parameters $\theta$. Therefore, this does not involve changing weights of a neural network, but adjusting the data, similar to what one might do for adjusting an image to “represent” a certain class. Here, our goal is not to generate images that maximally activate the network’s nodes, but to generate adversarial examples while keeping changes to the original image as small as possible.

• It’s also possible to derive a similar “fast gradient” adversarial method based on the $L_2$, rather than $L_\infty$ norm.

• It’s possible to use adversarial networks as part of the training data, in order to make the neural network more robust, such as using the following objective:

Some promising results were shown in (Goodfellow et al., 2015), which found that it was better than dropout.

• The FGSM requires knowledge of the model $f_\theta$, in order to get the gradient with respect to $x$.

The last point suggests a taxonomy of adversarial methods. That brings us to our next section.

Two ways of categorizing methods are:

• White-box attacks. The attacker can “look inside” the learning algorithm he/she is trying to fool. The FGSM is a white-box attack method because “looking inside” implies that computing the gradient is possible. I don’t think there is a precise threshold of information that one must have about the model in order for the attack to be characterized as a “white-box attack.” Just view it as when the attacker has more information as compared to a black box attack.

• Black-box attacks. The attacker has no (or little) information about the model he/she is trying to fool. In particular, “looking inside” the learning algorithm is not possible, and this disallows computation of gradients of a loss function. With black boxes, the attacker can still query the model, which might involve providing the model data and then seeing the result. From there the attacker might decide to compute approximate gradients, if gradients are part of the attack.

As suggested above, attack methods may involve querying some model. By “querying,” I mean providing input $x$ to the model $f_\theta$ and obtaining output $f_\theta(x)$. Attacks can be categorized as:

• Zero-query attacks. Those that don’t involve querying the model.

• Query attacks. Those which do involve querying the model. From the perspective of the attacker, it is better to efficiently use few queries, with potentially more queries if their cost is cheap.

There are also two types of adversarial examples in the context of classification:

• Non-Targeted Adversarial Examples: Here, the goal is to mislead a classifier to predict any label other than the ground truth. Formally, given image $x$ with ground truth2 $f_\theta(x) = y$, a non-targeted adversarial example can be generated by finding $\tilde{x}$ such that

where $d(\cdot,\cdot)$ is an appropriate distance metric with a suitable bound $B$.

• Targeted Adversarial Examples: Here, the goal is to mislead a classifier to predict a specific target label. These are harder to generate, because extra work must be done. Not only should the adversarial example reliably cause the classifier to misclassify, it must misclassify to a specific class. This might require finding common perturbations among many different classifiers. We can formalize it as

where the difference is that the label $\tilde{y} \ne y$ is specified by the attacker.

There is now an enormous amount of research on Deep Learning and adversarial methods. One interesting research direction is to understand how attack methods transfer among different machine learning models and datasets. For example, given adversarial examples generated for one model, do other models also misclassify those? If this were true, then an obvious strategy that an attacker could do would be to construct a substitute of the black-box model, and generate adversarial instances against the substitute to attack the black-box system.

A method proposed in (Liu et al., 2017) for generating adversarial examples using ensembles.

The first large-scale study was reported in (Liu et al., 2017), which studied black-box attacks and transferrability of non-targeted and targeted adversarial examples.3 The highlights of the results were that (a) non-targeted examples are easier to transfer, which we commented earlier above, and (b) it is possible to make even targeted examples more transferrable by using a novel ensemble-based method (sketched in the figure above). The authors applied this on Clarifai.com. This is a black-box image classification system, which we can continually feed images and get outputs. The system in (Liu et al., 2017) demonstrated the ability to transfer targeted adversarial examples generated for models trained on ImageNet to the Clarifai.com system, despite an unknown model, training data, and label set.

Whew! Hopefully this provides a blog-post bite-sized overview of adversarial examples and related themes. Here are the papers that I cited (and read) while reviewing this content.

1. Christian Szegedy et al., Intriguing Properties of Neural Networks. ICLR, 2014.

2. Ian Goodfellow et al., Explaining and Harnessing AdversarialExamples. ICLR 2015.

3. Yanpei Liu et al., Delving into Transferable Adversarial Examples and Black-box Attacks. ICLR 2017.

1. Research papers about adversarial examples are often about generating adversarial examples, or making neural networks more robust to adversarial examples. As of mid-2019, it seems safe to say that any researcher who claims to have designed a neural network that is fully resistant to all adversarial examples is making an irresponsible claim.

2. The formalism here comes from (Liu et al., 2017), and assumes that $f_\theta(x) = y$ (i.e., that the classifier is correct) so as to make the problem non-trivial. Otherwise, there’s less motivation for generating an adversarial example if the network is already wrong.

3. Part of the motivation of (Liu et al., 2017) was that, at that time, it was an open question as to how to efficiently find adversarial examples for a black box model.

# Transformer Networks for State of the Art Natural Language Processing

The Transformer network, introduced by Google researchers in their paper titled “Attention Is All You Need” at NIPS 2017, was perhaps the most groundbreaking result in all of Deep Learning in the year 2017. In part because the CS 182/282A class has a homework assignment on Transformer networks, I finally got around to understanding some of its details, which I will discuss in this post.

Before we proceed, be aware that despite the similar-sounding names, Transformer networks are not the same thing as spatial transformer networks. Transformer networks are deep neural networks that use attention and eschew convolutional and recurrent layers, while spatial transformer networks are convolutional neural networks with a trainable module that enable increased spatial invariance to the input, which might include size and pose changes.

## Introduction and Context

Way back in Fall 2014, I took CS 288 at Berkeley, which is the graduate-level Natural Language Processing course we offer. That was happening in tandem with some of the earlier post-AlexNet Deep Learning results, but we didn’t do any Deep Learning in 288. Fast forward five years later, and I’m sure the course is vastly different nowadays. Actually, I am not even sure it has been offered since I took it! This is probably because Professor Klein has been working at Semantic Machines (which recently got acquired by Microsoft, congratulations to them) and because we don’t really have another pure NLP research faculty member.

I did not do any NLP whatsoever since taking CS 288, focusing instead on computer vision and robotics, which are two other sub-fields of AI which, like NLP, heavily utilize Deep Learning. Thus, focusing on those fields meant NLP was my weakest subject of the “AI triad.” I am trying to rectify that so I can be conversationally fluent with other NLP researchers and know what words like “Transformers” (the subject of this blog post!), “ELMO,” “BERT,” and others mean.

In lieu of Berkeley-related material, I went over to Stanford to quickly get up to speed on Deep Learning and NLP. Fortunately, Stanford’s been on top of this by providing CS 224n, the NLP analogue of the fantastic CS 231n course for computer vision. There are polished lecture videos online for the 2017 edition and then (yes!!) the 2019 edition, replete with accurate captions! I wonder if the people who worked on the captions first ran a Deep Learning-based automatic speech recognizer, and then fine-tuned the end result? Having the 2019 edition available is critical because the 2017 lectures happened before the Transformer was released to the public.

So, what are Transformers? They are a type of deep neural network originally designed to address the problem of transforming one sequence into another. A common example is with machine translation (or “transduction” I suppose — I think that’s a more general umbrella term), where we can ask the network to translate a source sentence into a target sentence.

From reviewing the 224n material, it seems like from the years 2012 to 2017, the most successful Deep Learning approaches to machine translation involved recurrent neural networks, where the input source was passed through an encoder RNN, and then the final hidden state would be used as input to a decoder RNN which then provided the output sequentially. One of the more popular versions of this model was from (Sutskever et al., 2014) at NIPS. This was back when Ilya was at Google, before he became Chief Scientist of OpenAI.

One downside is that these methods rely on the decoder being able to work with a fixed-size, final hidden state of the encoder, which must somehow contain all the relevant information for translation. Some work has addressed that by allowing the decoder to use attention mechanisms to “search backwards” at earlier hidden layers from the decoder. This appears to be the main idea from (Loung et al., 2015).

But work like that still uses an RNN-based architecture. RNNs, including more advanced versions like LSTMs and GRUs, are known to be difficult to train and parallelize. By parallelize, we mean within each sample; we can clearly still parallelize across $B$ samples within a minibatch. Transformer networks can address this issue, as they are not recurrent (or convolutional, for that matter!).

## (Some) Architectural Details

Though the Transformer is billed as a new neural network architecture, it still follows the standard “encoder-decoder” that are commonly used in seq2seq models, where the encoder produces a representation of the word, and the decoder autoregressively produces the output. The main building blocks of the Transformer are also not especially exotic, but rather clever arrangements of existing operations that we already know how to do.

Probably the most important building block within the Transformer is the Scaled Dot-Product Attention. Each involves:

• A set of queries $\{q_1, \ldots, q_{n_q}\}$, each of size $d_k$.
• A set of keys $\{k_1, \ldots, k_{n_k}\}$, each of size $d_k$ (same size as the queries).
• A set of values $\{v_1, \ldots, v_{n_v}\}$, each of size $d_v$. In the paper, they set $d_k=d_v$ for simplicity. Also, $n_v = n_k$ because they refer to a set of key-value pairs.

Don’t get too bogged down with what these three mean. They are vectors that we can abstract away. The output of the attention blocks are the weighted sum of values, where the weights are determined by the queries and keys from a compatibility function. We’ll go through the details soon, but the “scaled dot-product attention” that I just mentioned is that compatibility function. It involves a dot product between a query and a key, which is why we constrain them to be the same size $d_k$.

Let’s stack these into matrices, which we would have to do for implementation purposes. We obtain:

where individual queries, keys, and values are row vectors within the arrays. The Transformers paper doesn’t specify the exact row/column orientation because I’m sure it doesn’t actually matter in practice; I found the math to be a little easier to write out with treating elements as row vectors. My LaTeX here won’t be as polished as it would be for an academic paper due to MathJax limitations, but think of the above as showing the pattern for how I portray row vectors.

The paper then states that the output of the self-attention layer follows this equation:

For the sake of saving valuable column space on this blog, I’ll use ${\rm Attn}$ and ${\rm soft}$ for short. Expanding the above, and flipping the pattern in the $K$ matrix (due to the transpose), we get:

In the above, I apply the softmax operator row-wise. Notice that this is where the compatibility function happens. We are taking dot products between queries and keys, which are scaled by $1/\sqrt{d_k}$, hence why we call it scaled dot-product attention! It’s also natural why we call this “compatible” — a higher dot product means the vectors are more aligned in their components.

The last thing to note for this sub-module is where the weighted sum happens. It comes directly from the matrix-matrix product that we have above. It’s easier to think of this with respect to one query, so we can zoom in on the first row of the query-key dot product matrix. Rewriting that softmax row vector’s elements as $\{w_1^{(1)}, \ldots, w_{n_k}^{(1)}\}$ to simplify the subsequent notation, we get:

and that is our weighted sum of values. The $w_i^{(1)}$ are scalars, and the $v_i$ are $d_v$-sized vectors.

All of this is for one query, which takes up one row in the matrix. We extend to multiple queries by adding more queries as rows.

Once you understand the above, the extensions they add onto this are a little more straightforward. For example, they don’t actually use the $Q$, $K$, and $V$ matrices directly, but linearly project them using trainable parameters, which is exactly what a dense layer does. But understanding the above, you can abstract the linear projection part away. They also perform several of these in parallel and then concatenate across one of the axis, but again, that part can be abstracted away easily.

The above happens within one module of an encoder or decoder layer. The second module of it is a plain old fully connected layer:

They then wrap both the self-attention part and the fully connected portion with a residual connection and layer normalization. Whew! All of this gets repeated for however many layers we can stomach. They used six layers for the encoder and decoder.

Finally, while I won’t expand upon it in this blog post, they rely on positional embeddings in order to make use of the sequential nature of the input. Recurrent neural networks, and even convolutional neural networks, maintain information about the ordering of the sequence. Positional embeddings are thus a way of forcing the network to recognize this in the absence of these two such layers.

## Experiments and Results

Since I do not do NLP research, I am not sure about what benchmarks researchers use. Hence, that’s why I want to understand their experiments and results. Here are the highlights:

• They performed two machine translation tasks: English-to-German and English-to-French. Both use “WMT 2014” datasets. Whatever those are, they must presumably be the standard for machine translation research. The former consists of 4.5 million sentence pairs, while the latter is much larger at 36 million sentence pairs.

• The metric they use is BLEU, the same one that NLP-ers have long been using for translation. and they get higher numbers than prior results. As I explained in my blog post about CS 288, I was not able to understand the spoken utterances of Professor Klein, but I think I remember him saying something like this about BLEU: “nobody likes it, everybody uses it”. Oh well. You can see the exact BLEU scores in the paper. Incidentally, they average across several checkpoints for their actual model. I think this is prediction averaging rather than parameter averaging but I am not sure. Parameter averaging could work in this case because the parameters are relatively constrained in the same subspace if we’re taking about consecutive snapshots.

• They train their model using 8 NVIDIA P100 GPUs. I understand that Transformers are actually far more compute efficient (with respect to floating point operations) than prior state of the art models, but man, AI is really exacerbating compute inequality.

• WOW, Table 3 shows results from varying the Transformer model architecture. This is nice because it shows that Google has sufficiently explored variations on the model, and the one they present to us is the best. I see now why John Canny once told me that this paper was one of the few that met his criteria of “sufficient exploration.”

• Finally, they generalize the Transformer to the task of English constituency parsing. It doesn’t look like they achieve SOTA results, but it’s close.

Needless to say, the results in this paper are supremely impressive.

## The Future

This paper has 1,393 citations at the time of me writing this, and it’s only been on arXiv since June 2017! I wonder how much of the results are still SOTA. I am also curious about how Transformers can be “plugged into” other architectures. I am aware of at least two notable NLP architectures that use Transformers:

• BERT, introduced in the 2018 paper BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. It’s another Google product, but one which at the time of publication had noticeable SOTA results. You can see one of Jay Alammar’s excellent “illustrations” of it here. It looks like I have another blog added to my reading list!

• GPT-2, released in February 2019 in the form of a blog post and a preprint in ICML format (but I’m not sure if it was actually submitted to ICML, since OpenAI appears to have changed their recent publication habits). Oh yeah. Oh yeah. I have no direct comment on the controversy this created, apart from worrying that the media will continue with their click-baiting headlines that don’t accurately reveal the context and limitations of results.

I’m sure there are so many other applications of Transformers. I wonder how often Transformers have been used in robotics?

# Better Development Environment: Git Branch in bashrc and Virtualenvwrapper

This week is UC Berkeley’s Spring Break. Obviously I’m not on vacation, because I actually need Spring Break to catch up on research due to how much time my GSI is consuming; it is taking 25-ish hours a week if I’m tracking my work hours correctly. (I’m not sure if students realize this, but teaching assistants, and probably professors, look forward to the Spring Break just as much as they do.) But I figured that since since the week would give me leeway to step off the gas pedal just a tad bit, I should use the time to tinker and improve my development environment. Broadly speaking, this means improving how I code, since my normal daily activity involves lots and lots of programming. Therefore, a proper investment in a development environment should pay for itself.

Almost two years ago, I wrote a blog post stating how I organized my GitHub repositories. That’s technically part of my development environment, but perhaps more important than GitHub repository organization is how I design my .bashrc files (or .bash_profile if I use my Mac OS X laptop) and how I organize my Python virtual environments. Those two are the subject of this blog post.

After reading my fellow lab-mate Roshan Rao’s development environment on GitHub, I realized that I was missing out on a key feature: having the git branch explicitly listed in the command line, like this where the text (master) is located after the name of this repository:

danielseita@dhcp-46-186:~/DanielTakeshi.github.io (master) $ls Gemfile _config.yml _posts about.md css new-start-here.md Gemfile.lock _includes _sass archive.md feed.xml subscribe.md README.md _layouts _site assets index.html danielseita@dhcp-46-186:~/DanielTakeshi.github.io (master)$

Having the branch listed this way is helpful because otherwise it is easy to forget which one I’m on when coding across multiple projects and multiple machines, and I’d rather not have to keep typing git status to reveal the branch name. Branches are useful for a variety of reasons. In particular, I use them for testing new features and fixing bugs, while keeping my master branch stable for running experiments. Though, even that’s not technically true: I’ll often make “archived” branches solely for the purpose of reproducing experiments. Finally, the value of branches for a project arguably increases when there are multiple collaborators, since each can work on their own feature and then make a pull request from their branch.

All that’s needed for the above change are a couple of lines in the .bashrc file, which I copied from Roshan earlier. No extra “plugins” or fancy installations are needed. Incidentally, I am putting my development environment online, and you can see the relevant changes in .bashrc. I used to have a vimrc repository, but I am deprecating that in favor of putting my vimrc details in the development environment. Note that I won’t be discussing vimrc in this post.

There was one annoying issue after implementing the above fix. I noticed that calling an alias which activates a virtualenv caused the git branch detail to disappear, and additionally reverted some of the .bashrc color changes. To be clear on the “alias” part, as of yesterday (heh!) I organized my Python virutual environments (“virualenvs”) by saving them all in a directory like this:

danielseita@dhcp-46-186:~ $ls -lh seita-venvs/ drwxr-xr-x 7 danielseita staff 224B Feb 4 21:18 py2-cs182 drwxr-xr-x 7 danielseita staff 224B Mar 8 18:09 py2-cython-tests drwxr-xr-x 8 danielseita staff 256B Mar 12 21:45 py3-hw3-cs182 drwxr-xr-x 7 danielseita staff 224B Mar 8 13:36 py3.6-mujoco danielseita@dhcp-46-186:~$

and then putting this in my .bashrc file:

alias  env182='source ~/seita-venvs/py2-cs182/bin/activate'
alias  cython='source ~/seita-venvs/py2-cython-tests/bin/activate'
alias hw3_182='source ~/seita-venvs/py3-hw3-cs182/bin/activate'
alias  mujoco='source ~/seita-venvs/py3.6-mujoco/bin/activate'

So that, for example, if I wanted to go to my environment which I use to test Python MuJoCo bindings, I just type in mujoco on the command line. Otherwise, it’s really annoying to have to type in source <env_path>/bin/activate each time!

But of course people must have settled on a solution to avoiding that. During the process of trying to fix the puzzling issue above of aliases “resetting” some of my .bashrc changes, I came across virtualenvwrapper, a package to manage multiple virtualenvs. I’m embarrassed that I didn’t know about it before today!

Now, instead of doing the silly thing with all the aliases, I installed virtualenvwrapper, adjusted my ~/.bashrc accordingly, and made a new directory ~/Envs in my home level directory which now stores all my environents. Then, I create virtualenvs which are automatically inserted in ~/Envs/. For example, the Python2 virtualenv that I use for testing CS 182 homework could be created like this:

mkvirtualenv py2-cs182

I often put in py2 and similar stuff to prefix the virtualenv to make it easy to remember what Python versions I’m using. Nowadays, I normally need Python3 environments, because Python2 is finally going to be deprecated as of January 1, 2020. Let’s dump Python 2, everyone!!

For Python 3 virtualenvs, the following command works for both Mac OS X and Ubuntu systems:

mkvirtualenv --python=which python3 py3.6-mujoco

To switch my virtualenv, all I need to do is type workon <ENV>, and this will tab-complete as needed. And fortunately, it seems like this will preserve the colors and git branch changes from my .bashrc file changes.

One weakness of the virtualenvwrapper solution is that it does require a global pip installation for installing it in the first place, via:

pip install --user virtualenvwrapper

So that you can even start using it. I’m not sure how else to do this without having someone install pip with sudo access. But that should be the only requirement. The --user option means virtualenvwrapper is only applied locally to you, but it also means you have to change your .bashrc to source the local directory, not the global one. It works, but I wish there was an easy option that didn’t require a global pip installation.

As a second aside, I also have a new .bash_aliases file that I use for aliases, mostly for ssh-ing into machines without having to type the full ID, but I’ll be keeping that private for obvious reasons. Previously, I would put them in my .bashrc, but for organizational purposes, it now makes sense to put them in a separate file.

Whew. I don’t know how I managed to work without these changes above. I use multiple machines for research, so I’ll be converting all of my user accounts and file organization to follow the above.

Of course, in about six more months, I will probably be wondering how I managed to work without features XYZ … there’s just so much to learn, yet time is just so limited.

# Forward and Backward Convolution Passes as Matrix Multiplication

As part of my CS 182/282A GSI duties, I have been reviewing the homework assignments and the CS 231n online notes. I don’t do the entire assignments, as that would take too much time away from research, but I do enough to advise the students. Incidentally, I hope they enjoy having me as a GSI! I try to pepper my discussion sections with lots of humor, and I also explain how I “think” through certain computations. I hope that is helpful.

One reason why I’m a huge fan of CS 231n is that, more than two years ago, I stated that their assignments (which CS 182/282A uses) are the best way to learn how backpropagation works, and I still stand by that comment today. In this post, I’d like to go through the backwards pass for a 1D convolution in some detail, and frame it in the lens of matrix multiplication.

My main reference for this will be the CS 231n notes. They are still highly relevant today. (Since they haven’t been updated in a while, I hope this doesn’t mean that Deep Learning as a field has started to “stabilize” …) With respect to the convolution operator, there are two main passages in the notes that interest me. The first explains how to implement convolutions as matrix multiplication:

Implementation as Matrix Multiplication. Note that the convolution operation essentially performs dot products between the filters and local regions of the input. A common implementation pattern of the CONV layer is to take advantage of this fact and formulate the forward pass of a convolutional layer as one big matrix multiply as follows: […]

This allows convolutions to utilize fast, highly-optimized matrix multiplication libraries.

The second relevant passage from the 231n notes mentions how to do the backward pass for a convolution operation:

Backpropagation. The backward pass for a convolution operation (for both the data and the weights) is also a convolution (but with spatially-flipped filters). This is easy to derive in the 1-dimensional case with a toy example (not expanded on for now).

As usual, I like to understand these through a simple example. Consider a 1D convolution where we have input vector $\begin{bmatrix}x_1 & x_2 & x_3 & x_4\end{bmatrix}^T$ and three weight filters $w_1$, $w_2$, and $w_3$. With a stride of 1 and a padding of 1 on the input, we can implement the convolution operator using the following matrix-vector multiply:

or more concisely, $W\mathbf{x}’ = \mathbf{o}$ where $\mathbf{x}’ \in \mathbb{R}^6$ is the padded version of $\mathbf{x} \in \mathbb{R}^4$.

As an aside, you can consider what happens with a “transposed” version of the $W$ matrix. I won’t go through the details, but it’s possible to have the matrix-vector multiply be an operator that increases the dimension of the output. (Normally, convolutions decrease the spatial dimension(s) of the input, though they keep the depth consistent.) Justin Johnson calls these “transposed convolutions” for the reason that $W^T$ can be used to implement the operator. Incidentally, he will start as an assistant professor at the University of Michigan later this fall – congratulations to him!

In the backwards pass with loss function $L$, let’s suppose we’re given some upstream gradient, so $\frac{\partial L}{\partial o_i}$ for all components in the output vector $\mathbf{o}$. How can we do the backwards pass for the weights and then the data?

Let’s go through the math. I will now assume the relevant vectors and their derivatives w.r.t. $L$ are row vectors (following the CS 182/282A notation), though for the other way around it shouldn’t matter, we’d just flip the order of multiplication to be matrix-vector rather than vector-matrix.

We have:

and

Recall that in our example, $\mathbf{x} \in \mathbb{R}^4$ and $\mathbf{w} \in \mathbb{R}^3$. This must be the same shape as their gradients, since the loss is a scalar.

Notice that all the elements in the Jacobians above are from trivial dot products. For example:

By repeating this process, we end up with:

and

Upon seeing the two above operations, it should now be clear why these are viewed as convolution operations. In particular, they’re convolutions where the previous incoming (or “upstream” in CS231n verbiage) gradients act as the input, and the Jacobian encodes the convolution operator’s “filters.” If it helps, feel free to transpose the whole thing above to get it in line with my matrix-vector multiply near the beginning of the post.

Now, why does 231n say that filters are “spatially flipped?” It’s easiest to draw this out on pencil and paper by looking at how the math works out for each component in the convolution. Let’s look at the computation for $\frac{\partial L}{\partial \mathbf{x}}$. Imagine the vector $\frac{\partial L}{\partial \mathbf{o}}$ as input to the convolution. The vector-matrix multiply above will result in a filter from $\mathbf{w}$ sliding through from left-to-right (i.e., starting from $\frac{\partial L}{\partial o_1}$) but with the filter actually in reverse: $(w_3,w_2,w_1)$. Technically, the input actually needs to be padded by 1, and the stride for the filter is 1.

For $\frac{\partial L}{\partial \mathbf{w}}$, the filter is now from $\mathbf{x}$. This time, while the filter itself is in the same order, as in $(x_1,x_2,x_3,x_4)$, it is applied in reverse, from right-to-left on the input vector, so the first computation is for $\frac{\partial L}{\partial o_4}$. I assume that’s what the notes mean as “spatially flipped” though it feels a bit misleading in this case. Perhaps I’m missing something? Again, note that we pad 1 on the input and use a stride of 1 for the filter.

In theory, generalizing to 2D is, as Professor John Canny has repeated said both to me individually and the CS 182/282A class more broadly, just a matter of tracking and then regrouping indices. In practice, unless you’re able to track indices as well as he can, it’s very error-prone. Be very careful. Ideally this is what the CS 182/282A students did to implement the backwards pass, rather than resort to looking up a solution online from someone’s blog post.

# Batch Constrained Deep Reinforcement Learning

An interesting paper that I am reading is Off-Policy Deep Reinforcement Learning without Exploration. You can find the latest version on arXiv, where it clearly appears to be under review for ICML 2019. An earlier version was under review at ICLR 2019 under the earlier title Where Off-Policy Deep Reinforcement Learning Fails. I like the research contribution of the paper, as it falls in line with recent work on how to make deep reinforcement learning slightly more practical. In this case, “practical” refers to how we have a batch of data, from perhaps a simulator or an expert, and we want to train an agent to learn from it without exploration, which would do wonders for safety and sample efficiency.

As is clear from the abstract, the paper introduces the batch-constrained RL algorithm:

We introduce a novel class of off-policy algorithms, batch-constrained reinforcement learning, which restricts the action space in order to force the agent towards behaving close to on-policy with respect to a subset of the given data.

This is clear. We want the set of states the agent experiences to be similar to the set of states from the batch, which might be from an expert (for example). This reminded me of the DART paper (expanded in a BAIR Blog post) that the AUTOLAB developed:

• DART is about applying noise to expert states, so that behavior cloning can see a “wider” distribution of states. This was an imitation learning paper, but the general theme of increasing the variety of states seen has appeared in past reinforcement learning research.
• This paper, though, is about restricting the actions so that the states the agent sees match those of the expert’s by virtue of taking similar actions.

Many of the most successful modern (i.e., “deep”) off-policy algorithms use some variant of experience replay, but the authors claim that this only works when the data in the buffer is correlated with the data induced by the current agent’s policy. This does not work if there is what the authors define as extrapolation error, which is when there is a mismatch between the two datasets. Yes, I agree. Though experience replay is actually designed to break correlation among samples, the most recent information is put into the buffer, bumping older stuff out. By definition, that means some of the data in the experience replay is correlated with the agent’s policy.

But more generally, we might have a batch of data where nothing came from the current agent’s policy. The more I think about it, the more an action restriction makes sense. With function approximation, unseen state-action pairs $(s,a)$ might be more or less attractive than seen pairs. But, aren’t there more ways to be bad than there are to be good? That is, it’s easy to get terrible reward in environments, but harder to get the highest reward, which one can verify by mathematically assigning the probabilities of each random sequence of actions. This paper is about restricting the actions so that we keep funneling the agent towards the high-quality states in the batch.

To be clear, here’s what “batch reinforcement learning” means, and its advantages:

Batch reinforcement learning, the task of learning from a fixed dataset without further interactions with the environment, is a crucial requirement for scaling reinforcement learning to tasks where the data collection procedure is costly, risky, or time-consuming.

You can also view this through the lens of imitation learning, because the simplest form, behavior cloning, does not require environment interaction.1 Furthermore, one of the fundamental aspects of reinforcement learning is precisely environment interaction! Indeed, this paper benchmarks with behavior cloning, and freely says that “Our algorithm offers a unified view on imitation and off-policy learning.”2

Let’s move on to the technical and algorithmic contribution, because I’m rambling too much. Their first foray is to try and redefine the Bellman operator in finite, discrete MDPs in the context of reducing extrapolation error so that the induced policy will visit the state-action pairs that more closely correspond with the distribution of state-action pairs from the batch.

A summary of the paper’s theory is that batch-constrained learning still converges to an optimal policy for deterministic MDPs. Much of the theory involves redefining or inducing a new MDP based on the batch, and then deferring to standard Q-learning theory. I wish I had time to go through the papers that this one references, such as this old 2000 paper.

For example, the paper claims that normal Q-learning on the batch of data will result in an optimal value function for an alternative MDP, $M_{\mathcal{B}}$, based on the batch $\mathcal{B}$. A related and important definition is the tabular extraploation error $\epsilon_{\rm MDP}$, defined as discrepancy between the value function computed with the batch versus the value function computed with the true MDP $M$:

This can be computed recursively using a Bellman-like equation (see the paper for details), but it’s easier to write as:

By using the above, they are able to derive a new algorithm: Batch-Constrained Q-learning (BCQL) which restricts the possible actions to be in the batch:

Next, let’s introduce their practical algorithm for high-dimensional, continuous control: Batch-Constrained deep Q-learning (BCQ). It utilizes four parameterized networks.

• A Generative model $G_\omega(s)$ which, given the state as input, produces an action. Using a generative model this way assumes we pick actions using:

or in other words, the most likely action given the state, with respect to the data in the batch. This is difficult to model in high dimensional continuous control environments, so they approximate it with a variational autoencoder. This is trained along with the policy parameters during each for loop iteration.

• A Perturbation model $\xi_\phi(s,a,\Phi)$ which aims to “optimally perturb” the actions, so that they don’t need to sample too much from $G_\omega(s)$. The perturbation applies noise in $[-\Phi,\Phi]$. It is updated via a deterministic policy gradient rule:

The above is a maximization problem over a sum of Q-function terms. The Q-function is differentiable as we parameterize it with a deep neural network, and stochastic gradient descent methods will work with stochastic inputs. I wonder, is the perturbation model overkill? Is it possible to do a cross entropy method, like what two of these papers do for robotic grasping?

• Two Q-networks $Q_{\theta_1}(s,a)$ and $Q_{\theta_2}(s,a)$, to help push their policy to select actions that lead to “more certain data.” They used that in their ICML paper last year, so I’ll have to read through the details of that paper to fully understand.

All networks other than the generative model also consist of target networks, following standard DDPG practices.

All together, their algorithm uses this policy:

To be clear, they approximate this maximization by sampling $n$ actions each time step, and picking the best one. The perturbation model, as stated earlier, increases the diversity of the sampled actions. Once again, it would be nice to confirm that this is necessary, such as via an experiment that shows the VAE collapses to a mode. (I don’t see justification in the paper or the appendix.)

There is a useful interpretation of how this algorithm is a continuum between behavior cloning (if $n=1$ and $\Phi=0$) and Q-learning ($n\to \infty$ and $\Phi \to a_{\rm max}-a_{\rm min}$).

All right, that was their theory and algorithm — now let’s discuss the experiments. They test with DDPG under several different conditions. They assume that there is a “behavioral DDPG” agent which generates the batch of data, for which an “off-policy DDPG” agent learns from, without exploration. Their goal is to improve the learning of the “off-policy DDPG.” (Don’t get confused with the actor-critic framework of normal DDPG … just think of the behavioral DDPG as the thing that generates the batch in “batch-constrained RL.”)

• Final Buffer. They train the behavioral DDPG agent from scratch for 1 million steps, adding more noise than usual for extra exploration. Then all of its experience is pooled inside an experience replay. That’s the “batch”. Then, they use it to train the off-policy DDPG agent. That off-policy agent does not interact with the environment — it just draws samples from the buffer. Note that this will result in widespread state coverage, including potentially the early states when the behavioral agent was performing poorly.

• Concurrent. This time, as the behavioral DDPG agent learns, the off-policy one learns as well, using data from the behavioral agent. Moreover, the original behavioral DDPG agent is also learning from the same data, so both agents learn from identical datsets (though, due to minibatch noise, it’s not exactly the same each minibatch…).

• Imitation. After training the behavioral DDPG agent, they run it for 1 million steps. Those experiences are added to the buffer, from which the off-policy DDPG agent learns. Thus, this is basically the imitation learning setting.

• Imperfect Demonstrations. This is the same as the “imitation” case, except some noise is added to the data, through Gaussian noise on the states and randomness in action selection. Thus, it’s like adding more coverage to an expert data.

The experiments use … MuJoCo. Argh, we’re still using it as a benchmark. They test with HalfCheetah-v1, Hopper-v1, and Walker2d-v1. Ideally there would be more, at least in the main part of the paper. The Appendix has some limited Pendulum-v0 and Reacher-v1 results. I wonder if they tried on Humanoid-v1.

They actually performed some initial experiments before presenting the theory, which justifies the need to correct for extrapolation error. The most surprising fact there was that the off-policy DDPG agent failed to match the behavioral agent even in the concurrent learning paradigm, where I think the only differences are with the initial policy initialization and the randomness inherent in each minibatch. That’s quite surprising!

This was what motivated their Batch-Constrained deep Q-learning (BCQ) algorithm, discussed above.

As for their results, I am a little confused after reading Figure 2. They say that:

Only BCQ matches or outperforms the performance of the behavioral policy in all tasks.

Being color-blind, the BCQ and VAE-BC colors look indistinguishable to me. (And the same goes for the DQN and DDPG baselines, which look like they are orange and orange, respectively.) I wish there was better color contrast, perhaps with light purple and dark blue for the former, and yellow and red for the latter. Oh well. I assume that their BCQ curve is the highest one on the rewards plot … but this means it’s not that much better than the baselines on Hopper-v1 except for the imperfect demonstrations task. Furthermore, the shaded area is only half of a standard deviation, rather than one. Finally, in the imitation task, simple behavior cloning was better. So, it’s hard to tell if these are truly statistically significant results.

While I wish the results were more convincing, I still buy the rationale of their algorithm, and that it is beneficial under the right circumstances.

1. More advanced forms of imitation learning might require substantial environment interaction, such as Generative Adversarial Imitation Learning. (My blog post about that paper is here.)

2. One of the ICLR reviewers brought up that this is more of an imitation learning algorithm than it is a reinforcement learning one …

# Deep Learning and Importance Sampling Review

This semester, I am a Graduate Student Instructor for Berkeley’s Deep Learning class, now numbered CS 182/282A. I was last a GSI in fall 2016 for the same course, so I hope my teaching skills are not rusty. At least I am a GSI from the start, and not an “emergency appointment” like I was in fall 2016. I view my goal as helping Professor Canny stuff as much Deep Learning knowledge into the students as possible so that they can use the technology to be confident, go forth, and change the world!

All right, that was cheesy, and admittedly there is a bit too much hype. Nonetheless, Deep Learning has been a critical tool in a variety of my past and current research projects, so my investment in learning the technology over the last few years has paid off. I have read nearly the entire Deep Learning textbook, but for good measure, I want to officially finish digesting everything from the book. Thus, (most of) my next few blog posts will be technical, math-oriented posts that chronicle my final journey through the book. In addition, I will bring up related subjects that aren’t explicitly covered in the book, including possibly some research paper summaries.

Let’s start with a review of Chapter 17. It’s about Monte Carlo sampling, the general idea of using samples to approximate some value of interest. This is an extremely important paradigm, because in many cases sampling is the best (or even only) option we have. A common way that sampling arises in Deep Learning is when we use minibatches to approximate a full-data gradient. And even for that, the full data gradient is really one giant minibatch, as Goodfellow nicely pointed out on Quora.

More formally, assume we have some discrete, vector-valued random variable $\bf{x}$ and we want the following expectation:

where $x$ indicates the possible values (or “instantiations” or “realizations” or … you get the idea) of random variable $\bf{x}$. The expectation $\mathbb{E}$ is taken “under the distribution $p$” in my notation, where $p$ must clearly satisfy the definition of being a (discrete) probability distribution. This just means that $\bf{x}$ is sampled based on $p$.

This formulation is broad, and I like thinking in terms of examples. Let’s turn to reinforcement learning. The goal is to find some parameter $\theta^* \in \Theta$ that maximizes the objective function

where $\tau$ is a trajectory induced by the agent’s policy $\pi_\theta$; that probability is $\pi_\theta(\tau) = p(s_1,a_1,\ldots,s_T,a_T)$, and $R(\tau) = \sum_{t=1}^T R(s_t,a_t)$. Here, the objective plays the role of $\mathbb{E}_p[f(\bf{x})]$ from earlier with the trajectory $\tau$ as the vector-valued random variable.

But how would we exactly compute $J(\theta)$? The process would require us to explicitly enumerate all possible trajectories that could possibly arise from the environment emulator, and then weigh them all accordingly by their (log) probabilities, and compute the expectation from that. The number of trajectories is super-exponential, and this computation would be needed for every gradient update we need to perform on $\theta$, since the distribution of trajectories directly depends on $\pi_\theta(\tau)$.

You can see why sampling is critical for us to make any headway.

(For background on this material, please consult my older post on policy gradients, and an even older post on the basics of Markov Decision Processes.)

The solution is to take a small set of samples $\{x^{(1)}, \ldots, x^{(n)}\}$ from the distribution of interest, to obtain our estimator

which is unbiased:

and converges almost surely to the expected value, so long as several mild assumptions are met regarding the samples.

Now consider importance sampling. As the book nicely points out, when using $p(x)f(x)$ to compute the expectation, the decomposition does not have to be uniquely set at $p(x)$ and $f(x)$. Why? We can introduce a third function $q$:

and we can sample from $q$ and average $\frac{pf}{q}$ and get our importance sampling estimator:

which was sampled from $q$. (The $\hat{s}_p$ is the same as $\hat{s}_n$ from earlier.) In importance sampling lingo, $q$ is often called the proposal distribution.

Think about what just happened. We are still computing the same quantity or sample estimator, and under expectation we still get $\mathbb{E}_q[\hat{s}_q] = s$. But we used a different distribution to get our actual samples. The whole $\bf{x}^{(i)}\sim p$ or $\bf{x}^{(i)}\sim q$ notation is used to control the set of samples that we get for approximating the expectation.

We employ this technique primarily to (a) sample from “more interesting regions” and (b) to reduce variance. For (a), this is often motivated by referring to some setup as follows:

We want to use Monte Carlo to compute $\mu = \mathbb{E}[X]$. There is an event $E$ such that $P(E)$ is small but $X$ is small outside of $E$. When we run the usual Monte Carlo algorithm the vast majority of our samples of $X$ will be outside $E$. But outside of $E$, $X$ is close to zero. Only rarely will we get a sample in $E$ where $X$ is not small.

where I’ve quoted this reference. I like this intuition – we need to find the more interesting regions via “overweighting” the sampling distribution there, and then we adjust the probability accordingly for our actual Monte Carlo estimate.

For (b), given two unbiased estimators, all other things being equal, the better one is the one with lower variance. The variance of $\hat{s}_q$ is

The optimal choice inducing minimum variance is $q^*(x) \propto p(x)|f(x)|$ but this is not usually attained in practice, so in some sense the task of importance sampling is to find a good sampling distribution $q$. For example, one heuristic that I’ve seen is to pick a $q$ that has “fatter tails”, so that we avoid cases where $q(x) \ll p(x)|f(x)|$, which causes the variance of $\frac{p(x)f(x)}{q(x)}$ to explode. (I’m using absolute values around $f(x)$ since $p(x) \ge 0$.) Though, since we are sampling from $q$, normally the case where $q(x)$ is very small shouldn’t happen, but anything can happen in high dimensions.

In a subsequent post, I will discuss importance sampling in the context of some deep learning applications.

# I Will Make a Serious Run for Political Office by January 14, 2044

I have an official announcement. I am giving myself a 25-year deadline for making a serious run for political office. That means I must begin a major political campaign no later than January 14, 2044.

Obviously, I can’t make any guarantees about what the world will be like then. We know there are existential threats about which I worry. My health might suddenly take a nosedive due to an injury or if I somehow quit my addiction to salads and berries. But for the sake of this exercise, let’s assume away these (hopefully unlikely) cases.

People are inspired to run for political office for a variety of reasons. I have repeatedly been thinking about doing so, perhaps (as amazing as it sounds) even moreso than I think about existential threats. The tipping point for me making this declaration is our ridiculous government shutdown, now the longest in history.

This shutdown is unnecessary, counterproductive, and is weakening the United States of America. As many as 800,000 federal workers are furloughed or being forced to work without pay. On a more personal note, government cuts disrupt American science, a worrying sign given how China is investing vast sums of money in Artificial Intelligence and other sciences.

I do not know which offices I will target. It could be national or state-wide. Certain environments are far more challenging for political newcomers, such as those with powerful incumbents. But if I end up getting lucky, such as drawing a white supremacist like Steve King as my opponent … well, I’m sure I could position myself to win the respect of the relevant group of voters.

I also cannot state with certainty regarding my future political party affiliation. I am a terrible fit for the modern-day GOP, and an awkward one for the current Democratic party. But, a lot can change in 25 years.

To avoid distracting myself from more pressing circumstances, I will not discuss this in future blog posts. My primary focus is on getting more research done; I currently have about 20 drafts of technical posts to plow through in the next few months.

But stay tuned for what the long-term future may hold.

# What Keeps Me Up at Night

For most of my life, I have had difficulty sleeping, because my mind is constantly whirring about some topic, and I cannot shut it down. I ponder about many things. In recent months, what’s been keeping me up at night are existential threats to humanity. Two classic categories are nuclear warfare and climate change. A more recent one is artificial intelligence.

The threat of civilization-ending nuclear warfare has been on the minds of many thinkers since the days of World War II.

There are nine countries with nuclear weapons: the United States, Russia, United Kingdom, France, China, India, Pakistan, Israel, and North Korea.

The United States and Russia have, by far, the largest nuclear weapons stockpiles. The Israeli government deliberately remains ambiguous about its nuclear arsenal. Iran is close to obtaining nuclear weapons, and it is essential that this does not happen.

I am not afraid of Putin ordering nuclear attacks. I have consistently stated that Russia (essentially, that means Putin) is America’s biggest geopolitical foe. This is not the same as saying that they are the biggest existential threat to humanity. Putin may be an dictator who I would never want to live under, but he is not suicidal.

North Korea is a different matter. I have little faith in Kim Jong Un’s mental acuity. Unfortunately, his regime still shows no signs of collapse. America must work with China and persuade them that it is in the interest of both countries for China to end their support of the Kim regime.

What about terrorist groups? While white supremacists have, I think, killed more Americans in recent years than radical Islamists, I don’t think white supremacist groups are actively trying to obtain nuclear weapons more as they want a racially pure society to live in, which by necessity requires some land usable and fallout-free.

But Islamic State, and other cult-like terrorist groups, could launch suicide attacks by stealing nuclear weapons. Terrorist groups lack homegrown expertise to build and launch such weapons, but they may purchase, steal, bribe, or extort. It is imperative that our nuclear technicians and security guards are well-trained, appropriately compensated, and have no Edward Snowdens hidden among them. It would also be prudent to assist countries such as Pakistan so that they have stronger defenses of their nuclear weapons.

Despite all the things that could go wrong, we are still alive today with no nuclear warfare since World War II. I hope that cool heads continue to prevail among those in possession of nuclear weapons.

A good overview of the preceding issues can be found in Charles D. Ferguson’s book. There is also a nice op-ed by elder statesmen George Shultz, Henry Kissinger, William Perry, and Sam Nunn on a world without nuclear weapons.

Climate change is a second major existential threat.

The good news is that the worst-case predictions from our scientists (and, ahem, Al Gore) have not materialized. We are still alive today, and the climate, at least from my personal experience — which cannot be used as evidence against climate change since it’s one data point — is not notably different from years past. The increasing use of natural gas has substantially slowed down the rate of carbon emissions. Businesses are aiming to be more energy-efficient. Scientists continue to track worldwide temperatures and to make more accurate climate predictions aided by advanced computing hardware.

The bad news is that carbon emissions will continue to grow. As countries develop, they naturally require more energy for the higher-status symbols of civilization (more cars, more air travel, and so on). Their citizens will also want more meat, causing more methane emissions and further strains on our environment.

Moreover, the recent Artificial Intelligence and Blockchain developments are computationally-heavy, due to Deep Learning and mining (respectively). Artificial Intelligence researchers and miners therefore have a responsibility to be frugal about their energy usage.

It would be ideal if the United States could take the lead in fighting climate change in a sensible way without total economic shutdown, such as by applying the carbon tax plan proposed by former Secretary of State George Shultz and policy entrepreneur Ted Halstead. Unfortunately, we lack the willpower to do so, and the Republican party in recent years has placed lower priorities on climate change, with their top politician even once Tweeting the absurd and patently false claim that global warming was a “hoax invented by the Chinese to make American manufacturing less competitive.” That most scientists are Democrats can be attributed in large part because of attacks on climate change (and the theory of evolution, I’d add), not because they are anti-capitalism. I bet most of us recognize the benefits of a capitalistic society like I do.

While I worry about carbon and temperature, they are not the only things that matter. Climate change can cause more extreme weather, such as droughts which have plagued the Middle East, exacerbating the current refugee crisis and destabilizing governments throughout the world. Droughts are also stressing supplies in South Africa, and even America, as we have sadly seen in California.

A more recent existential threat pertains to artificial intelligence.

Two classes of threats I ponder are (a) autonomous weapons, and a broad category that I call (b) the risks of catastrophic misinformation. Both are compounding factors that contribute to nuclear warfare or a more drastic climate trend.

The danger of autonomous weapons has been widely explored in recent books, such as Army of None (on my TODO list) and in generic Artificial Intelligence books such as Life 3.0 (highly recommended!). There are a number of terrifying ways in which these weapons could wreak havoc among populations throughout the world.

For example, one could also think of autonomous weapons merging with biological terrorism, perhaps via a swarm of “killer bee robots” spreading a virus. Fortunately, as summarized by Steven Pinker in the existential threats chapter of Enlightenment Now, biological agents are actually ill-suited for widespread terrorism and pandemics in the modern era. But autonomous weapons could easily be used for purposes that we can’t even imagine now.

Autonomous weapons will be applied on specially designed hardware. These won’t be like the physical, humanoid robots that Toyota is developing for home robots, because robotic motion that mimics human-like motion is too slow and cumbersome to cause an existential threat. Recent AI advances have been primarily from software. Nowhere was this more apparent to me from AlphaGo, which astonished the world by defeating a top Go player … but a DeepMind employee, following AlphaGo’s instructions, placed the stones on the board. The irony is that something as “primitive” as finely placing stones on a game board is beyond the ability of current robots. This means that I do not consider situations where a robot must physically acquire resources with its own hardware to be an existential threat.

The second aspect of AI that I worry about is, as stated earlier, “catastrophic misinformation.” What do I mean by this? I refer to how AI might be trained to create material that can drastically mislead a group of people, which might cause them to be belligerent with others, hence increasing the chances of nuclear or widespread warfare.

Consider a more advanced form of AI that can generate images (and perhaps videos!) far more complex than those that the NVIDIA GAN can create. Even today, people have difficulty distinguishing between fake and real news, as noted in LikeWar. A future risk for humanity might involve a world-wide “PizzaGate” incident where misled leaders go at war with each other, provoked by AI-generated misinformation from a terrorist organization running open-source code.

Even if we could count on citizens to hold their leaders accountable, (a) some countries simply don’t have accountable leaders or knowledgeable citizens, and (b) even “educated” people can be silently nudged to support certain issues. North Korea has brainwashed their citizens to obey their leaders without question. China is moving beyond blocking “Tiananmen Square massacre”-like themes on the Internet; they can determine social credit scores, automatically tracked via phone apps and Big Data. China additionally has the technical know-how, hardware, and data, to utilize the latest AI advances.

Imagine what authoritarian leaders could do if they wanted to rouse support for some controversial issue … that they learned via fake-news AI. That succinctly summarizes my concerns.

Nuclear warfare, climate change, and artificial intelligence, are currently keeping me up at night.

# How to be Better: 2019 and Earlier Resolutions

I have written New Year’s resolutions since 2014, and do post-mortems to evaluate my progress. All of my resolutions are in separate text documents in my laptop’s desktop, so I see them every morning.

In the past I’ve only blogged about the 2015 edition, where I briefly covered my resolutions for the year. That was four years ago, so how are things looking today?

The good news: I have maintained tracking New Year’s resolutions throughout the years, and have achieved many of my goals. Some resolutions are specific, such as “run a half marathon in under 1:45”, but others are vague, such as “run consistently on Tuesdays and Thursdays”, so I don’t keep track of the number of successes or failures. Instead, I jot down several “positive,” “neutral,” and “negative” conclusions at each year’s end.

Possibly because of my newfound goals and ambitions, my current resolutions are much longer than they were in 2015. My 2019 resolutions are split into six categories: (1) reading books, (2) blogging, (3) academics, education, and work, (4) physical fitness and health, (5) money and finances, and (6) miscellaneous. Each is further sub-divided as needed.

Probably the most notable change I’ve made since 2015 is my book reading habit, which has rapidly turned into my #1 non-academic activity. It’s the one I default to during my evenings, my vacations, my plane rides, and on Saturdays when I generally do not work in order to recharge and to preserve my sanity.

Ultimately, much of my future career/life will depend on how well I meet my goals under class (3) above, in the academics, education, and work category, At a high level, the goals here (which could be applied to my other categories, but I view them mostly under the lens of “work”) are:

• Be Better At Minimizing Distractions. I am reasonably good at this, but there is still a wide chasm between where I’m at and my ideal state. I checked email way too often this past year, and need to cut that down.

• Be Better At Reading Research Papers. Reading academic papers is hard. I have read many, as evident by my GitHub paper notes repository. But not all of those notes have reflected true understanding, and it’s easy to get bogged down into irrelevant details. I also need to be more skeptical of research papers, since no paper is perfect.

• Be Better At Learning New Concepts. When learning new concepts (examples: reading a textbook, self-studying an online course, understanding a new code base), apply deliberate practice. It’s the best way to quickly get up to speed and rapidly attain the level of expertise I require.

I hope I make a leap in 2019. Feel free to contact me if you’ve had some good experiences or insights from forming your own New Year’s resolutions!

# All the Books I Read in 2018, Plus My Thoughts

As I did in 2016 and then in 2017, I am reporting the list of books that I read this past year1 along with brief summaries and my colorful commentary. This year, I read 34 books, which is similar to the amount in past years (35 and 43, respectively). This page will have any future set of reading list posts.

Here are the categories:

1. Business, Economics, and Technology (9 books)
2. Biographies and Memoirs (9 books)
3. Self-Improvement (6 books)
4. History (3 books)
5. Current Events (3 books)
6. Miscellaneous (4 books)

All books are non-fiction, and I drafted the summaries written below as soon as I had finished reading each book.

As usual, I write the titles below in bold text, and the books that I especially enjoyed reading have double asterisks (**) surrounding the titles.

## Group 1: Business, Economics, and Technology

I’m lumping these all together because the business/econ books that I read tend to be about “high tech” industries.

• Blockchain Revolution: How the Technology Behind Bitcoin and Cryptocurrencies is Changing the World (2016, later updated in 2018) by father-son author team Don Tapscott and Alex Tapscott, describes how the blockchain technology will change the world. To be clear, blockchain already has done that (to some extent), but the book is mostly about the future and its potential. The technology behind blockchain, which has enabled bitcoin, was famously introduced in 2008 by Satoshi Nakamoto, whose true identity remains unknown. Blockchain Revolution gives an overview of Nakamoto’s idea, and then spends most of its ink describing problems that could be solved or ameliorated with blockchain, such as excess centralization of power, suppression of citizens under authoritarian governments, inefficiencies in payment systems, and so forth. This isn’t the book’s main emphasis, but I am particularly intrigued by the potential for combining blockchain technology with artificial intelligence; the Tapscotts are optimistic about automating things with smart devices. I still have lots of questions about blockchain, and to better understand it, I will likely have to implement a simplified form of it myself. That being said, despite the book’s optimism, I remain concerned for a few reasons. The first is that I’m worried about all the energy that we need for mining — isn’t that going to counter any efficiency gains from blockchain technology (e.g., due to smart energy grids)? Second, will this be too complex for ordinary citizens to understand and benefit, leaving the rich to get the fruits? Third, are we really sure that blockchain will help protect citizens from authoritarian governments, and that there aren’t any unanticipated drawbacks? I remain cautiously optimistic. The book is great at trying to match the science fiction potential with reality, but still, I worry that the expectations for blockchain are too high.

• ** Machine Platform Crowd: Harnessing our Digital Future ** is the most recent book jointly authored by Brynjolfsson and McAfee. It was published in 2017, and I was excited to read it after thoroughly enjoying their 2014 book The Second Machine Age. The title implies that it overlaps with the previous book, and it does: on platforms, the effect of two-sided markets, and how they are disrupting businesses. But there’s also two other core aspects: the machine and the crowd. In the former (my favorite part, for obvious reasons), they talk about how AI and machine learning have been able to overcome “Polyani’s Paradox”, discussing DeepMind’s AlphaGo – yay! Key insight: experts are often incorrect, and it’s best to leave many decisions to machines. The other part is the crowd, and how the core of many participants can do better than a smaller group of so-called experts. One of the more interesting aspects is the debate on Bitcoin as an alternative to cash/currency, and the underlying Blockchain structure to help enforce contracts. However, they say that companies are not going obsolete, in part because contracts can never fully specify everything in the possible world, so companies can claim to do anything that’s not specified there if they own an asset, etc. Brynjolfsson and McAfee argue that while the pace of today’s world is incredible, companies will still have a role to play, and so will people and management, since they help to provide a conducive environment or mission to get things done. Overall, these themes combine together to form a splendid presentation in, effectively, how to understand all three of these aspects (the machine, the platform, and the crowd) in the context of our world today. Sure, one can’t know everything from reading a book, but it gives a tremendous starting point, hence why I enjoyed it very much.

• ** Reinventing Capitalism in the Age of Big Data ** is a 2018 book by Oxford professor Viktor Mayer-Schönberger and writer Thomas Ramge, that describes their view of how capitalism works today. In particular, they focus on comparing markets versus firms in a manner similar to books such as Platform Revolution (see my comments above), but with perhaps an increased discussion over the role of prices. Historically, humans lacked all the data we have today, and condensing everything about an item for purchase in a single quantity made sense for the sake of efficiency. Things have changed in today’s Big Data world, where data can better connect producers and consumers. In the past, a firm could control data and coordinate efforts, but this advantage has declined over time, causing the authors to argue that markets are making a “comeback” against the firm, while the decline of the firm means we need to rethink our approaches towards employment since stable jobs are less likely. Reinventing Capitalism doesn’t discuss much about policies to pursue, but one that I remember they suggested is a data tax (or any “data-sharing mandate” for that matter) to help level the playing field, where data effectively plays the role of money from earlier, or fuel in the case of Artificial Intelligence applications. Obviously, this won’t be happening any time soon (and especially not with the Republican party in control of our government) but it’s certainly thought-provoking to consider what the future might bring. I feel that, like a Universal Basic Income (UBI), a data tax is inevitable, but will come too late for most of its benefits to kick in due to delays in government implementation. It’s an interesting book, and I would recommend it along with the other business-related books I’ve read here. For another perspective, see David Leonhardt’s favorable review in The New York Times.

## Group 2: Biographies and Memoirs

This is rapidly becoming a popular genre within nonfiction for me, because I like knowing more about accomplished people who I admire. It helps drive me to become a better person.

• ** Worthy Fights: A Memoir of Leadership in War and Peace ** is Leon Panetta’s memoir, co-written with Jim Newton. I didn’t know much about Panetta, but after reading this engaging story of his life, I’m incredibly amazed by his career and how Panetta has made the United States and the world better off. The memoir starts at his father’s immigration from Italy to the United States, and then discusses Panetta’s early career in Congress (first as an assistant to a Congressman, then as a Congressman himself), and then his time at the Office of Management and Budget, and then President Clinton’s Chief of Staff, and then (yes, there’s more!) Director of the CIA, and finally, President Obama’s Secretary of Defense. Wow — that’s a lot to absorb already, and I wish I could have a fraction of the success and impact that Panetta has had on the world. I appreciate Panetta for several reasons. First, he repeatedly argues for the importance of balancing budgets, something which I believe isn’t a priority for either political party; despite what some may say (especially in the Republican party), their actions suggest otherwise (let’s build a wall!!!). Panetta, though, actually helped to balance the federal budget. Second, I appreciated all the effort that he and the CIA did to find and kill Osama bin Laden — that was one of the best things to happen from the CIA over the last decade, and their efforts should be appreciated. The raid on Osama bin Laden’s fortress was the most thrilling part of the memoir by far, and I could not put the book down. Finally, and while this may just be me, I personally find Panetta to be just the kind of American that we need the most. His commitment to the country is evident by the words in the book, and I can only hope that we see more people like him — whether in politics or not — instead of the ones who try to run government shutdowns8 and deliberately provoke people for the sake of provocation. After Enlightenment Now (see below), this was my second favorite book of 2018.

• ** My Journey at the Nuclear Brink ** is William Perry’s story of his coming of age in the nuclear era. For those who don’t know him (admittedly, this included me before reading this book!) he served as the Secretary of Defense for President Clinton from February 1994 to January 1997. Before that he held an “undersecretary” position in government, and before that he was an aspiring entrepreneur and a mathematician, and earlier still, he was in the military. The book can be admittedly dry at times, but I still liked it and Perry recounts several occasions when he truly feared that the world would delve into nuclear warfare, most notably during the Cuban Missile Crisis. During the Cold War, as expected, Perry’s focus was on containing possible threats from the Soviet Union. Later, as Secretary of Defense, Perry was faced with a new challenge: the end of the Cold War meant that the Soviet Union dissolved into 15 countries, but this meant that nuclear weapons were spread out among different entities, heightening the risks. It is a shame that few people understand how essential Perry was (along with then-Georgia Senator Sam Nunn) in defusing this crisis by destroying or dis-assembling nuclear silos. It is also a shame that, as painfully recounted by Perry, Russia-U.S. relations have sunk to their lowest point since the high at 1996-1997 that Perry helped to facilitate. Relations sank in large part due to the expansion of NATO to include Eastern European countries. This was an important event discussed by Michael Mandelbaum in Mission Failure, and while Perry argued forcefully against NATO expansion, Clinton overrode his decision by listening to … Al Gore, of all people. Gaaah. In more recent years, Perry has teamed up with Sam Nunn, Henry Kissinger, and George Shultz to spread knowledge on the dangers of nuclear warfare. These four men aim to move towards a world without nuclear weapons. I can only hope that we achieve that ideal.

• ** The Art of Tough: Fearlessly Facing Politics and Life ** is Barbara Boxer’s memoir, published in 2016 near the end of her fourth (and last) term as U.S. Senator of California. Before that, she was in the House of Representatives for a decade. Earlier still, Boxer held some local positions while taking part in several other political campaigns. Before moving to California in 2014, I didn’t know about Barbara Boxer, so I learned more about her experiences in the previously mentioned positions; I got a picture of what it’s like to run a political campaign and then later to be a politician. The stories of the Senate are most riveting, since it’s a highly exclusive body that acts as a feeder for presidents. It’s also constantly under public scrutiny — a good thing! In the Senate, Boxer emphasizes the necessity of developing working relationships among colleagues (are you listening, Ted Cruz?). She also emphasizes the importance of being tough (hence the book’s title), particularly due to being one of the few women in the Senate. Another example of “being tough” is staking out a minority, unpopular political position, such as her vote against the Iraq war in 2002, which was the correct thing to do in hindsight. She concludes the memoir emphasizing that she didn’t retire because of hyper-partisanship, but rather because she thought she could be more effective outside the Senate and that California would produce worthy successors to her. Indeed, her successor Kamala Harris holds very similar political positions. The book was a quick, inspiring read, and I now want to devour more memoirs by famous politicians. My biggest complaint, by far, is that during the 1992 Senate election, Boxer described herself as “an asterisk in the polls” and said even as recently as a few months before the Democratic primary election, she was thinking of quitting. But then she won … without any explanation for how she overcame the other contestants. I mean, seriously? One more thing: truthfully, one reason why I read The Art of Tough was that I wanted to know how people actually get to the House of Representatives or the Senate. In Boxer’s case, her predecessor actually knew her and recommended that she run for his seat. Thus, it seems like I need to know more politically powerful people.

• ** Churchill and Orwell: The Fight for Freedom ** is a thrilling 2017 book by Thomas E. Ricks, a longtime reporter specializing in military and national security issues, and who writes the Foreign Policy blog Best Defense. Churchill and Orwell, provides a dual biography of these two Englishmen, first discussing them independently before weaving together their stories and then combining their legacies. By the end of the 20th Century, as the book correctly points out, both Churchill and Orwell would be considered as two of the most influential figures in protecting the rights and freedoms of people from intrusive state governments and outside adversaries. Churchill, obviously, was the Prime Minister of England during World War II and guided the country through blood and tears to victory versus the decidedly anti-freedom Nazi Germany. Orwell initially played a far lesser role in the fight for freedom, and was still an unknown quantity even during the 1940s as he was writing his two most influential works: Animal Farm and 1984. However, no one could ever have anticipated at the time of his death in 1950 (one year after publishing 1984) that those books would become two of the most wildly successful novels of all time11. As mentioned earlier, this book was published last year, but I think if Ricks had extra time, he would have mentioned Kellyanne Conway’s infamous “alternative facts” statement and how 1984 once again became a bestsellerdecades after it was originally published. I’m grateful to Ricks for writing such an engaging book, but of course, I’m even more grateful for what Churchill and Orwell have done. Their legacies have a permanent spot in my heart.

• ** A Higher Loyalty: Truth, Lies, and Leadership ** is the famous 2018 memoir of James Comey, former FBI director and detested by Democrats and Republicans alike. I probably have a (pun intended) higher opinion of him than almost all “serious” Democrats and Republicans, given my sympathy towards people who work in intelligence and military jobs that are supposed to be non-political. I was interested in why Comey discussed Clinton’s emails they way he did, and also how he managed his interactions with Trump. Note that the Robert Mueller thing is largely classified, so there’s nothing in A Higher Loyalty about that, but his interactions with others at the highest levels of American politics is fascinating. Comey’s book, however, starts early, with a harrowing story about how Comey and his brother were robbed at gunpoint while in high school, an event which he would remember forever and which spurred him to join law enforcement. Among other great stories in the book (before the Clinton/Trump stuff) is when he threatened to resign as (deputy) Attorney General. That was when George Bush wanted to renew StellarWind, a program which would surge into public discourse upon Edward Snowden’s leaks. I knew about this, but Comey’s writing made this story thrilling: a race to try and protect a dying Attorney General’s approval to renew a law which Comey and other lawyers thought was completely indefensible. (It was criticized by WSJ writer Karl Rove as “melodramatic flair”). Regarding the Clinton emails, Comey did a good job explaining to me what needed to happen in order to prosecute Clinton, and I think the explanation he gave was fair. Now, about his renewal of the news 11 days before the election … Comey said either he could not say anything (and destroy the reputation of the FBI if the email investigation was found to continue) or say something (and get hammered now). One of the things that I’m most impressed about the book is Comey’s praise towards Obama, and oddly, Obama said he still thought highly of him at the end of 2016 when Comey was universally pilloried in the press. A Higher Loyalty is another book in my collection of those who have served in high levels of office (Leon Panetta, William Perry, Michael Hayden, Barbara Boxer, Sonia Sotomayor, etc.) so you can tell that there’s a trend here. The WSJ slammed him for being “more like Trump than he admits” but I personally can’t agree with that statement.

• Faith: A Journey for All is one of former President James (“Jimmy”) Carter’s many books,12 this one published in 2018. I discussed it in this earlier blog post.

## Group 3: Self-Improvement and Skills Development

I have long enjoyed reading these books because I want to use them to become a highly effective person who can change the world for the better.

• ** Stress Free For Good: 10 Scientifically Proven Life Skills for Health and Happiness ** is a well-known 2005 book13 co-authored by professors Fred Luskin and Kenneth R. Pelletier. The former is known for writing Forgive for Good and his research on forgiveness, while the latter is more on the medical side. In this book, they discuss two types of stress: Type I and Type II. Type I stress occurs when the stress source is easily identified and resolved, while Type II stress is (as you might guess) when the source cannot be easily resolved. Not all stress is bad — somewhat contradicting the title itself! – as humans clearly need stress and its associated responses if it is absolutely necessary for survival (e.g., running away from a murderer). But this is not the correct response for a chronic but non-lethal condition such as deteriorating familial relationships, challenging work environments, and so forth. Thus, Luskin and Pelletier go through 10 skills, each dedicated to its own chapter. Skills include the obvious, such as smiling, and the not-so-obvious, such as … belly-breathing?!? Yes, really. The authors argue that each skill is scientifically proven and back each with anecdotes from their patients. I enjoyed the anecdotes, but I wonder how much scientific evidence qualifies as “proven”. Stress Free For Good does not formally cite any papers, and instead concisely describes work done by groups of researchers. Certainly, I don’t think we need dozens of papers to tell us that smiling is helpful, but I think other chapters (e.g., belly breathing) need more evidence. Also, like most self-help books, it suffers from the medium of the written word. Most people will read passively, and likely forget about the skills. I probably will be one of them, even though I know I should practice these skills. The good news is, while I have lots of stress, it’s not the kind (at least right now, thankfully) that is enormously debilitating and wears me down. For those in worse positions than me, I can see this book being, if not a literal life saver, at least fundamentally useful.

• How to Invest Your Time Like Money is a brief 2015 essay by time coach Elizabeth Grace Saunders, and I found out about it by reading (no surprise here!) a blog post from Cal Newport. I bought this on my iBooks app while trying to pass the time at a long airport layover in Vancouver when I was returning from ICRA 2018. Like many similarly-themed books, she urges the reader to drop activities that aren’t high on the priority list and won’t have a huge impact (meetings!!), and to set aside sufficient time for relaxing and sleeping. The main distinction between this book and others in the genre is that Saunders tries to provide a full-blown weekly schedule to the reader, urging them to fill in the blanks with what their schedule will look like. The book also proffers formulaic techniques to figure out which activity should go where. This is the part that I’m not a fan of — I never like having to go that far in detail in my scheduling and I doubt the effectiveness of applying formulas to figure out my activities. I can usually reduce my work days to one or two critical things that I need to do, and block off huge amounts of flexible time blocks. A fixed, rigid schedule (as in, stop working on task A at 10:00am and switch to task B for two hours) rarely works for me, so I am not much of a fan of this book.

• ** Peak: Secrets from the New Science of Expertise ** is a 2016 book by Florida State University psychologist Anders Ericsson and science writer Robert Pool. Ericsson is well-known for his research on deliberate practice, a proven technique for rapidly improving one’s ability in some field,14 and this book presents his findings to educate the lay audience. Ericsson and Pool define deliberate practice as a special type of “purposeful practice” in which there are well-defined goals, immediate feedback, total focus, and where the practitioner is slightly outside his or her comfort zone (but not too much!). This starkly contrasts with the kind of ineffective practice where one repeats the same activity over and over again. Ericsson and Pool demonstrate how the principles of deliberate practice were derived not only from “the usual”15 fields of chess and music, but also from seemingly obscure tasks such as memorizing a string of numerical digits. They provide lessons on developing mental representations for deliberate practice. Ericsson and Pool critique Malcolm Gladwell’s famous “10,000-hour rule” and, while they agree that it is necessary to invest ginormous amounts of time to become an expert, that time must consist of deliberate practice rather than “ordinary” practice. A somewhat controversial topic that appears later is the notion of “natural talent.” Ericsson and Pool claim that it doesn’t exist except for height and body size for sports, and perhaps a few early advantages associated with IQ for mental tasks. They back their argument with evidence of how child prodigies (e.g., Mozart) actually invested lots of meaningful practice beforehand. And thus lies the paradox for me: I’m happy that there isn’t a “natural talent” for computer science and AI research, but I’m not happy that I got a substantially late start in developing my math, programming and AI skills compared to my peers. That being said, this book proves its worth as an advocate for deliberate practice and for its appropriate myth-busting. I will do my best to apply deliberate practice to my work and physical fitness.

• ** Grit: The Power of Passion and Perseverance **, a 2016 book by Angela Duckworth, a 2013 MacArthur Fellow and a professor of psychology at the University of Pennsylvania. Duckworth is noted for winning a “genius” grant, despite how (when growing up) her father would explicitly say that she wasn’t a genius. She explores West Point and the military, athletics, academia, and other areas (e.g., the business world), to understand what causes people to be high achievers while others achieve less? Her conclusion is that these people have “grit”. She develops a Grit scale – you can take it in the book. (I am always skeptical of these things, but it’s very hard to measure psychological factors.) Duckworth says people with grit combine passion and perseverance (see the book’s subtitle!). She cites West Point survivors, fellow MacArthur fellow Ta-Nehisi Coates, and Cody Coleman, who is now a computer science PhD candidate at Stanford University. But how do you get grit? Follow your passion is bad advice, which by now I’ve internalized. And yes, she cites Cal Newport’s So Good They Can’t Ignore You, but apparently Deep Work must have been published too late to make it into this book, because her FAQ later says she works about 70 hours a week in all; this is shorter than my work schedule but longer than Professor Newport’s.16 But anyway, she makes it clear that once people have started their passion or mission, they need to stick with it and not quit just because they’ve had one bad day. For Duckworth, her mission is about using psychology to maximize success in people, and children in particular. Part of this involves deliberate practice, and yes, she cites Anders Ericsson’s work, which is largely compatible with grit. Probably the major gap in the grit hypothesis is that stuff like poverty, racism and other barriers can throw a wrench in success, but grit can still be relatively useful regardless of circumstances. If you want to know more, you can check out her 6-minute TED talk.

## Group 4: History

This is a relatively short section, with just three books. Still, all three were excellent and highly educational. These books (especially the last two) can be harder to read than biographies, which is why I read fewer of them.

• ** The Origins of Political Order: From Prehuman Times to the French Revolution ** is a book by political scientist Francis Fukuyama, and one that I’ve wanted to read for several years and finally finished it after the ICRA 2019 deadline. I discuss the book in a separate blog post, where I also discuss Jimmy Carter’s book. Fukuyama wrote a follow-up book which I bought after BARS 2018, but alas, I have not even started reading it. Neither did I read Fukuyama’s more famous work, The End of History and the Last Man. There is so much I need to read, but not enough time.

• ** Enlightenment Now: The Case for Reason, Science, Humanism, and Progress ** is a 2018 book by famous Harvard professor Steven Pinker,17 known for writing the 2011 bestseller The Better Angels of Our Nature and for research in cognitive psychology. I haven’t read Better Angels (I have a copy of it), but Enlightenment Now seems to be a natural sequel written in a similar style with graphs and facts galore about how the world has been getting better overall, and not worse as some might think from the “Again” in “Make America Great Again!!”. The bulk of the book consists of chapters on one main theme, such as life, the environment, equal rights, democracy, inequality, peace, existential threats, and other topics. For each, Pinker explains why things have gotten better by reporting on relevant long-term statistics. Enlightenment Now is probably as good as you can get in answering as many of humanity’s critical questions together in one bundle, and written by someone who, in the words of Scott Aaronson (amusingly referred to as “Aronson” in the acknowledgments) is “possibly the single person on earth most qualified to tackle those questions.” In the other parts of the book, Pinker defends Enlightenment thinking from other forces, such as religious thinking and authoritarianism. To me, one of the most impressive parts of the book may be that Pinker very often anticipates the counter-arguments and answers them right after making various claims. I find Pinker’s claims to be very reasonable and I can tell why Bill Gates refers to Enlightenment Now as “his new favorite book” (replacing Better Angels). And about Trump, it’s impossible to ignore him in a book about progress, because Trump’s “Make America Great Again” professes a nostalgia for a glorious past, but this would include (in the United States alone) segregation, bans on interracial marriage, gay sex, and birth control.18 Is that the kind of world we want to live in? Despite all the real problems we face today, if I had to pick any time to be born, it would be the present. Pinker is a great spokesman for Enlightenment thinking, and I’m happy to consider myself a supporter and ardent defender of these ideals. This was my favorite book I read in 2018.

## Group 5: Current Events

Here are three books published in 2018 about current events, from a US-centric perspective, with some discussions about Russia sprinkled in.

• ** The Fifth Risk ** is the latest book by author and journalist Michael Lewis, who writes about the consequences of what happens when people in control of government don’t know how it works. In the words of John Williams, “I would read an 800-page history of the stapler if he wrote it”. That’s true for me as well. Lewis quickly hooked me with his writing, which starts off about … you guessed it, Rick Perry and the Department of Energy. The former Texas governor was somehow tapped to run the Department of Energy despite famously campaigning to abolish it back in the 2012 Republican primaries … when, of course, in a televised debate, he failed to remember it as the third government agency he would eliminate. Oops. Later, he admitted he regretted this, but still: of all the people that could possibly lead the Department of Energy, why did it have to be him?!?!?19 Other departments and agencies are also led by people with either little understanding of how it works, or industry lobbyists who stand to gain a large paycheck after leaving government. I want the best people to get the job, and that’s unfortunately not happening with Trump’s administration. Furthermore, not only do we have job mismatches, we also have repeated federal government shutdowns, at the time of me writing this blog post. Why should Americans want to work for the federal government if we can’t give them a stable wage? (That’s literally why many people aim for federal jobs, due presumably to more stability than the private sector.) The silver lining is that this book also consists of a series of interviews with unsung heroes in our government, who are working to maintain it and counter the influence of misguided decisions happen on top. The Fifth Risk will clearly not have any impact whatsoever on the Trump administration, because they would not bother reading books like this.

## Group 6: Miscellaneous

Finally, we have some random books that didn’t make it into the above categories.

• Nuclear Energy: What Everyone Needs to Know was written by Charles D. Ferguson, and provides an overview of various topics pertinent to nuclear energy. You can explore (Doctor) Ferguson’s background on his LinkedIn page, but to summarize: a PhD in physics followed by various government and think-tank jobs, most of which relate to nuclear energy and make him well-qualified to write this book. Published in 2011, just two weeks after the Fukushima accident and before the Iran Nuclear Deal, Nuclear Energy is organized as a set of eight chapters, each of which is broken up into a list of sections. Each section is highlighted by a question or two, such as “What is energy, and what is power?” in the first chapter on fundamentals, and “How many nuclear weapons do the nuclear armed countries have?” in the chapter on proliferation. I decided to read this book for two main reasons: the first is that I am worried about existential threats from nuclear warfare (inspired in part by reading William Perry’s book this year — see above), and the second is whether nuclear energy can be a useful tool for addressing climate change. For the former, I learned about the many agencies and people who are doing their part to stop proliferation, which partially assuages my concerns. For the latter, I got mixed messages, unfortunately. In general, Ferguson does a good job treating issues in a relatively unbiased manner, presenting both pros and cons. The book isn’t a page-turner, and I worry that the first chapter on fundamentals might turn off potential readers, but once a reader gets though the first chapter, the rest is easier reading. I am happy he wrote Nuclear Energy, and I plan to mention more in a subsequent blog post.

• Turing’s Vision: The Birth of Computer Science is a brief book by math professor Chris Bernhardt which attempts to present the themes of Turing’s landmark paper of 1936 (written when he was just 24 years old) on the theory of computation. Most of the material was familiar to me as it is covered in standard theory of computation courses for undergraduates, though I have to confess that I forgot much of the material. And this, despite blogging about theory of computation several times on this blog! You can find the paper online, titled “On Computable Numbers, with an Application to the Entscheidungsproblem”. I think the book is useful as a general introduction to the lay reader (i.e., non computer scientist).

Whew, that’s 2018. Up next, 2019. Happy readings!

Update January 2, 2019: I revised the post since I had forgotten to include one book, and I actually read another one in between the December 27 publication date and January 1 of the new year. So that’s 34 books I read, not 32.

1. Technically, books that I finished reading this year, which includes those that I started reading in late 2017, and excludes those that I will finish in 2019.

2. Yeah, yeah, if Andrew Ng says to read a book, then I will read it. Sorry, I can follow the leader a bit too much …

3. One of the phrases that I remember well from the book is something like: “this is a book on how to get into the rich man’s club in the first place” (emphasis mine).

4. I would be interested in being a “science advisor” to the President of the United States.

5. It should surprise no one that I am a vocal proponent of an open society, both politically and economically.

6. There’s less congestion in the air, and the skill required means all the “drivers” are far more sophisticated than the ground counterparts.

7. Singapore is advanced enough in that top academic conferences are held there — think ICRA 2017. (Sadly, I was unable to attend, though I heard the venue was excellent.) In addition, Singapore is often the best country in terms of “number of academic papers with respect to total population” for obvious reasons.

8. At the time I finished this book in early 2018 and drafted the summary for Worthy Fights in this blog post, the US Government was reeling from two government shutdowns, one from Chuck Schumer and the other from Rand Paul. And at the end of 2018, when I finished doing minor edits to the entire post before official publication, we were in the midst of the third government shutdown of the year, this time from Donald Trump who famously said he would “own” the shutdown in a televised interview. Don’t worry, this doesn’t hinder my interest in running for political office. If anything, the constant gridlock in Washington increases my interest in being there somehow, since I think I could improve the situation.

9. This raises the question: if Vance says he should do that, shouldn’t other VCs help to invest in areas or in groups of people who haven’t gotten the fruits of VC funding, such as black people?

10. This shocked me. If I were in his position, which admittedly I am not, there’s no way I would not run for office. I mean, he had people (not including — presumably — his relatives) clamoring him to run!!

11. In 2005, TIME chose both Animal Farm and 1984 to be in their top 100 novels of all time.

12. I mean, look at all of these books

13. I decided to read it upon seeing it featured on Professor Olga Russakovsky’s website

14. When I saw the book’s description, I immediately thought of Cal Newport’s Deep Work as a technique that merges well with deliberate practice, and I was therefore not surprised to see that deliberate practice has been mentioned previously on Study Hacks

15. I say “usual” here because chess and music are common domains where psychologists can run controlled experiments to measure expertise, study habits, and so on.

16. I wonder what she would think of Newport’s Deep Work book.

17. I bumped into Steven Pinker totally by coincidence at San Francisco International Airport (SFO) last month. I was surprised that he was all by himself, even when SFO is filled with people who presumably must have read his book. I only briefly mentioned to him that I enjoyed reading his book. I did not want to distub him.

18. I should add from my perspective, the past also includes lack of technological and personal support for people with disabilities.

19. Lewis, unfortunately, believes that Perry has not spent much time learning about the department from the previous energy secretary, an MIT nuclear physicist who played a role in the technical negotiations of the Iran nuclear deal. Dude, there’s a reason why President Obama chose nuclear physicists to run the Department of Energy.

20. Unless they’re smooching to get that salary increase, or trying to con people à la Kenneth Lay.

21. Freecycle seems like a cool resource. Think of it as a Craigslist but where all products must be sold for free. I’m surprised I never heard of Freecycle before reading Give and Take, but then again, I didn’t know anything about Craigslist until summer 2014, when I learned about it as I was searching for apartments in Berkeley. That’s why reading books so useful: I learn

22. I couldn’t help but end this short review with two quick semi-personal comments. First, I didn’t realize until reading the acknowledgments section (yes, I read every name in those!) that he is a close friend of CMU professor Jean Yang, whose blog I have known about for many years. Second, Seth cites the 2015 paper A Century of Portraits: A Visual Historical Record of American High School Yearbooks, by several students affiliated with Alexei Efros’ group. The citation, however, was incorrect since it somehow missed the lead author, so I took pictures and emailed the situation to him and the Berkeley authors. Seth responded with a one-liner: “Sorry. Not sure how that happened. I will change in future editions”, so hopefully there will be future editions (not sure how likely that is with books, though). I bet the Berkeley authors were surprised to see that (a) their work made it in Seth’s book, and (b) someone actually read the endnotes and caught the error.

# Physical Versus Online Terrorist Threats

Talk about unwelcome headlines before the holidays. The sudden decision by President Trump to move American troops out of Syria makes little sense since ISIS is not yet defeated. While ISIS may have certainly been battered and pushed back over the last few years, withdrawing now gives them a reprieve and risks further destabilizing the Middle East by allowing actors such as Assad, Russia, Iran, and potentially other terrorist groups, to fill in the remaining void. A similar logic follows Trump’s second sudden decision regarding troops in Afghanistan.

Making matters worse is the resignation of Secretary of Defense Jim Mattis, who could no longer contain his disagreements with Trump, and pointedly wrote: “you have the right to have a Secretary of Defense whose views are better aligned with yours” in his resignation letter.

This is disheartening because Secretary Mattis was one of the most competent members of Trump’s administration. While one can certain disagree with decisions here and there, he has the requisite experience and know-how to run America’s defense department. I am a firm believer that for critical presidential cabinet positions, we must have the best people get the best job. President Trump will not find anyone better than Mattis who wants the job.

In addition to all that has and will be said among the Washington class and elites, I urge everyone to remember the non-traditional aspect of our war on terror. ISIS will not be defeated until we have also eradicated its online presence.

More than any other terrorist organization, ISIS became a household name via their effectiveness at utilizing the Internet and Social Media. Examples include their nasty online killings, to recruitment and organization via social media, and to encouragement of “lone wolf” attacks from law abiding citizens turned radicalized ISIS agents.

Social medial companies deserve their fare share of blame for allowing ISIS and other terrorist organizations to gain a foothold in them. Facebook, Twitter, YouTube, and similar companies have more resources, technical skill, and money than ISIS, yet were (initially) blindsided by terrorist activity. Compounding these issues are lack of incentives: combating terrorism requires social media companies to allocate resources that could otherwise be used to increasing growth.

The good news is that, due to public and government pressure, these companies have dramatically improved their counter-terrorism techniques. I haven’t seen as many headlines about terrorism on social media, but hard data would be more reassuring. (For an overview of how terrorists have utilized social media, I recommend the thrilling yet worrisome book Like Wars, published earlier this year, and which has helped shape my thinking on where the real threats lie in the modern era.)

To recap and summarize my position, let’s keep our troops in Syria and Afghanistan, but please don’t relieve the pressure on media companies — and ourselves as consumers — to be vigilant of terrorist groups using social medial to advance their malicious agendas.

I would be remiss if I didn’t mention that the next logical step after social media is for terrorists to use new advances in Artificial Intelligence. Look at some recent research results from NVIDIA, for instance; I predict that it will not be long — if it hasn’t already happened — before terrorist groups start buying GPUs and generating fake images.

To be clear, I am not blindly anti-Trump. On the contrary, I want him to succeed as president so that America succeeds. Though I know there’s little chance he will change his mind with respect to traditional military, I hope that he and his administration will do enough to stop the online threat from ISIS. Or, at the very least, that they will not do something with the byproduct of making terrorists have an easier time online. I am more concerned about threats from the Internet, social media, and misinformation in the near future, rather than traditional military-style combat.

Overall, it’s a sad day when the leader of the free world’s actions give relief to ISIS and are praised by America’s biggest geopolitical foe — Vladimir Putin.

# Better Saving and Logging for Research Experiments

In many research projects, it is essential to test which of several competing methods and/or hyperparameters works best. The process of saving and logging experiments, however, can create a disorganized jungle of output files. Furthermore, reproducibility can be challenging without knowing all the exact parameter choices that were used to generate results. Inspired in part by Dustin Tran’s excellent Research-to-Engineering framework blog post, in this post I will present several techniques that have worked well for me in managing my research code, with a specific emphasis on logging and saving experimental runs.

Technique 0. I will label this as technique “0” since it should be mandatory and generalizes far beyond logging research code compared to the other tips here: use version control. git, along with the “hub extension” to form GitHub, is the standard in my field, though I’ve also managed projects using GitLab.

In addition, I’ve settled on these relevant strategies:

• To evaluate research code, I create a separate branch strictly for this purpose (which I name eval-[whatever]), so that it doesn’t interfere with my main master branch, and to enable greater ease of reproducing prior results by simply switching to the appropriate branch. The alternative would be to reset and restore to an older commit in master, which can be highly error-prone.
• I make a new Python virtualenv for each major project, and save a requirements.txt somewhere in the repository so that recreating the environment on any of the several machines I have access to is (usually) as simple as pip install -r requirements.txt.
• For major repositories, I like to add a setup.py file so that I can install the library using python setup.py develop, allowing me to freely import the code regardless of where I am in my computer’s directory system, so long as the module is installed in my virtualenv.

Technique 1. In machine learning, and deep learning in particular, hyperparameter tuning is essential. For the ones I frequently modify, I use the argparse library. This lets me run code on the command line like this:

python script.py --batch_size 32 --lrate 1e-5 --num_layers 4 <more args here...>

While this is useful, the downside is readily apparent: I don’t want to have to write down all the hyperparameters each time, and copying and pasting earlier commands might be error prone, particularly when the code constantly changes. There are a few strategies to make this process easier, all of which I employ at some point:

• Make liberal use of default argument settings. I find reasonable values of most arguments, and stick with them for my experiments. That way, I don’t need to specify the values in the command line.
• Create bash scripts. I like to have a separate folder called bash/ where I insert shell scripts (with the endname .sh) with many command line arguments for experiments. Then, after making the scripts executable with chmod, I can call experiment code using ./bash/script_name.sh.
• Make use of json or yaml files. For an alternative (or complimentary) technique for managing lots of arguments, consider using .json or .yaml files. Both file types are human-readable and have built-in support from Python libraries.

Technique 2. I save the results from experiment runs in unique directories using Python’s os.path.join and os.makedirs functions for forming the string and creating the resulting directory, respectively. Do not create the directory with code like this:

because it’s clumsy and vulnerable to issues with slashes in directory names. Just use os.path.join, which is so ubiquitous in my research code that by habit I write

at the top of many scripts.

Subdirectories can (and should) be created as needed within the head experiment directory. For example, every now and then I save neural network snapshots in a snapshots/ sub-directory, with the relevant parameter (e.g., epoch) in the snapshot name.

But snapshots and other data files can demand lots of memory. The machines I use for my research generally have small SSDs and large HDDs. Due to memory constraints on the SSDs, which often have less than 1TB of space, I almost always save experiment logs in my HDDs.

Don’t forget to back up data! I’ve had several machines compromised by “bad guys” in the past, forcing me to reinstall the operating system. HDDs and other large-storage systems can be synced across several machines, making it easy to access. If this isn’t an option, then simply copying files over from machine-to-machine manually every few days will do; I write down reminders in my Google Calendar.

Technique 3. Here’s a surprisingly non-trivial question related to the prior tactic: how shall the directory be named? Ideally, the name should reflect the most important hyperparameters, but it’s too easy for directory names to get out of control, like this:

I focus strictly on three or four of the most important experiment settings and put them in the file name. When random seeds matter, I also put them in the file name.

Then, I use Python’s datetime module to format the date that the experiment started to run, and insert that somewhere in the file name. You can do this with code similar to the following snippet:

where I create the “suffix” using the algorithm name, the date, and the random seed (with str().zfill() to get leading zeros inserted to satisfy my OCD), and where the “HEAD” is the machine-dependent path to the HDD (see my previous tip).

There are at least two advantages for having the date embedded in the file names:

• It avoids issues with duplicate directory names. This prevents the need to manually delete or re-name older directories.
• It makes it easy to spot-check (via ls -lh on the command line) which experiment runs can be safely deleted if major updates were made since then.

Based on the second point above, I prefer the date to be human-readable, which is why I like formatting it the way I do above. I don’t put in the seconds as I find that to be a bit too much, but one can easily add it.

Technique 4. This last pair of techniques pertains to reproducibility. Don’t neglect them! How many times have you failed to reproduce your own results? I have experienced this before and it is embarrassing.

The first part of this technique happens during code execution: save all arguments and hyperparmaters in the output directory. That means, at minimum, write code like this:

which will save the arguments in a pickle file in the save path, denoted as args.save_path which (as stated earlier) usually points somewhere in my machine’s HDD. Alternatively, or in addition, you can save arguments in human-readable form using json.

The second part of this technique happens during paper writing. Always write down the command that was used to generate figures. I mostly use Overleaf — now merged with ShareLaTeX — for writing up my results, and I insert the command in the comments above the figures, like this:

% Generate with:
% python [script].py --arg1 val1 --arg2 val2
% at commit [hashtag]
\begin{figure}
% LaTeX figure code here...
\end{figure}

It sounds trivial, but it’s helped me several times for last-minute figure changes to satisfy page and margin limits. In many of my research projects, the stuff I save and log changes so often that I have little choice but to have an entire scripts/ folder with various scripts for generating figures depending on the output type, and I can end up with tens of such files.

While I know that TensorBoard is popular for checking results, I’ve actually never used it (gasp!); I find good old matplotlib to serve my needs sufficiently well, even for checking training in progress. Thus, each of my files in scripts/ creates matplotlib plots, all of which are saved in the appropriate experiment directory in my HDDs.

Conclusion. These techniques will hopefully make one’s life easier in managing and parsing the large set of experiment results that are inevitable in empirical research projects. A recent example when these tips were useful to me was with the bed-making paper we wrote, with neural network training code here, where I was running a number of experiments to test different hyperparameters, neural network architectures, and so forth.

I hope these tips prove to be useful for your experimental framework.

# Bay Area Robotics Symposium, 2018 Edition

The auditorium where BARS 2018 talks occurred, which was within the Hoover Institution. The number of attendees was capped at 400.

An example presentation at BARS.

A few weeks ago, I attended the Bay Area Robotics Symposium (BARS). Last year, BARS was at the UC Berkeley International House, and you can see my blog post summary here. This year, it was at Stanford University, within one of the Hoover Institution buildings. Alas, I did not get to meet 97-year-old George Shultz or 91-year-old William Perry so that I could thank them for helping to contain the threat of nuclear warfare from the Cold War to the present day.

Oh, and so that I could also ask how to become a future cabinet member.

The location of BARS alternates between Berkeley and Stanford since those are the primary sources of cutting-edge academic robotics research in the Bay Area. I am not sure what precisely differs “Berkeley-style” robotics from “Stanford-style” robotics. My guess is that due to Pieter Abbeel and Sergey Levine, Berkeley has more of a Deep Reinforcement Learning presence, but we also have a number of researchers in “classical” robotics (who may also use modern Deep Learning technologies) such as our elder statesmen Ken Goldberg and Masayoshi Tomizuka, and elder stateswoman Ruzena Bajcsy.

It is unclear what Stanford specializes in, though perhaps a reasonable answer is “everything important.” Like Berkeley, Deep Learning is extremely popular at Stanford. Pieter and Sergey’s former student, Chelsea Finn, is joining the Stanford faculty next year, which will balance out the Deep Reinforcement Learning research terrain.

The bulk of BARS consists of 10-minute faculty talks. Some interesting tidbits:

• More faculty are doing research in core deep reinforcement learning, or (more commonly) making use of existing algorithms for applications elsewhere. There is also a concern over generalization to new tasks and setups. I distinctly remember Chelsea Finn saying that “this talk is about the less interesting stuff” — because generalizing to new scenarios outside the training distribution is hard.

• Another hot area of research is Human-Robot Interaction (HRI), particularly with respect to communication and safety. With the recent hires of Dorsa Sadigh at Stanford and Anca Dragan at Berkeley, both schools now have at least one dedicated HRI lab.

• Finally, my favorite talk was from Ken Goldberg. I was touched and honored when Ken talked about our work on bed-making, and commented on my BAIR Blog post from October which summarized key themes from the lab’s research.

Since BARS is funded in part by industry sponsors, the sponsors were allotted some presentation time. The majority were about self-driving cars. It was definitely clear what the hot topic was there …

In addition to the faculty and industry talks, there were two keynote talks. Last year, Professor Robert Full’s keynote was on mobile, insect-like robots. This year, Stanford NLP professor Chris Manning had the first slot, and in a sign of the increasing importance of robotics and the law, California Supreme Court Justice Mariano-Florentino Cuéllar gave the second keynote. That was unexpected.

During the Q-&-A session, I remember someone asking the two men how to deal with the rising pace of change and the threat of unemployment due to intelligent robots automating out jobs. I believe Professor Manning said we needed to be lifelong learners. That was predictable, and no worries, I plan to be one. I hope this was obvious to anyone who knows me! (If it was not, please contact me.)

But … Professor Manning lamented that not everyone will be lifelong learners, and disapprovingly commented about people who spend weekends on “football and beer.”

The Americans among us at BARS are probably not the biggest football fans (I’m not), and that’s before we consider the students from China, India, and other countries where football is actually soccer.

Professor Manning can get away with saying that to a BARS audience, but I would be a little cautious if the audience were instead a random sample of the American population.

BARS had two poster sessions with some reasonable food and coffee from our industry sponsors. These were indoors (rather than outdoors as planned) due to air quality concerns from the tragic California fires up north.

During the poster sessions, it was challenging to communicate with students, since most were clustered in groups and sign language interpreters can have difficulty determining the precise voice that needs to be heard and translated. Probably the most important thing I learned during the poster session was not even a particular research project. I spoke to a recent postdoc graduate from Berkeley who I recognized, and he said that he was part of a new robotics research lab at Facebook. Gee, I was wondering what took Facebook so long to establish one! Now Facebook joins Google, NVIDIA, and OpenAI with robotics research labs that, presumably, use machine learning and deep learning.

After BARS, I ate a quick dinner, bought Fukuyama’s successor book to the one I read and discussed earlier this month at the Stanford bookstore, and drove back home.

Overall, BARS went as reasonable as it probably could have gone for me.

One lasting impression on me is that Stanford’s campus is far nicer than Berkeley’s, and much flatter. No wonder Jitendra Malik was “joking” last year about how robots trained on Stanford’s smooth and orderly design would fail to generalize to Berkeley’s haphazardness.

The Stanford campus.